Procedia Computer Science Volume 88, 2016, Pages 169–175 7th Annual International Conference on Biologically Inspired Cognitive Architectures, BICA 2016
Recognizing permuted words with Vector Symbolic Architectures: A Cambridge test for machines Denis Kleyko1 ∗, Evgeny Osipov1 , and Ross W. Gayler2 1 2
Lule˚ a University of Technology, 971 87 Lule˚ a, Sweden Independent Researcher, Melbourne, VIC, Australia
Abstract This paper proposes a simple encoding scheme for words using principles of Vector Symbolic Architectures. The proposed encoding allows finding a valid word in the dictionary for a given permuted word (represented using the proposed approach) using only a single operation – calculation of Hamming distance to the distributed representations of valid words in the dictionary. The proposed encoding scheme can be used as an additional processing mechanism for models of word embedding, which also form vectors to represent the meanings of words, in order to match the distorted words in the text to the valid words in the dictionary. Keywords: Vector Symbolic Architectures, distributed representation, spell-checking, word embedding
1
Introduction
In recent years word embedding models capturing similarity of usage and meaning of words have gained significant attention in the research community. These models represent words as vectors in a high-dimensional space. Vectors are formed by processing large text corpora, for example, by counting the co-occurrence of words in documents. Other models take into account a window of words around the current word of interest. There are many word embedding vector models, for example, Latent Semantic Analysis [1], Random Indexing [2], Context Vectors, Word2Vec [3], and BEAGLE [4]. These models have several potential applications in the area of natural language processing including [5] document retrieval, synonym tests, semantic similarity search, word sense disambiguation, and bilingual information extraction. Despite the ability of these models to form the representation of the usage and semantics of words, the models are brittle to distorted input strings such as typographical errors. This is unlike human beings, who are able to cope with the distorted words. A simple example of this fact is an Internet meme, the so-called “Cambridge test”. Readers are presented with a text of words with scrambled letters (referred to later as permuted words). It appears that most people do not have any problems understanding the text similar to one illustrated in the left box in Figure 1. Note, that this “Cambridge test” should not be considered as a well studied scientific effect. As it is described in [6, 7] there was no such study done at Cambridge ∗ Corresponding
author. Email:
[email protected];
[email protected];
[email protected]
Selection and peer-review under responsibility of the Scientific Programme Committee of BICA 2016 c The Authors. Published by Elsevier B.V.
doi:10.1016/j.procs.2016.07.421
169
Recognizing permuted words with Vector Symbolic Architectures . . .
Kleyko, Osipov, and Gayler
University. Some related research can be traced to the work of Graham Rawlinson [8], which found that randomizing letters in the middle of words has little or no effect on the ability to understand the text. For human readers the multi-word context also plays an important role. The relevant point for the purposes of this paper is that permutation of the letters within a word does not prevent its recognition and the presence of the correct letters, even though in the wrong order, provides some cues to recognition. This effect was our inspiration to create an encoding scheme forming a distributed representation of a word, which resists the effect of locally permuted letters.
Figure 1: The reconstruction process of the permuted text the of “Cambridge test”. To the best of our knowledge the approaches used in text editors and search engines [9] account only for simple permutations, e.g. permutation of two adjacent letters, because it would require non-linear complexity to estimate the similarity between an arbitrarily distorted word and valid words in the dictionary. For example, Levenshtein distance, which handles arbitrary distortion (permuting letters, substitution of a letter by another letter, and insertion of a new letter), requires O(n · m) operations for every pair of words where n and m are the lengths of the two words. This paper considers only the case of permuted words. This paper proposes an encoding scheme that allows a valid word to be found in the dictionary from a given permuted word, using only a single operation – calculation of Hamming distance to the distributed representations of valid words in the dictionary. The overhead of the approach (see Figure 1) is the maintenance of the memory storing the distributed representations of words from the dictionary. The distributed representation of a word is formed using high-dimensional random vectors and operations on them. The paradigm of such a data representation is called Vector Symbolic Architectures (VSAs). This article is structured as follows. Section 2 provides the fundamentals of the theory of Vector Symbolic Architectures and Binary Spatter Codes (a specific instance of VSA) relevant to the scope of the article. Section 3 presents the main contribution of this article – a VSAbased, permutation invariant, encoding scheme for words. The performance evaluation of the proposed scheme is given in Section 4. The conclusions are presented in Section 5.
2
Fundamentals of Vector Symbolic Architecture
Distributed data representation is widely used for computer based semantic reasoning [10], [11]. The cognitive capabilities achievable using distributed representations have been demonstrated by creating systems capable of solving Raven’s progressive matrices [12, 13] and via the imitation of concept learning in honey bees [14, 15]. Recently they were also applied in several distributed semantic [16] and signal processing [17, 18, 19, 20, 21, 22, 23] applications. VSAs are a family of bio-inspired high-dimensional vector representations of structured knowledge and operations on those vector representations. In this paper we use only the Binary Spatter Code variant of VSAs [10], in which the individual elements only take the binary values “0” or “1”. Their development was stimulated by studies on brain activity that showed that 170
Recognizing permuted words with Vector Symbolic Architectures . . .
Kleyko, Osipov, and Gayler
the processing of even simple mental events involves simultaneous activity in many dispersed neurons. The activity pattern of a population of neurons can be modelled as vector in a vector space by associating each neuron with a basis dimension of the space and representing the activity level of each neuron as the projection of the vector onto the dimension corresponding to the neuron. VSAs can be viewed as an abstraction of distributed neural systems that can be easily mapped back to a neural implementation, but the interesting properties of the VSA system arise from the mathematical properties of the vector space rather than its neural implementation. Information in VSAs is represented in a distributed fashion. A single, unitary concept is represented with a codeword instantiated as a high dimensional vector (HD-vector ). These representational vector vector spaces are generally of dimension at least 500, and typically much higher. Randomness means that the values of each element of an HD-vector are independent of each other, and “0” and “1” components have approximately the same density, i.e. ρ0 ≈ ρ1 ≈ 0.5. In very high dimensional spaces, the Hamming distance between two arbitrarily chosen HD-vectors is very strongly concentrated around 0.5. That is, arbitrarily chosen HDvectors are, with overwhelmingly high probability, almost orthogonal (i.e. effectively dissimilar). This is similar to the behaviour of symbolic representations – arbitrarily chosen symbols are generally different. Interested readers are referred to [10] and [24] for comprehensive analysis of probabilistic properties of the hyperdimensional representation space. Similarity metric. The similarity between two binary vector representations is characterized by the Hamming distance (i.e. the Hamming weight of the result of the bit-wise XOR of the two vectors). This measures the number of positions in the two compared vectors in which n−1
a ⊗b
1 = i=0n i i , where ai , bi are bits at position they have different values: ΔH (A, B) = A⊗B n i in vectors A and B of dimension n, .1 is the Hamming weight (the count of elements having the value 1), and ⊗ denotes the bit-wise XOR operation. Generation of HD-vectors. Several random binary vectors with the above properties can be generated from one such vector via permutation, of which the cyclic shift operation [25] is an instance. Using this operation, a sequence of K vectors, which are dissimilar to a given initial random HD-vector A (i.e. the normalized Hamming distance between them equals approximately 0.5) are obtained by cyclically shifting A by i positions, where −K ≤ i ≤ K d. This operation is denoted here as Sh(A, i). The cyclic shift operation has the following properties: • it is invertible, i.e. if B = Sh(A, i) then A = Sh(B, −i), • it is associative in the sense that Sh(B, i + j) = Sh(Sh(B, i), j) = Sh(Sh(B, j), i), • it preserves the Hamming weight of the result: B1 = Sh(B, i)1 , and • the result is dissimilar to the vector being shifted: ΔH (B, Sh(B, i)) ≈ 0.5. Note that the cyclic shift is a special case of the permutation operation [10]. In the context of VSA, permutations have been used to encode a variety of data structures, including sequences of semantically-bound elements [16]. Bundling of vectors. The bundling operation is used to combine multiple entities into one structure, analogous to adding elements to a set. It is implemented here by a thresholded sum of the HD-codes representing the entities. A bit-wise thresholded sum of n vectors yields 0 when n/2 or more arguments are 0, and 1 otherwise. If the sum produces an even number, the resulting tie is broken randomly. This is equivalent to adding an extra random HD-code [10]. The terms “thresholded sum” and “majority sum” are used interchangeably in this work, and both refer to the sum [A + B + C]. The relevant properties of the majority sum are: • the result is a random vector, i.e. the density of ‘1’ components is approximately equal to
171
Recognizing permuted words with Vector Symbolic Architectures . . .
Kleyko, Osipov, and Gayler
the density of ‘0’ components: ρ0 ≈ ρ1 ≈ 0.5; • the result is similar to all vectors included in the sum; • as the number of HD-codes that are operands of the majority sum operator increases, the Hamming distance between the result of the operation and any of the operands tends to 0.5 (which is the expected similarity of any two HD-codes selected at random). That is, the information capacity of the result vector is finite and information about the operands is lost as more of them are combined. The algebra on VSA includes another operation, binding. Since this operation is not used in the scope of this article the description of its properties is omitted.
3
VSA-based encoding of words
The encoding starts with the generation of random vectors for the alphabet, i.e. the dictionary of letters containing 26 HD-vectors is created at initialization. This fixed dictionary is used throughout the life cycle of the system in order to form distributed representations of words. The encoding of a word into the distributed representation constructs a bag-of-letters representation consisting of two parts. The first part of the bag-of-letters encoding consists of the HD-vectors corresponding to each of the letters present in the word. Note, that if a letter is used in the word n times its representation is also used n times for the encoding. This part of the encoding ensures that words consisting of similar sets of letters have similar representations. That is, the bag-of-letters based on only this encoding step is identical for all permutations of the input word.
Figure 2: Illustration of the scheme for VSA representation of a word by encoding all its letters and their cyclic shifts and taking the majority sum of the individual HD-vectors. As an example of the encoding scheme consider the word “bica”.
Figure 3: The dictionary of valid words. The entries are ordered by similarity to the first input word “aoccdrnig”.
The second part encodes each letter in its specific absolute position in the word. The cyclic shift operation is applied on the initial HD-vector representing a letter where the value of the shift equals the position number of this letter in the word. The resultant HD-vector represents the letter-position combination. This part of the encoding represents the order of the letters in the word. Therefore, even if two words consist of the same letters (e.g. reversed words) this part of their representations will be different because the letters are not in the same positions. Finally, the majority sum operation is applied to all the HD-vectors (letters and ordered letters). The result of the majority rule operation is the distributed representation of the word. That is, a representation has been constructed of a bag-of-letters where the components are both letters and letter-position combinations. 172
Recognizing permuted words with Vector Symbolic Architectures . . .
Kleyko, Osipov, and Gayler
The encoding process is illustrated for the word “bica” in Figure 2. First, 4 HD-vectors for letters “a”, “b”, “c”, and “i” are taken from the dictionary of letters. Next, 4 letter-position vectors are formed by cyclic shift of the letter HD-vectors. For example letter “b” is in the first position, therefore, its letter-position in the word “bica” is represented as Sh(b, 1), where b is the initial HD-vector from the dictionary of letters. Due to the fact that the number of vectors participating in the majority sum should be odd, a fixed random vector R is added to components. Finally, the 9 HD-vectors are used as input to the majority sum operation. The output is also a binary HD-vector. This vector is the distributed representation of the word “bica”.
4
Performance evaluation
This section presents the evaluation of the proposed representational scheme when the system is presented with a permuted text. The goal is to reconstruct the original text. The dimensionality of vectors for all experiments was fixed to 10,000 elements. This dimensionality is a typical choice when using distributed representations [10], while smaller dimensionality, e.g. 1,000 could also be used, the accuracy of word recognition decreases. During the reconstruction, the permuted text of the “Cambridge test” was used as an input. The example of distorted text from Figure 1, which reads: “aoccdrnig to rseearhcer at cmabrigde uinervtisy it deos not mttaer in waht oredr the ltteers in wrod are the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae the rset can be toatl mses and you can sitll raed it wouthit porbelm tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef but the wrod as wlohe”. In this text the first and last letters of each word are unaltered, while the interior letters of each word are randomly permuted. Our algorithm takes words from the input text one by one. Punctuation characters and articles were deleted. All letters were mapped to lower case. The whole process is outlined in Figure 1. The dictionary of valid words used in the original text is created at system initialization. The dictionary contains the 48 words shown in Figure 3. (The VSA algorithm is completely insensitive to the order of the words in the dictionary. For presentation purposes the words in the dictionary have been ordered by similarity to the first test word.) The distributed representation (HD representation) of each valid word was formed according to the encoding scheme presented above. The dictionary includes all 48 words and their corresponding distributed representations. It is depicted as “HD representation of the dictionary” in Figure 1. During the operating phase, when a new permuted word is presented, the system first encodes the word into its distributed representation. Next, the Hamming distance from the distributed representation of the permuted word to each representation in the dictionary is calculated. Finally, the dictionary word corresponding to the minimal Hamming distance from the test word is issued as the output. The simulation of the task was repeated 100 times. During every run new HD-vectors for the dictionary of letters were initialized at random. The results within each simulation run are completely deterministic as the representations are fixed. However, there is random variation between runs due to the random choice of vectors for each letter representation. It will be seen in the results later that the variation in similarity due to the random choice of letter vectors is very small compared to the variations in similarity due to the differences in structure of the words. That is, the representations reflect the structural variations between the words far more than they reflect the random variations of the vectors used to construct the representations. In every simulation run, all the permuted words in the input text were correctly transformed to the corresponding valid words from the dictionary. The results of the Hamming distance calculation during one of the simulations for the first 173
Recognizing permuted words with Vector Symbolic Architectures . . .
Kleyko, Osipov, and Gayler
Figure 4: The results of Hamming distance calculation for the first input word “aoccdrnig”. input word “aoccdrnig” to 48 alternatives in the dictionary are presented in Figure 4. The Hamming distances are sorted in ascending order (i.e. decreasing similarity to the input word). The dictionary of words is displayed in the same order in Figure 3. This input word is clearly most similar to the dictionary word “according” (has the minimal Hamming distance). The next three closest alternatives are “can”, “and”, and “cambridge”. All letters of the dictionary words “and” and “can” are present in the input word. Moreover, “and” and “aoccdrnig” have the letter “a” in the same position. The similarity of the input word to the dictionary word “can” is increased because the letter “c” is encoded twice in the input word, which doubles the contribution to the similarity. The input word “aoccdrnig” and dictionary word “cambridge” have 6 letters in common (none in the same position).
5
Conclusion
In this article an encoding scheme for words based on the usage of high-dimensional distributed representations and Vector Symbolic Architectures is proposed. This encoding allows a valid word to be found in a dictionary from a given permuted word. It is also worth noting that this correction process is based on very simple arithmetic manipulation of HD-vectors whereas traditional approximate string similarity methods require more complex computations, such as dynamic programming. These simple arithmetic manipulations could be implemented very rapidly and cheaply in a specialized processor, such as a GPU. These (and similar) calculations are also very suitable for calculation with neural networks. As it was mentioned above the choice of dimensionality of representations affects the accuracy of the recognition process. The encoding with 1,000 element vectors shows worse accuracy than the encoding with 10,000 element vectors. On the other hand calculations with 10,000 element vectors require more computational resources as well as memory for the storage of distributed representations. Thus, the choice of the particular dimensionality is application specific and requires trade off between the recognition accuracy and computational resources. Acknowledgements. This work is supported by the Swedish Foundation for International Cooperation in Research and Higher Education (STINT), institutional grant IG2011-2025.
References [1] T. K. Landauer and S. T. Dumais. A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211–240, 1997. [2] P. Kanerva, J. Kristofersson, and A. Holst. Random Indexing of text samples for Latent Semantic Analysis. In Proceedings of the 22nd Annual Conference on the Cognitive Science Society, page 1036, 2000. [3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (NIPS 2013), pages 3111–3119, 2013.
174
Recognizing permuted words with Vector Symbolic Architectures . . .
Kleyko, Osipov, and Gayler
[4] M. N. Jones and D. J. K. Mewhort. Representing Word Meaning and Order Information in a Composite Holographic Lexicon. Psychological Review, 114(1):1–37, 2007. [5] T. Cohen and D. Widdows. Empirical distributional semantics: Methods and biomedical applications. Journal of Biomedical Informatics, 42:390–405, 2009. [6] Scrambled text. [online]. Available online:http://www.brainhq.com/brain-resources/ brain-teasers/scrambled-text. [7] M. Davis. How to write a spelling corrector. [online]. Available online:http://www.mrc-cbu.cam. ac.uk/people/matt.davis/Cmabrigde/. [8] G. Rawlinson. Reibadailty. New Scientist, 162(2188):55, 1999. [9] P. Norvig. [online]. Available online:http://norvig.com/spell-correct.html. [10] P. Kanerva. Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognitive Computation, 1(2):139–159, 2009. [11] T. A. Plate. Holographic reduced representations: Distributed representation for cognitive structures. Center for the Study of Language and Information (CSLI), 2003. [12] B. Emruli, R. W. Gayler, and F. Sandin. Analogical mapping and inference with binary spatter codes and sparse distributed memory. In The proceedings of the IJCNN, pages 1–8, 2013. [13] D. Rasmussen and C. Eliasmith. A neural model of rule generation in inductive reasoning. Topics in Cognitive Science, 3(1):140–153, 2011. [14] D. Kleyko, E. Osipov, R. W. Gayler, A. I. Khan, and A. G. Dyer. Imitation of honey bees’ concept learning processes using vector symbolic architectures. Biologically Inspired Cognitive Architectures, 14:57–72, 2015. [15] D. Kleyko, E. Osipov, M. Bjork, H. Toresson, and A. Oberg. Fly-The-Bee: A Game Imitating Concept Learning in Bees. Procedia Computer Science, 71:25–30, 2015. [16] G. Recchia, M. Sahlgren, and P. Kanerva M.N. Jones. Encoding Sequential Information in Semantic Space Models. Comparing Holographic Reduced Representation and Random Permutation. Computational Intelligence and Neuroscience, pages 1–18, 2015. [17] D. Kleyko, E. Osipov, A. Senior, A. I. Khan, and Y. A. Sekercioglu. Holographic Graph Neuron: A Bio-Inspired Architecture for Pattern Processing. Neural Networks and Learning Systems, IEEE Transactions on, PP(99):1–13, 2016. [18] O. Rasanen and S. Kakouros. Modeling Dependencies in Multiple Parallel Data Streams with Hyperdimensional Computing. Signal Processing Letters, IEEE, 21(7):899–903, 2014. [19] Kleyko D. and Osipov E. Brain-like classifier of temporal patterns. In The Proceeding of the 2nd International Conference on Computer and Information Sciences - ICCOINS, pages 1–6, 2014. [20] O. Rasanen. Generating Hyperdimensional Distributed Representations from Continuous Valued Multivariate Sensory Input. In Proceedings of the 37th Annual Meeting of the Cognitive Science Society, pages 1943–1948, 2015. [21] D. Kleyko, E. Osipov, N. Papakonstantinou, V. Vyatkin, and A. Mousavi. Fault detection in the hyperspace: Towards intelligent automation systems. In IEEE International Conference on Industrial Informatics, INDIN, pages 1–6, 2015. [22] O. Rasanen and J. Saarinen. Sequence Prediction With Sparse Distributed Hyperdimensional Coding Applied to the Analysis of Mobile Phone Use Patterns. Neural Networks and Learning Systems, IEEE Transactions on, PP(99):1–12, 2015. [23] D. Kleyko, E. Osipov, and D.A. Rachkovskij. Modification of Holographic Graph Neuron using Sparse Distributed Representations. Procedia Computer Science, pages 1–7, 2016. [24] P. Kanerva. Sparse distributed memory. The MIT Press, 1988. [25] D. Kleyko and E. Osipov. On bidirectional transitions between localist and distributed representations: The case of common substrings search using vector symbolic architecture. Procedia Computer Science, 41:104–113, 2014.
175