COMPUTER AND NATURAL LANGUAGE TEXTS – A COMPARISON ...

3 downloads 2822 Views 214KB Size Report
In recent years the so-called “science of complexity” has introduced concepts like self ... that complexity, entropy and information are related – information reduces entropy and ... differences between natural language texts and computer programs introduced in next section ..... IEEE Transactions on Information Technology.
Citation Reference: P. Kokol, V. Podgorelec, M. Zorman, T. Kokol, T. Njivar, Computer and Natural Language Texts - A Comparison Based on Long-Range Correlations, Journal of the American Society for Information Science, John Wiley & Sons, vol. 50, num. 14, pp. 1295-1301, December 1999.

COMPUTER AND NATURAL LANGUAGE TEXTS – A COMPARISON BASED ON LONG RANGE CORRELATIONS* Peter Kokol1, Vili Podgorelec1, Milan Zorman1,Tatjana Kokol2, Tatjana Njivar2 1 University of Maribor, FERI, LSD, Smetanova 17, 2000 Maribor, Slovenia 2 The School of Textiles, Pariska 34, 2000 Maribor, Slovenia

Endless invention, endless experiment, Brings knowledge of motion, but not of stillness; Knowledge of speech, but not of silence; Knowledge of words, and ignorance of the Word. All our knowledge brings us nearer to our ignorance, all our ignorance brings us nearer to death, but nearness to death no nearer to God. Where is the Life we have lost in living? Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information? T. S. Eliot, The Rock Abstract “Long range power low correlation” (LRC) is defined as a maximal propagation distance of the effect of some disturbance within a system found in many systems that can be represented as strings of symbols. LRC between characters has been identified also in natural language texts. The aim of this paper is to show that long - range power law correlation can be also found in computer programs meaning that some common laws hold for both natural language texts and computer programs. This fact enables one to draw parallels between these two different types of human writings and on the other hand enables to measure the differences between them.

1. Introduction: Language and Communication Both computer programs and natural language texts (human writings in the following text) can be in general classified as means for communicating. Communication can be defined as (Young, 1978) the interaction between systems or parts of a system using a prearranged *

This work was supported by a grant of Slovenian Ministry of Science and Technology – J2-0514-0796-98

code, and the human communication as (Francois, 1997) the organised behaviour among a human group, according to a shared cultural and semantic code. It is clear from both of the above definitions that the code is crucial for effective communication, however it is not always so obvious that the code is in general complex, offering physical, perceptive, syntactic, semantic, pragmatic and cultural aspects in accordance with the characteristics of the environment shared by the communication systems. In general we can argue that the code is embodied in the communication language (or simply the language in the text that follows). There are many more or less related definitions of language supporting various hypothesis, like (Francois 1997): 1. A set of signs that can be combined through rules in order to form significant associations. 2. The subset of all possible meaningful combinations that can be formed from a set of signs. 3. A set of signs with more or less commonly admitted, or coded values that can be used for the communication of messages between senders and receivers. Another interesting view (very much related to the second definition above) has been pointed out by a Slavic linguist Jakobson, building on the theories of de Saussure (Gibs, 1998). He argued that all languages share certain structures and rules that allow some combinations and forbid others. Without such structures and rules the words have no inherent meaning (i.e. “oo” occurs in “soothe” and “cool”). Another aspect necessarily associated with human writings and communication is the concept of meaning (Parkinson 1982). As for the definition of language there are many different definitions of meaning, but we are only interested in the relation meaning – information in this paper. In that context (Francois 1997) meaning is mainly subservient to implicit knowledge i.e. formerly acquired and organised information. 1.1. Human Writings and Complexity In recent years the so-called “science of complexity” has introduced concepts like self organisation, fractals, chaos and similar. Following these concepts, Atlan (1989) has proposed a theory that to develop language structures, rules, words and vocabularies, the language should go through a self-organising system called internalisation (Figure 1.) As can be seen from the Figure 1 the language development and the meaning of human writings generated by that language are both parts of the “closed loop” related trough the common phenomena called the complexity, which will be discussed, latter on. It is well known (Francois 1997)

that complexity, entropy and information are related – information reduces entropy and enlarges complexity.

INTERNALISATION is

SELF ORGANISING SYSTEM

producing LANGUAGE is defined by GRAMMAR

is characterised by SYMBOLS

COMPLEXITY

Consists of RULES

VOCABULARY

Consists of

ORGANISED INFORMATION

WORDS

is characterised by

contains

Consists of

can be defined as MEANING

are combined according to

have

HUMAN WRITTINGS

Figure 1. Language generation process.

1.2. Long Range Correlations Long-range power law correlation (LRC) has been discovered in a wide variety of systems. Wilson (1989) points out that when the correlation length (defined in Wilson 1989 as the maximal propagation distance of the effect of some disturbance within a system) increases the small variations do not disappear, but merely become a finer structure superimposed to the large-scale structure. As a consequence the LRC is very important for understanding the system’s behaviour, since we can quantify it with a critical exponent. Quantification of this kind of scaling behaviour for apparently unrelated systems allows us to recognise similarities between different systems, leading to underlying unification. For example LRC has been identified in DNA sequences (Peng 1992) and natural language texts (Schenkel 1993, Ebeling 1995) - the consequence is that DNA and human writings can be analysed using very similar techniques (Bulderyev 1994). . The most important assumptions for the existence of long range correlation in human writings are (see Ebeling (1994) for the discussion of some other reasons):



the first reason follows from the definition of the language – the combination of signs to be meaningful have to be correlated, the correlation represents a filter which reduces the combinations of signs to be used. There are about 100000 different words in a typical vocabulary of a natural language (correlated signs) from about 1012 possible combinations of noncorrelated sign strings;



the second reason is predictability. In order toidentify the content of the human writing, we need only to scan some of the first sentences – the words of the rest are usually correlated with the words of the first sentences.



The systems generated by a self organising process exhibit long range correlation.

Another very important characteristic of long range correlation is that it represents a lower bound of complexity of the system being analysed (Li 1997) and is thereafter very useful in researching information content and meaning of human writings 1.3. The aim of the paper The aim of this paper is first to show that long - range power law correlation can be also found in computer programs a finding indicating that some common laws hold for both natural language texts and computer programs. It is not so important to explicitly know these laws, but knowledge of their existence can be very valuable, because it enables one to draw parallels between these two different types of human writings. For example the development of various computer languages are well documented, thus an analysis can be performed which may result in some interesting findings, that can than help one to better understand the development of natural languages. On the other hand we would like to present that the differences between natural language texts and computer programs introduced in next section are reflected in the critical exponent of long - range power law correlation. 2. Natural and computer language texts Computer languages belong to the greater family of artificial (constructed) languages. However many of the “human artificial languages”, invented for various purposes, i.e. Erone a language for self – expression, Esperanto and Klingon - universal communication languages, Lojban – a logical language for aiding reasoning, didn’t become widely recognised, while computer languages have been accepted internationally (Miller 1991). Many studies exist that compare natural languages, find similarities between them and search for common roots (see Miller 1991 for an overview), many studies compare programming languages (Wegner 1989, Tucker 1997, Malik 1999), and very few making direct

comparisons between programming and natural languages. Schenkel (1993) using long range correlation, reveals that computer programs are much more optimised than natural language texts. But in his analyses, he used translated computer programs (in machine code) and not source programs written in higher level programming languages. Kokol (1996) comparing natural language texts and computer programs using Zipf’s law found out that Zipf’s law holds for both types of texts, but the critical exponents of the law differ significantly. The common point of some linguistic theory texts briefly touching the above subject (see Arista 1999 for links) are summarised bellow. Both ordinary texts written in natural language and computer programs can be represented as strings of symbols (words, characters, sentences, etc.) (Miller 1991, Kokol, 1997, Schenkel 1993). Computer programs are written according to strict grammatical rules (context free and regular grammars) (Floyd, 1994). They have vocabularies of keywords, reserved words and operators, from which programmers select appropriate keywords during the programming process (Kokol 1996, Eghe 1990). In addition, they have vocabularies of numbers and vocabularies of identifiers (names of used defined variables, procedures, functions, modules, labels etc.) created by programmers. These are, in general, not language dependent and this distinguishes computer languages from natural ones. Another great difference between natural and computer languages is that computer languages are much more restricted and formal than natural languages and have much more limited vocabularies. But maybe the greatest distinction between computer and natural languages is ambiguity – texts written in natural language are ambiguous, computer programs are not (Floyd 1994). 3. Measures of Complexity Many different quantities (Gell-Mann, 1995) have been proposed as measures of complexity to capture all our intuitive ideas about what is meant by complexity. Some of the quantities are computational complexity, information content, algorithmic information content, the length of a concise description of a set of the entity’s regularities, logical depth, etc., (in contemplating various phenomena we frequently have to distinguish between effective complexity and logical depth - for example some very complex behavior patterns can be generated from very simple formula like Mandelbrot’s fractal set, energy levels of atomic nuclei, the unified quantum theory, etc.- that means that they have little effective complexity and great logical depth). Li (1991) relates the complexity with difficulty concerning the system in question, for example the difficulty of constructing a system, difficulty of describing the system, etc. It is also well known that complexity is related to entropy. Several

authors speculate that the relation is one to one (i.e. algorithmic complexity is equivalent to entropy as a measure of randomness) (Grassbereger, 1989), but (Li, 1991) shows that the relation is one to many or many to one depending on the definition of the complexity and the choice of the system being studied. Many other measures of complexity are proposed, and a very extensive bibliography about them can be found in Edmonds (1997). Using the assumption that meaning and information content in text is founded on the correlation between language symbols one of the meaningful measures of complexity of human writings is entropy as established by Shannon (1951). Yet, when a text is very long it is almost impossible to calculate the Shannon information entropy so Grassberger (1989) proposed an approximate method to estimate entropy. But entropy does not reveal directly the correlation properties of texts so another more general measure is needed. One possibility is to use Fourier power spectrum, however a method yielding much more quality scaling data was introduced recently. This method, called Long-range correlation (Schenkel 1993) is based on the generalisation of entropy and is very appropriate for measuring complexity of human writings. Various quantities for the calculation of long range correlation in linear symbolic sequences were introduced in the literature and are discussed by Ebeling (1995). The most popular methods are dynamic entropies, scaling exponent 1/fδ, higher order cumulates, mutual information, correlation functions, mean square deviations, and mapping of the sequence into random walk. It is agreed by many authors (Ebeling 1995, Schenkel 1993, Peng 1992) that the mapping into random walk is the most effective and successful approach in the analysis of human writings. 4. Calculation of the Long-range Power Law Correlation In order to analyse the long-range correlation of a string of symbols the best way is to first map the string into a Brownian walk model (Schenkel 1993) (Brownian walk is defined in Francios (Francios 1997) as the archetypical, random unpredictable process). Namely, the Brownian walk model is well researched and publicised and the derived theories and methodologies are widely agreed and in addition easy to implement with the use of computer. There are various possibilities to implement the above mapping (Kokol 1997). In this paper we will use the so-called CHAR method described by Schenkel (Schenkel 1993) and Kokol (Kokol 1997). A character is taken to be the basic symbol of a human writing. Each character is then transformed into a six bit long binary representation according to a fixed code table. It has been shown by Schenkel (Schenkel 1993) that the selection of the code table does not

influence the results as long as all possible codes (i.e. we have 64 different codes for the six bit representation – in our case we assigned 56 codes for the letters and the remaining codes for special symbols like period, comma, mathematical operators, etc) are used. The obtained binary string is then transformed into a two dimensional Brownian walk model (Brownian walk in the text which follows) using each bit as a one move - the 0 as a step down and the 1 as a step up. The whole mapping is schematically presented in Figure 2. An important statistical quantity characterising any walk is the root of mean square fluctuation F about the average of the displacement. In a two-dimensional Brownian walk model the F is defined as:

F 2 (l ) ≡ [∆y (l , l 0 )]

2

[

− ∆y (l , l 0 )

]

2

where (Fig. 2) ∆y ( l , l 0 ) ≡ y ( l 0 + l ) − y ( l 0 ) l

is the distance between two points of the walk on the X axis

l0

is the initial position (beginning point) on the X axis where the calculation of F(l) for one pass starts

y

is the position of the walk – the distance between the initial position and the current position on Y axis

and the bars indicate the average over all positions l0. This is equivalent to the following algorithm

1. Start with l0 = 1 2. begin the loop with l = 1 3. calculate the quantity ∆y(l, l0) and its square 4. increment the l until the end of the walk and go to step 3

5. increment the beginning point sequentially until the end of the walk and go to step 2 6. Average all of the calculated quantities to obtain F2(l). The F(l) can distinguish two possible types of behaviour: •

if the string sequence is uncorrelated (normal random walk) or there are local correlations extending up to a characteristic range i.e Markov chains or symbolic sequences generated by regular grammars (Li, 1991), then F (l ) ≈ l 0.5



if there is no characteristic length and the correlations are “infinite” then the scaling property of F(l) is described by a power law F (l ) ≈ l α and α ≠ 0.5.

The power law is most easily recognised if we plot F(l) and l on a double logarithmic scale (Fig. 2.). If a power law describes the scaling property then the resulting curve is linear and the slope of the curve represents α. In the case that there are long range correlations in the strings analysed, α should not be equal to 0.5.

Part of a human writing (computer program or a natural language text)

… for I:= 1 to B … Is transformed using into

... 010010010101010101011010 …

CODE TABLE: A B C ...

0 1

string of binary symbols and then into Brownian motion.

000000 000001 000010 ...

step down step up

F(l) is computed using

F 2 (l ) ≡ [∆y(l )] − ∆y(l ) 2

5

∆ y (l ) ≡ y (l 0 + l ) − y ( l 0 )

alpha=0,59

4

2

4 3 3

F(l)

2 2

log(F)

1 1

log(F) - com puted

0 -1 0 1 2 3 4 5 6 7

F(l) is plotted and a curve trough points is computed. The slope of computed curve (if linear) represents alpha.

Figure 2. The calculation of F(l).

The main difference between random sequences and human writings is purpose. Namely, the writing or programming is done consciously and with purpose, but that is not the case with random processes, therefore we anticipate that α should differ from 0.5. The difference in α between different writings can be attributed to various factors like personal preferences, used standards, language, type of the text or the problem being solved, type of the organisation in which the writer (or programmer) works, different syntactic, semantic, pragmatic rules etc.

To calculate α for a human writing, text must be presented as a string of 64 symbols: letters (upper and lower cases are separated) and various delimiting symbols such as commas, mathematical operators, points, etc., - empty spaces are ignored. Each of these symbols is then represented by a binary number of six bits. In such manner the program is transformed into a string of 0 and 1. Each 0 is then interpreted as a downward step and each 1 as an upward step of the Brownian walk. Then the F(l) for that walk is computed, plotted and finally α is calculated from the slope of the resulting curve. The method is schematically presented in figure 2.

5. The analysis of human writings To compare natural language texts and computer programs we first selected three natural languages namely two world languages English and German and our native language Slovenian; and three popular computer languages in wide use FORTRAN, Pascal and C++. From the available literature on CDs we randomly selected twenty works of well-known belletristic authors (see Table 1.) for each language. Note that all works are different, there are no translations. In the similar manner we selected 20 computer programs for each computer language. Students wrote the programs during programming courses in the last year of study, meaning that the level of selected programs can be estimated as professional, like their counterpart - the natural language texts. Also all programs were different, no programs were rewritten from one programming language to another. Programs were from different domains ranging from engineering programs, system programs, programs for financial analysis, etc. and were from 1000 to 2000 lines long. Table 1. The α calculated for some English texts. Title Emma The wonderful wizard of Oz A Tale of Two Cities The Time Machine Julius Ceasar Romeo and Juliet The strange case of dr. Jekyll and mr. Hyde . . .

Author Jane Austen L. Frank Baum Charles Dickens H.G. Wells Shakespeare Shakespeare Robert Louis Stevenson . . .

α 0.54661 0.49417 0.47933 0.53280 0.57251 0.52445 0.48013 . . .

The human writings were analysed using the method introduced in section 4. and the results are presented in Tables 1, 2 and 3. We see that the mean α for natural language texts is very near 0.5, but single texts differ from this critical value significantly. We can also observe that mean values for texts in different language are nearly the same. However the mean values for programs, except for FORTRAN differ essentially from 0.5 and we can see that the programming language influences the mean value significantly. Table 2. The average, deviation, minimal and maximal values for computer programs C++ 0.643 Mean value Standard deviation 0.085 0.412 Minimal 0.773 Maximal

Pascal 0.581 0.066 0.311 0.724

Fortran 0.491 0.016 0.325 0.587

Table 3. The average, deviation, lowest and maximal values for natural language texts Mean value Standard deviation Minimal Maximal

English German Slovenian 0.501 0.493 0.504 0.029 0.022 0.048 0.463 0.456 0.428 0.720 0.544 0.629

6.Discussion Assuming that the critical exponent of long-range correlation α measures the complexity (or information content) (Kokol 1997) we can summarise the observations from previous section in following comments: 1. The α mean values for various natural languages do not differ significantly. (p > 0.05 for student t-test). The number of items in vocabularies for all three analysed languages is similar (Miller 1991). Also the culture and history of belletristic writing in Britain and German are comparable, and Slovenia as a small country has been very influenced by German culture (Bradbury 1996). 2. The α mean values for various programming languages differ significantly ((p < 0.05 for student t-test). The size of the vocabularies for three analysed program languages are

different, the C++ has the largest vocabulary and the FORTRAN the smallest. In addition the C++ and Pascal are structured languages, FORTRAN is not – thereafter the entropy in FORTRAN programs can be larger and as consequence the information content and complexity lesser (Kokol 1999). 3. The α mean values for programming languages are significantly higher then the α mean values for natural languages (p < 0.05 for student t-test). We can attribute this variance to some differences between program and natural languages presented in the section 2: Programming languages are more formal. Formality reduces the chances for randomness and forces order. Thereafter in average the entropy is lesser, information content and complexity larger (Lyu 1995). Natural languages are ambiguous (Floyd 1994). Ambiguity enlarges the possibility of randomness and disorder (entropy). Afterwards, information content and complexity are smaller. Programming languages have much smaller vocabularies of identifiers, but some identifiers in programs are not language dependent. The number of words in programming language vocabularies are small, but they can be expanded with user defined identifiers, however the overall number is still small in comparison with natural language vocabularies (the number of different words in an average program is several hundreds and the number of different words in a belletristic texts few thousands) (Miller 1991, Kokol 1996). It can be assumed that operators (i.e. verbs in natural languages; reserved words, mathematical functions, etc. in programming languages) produce numerous relations between operands (i.e. nouns and adjectives in natural languages; identifiers, numbers, etc. in programming languages) and thereafter, according to the definition, (Morowitz 1995, Devlin 1992) the complexity. However, the operators and operands can also be the source of randomness, if the rules for combining them are not strict enough. Taking the ambiguity of natural languages and the more strict semantic, syntactic and pragmatic rules of programming languages into account, we can presume that the larger natural language vocabularies contributes more to the randomness (entropy) than to the complexity (information content).

7.Conclusion The results presented in this paper demonstrate the presence of long-range correlation in both natural language texts and computer programs and indicate the existence of some common laws. At the same time the results show that long-range correlation can be used to differentiate between above two types of human writings emphasising following characteristics: Programming languages are more formal. Natural languages are ambiguous Programming languages have much smaller vocabularies of identifiers, but some identifiers in programs are not language dependent. Larger natural language vocabularies contributes more to the randomness (entropy) than to the complexity (information content). Due to the fact that long range correlation, complexity and information content are closely related the presence of long range correlation introduces its possible use in the disciplines where: •

information content of human writings must be estimated, like information retrieval, data mining, Internet searching, and similar.



entropy must be measured like estimation of number of faults in computer programs, the number of faults in information retrieval queries, and similar.

The presence of long-range correlation in both natural language texts and computer programs indicates a possible direction in our future research. The existence of common laws enable us to find and identify them, providing an opportunity to gain more insight into natural language development and as consequence to gain more insight into the development of human species in general. 8.References 1. Morowitz, H. (1995). The Emergence of Complexity, Complexity 1(1),4. 2. Gell-Mann, M. (1995). What is complexity, Complexity 1(1),16-19. 3. Schenkel, A., Zhang, J., Zhang, Y (1993). Long range correlations in human writings, Fractals 1(1),47-55. 4. Peitgen, R et al. (1993). Chaos and Fractals - New Frontiers of Science, Springer Verlag.

5. Kokol, P., Kokol, T. (1996). Linguistic laws and computer programs. Journal of the American Society for Information Science. 47(10), 781-785. 6. Kokol, P., Brest, J., Žumer, V. (1997) Long-range correlations in computer programs. Cybernetics and systems, 28(1), 43-57. 7. Kokol, P. et.al. (1999), Alpha – a generic software complexity metric, Proceedings of ESCOM Conference, To be published. 8. Shannon, C. E. (1951) Prediction and entropy of printed English. Bell System Technical Journal. 30, 50. 9. Grassberger, P. (1989) Estimating the information content of symbol sequences and efficient codes. IEEE Transactions on Information Technology. 35, 669. 10. Benthem, J., Meulen, A. (1997) The Handbook of Logic and Language. North – Holland. 11. Gibbs W.W. (1998) The Profile of Claude Levi-Strauss. Scientific American. 278(1) 2425. 12. Buldyrev, S V et al.(1994). Fractals in Biology and Medicine: From DNA to the Heartbeat, Fractals in Science (Eds. Bunde A, Havlin S), Springer Verlag.. 13. Atlan, H. (1989). Noise, complexity and meaning in cognitive systems. Rev. Intern. De Sytemique 3(2). 14. Young, J.Z. (1978). Programs of the Brain. Oxford U.P., Oxford. 15. Francois, C (Ed.) (1997). International encyclopedia of systems and cybernetics. K.G. Saur, Munchen. 16. Wilson, K. (1989). Les phenomenes de phisique et les echelles de longeur. Pour la Science. Diffus, Berlin, Paris. 17. Parkinson, G H R. (Ed.) (1982), The Theory of Meaning. Oxford U.P., Oxford. 18. Li, W (1991), On the Relationship Between Complexity and Entropy for Markov Chains and Regular Languages. Complex Systems 5(4), pp. 381 – 399. 19. Edmonds B. (1997), http://www.cpm.mmu.ac.uk/~bruce/combib/

Measures

of

Complexity,

20. Peng C K et al (1992), Nature 356, pp 168. 21. Ebeling W, Neiman A, Poschel T (1995), Dynamic Entropies, Long Range Correlations and Fluctations in Complex Linear Structures, Proceedings of Coherent Approach to Fluctations, World Scientific. 22. Ebeling W, Poschel T, Albrecht K F (1994), Entropy, Transformation and Word Distribution of Information Carrying Sequences, Research Report, University of Berlin, Berlin. 23. Li, W (1997). The Study of Correlation Structures of DNA Sequences: A Critical Review. To be published in Computer & Chemistry. 24. Wegner, P. (Ed.). (1989). Programming languages paradigms. Computing Surveys 21(3). 25. Miller, G.A. (1991). The Science of Words. New York: Scientific American Library,.

26. Arista, R., Dry H.(1999). The Linguist List. (http://www.linguistlist.org) 27. Floyd, R. W., Beigl, R. (1994). The language of Machines. New York: Computer Science Press. 28. Devlin, K. (1992). Logic and Information. London: Cambridge University Press. 29. Bradbury M. (1996), The Atlas of Literature, London: LTD Press. 30. Lyu, M.R. (ed.) (1995), Software Reliability Engineering, Washington: Computer Society Press. 31. Eghe, L., Rousseau, R. (1990), Quantitative methods in library, documentation and information science, Amsterdam: Elsevier. 32. Tucker, A.B., (Ed.) (1997), The Computer Science and Engineering Handbook, Boca Raton: CRC Press. 33. Malik A. M. (1999), Evolution of the High Level Programming Languages, Sigplan 33(12) pp. 72 – 80.

Suggest Documents