Information Retrieval and Database Architecture for Conventional Japanese Character Dictionaries Lothar M. Schmitt, Jens Herder, and Subhash Bhalla University of Aizu Fukushima 965-80, Japan voice: [+81](242)37-2537 fax: [+81](242)37-2549 email: flothar,herder,bhalla
[email protected]
Abstract The cycle of abstraction-reconstruction, which occurs as a fundamental principle in the development of culture and in cognitive processes, is described and analyzed. This approach leads to recognition of boundary conditions for and directions of probable development of cognitive tools. It is shown how the transition from a conventional Japanese-English character dictionary to a multi-dimensional language database is an instance of such an abstraction-reconstruction cycle. The individual phases in the design of a multidimensional language database based upon dierent computer software technologies are investigated in regard to the underlying cycle. The methods used in the design of a multi-dimensional language database include the use of unix software tools, classical database methods as well as the use of search engines based upon full text search. Several directions of application and extension for multi-dimensional language databases are discussed. Keywords: abstraction-reconstruction cycle, dictionary, unix, database, retrieval methods, hypertext, search engine
1 Introduction The introduction of the personal computer in the 1980's caused severe changes in the publishing industries. From then on, typesetting and layout of books has been done using text-editing and text-formatting programs [68][54]. The spread of personal computers has now progressed so far into society that not only the production of books is done on computers but also accessing information from books (or properly reformatted sources) is done via personal computers [31]. This holds, in particular, for dictionaries and scienti c publications.
The phenomena, which we have just listed, are part of an abstraction-reconstruction cycle which we shall describe and analyze in this paper. By an abstractionreconstruction cycle, we mean the succession of two processes: rst, tools are reduced in size; second, the technology or knowledge gained in the reduction process is used to (re)build most of the environment where the reduction process took place. Abstractionreconstruction cycles seem to govern many cultural processes and cognitive processes as will be outlined below. For example, the introduction of the personal computer is the result of several eorts of minimalization and subsequent reconstruction both on the hardware as well as on the software level of computer components. A key abstraction process in regard to the use of computers is the reduction of the underlying writing system to the binary system, which can be used to represent most types of data and allows the use of electronic devices for data processing on the hardware level. An abstraction process on the software level is the introduction of a somewhat minimal procedural programming language (C [49]). Both, the minimal writing system and the minimal programming language are subsequently used to implement other, higher-level software components for data processing such as text editors that (re)construct typewriter functions. In regard to cognitive processes, the concept of an abstraction-reconstruction cycle seems at least to describe the development of results of cognitive processes such as tools. We do not insist that this concept describes processes that actually occur in the brain. For example, Maxwell's equations [6] are the result of an abstraction process, and the whole of non-relativistic electrodynamics can be recovered from them. The concept of an abstraction-reconstruction cycle can deliver concrete support in decision making due to understanding of the overall process and its direc-
tion. For example, one observes that the components that are added as peripherals to computers make it more and more humane in the sense that man-machine communication becomes simpler and relies on normal communication skills such as plain language [73]. Obviously, such a point of view allows one to make decisions about future developments of peripherals, e.g., which developments to prefer and which of those to consider rst. It also shows the limits of developments: a computer keyboard must t the dimensions of the human hand as long as it is used for input of textual data; and one does not need to process hundreds of frames per second for a virtual reality application. The starting point of our work is the source le of a well-known Japanese-English dictionary, the Nelson dictionary [66], which was generated with a \What You See Is What You Get" editor having a printed book in mind. Already the process of using an editing program on a personal computer is a signi cant step of abstraction in regard to representation of the book by a writing system. Consequently, the system computer-editor-data is able to interact to a much higher degree with the user than the result (i.e., the printed book). For example, the computer can determine whether or not a speci c word exists in the le representing the dictionary in a much faster and more accurate way than a human can make the same decision using the book. Already, this system is a cognitive tool supporting certain cognitive tasks [52]. We have used unix tools to transform the source le of the dictionary into two dierent multi-dimensional databases. First, we generated a system of tables for a classical relational database application [28]. This goes along with a process of reduction of the overall size of the data. In fact, a relational database de nes, ideally, a minimal set of propositional units [81]. From this data representation, we can reconstruct the dictionary as a book using, for example, TEX, and we can (re)implement the standard lookup method in the Nelson dictionary (using radicals and stroke counts) by computer, but we can go far beyond that as we shall outline below. The second database application, which we have constructed, is a clickable hierarchy of html formatted les [65]. We access this le system using a search engine (Excite [31]), which automatically generates a user-friendly interface, and an index based upon a weighted graph which in turn is based on probabilistic principles combined with evaluation of contextual distance. Here, the system we use becomes even more humane: the input language is plain written language in contrast to a noun-phrase (quanti er) driven classical database language such as sql.
Altogether we present a detailed analysis which shows why the transition from a conventional Japanese-English character dictionary to a multidimensional language database is an instance of an abstraction-reconstruction cycle, and we indicate several possible directions for future development and applications.
2 The Abstraction-Reconstruction Cycle In this section, we shall illustrate the concept of an abstraction-reconstruction (or sometimes just a reduction-reconstruction) cycle that occurs in many cultural and cognitive processes. Our analysis in regard to the development of culture will be from a historical point of view. The analysis in regard to cognitive processes will be from the point of view of properties of neural networks. Some people accept the idea that new technologies and cultural development can be \mental externalizations" [37], and this is, essentially, our point of view here. However, one can also nd sharp criticism of such reasoning [44]. The concept of an abstraction-reconstruction cycle has concrete applications. Viewing (external) processes in this way allows one to determine or even chose their direction of development. It also allows one to determine the next canonical step or saturation limits for implementation of the reconstruction process in regard to available technology. This may be used to make investment or other planning decisions.
2.1 Reduction in Size Reduction in size is one of the most prominent thriving forces in cultural development. Very often, an individual, group or society, that uses smaller tools for a speci c purpose, operates in a more energy ef cient way and/or gains a technical, nancial, military or other advantage. Smaller tools can be stored and operated in greater numbers in the same physical space which in some cases creates a higher chance for synergy eects such as networking. Prominent examples of hardware that have undergone a process of reduction in size include stone tools of prehistoric humans, tools in medicine, guns, atomic bombs, bicycles, automobiles, telephones, and computers (in most of their components). The limits of these processes are in many cases de ned by the dimensions of the human body. Most tools mentioned above must t the
human hand or carry the human body. Atomic warfare is bound by the requirement of a critical mass of nuclear fuel. The size of logic gates in semiconductors chips is limited by the size of atoms. In a similar way, structuring (and concurrent simpli cation of data) is possibly the most prominent thriving force in the development of the way we see our world, we record our world, and we think about our world. Proper structuring (or encoding) makes it possible to process more data concurrently through limited channels. The simplest form, clustering or coding experience by category, is fundamental to mental life because it greatly reduces the demands on perceptual processes, storage space and reasoning processes, all of which are known to be limited [79]. Clustering seems to be a process that occurs in the human brain from earliest childhood [72]. If clustering is iterated, then this naturally yields hierarchical data organization. Experimental work (see, e.g., [15], [16], [17], and [26]) shows that some data such as words are indeed stored in a categorized, hierarchical fashion in the brain. Along with hierarchical data organization goes the process of abstraction (by naming clusters, properties in the hierarchy and discarding multiple occurrences of information in the clusters) and the process of theory development. As stated in [12], abstraction and theory development produce cognitive economy through integration of information in a small number of general principles. (See also [25].) Consequently, a subject that is properly described by axioms and underlying principles can be stored, presented and, hopefully, understood or processed more easily in a shorter amount of time. Every eld of research constantly undergoes a process of abstraction. This tendency is possibly most prominent in Mathematics. The reader may, for example, compare the presentation of Analysis in [33] and [55]. An example for a grand unifying principle from a non-mathematical science is the discovery of plate tectonics in Geology. Extending our present discussion drastically to non-sciences, one may even go as far as seeing the ten commandments as a process of formulating a minimal set of laws that governs human behavior in society. Note that combined with the process of abstraction is, usually, the creation of new, specialized language. Cognitive processes in the brain may be the result of computational processes similar to those in arti cial neural networks [22][43]. In [75], the internal hierarchical organization in a trained arti cial neural network (NETtalk) designed to pronounce English text was examined. It was found that the structure of the internal hierarchical representation could be identi ed
with important distinctions and structures in the outside world [22]. Associated with every process of reduction of size is a collection of certain threshold levels determined by the past, current or future environment. If the size of a tool is below one of the associated thresholds, then its mere existence in connection with the existence of other tools in the environment opens up new possibilities. An example is the use of the compact disk which was originally intended as replacement for the classical vinyl record, as mass storage device for data in computer technology. Thresholds may lie above and below the originally anticipated limit dimension. An example is the interplay of operating tools and the microscope in micro surgery where the dimensions of the operating tools are below the originally anticipated level. The history of science is, of course, lled with examples of interplay of ideas from dierent disciplines (see [18] for examples). Note that time is an important factor to be taken into consideration in regard to combination of tools or ideas. One component of a possible, new tool may be available much earlier in time than another. In fact, components may \miss" each other in time since one may have been forgotten when the other is invented.
2.2 Reduction of the Writing System For many areas of research or information processing, the process of abstraction or reduction in size means a reduction of the writing system used to describe the information. Possibly, the most spectacular examples from modern day sciences are the introduction of quarks [6] in Physics to bring order into the subatomic particle zoo and the description of the genome of living beings as strings of dna in Biology. Other examples are the introduction of risc technology in computers, and the implementation of the unix [48] command compress [87]. All these examples of reduction of the underlying writing system just mentioned triggered new technology or enhanced existing technology: the search for missing quarks, gene manipulated food and genetic algorithms [51], new assembler languages and higher data processing speed, reduced storage space and higher rates of data transmission, respectively. Other examples for the reduction of the writing or representation system are the invention of money (a small set of standardized tokens for exchange) and the attempt to introduce a single currency in the European Community. The Chinese and the Japanese languages use a large logographic system [42] of thousands of characters (kanji). Originally, a kanji represents one word in a
pictorial way. Over time, the representations of kanji have evolved so much that, today, the pictorial information is lost to a large degree. In both cultures, we can see a tendency of reduction and simpli cation of the writing system. In the Chinese culture, this happened, e.g., by the simpli cation of characters and resulting collapse of characters in 1955 and 1964. In the Japanese culture, a similar example is the classi cation of kanji into main and non-main characters by the Japanese Ministry of Education [66]. Another example of reduction and simpli cation of the writing system in the Japanese culture is the introduction of hiragana and katakana consisting of 46 characters which represent the sound of words in a syllabic writing system [42]. Another process of abstraction associated with the Chinese and the Japanese languages is the Chinese classi cation of kanji in regard to 214 radicals and the number of strokes in the kanji [88]. Even though these data do not uniquely identify a kanji and, thus, only allow classi cation in clusters of a limited number of single kanji, it is the introduction of a reduced 224 symbol writing system that makes this method work. The Nelson dictionary [66] where kanji are listed by radical and stroke number is one realization of this abstraction process. A further step in reduction of the size of a writing system is the use of an alphabetic system instead of a logographic or a syllabic system. Most alphabetic writing systems contain about 25 letters [27]. The earliest-known alphabet was the North-Semitic consisting of 22 consonant letters. The North-Semitic alphabet served as model for the Phoenician alphabet which in turn served as model for the Greek alphabet. The Greek alphabet contains also letters for vowels in contrast to earlier alphabets from which it developed. (See [27] for a detailed account of the facts just mentioned. [42] states that the alphabetic system was developed only at one place: ancient Greece. This seems to be incorrect.) The Roman alphabet, which is mostly used in European cultures, developed from the Greek alphabet. The more abstract concept of data representation through an alphabet had profound technical implications. Gutenberg's (re)invention of the printing technique could spread in Europe because the small set of characters allowed easy reproduction of the equipment used for printing in contrast to printing the larger Chinese logographic system [18]. Printed material, particularly reprints of classical Greek and Roman sources and the Bible, was suddenly available in abundance. This ow of information was part of the Renaissance which changed the mindset in
European culture and is still felt today. Later the reduced size of the alphabet made it possible to invent the transportable typewriter for the us military when the technology for handling metal had caught up. Numbers are part of every language. Together with the words for elementary processing operations such as addition, numbers may be seen as a closed, small subsystem of most languages. The ten Arabic numerals are a very ecient writing system for numbers. The reduced size of the writing system combined with the relative simplicity of elementary arithmetic operations (e.g., compared to the use of grammatical rules in real language processing) made it possible to go one step further in the development of data-processing tools: mechanical computing devices such as cash registers transform data, (i.e., the input and output of the processing device do not essentially coincide). The minimal, non-trivial alphabet is the binary alphabet. Using the binary alphabet and Boolean algebra (which is even simpler than arithmetic operations) allows one to process information by electric circuits in computers rather than mechanically. It is worth noticing that the physical representation of electric circuits underwent the process of size reduction from electric switches (relays), to vacuum tubes, to transistors and vlsi semiconductor chips. Consequently, computers went down in size from room- lling machines to the limiting outer size of a keyboard with a display. The process of optimization of internal components is by far not completed. We would like to state that the size of a writing system of a language does not place any judgment upon that particular language or culture. The statement, supported by historical evidence, is simply the fact that if the number of possible states of a device is limited, then only a smaller writing system can be processed. In regard to cognitive processes, it is worth noting that there are four distinct phases in a child's development of learning to read text represented by an alphabetic writing system [42]: the sight-vocabulary phase (reading a limited set of whole-words), the discrimination-net phase (reading through deciding a match to a known whole word), the phonological recoding phase (using spelling-to-sound rules), and last the orthographic phase (using spelling alone). These correspond to the stages of the reduction process described above: the rst two phases correspond to a logographic writing system; the third phase corresponds to a syllabic writing system; and the last phase corresponds, of course, to the underlying alphabetic writing system. Even though it seems not necessary to use,
e.g., spelling-to-sound rules for reading, the phases in the process of learning to read show strong similarity to the stages of abstraction or size reduction of writing systems as outlined above.
2.3 Reconstruction Concurrent with the process of reduction in size is in most cases the inverse process of (partial) reconstruction. When a certain level of simpli cation, reduction in size, or abstraction has been reached, entities that have be recognized as being of particular value in the previous process are reconstructed with the advanced technology now available. For example, the size of one's automobiles may function as one's demonstration of status and power. Consequently, large automobiles are still built today but with the advanced technology obtained in optimizing smaller cars. In general, information is lost in the combined process of reduction-reconstruction. For example, Maxwell's equations [6] are used to deduce the whole of non-relativistic electrodynamics. However, it may not be studied anymore by physicists how speci c laws of electrodynamics were obtained from experimental data. In the case of processing written language, it is interesting to observe how the process of reduction of the writing system which lead to the applicability of processing by electric circuits is reversed. The rst step is to represent integers as binary strings making it possible to implement and perform numerical calculations. Processing numbers is (re)implemented by early programming languages. A subsequent step is the representation of the Roman alphabet as ascii code. This allowed record-oriented data processing including text using programming languages such as cobol [39]. Shortly afterwards, interactive processing |in particular retrieval, of textual and numerical data on terminals| became possible through database technology using database retrieval languages such as sql [28]. Later with the arrival of the vlsi semiconductor chip and the personal computer, the typewriter became obsolete and processing of text, even mathematical treatises, using such tools as troff [68] and TEX[54] became standard. Processing of kanji was implemented through the production of specialized word processors in the Japanese market. This development reversed the process of reduction of the writing system: more and more rare kanji are in use today since they can be easily found in kanji processing devices. The next step in regard to representation and processing of written language is the combined representation of ascii and kanji characters in, e.g., the unicode [83]
and its subsequent processing by computers in the Internet environment. Finally, there is a single standard to be expected which identi es every character in use (or ever used) uniquely as a binary integer. It is interesting to note that a reduction of size in the unit that can be placed on paper by the output device was also necessary to make this development possible. Printing the Roman alphabet requires such units to have the size of a printed letter. The introduction of matrix printers made printing kanji characters possible for computers. This invention by Japanese companies was certainly triggered by the desire to print Japanese kanji. Note however, that the way kanji are printed is fundamentally dierent of the way kanji are written. The latter type of information is lost in the reconstruction process. Actually, the statement made above in regard to replacement of the typewriter by the personal computer does not go far enough: the personal computer with its multiple window displays is replacing the desk top including tools such as the fax machine, the typewriter, the pocket calculator, the pen for drawing, etc. The culminating point in the process of reconstructing of information processing may well be the introduction of the mouse as substitute for the writing feather or even the nger. That the aspect of information processing with the nger is an important point in this development can be seen from the recent introduction of softpads for laptops where the touch of the nger is used to steer the cursor on the screen. This development of reduction and reconstruction of information processing is accompanied by the individual's loss of ability to perform certain tasks without equipment. Pocket calculators diminish the ability to perform numerical calculations on paper or in one's head. Machine based processing of text diminishes the ability to write. In particular, computer based processing of kanji makes it somewhat unnecessary to learn to write the (less common) characters by hand. The discussion of the two preceding sections until this point is summarized in Figure 1. Reduction occurs from a maximal size of a system of data or a tool (top left corner) until a minimal size is reached at point t0 in time. This is symbolized by the left descending line. In this process, thresholds 1-3 are crossed. The thresholds are determined by the overall (external) environment of the particular system being reduced. After a threshold is crossed, certain cooperative eects with other tools (combinations) can take place. After the minimum is reached at time t0 , (partial) reconstruction occurs symbolized by the rightmost ascending line (1). Reconstruction may occur
also at earlier stages of the reduction process or even may occur continuously. This is symbolized by the second ascending line (2) in the gure. Size (2) Reduction Minimalization
Reconstruction (partial)
(1)
Treshold 1 Treshold 2
Treshold 3
Minimum
to
Time
Possible combination with other tools
Figure 1: Minimalization and reconstruction Hop eld nets, a certain type of symmetric, singlelayer arti cial neural networks, show the following behavior [43][76]: the state of a Hop eld net after a new stimulus will be the example from training that closest resembles the stimulus. Thus, Hop eld nets function as associative memory. From an input pattern the closest matching, stored pattern is reproduced or recognized. It is well known that the human mind has the ability to reconstruct or recognize a seemingly full, correct representation of data from incomplete data. One example for this phenomenon is the phoneme restoration eect [85][5]. In [80], various eects of completion of optical data (mostly in regard to the blind spot) are reported. Subvocalizing as described in [29] may be seen as a process of completing patterns acquired while learning to read. If one accepts that an abstraction process yields selected elements, pieces, or remnants from a larger set of data and completion by associative memory is a (possible) cognitive process in the brain, then this might give an indication why the reconstruction processes described above take place even when they may not be necessary. For example, the appreciation of the tool or gadget \eyes" (i.e., a pair of eyes on the screen that always looks in the direction of the cursor) may lie in the fact that the eyes add a human or pet-animal touch to the screen besides its practical use. Observe
also that we more and more reconstruct \ourselves" by combining computers with cameras, microphones, and loudspeakers and recreating our fantasies in virtual reality [8].
2.4 Abstraction as a Threat An important aspect of the changes due to an abstraction-reconstruction cycle is the eect of detachment or separation from (understanding) tools and processes. This eect is discussed next in regard to acceptance of new technologies. As we have seen above, the introduction of a simpli ed way to describe, classify and process information can induce severe technological or social change. New tools may make certain professions obsolete. Prominent examples are the dramatic change in the printing industry triggered by computer-based publishing techniques or the changes in industrial production due to the introduction of robots. A single European currency may be a threat to the autonomy of many governments in Europe. It may also be a threat to the role of the us dollar as leading world currency. Similarly, convenient access to any language may be seen by some people as a threat to their culture, autonomy or their profession. Convenient access to a language leads to easier communication, comparison and judgment of data, services and goods. Suppliers can communicate more easily with potential customers. Such processes may be seen as a threat to the nancial interests of parts of the business community and specialized workforce. One particular aspect of the process of abstraction is the new language accompanying such change. Afterwards, individuals can be classi ed in those that master the new language and the associated processes (experts) and those that do not. This generates new dependencies and new power structures. In general, skilled individuals gain power, and untrained individuals lose power. A more compact presentation of a mathematical eld such as Analysis ([33] vs. [55]: the latter is more abstract, hence shorter and easier to understand) or the introduction of Maxwell's equations make it simpler for the trained individual to acquire and to use this information. But a requirement for being able to access information completely is a \basic" education in regard to the subject. With today's technology, many people can almost instantly reproduce thousands of kanji. But the requirement for being able to access this way of data processing is to be able to operate a computer. People who do not have the basic skills are left out. Since no one can acquire the basic skills in all elds, this process can be seen
as an overall fractalization of culture and society. The associated feeling of loss of foundation or orientation may be felt as a threat too. The discussion of the two preceding paragraphs in this section up to this point is summarized in Figure 2, which shows the same process of abstractionreconstruction as Figure 1. The level of expertise is lowest when essentially no understanding of the abstraction process is needed to operate a system (top left corner). The level of expertise is still relatively low while only reduction occurs (left decending line). After the point of minimalization is reached at time t0 , a fully competent user/applier of a system must have a relatively high degree of expertise since reconstruction from a minimal system is now necessary. The level of expertise is highest when it is required to reconstruct the full original range of a system in all detail starting at the point of minimalization (top right corner). Lowest
Highest
Size Reconstruction (partial) Reduction Minimalization
Low
High degree of expertise needed to handle this size
Minimum
to
Time
Figure 2: Increase of required level of expertise As outlined in [60], one can understand a particular emotional state as a node in the neural network of the brain which shares strong associative connections to other nodes connecting information causally linked to past occurrences of the particular emotional state. If a massive rewiring or change of synaptic connections has to occur in the neural network of the brain due to changes in the environment, and one assumes that such change relates in many instances to nodes containing information about emotional states, then this may be an explanation why technological change causes emotional states ranging from euphoria to fear and depression.
3 Processing Tools The programming language C [49] can be seen as a (non-unique) somewhat minimal point of simpli cation where the following goals are achieved concurrently to a very high degree: C is close to actual machine language (the true minimal alphabet of machine operation) using the pointer concept. C is portable, i.e., independent from actual machine language and design. C is a procedural programming language which allows recursive de nitions. C includes processing both numerical data and text conveniently. If it is accepted that C represents a minimal point of simpli cation as characterized above, then it is natural to ask at this point what type of reconstruction process is implemented by the most prominent software components used in this project: sed [48] and awk [4] under unix. Database retrieval techniques using a relational database model [28]. Excite, a search engine for systems of les [31].
3.1 Data Processing with UNIX Tools The design of procedural programming languages is certainly strongly in uenced by the dominance of functional view of structure in Mathematics and the fact that many early computer scientist were mathematicians by education. Functional modeling is a simple method applicable to well-de ned limited domains that actually can be completely formalized. Language is not a well-de ned limited domain. Language is a complicated, living, dynamical entity that constantly undergoes change and is potentially unlimited. Processing language is determined by the constant ow of input and output of data, the selection and processing of signi cant components in the ow such as patterns in the ow (e.g., grammar), and spatial relation (i.e., context) of entities such as words in the ow. Language processing is also strongly in uenced by our ability to process information like speech concurrently with information coming from other sources such as pictures. The unix operating system and programming environment with its commands and tools is a collection
of concepts, which combined lead away from the functional, mathematical viewpoint of C, and which lead towards processing data in a way resembling more the principles of language processing. A prominent feature of unix is the organization of programs in a ow of data using the pipe mechanism. Selection mechanisms or lters based upon pattern matching are realized by unix tools such as sed, awk, lex [57], or perl [84]. Moreover, lex is a tool designed to implement lexical analysis. A large class of grammars can be de ned and processed using yacc [45]. The unix programming environment contains tools speci cally designed for typesetting such as nroff [68] or troff. It is worth noting that troff includes the possibility to process mathematical formulas which are encoded in the way they are spoken. Finally, note that the unix operating system allows pseudo parallelism on a single machine, i.e., the machine switches between certain tasks so fast that the user gets the impression that these tasks are performed concurrently. In summary, data processing with unix tools implements certain aspects of language processing (but certainly is far from fully implementing the latter) after a process of simpli cation towards a minimal, procedural, underlying programming language, i.e., C.
3.2 File Organization with UNIX Unix uses a hierarchy of directories and les ( le tree) to organize data formalizing the process of clustering described above. Together with the unix command ln, this allows organization of data in the format of a directed graph or so-called network. As described above, using clustering and hierarchies is an ecient way to organize data for speci c purposes and is certainly more structured than a linear table of data. However, organizing data hierarchically is not optimal for many practical applications. Some data can easily be retrieved and combined in a hierarchy (e.g., \Who is your superior?" in regard to a bureaucratic hierarchy) but other data require an extensive search through the tree (e.g., \Do all persons on your level have the same salary?"). As indicated by the latter examples, a hierarchy represents a weighted view of data which favors retrieval and communication of certain types of data and hinders the retrieval and communication of others. One possible form of organization of data for a radical and stroke-count based dictionary such as the Nelson dictionary [66] is a le tree organized by 1. Radical identi ed through its radical number (top layer). (Speci c data about the radical in a sep-
arate le.) 2. Number of additional strokes in the kanji containing the radical. 3. Kanji containing the radical identi ed by its sequential listing number in [66]. (Kanji in a separate le.) 4. Number of strokes in the second kanji of a compound that contains the kanji (3) as rst. 5. Compounds containing the kanji (3) as rst kanji and having a second kanji with the speci ed number of strokes (4). (In a le.) The data are organized in a relatively small format since, e.g., the information about the radical used to identify a group of kanji is contained in the list of subdirectories of the radical. Thus, it can be seen as a table in a database. The standard procedure in which the Nelson dictionary is used can be implemented by simply using the unix commands cd and cat in the le hierarchy proposed above. In addition to the standard method of use, all sorts of queries and combination of data can be implemented using unix tools. However, non-standard queries and combination of data may require extensive searches. In addition, non-standard searches, if not prefabricated as single unix commands, would require a detailed understanding of several programming languages. For example, traversing the le tree essentially needs shell programming while searching in les essentially needs a programming language capable of pattern matching. An alternative to the organization of a le tree as described above, would be formatting the individual les and directories in html [65] format and implementing the hierarchy through clickable links in the display of the les. A directory in the hierarchy as described above would then be represented as a clickable table of, e.g., radicals or stroke numbers. In summary, data organization in the le tree of unix favors hierarchical data organization which has distinct advantages and drawbacks. A properly de ned le tree represents a reduced volume of data and favors certain access methods. Non-anticipated access methods (i.e., such access methods that are not implemented as a single unix command or as, e.g., a clickable, graphical user interface) require a high amount of speci c, technical programming.
3.3 Database Retrieval Techniques The computer usage for the purpose of data management became very active after the invention of the
cobol programming language which is speci cally de-
signed to implement business transactions. In the 1960's, it was very common to organize data in a hierarchy based on the model for a hierarchically organized workforce. As indicated above, this model had limitations. An ad-hoc extension was the use of linked lists (networks as an improvement over hierarchies). However, traversal within the network model was very cumbersome. Also, the model was not suitable for all types of questions related to information extraction. In 1972, Codd working with ibm proposed the relational database model and usage of relational algebra and relational calculus as a possible solution [24][28]. Essentially, relational algebra and relational calculus uses the language of set theory to formulate queries and to retrieve data. ibm launched prototype development projects at three separate laboratories. The sql language and many other rdms architectural aspects are a result of this research activity. In the 1970's, the restrictions of size of main memory and processing capacity of machines lead to problems of dealing with such operations as JOIN for large data tables and delayed the acceptance of Codd's model. It took more than 10 years for the relational database model to become widely accepted. From the viewpoint of our discussion, the proper design and use of tables that represent relations of data is a simpli cation or reduction of size in several regards. First, by proper design of tables (i.e., de ning carefully which data are put together with which other data in tables), the size of the les containing the data is signi cantly reduced compared, e.g., to a list of all complete records. See [28] for techniques how to achieve minimal representation of data. Second, generating tables properly de nes basic relations from which other relations can be (re)generated (temporarily) using such operations as JOIN. Generating excessively large intermediate tables via JOIN can be avoided by properly preselecting from tables containing the basic relations. The reduction to a set of basic relations is, obviously, a process of abstraction similar to distributing the data in a hierarchical unix le tree where the basic relations are de ned by listing subdirectories. In summary, data organization through tables and relations de nes a set of propositional units [81]. Data retrieval methods are implemented using relational algebra and relational calculus which are very similar to the language of set theory. Data retrieval methods through a programming language such as sql implement a more specialized type of language consisting of a certain collection of quanti ers (or noun-phrases).
Access methods, which are not anticipated by the designer of the database system, may require programming, but only in a single programming language.
3.4 Using a Search Engine The use of a search engine such as Excite [31] is probably the simplest way of implementing search methods and generating a simple, user-friendly interface with today's computer means. The task of the organizer/designer of the underlying database is reduced to providing a not necessarily structured collection of les each containing a complete set of data for one unit. Units are radicals, kanji and compounds in our case. The les containing data sets can be formatted using the html language such that they can be viewed conveniently using an Internet browser. In fact, the le a user sees displayed can even be generated dynamically. The search engine then automatically builds up an index based upon probabilistic relational principles. This is somewhat similar to the organization or learning of (arti cial) neural networks where the weights for synaptic connections are changed in accordance with frequency of use and relational encounter using feedback mechanisms. The search engine also generates an interface for input of text which can be incorporated in the display of every le containing a data set. This makes it possible to proceed from any displayed result further in a search without having to return to a single point of entry. The input language for the interface is normal language (i.e., words, list of words, sentences) not a specialized programming language. In view of our discussion, the human, written language is fully reconstructed as input device. In addition to normal language, the interface allows Boolean combination of language fragments as input, e.g., it allows searches for les explicitly containing two words. In summary, using a search engine such as Excite [31] reconstructs normal, written language fully as input language for search in the database. By full reconstruction, we mean that any piece of normal, written language is admissible as input-pattern which is used for subsequent search. In that sense, the interface and its use is much more humane than the usage of unix or relational database technology. The design of the database is reduced to mere listing of all entries. Abstraction is performed by the machine in creating a weighted graph of relations between units in the database. Note that the use of the search engine as \black box" takes away control from the designer of the database and gives control to the designer of the search
engine. The designer of the database is detached from the processes that generate the database and its relations.
4 Creating a Database from a Dictionary In this section we shall describe some speci c implementation issues in regard to our project and some of their applications. The techniques, which we have developed here to reformat and enhance the source le of [66], can be applied to other sources.
4.1 The Original File The starting point of our work is the source le of the Nelson dictionary [66] in Rich Text Format (rtf) which was generated with a \What You See Is What You Get" (wysiwyg) editor having a printed book in mind. Being a valuable object already, this source has several severe aws in regard to being accessible by computer for other purposes than formatting and subsequent printing:
The source is \dirty": it contains many control sequences coming from the wysiwyg editor which have no meaning but the format and the spacing in the printed book. In particular, the source is not formatted in a way that it can be used directly as a database in connection with programming languages such as prolog [23] or sql.
The source cannot be \cleaned" in an easy fashion of the control sequences mentioned above (i.e., by just removing them). In fact, some of the control sequences in the source do carry meaning: Japanese is represented using kanji, kunpronunciation and on-pronunciation. The kunpronunciation and the on-pronunciation are typeset using the small-caps and the italics character sets, respectively. In the source, the associated text is framed by corresponding unique pairs of control sequences.
The source was typed with a regular layout in the printed book in mind. Though quite regular already, it contains a certain collection of describable irregularities: one example is that the italics mode used to indicate on-pronunciation was turned o in, e.g., \kotoyo(su)" just before the closing parenthesis was typed.
Some minor portions of the source are very irregular and require correction by a human inspector prior to processing by machine. The complete listing from the source for the last radical No. 214 is as follows:
17-STROKE RADICALS\par \par Rad. \f16704 \'EA\'9E\f16 214\par \par \i Fue\plain flute. Nickname: Flute.\par \par \f16704 \'EA\'9E\f16 7107 J737e M48882 \scaps Yaku \plain flute.\par
where the last two lines in our listing are just one line in the source. A second source le was generated from the rst using the wysiwyg editor's feature for producing a control-sequence-free version of a le. The second source le contains kanji characters formatted in a way which is more suitable for our purposes. In fact, kanji characters are encoded in such a way that they can be read directly by an Internet browser which we use for display. Thus, using the second source le saved the work of implementing a conversion program for kanji encoding. In addition, the second source le is somewhat more regular already in the sense that the spacing in this le corresponds to information we can use. For example, an empty line corresponds here, in most cases, to a new entry in the dictionary. However, some valuable information such as the distinction between kun-pronunciation and on-pronunciation is lost in this version of the source. Note that in regard to our discussion of an abstraction-reconstruction cycle, the use of a wysiwyg already creates a exible, interactive environment for the person who typesets a document but not for the end-user. The document already represents a machine-searchable database. For example, a document in a wysiwyg editor can be checked for spelling or grammatical mistakes by machine.
4.2 Processing with sed and awk We have developed a family of carefully crafted sed and awk based lters that are able to transform the two combined source les into a database of almost any desired format. These include a prolog database, an relational database consisting of several tables and a collection of html formatted les which can be accessed and displayed with an Internet browser and searched
with a search engine. As described above the rst task in the process of generating a database from our sources is to discard some but not all of the formatting control sequences of the rst source le. In other words, we have to re-edit the source. This is preferably done by a program. The stream line editor sed is ideally suited for this purpose since it combines pattern matching, which we need to recognize relevant pieces of the source, with editing functions such as substitution. First, both source les were merged line-by-line and relevant information was extracted from every pair of lines. Merging was achieved through pattern matching observing that not all but most lines correspond one-to-one in both sources. Unneeded control sequences were eliminated in this process. As mentioned above, one problem in that regard was the fact that openers and terminators of, e.g, the italics mode were sometimes scrambled with control sequences for other functions of the editor. In order to counter this eect, a nite collection of commutation relations for control sequences was applied to the rst source le using the tagged regular expression mechanism of sed in order to match openers and terminators properly. With these techniques, the kanji and the three character sets used to represent kun-pronunciation, on-pronunciation and English could properly be identi ed. After the source le was properly cleaned and relevant pieces from the two sources identi ed (marked or tagged) by sed, we used awk to generate a format from which all sorts of applications are now possible. The original source is up to a small number of exceptions typed regularly enough that using pattern matching in awk and a suitably de ned grammar the three categories of entry i) radical, ii) kanji, and iii) compound can be identi ed properly. This application naturally yields a valuable by-product: by counting all units, a complete index for the dictionary including the compounds is generated, which was a previously non-existing feature. This is useful for nding compounds in the dictionary. All relevant pieces of data in the nal format can be picked out by awk as elds in a table and framed with any desired syntax. We remark that parsers for rtf les that are selective in the way we have needed here are not in existence. Note, that we can recreate the book by framing elements in the nal table with suitable TEX syntax and processing the resulting le using jlatex. In addition, it is possible to use the reformatted source to create English7!kanji as well as kun/on-pronunciation7!English or kun/on-
pronunciation7!kanji dictionaries, which considerably extend the previously existing kanji7!kun/onpronunciation7!English dictionary, simply by using the unix command sort combined with the generation of suitable formatting syntax using, e.g., awk. In summary, combination of unix text processing tools with the source le of a dictionary can be used to create other types of dictionaries by machine. The use of ad hoc methods at this stage of the project rather than the development of a sophisticated grammatical description of the whole source and processing with higher level tools such as lex and yacc is justi ed by the fact that the formatting operations just described are used only once. Commenting on the theme of reduction of size, let us remark that the reformatted source, which includes labeling key words, has about 60% of the size of the original rich text formatted le.
4.3 The Relational Database One application of the reforming process just described is the generation of several tables for a database. The tables correspond to the three categories of entry in the Nelson dictionary: i) radical, ii) kanji, and iii) compound. Table 1 contains entries of the following type with data about radicals: radical#No. of strokes#No. of radical#Comment
This table is intended for the search by radical as key index as originally intended in the Nelson dictionary. It identi es the number of the radical which is used in other tables as the key index. The Comment contains the whole structured data set for the radical including such items as its meaning or references to listing in other dictionaries. It can be displayed to give the user of the database additional information during a search. Table 2 contains entries of the following type with data about radicals: No. of radical#word1#...#word10 #Comment word1 ...word10 lists kun/on-pronunciations and English. This table is intended for the search by English meaning or pronunciation. Again, Comment contains the whole structured data set for the radical. Table 3 contains entries of the following type with data about kanji: No. of radical#Kanji#classi cation
#No. of additional strokes#index No. #reference1#reference2#reference3#Comment
classi cation is the classi cation of kanji by the Ministry of Education of Japan. index No. sequentially counts all kanji with the same number of strokes that follow a particular radical in the Nelson dictionary as part of the complete index of the dictionary. reference1...reference3 refer to the cross reference system within the Nelson dictionary. This table is intended for continuation of the original search method in the Nelson dictionary by combining it with table 1. Comment contains the whole structured data set for the kanji including pronunciations and meaning. Table 4 contains entries of the following type with data about kanji: kanji#word1 #...#word10 #Comment word1 ...word10 lists kun/on-pronunciations and English. Again, Comment contains the whole structured data set for the kanji. Table 5 respectively Table 6 contain data about compounds and are structured similar to Table 3 respectively Table 4. Table 1-6 de ne a relational database scheme. The possibilities of searching the database with queries based upon the relational calculus by far extend the possibilities of the original source. The use of the sql database language to access information raises the level of access and control for the user considerably. For example, we can search by approximate stroke count in the rst and the second kanji of a compound by searching a range of stroke numbers. Such a search frees in particular the beginner from the obligation to identify the radical in the leading kanji and to provide an exact stroke count for kanji and radicals in order to use the database. Possibly, the ability to identify radical and provide exact stroke counts may become obsolete. The fact that the machine can combine queries searching multiple tables has two main consequences: The response time for a search is reduced dramatically (compared to manual search). The set of possible solutions becomes small enough such that the human end-user can conveniently pick the particular data he or she needs. Note that combining suitable queries as above is needed to eliminate a very important problem which arises in the use of Japanese dictionaries: multiplicity. For example, the search by stroke count in a single
kanji can lead to several hundred entries in the dictionary. Another example is the fact that a single kanji character usually has many dierent meanings. Large sets of solutions for simple queries make proper identi cation of kanji and compounds in text quite tedious, in particular, if several dictionaries have to be used. At this stage, we can also extend the database in a simple manner just by providing additional tables. For example, we can compile an additional table that contains numbers of horizontal, vertical or diagonal strokes in kanji. These can, of course, also be searched approximately which may counter ambiguities in regard to what constitutes, e.g., a diagonal stroke. See [1], [2], [3], and [10] for a more detailed analysis of some aspects of the discussion in this section. The use of the sql programming language with its quanti er (noun-phrase) based syntax not only introduces search methods that can be formulated closer to ordinary language: it also allows approximative searches in regard to numerical data which could not be performed by humans in view of the excessive amounts of data that have to be scanned. Combining the reformatted source of the Nelson dictionary with database methods opens up a variety of previously impossible applications. Commenting on the theme of reduction of size, let us remark that the combined tables have approximately the same size as the original rtf le due to the listing of comments. If this listing is ignored and only those elds are considered that are unstructured and directly accessible via sql, then the combined size of the tables is approximately 15% of the original source.
4.4 Using the Search Engine We have also generated a collection of html formatted les containing data related to items such as radicals, kanji and compounds. Essentially, every le contains the complete set of data pertaining to one item. This is done in order to give the search engine (Excite) the greatest possible amount of sensible information which it can use to generate the weighted graph underlying the indexing procedure. In addition to the data pertaining to the items, the html link mechanism, which can link a symbol in one le with another le (or some other place in the same le), is used to retain the original hierarchy of the Nelson dictionary and to allow the classical lookup method. Consequently, the le containing data related to a particular radical contains a link to a clickable table of stroke numbers which in turn contains a link to a clickable table of kanji. In addition to that, the le containing data related to a particular radical contains a
link to a clickable table of kanji containing this radical. This is useful for users that cannot count strokes but may be able to identify a kanji visually. Similar links are implemented for the transition from kanji to compounds. A typical display for a compound is shown in Figure 3. Note that searching by clicking relies to a large extend on processing visual information and pattern matching. For example, instead of identifying a radical, and subsequently scanning the book linearly (!) until the radical is found, the user is presented with a table containing all radicals in an array. Searching for the radical can at least in part use parallel processing since one is able to look at several radicals in the table at the same time. If the classical lookup method is used extensively in our system, then the place of radicals in the table will be learned and the process of nding a radical and associated information is accelerated. Usually, a user cannot open the Nelson dictionary at precisely the page where the entries pertaining to a particular radical start. One disadvantage in the implementation of the search engine at this time is that it cannot process digits and can only process ascii characters otherwise. In order to overcome these diculties and to use the search engine eectively, we have taken the following measures: We have translated numbers in the data set into words. Thus, instead of searching for the number 14 one can search for the word fourteen. Using the l-command of sed, we have converted every kanji (and compound) into a unique pseudoword and included that pseudo-word into the le containing the data in regard to the kanji. In addition, we have implemented the same conversion mechanism as a separate unix command. Using this unix command in a terminal window combined with the copy-paste mechanism of a workstation's window system allows a relatively convenient input method for kanji into the search engine. We expect that this technical diculty will be cease to exist in the near future. As outlined above, searching a system of les with, e.g. Excite, uses the whole of the languages involved as input language. Thus, the user has not to learn any specialized programming language in order to use this system. In addition, searches can be combined in a way similar to the database access method of using Boolean operators (i.e. special keywords such as AND) which the search engine understands. The search
engine does not only nd exact matches to a given query: it produces a collection of matches ordered by probabilistic closeness to the given query. This probabilistic component of the search and the fact that the user does not have to learn a query language adds additional power to the search engine method using, e.g. Excite, if compared with classical database methods using, e.g., sql. Overall our system supports the following goals that are formulated in [86]: i) The dominance of the page as display format is maintained. ii) Formal navigation through the visual structure is enhanced. iii) Indexing is made possible on every page using clickable links and the enclosed window for further textual input into the search engine. iv) Invisible structural context is constructed by the search engine. In fact, the search engine may nd matches to a query that do not contain components of the query but are close to these in the underlying weighted graph of the search engines indexing mechanism. v) We can automatically generate the pages and update any information. In addition, we achieve the following goals [52]: i) The cognitive load is shared by the tool and the user since the tool provides support for lower level skills such as visual pattern matching so that resources are left over for higher-order thinking skills. ii) The tool allows the user/learner to engage in cognitive activity that would be out of reach otherwise such as identifying kanji without being able to perform a stroke count. Note that in addition to the above, the use of the search engine breaks the linear structure of the original dictionary completely and allows multiple orderings of information [56].
5 Future Extensions In view of the abstraction-reconstruction cycle that we have presented as an underlying (cognitive) principle for the process of generating several variants of the Nelson dictionary, we can ask what are the future developments and extensions likely to be implemented. Some of the ideas listed below have been developed in [1], [2], and [3]. In regard to the view that we mainly create externalizations of ourselves, we can expect that electronic dictionaries will be coupled with all sorts of sophisticated audio or visual devices. The database can be coupled with kanji recognition systems. It can as well be coupled with sound processing devices to allow natural and arti cial speech as input and output medium. This can be used to construct tools for impaired per-
Figure 3: Display of a compound
sons. It can also be used to train skills such as correct pronunciation. As pointed out in [64], learning a particular language is strongly connected to learning aspects of the culture(s) associated with that language. In view of that, one can think of combining our database with other tools such as the electronic version of the Webster [38] dictionary in order to include de nitions of words in the words' host language. Another possibility is the combination with an electronic version of a dictionary such as Random House's Word Menu [35] where the vocabulary is hierarchically structured similar to the way the outside world seems to be structured. For example, the vocabulary is structured such that the cluster with words related to the kitchen contains all tools usually used in that room. Finally, let us mention the idea to couple the dictionary with an online corpus such as the Brown corpus [21] from which example sentences of the use of a particular word or phrase can be extracted. As pointed out in [40], (reading) skills are enhanced, if words are presented with examples of usage. Using a more and more humane input/output mechanism for database systems by using plain language as input medium and graphically enhanced display as output, we achieve some of the overall goals as formulated in [73]: i) Computer based tools are designed pedagogically, that is, as cognitive instructional tools for mindful teachers and learners in a culture of problem solving (e.g., performing a search in the database is simpli ed). ii) The minds of intentional learners are extended and empowered (e.g., through enhanced possibilities for search). iii) Learners are provided with guidance (e.g., through use of clickable tables). iv) Tools stimulate and provide communication (e.g., the tool can be used for translation).
6 Conclusion Our presentation started with the introduction of a general principle of an abstraction-reconstruction cycle which seemingly occurs in many (large scale) cultural as well as in cognitive processes. Creating a multi-dimensional language database by machine from the source of a linear dictionary in printed form can be seen as canonically embedded in such an abstractionreconstruction cycle where information is organized in a more and more structured way and tools such as the originally printed version of the dictionary from which this information was extracted can be reconstructed with great eciency even in extended versions. In addition to that, the development of a multi-dimensional
language database can be seen as part of the general trend in the development of computer tools to mimic or reconstruct cognitive abilities of the mind. This point of view yields the possibility to envision future development of the tools used in this project in such a way that the overall appearance and use become more and more humane.
7 Acknowledgement The authors wish to thank C.L. Nehaniv for several valuable comments and for pointing out several references in regard to this paper.
References [1] H. Abramson, S. Bhalla, K.T. Christianson, J.M. Goodwin, J.R. Goodwin, and J. Sarraille (1995). Towards CD-ROM Based Japanese$English Dictionaries: Justi cation and Some Implementation Issues.. Proc. Third Natural Language Processing Paci c-Rim Symposium, Seoul, Korea, (Dec. 4-6, 1995), 174-179. [2] H. Abramson, S. Bhalla, K.T. Christianson, J.M. Goodwin, J.R. Goodwin, J. Sarraille, and L.M. Schmitt (1996a). Multimedia, Multilingual Hyperdictionaries: A Japanese$English Example. Joint International Conference of The Association for Literary and Linguistic Computing & The Association for Computers and the Humanities, Bergen, Norway, (June 25-29, 1996). http://www.hd.uib.no/allc-ach.abstract .html
[3] H. Abramson, S. Bhalla, K.T. Christianson, J.M. Goodwin, J.R. Goodwin, J. Sarraille, and L.M. Schmitt (1996b). The Logic of Kanji Lookup in a Japanese$English Hyperdictionary. Joint International Conference of The Association for Literary and Linguistic Computing & The Association for Computers and the Humanities, Bergen, Norway, (June 25-29, 1996). http://www.hd.uib.no/allc-ach.abstract .html
[4] A.V. Aho, B.W. Kernighan and P.J. Weinberger (1978). awk { A Pattern Scanning and Processing Language (Second Edition). In: [47]. http://cm.bell-labs.com/7thEdMan/vol2/awk
[5] M. Akagi (1992). Psychoacoustic Evidence for Contextual Eect Models. In: [82], 63-78. [6] M. Alonso and E.J. Finn (1992). Physics. Reading, MA: Addison-Wesley. [7] D.P. Ausubel (1968). Educational Psychology: A Cognitive View. New York: Hold, Rinehart and Winston. [8] N.I. Badler, C.B. Phillips and B.L. Webber (1993). Simulating Humans | Computer Graphics Animation and Control. Oxford: Oxford University Press. [9] B.G. Bara (1995). Cognitive Science | A Developmental Approach to the Simulation of the Mind. Hillsdale, NJ: Lawrence Erlbaum Associates. [10] S. Bhalla, H. Abramson, K.T. Christianson, J.M. Goodwin, J.R. Goodwin, J. Sarraille, and L.M. Schmitt (1997). Recognition of Japanese Kanji Characters by Non-Japanese Learners Through a Support Database System. Proc. International Conference on Cognitive Technology CT97, The University of Aizu, Japan. [11] M.H. Bornstein and M.E. Lamb (eds.) (1992). Developmental Psychology | An Advanced Textbook. Hillsdale, NJ: Lawrence Erlbaum Associates. [12] G. Botterill (1996). Folk Psychology and Theoretical Status. In: [20], 105-18. [13] J.D. Bransford (1979). Human Cognition: Learning, Understanding and Remembering. Belmont, CA: Wadsworth Publishers. [14] J.S. Bruner (1966). Towards a Theory of Instruction. Cambridge, MA: Harvard University Press. [15] W.A. Bous eld (1953). The Occurence of Clustering in the Recall of Randomly Arranged Associates. J. of General Psychology 49, 229-240. [16] W.A. Bous eld and B.M. Cohen (1955). The Occurence of Clustering in the Recall of Randomly Arranged Words of Dierent Frequencies of Usage. J. of Genetic Psychology 52, 83-95. [17] G.H. Bower, M.C. Clark, A.M. Lesgold and D. Winzenz (1969). Hierarchical Retreival Schemes in Recall of Categorized Word Lists. J. of Verbal Learning and Verbal Behavior 8, 323-343. [18] J. Burke (1985). The Day the Universe Changed. Boston: Little, Brown and Co.
[19] J.B. Caroll (1983). Studying Individual Dierences in Cognitive Abilities: Through and Beyond Factor Analysis. In: [30], 1-34. [20] P. Carruthers and P.K. Smith (1996). Theories of Theories of the Mind. Cambridge: Cambridge University Press. [21] Centre for Humanistic Research. The Brown Corpus. P.O. Box 54, Bergen: University of Bergen. [22] P.M. Churchland (1990). Cognitive Activity in Neural Networks. In: [67], 199-219. [23] W.F. Clocksin and C.S. Mellish (1981). Programming in Prolog. Berlin: Springer-Verlag. [24] E.F. Codd (1990). The Relational Model for Database Management | Version 2. Reading, MA: Addison-Wesley. [25] F. Colin (1985). Theory and Understanding. Oxford: Basil Blackwell. [26] A.M. Collins and M.R. Quillian (1969). Retrieval Time from Semantic Memory. J. of Verbal Learning and Verbal Behavior 8, 240-247. [27] D. Crystal (1987). The Cambridge Encyclopedia of Language. Cambridge: Cambridge University Press. [28] C.J. Date (1995). An Introduction to Database Systems. Reading, MA: Addison-Wesley. [29] E. Dechant (1991). Understanding and Teaching Reading | An Interactive Model. Hillsdale, NJ: Lawrence Erlbaum Associates. [30] R.F. Dillon and R.R. Schmeck (Eds.) (1983). Individual Dierences in Cognition | Volume 1. New York: Academic Press. [31] Excite Inc. (1996). Excite for Web Servers. Mountain View, CA: Excite Inc. http://www.excite.com/navigate/
[32] M.W. Eysenck (Ed.) (1990). Cognitive Psychology | An International Review. Chichester: John Wiley & Sons. [33] E. Fischer (1990). Intermediate Real Analysis. Berlin: Springer-Verlag. [34] J.A. Fodor (1983). Modularity of Mind. Cambridge, MA: Bradford Books, MIT Press. [35] S. Glazier (1992). Random House Word Menu. New York: Random House.
[36] B. Gorayska and J.L. Mey (1996). Cognitive Technology | In Search of a Human Interface. Advanced in Psychology 113. Amsterdam: Elsevier. [37] B. Gorayska and J.L. Mey (1996). Of Minds and Men. In: [36]. [38] I. Gove and P. Babcock (1961). Webster's Third New International Dictionary. Spring eld, MA: Merriam-Webster. [39] R.T. Grauer (1985). Structured COBOL programming. Englewood Clis, NJ: Prentice-Hall. [40] F. Grellet (1981). Developing Reading Skills | A Practical Guide to Reading Comprehension Exercises. Cambridge: Cambridge University Press. [41] J. Halpern (1990). New Japanese-English Character Dictionary. Tokyo: Kenkyusha. [42] M. Harris and M. Coltheart (1992). Language Processing in Children and Adults: An Introduction. London: Routledge. [43] M.H. Hassoun (1995). Arti cial Neural Networks. Cambridge, MA: Bradford Books, MIT Press. [44] H. Hendriks-Jansen (1996). Catching Ourselves in the Act. Cambridge, MA: Bradford Books, MIT Press. [45] S.C. Johnson (1978). yacc: Compiler-Compiler. In: [47].
Yet Another
http://cm.bell-labs.com/7thEdMan/vol2 /yacc.bun
[46] F.C. Keil (1990). Constraints on the Aquisition and Representation of Knowledge. In: [32], 197220. [47] B.W. Kernighan and M.D. McIlroy (1978). UNIX Programmer's Manual (Seventh Edition). Murray Hill, NJ: Bell Laboratories. [48] B.W. Kernighan and R. Pike (1984). The UNIX Programming Environment. Englewood Clis, NJ: Prentice Hall. [49] B.W. Kernighan and D.M. Ritchie (1988). The C Programming Language. Englewood Clis, NJ: Prentice Hall. [50] D. Klahr (1992). Information-Processing | Approaches to Cognitive Development. In: [11], 273336.
[51] J.R. Koza (1992). Genetic Programming. Cambridge, MA: Bradford Books, MIT Press. [52] S.P. Lajoie (1993). Computer Environments as Cognitive Tools for Enhancing Learning. In: [53], 261-288. [53] S.P. Lajoie and S.J. Derry (Eds.) (1993). Computer as Cognitive Tools. Hillsdale, NJ: Lawrence Erlbaum Associates. [54] L. Lamport (1986). LaTEX| A Document Preparation System. Reading, MA: Addison-Wesley. [55] S. Lang (1983). Undergraduate Analysis. First Edition 1968 by Addison Wesley. Berlin: SpringerVerlag. [56] R. Lehrer (1993). Authors of Knowledge: Patterns of Hypermedia Design. In: [53], 197-228. [57] M.E. Lesk and E. Schmidt (1978). lex | A Lexical Analyzer Generator. In: [47]. http://cm.bell-labs.com/7thEdMan/vol2/lex
[58] B.E.F. Lindblom and M. Studdert-Kennedy (1967). On the Role of Formant Transition in Vowel Recognition. J. Acoust. Soc. A., 42 (4), 686694. [59] K. Lunde (1993). Understanding Japanese Information Processing. Sebastopol, CA: O'Reilly & Associates. [60] C. MacLeod (1990). Mood Disorders and Cognition. In: [32], 9-56. [61] D.W. Massaro (1986). Psychophysics Versus Specialized Processes in Speech Perception: An Alternative Perspective. In: [77], 46-65. [62] D.W. Massaro (1987). Speech Perception by Ear and Eye: A paradigm for Psychological Inquiry. Hillsdale, NJ: Lawrence Erlbaum Associates. [63] D.W. Massaro (1984). Building and Testing Models of Reading Processes. In: [69], 111-46. [64] B.S. Mikulecky (1990). A Short Course in Teaching Reading Skills. Reading, MA: Addison-Wesley. [65] NCSA: The National Center for Supercomputing Applications (1996). A Beginner's Guide to HTML. The University of Illinois at UrbanaChampaign. http://www.ncsa.uiuc.edu/General/Internet /WWW/HTMLPrimer.html
[66] A.N. Nelson (1962,1995). The Original Modern Reader's Japanese-English Character Dictionary (Classic Edition). Rutland, VM: Charles E. Tuttle Company. [67] D.N. Osherson and E.E. Smith (Eds.) (1990). An Invitation to Cognitive Science | Volume 3: Thinking. Cambridge, MA: MIT Press. [68] J.F. Ossanna and B.W. Kernighan (1978). Tro User's Manual. In: [47].
http://cm.bell-labs.com/plan9/doc/troff.ps
[69] P.D. Pearson (Ed.) (1984). Handbook of Reading Research. New York: Longman. [70] N.A. Picket and A.A. Laster (1993). Technical English: Writing, Reading and Speaking. New York: Harper Collins College Publishers. [71] H.L. Pick, P. Van den Broek and D.C. Knill (1992). Cognition | Conceptual and Methodological Issues. Washington: American Psychological Association. [72] D. Premack (1992). On the Origins of DomainSpeci c Primitives. In: [71], 189-212. [73] K. Reusser (1993). Tutoring Systems and Pedagogical Theory: Representational Tools for Understanding, Planning and Re ection in Problem Solving. In: [53], 143-178. [74] A. Revonsuo and M. Kampinnen (Eds.) (1994). Consciousness in Philosophy and Cognitive Neuroscience. Hillsdale, NJ: Lawrence Erlbaum Associates. [75] C.R. Rosenberg and T.J. Sijnowski (1987). Parallel Networks That learn to Pronounce English Text. Complex Systems 1, 145-168. [76] S. Russel and P. Norvig (1995). Arti cial Intelligence. Englewood Clis, NJ: Prentice Hall. [77] M.E.H. Schouten (Ed.) (1987). The Psychophysics of Speech Perception. NATO ASI Series D: No.39. Dordrecht: Martinus Nijho Publishers. [78] G. Segal (1986). The Modularity of Theory of Mind. In: [20], 141-57. [79] E.E. Smith (1990). Categorization. In: [67], 3353. [80] P. Smith-Churchland and V.S. Ramachandran (1994). Filling In: Why Dennett Is Wrong. In: [74], 65-92.
[81] N.A. Stillings, M.H. Feinstein, J.L. Gar eld, E.L. Rissland, D.A.Rosenbaum, S.E. Weisler, L. Baker-Ward (1992). Cognitive Science. Cambridge, MA: Bradford Books, MIT Press. [82] Y. Tohkura, E. Vatikiotis-Bateson and Y. Sagisaka (Eds.) (1992). Speech Perception, Production and Linguistic Structure. Amsterdam: IOS Press. [83] The Unicode Consortium (1996). The Unicode Standard | Version 2.0. Reading, MA: AddisonWesley Developers Press. [84] L. Wall and R.L. Schwarz (1990). Programming Perl. Sebastopol, CA: O'Reilly & Associates. [85] R.M. Warren (1970). Perceptual Restoration of Missing Speech Sounds. Science 167, 392-393. [86] D.D. Weinberger (1993). The Active Document: Making Pages Smarter. In: [70], 498-503. [87] T.A. Welch (1984). A Technique for High Performance Data Compression. Computer 17 (6), 8-19. [88] L. Wieger (1965). Chinese Characters. New York: Paragon Book Reprint Corp., and Dover Publications.