Computational Linguistics or Linguistic Informatics?

7 downloads 0 Views 412KB Size Report
the token dearie is recognized by the space on the left and the comma on the right; ..... has a dog includes four unigrams, three bigrams (John has, has a, a dog), ...
ISSN 0005-1055, Automatic Documentation and Mathematical Linguistics, 2014, Vol. 48, No. 3, pp. 149—157. © Allerton Press, Inc., 2014. Original Russian Text © VA. Yatsko, 2014, published in Nauchno-Tekhnicheskaya Informatsiya, Seriya 2, 2014, No. 3, pp. 00—00.

Computational Linguistics or Linguistic Informatics? V. A. Yatsko Katanov State University od Khakasia, Abakan, Russia e-mail: [email protected] Received February 13, 2014

Abstract—The concept of “linguistic informatics” is introduced in order to refer to a scientific domain that studies the distribution patterns of text information, as well as problems, principles, methods, and algorithms applied for the development of linguistic software and hardware. The key terms and concepts in the related field are investigated; a classification of linguistic software is introduced. Keywords: processing of natural language units, algorithms and programs, classification, linguistic informatics DOI: 10.3103/S0005105514030042

Today’s world is characterized by the widespread use of natural language-processing technologies that are part of the global process of the digitalization of society. Millions of users worldwide send queries to information retrieval systems, give their phones voice commands, and perform automatic text summarization. At the same time, they are largely unaware of the fact that such activities have been made possible by the development of a subject area that embraces research and development of algorithms and programs that are applied for the processing of natural-language texts. A variety of terms are currently used to refer to this subject area. The term computational linguistics is the most common concept. In our opinion, this term is not very suitable, because the use of the name of a discipline as a head word in the corresponding phrase limits the application scope of the term to the representation of this particular discipline. Few people know about “information biology” or “computer historiography,” as the developments within these domains are intended for biologists and historians. Accordingly, the terms “computational linguistics” and “corpus linguistics” imply that linguists represent a major group of users of products that are produced by the representatives of these disciplines. If this is in the first case, in the second case the phrase structure contradicts its denotation because computational linguistics is generally associated with the development of linguistic programs and systems, including, for example, information retrieval systems, whose user contingent is surely not limited to linguists. See the definition of R. Grishman: “Computational linguistics deals with the study of computer systems that are dedicated to the analysis and generation of naturallanguage units” [1, p. 4]. In the foreign literature, the term “natural language processing” (NLP) is also often used; it is applied, among other uses, in order to define the concept of computational linguistics, see the definitions provided in [2]. In our opinion, this term quite accurately defines the object of activity, i.e., natural language units;

however, it does not specify the subject of study. In fact, this definition delineates an area of practical activity, rather than a scientific discipline. The introduction of a separate word that would refer to the scientific dimension, for example, “the theory of natural-language unit processing” makes the term too heavy, and its abbreviation does not quite match the sound structure of the Russian language. Special attention should be paid to the term “applied linguistics,” whose accepted Russian scientific meaning differs from its Anglo-American, or generally Western interpretation. Until recently, applied linguistics was understood as a language teaching methodology: “Until recently, the major bulk of developments in applied linguistics were dedicated to language study, especially of English as a foreign or second language” [3, p. 4]. At present, the subject area of applied linguistics tends to expand: it also includes speech therapy and problems of translation, see the definition provided in the Oxford dictionary, which defines applied linguistics as “A branch of linguistics that deals with practical applications of language study, for example, for the purpose of language study, translation, and speech therapy” [4]. This interpretation significantly differs from the Russian scientific understanding of applied linguistics. According to Yu.V. Rozhdestvenskii, “the task of applied linguistics consists in the introduction of new speech materials, identification of the most effective methods of verbal communication based on new methods, validation of language norms through training, and dissemination of new forms of verbal communication based on the study of new types of texts (the creation, transfer, storage, and use of such texts)” [5, p. 215]. He believes that applied linguistics includes three main areas: linguistic didactics, linguistic semiotics, and information services [5, p. 299]. Information services include areas of activity related to libraries, archives, administration, information retrieval, summarization, compilation of information dictionaries,

149

150

YATSKO

bilingual translation, and automated control systems. Referring to the “linguistic part” of the information services theory, Yu.V. Rozhdestvenskii suggests applying the term “linguistic informatics” [5, p. 354]. There is no doubt that the above interpretation of applied linguistics is considerably broader than that one accepted abroad; specifically, it covers the areas that are usually included in other subject areas in the Anglo-American tradition. For example, information-retrieval problems are often included in the subject area of information science, whereas the issues of the development of automated control systems are covered by computer science [6]. At the same time, based on the interpretation of Yu.V. Rozhdestvenskii, speech therapy should not be part of applied linguistics; rather, it is related to the application of linguistics, as in this case linguistic information is used in order “to solve practical problems that concern other areas of science or practice” [5, p. 298]. On the other hand, A.N. Baranov points out that applied linguistics should be understood as “an activity aimed at the application of scientific knowledge about the structure and function of language in the non-linguistic disciplines...” [7, p. 7]. In addition to various contradictory interpretations, among others, the term “applied linguistics” has a previously noted deficiency, which consists in the use of the word “linguistics” as a head word in the phrase. This term is also special in the fact that it represents a variety of activities that are not directly related to automatic text processing, including linguistic semiotics, linguistic didactics, and terminology specified in the passport of the 10.02.21 specialty “Applied and mathematical linguistics” [8]. In [9] we proposed using the term “linguistic informatics” and also defined its specific subject area based on the information-linguistic model; discipline problems were limited to the study of information retrieval and summarization. Later, the Japanese experts used linguistic informatics in order to refer to the subject area, which includes the development of software tools designed for the study of foreign languages [10]. In this paper, we interpret the meaning of the term “linguistic informatics” from new angles, analyze basic concepts and the structure of this subject area, and reveal its interdisciplinary nature. Special attention is paid to a comparative consideration of linguistic informatics and theoretical linguistics concepts, which allows us to establish the specific features of this subject area. We believe that the use of the word informatics as a head word is justified because informatics embraces the creation of products, particularly software and hardware tools and technologies designed for various groups of users, including laymen. In the prototypical representation, the word informatics is associated with semantic components, such as “computers,” “programs,” and “the Internet,” and does not correspond to a narrow professional area, unlike the word linguistics. The attribute linguistic allows one to restrict the subject area to the problems of linguistic software, hardware, and technology development. Linguistic hardware is understood as computing equipment that is specifically designed for the processing of natural language texts.

Such equipment is widely used in systems of speech and acoustic, as well as optical character recognition. Linguistic software is understood as programs, applications, and systems, whose input consists of natural language texts and that operate on the basis of linguistic algorithms that are applied for the processing of natural language units. The operation of numerous linguistic applications is based on the process of lexical decomposition of text, which results in the recognition of tokens in the input text and the generation of a token list as an output. A token can be defined as a sequence of alphabetic and/or numeric characters separated on the left and right by text formatting and/or punctuation characters. Breaking up a text into tokens is called tokenization; tokenization programs are called tokenizers. In the text “Bye-bye, dearie, he smiled”, the Bye-bye token is recognized by the quotation marks on the left and the comma on right; the token dearie is recognized by the space on the left and the comma on the right; the token he is recognized by the spaces on the left and right, and the token smiled by the space on the left and the full stop on the right. As it can be seen from the above example, tokens usually coincide with words; therefore, the term token is matched by the term word in theoretical linguistics. User instructions and linguistic software interfaces often refer to the term word, rather than a token, because the former is more understandable and familiar. However, from the viewpoint of theoretical interpretation, the two terms significantly differ. The linguistic interpretation of the term word is usually provided in the form of a relationship between the sign, the denoted object, and the meaning, which is represented in the form of a well-known semantic triangle. As shown above, only signs distinguished by formal features are taken into account in linguistic informatics. Accordingly, the research goals of words and tokens differ. In linguistics, research is performed on the interpretation of the meanings of words, a distinction between occasional and usual word uses, as well as the exposure of new meanings, and conditions of their actualization. Linguistic informatics studies statistical features of the distribution of tokens in the text, which provide basis for the development of weighting formula that are needed to identify the most salient terms. In this context, a distinction is made between unique and total tokens. The term unique token refers to a token without considering the number of its repetitions in a text, while the term total tokens describes the number of tokens based on their frequency. The British National Corpus repeats the unique token the 5973437 times, i.e., it contains 5973437 total tokens; in the Contemporary American English Corpus [11], the frequency of the term the is 25063954. Thus, the number of total tokens is usually greater than the number of unique tokens. When the terms are weighed, this circumstance allows one to use probability scores and eliminate the dependence of weight scores on the size of the text. The probability value for the in the British National Corps is 0.05973 (the size of the corpus is 100 million words), and it is 0.05569 in the Contemporary American

AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Vol. 48 No. 3

2014

COMPUTATIONAL LINGUISTICS OR LINGUISTIC INFORMATICS?

English Corpus (the size is 450 million words). The difference in the probability values is about four thousands, while the difference between the raw frequencies is about nineteen million. In linguistic informatics, the process of converting raw frequencies with the aim of eliminating the dependence on the size of the text and bringing different values to a common form is called normalization. The binary logarithm is widely used as a means of normalizing. The binary logarithm of 5973 437 is equal to 22.51013, and of 25063954 to 24.57911, which results in a difference in two integers rather than nineteen million. Surely, such a meaning of the term normalization is different from the same term used in linguistics. In the process of lexical decomposition, it is important for systems designed for the representation of meaning of the text to recognize as tokens such expressions as geographical names, personal names, abbreviations, and stable combinations. When combinations of New York or N.A.T.O are divided into separate tokens, the meaning of the text might not be reproduced or can even be distorted. Therefore, some linguistic software applies special filters and additional rules for phrase recognition. In the process of text processing [12], the Essence application [13], which is designed for automatic summarization, recognizes the combinations of Jamrul Hussain, Nilufa Begum, and NEW YORK as separate tokens. While processing the same text, the AntConc statistical analysis program [14] divides all of these combinations into separate tokens. Such a difference can be explained by different functionality and user audiences. Statistical analysis programs put out data on the frequency of text units; experts in the field of automatic text processing and linguists who apply these data for professional activities are the users of such programs. Summarization systems are part of general purpose software; they are intended for a significantly wider range of users. In linguistics, the language thesaurus is classified on the basis of semantic, syntactic, etymological, and stylistic criteria. The classification of lexical units into notional and functional words (stop words) is widely used in linguistic informatics. In the literature [15, 16], the uniform distribution of stop words across various types of text is highlighted as one of their main features. In any sufficiently large English-language text, articles, pronouns, prepositions, and conjunctions represent the most frequent words. As noted by William Francis [16], the ten most frequent words in the English language provide between 20 and 30% of all tokens. The removal of stop words can significantly (in some cases by almost 40%) reduce the size of linguistic databases and also increase the speed and accuracy of searching. At the same time, stop words are used as one of the key parameters in the process of automatic text classification: the fact that they are found in all types of text, regardless of its genre and stylistic features, allows one to perform a comparative analysis of different texts and identify the features of the distribution of stop words that are specific to individual groups, categories, and genres of texts [17]. Filtering of stop words is an

151

important element of word processing in data-retrieval systems and systems of automatic text classification, which is performed on the basis of special stop lists or algorithms. Filtering of stop words is often performed using the TF*IDF formula suggested by G. Salton and C.S. Young in 1973 [18], as well as its interpretation [19]. In accordance with this formula, term distribution in the analyzed text is compared with their distribution in a set of text documents, where the largest weight is assigned to the terms that are mostly frequently used in the related document, but are rarely found in other text documents of the set, while terms found in the related document and in all texts of the collection are assigned zero scores. Thus, the formula is used to describe a certain pattern of distribution of textual information. The main problem associated with the use of the TF*IDF formula is the uncertainty of the quantitative and stylistic content of the text collection, which is mapped to the analyzed text. One approach to solving this problem could consist in the use of a zonal text analysis based on the interpretation of the law of Bradford [20]. We presented a detailed description of lexical text unit processing because it vividly demonstrates the interdisciplinary nature of the subject area, where lexical decomposition is a fundamental algorithm that underlies a number of algorithms running on different levels of a language system. Morphological analysis, annotation, phrase decomposition, n-gram splitting, clause splitting, and anaphora resolution are performed basing on tokenization. Morphological analysis algorithms are used to recognize the elements of the morphological word structure: roots, stems, suffixes, and endings. Stemming and lemmatization are algorithms that are commonly used at a morphological level. The objective of stemming is to identify stemmas of different word forms that have the same meaning. The stemmer’s input consists of a list of tokens, and the output of a list of the related stems (stemmas). Stemming can significantly improve the quality and effectiveness of search and is widely used in various types of information retrieval systems. In theoretical linguistics, the term stem refers to the unchangeable part of a lexical unit. The term stemma is used to refer to a sequence of characters remaining after removal of strings contained in some data files, which fulfills the token identification function. Thus, the Lancaster stemmer [21] in the token daughter removes er, because the data file, on whose basis it operates, already contains such a string. From the point of view of theoretical linguistics, an error occurs in this case, as er is part of the word stem. From the perspective of linguistic informatics, no error occurs since the use of the daught stemma enables one to consider the daughter and daughters tokens as identical. At the same time, when er is separated from cater on the basis of the cat stemma, not only the tokens such as catered, caters, catering are identified, but also cat, cats, cat’s. As a result, the error of overstemming occurs because tokens with different meanings are equated. In our stemmer [22], suffixes and endings are matched up with the corresponding parts of speech, which allows one to reduce the number of errors,

AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Vol. 48 No. 3

2014

152

YATSKO

but requires a preliminary annotation by parts of speech tags. Annotation is performed by tagging; tagger input includes a list of tokens, while the output comprises a list in which every token is assigned a tag, i.e., an identification mark showing its linguistic features. The most common type of taggers are parts of speech taggers (POS taggers), which recognize a token’s part of speech and attribute to it a corresponding tag. In addition to the information about parts of speech, details about the lexical-grammatical and semantic characteristics of words are given. For example, NN is a common singular noun, NNS is common plural noun, AJC is an adjective in the comparative degree, etc. The term tag was introduced into scientific use in view of the development of electronic textual corpora. It has no analogs in theoretical linguistics despite its wide application in computer science, particularly, for the purpose of marking hypertext language descriptors. At present, annotation is widely used in the systems of automatic text classification, where POS tags and their combinations serve as the classification parameters. Other types of tags include semantic tags and knowledge roles tags. The former are used in factual data retrieval systems, while the latter are applied by text mining systems. The factual information recognition system (IRS) InFact developed by Insightful Corporation applies the tags such as Person, Location, Organization. In response to a request [Organization/Name]>buy> [Organization/Name]^money the IRS outputs those parts of the text, which contain information about the acquisition of one company by another for a certain amount of money [23]. It should be noted, that the input of such IRS consists of the text in an artificial language, which prevents us from attributing them to the subject area of linguistic informatics. The same applies to cryptographic systems that should be studied within scope of computer science. In [24] the text annotation of diagnostic reports on the state of high electrical insulation of rotary devices by cognitive roles Observed Object, Symptom, Cause was performed. As a result, a system was created, which allowed engineers to obtain information about the symptoms of malfunctioning in a specific object, as well as its causes and solutions. The lexicographical information that was developed within the Framenet project [25] was used as a linguistic database. Semantic and cognitive role annotation involves the recognition of both individual words and phrases. Such annotation requires preliminary design and application of special phrase structure grammars at the syntactic level of the language system. One of the fundamental algorithms applied at a syntactic level is a syntactic splitting performed by syntactic splitters. The splitter’s input consist of a text, and the output of a list of text sentences. Syntactic splitting algorithms enable sentences recognition based on text formatting characters: spaces, punctuation marks, and end of strings marks. Thus, the term sentence in linguistic informatics refers to a sequence of strings separated on the left and right by text formatting characters and punctuation marks. Sentence recognition

is complicated by the lack of standard text formatting; full stops, and exclamation and question marks, which are commonly used as separators, can be used not only at the end, but also in the middle of sentences. Some text units that are formatted as sentences are not sentences as such. These include items such as tables of contents, titles of individual sections, titles of figures, tables, text contained in the tables and graphs, as well as headers and footers. Meanwhile, sentences represent basic units of analysis in many systems; in automatic summarization systems, text output consists of sentences. Errors in sentence recognition significantly reduce the effectiveness of such systems in general. We introduced a deduction-inversion architecture of text splitting, according to which a text is first divided into paragraphs, then into words, and afterwards sentences are generated on the basis of words. Thus, decomposition starts with a larger unit (paragraphs), continues with a smaller unit (words), and finishes with another larger unit (sentences). Deduction-inversion splitting architecture allows one to ignore such text components as headings, subheadings, and tables of contents, because these elements are not included in paragraphs [22]. Syntactic decomposition provides basis for a variety of algorithms for recognition of the phraseological structure of sentences. The algorithms for n-gram recognition of phrases consisting of two (bigrams), three (trigrams), or more (tetragrams, pentagrams, hexagrams, geptagrams, and octograms) tokens are

AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Vol. 48 No. 3

2014

COMPUTATIONAL LINGUISTICS OR LINGUISTIC INFORMATICS?

Algorithms and automatic text-processing software Algorithms Programs Recognized/processed unit

Linguistic term

153

Language levels

Symbol recognition OCR Stemming Stemmers Lemmatization Lemmatizers

Symbol Stemma Lemma

Grapheme Word stem Lexeme

Grapheme Morphological

Tokenization Annotation Term weighting

Tokenizers Taggers Weighting filters

Token Tag Weight score

Word

Lexical

N-gram splitters Chunkers Parsers Text splitters Clause splitters Resolvers

N-gram Phrase Sentence Sentence Clause Anaphora

N-gram splitting Phrase chunking Parsing Syntactic splitting Anaphora resolution

most widespread [26]. In this case, splitting into phrases is carried out by taking the position of tokens in the sentence into account. For example, the sentence John has a dog includes four unigrams, three bigrams (John has, has a, a dog), two trigrams (John has a, has a dog), and one tetragram, which is the entire sentence. The number of bigrams and trigrams in each sentence (ng(s)) is n–1 and n–2, respectively, where n is the number of tokens in a sentence, i.e., ng(s) = Wi (n-1), Wi (n-2),…Wi-(nn), where Wi is an ordinal level n-gram, starting from bigrams. N-gram recognition is based on the use of the related rules. N-gram distribution analysis allows one to identify statistically significant phrases; it is often applied in stochastic algorithms for a tag-based POS annotation. N-gram distributions is used for the purpose of automatic classification and categorization, serving as an important parameter, which allows one to determine the belonging of a text to a specific category, type, group, or genre. When the analysis is performed at a syntactic level, bigrams and trigrams serve as the basic units because the recurrence of phrases with many tokens is unlikely. N-gram analysis of a higher magnitude is applied for automatic spelling correction, as well as in OCR (Optical Character Recognition) systems, in which token characters serve as the basic units. Phrasal decomposition programs, viz., chunkers whose output consists of phrases of a certain type (nominal, verbal, prepositional, adjectival, and adverbial), are used for the analysis of morphologically significant phrases. The most common are nominal (noun-phrase) chunkers that recognize phrases with a head noun. This type of phrases is used to denote objects described in a text, whereas their ranking by weighting scores allows one to obtain a list of keywords which reflect the main content of the text. Recognition of phrases of this type is performed on the

Phrase Sentence Sentence Proposition Anaphora

Syntactic

Discursive

basis of preliminary annotation by parts of speech tags and the association of individual parts of speech in phrases according to grammar rules. Phrase-structure rules were developed for the English language within the concept of generative grammar proposed by Naom Chomsky. Grammar rules are written as NP → NN; NP → DetNN; NP → DetANN, where the phrase composition (in this case, nominal) and the word order are specified. In the first case, it is shown that the noun phrase may consist of one noun only (NN); in the second case it consists of a determiner (Det) and a noun, where the determiner takes the position before the noun, while the reverse word order is wrong; in the third case, a phrase is composed of a determiner, an adjective (A), and a noun, while other word order possibilities are wrong. To date, a variety of grammars has been created on the basis of the concept of Chomsky; these grammars are divided into two main types: derivative and non-derivative. Derivative grammars distinguish between a surface and deep structure of phrases and sentences; additional rules of derivation of surface structures from deep structures are formulated. A syntactic structure is represented as a hierarchical tree of dependencies. Non-derivative grammars describe surface, usually linear syntactic structures. The choice of a certain type of grammar is underpinned by the problems of a specific research project. Derivative grammars underlie the functioning of syntactic parsers, which output a graph of the syntactic structure of a sentence. Similar to POS taggers, syntactic parsers are trained on sentences with a manually marked up syntactic structure; they apply the rules to determine the most probable variant based on hidden Markov models. One example is Lexparser developed at Stanford University, USA [27].

AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Vol. 48 No. 3

2014

154

YATSKO

Hierarchical syntactic structures are applied in machine-translation systems in order to establish the equivalence of syntactic structures in two languages. At a syntactic level, decomposition can be run not only on phrases and sentences, but also on clauses, i.e, elementary predicative structures that embody a judgment. The concept of a clause to some extent corresponds to the concept of a proposition in linguistics; however, clauses are identified on formal grounds, which can, for instance, include the presence of a nominal group followed by an accompanying verbal group. Breaking into clauses is used in data-mining systems for a more adequate representation of text content (e.g., see [24]). The most common algorithms applied at a discursive level, are anaphora resolution algorithms, which involve the replacement of anaphoric pronouns by preceding co-referent object names. A discourse is understood in linguistic informatics as a text, whose component (clauses, sentences) relationships manifest as lexical and/or syntactic unit repetitions. We believe that the study of logical and semantic relationships between text units and the modeling of the logical and semantic structure of a text are beyond the subject of study of linguistic informatics and are part of artificial intelligence research. The table given above illustrates algorithms and programs that are grouped by the levels of the language system. They characterize the specific subject area of linguistic informatics. Program names are given in English, as their transliterations are commonly used as the Russian equivalents. The described algorithms and programs underlie linguistic software, which can be classified by a variety of criteria. According to the material form of the input text, one can distinguish between systems of oral or written text processing. In the former case, one usually speaks of speech processing, and in the latter case, of text processing. The early days of linguistic informatics were associated with the problems of written text processing, the development of IRS and system summarization and machine translation, which marked the late 1950s and 1960s. Speech-processing systems have been intensively developed since the 1990s, when household speech-recognition systems emerged. At present, they are widely used in question-answering machines, systems of recognition of individual personal characteristics, such as age, sex, and even the level of alcohol intoxication [28], and systems of voice control over technical objects, including nano-systems [29]. Based on the form of speech activity, it is possible to distinguish algorithms for processing monolog and dialog speech. For a long time, monolog texts, primarily texts of scientific papers, were the object of automatic text analysis. The development of the Internet has stimulated the development of dialogic genres of writing: chats, blogs, and forums. Processing of such texts is specific and requires the use of special algorithms that take paralinguistic features into account. In addition, dialogic speech processing systems also intensively develop, viz., question-answer

and machine translation systems. Based on the degree of intellectuality of results obtained by users, one can distinguish a separate group of algorithms that help to issue implicit information contained in a text or new information that is not yet part of the processed text. Such algorithms are developed in the process of text mining and significantly differ from traditional information retrieval and summarization algorithms, which result in identifying the most significant information contained in the text. Text mining is widely used in industry and medicine as a means of knowledge sharing. In medicine, digitalization and cognitive role tagging of patient records allows doctors to use search engines in order to find diagnoses that match certain symptoms, treatment methods that have already been applied by other doctors, prescribed medicines and drugs, and the results of treatment [30]. Text mining of opinions of commercial products [31] is also developing successfully. Specifically, it allows manufacturers to identify advantages and disadvantages of their products and to implement effective marketing policies. In terms of target groups of users, one can identify universal, special, and professional linguistic software. Generic systems are designed for all groups of users, regardless of their profession, age, or social status. A typical example is the Internet IRSs, which are used globally by billions of people every day. Special linguistic software is designed for specific groups of users. Text-mining systems are usually positioned as systems that are designed to support decision-making of representatives of certain professional groups. Professional linguistic software is intended for specialists in the field of linguistic informatics and support research in the field of automatic text analysis. There are a number of statistical analysis programs that provide information on the number of unique and total tokens, the number of n-grams, the context of the application of a lexical unit, and the probability and statistical indicators of their co-occurrence [14, 32]. Depending on the mode of operation, linguistic applications and systems can be divided into automatic and automated ones. Automated systems operate in a discrete mode; most of the currently developed software belongs to this type of systems. These are, for example, IRS systems, whose operation starts with a user’s query and ends with the issue of the result. Automatic systems operate in a continuous mode, such as speech summarization systems, which allow one to track news events. Note that the names of specific linguistic software have no strict distinction between the terms “automatic” and “automated.” Information retrieval systems are correctly characterized as automated, while summarization systems are traditionally called automatic (see the title of the famous collection of papers Advances in automatic text summarization), although the reference is in fact made to the systems operating in a discrete mode. Along with automatic and automated systems and applications, computer-assisted/aided text processing software is also being developed. Systems of this type are most commonly used in bilingual translation and foreign-language study practices with the aim of improving the effectiveness of teacher and translator

AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Vol. 48 No. 3

2014

COMPUTATIONAL LINGUISTICS OR LINGUISTIC INFORMATICS?

performance. Translation memory systems contain databases with previously translated texts, dictionaries, and corpora. With the help of these systems translators can include words, phrases, and sentences of previous translations in the text, and also check the context of use of lexical units using dictionaries and corpuses.

Operation of linguistic software is supported by lexicographic resources, which include lists of terms, glossaries, statistical terminology dictionaries, thesauri, and ontologies. Lists of terms contain linguistic units that are needed to perform certain software functions. These include lists of suffixes and endings, on the basis of which operate stemmers, and stop lists used in the appropriate filters. In addition to the specific terms, glossaries contain meta-information. In morphological dictionaries, which are necessary to support the functioning of parts of speech taggers, each token is attributed a tag of parts of speech (or two or more possible tags). In statistical terminology dictionaries, each linguistic unit is provided with information about its distribution in texts or files. In a non-lemmatized dictionary compiled by A. Kilgarriff [33] based on the British National Corpus, each token is assigned a part of speech tag, as well as corpus frequency and the number of files in which it can be found. Statistical data contained in such dictionaries are essential for determining the probability characteristics that are necessary for the development of a number of text mining systems. For example, probabilistic parameters can be considered in the development of stochastic part-of-speeh taggers, where one can ignore the use of tokens with certain tags if they have low probability. Frequency of use in the corpus of frequent as an adjective is almost 60 times higher than its use as a verb; in probabilistic values the difference is 0.002321 — 0.000039 = 002282 (with 1000000 total tokens in the corpus). In this case, it is possible to ignore the verbal forms and attribute to all tokens of frequent the adjective tag, as the probability of an error is extremely small. Thesauri provide information about terms associated through structural semantic relations: synonymous, antonymous, hyponymous or hyperonimic. The most widely known thesaurus of English is WordNet, developed at Princeton University, USA. This thesaurus is distributed with an open-source code localized for various programming languages [34]. The problems related to the application of this thesaurus for automatic text analysis are discussed at various international conferences, which illustrates how relevant the development of this particular type of dictionaries is. The basic concept underlying the architecture of WordNet is the concept of synonymous series (synsets), a group of semantically related terms distributed by parts of speech, which vary depending on the degree of semantic proximity. Semantic proximity is defined by the distance from the source (original) word. If, for example, the word courage is accepted as the initial word, the synonymous row of the first level will consist of the words through which this word is directly explained: courageousness, bravery, and braveness. The synonymous row of the second level will include, in this case, synonyms, such as the word spirit. The synonymous row of the third level

155

will include synonyms of the word spirit: character, fiber, and fibre. The synonymous row of the fourth level includes the synonyms of words at the third level, etc. Thus, a typical hypertext structure with continuous transitions from one cluster of synonyms to another is created. Thesauri are used in information-retrieval systems, and systems of automatic text summarization, automatic classification and categorization of texts, as well as text mining systems. The use of thesauri in IRS systems is an effective means of query expansion. Ontologies are defined as complex structured dictionaries which model the structure of a particular subject area on the basis of functional relations between components; ontologies are used to support text-mining systems. The complex structure of ontologies consists in a multilevel hierarchy, and at certain levels, the ontological components, viz., concepts and categories, are related to specific terms, viz., instantiations. Ontologies are classified into formal and linguistic ones. The specificity of linguistic ontology consists in the fact that it is associated with grammar rules that allow one to recognize the components of the ontology and their interrelations in the text. In [31] a six-level linguistic ontology was described. This ontology was designed to support the system of automatic analysis of customer opinions of commercial products. The upper-level of the ontology contains semantic and syntactic term categories, which express either a positive or a negative assessment, respectively, or change its intensity. The ontology is related to the linear grammar, through which the product names listed in a search query correspond to the components of the ontology. The above information allows one to define linguistic informatics as a discipline that studies patterns of distribution of textual information, as well as problems, principles, methods, and algorithms applied for the development of linguistic software and hardware. Linguistic informatics is an interdisciplinary science; its development is determined by mathematical, technical, and linguistic foundations. Mathematics and linguistics perform a multi-faceted methodological role. The methodological role of mathematics increases along with the increasing level of research and development. The development of applied linguistic software requires knowledge of the basic elements of Boolean algebra and propositional calculus, which are common to all types of programming. At the same time, theoretical and fundamental research is impossible without knowledge of the related mathematical domains, such as set theory, graph theory, probability theory, statistical analysis, and laws of text information distribution. At present, one of the fundamental problems of this domain, the solution of which is impossible without the use of mathematical tools, is the development of criteria to assess the representativeness of text corpora. The methodological role of linguistics increases along with the growing complexity of processed and recognized linguistic units. If tokenization can be performed on the basis of text formatting, stemming requires the knowledge of the morphological structure of words; chunking and parsing rely on the knowledge of the phrase structure of

AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Vol. 48 No. 3

2014

156

YATSKO 0. Grishman, R., Computational Linguistics: An Introduction, Cambridge: Cambridge University, 1986. 1. Richter, F., Introduction to Computational Linguistics. 2005. http://www.sfs.uni-tuebingen.de/~fr/teaching/ ws05-06/icl/slides/lecture2.pdf 2. Naves, O., Applied linguistics what it is and the history of the discipline. 2002. http://diposit.ub.edu/dspace/ bitstream/2445/4701/1/Naves2008ALDisciplinePartIonGrabe2002.pdf 3. Definition of applied linguistics, Oxford Dictionaries 2012. http://www.oxforddictionaries.com/definition/ english/applied-linguistics?q=applied+linguistics 4. Rozhdestvenskii, Yu.V., Lektsiipo obshchemu yazykoznaniyu (Lections on General Linguistics), Moscow: Vysshaya Shkola, 1990. 5. Information science — definition and more, in Merriam-Webster Dictionary 2012. http://www.merriamwebster.com/dictionary/information%20science 6. Baranov, A.N., Vvedenie v prikladnuyu lingvistiku: uchebnoe posobie (Introduction to Applied Linguistics: A Tutorial), Moscow: Editorial URSS, 2001. 7. Passports of scientific worker specialty nomenclature, FGAU GNII ITT “Informika. http://www.edu.ru/db/ portal/spec_pass/vuz_ds_pasport.php?spec=10.02.21 8. Yatsko, V.A., Linguistic aspects of informatics, Nauchno-tekhnicheskaya informatsiya. Ser. 1 1996, no. 2, pp. 1—7. 9. Linguistic Informatics — State of the Art and the Future, Kawaguchi, Y., et al., Amsterdam: Benjamins 2005. 10. The corpus of contemporary American English (COCA), Brigham Young University 2012. http://corpus.byu.edu/coca/ 11. Missing tot’s trail goes cold after three month 2009. http://edition.cnn.com/2009/CRIME/01/13/grace.coldcase.hussain/index.html?eref=rss_crime 12. Sillanpaa, M., Lost knowledge — DM Partner’s Essence 2009. http://bigmenoncontent.com/2009/06/ 04/lost-knowledge-%E2%80%93-dm-partner%E2%80%99s-essence/ 13. Laurence Anthony’s software 2011. http://www.antlab.sci.waseda.ac.jp/software.html 14. Tsz-Wai, L.R., He, B., and Ounis, I., Automatically building a stopword list for an information retrieval system, J. Digital Inform. Manag.: Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR'05), 2005, vol. 3, no. 1, pp. 3-8. 15. Francis, W.N., Frequency Analysis of English Usage Lexicon and Grammar, Boston: Houghton Mifflin, 1982. 16. Santini, M., Automatic identification of genre in Web pages, PhD Thesis, Brighton: University of Brighton, 2007. http://www.itri.brighton.ac.uk/~Marina.Santini/ MSantini_PhD_Thesis.zip 17. Salton, G. and Yang, C.S., On the specification of term values in automatic indexing, J. Documentation, 1973, vol. 29, pp. 351-372. 18. Yatsko, VA., TF*IDF revisited, Int. J. Comput. Linguistics Natural Language Eng., 2013, vol. 2, pp. 385-387. 19. Yatsko, V.A., Method of zonal data analysis, V Mire Nauchnykh Otkrytii, 2013, no. 6.1, pp. 166-182. 20. Paice, C.D., Another stemmer, SIGIR Forum, 1990, vol. 24, no. 3, pp. 56-61. 21. Yatsko, V.A., Starikov, M.S., Larchenko, E.V., and Vishnyakov, T.N., The algorithms for preliminary text processing: decomposition, annotation, morphological analysis, Autom. Docum. Math. Ling., 2009, vol. 43, pp. 336-343. 22. Marchisio, G., Dhillon, N., Liang, J., et al., A case study in natural language based Web search, in Natural Language

sentences; clause-splitting necessitates the knowledge of the structure of predicative constructions; anaphora resolution is based on the knowledge of inter-phrase relations between sentences. One of the relevant theoretical problems, which requires serious linguistic analysis, is the development of role grammars that support text-mining systems. One of the main problems that affects the development of the subject area is related to training of experts, who must have a mixed knowledge of the humanities, as well as technical sciences and mathematics. Training of such specialists in foreign universities is a part of master’s programs, whose content includes technical, mathematical, and linguistic components. One can, for example, note the master’s program of the Department of Linguistics at University of Washington in Seattle [35]. The technical component involves good programming skills in C + + and Java (knowledge of Perl and/or Python is also recommended), knowledge of database structures and algorithms, finite automata and transducers, as well as the ability to use server clusters on the basis of the UNIX platform. The mathematical component includes probability theory and statistical analysis. The linguistic component includes an introduction to phonetics and syntax, with special emphasis on the study of articulatory and acoustic correlates of phonological units and the development of formal grammars that are needed to build applications; study of methods of shallow processing of natural language units, including tokenization, annotating, morphological analysis, parsing; the study of methods of deep processing of natural language units, including grammar and algorithms which are necessary to correlate deep structures with surface syntactic structures. At the final stage of training, students learn methods to develop information retrieval and question-answering systems and machine translation, as well as applications and programs for language study, spelling and grammar checking, handwriting recognition and optical character recognition of symbols, document clustering, and speech recognition and synthesis. During their studies, students do internships at the largest companies involved in the development of linguistic software, such as Microsoft, Google, and InXight. Such a master’s program is of interest because it gives an idea of the interdisciplinary nature and structure of the subject area. The most important of the three components is naturally the linguistic one; it is not accidental that such courses are offered by linguistic departments. The subject area includes information retrieval and question-answering systems, speech and text recognition systems, machine translation, and foreign language teaching systems. The proposed master’s degree program, as stated on the website of the University of Washington is one of the few top-notch programs worldwide that trains specialists in the field of computational linguistics. There is no doubt that the development of language technologies and the creation of similar training programs is relevant for Russia. REFERENCES

AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Vol. 48 No. 3

2014

23.

24. 25.

26. 27.

28.

COMPUTATIONAL LINGUISTICS OR LINGUISTIC INFORMATICS? 157 Processing and Text Mining, Kao, A. and Poteet, S., Eds., nologii v lingvistike i lingvodidaktike: mif ili real’nost? London, 2007, pp. 69-90. Opyt sozdaniya obshchego obrazovatel’nogo prostranstva Mustafaraj, E., Hoof, V, and Freisleben, D., Mining stran SNG. Tezisy Mezhdunarodnoi nauchno-prakdiagnostic text reports by learning to annotate knowledge ticheskoi konferentsii (Proc. Int. Sci.-Pract. Conf. roles, in Natural Language Processing and Text Mining, “Nanotechnologies in Linguistics: Myth or Reality? Kao, A. and Poteet, S., Eds., London, 2007, pp. 45-68. Experience of Formation of Common Educational Space Baker, C.F., Fillmore, C.J., and Lowe, J.B., The Berkeley of SNG-Countries”), Moscow, 2007. framenet project. 1998. http://acl.ldc.upenn.edu/ 29. Li, Q., Zhai, H., Deleger, L., et al., A sequence labeling C/C98/C98-1013.pdf?origin=publication_detail approach to link medications and their attributes in clinical Bickel, S., Haider, P, and Scheffer, T, Predicting sentences notes and clinical trial announcements for information using n-gram language models. 2005. http:// extraction, J. Amer. Med. Inform. Assoc., 2013, vol. 20, pp. delivery.acm.org/10.1145/1230000/1220600/p193-bic915-921. kel.pdf?ip=2.61.106.175&id=1220600&acc=OPEN& 30. Yatsko, V.A. and Starikov, M.S., On the experience of key=4D4702B0C3E38B35%2E4D4702B0C3E38B35 designing an ontology for automatic analysis of user %2E4D4702B0C3E38B35%2E6D218144511F3437& sentiments about commercial products, Autom. Docum. CFID=304034823&CFTOKEN=75172216& __ acm _ Math. Ling., 2011, vol. 45, pp. 163-168. =1395046160_ee1d17fe0ade60cc1447d85533fc168c 31. Yatsko’s Computational Linguistics Laboratory, 2013. The Stanford parser: a statistical parser, in The Stanford http://yatsko.zohosites.com/linguistic-toobox-a-conNatural Language Processing Group 2014. http://nlp. cordancer.html stanford.edu/software/lex-parser.shtml 32. Kilgarriff, A., BNC database and word frequency lists, Levit, M., Huber, R., Batliner, A., and Noth, E., Use of 1998. http://www.kilgarriff.co.uk/bnc-readme.html prosodic speech characteristics for automated detection of 33. About WordNet, 2012. http://wordnet.princeton.edu alcohol intoxication, Proc. Workshop on Prosody and 34. UW professional master’s in computational linguistics, University of Washington,2014. http://www.comSpeech Recognition, New York, 2001, pp. 103-106. pling.uw.edu/about Potapova, R.K., Nanotechnologies and linguistics: Forecasting and perspectives of interaction, Nanotekh-

AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Vol. 48 No. 3

2014