Language Modeling with Morphosyntactic Linguistic Wavelets 1
Daniela López De Luise, 2Débora Hisgen, 3Alberto Abaigar AIGroup, Universidad de Palermo, Mario Bravo 1050, Buenos Aires, ArgentinaIGroup,
[email protected] *2, Débora Hisgen AIGroup, Universidad de Palermo, Mario Bravo 1050, Buenos Aires, ArgentinaIGroup,
[email protected] 3 AIGroup, Universidad de Palermo, Mario Bravo 1050, Buenos Aires,
[email protected] 1, Daniela López De Luise
Abstract This paper presents Morphosyntactic Linguistic Wavelets (MLW), an approach for managing information extracted from sentences in Spanish. MLW was applied to automatic text summarization, automatic topic extraction and chatter-bots. It is useful for automatic extraction and modeling of the information in texts and dialogs. MLW are used here to help extract and organize information from dialogs in natural language. Results are analyzed from a knowledge-representation perspective, and statistically evaluated. The paper also describes how MLW are related to the kernel of the chatter-bot. This language modeling approach has the advantage of being able to learn specific information of Spanish language from dialogs, and to build structures that will be useful for making answers in future dialogs.
Keywords: Natural Language Processing, Morphosyntactic Linguistic Wavelets, Dialog Processing 1. Introduction One of the most compelling problems in Computational Linguistics is the modeling of language in an efficient way. There are several complex tips to take into account in order to create a human-like linguistic model. These tips are related to words that are frequently used in the literature, like "abstraction" [1], "ontology" [2] [3], "semantic" [4], "concept" [5] and "cognitive processes" [6]. As stated in [7], the word “abstraction” is associated with a construction process that builds a hierarchy of internal ontology representations. That hierarchy is customized because each individual builds it as he/she is exposed to a particular sequence of events in the real word. Therefore the internal mental structure has a singular and specific configuration. The aim of this paper is to present a general overview of Morphosyntactic Linguistic Wavelets (MLW), explain how it organizes information and analyze its compatibility with respect to the linguistic process in humans. MLW has been studied in many applications such as automatic text summarization, automatic topic extraction, document profiling [8] and chatter-bots [9]. In the context of this paper is used as part of the kernel in a chatter-bot prototype named WIH (Word Intelligent Handler) to make it more flexible. Two competing theories of language acquisition dominate the linguistic and psycho-linguistic communities: - Nativist theory: posited by Chomsky [10], claims that linguistic capacity is innate, therefore, certain linguistic universals are given to language learners for free, requiring only the adjustment of a set of parameters in order to fully acquire language. - Emergentist theory: by Bates [11], claims that language emerges as a result of various competing constraints which are all consistent with general cognitive abilities and hence no dedicated provisions for universal grammar are required. Consequently, "[linguistic universals] do not consist of specific linguistic categories or constructions; they consist of general cognitive abilities”. Taking MLW as the core of the WIH chatter-bot follows the emergentist conception, although the linguistics wavelets are a mix of Chomsky's original conception and mathematic wavelets. Although they are hard to obtain, many inspired engineers proposed and developed prototypes to imitate natural language in dialogs. Among them, J. Weizenbaum, who created one of the first chatterbots called ELIZA [12]. Others are Dr. K. Colby (creator of chatter Parry) [13] [14], R. Wallace (creator of ALICE) [15] [16], and T. Winograd (with SHRDLU) [14], among others.
Most of the Natural Language Processing (NLP) proposals split up the linguistic problem by dividing the work to be performed in layers such as phonological, morphological, syntactic, semantic and pragmatic. This division simplifies the problem quite well. At the same time they command the modeling process to make an explicit manipulation of knowledge. Consequently, current solutions require tagging, lengthy dictionaries, and frameworks such as WebODE [17], ContentWeb [18] [19], XTAG [20], MorphoLogic Recognition Assistant [21], etc. Although they can be used, MLW does not require these frameworks. WIH is a chatter-bot that applies the MLW model in order to dynamically create, organize, modify, purge, and access internal structures and transitional information. The rest of the paper is organized as follows: An introduction to Morphosyntactic Linguistic Wavelets (Section 2), an overview of WIH and its relation with MLW (Section 3), test cases (Section 4), and conclusions and future work (Section 5).
2. Introduction to Morphosyntactic Linguistic Wavelets Traditional wavelets are mathematical functions with similar principles to those in Fourier analysis. They are useful for digital signal processing and image compression, by means of decomposing data into components (called coefficients) and parameters. As a counterpart, there is Morphosyntactic Linguistic Wavelets (MLW), which are heuristics used for similar purposes. They take their name from traditional wavelets because they share many of their characteristics [22]: - Decomposition of information - Many levels of decomposition (granularity) - Extraction of the main information - Possibility of summarization (compression) - De-noising These are the concepts that govern MLW. Their inputs are sentences, and their outputs are graded decompositions with certain parameters. The decompositions and parameters constitute the information "learned" from the input. Every sentence has information to be learned. The type of the information to be learned has different abstraction levels: - Low level: words in the vocabulary, punctuation, etc. - Medium level: formal structures for sentences, usage of certain expressions in natural language, finding out the proper context for certain words, etc. - High level: topic, summary of the information, related information, etc. It is important to note that MLW can apply to written text, web text and to sentences in dialogs. In the current paper it is part of the core of a chatter-bot; as a consequence the input consist of sentences which belong to dialogs. These sentences are fed in the right order to the conversational robot WIH. For more details see Section 4. There are several steps needed to bring a text or a sentence into a MLW: 1. Inputting sentences and pre-processing them 2. Translating text into an oriented graph (called E ci) preserving most morphosyntactic properties 3. Applying filtering using the most suitable approach 4. If abstraction granularity and details are insufficient for the current problem: 4.1. Insert a new node, called Ece, in the knowledge organization 4.2. Repeat from Step 3 5. Taking the resulting sequence of filtering as a current representation of the knowledge about the ontology of the text. 6. Taking the resulting Eci as the internal representation of the new text event
2.1. Inputting sentences and pre-processing Given a sentence s composed of a set of n streams {s1, ...., sn}, convert it into a set of constituents with an internal and universal format. This format is independent of the natural language used to generate s and adds certain contextual and morphosyntactic information to every word. The result is a set of derived constituents in the internal format {EBH a,...,EBHn} The translation from si to EBHi has two main steps: a) Deriving relevant attributes
b) Building EBHi using si and the attributes The first step has certain language-dependent knowledge that will be briefly described in the next subsection. The second is a general procedure. In the case of sentences extracted from web sites, the pre-processing can also require filtering out (or converting to tokens) links, images, and any other non-textual information. For implementation purposes there could be additional pre-processing to assure that streams si are valid characters, numbers or symbols. None of these actions affect the global process. 2.1.1. Deriving relevant attributes Every stream si is automatically processed to derive information that describes it best. This information is called "descriptor". There are several descriptors. Their values are restricted for practical reasons, but even so, they are very useful for gathering important characteristics of every stream. -tipo-pal (type of si) tipo-pal= {sustantivo, verbo, otro} (noun, verb, other) This descriptor takes the concept that basic statements within s are combinations of noun + verbs. The value declares if si is a verb, noun or other type of word. This is derived with a J48 induction tree. -pal-ant-tipo (type of si-1) pal-ant-tipo={ninguna, otro} (none, other) The value is other if there is a stream si-1, otherwise it is none. It is used to infer the most likely type for si. It describes the fact that si is the first stream in s. -tipo-pag (type of web page) tipo-pag={índice, contenido}(index, content) The index value is used when processing web pages and the URL has the special tag "index". It is irrelevant for the current application of MLW. -long-palabra (stream length) It is the number of characters in si. Takes si as a syntagm. It is relevant for inferring pal-ant-tipo [8]. -cant-vocales-fuertes (number of strong vowels) The number of occurrences of "a", "e", "o" in si. As with long-palabra, its relevance is shown statistically [8]. -cant-vocales-debiles (number of strong vowels) It is the number of occurrences of "i", "u" in si. It has the same relevance as long-palabra. -empieza-mayuscula (does si start with uppercase?) empieza-mayuscula={si, no} (yes, no) Denotes whether si starts with uppercase. It has the same relevance as long-palabra. -resaltada (is it emphasized?) resaltada={si, no} (yes, no) Denotes whether si is enclosed in single or double quotes, or in all uppercase. It is used only for text processing, so it is not relevant for the current application with a chatter-bot. -es-titulo (is it part of a title?) es-titulo={si, no} (yes, no) Denotes whether si is the first stream in the current text/web page. It is used only for text/web processing, so it is not relevant for the current application with a chatter-bot. -long-oracion (length of the sentence) It is the number of streams si in the sentence (excluding special symbols and numbers). It is used only for text/web processing, so it is not relevant for the current application with a chatter-bot. -stem It is the invariant root of si, when it is detected as a known word in the current vocabulary (that is, the vocabulary learned up to that moment). It follows the algorithm designed by [23]. -id-palabra ( si identification) It is a unique identifier that represents si . For practical reasons it is currently si, because it is used to generate sentences during a dialog and to feed vocabulary. -id-caso (web site). It is the unique identification for the web page containing si. It aims to provide reference back to the web site of the page used to extract the text that is currently being processed. This way MLW can be
also used as an automatic indexer by content. It is used only for web processing, so it is not relevant for the current application with a chatter-bot. -po (evaluation of si as part of s) It is a very special descriptor because it is the only one that depends on certain basic conditions explicitly declared for the language in use (in this case Spanish). It is basically a value that detects the occurrence of certain prefixes or combinations of words in Spanish. There are two types of cases detected: opposition or ambiguity. Table 1 shows the criteria for these special cases. Table 1. Opposition and ambiguity in EBH Opposition pattern (vi) no X (-1) sin X (-1) des X (-1) inh X (-1) anti X (-1) dis X (-1)
Ambiguity pattern (vi) muy X (0.7) algo X (0.24) mucho/a(s) X (1.9) escaso/a(s) X (-0.33) excesivamente X (-1.99) abundantemente X (1.2) demasiado/a(s) X (-1.5)
Ambiguity pattern (vi) tan X (0.9) poco/a(s) X (0.12) bastante(s) X (0.8) escasamente X (-0.33) excesiva/o X (-1.8) abundante(s) X (1.3) exageradamente X (-1.99) exagerada/o(s) X (-1.90)
These values, taken from a survey, are applied to any si. vi>0 indicates a positive connotation; vi