Textual Data Mining through the Synergistic ... - Semantic Scholar

1 downloads 0 Views 40KB Size Report
clouds. On a sunny afternoon, these clouds are likely to be white. The tops may shine in the sunlight. We might call them fair-weather clouds. But sometimes.
Textual Data Mining through the Synergistic Combination of Classifiers and Linguistic Processors Sylvain Delisle & Ismaïl Biskri Département de mathématiques et d'informatique Université du Québec à Trois-Rivières Trois-Rivières, Québec, Canada, G9A 5H7

Abstract Numerical data mining tools are generally quite robust but only provide coarse-granularity results; such tools can handle very large inputs. Computational linguistic tools are able to provide fine-granularity results but are less robust; such tools, often semi-automatic, usually handle relatively short inputs. A synergistic combination of both types of tools is the basis of our hybrid approach. First, a connectionist classifier is used to locate potentially interesting documents, or segments thereof. Second, the user selects segments that will be forwarded to the linguistic processor in order to semi-automatically analyse their textual data and extract relevant information or knowledge elements. We present the main characteristics of our hybrid approach to textual data mining, plus a methodology by which it can be put to use. We also report on the results of a first evaluation involving a corpus made up of two texts pertaining to two different domains. Keywords: Data mining, connectionist classifier, linguistic processor, information extraction, knowledge extraction.

1. Introduction Nowadays, institutions and organisations accumulate, use and maintain tremendous amounts of documents which are often poorly classified or categorised. Even the WWW (World Wild Web), which can be considered as an electronic library, is plagued with this problem. This situation seriously complicates information retrieval tasks, thus making information or knowledge extraction from these documents a time consuming and very expensive activity.

Traditional natural language processing and computerised information retrieval almost always use a priori knowledge ([7]) in order to process input texts: without such knowledge, these systems can simply not function at all. The nature of this a priori knowledge is diverse— morphological, lexical, syntactic, semantic and encyclopedic (see, amongst others, [12], [17], [18], [20] and [9])—and make it more or less difficult to feed into the system. This is an instance of the so-called knowledge engineering bottleneck. In this work, we propose to look at knowledge/information extraction, also called textual data mining here, in a different way. Indeed, we do not want to pre-feed the system with the very knowledge we hope to extract from the documents except, maybe, with very general or encyclopedic (public) knowledge. Thus, it is the text that contains the knowledge which we are seeking and we must find a way to extract it. However, this extraction process must be performed in such a way that full-fledged detailed (syntactic) analysis of huge volumes of texts will not be a pre-requisite, even more if the analysis is semi-automatic. This would simply not be a practical solution, especially since most of the time only a small portion of the searched documents are relevant and truly deserve more costly processing. That is why we argue that a sensible strategy is to first apply cost-reasonable numerical methods, with a tool such as a connectionist classifier, and then more expensive linguistic methods, with tools such as syntactic and semantic analy-

sers. In the first phase, “rough” numerical methods help (quickly) select documents which, according to the user’s needs, deserve more “refined” (time-consuming) processing that will, in the end, make the extraction process a reality. It is the interfacing of these two different types of methods that is at the very heart of the hybrid approach to textual data mining we propose in this paper.

2. Classifiers for Textual Data Mining Having established the important role that numerical methods play in the two-phase model we propose, let us now look more specifically at classifiers—see [11], [14] and [16] for fundamentals. These mathematical tools are welladapted to the task of “rough” or “first-cut” textual data mining if we are willing to consider text processing for data mining as a classification task based on a set of equivalence criteria. This perspective is quite different from the more traditional one in which text processing is seen as a text understanding process, or a paraphrasing process, or an information recall process. We propose to consider classifiers as text processing “machines” that can identify text segments containing similar information. Despite the success that classifiers have had in a wide variety of tasks, they suffer from two serious limitations. First, standard models can only handle stable corpora: if the texts change, the numerical analysis must be redone. This is clearly a severe problem with utterly unstable corpora such as those on the WWW. The same is true of models that require the computation of precise statistical parameters. Second, the results produced by classifiers are generally problematic as far as their linguistic interpretation goes ([6]). Associations between classes and words are not always clear or unambiguous. However, we think these limitations should not prevent researchers to consider classifiers for textual data mining purposes. Indeed, we think hybrid strategies of the type we propose here are very promising. But there is more. Recent research on classifiers gives hope that some of their limitations will be overcome one day. Let us simply mention “emerging computations” classifiers, be they genetic, markovian ([5]), and most of all connectionist. The later use various models: li-

near or non-linear, thermodynamics-based, competition-based, backpropagation-based, and based on activation or learning rules. Such models can often deal with weak or even contradictory constraints and, most importantly, can adapt to new situations even in the presence of a certain noise level. They show good generalisation and robustness qualities thanks to the strength of their internal classification capability. The properties of ART (Adaptive Resonance Theory) networks are particularly interesting— for the basic elements of these networks, see section 6.4.1 of [11] which also contains the references to Grossberg’s work who introduced them. Indeed, from [11]: “they are capable of stable categorisation of an arbitrary sequence of unlabeled input patterns in real time. These architectures are capable of continuous training with nonstationary inputs. They also solve the stability-plasticity dilemma; namely, they let the network adapt yet prevent current inputs from destroying past training.”. More complex versions of the basic ART model (ART1), called ART2, ART2-A, ART3 and ARTMAP, are capable of clustering binary and analog input patterns. As we will see in section 3.1, we have used a ART neural network which behaves as a classifier.

3. Implementation of the Hybrid Model We now look at how we have implemented the hybrid model we proposed above. It consists of two interconnected subsystems: numerical and linguistics, the first of which is relatively less dependent on the specific natural language used.

3.1. The classifier subsystem We use a classifier that takes on input a corpus and produces classes of terms which will be used as valuable information by the user. The latter, typically a knowledge engineer, will look at the lexical groupings (associations), i.e. the classes, as a means to select text segments that deserve further investigation through linguistic processing. The classifier is thus used as a filtering mechanism, allowing the user to retain only those corpus segments that are relevant with respect to his or her information needs. The specific classifier we used is implemented in Visual Basic; it is called CONTERM

and was develop at the LANCI1 laboratory. In order to apply CONTERM to a new corpus, one must first prepare its lexicon (after having eliminated certain words which may not be useful, i.e. stop words, and performed a morphological analysis), then carry out a matrix transformation on the corpus (which will allow the computation of every lemma frequency within each text segment) so that it is in a format amenable to the neural network, and, finally, perform a classification extraction process with the help of a ART neural network—the FUZZYART neural network was implemented on the MATLAB matrix programming platform. The output of the neural network is a matrix representing the classification associated with the input text. Each line of this matrix contains sorted pairs indicating whether each lexicon entry is a member or not of this particular class’ prototype (a prototype is created for every class). Thus, a class is characterized by the presence of a certain number of terms (words). Every class identifies terms that occurr in text segments which, according to the classifier, have similarities. These classes are also characterized by the terms that are present in all the text segments appearing in the same class. In brief, the classifier’s raw output is a sequence of classes characterized by a set of terms (words) and a set of text segments.

3.2. The linguistic subsytem This subsystem’s role is to perform an indepth linguistic analysis of the text segments selected by the user with the help of the classifier subsystem. In doing so, the user typically aims at extracting fine-grained and structured knowledge representations associated with the segments of interest. For instance, the user could be interested in a specific verb (activity) and could use the linguistic subsystem’s outputs to easily construct a semantic network associated with this verb since these outputs are equivalent to predicate structures of the form Predicate (argument1 argument2 ... argumentn). Syntactic analysis of textual material is performed by DIPETT (Domain-Independent Parser of English Technical Texts). The output of this 1

Laboratoire d’ANalyse Cognitive de l’Information, Université du Québec à Montréal (Québec, Canada).

broad-coverage parser consists of a single parse tree, the best analysis according to its surfacebased heuristics. Tests on a number of actual, unedited texts have produced either full parses or fragmentary parses (into sequences of phrases) for up to 95% sentences. A sentence, DIPETT’s unit of processing, contains a main clause and possibly subordinate clauses. Before semantic analysis takes place, parse trees of structurally complex sentences are decomposed into parse trees for clauses. We do not assume that input texts are free of ambiguity, but the user is expected to help the system select only one parse tree and only one semantic analysis for each input string analysed. Semantic analysis is performed by a separate module called HAIKU. Each clause (HAIKU’s unit of processing) is built around a main verb which is essential in case analysis ([2]; [10]; [7]). Case relations represent semantic relationships between a clause’s main verb and its syntactic arguments (subject, objects, prepositional phrases, adverbials). Links labeled by cases usually correspond to roles in the activity denoted by the verb; for example, the agent case identifies an entity that acts. Cases appear in texts as predicate-argument structures, each case denoted by a syntactic marker and occupied by a particular filler. For example, in “The thief broke the window with a stone”, with is the marker and a stone is the instrument of the break activity. DIPETT and HAIKU are robust components of a fully implemented text analysis system, called TANKA, originally designed for knowledge acquisition—details appear in [3], [7] and [8].

4. The Methodology Associated with the Hybrid Model The implemented model is supported by a methodology that guides the user through the knowledge extraction process. This methodology is organised in four major steps which are now described.

4.1 Partitioning of the initial corpus into its different domains with the CONTERM classifier If a corpus contains texts or segments about different domains, these can easily be separated from each other with the help of a CONTERM classification. Indeed, these different domains

will normally be described with different words. The only important exception could be that of functional words (e.g. determiners, prepositions) which are used domain-independently. However, this problem is eliminated during the preparation of the lexicon (section 3.1): for the most part, the functional vocabulary is discarded. This will ensure that the classifier is not “distracted” by noisy tokens during its computations; only vocabulary specific to the corpus’ various domains will affect the classifier’s results (i.e. the classes). Yet, CONTERM’s results do not correspond to an automatic interpretation of the corpus. At this stage, what we obtain is a coarse partitioning of the corpus into its main themes or topics, each of these comprising one or more classes of words. The user is the only one who can associate a meaningful interpretation to the partitions produced by CONTERM.

4.2 Theme (class) identification within the partitions obtained in the first step After the partitioning of the corpus into its main topics, the user now identifies sub-topics within each main topic (partition) by looking at the classes. Most frequent words within each class can be seen as a set of constraints that mutually influence each other and thus provide a good indication as to what the sub-topic is. This represents a progression toward a finer interpretation of the text. Again, as in the first step, this identification of sub-topics is performed by the user. Sub-topics of particular interest to the user will be considered in the next step.

4.3 Classes exploration with regard to the user’s needs This third step is manual but could become automatic—this is an item for future work. At this point, the user has assigned a theme to every class, or at least to most of them. So he or she is able to select the themes that best correspond to his or her information/knowledge needs. The selected themes indicate which text segments belong to the classes: these text segments share similarities and thus deserve to undergo further processing with the linguistic subsystem.

tion performed in the three previous steps, will finally undergo detailed linguistic analysis in order to extract structured information or knowledge representations from the selected text segments. An example will illustrate this step in the following section.

5. Putting the implemented model to the test Using the methodology described in section 4, we put our model to the test on a corpus made up from the concatenation of two different English technical texts. The first text comes from an introductory book on weather phenomena ([15]) and contains some 600 sentences. The second text comes from a book on small engines ([1]) and contains some 1600 sentences. As opposed to [19], for instance, our approach can handle corpora made up from non-homogenous texts originating from different domains. These two concatenated texts were saved within a single file. Following the methodology of section 4, the first step produced 243 text segments, each containing an average of 10 words. Segments #1 to #67 represented the first text (on weather phenomena), while segments #68 to #243 represented the second text (on small engines). CONTERM was able to perfectly separate both texts and we obtained 83 classes. In the second step, assuming we were interested in weather phenomena, we analysed the classes associated with that domain. We found the following outstanding classes: (clouds, moisture, rain), (earth), (storm), (sky), etc. In the third step, we selected amongst the outstanding classes of the previous step those that were relevant with respect to our information needs. Let us assume we were looking for information on clouds and rain because we wanted to construct a semantic network of the concept clouds. We then chose sentences associated to the selected classes in order to perform a detailed linguistic analysis on them. Let us assume that we picked up the following sentences (all of them authentic from [15]): “These are cumulus

4.4 Detailed linguistic processing of the selected text segments

clouds. On a sunny afternoon, these clouds are likely to be white. The tops may shine in the sunlight. We might call them fair-weather clouds. But sometimes cumulus clouds turn into black storm clouds.”.

This final step is the one in which text segments kept by the user, after the filtering opera-

Finally, in step 4, we submitted the text segments to the linguistic processor. Let us now

illustrate the linguistic analysis with this simple sentence: “But sometimes cumulus clouds turn into black storm clouds.”. First of all, a detailed syntactic analysis is (automatically) performed by the DIPETT parser—we pass over morphological and lexical analysis here—which produces a single correct parse tree. Here is a simplified version of that parse tree in which words are underlined: ( but:conj-coord ( sometimes:adv ( ( ( cumulus:noun ) clouds:noun ) turn:verb ( into:prep (( black:adj ( ( storm:noun ) clouds:noun ) ) ) ) ) ) )

Then, with the help of the semi-automatic semantic analyser HAIKU, the user identifies the semantic relationships between verbs (predicates) and their arguments, as well as between head nouns and their modifiers. These results (associations) are constantly accumulated by HAIKU in an incremental knowledge base in order to improve the processing of future sentences and facilitate user intervention. In fact, HAIKU accomplishes a kind of instance-based learning by which it improves its own performance as more sentences are processed. By the same token, the knowledge base incrementally updated by HAIKU constitutes a structured model of the concepts analysed by the linguistic subsystem. Following is a simplified excerpt from the actual session in which our example sentence (“But sometimes cumulus clouds turn into black storm clouds.”) was analysed. The first phase is a case-

based semantic analysis (comments are in italics, user inputs are underlined): {HAIKU shows how it decomposed the input sentence, using the parse tree produced by DIPETT.} CURRENT CURRENT CURRENT CURRENT

SUBJECT: VERB : COMPL: MODIFS:

cumulus clouds turn into black storm clouds sometimes

{The syntactic pattern (CMP) associated with the current sentence contains a subject, a prepositional phrase introduced by ‘into’ and an adverb.} CMP (syntactic pattern) found by HAIKU: psubj-into-adv

{After having attempted to find in its knowledge base a relevant semantic pattern (CP) to propose for the current sentence, via a nearest-neighbour algorithm,

HAIKU informs the user that it has no highly probable candidates to suggest.} S3: entirely new CMP with known verb; no obvious candidate CPs to propose

{HAIKU informs the user of what it knows about the preposition ‘into’: it can introduce a “time to” (tto) case, a “location to” (lto) case or a “direction” (dir) case. Each is presented with an example sentence.} here is what the cmDict knows about this Case marker:INTO {format: Case(COmmonness,Example); CO=1:common, CO=2:uncommon} tto(1,We worked INTO THE NIGHT.) lto(1,Jump INTO THE POOL.) dir(1,Turn the boat INTO THE WIND.)

{HAIKU asks the user to suggest a semantic pattern (CP) for the current sentence. The user enters the pattern expr-eff-freq which means that the subject (psubj) plays a role of “experiencer” (expr), the ‘into’ prepositional phrase plays a role of “effect” (eff), and the adverb ‘sometimes’ plays a role of “frequency” (freq) in the current sentence. Note that the CP entered by the user, i.e. expr-eff-freq, is the only input that the user had to enter so far. This new information will be added to HAIKU’s knowledge base so that it can handle similar situations the next time around and make informative suggestions to the user.} please enter the new CP (e.g. agt-objtat), or enter 'h' to see the current input string and CMP, or CR to abort ["new CP"/h/CR]? expr-eff-freq

{Here, HAIKU makes sure that syntactic and semantic patterns are properly matched. The user approves by typing in ‘Y’.} CMP & CP will be paired as follows: psubj/expr into/eff adv/freq ; Correct [n/Y]? Y

{HAIKU informs the user about the updates it is making on its knowledge base (called dictionaries here). For the most part, these changes are made relative to the current sentence’s main verb, i.e. turn.} UPDATING mDict for turn ... + turn [psubj-into-adv:1,psubj-to:1] [[adv,[freq:1]],[into,[eff:1]], [psubj,[expr:2]],[to,[eff:1]]] UPDATING cmpDict for psubj-into-adv ... + psubj-into-adv [[expr-eff-freq:1, '[but,sometimes,cumulus,clouds,turn, into,black,storm,clouds,.]']] UPDATING cpDict for expr-eff-freq ... + expr-eff-freq [turn] UPDATING ccvpIndex ... + ccvpIndex: psubj-into-adv expr-eff-freq turn102 VERB PREDICATE STRUCTURE (verb_pred_struct) FOR turn ... + [psubj-into-adv/expr-eff-freq/ np-into-adv]

HAIKU also updates the original parse tree with the semantic information it just collected thus transforming the tree into a hybrid syntaxicosemantic representation which can be trivially translated into a semantic network or a conceptual graph (semantic cases are capitalised and words are underlined): ( but:conj-coord ( [FREQ: sometimes:adv] ( [EXPR: ( ( cumulus:noun ) clouds:noun )] turn:verb ( [EFF: into:prep ( ( black:adj ( ( storm:noun ) clouds:noun ]

In the last phase of semantic analysis, HAIKU analyses nominal compounds in the current sentence, i.e. cumulus clouds and black storm clouds. For the phrase 'cumulus cloud' there is a relationship between (cumulus) and (cumulus_cloud).

{A NMR label, different from the semantic cases used above, identifies the semantic relationship between a head noun and its modifiers. ‘prop’ (underlined) means “property”.} > Please enter a valid NMR label ('a' to abort): prop Property (prop): cumulus_cloud is cumulus

{HAIKU makes a first suggestion which happens to be wrong: ‘black’ is not a modifier of ‘storm’ but of ‘clouds’. The user enters ‘n’ (no) and HAIKU adjusts its interpretation.} For the phrase 'black storm cloud' is that 'black storm' [Y/n]? n ______________ | | black __________ | | storm cloud For the phrase 'storm cloud' there is a relationship between (storm) and (storm_cloud).

{Here, ‘resu’ (underlined) means “result”.} > Please enter a valid NMR label ('a' to abort): resu Result (resu): storm is a result of storm_cloud For the phrase 'black storm_cloud', do you accept the assignment: Property (prop): black_storm_cloud is black [n/a//Y] Y

{Finally, HAIKU outputs the compounds’ structures.} NOUN MODIFIER RELATIONSHIPS

Property (prop): cumulus_cloud is cumulus Property (prop): black_storm_cloud is black Result (resu): storm is a result of storm_cloud storm_cloud isa cloud

It is straightforward to augment the hybrid syntaxico-semantic structure produced above with such information. We clearly see how the linguistic subsystem can really support the user’s knowledge extraction task.

6. Conclusion We have presented a hybrid model of textual data mining built on top of two implemented subsystems, numerical and linguistic. We think such approaches are very promising as they exemplify “best of both worlds” strategies. Our emphasis in this paper has been not so much on the internal details of both subsystems, nor their mathematical and linguistic basis but, rather, on their interfacing and their appropriateness for textual data mining. We also illustrated, with the help of a 4-step methodology, how one could actually use such a hybrid system in order to carry out knowledge extraction tasks. The salient feature of our approach is the synergistic combination of a numerical (classifier) processor with a linguistic (syntactic and semantic) processor. The classifier helps the user to filter out irrelevant texts from a potentially very large corpus. Typically, this leaves him or her with a drastically smaller subcorpus which deserves more detailed investigation and analysis. This is where the linguistic processor steps in and supports the user’s task in performing an indepth analysis of the selected text segments. This leads to the construction of conceptually-oriented (structured) representations of the knowledge/information occurring in the selected text segments. Such representations could easily be fed into a knowledge-based system, for instance. As a final remark let us simply mention that we have plans for future work on other hybrid models built on various kinds of classifiers and linguistic subsystems (e.g. [4], [13], [21]). We also want to work on more tightly coupled interfaces between the numerical and the linguistic processors. Up to now, we have only used a loosely connection between both subsystems.

References [1] Atkinson, H.F. (1990), Mechanics of Small Engines, McGraw-Hill. [2] Barker, K., Copeck T., Delisle S., Szpakowicz S. (1997), “Systematic Construction of a Versatile Case System”, Journal of Natural Language Engineering 3 (4), 279-315. [3] Barker, K., Delisle, S., Szpakowicz, S. (1998), “Test-Driving TANKA: Evaluating a Semi-Automatic System of Text Analysis for Knowledge Acquisition”, Proceedings of the 12th Canadian Artificial Intelligence Conference—CAI-98, Vancouver (BC, Canada), 6071. [4] Biskri, I., Meunier, J.G. (1998), “Vers un modèle hybride pour le traitement de l’information lexicale dans les bases de données”, Actes du Colloque JADT-98, Nice (France). [5] Bouchaffra, D. (1993), “Theory and Algorithms for analysing the consistent Region in Probabilistic Logic”, International Journal of Computers & Mathematics with Applications, (ed.) E.Y. Rodin, Vol. 25, No. 3, Pergamon Press. [6] Church, K.W., Hanks, P. (1990), “Word Association Norms, Mutual Information, and Lexicography”, Computational Linguistics 16, 22-29. [7] Delisle, S. (1994), Text Processing without a priori Domain Knowledge: Semi-Automatic Linguistic Analysis for Incremental Knowledge Acquisition, Ph.D. Thesis, Department of Computer Science, University of Ottawa. [8] Delisle, S., Barker K., Copeck T., Szpakowicz S. (1996), “Interactive Semantic Analysis of Technical Texts”, Computational Intelligence, 12(2), 273-306. [9] Delisle, S., Létourneau, S., Matwin, S. (1998), “Experiments with Learning Parsing Heuristics”, Proceedings of the COLINGACL'98 Conference, Montréal (Québec, Canada), 307-314. [10] Delisle, S., Szpakowicz, S. (1997), “Extraction of Predicate-Argument Structures from Texts”, Proceedings of the 2nd Conference on Recent Advances in Natural Language Processing—RANLP-97, Tzigov Chark (Bulgaria), 318-323.

[11] Hassoun, M.H. (1995), Fundamentals of Artificial Neural Networks, MIT Press. [12] Jacobs P., Zernik, U. (1988), “Acquiring Lexical Knowledge from Text: A Case Study”, Proceedings of the Seventh National Conference on Artificial Intelligence (AAAI88), Saint Paul (Minnesota, USA), 739-744. [13] Jouis, C., Biskri, I., Descles, J.P., Le Priol, F, Meunier, J.G., Mustafa, W., Nault G. (1997), “Vers l’intégration d’une approche sémantique linguistique et d’une approche numérique pour un outil d’aide à la construction de bases terminologiques”, Actes du colloque JST97, Avignon, France. [14] Kasabov. N.K. (1996), Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering, MIT Press. [15] Larrick, N. (1961), Junior Science Book of Rain, Hail, Sleet & Snow. Champaign: Garrard Publishing Company. [16] Mehrotra, K., Chilukuri, K.M., Ranka, S. (1997), Elements of Artificial Neural Networks, MIT Press. [17] Moulin, B., Rousseau, D. (1990), “Un outil pour l’acquisition des connaissances à partir de textes prescriptifs”, ICO, Québec 3 (2), 108-120. [18] Regoczei, S., Hirst, G. (1989), On Extracting Knowledge from Text : Modeling the Architecture of Language Users, TR CSRI 225, University of Toronto. [19] Toussaint, Y., Namer, F., Daille, B., Jacquemin, C., Royauté, J., Hathout, N. (1998), “Une approche linguistique et statistique pour l’analyse de l’information en corpus”, Actes de la Conférence TALN-98, Paris (France), 182-191. [20] Zarri, G.P. (1990), “Représentation des connaissances pour effectuer des traitements inférentiels complexes sur des documents en langage naturel”, Office de la langue française (ed.), Les industries de la langue : Perspectives 1990, Gouvernement du Québec. [21] Biskri, I., Delisle, S. (1999), “Un modèle hybride pour le textual data mining : un mariage de raison entre le numérique et le linguistique”, Actes de la Sixième Conférence sur le Traitement de Automatique des Langues Naturelles (TALN-99), Cargèse (Corse), July 12-17, to appear.