Keywords: template building, problem domain ontology, multi-agent text ... contracts, emails, business letters, licenses, manuals, financial and technical reports.
Creating Template Contract Documents using MultiAgent Text Understanding and Clustering in Cars Insurance Domain Igor Minakov 1, George Rzevski 2, Petr Skobelev 1, Simon Volman 1 1
MAGENTA Development, 1a Osipenko St., Samara, 443010, Russia {minakov, skobelev, volman}@magenta-technology.ru 2 MAGENTA Technology, 130 Shaftesbury Avenue, London W1D 5EU, UK { george.rzevski }@magenta-technology.com Abstract. The paper discusses problems in automated processing and classification of unstructured text information and suggests a new approach based on the multi-agent technology. The approach was applied for one of UK insurance companies to analyze 25000 documents related to car insurance domain, leading to development of a system, capable to analyze documents, classify them into hierarchical semantic structure and build a template, which includes suitable parts of all similar documents. The paper describes the system, presents testing results and discusses perspectives. Keywords: template building, problem domain ontology, multi-agent text understanding, semantic networks comparison, text mining, data mining, classification, clustering
1. Introduction and Problem Definition Nowadays companies work with enormous amount of documents, dealing with contracts, emails, business letters, licenses, manuals, financial and technical reports etc. Even for a medium-size company, number of documents becomes so high, that it’s impossible to process manually, without electronic document circulation and CRM systems. But sometimes in addition to usual tasks, one requires tools for deeper analysis, including semantic search and comparison of documents, grouping of similar sense documents and even documents generation methods. The company under investigation, one of the five biggest insurance companies in the UK, had the following problem. The contract for the car insurance naturally depends on many parameters – from the client’s gender and age to the education level, yearly income, class of car and driving history. Over the last 20 years their lawyers prepared more than 25000 documents in the field of car insurance contracts. And now the task was to re-analyze them, group them according to the semantics and prepare a template contract for each group of documents– a contract, which included the best\most frequent clauses from the documents within the group to be used in future as a basis for all new contracts. As on the market there are also contracts from
the competitive insurance companies, part of the task was to group these external documents as well, and take them into account during the analysis and template building process. The initial estimations were that there would be around 100 groups, and that the whole work would take approximately 16 man-years of highly qualified lawyer experts. Our task was to automate this process with decision making support system, thus saving time and money. The existing systems and algorithms, which give possibility to group documents into clusters (even the best ones, like LSA [1], Scatter/Gather [2] and STC [3]), have a number of limitations [4], some of them require “seed” document in each group, some require expert pre-analysis, including desired number of resulting groups, some just produce mediocre results due to representing semantics as keywords from documents, causing a lot of noise and non-relevant results. Thus we took a decision to apply and adjust our own technology for text understanding, which gives possibility to represent the sense of the document as a semantic network, using problem domain ontology. After that clustering method is applied to find similar documents. Existing clustering methods work poorly with semantic networks, so we applied our multi-agent clustering algorithm to find groups of documents. After that some heuristics methods are applied to create a template based on group of similar documents. Below we’ll describe approach and methods in detail. 2. System workflow On the first stage our system requires ontology of the problem domain – car insurance. It’s constructed in a semi-automatic way. Using the ontology constructor tool [5] an expert can input a set of domain documents (in our case – insurance contracts), and the system suggests hierarchy of objects, attributes and relations, which expert can adjust manually. The heuristics are similar to those, used in [6] and [7]. For our domain the ontology included more than 400 objects (like “document”, “agreement”, “contractor”, “terms of contract”), on the average of 6 attributes per objects (like “amount”, “class of car”, “car parameters”, “contract lengths”, “warranty conditions” etc), and 37 relations (like “have”, “between”, “part of contract”, “belongs to”, “guarantee” etc). On the next stage with the use of the text understanding algorithm [8,9], each document is automatically assigned a semantic descriptor, which represents its sense in terms of created problem domain ontology. (Fig. 1) It’s clear that descriptor represents not all words/information available in document, but the most relevant and important. For example, in the fragment on the figure it’s possible to reconstruct, that it describes certificate for option, which belongs to one party of the agreement, which in its turn gives right for its further usage. Option belongs to the registered holder. By similar analysis it’s quite easy to interpret semantic network, saving sense of the document. Also to be able to use standard clustering methods, in addition to semantic descriptors, keywords and key clauses are created.
After semantic descriptors for all documents are built, clustering algorithm [10,11] is applied. It creates hierarchy of clusters, and as a description for cluster semantic descriptors are also used, which helps to interpret reasons behind grouping documents together.
Fig. 1. Fragment of semantic descriptor for insurance contract
As a result we get hierarchy structure, where each record and cluster can be a part of several other clusters, if their semantic descriptor can be close to several clusters simultaneously. (Fig. 2).
Fig 2. Example of created clusters for car insurance contracts
For each cluster the system represents its documents as a record with number of key clauses, which describe it. A procedure of comparing clauses takes into account number of similar words in clauses and their corresponding order. Clauses with high
enough level of similarity are considered the same. After that key clauses are calculated, as the most frequent clauses among documents in a set. After clusters of documents are created, system again clusters it, finding the most popular clauses, similar clauses and abnormalities for each group of clusters. After that all key clauses, which are popular and unique/abnormal are joined together to form final template. To determine the order in which they should appear in the resulting document, dynamic programming algorithm is used, taking into account in which relevant order to each other clauses appear in all documents in the group (Fig. 3). After that experts have the possibility to re-adjust the order of clauses, to select option or edit words in clauses with partial matching (where similar clauses have some differences in wording) or include additional clauses.
Fig. 3. Creating template basing on a set of key clauses from the grouped documents
3. The approach offered In this paragraph we’ll briefly describe the technologies in use, for the reader to coincide with the logic of the solution. The more detailed description can be found in [9,10], and a detailed comparison with other methods, alongside with performance analysis and number of real-life applications can be found in [7,11]. It’s worth mentioning, that multi-agent algorithms are following holistic approach by building Open Resource and Demand Networks, where demands and resources can join of leave network dynamically, in real time [12]. In this methodology we integrate use of agents with Demand and Resource roles (records and clusters for clustering, words and their meanings for text understanding ) and use semantic networks to represent formally the input situation, coming events and results of ongoing matching and re-matching between agents. 3.1 Multi-agent Text Understanding
The method consists of the following four steps (see Figure 4). • Morphological analysis • Syntactic analysis • Semantic analysis • Pragmatics The process goes as follows. The whole text is divided into sentences. Sentences are fed into the meaning extraction process one by one. And the first three stages are applied to each sentence. After the text is parsed, the resulting semantic descriptor enters the forth stage – pragmatics. Morphological Analysis An agent is assigned to each word in the sentence; • word agents access ontology and acquire relevant knowledge on morphology • word agents execute morphological analysis of the sentence and establish characteristics of each word, such as gender, number, case, time, etc. • If morphological analysis results in polysemy, i.e., a situation in which some words could play several roles in a sentence (a noun , an adjective or a verb), several agents are assigned to the same word each representing one of its possible roles – i.e. creating several branches of possible sentence parsing results. Syntactical Analysis • word agents access ontology and acquire relevant knowledge on syntax; • word agents execute syntactical analysis where they aim at identifying the syntactical structure of the sentence. For example, a subject searches for a predicate of the same gender and number, and a predicate looks for a suitable subject and objects. Conflicts are resolved through a process of negotiation. A grammatically correct sentence is represented by means of a syntactic descriptor; • If results of the syntactical analysis are ambiguous, i.e., several variants of the syntactic structure of the sentence under consideration are feasible, each feasible variant is represented by a different syntactic descriptor. If no syntactic descriptor is valid, then several variants with high enough level of correctness (i.e. more than 80% of grammatically possible links present in the syntactic descriptor) are selected for the next stage. Semantic Analysis • word agents access ontology and acquire relevant knowledge on semantics – including possible relations between concepts and valid attribute values; • each selected grammatical structure of the sentence under consideration is subjected to semantic analysis. This analysis is aimed at establishing the semantic compatibility of words in each grammatically correct sentence. From the ontology word agents learn possible meanings of words that they represent and by consulting each other attempt to eliminate inappropriate alternatives by building least contradictive semantic descriptor based on the problem domain ontology. • Once agents agree on a grammatically and semantically correct sentence, they create a semantic descriptor of the sentence, which is a network of concepts and attribute values contained in the sentence; • If a solution that satisfies all agents cannot be found, agents compose a message to the user explaining the difficulties and suggesting how the issues could be resolved or select most probable decision automatically, if level of correctness is high
enough (i.e. more than 80% links are valid according to the problem domain ontology); • Each new grammatically and semantically correct sentence generated in the previous steps is checked for semantic compatibility with semantic descriptors of preceding sentences. In the process agents may decide to modify previously agreed semantic interpretations of words or sentences by returning to earlier stages of negotiation (described above) with their new knowledge, which improve confidence in certain option and will result in reconstruction of semantic descriptor for preceding sentences; • When all sentences are processed, the final semantic descriptor of the whole document is constructed thus providing a computer readable semantic interpretation of the text
Fig. 4. General scheme of natural language processing algorithm
Multi-agent clustering algorithm Within the approach offered, agents of records and clusters start playing the main role. Entering clusters, agents form virtual communities similar to temporary hierarchies, which can be organized in different ways. In a simple case, records try to find the most “profitable” clusters with the maximum density, which is a metric for its decisions (in more complex cases, metric can include number of records, number of sub-dimensions for cluster, time of life in cluster, type of attributes etc). The process of search is always started with the nearest points and extends gradually. When a record finds a proper cluster, it makes an offer and waits for a reply. The cluster reconsiders its locality, calculates its variant and
either accepts or rejects the offer. Thus instead of a centralized optimal “front-office” decision of a classic algorithm, the approach offered provides solutions taken at the lowest level. These solutions are only based on some current local balance of interests of a particular record and some cluster. If each of the sides agrees, the record enters the cluster, if not – the record searches for other variants. The stages of this process in a simple 2D case are shown in Fig. 5, where: a – the first record’s entering, b – the second record’s entering and creating of the cluster, c – the third record’s entering, forming the second cluster of the first cluster and the third record, d – the second cluster “lures” the records from the first cluster to enter it (as it’s two-dimensional and based on selected value formula is more “profitable” for records), the first cluster is dropped out, the fourth record comes, e – a new cluster is formed, f – a new cluster “lures” the records from the inner cluster, a new record comes, g – a new record comes and the process starts from the very beginning, h – the final cluster.
Fig. 5. The stages of forming a cluster
In a more complicated case, agents can interact in the virtual market and use money, which regulate the records’ abilities or restrict them in making decisions in a natural way. For example, in transportation logistics, the sum of money available for a record can be set as a price for order, which customer is due to pay. Then the record of VIP order is “richer” than a record of a single order from random customer. This makes it possible for the first record to form clusters with distant records, taking into account richer context, and generating more clusters as a result. In this situation different models of the microeconomics of cluster or records can be applied. For example, clusters can take charges from records for the right to enter them in accordance with the “club system” model (the amount of the charge does not depend on the situation , richness of a record or on the number of the club members) or according to the “shareholder” model (the amount depends on the situation). In the latter case, a record can even increase its capital entering or quitting the cluster in the right time. It is obvious that if a record sees a more beneficial cluster and it does not have enough money to enter it, it can leave the current cluster, which may cause a wave of the process of dropping out cluster. In the process of interaction of clusters
and records real negotiation is conducted, with parties meeting each other half-way, for example, by mutual concessions. All this provides different variants of taking approximate decisions and the control of the correlation between the accuracy and cost and time effectiveness. Clusters and records can make decisions according to different rules, combining, for example, a simple local density and “public value” (the number of clusters a record participates in or the number of records in a cluster), “richness” or ‘lifespan” of a cluster or an record. These combinations of rules depend on what a user wants to see: bigger and more steady groups of clusters or, on the contrary, the most dynamic and miniature; densest or the most disperse; the “richest”, the most “die-hard” or any others. The clustering processes for a stable data structure resembles crystallization processes – records create various structures at the micro level, the produced structures on their part start participating in clustering process. The process stops when the whole structure is crystallized. The outcome of the process generates highlevel structures, which are more or less stable, but are adjusted in real time as new records (events) come into system. For the task of analyzing semantic descriptors we need to introduce distance metrics, which allow to calculate semantic proximity, based on the ontology. It’s done by finding most common sub-graph in the semantic network, using partial matching comparison of ontology concepts, depending on their distance within ontology (it’s calculated by heuristic method, taking into account such parameters, as the same parent, same attributes, connected by relation etc). After we’ll make cluster value formula as a distance between records, it will allow to compare any two records, but doesn’t require to compare all records amongh themselves to take decision (in this domain transitivity rule isn’t working, if document A close to document B, and document B close to document C, we cannot say how close A is to C. That’s the reason why many classical clustering algorithms are unable to work with semantic networks). In this scenario clusters are also receiving a semantic descriptor, which is a common part of semantic descriptors of its members, and thus it can also be represented as a record and be clustered with other records and clusters basing on their semantic proximity.
4. Test analysis and results To understand the quality of the results, we made a number of test experiments. As an input source we’ve got a number of documents, which was already manually processed by lawyer experts of the client company, which were manually divided by groups, and templates were created (that work was done before this project has started, so we had some valid test samples). The following procedure was applied – a problem domain ontology was created, based on the full set of documents. After that semantic descriptors were built and clustered in automatic mode, without human intervention. Templates were created for each group in semi-automatic mode, and the expert was not the same person as the one who prepared the test set. The results of this analysis can be found in Table 1.
Table 1. Comparing experts manual test analysis with system results # of # of Max hierarchy Average. number of Same and Docs Groups* level docs in a group similar (auto/ (auto / manual) (auto / manual) groups (%) manual) 11 7/9 4/3 2/2 94 % 43 23/14 5/2 4/3 91 % 125 54/28 9/3 6/4 87 % 864 279 / 43 13/3 5/14 88 % * – shows overall number of groups, not only high-level ones
Template validness level (%) 96 % 90 % 81 % 79 %
Let’s state several interesting facts – first of all, quite obviously, the lesser the number of documents, the better the results of human manual analysis. In addition to it, human in average prefer to simplify structure, so in example with many documents, it’s hard for him to create an hierarchy of many levels, he usually ends up with logic like “very similar-quite similar-more or less similar-non-similar”. And overall number of groups in manual mode is considerably lower. But even with these simplifications in all our tests one or several groups, which were specified by expert (usually on top levels of hierarchy, i.e. more generalized) was the same as in the automatic mode. A number of differences can be explained by the system feature, that if document is similar in sense to several clusters, it will be included in all such clusters, while human expert prefer to assign it only to one group. And this is one of the reasons why it seems from the results that correctness is decreasing over time – if there are too many documents, for human it’s hard to see and remember all similarities, he usually uses assigned or found several criteria. In groups, which were found both by computer and human expert, much more often there exist documents, which were missed by human (approx. 35%), than documents, which were wrongly assigned (11%) or missed by computer, but found by expert (7%). To summarize the results, one can see that system even in automatic mode, without human guidance, gives result with approximately 90% agreement in grouping, and 80-85% in template building. These results show good quality in decision making support process. To end up with practical testing, we can give real numbers which were shown when we deployed this system in the client’s company. There were approximately 25 000 instances of car insurance contracts and agreements, 30 pages on average. Starting from 16 man-years as an initial estimation, we were able to solve this task in 40 man-months, i.e. speeding up the work almost five times. 5. Conclusion The paper shows the following results achieved: • A method and tool developed to intelligently classify and analyze documents in any given problem domain. • A car insurance ontology is developed. Being open and easy to enrich, it can be a valuable resource in the current information-oriented society.
•
A usage of this program allowed to significantly speed-up process of documents analysis and development, thus saving time and money, and freeing highly qualified resources from monotonous and routine work. The system under consideration combines latest developments in the field of text understanding and knowledge management. It is a good example of knowledge integration application and can give tools and methods for generalizing, searching, analyzing, comparing, classifying and finally building new and existing knowledge. Such approach can be applicable in any given professional problem domain, providing a powerful suite of tools to work with the enterprise knowledge.
References 1
2
3 4
5
6
7
8
9 10 11
12
Dumains Susan T., Furnas George W., Landauer Thomas K.: Indexing by Latent Semantic Analysis. Bell Communications Research 435 South St. Morristown, NJ 07960. Richard Rashman: University Of Western Ontario. Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey: Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, SIGIR ’92. 1992. P. 318 – 329 Zamir Oren Eli.: A Phrase-Based Method for Grouping Search Engine Results. University of Washington, Department of Science & Engineering. Steinbach, M., Karypis, G., and Kumar, V. (2000).: A Comparison of Document Clustering Techniques, Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 20-23, 2000, Boston, MA, USA. Andreev V., Iwkushkin K., Minakov I., Rzevski G., Skobelev P.: The Constructor of Ontologies for Multi-Agent Systems. In 3rd International Conference ‘Complex Systems: Control and Modelling Problems’, Samara, Russia, September 4-9 2001, 480 – 488. Morin E.: Automatic acquisition of semantic relations between terms from technical corpora. Proc. Of the Fifth Int. Congress on Terminology and Knowledge Engineering (TKE-99), TermNet-Verlag, Vienna Faure D, Poibeau T.: First experiments of using semantic knowledge learned by ASIUM for information extraction task using INTEX. In: S. Staab, A. Maedche, C. Nedellec, P. Wiemer-Hastings (eds.), Proceedings of the Workshop on Ontology Learning, 14th European Conference on Artificial Intelligence ECAI’00, Berlin, Germany. 2000. Andreev V., Iwkushkin K., Karyagin D., Minakov I., Rzevski G., Skobelev, P., Tomin M.: Development of the Multi-Agent System for Text Understanding. In 3rd International Conference ‘Complex Systems: Control and Modelling Problems’. Samara, Russia, September 4-9 2001, 489 – 495. I Minakov , G Rzevski, P Skobelev: Automated Text Analysis // Patent Application No. 305634, UK, 2004. I Minakov, G Rzevski, P Skobelev: Data Mining // Patent Application No. 0403145.6, UK, 2004. Minakov Igor, Rzevski George, Skobelev Petr, Volman Semen: Automatic Generation of Business Rules for Logistics Company using Multi-agent clustering, 1st International Conference on Business Information, Organisation and Process Management (BIOPoM 2006), Westminster Business School, University of Westminster London, June, 2006. Skobelev P.O.: Holonic Systems Simulation // Proc. of the 2nd International Conference “Complex Systems: Control and Modelling Problems”, Samara, June 2023, 2000, p. 73-79.