Understanding Text Mining: a Pragmatic Approach - CiteSeerX

2 downloads 204991 Views 70KB Size Report
Management (KM) and Human Resources (HR); Marketing ranging from ... Market Analysis. Customer Opinion. Analysis on virtual community. (mail and .... their documents and to facilitate the retrieval of documents by advanced search.
Understanding Text Mining: a Pragmatic Approach Sergio Bolasco1, Alessio Canzonetti1, Federico M. Capo1, Francesca Della Ratta-Rinaldi1, Bhupesh K. Singh1 1

Università "La Sapienza" di Roma, Via del Castro Laurenziano 9, 00161, Roma, Italy {sergio.bolasco, alessio.canzonetti, francesca.dellaratta}@uniroma1.it [email protected] [email protected] http://geostasto.eco.uniroma1.it/

Abstract. Making correct decisions often requires analysing large volumes of textual information. Text Mining is a budding new field that endeavours to garner meaningful information from natural language text. Text Mining is the process of applying automatic methods to analyse and structure textual data in order to create useable knowledge from previously unstructured information. Text Mining is inherently interdisciplinary, borrowing heavily from neighbouring fields such as data mining and computational linguistics. Some real application to define the state-of-the-art in Text Mining and to single out future needs and scenarios are collected.

1 Introduction We have focused on corporate users/clients, companies, scientific communities and others who may have used TM techniques to achieve their goals. In order to fulfil the objectives of the research and to give as detailed a study as possible of the most important TM applications, two strategies have been pursued: first of all, some of the main European and Italian companies offering TM services were contacted, in order to collect information on the characteristics of the applications, the type of technology used, the problems and solutions, and possible future scenarios; secondly, a detailed search on the web was made to collect further information about operators or developers and applications. On the basis of the material collected, a synthetic grid was built to collocate the 100 cases that we considered most relevant for the typology of function and sector of activity. A profile was compiled for each case, briefing systematically the initial problem, the solution and the results obtained and the group of sources that were consulted to draft the profiles .

1

2 The Approach to the Study In order to delineate the state of the art of the main TM applications, the starting point was to analyse the work carried out by the companies that have been developing software solutions for TM for some time now. Having examined the features of main Italian companies, the research was extended to other important international market players (SAS, Spss, Temis, Inxight, etc.). A directory of customer that used TM instruments for Business Intelligence or researches was made. The analysis of cases allowed us to identify three alternative points of view by which to interpret the TM applications: the functions that they satisfy, the sectors of activity and the type of results that could be obtained. The Applications can be divided into four main typologies in relation to: Knowledge Management (KM) and Human Resources (HR); Marketing ranging from Customer Relationship Management (CRM) to Market Analysis (MA); Technology, ranging from Technology Watch (TW) to Patent Analysis (PA) and, lastly, Natural Language Processing (NLP). Within these macro-groups, the 11 categories of functions reported in the table below were identified: Table1. TM functions by sectors of application

KM & HR Support and decision making and Competitive Intelligence Extraction Transformation Loading Human Resources Management: employee motivation and CV analysis

CRM & MA

TW

Customer care and CRM Market Analysis Customer Opinion Analysis on virtual community (mail and newsgroup)

Patent, Scientific abstracts and Financial news

NLP Questioning in Natural Language and search engines Multilingual applications Dictionary and Spelling Correction Voice Recognition

By linking the functions with the sectors of applications, a grid was constructed where all the case studies (around 100) are placed (see Table 2). We have presented the series of cases of the applications as complete as possible through search on the web. The study, based on the analysis of material taken from Internet, presented a series of limitations, as the heterogeneity of the material collected, often characterized by a publicity appeal, makes it difficult to produce a homogeneous profiles , or because of the privacy demanded by customers using TM . Despite these limits, the joint analysis of the different “case studies” gives, in our opinion, an adequate picture of the state of the art of TM applications. In the following paragraphs the TM applications analysed will be briefly described according to: a) the possible types of results to be obtained; b) the main specifications of the sectors of application;

2

c) the type of function, with reference to the most significant cases recorded in the NEMIS report of the WG3 (each case has a code number thus easy to find).

3 The Results obtained with help of Text Mining The applications analysed present a considerable variety from the point of view of objectives, the type of tests analysed and the strategy of processing and analysis chosen. As a rule, it can be said that the text mining process is preceded by two essential phases: 1) The pre-processing phase where text retrieval, formatting and filing is done. 2) The lexical processing involves identification and lemmatisation of words. Following these two phases, the actual Text Mining processing is extremely diversified, as it is strictly linked to the objectives to be achieved. They are: 1. automatic analysis of documents and their categorisation/classification for the successive information retrieval; 2. search for relevant entities for information extraction; 3. formulation of queries in natural language, interpreted by NLP processes based on algorithms of artificial intelligence; 4. processing of multi-lingual texts for the retrieval of information independent of the original language of the documents. 3.1 Automatic categorisation/classification of documents The automatic analysis of documents is aimed at getting different types of results: a) the classification of documents within a predefined grid of categories; b) the clusterisation of the texts according to conceptual similarity or vocabulary; c) the extraction of semantic information on the text; d) the text summarisation. a) The case of document classification in a predefined “grid” of categories is probably the most frequent. This technique is used for example for the management of document bases, as in the case of big editors or information of a juridical nature, or for CRM applications that foresee automatic message routing. b) On the other hand, in the case in which a grid to classify the documents is not available, clusterisation techniques are used, which separates/groups together the documents into groups according to the similarity of their contents. Clusterisation is most frequent process used in the cases in which the content of the documents undergoing analysis is subject to high variability and is often unknown to the user (as in the case of the documents extracted from a search engine): the subdivision into groups makes it possible to have an idea of the conceptual domains that the documents belong to, since it is usually possible to consult the list of words characterising each cluster (for example, by the use of the TFIDF index). Clusterisation can be used not only for information retrieval, but also for identifying trends and topics by reading the texts. Thus, clusterisation allows to achieve an organized overview of the topics contained in the documents.

3

c) The analysis can be aimed at the extraction of relevant semantic contents of the text being examined; as in the case of Customer Opinion Analysis (COA), used in marketing applications in which a set of messages are analysed to obtain information concerning customers’ opinions. item d) Finally, the Text Summarisation makes it possible to create summaries and/or abstracts of documents automatically. The Text Summarisation procedures carry out the linguistic/statistical analysis of the document under examination to identify the topics dealt with and to eliminate the insignificant parts for the purposes of synthesis. 3.2 Search for relevant entities The applications dedicated to the search for relevant information do not foresee the classification of documents but the formulation of an answer to a specific query. Information extraction is frequent in the applications of Competitive Intelligence, Technology Watch and Market Analysis, having in common the objective of extracting strategic information from a vast amount of documents. In most cases statistical techniques of data reduction are used. The information extracted are generally lists of competitors and the products offered, lists of potential clients, identification of partnerships between companies, investments news or experimenting in new markets, information on new technologies or patents. 3.3 Formulation of queries in natural language Natural Language Processing forms the basis of most TM processes. Among the most significant applications that use this type of linguistic technologies are those that permit the management of queries in natural language, used above all for CRM or eGovernment. They are applications that facilitate the contact and the retrieval of information on Internet (or on intranet) for users who are not particularly familiar with the language of queries. The result of this process is the extraction of information with the criterion of the maximum precision and minimum effort on the part of the user. 3.4 Processing of multi-lingual texts A sector in continuous expansion, which will undoubtedly represent one of the future TM developments, is that of the management and interpretation of multi-lingual corpora. The ability to interact simultaneously with texts drafted in different languages (starting with special dictionaries in which the "translators" have been contextually tested) is a potentiality used above all in the field of search engines that make it possible to extract documents of interest, using multi-lingual platforms. A specific case study has been carried out by the company Synthema1 on the extraction of information from multi-lingual corpora in the context of the NEMIS project.

1 See:

Neri, F., Text Mining applied to multilingual corpora, Nemis case study, 2004.

4

4 The Sectors of Text Mining Application •

The main TM applications are most often used in the following sectors: Publishing and media; • Telecommunications, energy and other services industries; • Information technology sector and Internet; • Banks, insurance, venture capitalist, and financial markets; • Political institutions, political analysts, public administration and legal documents; • Pharmaceutical and research companies and healthcare. The sectors analysed are characterised by a fair variety in the applications being experimented; however, it is possible to identify some sectorial specifications in the use of TM, linked to the type of production and the objectives of the knowledge management leading them to use TM. The publishing sector, for example, is marked by prevalence of Extraction Transformation Loading applications for the cataloguing, producing and the optimisation of the information retrieval. In the banking and insurance sectors, on the other hand, CRM applications are prevalent and aimed at improving the management of customer communication, by automatic systems of message re-routing and with QNL applications supporting the search engines asking questions in natural language. In the institutional field, ETL applications are prevalent for the filing and management of legal and normative documents and those of CRM and QLN are used to increase citizens’ participation and the dissemination of information. In the medical and pharmaceutical sectors, applications of Competitive Intelligence and Technology Watch are widespread for the analysis, classification and extraction of information from articles, scientific abstracts and patents. A sector in which several types of applications are widely used is that of the telecommunications and service companies: the most important objectives of these industries are that all applications find an answer, from market analysis to human resources management, from spelling correction to customer opinion survey.

5 Type of Functions and Objectives 5.2 Text Mining Applications in Knowledge Management and Human Resources Competitive Intelligence (CI) The need to organise and modify their strategies according to demands and to the opportunities that the market present requires that companies collect information about themselves, the market and their competitors, and to manage enormous amount of data, and analysing them to make plans. The aim of Competitive Intelligence is to select only relevant information by automatic reading of this data.

5

Once the material has been collected, it is classified into categories to develop a database, and analysing the database to get answers to specific and crucial information for company strategies. The typical queries concern the products, the sectors of investment of the competitors, the partnerships existing in markets, the relevant financial indicators, and the names of the employees of a company with a certain profile of competences (see cases CI-TES1 and CI-TES2). In some cases the introduction of TM substitutes already existing systems, as in the case of Total (CI-TES2)where, before the introduction of TM, there was a division that was entirely dedicated to the continuous monitoring of information (financial, geopolitical, technical and economic) and answering the queries coming from other sectors of the company. In these cases the return on investment by the use of TM technologies was self evident when compared to results previously achieved by manual operators. In some cases, if a scheme of categories is not defined a priori, clusterisation procedures are used to classify the set of documents (considered) relevant with regard to a certain topic, in clusters of documents with similar contents. The analysis of the key concepts present in the single clusters gives an overall vision of the subjects dealt with in the single texts. A good example of this is given by the IBM corporation (CIOT1), realized with Online Analyst: in order to answer the needs of the sales department, all the documents regarding call centres were extracted from the company’s document baseand classified into 30 groups, the features of which are visualised with the list of key concepts characterising them. The software used permitted the creation of some summary tables, which in the example in question gives the names and the number of times the competitors are mentioned, the names of other companies involved in the market, the principal partnerships among companies and the list of information technology services identified in the documents with the relative occurrences. The cluster analysis procedures were also used to define a classification system into which new documents could be added at any time. This is the case, for example, of the pharmaceutical company GlaxoSmithKline (CI-PH3) which, using the SAS Text Miner on a sample of PubMed articles, identified some macro-categories into which to classify the references to their own and their competitors’ pharmaceutical products. Another important case of CI was the assessment of hospitals and the work of hospital doctors carried out by researchers of the University of Louisville (Kentucky, USA) on the relationship between diagnosis and medical prescriptions (CI-PH1). By using Text Miner, the researchers analysed the pharmaceutical orders together with the information on the medicine’s code number, the name of the medicine, the diagnosis and the names of the doctors who made out the prescription. Cluster analysis processing made it possible to identify different groups in which to recognise common medical practice or in anomalous cases, a wrong prescription or particularly innovative treatment. The natural result of this study was the creation of predefined lists to verify the relation between diagnosis and prescriptions. Extraction Transformation Loading (ETL) Extraction Transformation Loading are aimed at filing non-structured textual material into categories and structured fields. The search engines are usually associated with ETL that guarantee the retrieval of information, generally by systems foreseeing

6

conceptual browsing and questioning in natural language. The applications are found in the editorial sector, the juridical and political document field and medicalhealthcare. The case studies presented refer to big European, and American editorial groups (see in particular ETL-PM1, ETL-PM3 and ETL-PM5), placed together by the need to file their documents and to facilitate the retrieval of documents by advanced search engines permitting conceptual search. For the filing, usually complex systems of taxonomies are defined to interact with automatic tools. In the legal documents sector the document filing and information management operations deal with the particular features of language, in which the identification and tagging of relevant elements for juridical purposes is necessary and the normalisation of normative references (e.g. to make the word "art." equivalent to that of "article"). In the case of legal documents (ETL-POL2), the principal aim of the filing procedures is the optimisation of those of search and information retrieval. In the healthcare sector, the experience of the NHS medical centre of Modena (ETLPH2), is indicative as it was aimed at the sharing of knowledge in complex systems in which people with different professional experience operate. An integrated knowledge system was realized, with the Cogito software of Expert System, by which non-structured documents of different types and formats are harmonised and put into one single database, which can be accessed by doctors or call centre operators. Human Resources Management (HR) TM techniques are also used to manage human resources strategically, mainly with applications aiming at analysing staff’s opinions, monitoring the level of employee satisfaction, as well as reading and storing CVs for the selection of new personnel. The two examples of this type of applications are those relative to CV analysis developed realized by the Crédit Lyonnais (HR-BIF1) and the level of company motivation of employees created realized by ConocoPhilips (HR-TES1). In the context of human resources management, the TM techniques are often utilized to monitor the state of health of a company by means of the systematic analysis of informal documents. A good example of this is the case of ConocoPhilips (HRTES1), a fast-moving American company, which developed an internal system - the VSM (Virtual Signs Monitor) - able to find the intangible but crucial aspects of company life, the degree of experience and knowledge and the “productive” abilities. The approach chosen by Conoco was that of “measuring” the company mood by means of the indicators suggested by Sumantra Ghoshal’s theory of ‘The Individualized Corporation’, which contrasts a new model based on completely different pillars, like stretch, discipline, trust and reciprocal support with the traditional managerial model founded on concepts of constraint, contract, control and compliance. This managerial model, according to Ghoshal’s formulation encourages the cooperation and collaboration between the elements of an organisation, improving its results. Its collaboration with Temis enabled Conoco to refine its system for the monitoring of textual sources like e-mails, internal surveys of employees’ opinions, declarations of the management, internal and external chat lines, all representing important means for sounding the evolution of company culture.

7

The morpho-syntactic and semantic analysis made it possible to relate the occurrences of certain expressions present in the textual sources with one (or more) of the indicators suggested by Ghoshal and representative of the two contrasting managerial models (“Organization Man” and “Individualized Corporation”). 5.2 Text Mining Applications in Customer Relationship Management and Market Analysis Customer Relationship Management (CRM) In CRM domain the most widespread applications are related to the management of the contents of clients’ messages. This kind of analysis often aims at automatically rerouting specific requests to the appropriate service or at supplying immediate answers to the most frequently asked questions. With reference to the Italian market, the sectors in which customer care seems to be most developed are banks, insurance companies and institutions, in which the civic networks have recently been developed. The need to give quick answers to potential and existing customers is particularly felt in the insurance sector, above all following the recent spread of companies that deliver services exclusively by phone. Among the different cases identified, one of the most interesting in Italy is that of the Linear Assicurazioni (CRM-BIF7), which has adopted an automatic support system for the call centre operators. One of the first European civic networks was Iperbole, the website of the municipality of Bologna. Iperbole is a real electronic notice board giving citizens information supplied by different sources (local authorities, firms, social bodies, citizens associations), enabling all the members of the urban community to participate in public debates on local topics or to communicate (by electronic post) with other members of the community and with the promoters of the network themselves. Market Analysis (MA) Market Analysis, instead, uses TM mainly to analyse competitors and/or monitor customers' opinions to identify new potential customers, as well as to determine the companies’ image through the analysis of press reviews and other relevant sources. For many companies tele-marketing and e-mail activity represents one of the main sources for acquiring new customers. An Italian company Celi developed InfoDiver, a tool used by various companies which, by visiting the list of target sites, analyses the contents and automatically produces a list containing the names (with address, telephone number, email) of all the companies answering the required criteria (MATES2). The TM instrument makes it possible to present also more complex market scenarios. This is the case, for example, of the consortium Telcal (Telematica Calabria, MATES3), set up by the Region of Calabria and some telecommunications companies with the aim of promoting innovation of the whole Region with the construction of a capillary telematic network over the territory. The Consortium started the development of an application called Market Intelligence System - MIS, an

8

“intelligent” software – developed with the contribution of Temis and based on automatic text reading – permitting the small and medium companies of the Region to discover, according to their own peculiarities and without any prior indication, in which parts of the worlds market to promote their products. It screens a careful selection of websites and thousands of articles coming from about 3,000 sources of the international press and periodically downloaded into a database. The system filters the information and obtains answers to precise queries by the users, strategically supporting their marketing activity. 5.3 Text Mining Applications in Technology Watch (TW) The technological monitoring, which analyses the characteristics of existing technologies, as well as identifying emerging technologies, is characterised by two elements: the capacity to identify in a non-ordinary way what already exists and that is consolidated and the capacity to identify what is already available at an embryonal state, identifying through its potentiality, application fields and relationships with the existing technology. The case studies show that, as in the example of the German company European Molecular Biology Laboratory (TW-PH3), specific statistical techniques are applied to represent – by means of factorial techniques – the concepts and scientific topics around which are organised the documents being analysed. The result of this application is a search engine that is useful to identify and create search keys for documents belonging to similar subject contexts. Another interesting case is that of the scientific park of Trieste (IT) AREA Science Park (TW- OT1), engaged in the valuation of research and the transfer of innovation to production. Synthema developed a search and patent classification engine, able to download – from public and private databases – the patents and scientific publications of potential interest to the user. The bibliographical references of each document, belonging to the same thematic domain, are first of all normalised and then imported into the system. Then a multi-lingual tool (Italian, English, French and German) allows the operator to automatically extract the key information from the texts, to index and memorise them in the database. 5.4 Text Mining Applications in Natural Language Processing and Multilingual Aspects Questioning in Natural Language The most important case of application of the linguistic competences developed in the TM context is the construction of websites that support systems of questioning in natural language. For example, the conceptual meta-search engine adopted for the website of the President of the Council of Ministers (QNL-POL2), developed by Expert System, which has the ability to recognise and interpret natural language, thus simplifying the search process for information on the part of the users.

9

The need to make sites cater as much as possible for the needs of customers who are not necessarily expert in computers or web search is common also to those companies that have an important part of their business on the web. A good example of this is the French company La Redoute (QNL- INF5), an on-line commerce company that saw a noticeable increase in its orders following the introduction, in 2001, of a search engine that supports queries in natural language, with the adoption of the software iCatalog (Sinequa). With the iIntuiton technology, also produced by Sinequa, search engines were used with the possibility of language questioning for the French site AlloCiné dedicated to cinema (QNL-INF4) and for Leroy Merlin (QNL-OT4), one of the most important companies in the field of DIY and building materials. Multi-lingual Applications In NLP, Text Mining applications are also quite frequent and they are characterised by multi-linguism. Besides the examples that have already been quoted in this section and others in the report, the identification and retrieving system of web pages in different languages of the American company Inktomi (ML-INF1) and a multi-lingual search engine of the company Verity (ML-INF2) represent further applications. Inktomi, which manages a file of websites to which the world search engines of the world refer to, experimented Text Mining to identify and analyse web pages published in different languages, by means of the Inxight LinguistX Platform® tool, an engine of natural language processing (produced by the American company Inxight). Verity is a Californian company that deals with the design and construction of company portals and knowledge management systems able to automatically manage information by means of advanced search systems. Through the phases of text parsing (normalisation, segmentation, lemmatisation and de-compounding) the Verity search engine permits more efficient retrieving operations, as a result of the ability to manage documents drafted in 10 languages. This has enabled the Californian company to get onto the international market, from which it was precluded beforehand.

Reference Balbi, S., Bolasco, S., Verde, R., (2002), “Text mining on elementary forms in complex lexical structures”, JADT 2002. Actes des 6es journées internationales d’analyse statistique de données textuelles, Saint-Malo, IRISA, pp. 89-100. Bolasco, S., Bisceglia, B., Baiocchi, F., (2004), Estrazione automatica di informazione dai testi, Mondo Digitale, a. III, n. 1, pp. 27-43. Castels, M., (2002), Galassia Internet, Milano, Feltrinelli. [orig, 2001, Internet Galxy, Oxford University Press] Collica, R. S., February (2003), “Mining textual data for CRM applications”, DM Review, February 2003. Dini, L., Mazzini, G., (2002),"Opinion classification through information extraction" in Zanasi, Brebbia, Ebecken and Melli (eds), Data Mining III, WIT Press, pp. 299-310.

10

Evans, D., (2000), “Advanced text mining tools help CRM systems 'see' customers”, Unisys World, July 2000. Feldman, R., Dagan, I., (1995), “Knowledge Discovery in Textual Databases”, Proceedings of KDD-95, pp. 112–117. Lebart, L., Salem, A., Berry, L., (1998), Exploring Textual Data, Kluwer Academic Publishers, Dordrecht-Boston.London. Mani, I., Maybury, M. T., (ed ), (1999), Advances in automatic text summarization., The MIT Press, Cambridge Massachussetts. Nasukawa, T., Nagano, T., (2001), “Text analysis and knowledge mining system”, IBM System Journal, Vol. 40, No. 4. Neri, F., (2004), Text Mining applied to multilingual corpora, Nemis case study. Sullivan, D., (2001), Document Warehousing and text mining. Techniques for improving business operations, Marketing and sales, New York, Wiley. Sullivan, D., (2000), “Beyond the numbers”, Intelligent Enterprise, Vol. 3, No. 14 Zanasi, A., (2002), “L’analisi dei testi nel CRM analitico”, Computerworld Ondine, http://www.cwi.it/ncwi/news.nsf/0/db572cab7fe31ab0c1256b83005c5426?OpenDocument. Zanasi, A., (2001), “Text Mining: the new Competitive Intelligence Frontier. Real cases in industrial, banking and telecom/SMEs world”, VSST2001 Conference Proceedings, Barcellona.

Appendix See next page

11

Table 2. Some main case studies by sector and application typology

Sector Typical applications

Application typology • Functions

(*)

Customer Relationship Management and Market Analysis • CRM - Customer Care and CRM • MA - Market Analysis • COA – Customer Opinion Analysis on Virtual Communities (Mail and Newsgroups)

BIF POL INF TES PH PM Banks, insurance and Informatics, Internet, Pharmaceutical and Publishing and Political Institutions, Telecommunications, financial markets Public Administration Linguistic Resources Energy and Other Healthcare Media Analysis of financial Automatic extraction of Automated and Legal Documents On Line Services Industries news, semi automated Information retrieval of data from abstracts of archives created Automatic document Semi automated callcall-centres, automatic texts independent of biomedical content, by publishers and analysis, search engine centres, extraction of lists broadcasters dedicated to on-line public of customers/competitors, analysis of curriculum language used, management of clinical vitae services automated translation strategy planning data

CRM-BIF7 from

3 cases

8 cases

HR-BIF1 from

3 cases

Natural Language Processing • ORT - Dictionary and Spelling Correction • QNL - Questioning in Natural Language and Search Engines • ML – Multi-lingual Writing and Teaching Languages • VR - Voice Recognition

6 cases

CI-PH1 CI-PH3 ETL-PH2 from

ETL-PM1 ETL-PM3 ETL-PM5 from

8 cases

4 cases

QNL-INF4 QNL-INF5 ML-INF1 ML-INF2 from

4 cases

6 cases

ETL-POL2 from

3 cases

CI-TES1 CI-TES2 HR-TES1 from

5 cases

QNL-POL2 from

26 cases

CI-OT1 from

28 5 cases cases

QNL-OT4 from

40 7 cases cases

9 cases

12 cases TW-PH3 from

Technology Watch • TW - Patents, Scientific Abstracts and Financial News

Total

Total

7 cases

Knowledge Management and Human Resources • CI- Supports Decision Making and Competitive Intelligence • ETL – Extraction Transformation Loading Transformation of Free Text into Structured Text and Database Filling • HR - Human Resources: Employee Motivation and CV Analysis

4 cases

MA-TES2 MA-TES3 from

OT Others

17 cases

12 cases

4 cases 12 cases

1 cases 13 cases

16 cases

*The code inside the cells refers to the case studies mentioned in the paper and illustrated in the WG3 NEMIS Final Report (see http://nemis.cti.gr/).

12

13 cases

TW-OT1 from

6 1 cases cases 17 cases 100