Introduction to Text Mining Overview - Universidade do Porto

46 downloads 119 Views 53KB Size Report
R. Feldman, J. Sanger: The Text Mining Textbook: Advanced Approaches in ... Ingo Feinerer: Introduction to the tm Package: Text Mining in R, http://cran.r-.
Introduction to Text Mining

Pavel Brazdil LIAAD INESC Porto LA FEP, Univ. of Porto

Escola de verão Aspectos de processamento da LN F. Letras, UP, 4th June 2009 http://www.liaad.up.pt

Overview 1. Introduction 1.1 What is Text Mining 1.2. Documents and Document Collections 1.3. Representation of Documents and Features 1.4. Basic Text Mining Tasks 1.5. Advanced Text Mining Tasks 2. Document Classification 3. Clustering of Documents 4. Information Extraction 5. Discovery of Patterns and Trends

2

Bibliography R. Feldman, J. Sanger: The Text Mining Textbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge Univ. Press, 2007 S.Weiss, N.Indurkhya, T.Zhang, F. Damerau, Text Mining: Predictive Methods for Analysing Unstructured Information, Springer, 2005. R. Feldman, J. Sanger: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge Univ. Press, 2007 F. Sebastiani: Machine Learning in Automated Text Classification, J. ACM Computing Surveys, Vol. 34, No.1, 2002. F. Colas, P. Brazdil: On the Behavior of SVM and Some Older Algorithms in Binary Text Classification Tasks, in Text, Speech and Dialog, LNCS, Vol. 4188, pp. 45-52, 2006. Cordeiro, J., Brazdil, P.: Learning Text Extraction Rules without Ignoring Stop Words. in the 4th International Workshop on Pattern Recognition in Information Systems – PRIS - 2004; pp. 128-138. Patil, K. and Brazdil, P., SumGraph: Text Summarization using Centrality in the Pathfinder Network, International Journal on Computer Science and Information Systems, 2(1), pp. 18-32, 2007. Ingo Feinerer: Introduction to the tm Package: Text Mining in R, http://cran.rproject.org/web/ packages/tm/vignettes/tm.pdf Luís Torgo: A Linguagem R, programação para a Análise de Dados, Escolar Editora, 2009. 3 20Newsgroups, http://people.csail.mit.edu/jrennie/20Newsgroups/

1.1 What is Text Mining Text mining (TM) seeks to extract useful information from a collection of documents. It is similar to data mining (DM), but the data sources are unstructured or semi-structured documents. The TM methods involve : - Basic pre-processing / TM operations, such as identification / extraction of representative features (this can be done in several phases) - Advanced text mining operations, involving identification of complex patterns (e.g. relationships between previously identified concepts) TM exploits techniques / methodologies from data mining, machine learning, information retrieval, corpus-based computational linguistics

4

1.2 Documents and Document Collections Document collection is a grouping of text-based documents. It can be either static or dynamic (growing over time). Document is a unit of discrete textual data within a collection, representing usually some real world document, such as, a business report, memorandum, email, research paper, news story etc. A document can be a member of different document collections (e.g. legal affairs and computing equipment, if it falls under both).

5

1.2 Document Collection PubMed An example of a real-world document collection: PubMed is the National Library of Medicin’s on-line repository of citation-related information for biomedical research papers. This on-line service contains abstracts for > 12 million research papers. About 40.000 new abstracts are added every month. Search for a particular document: Keyword search is not very useful, as protein or gene returns 2.800.000 documents. Even a more specific term – epidermal growth factor receptor returned 10.000 documents. Identification of relationships across documents across documents: Manual attempts are labour-intensive / impossible to achieve. Automatic methods enhance the speed / efficiency of research activities. 6

1.2 Document Structure Text documents can be : - unstructured, i.e. free-style text (but from a linguistic perpective they are really structured objects) - weakly structured adhering to some pre-specified format, like most scientific papers, business reports, legal memoranda, news stories etc. - semistructured exploiting heavy document templating or style sheets.

7

1.3 Document Representation and Features Irregular and implicitely structured representation is trasformed into an explicitly structured representation. We can distinguish: - feature based representation, - relational representation. In feature based representation that documents are represented by a set of features.

8

1.3 Document Representation and Features Examples of some commonly used features: - Characters enabling to recognize e.g. morpological features. So called bigrams (trigrams) represent sequences of 2 (or 3) characters. - Words Often the term word-level tokens is used instead. Tokens can be anotated (e.g. with labels representing noun, verb etc.). Bag-of-words representation exploits words, but the order is ignored. Word stem represents a group of related words stripped of a suffix - Terms may represent single words or multiword units, such as “White House”

9

1.3 Document Representation and Features Examples of some commonly used features (continued): - Concepts For example, the concept identifier “car” can represent different words in the text, such as automobile, car, sport-car etc. Concepts are useful to represent synonyms and help to resolve polysemy.

10

1.3 Problem of High Dimensionality Structured representations of natural languange documents leads usually to very large number of features. For instance, one small collection of Reuters of 15,000 documents contains 25,000 non-trivial features (word stems). Some algorithms do not deal very well with large numbers of features and hence it is necessary to employ feature reduction techniques. Another problem is feature sparcity: Each document conctains only a small number of all potential features.

11

1.4 Basic Text Mining Tasks • • • •

Document classification (categorization) Information Retrieval Clustering / organization of documents Information extraction

More infomation to follow

12

1.4 Information Retrieval Retrieval of documents in response to a “query document” (as a special case, the query document can consist of a few keywords) Document Collection

Query Document

Document Matcher

Retrieved Documents

13

1.4 Document Classification Classification of documents into predefined categories (classes)

Agriculture Finance New Document

Classification Legislation etc.

14

1.4 Clustering / Organizing Documents Unsupervised process through which documents are classified into groups called clusters. Document Collection

Group 1 Group 2

Document Organizer Group 3 etc.

15

1.4 Information Extraction IE involves identification of certain entities in the text, their extraction and representation in a pre-specified format (e.g. a table). T5 Duplex em Gaia Data: 2002-05-10 15:01:24 PST Excelente localização no centro da cidade. 2 WC, despensa, terraço com marquise com 70 m2; 119700 euros; Tel. 966969663

Apartamento pouco usada T4, 2 wc´s, 3º andar com vista panorámica. Excelente localização, a poucos metros da zona central de Loulé. Perto metros do tribunal, biblioteca, piscinas, e diversos estabelecimentos comerciais. Preço: 132.180 Euros (negociavel) 936109097

Output: Filled in Template / Table Price

Type

Location

Area

119 700

T5

Gaia

70

132.180

T4

Loulé

?

...

...

...

16

1.5 Advanced Text Mining Tasks •

Concept co-occurrence Quantification of co-occurrence



Identification of trends in data Identification of new topics



Summarization

17

1.5 Concept Co-occurrence Detection of concept co-occurrence in documents, e.g.: Disease – Medical Drug (based on BioWorld articles) Rheumatoid Arthritis - Rituximab Rheumatoid Arthritis - Infliximab Prostate Carriroma - APC8015 Vacine etc.

Quantification of frequency of co-occurrence can be expressed in numerical form, or using a graphical representation (e.g. a circle graph; the width of the line indicates the strength of the connection) Diseases

Medical Drugs

Rheumatoid Arthritis Prostate Carriroma

Rituximab Infliximab APC8015 Vacine 18

1.5 Identification of Trends in Data Identification of trends in data How does the news concerning a particular disease (e.g. Rheum. Arthritis) and a particular medical drug (e.g. Rituximab) change over time? How does the news concerning a particular company and a particular product (e.g. medical drug Rituximab) change over time? Identification of new topics in the data Did any new articles appear concerning certain type of company (e.g. a farmaceutical company) and a particular type of product (e.g. A medical drug useful for treating lung cancer)? Identification of disappearing topics in the data Identification of a period covered by a certain topic 19

1.5 Summarization Summarization of a single document Selection of some sentences, summarizing the document D1: S1, S2, S3, S4

S4

Summarization of several documents Selection of a single representative document D1, D2, D3, D4, D5

D6

Selection of representative sentences from different documents D1: S1, S2, S3, S4 D2: S5, S6, S7 D3: S8, S9, S10

Dsummar: S1, S6, S4 20