Taxonomy generation for document collections

5 downloads 0 Views 2MB Size Report
Jun 30, 1999 - mining and the traditional approaches to data analysis such as query and ..... document, “IRA” is associated With the Irish Republican. Army ...
US006446061B1

(12) United States Patent

(10) Patent N0.: (45) Date of Patent:

Doerre et al.

(54) TAXONOMY GENERATION FOR

J. Mothe, T. Dkaki, B. Dousset, “Mining Information in

DOCUMENT COLLECTIONS

Order to Extract Hidden and Strategic Information”, Pro

ceedings of Computer—Assisted Information Searching on Internet, RIAO97, p. 32—51, Jun. 1997. M. IWayama, T. Tokunaga, “Cluster—Based Text Categori Zation: A Comparison of Category Search Strategies”, Pro ceedings of SIGIR 1995, p. 273—280, Jul. 1995, ACM. Y. Maarek, A.J. Wecker, “The Librarian’s Assistant: Auto matically organizing on—line books into dynamic book

(75) Inventors: Jochen Doerre; Peter Gerstl; Sebastian Goeser, all of Stuttgart; Adrian Mueller, Boeblingen; Roland Sei?'ert, Herrenberg, all of (DE) (73) Assignee: International Business Machines

Corporation, Armonk, NY (US) (*)

Notice:

shelves”, Proceedings of RIAO’94, Intelligent Multimedia,

Subject to any disclaimer, the term of this patent is extended or adjusted under 35

IR Systems and Management, NY, 1994. Yoelle S. Maarek, D.M. Berry and GE. Kaiser, “An Infor mation Retrieval Approach for Automatically Constructing

U.S.C. 154(b) by 0 days.

SoftWare Libraries”, IEEE Transactions on SoftWare Engi

(21) Appl. No.: 09/345,260 (22) Filed:

(30)

neering, vol. 17, No. 8, p. 800—813 Aug. 1991.

Jun. 30, 1999 (List continued on next page.)

Foreign Application Priority Data

Jul. 31, 1998

US 6,446,061 B1 Sep. 3, 2002

(EP) .......................................... .. 98114371

Primary Examiner—Diane D. MiZrahi Assistant Examiner—Apu Mo?Z (74) Attorney, Agent, or Firm—R. Bruce Brodie; Marc D.

(51)

Int. Cl.7 .............................................. .. G06F 17/30

McSWain

(52)

US. Cl. ......................................................... .. 707/3

(57)

(58)

Field of Search .............................. .. 707/101, 6, 3;

information mining Within a multitude of documents stored

ABSTRACT

This mechanism relates to a method Within the area of

705/10; 706/50

on computer systems. More particularly, this mechanism

References Cited

relates to a computerized method of generating a content taxonomy of a multitude of electronic documents. The

U.S. PATENT DOCUMENTS

technique proposed by the current invention is able to

(56)

improve at the same time the scalability and the coherence 6,233,575 B1 *

6/2000 Agrawal et a1. ............. .. 707/6

and selectivity of taxonomy generation. The fundamental approach of the current invention comprises a subset selec

OTHER PUBLICATIONS

tion step, Wherein a subset of a multitude of documents is

Y.S. Maarek and GE. Kaiser, “Change Management for

being selected. In a taxonomy generation step a taxonomy is generated for that selected subset of documents, the tax onomy being a tree structured taxonomy hierarchy. More over this method comprises a routing selection step assign

Very Large Software Systems”, Conference Proceedings, Seventh Annual International Phoenix Conference on Com

puters and Communications, p. 280—285, 1988.

assisting maintenance of large softWare systems”, Third

ing each unprocessed document to the taxonomy hierarchy based on largest similarity.

Israel Conference on Computer Systems and SoftWare Engi neering, p. 178—186, 1988.

31 Claims, 4 Drawing Sheets

Yoelle S. Maarek, “On the use of cluster analysis for

stbset 581C115"



. randomly - randomly

dmmem

‘5

accnmtng date

‘ 201

1‘ '

/

teature extnct on

t

- tmgutsttcteatures (RF) - lexlcal immitu (LA;

teature vector 21:

‘ htezravchtcal clustering

1

- 505mg topttonat)

204

!

lahetmg (opttonal)

taxonomy 214

labeled

taxonomy 215

categorization training compute category emes

ass g mg documents

16 taxonomy generated

taxonomy at

1mm subset database

US 6,446,061 B1 Page 2

OTHER PUBLICATIONS

Richard Helm and Yoelle S. Maarek, “Integrating Informa tion Retrieval and Domain Speci?c Approache for Browsing and Retrieval in Object Oriented Class Libraries”, Proceed ings of OOPSLA’91, p.47—61, Phoenix, AZ, Oct. 1991. Yoelle S. Maarek and Frank A. Smadj a, “Full Text Indexing Based on Lexical Relations. An Application: Software

Libraries”, Proceedings of SIGIR’89, 12th International Conference on Research and Development in Information

Computer Networks and ISDN 30 (1998), p. 317—326. Also available at http://www7.scu.edu.au/programme/fullpapers/

1849/com1849.htm. Yoelle S. Maarek et al., “WebCutter: A system for dynamic

and tailorable site mapping”, Proceedings of WWW6, the 6th International World Wide Web Conference, Santa Clara, CA, Apr. 1997. Also appeared in the Journal of Computer Networks and ISDN 29 (1997), p. 1269—1279. Also avail able at http://www.scope.gmd.de/info/www6/technical/pa

Retrieval, ed. N.J. Belkin and CJ van Rijsbergen, Special Issue of the SIGIR Forum, ACM Press, p. 198—206, Cam

per040/paper40.htm.

bridge, MA, Jun. 1989.

Yoelle S. Maarek and Israel Z. Ben Shaul, “Automatically

Yoelle S. Maarek and Daniel M. Berry, “The Use of Lexical

OrganiZing Bookmarks per Contents”, Proceedings of WWWS, the 5th International World Wide Web Conference, Paris, May 1996. Also appeared in the Journal of Computer

Af?nities in Requirements Extraction”, Proceedings of the Fifth International Workshop on Software Speci?cation and

An Application: Tailored Web Site Mapping”, Proceedings

Networks and ISDN 28, No. 7—11, p. 1321—1334. Also available at http://www5conf.inria.fr/?chihtml/papers/P37/ Overview.htm.

of WWW7, the 7th International World Wide Web Confer ence, Brisbane, Apr., 1998. Also appeared in the Journal of

* cited by examiner

Design, p. 196—202, Pittsburgh, PA, May 19—20, 1989. Michael Herscovici et al., “The Shark—Search Algorithm—

U.S. Patent

Sep. 3, 2002

US 6,446,061 B1

Sheet 2 0f 4

21; subset selction o randomly O randomly

according date

document database 210

201/

subset 212

20 2 l

I

feature extraction 0 linguistic features (LF) 0 lexical affinities (LA) feature vector

hierarchical clustering o slicing (optional)

204

taxonomy 214

/ labeling (optional) /

categorization training 0 compute category

schemes

Category scheme 206

~

FIG. 2

routing

6 assigning documents

216

#A

to taxonomy generated

taxonomy of

from subset

document database

U.S. Patent

Sep. 3, 2002

Sheet 3 0f 4

US 6,446,061 B1

caribsusg,murphy nt,window, applic,develop / microsoft,window

corp,microsoft nt,window managedmetwork site,web

nt,window, anti,virus /

ip,switch

nt,server

app1ic,develop

ce,window nt,window, managed,network / ?t , server

nt,window, unix,workstation /

compute,network

FIG. 3

U.S. Patent

Sep. 3, 2002

Sheet 4 of4

US 6,446,061 B1

bank, NatWest Securities, comment bank, banking, Banc One

bank

Corp.000l

banking Federal Reserve Ban fund

bank, fund, pay

_

IWHome Banking

bank, Federal Reserve

Ban, International Bank

bank, banking, IWHome

Banking

FIG. 4

US 6,446,061 B1 1

2

TAXONOMY GENERATION FOR DOCUMENT COLLECTIONS

(OLAP) for structured data, and from full text search for textual data. In essence, information mining is distinguished by the fact that it is aimed at the discovery of information and knoWledge, Without a previously formulated hypothesis.

FIELD OF THE INVENTION

By de?nition, the information discovered through the min ing process must have been previously unknoWn, that is, it is unlikely that the information could have been hypoth esiZed in advance. For structured data, the interchangeable terms “data mining” and “knoWledge discovery in data

The present invention relates to a method Within the area

of information mining Within a multitude of documents stored on computer systems. More particularly, the invention relates to a computeriZed method of generating a content taxonomy of a multitude of electronic documents. 10

rule based systems, neural netWorks, and visualiZation. “Text mining” technology is also based on different approaches of the same technologies; moreover it exploits

Organizations generate and collect large volumes of data, Which they use in daily operations. Yet many companies are unable to capitaliZe fully on the value of this data because information implicit in the data is not easy to discern. Operational systems record transactions as they occur, day and night, and store the transaction data in ?les and data

15

similar objects that differ signi?cantly from other objects.

in repositories provided by document management systems.

They also share the concept of classi?cation, Which refers to

The groWth of the Internet, and its increased WorldWide

?nding out to Which class it belongs a certain database record, in the case of data mining, or to a document, in the case of text mining. The classi?cation schema can be

acceptance as a core channel both for communication among

individuals and for business operations, has multiplied the sources of information and therefore the opportunities for

obtaining competitive advantages. Business Intelligence

discovered automatically through clustering techniques (the 25

together are used to enable improved decision making. Information mining is the process of data mining and/or text mining. It uses advanced technology for gleaning valuable

machine ?nds the groups or clusters and assigns to each cluster a generaliZed title or cluster label that becomes the

class name). In other cases the taxonomy can be provided by the user, and the process is called categoriZation.

Many of the technologies and tools developed in infor

insights from these sources that enable the business user

making the right business decisions and thus obtaining the competitive advantages required to thrive in today’s com petitive environment. Information Mining in general gener ates previously unknoWn, comprehensible, and actionable information from any source, including transactions, documents, e-mail, Web pages, and other, and using it to

techniques of computational linguistics. Both data mining and text mining share key concepts of knoWledge extraction, such as the discovery of Which fea tures are important for clustering, that is, ?nding groups of

bases. Documents are produced and placed in shared ?les or

Solutions is the term that describes the processes that

bases” describe a multidisciplinary ?eld of research that

include machine learning, statistics, database technology,

BACKGROUND OF THE INVENTION

mation mining are dedicated to the task 01 discovery and extraction of information or knoWledge from text

documents, called feature extraction. The basic pieces of information in text—such as the language of the text or company names or dates mentioned—are called features. 35

Information extraction from unconstrained text is the extrac

tion of the linguistic items that provide representative or

make crucial business decisions.

otherWise relevant information about the document content. These features are used to assign documents to categories in

Data is the raW material. It can be a set of discrete facts

about events, and in that case, it is most usefully described as structured records of transactions, and it is usually of also a source of an unstructured data, delivered as a stream

a given scheme, group documents by subject, focus on speci?c parts of information Within documents, or improve the quality of information retrieval systems. The extracted

of bits Which can be decodi?ed as Words and sentences of

features can also serve as meta data about the analyZed

text in a certain language. Industry analysts estimate that unstructured data represent 80% of an enterprise information compared to 20% from structured data; it comprises data from different sources, such as text, image, video, and audio; text, is hoWever, the most predominant variety of unstruc tured data. The IBM Intelligent Miner Family is a set of offerings that enables the business professional and in general any knoWl edge Worker to use the computer to generate meaningful information and useful insights from both structured data

documents. Extracting implicit data from text can be inter esting for many reasons; for instance:

numeric or literal type. But documents and Web pages are

45

tant terms in documents. This can give a quick impres sion Whether the document is of any interest. to ?nd names of competitors eg when doing a case study in a certain business area one can do a names extraction

on the documents that one has received from different

sources and then sort them by names of competitors. to ?nd and store key concepts. This could replace a text retrieval system Where huge indexes are not appropriate

and text. Although the general problems to solve (e.g.. clustering, classi?cation) are similar for the different data types, the technology used in each case is different, because

to highlight important information eg to highlight impor

55

but only a feW key concepts of the underlying docu ment collection should be stored in a database.

it needs to be optimiZed to the media involved, the user needs, and to the best use of the computing resources. For

specialiZed products: the IBM Intelligent Miner for Data,

to use related topics for query re?nement e.g. store the key concepts found in a database and build an application for query re?nement on top of it. Thus topics that are related to the users’ initial queries can be suggested to

and the IBM Intelligent Miner for Text. Information mining has been de?ned as the process of

help them re?ne their queries. Feature extraction from texts, and the harvesting of crisp

that reason, the IBM Intelligent Family is comprised of tWo

generating previously unknoWn, comprehensible, and

and vague information, require sophisticated knoWledge

actionable information from any source. This de?nition

models, Which tend to become domain speci?c. A recent

exposes the fundamental differences betWeen information

mining and the traditional approaches to data analysis such as query and reporting and online analytical processing

65

research prototype has been disclosed by J. Mothe, T. Dkaki, B. Dousset, “Mining Information in Order to Extract Hidden

and Strategic Information”, Proceedings of Computer

US 6,446,061 B1 4

3

the problem of coherence and selectivity: the leaf nodes in the taxonomy should be maximally coherent With all assigned documents having the same thematic content.

Assisted Information Searching on Internet, RIAO97, pp 32—51, June 1997.

A further technology of major importance in information mining is dedicated to the task of clustering of documents.

Related documents from different nodes should appear Within short distance in the taxonomy structure. The most important problems of the current state of the art

Within a collection of objects a cluster could be defmed as a group of objects Whose members are more similar to each other than to the members of any other group. In information

technologies for taxonomy generation are the problem of scalability: any document of the collection

mining clustering is used to segment a document collection into subsets, the clusters, With the members of each cluster being similar With respect to certain interesting features. For clustering no prede?ned taxonomy or classi?cation schemes

must be assigned to some leaf node in the taxonomy

and the Whole taxonomy generation process must be applicable to a signi?cantly larger number of docu ments and still being able to generate a taxonomy

are necessary. This automatic analysis of information can be used for several different purposes: to provide an overvieW of the contents of a large docu

Within a reasonable amount of time.

the problem of domain-independence: no hand-coded knoWledge on the domain to be analyZed derived from an analysis of the given document collection to steer and to speed up the taxonomy generation process should be used.

ment collection; to identify hidden structures betWeen groups of objects eg clustering alloWs that related documents are all

connected by hyper links; to ease the process of broWsing to ?nd similar or related information eg to get an overvieW over documents; to detect duplicate and almost identical documents in an

OBJECTIVE OF THE INVENTION

archive. Typically, the goal of cluster analysis is to determine a set of clusters, or a clustering, in Which the inter-cluster simi

The invention is based on the objective to improve the scalability of an taxonomy generation process alloWing a taxonomy generation method to cope With increasing num

larity is minimiZed and intra-cluster similarity is maximiZed. In general, there is no unique or best solution to this task. A

25

number of different algorithms have been proposed that are more or less appropriate for different data collections and

domain-independence of the taxonomy generation method.

interests. Hierarchical clustering Works especially Well for textual data In contrast to ?at or linear clustering Where the clusters have no genuine relationship, the clusters in a hierarchical approach are arranged in a clustering tree Where related clusters occur in the same branch of the tree. Clus

SUMMARY OF THE INVENTION The current invention teaches a method of generating a

tering algorithms have a long tradition. Examples and over vieWs of clustering algorithms may be found in M.

IWayama, T. Tokunaga, “Cluster-Based Text Categorization: A Comparison of Category Search Strategies”, in: Proceed ings of SIGIR 1995, pp 273—280, July 1995, ACM. or in Y. Maarek, A. J. Wecker, “The Librarian’sAssistant: Automati

35

cally organiZing on-line books into dynamic bookshelves”, in: Proceedings of RIAO ’94, Intelligent Multimedia, IR Systems and Management, NY, 1994. A further technology of major importance in information

content taxonomy of a multitude of documents (210) stored on a computer system and said method being executable by a computer system. The fundamental approach of the current invention comprises a subset-selection-step (201), Wherein a subset of said multitude of documents is being selected. In a taxonomy-generation-step (202 to 205) a taxonomy is generated for that selected subset of documents, said tax onomy being a tree-structured taxonomy-hierarchy. Said subset is divided into a set of clusters With largest intra

similarity and each of said clusters of largest intra-similarity

mining is dedicated to the task of categoriZation of docu ments. In general, to categoriZe objects means to assign them to prede?ned categories or classes from a taxonomy.

bers of documents to be analyZed in a reasonable amount of time. It is a further objective of the current invention to improve said scalability and at the same time to guarantee

45

is assigned to a leaf-node of said taxonomy-hierarchy as outer cluster. The inner-nodes of said taxonomy-hierarchy are ordering said subset, starting With said outer clusters, into inner-clusters With increasing cluster-siZe and decreas

The categories may be overlapping or distinct, depending on the domain of interest. For text mining, categoriZation can

ing similarity. Moreover said method comprises a routing selection-step (206), Wherein for each unprocessed docu

mean to assign categories to documents or to organiZe

ment of said multitude of documents not belonging to said subset its similarities With said outer-clusters are computed and said document is assigned to a leaf-node of said

documents With respect to a prede?ned organiZation. Cat egoriZation in the context of text mining means to assign

documents to preexisting categories, sometimes called top

taxonomy-hierarchy comprising the outer-cluster With larg

ics or themes. The categories are chosen to match the intended use of the collection and have to be trained before

est similarity. The technique proposed by the current invention is able to

hand. By assigning documents to categories, text mining can

help to organiZe them. While categoriZation cannot replace

improve at the same time the scalability and the coherence 55

the kind of cataloging a librarian does, it provides a much

less expensive alternative. State of the art technologies for taxonomy generation suffer several de?ciencies, like: the problem of navigational balance: the taxonomy must

due to the rest of the features of the claim 1 that the taxonomy of a reasonable selected and reasonable siZed subset of documents is already a stable taxonomy With

be Well-balanced for navigation by an end-user. In particular, the fan-out at each level of the hierarchy must be limited, the depth must be limited, and there must not be empty nodes.

the problem of orientation: nodes in the taxonomy should re?ect “concepts” and give suf?cient orientation for a user traversing the taxonomy.

and selectivity of taxonomy generation. Scalability is pro vided as the taxonomy generation step, being the most time consuming part of the overall process, is operating on the selected subset of documents only. This approach alone Would not be sufficient for solving the overall problem. It’s

65

respect to the complete multitude of documents. The intro duction of a separate routing selection step alloWs the mass of the documents to be assigned very efficiently in an already

computed taxonomy. By exploiting a hierarchical taxonomy approach the leaf nodes in the taxonomy are coherent With

US 6,446,061 B1 5

6

all assigned documents having the same thematic content and related documents from different nodes appear Within short distance in the taxonomy structure. The taxonomy

very effective as only tWo feature-vectors have to be

generated according the current invention is very stable, i.e. increasing the siZe of reasonable siZed subset of documents Will not change the taxonomy in any essential manner. Moreover the proposed method is completely domain independent, i.e. no hand-coded knoWledge on the domain to be analyZed derived from an analysis of the given document collection is required to steer and to speed up the taxonomy generation process. As a result the complete taxonomy generation process is fully automatic and does not require any human intervention or adaptation.

compared, the feature-vector of the recently processed docu ment and the category-scheme. Based on above approach similarity calculations of an unprocessed document With respect to an outer-cluster is very effective as only tWo feature-vectors have to be

10

feature-extraction-step extract features based on lexical af?nities Within said documents.

Exploiting lexical af?nity technology alloWs the proposed

Additional advantages are accomplished by the aspect that said taxonomy-generation-step comprises a ?rst

feature-extraction-step (202) extracting for each document

compared, the feature-vector of the unprocessed document and the category-scheme. Additional advantages are accomplished by the aspect that said ?rst-feature-extraction-step and/or said second

15

method to determine (in a domain independent manner) multi-Word phrases Which have a much higher semantic meaning compared to the single terms. Thus orientation for

of said subset its features and computing its feature statistics

the users is improved as the taxonomy is able to re?ect

in a feature vector (212) as a representation of said docu

“concepts”.

ment.

Introducing a distinct feature extraction step increases ?exibility of the proposed method as it becomes possible to

Additional advantages are accomplished by the aspect that said ?rst-feature-extraction-step and/or said second feature-extraction-step extract features based on linguistic

exploit different feature extraction technologies depending

features Within said documents.

on the intended purpose of the taxonomy, depending on the document domain and depending on the characteristics of

posed method to determine (in a domain independent

the various feature extraction technologies. Storing the time consuming s computation of the feature-vectors speeds up

Exploiting linguistic features technology alloWs the pro 25

the text. Different variants are associated With a single canonical form. Thus in cases Where “names” are of impor

processing as the feature-vectors can be used again in later

processing steps.

tance the proposed feature improves orientation and selec

Additional advantages are accomplished by the aspect that said taxonomy-generation-step comprises a clustering step (203) using a hierarchical clustering algorithm to gen erate said taxonomy-hierarchy and using said feature vectors for determining similarity.

tivity for the users.

Additional advantages are accomplished by the aspect that said lexical af?nities are extracted With a WindoW of M

Words to identify co-occurring Words. Allowing to adjust the Window size to determine lexical

Using the hierarchical clustering algorithm, Working bottom-up, i.e. Which starts With clusters comprising a single

document and then Working upWards by merging clusters

manner) names of people, organiZations, locations, domain terms (multi-Word terms) and other signi?cant phrases from

35

af?nities gives the freedom to control processing time versus complexity of extracted features. M being a natural number

until the root clusters has been generated, guarantees good coherence and selectivity of the taxonomy. Moreover the

With 1