Jun 30, 1999 - mining and the traditional approaches to data analysis such as query and ..... document, âIRAâ is associated With the Irish Republican. Army ...
US006446061B1
(12) United States Patent
(10) Patent N0.: (45) Date of Patent:
Doerre et al.
(54) TAXONOMY GENERATION FOR
J. Mothe, T. Dkaki, B. Dousset, “Mining Information in
DOCUMENT COLLECTIONS
Order to Extract Hidden and Strategic Information”, Pro
ceedings of Computer—Assisted Information Searching on Internet, RIAO97, p. 32—51, Jun. 1997. M. IWayama, T. Tokunaga, “Cluster—Based Text Categori Zation: A Comparison of Category Search Strategies”, Pro ceedings of SIGIR 1995, p. 273—280, Jul. 1995, ACM. Y. Maarek, A.J. Wecker, “The Librarian’s Assistant: Auto matically organizing on—line books into dynamic book
(75) Inventors: Jochen Doerre; Peter Gerstl; Sebastian Goeser, all of Stuttgart; Adrian Mueller, Boeblingen; Roland Sei?'ert, Herrenberg, all of (DE) (73) Assignee: International Business Machines
Corporation, Armonk, NY (US) (*)
Notice:
shelves”, Proceedings of RIAO’94, Intelligent Multimedia,
Subject to any disclaimer, the term of this patent is extended or adjusted under 35
IR Systems and Management, NY, 1994. Yoelle S. Maarek, D.M. Berry and GE. Kaiser, “An Infor mation Retrieval Approach for Automatically Constructing
U.S.C. 154(b) by 0 days.
SoftWare Libraries”, IEEE Transactions on SoftWare Engi
(21) Appl. No.: 09/345,260 (22) Filed:
(30)
neering, vol. 17, No. 8, p. 800—813 Aug. 1991.
Jun. 30, 1999 (List continued on next page.)
Foreign Application Priority Data
Jul. 31, 1998
US 6,446,061 B1 Sep. 3, 2002
(EP) .......................................... .. 98114371
Primary Examiner—Diane D. MiZrahi Assistant Examiner—Apu Mo?Z (74) Attorney, Agent, or Firm—R. Bruce Brodie; Marc D.
(51)
Int. Cl.7 .............................................. .. G06F 17/30
McSWain
(52)
US. Cl. ......................................................... .. 707/3
(57)
(58)
Field of Search .............................. .. 707/101, 6, 3;
information mining Within a multitude of documents stored
ABSTRACT
This mechanism relates to a method Within the area of
705/10; 706/50
on computer systems. More particularly, this mechanism
References Cited
relates to a computerized method of generating a content taxonomy of a multitude of electronic documents. The
U.S. PATENT DOCUMENTS
technique proposed by the current invention is able to
(56)
improve at the same time the scalability and the coherence 6,233,575 B1 *
6/2000 Agrawal et a1. ............. .. 707/6
and selectivity of taxonomy generation. The fundamental approach of the current invention comprises a subset selec
OTHER PUBLICATIONS
tion step, Wherein a subset of a multitude of documents is
Y.S. Maarek and GE. Kaiser, “Change Management for
being selected. In a taxonomy generation step a taxonomy is generated for that selected subset of documents, the tax onomy being a tree structured taxonomy hierarchy. More over this method comprises a routing selection step assign
Very Large Software Systems”, Conference Proceedings, Seventh Annual International Phoenix Conference on Com
puters and Communications, p. 280—285, 1988.
assisting maintenance of large softWare systems”, Third
ing each unprocessed document to the taxonomy hierarchy based on largest similarity.
Israel Conference on Computer Systems and SoftWare Engi neering, p. 178—186, 1988.
31 Claims, 4 Drawing Sheets
Yoelle S. Maarek, “On the use of cluster analysis for
stbset 581C115"
‘
. randomly - randomly
dmmem
‘5
accnmtng date
‘ 201
1‘ '
/
teature extnct on
t
- tmgutsttcteatures (RF) - lexlcal immitu (LA;
teature vector 21:
‘ htezravchtcal clustering
1
- 505mg topttonat)
204
!
lahetmg (opttonal)
taxonomy 214
labeled
taxonomy 215
categorization training compute category emes
ass g mg documents
16 taxonomy generated
taxonomy at
1mm subset database
US 6,446,061 B1 Page 2
OTHER PUBLICATIONS
Richard Helm and Yoelle S. Maarek, “Integrating Informa tion Retrieval and Domain Speci?c Approache for Browsing and Retrieval in Object Oriented Class Libraries”, Proceed ings of OOPSLA’91, p.47—61, Phoenix, AZ, Oct. 1991. Yoelle S. Maarek and Frank A. Smadj a, “Full Text Indexing Based on Lexical Relations. An Application: Software
Libraries”, Proceedings of SIGIR’89, 12th International Conference on Research and Development in Information
Computer Networks and ISDN 30 (1998), p. 317—326. Also available at http://www7.scu.edu.au/programme/fullpapers/
1849/com1849.htm. Yoelle S. Maarek et al., “WebCutter: A system for dynamic
and tailorable site mapping”, Proceedings of WWW6, the 6th International World Wide Web Conference, Santa Clara, CA, Apr. 1997. Also appeared in the Journal of Computer Networks and ISDN 29 (1997), p. 1269—1279. Also avail able at http://www.scope.gmd.de/info/www6/technical/pa
Retrieval, ed. N.J. Belkin and CJ van Rijsbergen, Special Issue of the SIGIR Forum, ACM Press, p. 198—206, Cam
per040/paper40.htm.
bridge, MA, Jun. 1989.
Yoelle S. Maarek and Israel Z. Ben Shaul, “Automatically
Yoelle S. Maarek and Daniel M. Berry, “The Use of Lexical
OrganiZing Bookmarks per Contents”, Proceedings of WWWS, the 5th International World Wide Web Conference, Paris, May 1996. Also appeared in the Journal of Computer
Af?nities in Requirements Extraction”, Proceedings of the Fifth International Workshop on Software Speci?cation and
An Application: Tailored Web Site Mapping”, Proceedings
Networks and ISDN 28, No. 7—11, p. 1321—1334. Also available at http://www5conf.inria.fr/?chihtml/papers/P37/ Overview.htm.
of WWW7, the 7th International World Wide Web Confer ence, Brisbane, Apr., 1998. Also appeared in the Journal of
* cited by examiner
Design, p. 196—202, Pittsburgh, PA, May 19—20, 1989. Michael Herscovici et al., “The Shark—Search Algorithm—
U.S. Patent
Sep. 3, 2002
US 6,446,061 B1
Sheet 2 0f 4
21; subset selction o randomly O randomly
according date
document database 210
201/
subset 212
20 2 l
I
feature extraction 0 linguistic features (LF) 0 lexical affinities (LA) feature vector
hierarchical clustering o slicing (optional)
204
taxonomy 214
/ labeling (optional) /
categorization training 0 compute category
schemes
Category scheme 206
~
FIG. 2
routing
6 assigning documents
216
#A
to taxonomy generated
taxonomy of
from subset
document database
U.S. Patent
Sep. 3, 2002
Sheet 3 0f 4
US 6,446,061 B1
caribsusg,murphy nt,window, applic,develop / microsoft,window
corp,microsoft nt,window managedmetwork site,web
nt,window, anti,virus /
ip,switch
nt,server
app1ic,develop
ce,window nt,window, managed,network / ?t , server
nt,window, unix,workstation /
compute,network
FIG. 3
U.S. Patent
Sep. 3, 2002
Sheet 4 of4
US 6,446,061 B1
bank, NatWest Securities, comment bank, banking, Banc One
bank
Corp.000l
banking Federal Reserve Ban fund
bank, fund, pay
_
IWHome Banking
bank, Federal Reserve
Ban, International Bank
bank, banking, IWHome
Banking
FIG. 4
US 6,446,061 B1 1
2
TAXONOMY GENERATION FOR DOCUMENT COLLECTIONS
(OLAP) for structured data, and from full text search for textual data. In essence, information mining is distinguished by the fact that it is aimed at the discovery of information and knoWledge, Without a previously formulated hypothesis.
FIELD OF THE INVENTION
By de?nition, the information discovered through the min ing process must have been previously unknoWn, that is, it is unlikely that the information could have been hypoth esiZed in advance. For structured data, the interchangeable terms “data mining” and “knoWledge discovery in data
The present invention relates to a method Within the area
of information mining Within a multitude of documents stored on computer systems. More particularly, the invention relates to a computeriZed method of generating a content taxonomy of a multitude of electronic documents. 10
rule based systems, neural netWorks, and visualiZation. “Text mining” technology is also based on different approaches of the same technologies; moreover it exploits
Organizations generate and collect large volumes of data, Which they use in daily operations. Yet many companies are unable to capitaliZe fully on the value of this data because information implicit in the data is not easy to discern. Operational systems record transactions as they occur, day and night, and store the transaction data in ?les and data
15
similar objects that differ signi?cantly from other objects.
in repositories provided by document management systems.
They also share the concept of classi?cation, Which refers to
The groWth of the Internet, and its increased WorldWide
?nding out to Which class it belongs a certain database record, in the case of data mining, or to a document, in the case of text mining. The classi?cation schema can be
acceptance as a core channel both for communication among
individuals and for business operations, has multiplied the sources of information and therefore the opportunities for
obtaining competitive advantages. Business Intelligence
discovered automatically through clustering techniques (the 25
together are used to enable improved decision making. Information mining is the process of data mining and/or text mining. It uses advanced technology for gleaning valuable
machine ?nds the groups or clusters and assigns to each cluster a generaliZed title or cluster label that becomes the
class name). In other cases the taxonomy can be provided by the user, and the process is called categoriZation.
Many of the technologies and tools developed in infor
insights from these sources that enable the business user
making the right business decisions and thus obtaining the competitive advantages required to thrive in today’s com petitive environment. Information Mining in general gener ates previously unknoWn, comprehensible, and actionable information from any source, including transactions, documents, e-mail, Web pages, and other, and using it to
techniques of computational linguistics. Both data mining and text mining share key concepts of knoWledge extraction, such as the discovery of Which fea tures are important for clustering, that is, ?nding groups of
bases. Documents are produced and placed in shared ?les or
Solutions is the term that describes the processes that
bases” describe a multidisciplinary ?eld of research that
include machine learning, statistics, database technology,
BACKGROUND OF THE INVENTION
mation mining are dedicated to the task 01 discovery and extraction of information or knoWledge from text
documents, called feature extraction. The basic pieces of information in text—such as the language of the text or company names or dates mentioned—are called features. 35
Information extraction from unconstrained text is the extrac
tion of the linguistic items that provide representative or
make crucial business decisions.
otherWise relevant information about the document content. These features are used to assign documents to categories in
Data is the raW material. It can be a set of discrete facts
about events, and in that case, it is most usefully described as structured records of transactions, and it is usually of also a source of an unstructured data, delivered as a stream
a given scheme, group documents by subject, focus on speci?c parts of information Within documents, or improve the quality of information retrieval systems. The extracted
of bits Which can be decodi?ed as Words and sentences of
features can also serve as meta data about the analyZed
text in a certain language. Industry analysts estimate that unstructured data represent 80% of an enterprise information compared to 20% from structured data; it comprises data from different sources, such as text, image, video, and audio; text, is hoWever, the most predominant variety of unstruc tured data. The IBM Intelligent Miner Family is a set of offerings that enables the business professional and in general any knoWl edge Worker to use the computer to generate meaningful information and useful insights from both structured data
documents. Extracting implicit data from text can be inter esting for many reasons; for instance:
numeric or literal type. But documents and Web pages are
45
tant terms in documents. This can give a quick impres sion Whether the document is of any interest. to ?nd names of competitors eg when doing a case study in a certain business area one can do a names extraction
on the documents that one has received from different
sources and then sort them by names of competitors. to ?nd and store key concepts. This could replace a text retrieval system Where huge indexes are not appropriate
and text. Although the general problems to solve (e.g.. clustering, classi?cation) are similar for the different data types, the technology used in each case is different, because
to highlight important information eg to highlight impor
55
but only a feW key concepts of the underlying docu ment collection should be stored in a database.
it needs to be optimiZed to the media involved, the user needs, and to the best use of the computing resources. For
specialiZed products: the IBM Intelligent Miner for Data,
to use related topics for query re?nement e.g. store the key concepts found in a database and build an application for query re?nement on top of it. Thus topics that are related to the users’ initial queries can be suggested to
and the IBM Intelligent Miner for Text. Information mining has been de?ned as the process of
help them re?ne their queries. Feature extraction from texts, and the harvesting of crisp
that reason, the IBM Intelligent Family is comprised of tWo
generating previously unknoWn, comprehensible, and
and vague information, require sophisticated knoWledge
actionable information from any source. This de?nition
models, Which tend to become domain speci?c. A recent
exposes the fundamental differences betWeen information
mining and the traditional approaches to data analysis such as query and reporting and online analytical processing
65
research prototype has been disclosed by J. Mothe, T. Dkaki, B. Dousset, “Mining Information in Order to Extract Hidden
and Strategic Information”, Proceedings of Computer
US 6,446,061 B1 4
3
the problem of coherence and selectivity: the leaf nodes in the taxonomy should be maximally coherent With all assigned documents having the same thematic content.
Assisted Information Searching on Internet, RIAO97, pp 32—51, June 1997.
A further technology of major importance in information mining is dedicated to the task of clustering of documents.
Related documents from different nodes should appear Within short distance in the taxonomy structure. The most important problems of the current state of the art
Within a collection of objects a cluster could be defmed as a group of objects Whose members are more similar to each other than to the members of any other group. In information
technologies for taxonomy generation are the problem of scalability: any document of the collection
mining clustering is used to segment a document collection into subsets, the clusters, With the members of each cluster being similar With respect to certain interesting features. For clustering no prede?ned taxonomy or classi?cation schemes
must be assigned to some leaf node in the taxonomy
and the Whole taxonomy generation process must be applicable to a signi?cantly larger number of docu ments and still being able to generate a taxonomy
are necessary. This automatic analysis of information can be used for several different purposes: to provide an overvieW of the contents of a large docu
Within a reasonable amount of time.
the problem of domain-independence: no hand-coded knoWledge on the domain to be analyZed derived from an analysis of the given document collection to steer and to speed up the taxonomy generation process should be used.
ment collection; to identify hidden structures betWeen groups of objects eg clustering alloWs that related documents are all
connected by hyper links; to ease the process of broWsing to ?nd similar or related information eg to get an overvieW over documents; to detect duplicate and almost identical documents in an
OBJECTIVE OF THE INVENTION
archive. Typically, the goal of cluster analysis is to determine a set of clusters, or a clustering, in Which the inter-cluster simi
The invention is based on the objective to improve the scalability of an taxonomy generation process alloWing a taxonomy generation method to cope With increasing num
larity is minimiZed and intra-cluster similarity is maximiZed. In general, there is no unique or best solution to this task. A
25
number of different algorithms have been proposed that are more or less appropriate for different data collections and
domain-independence of the taxonomy generation method.
interests. Hierarchical clustering Works especially Well for textual data In contrast to ?at or linear clustering Where the clusters have no genuine relationship, the clusters in a hierarchical approach are arranged in a clustering tree Where related clusters occur in the same branch of the tree. Clus
SUMMARY OF THE INVENTION The current invention teaches a method of generating a
tering algorithms have a long tradition. Examples and over vieWs of clustering algorithms may be found in M.
IWayama, T. Tokunaga, “Cluster-Based Text Categorization: A Comparison of Category Search Strategies”, in: Proceed ings of SIGIR 1995, pp 273—280, July 1995, ACM. or in Y. Maarek, A. J. Wecker, “The Librarian’sAssistant: Automati
35
cally organiZing on-line books into dynamic bookshelves”, in: Proceedings of RIAO ’94, Intelligent Multimedia, IR Systems and Management, NY, 1994. A further technology of major importance in information
content taxonomy of a multitude of documents (210) stored on a computer system and said method being executable by a computer system. The fundamental approach of the current invention comprises a subset-selection-step (201), Wherein a subset of said multitude of documents is being selected. In a taxonomy-generation-step (202 to 205) a taxonomy is generated for that selected subset of documents, said tax onomy being a tree-structured taxonomy-hierarchy. Said subset is divided into a set of clusters With largest intra
similarity and each of said clusters of largest intra-similarity
mining is dedicated to the task of categoriZation of docu ments. In general, to categoriZe objects means to assign them to prede?ned categories or classes from a taxonomy.
bers of documents to be analyZed in a reasonable amount of time. It is a further objective of the current invention to improve said scalability and at the same time to guarantee
45
is assigned to a leaf-node of said taxonomy-hierarchy as outer cluster. The inner-nodes of said taxonomy-hierarchy are ordering said subset, starting With said outer clusters, into inner-clusters With increasing cluster-siZe and decreas
The categories may be overlapping or distinct, depending on the domain of interest. For text mining, categoriZation can
ing similarity. Moreover said method comprises a routing selection-step (206), Wherein for each unprocessed docu
mean to assign categories to documents or to organiZe
ment of said multitude of documents not belonging to said subset its similarities With said outer-clusters are computed and said document is assigned to a leaf-node of said
documents With respect to a prede?ned organiZation. Cat egoriZation in the context of text mining means to assign
documents to preexisting categories, sometimes called top
taxonomy-hierarchy comprising the outer-cluster With larg
ics or themes. The categories are chosen to match the intended use of the collection and have to be trained before
est similarity. The technique proposed by the current invention is able to
hand. By assigning documents to categories, text mining can
help to organiZe them. While categoriZation cannot replace
improve at the same time the scalability and the coherence 55
the kind of cataloging a librarian does, it provides a much
less expensive alternative. State of the art technologies for taxonomy generation suffer several de?ciencies, like: the problem of navigational balance: the taxonomy must
due to the rest of the features of the claim 1 that the taxonomy of a reasonable selected and reasonable siZed subset of documents is already a stable taxonomy With
be Well-balanced for navigation by an end-user. In particular, the fan-out at each level of the hierarchy must be limited, the depth must be limited, and there must not be empty nodes.
the problem of orientation: nodes in the taxonomy should re?ect “concepts” and give suf?cient orientation for a user traversing the taxonomy.
and selectivity of taxonomy generation. Scalability is pro vided as the taxonomy generation step, being the most time consuming part of the overall process, is operating on the selected subset of documents only. This approach alone Would not be sufficient for solving the overall problem. It’s
65
respect to the complete multitude of documents. The intro duction of a separate routing selection step alloWs the mass of the documents to be assigned very efficiently in an already
computed taxonomy. By exploiting a hierarchical taxonomy approach the leaf nodes in the taxonomy are coherent With
US 6,446,061 B1 5
6
all assigned documents having the same thematic content and related documents from different nodes appear Within short distance in the taxonomy structure. The taxonomy
very effective as only tWo feature-vectors have to be
generated according the current invention is very stable, i.e. increasing the siZe of reasonable siZed subset of documents Will not change the taxonomy in any essential manner. Moreover the proposed method is completely domain independent, i.e. no hand-coded knoWledge on the domain to be analyZed derived from an analysis of the given document collection is required to steer and to speed up the taxonomy generation process. As a result the complete taxonomy generation process is fully automatic and does not require any human intervention or adaptation.
compared, the feature-vector of the recently processed docu ment and the category-scheme. Based on above approach similarity calculations of an unprocessed document With respect to an outer-cluster is very effective as only tWo feature-vectors have to be
10
feature-extraction-step extract features based on lexical af?nities Within said documents.
Exploiting lexical af?nity technology alloWs the proposed
Additional advantages are accomplished by the aspect that said taxonomy-generation-step comprises a ?rst
feature-extraction-step (202) extracting for each document
compared, the feature-vector of the unprocessed document and the category-scheme. Additional advantages are accomplished by the aspect that said ?rst-feature-extraction-step and/or said second
15
method to determine (in a domain independent manner) multi-Word phrases Which have a much higher semantic meaning compared to the single terms. Thus orientation for
of said subset its features and computing its feature statistics
the users is improved as the taxonomy is able to re?ect
in a feature vector (212) as a representation of said docu
“concepts”.
ment.
Introducing a distinct feature extraction step increases ?exibility of the proposed method as it becomes possible to
Additional advantages are accomplished by the aspect that said ?rst-feature-extraction-step and/or said second feature-extraction-step extract features based on linguistic
exploit different feature extraction technologies depending
features Within said documents.
on the intended purpose of the taxonomy, depending on the document domain and depending on the characteristics of
posed method to determine (in a domain independent
the various feature extraction technologies. Storing the time consuming s computation of the feature-vectors speeds up
Exploiting linguistic features technology alloWs the pro 25
the text. Different variants are associated With a single canonical form. Thus in cases Where “names” are of impor
processing as the feature-vectors can be used again in later
processing steps.
tance the proposed feature improves orientation and selec
Additional advantages are accomplished by the aspect that said taxonomy-generation-step comprises a clustering step (203) using a hierarchical clustering algorithm to gen erate said taxonomy-hierarchy and using said feature vectors for determining similarity.
tivity for the users.
Additional advantages are accomplished by the aspect that said lexical af?nities are extracted With a WindoW of M
Words to identify co-occurring Words. Allowing to adjust the Window size to determine lexical
Using the hierarchical clustering algorithm, Working bottom-up, i.e. Which starts With clusters comprising a single
document and then Working upWards by merging clusters
manner) names of people, organiZations, locations, domain terms (multi-Word terms) and other signi?cant phrases from
35
af?nities gives the freedom to control processing time versus complexity of extracted features. M being a natural number
until the root clusters has been generated, guarantees good coherence and selectivity of the taxonomy. Moreover the
With 1