Knowledge Discovery from Text Documents Based on Paragraph Maps

1 downloads 470 Views 349KB Size Report
t that the information can be managed and the knowledge can be retrieved. ... technology is based on a hierarchy of Self-Organizing Maps. (SOM) and on smart ...
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

Knowledge Discovery from Text Documents Based on Paragraph Maps Ari Visa, Jarmo Toivonen, Piia Ruokonen Lappeenranta University of Technology P.O. Box 20 FIN-53851 Lappeenranta, Finland fAri.Visa, Jarmo.Toivonen, [email protected]

Hannu Vanharanta Pori School of Technology and Economics Tampere University of Technology Bredantie 28-30 C 6 FIN-02700 Kauniainen, Finland [email protected]

Barbro Back ˚ Abo Akademi University Lemmink¨aisenkatu 14 A FIN-20520 Turku, Finland [email protected]

Abstract

1 Introduction

In law, physics, business, and so on, there are lots of documents. The organisation of these documents is essential. The right way to organise the documents reveals quite a lot from the information contents of the document. It is common that text documents are characterised and classified by keywords. The authors usually define these keywords. Nowadays there exists a tremendous amount of uncharacterised text documents. This fact is true due to Internet. It is also true due to old paper based archives. It is important that the information can be managed and the knowledge can be retrieved. It would be desirable to retrieve the information without reading the document. We propose a new technology based on multilevel hierarchies. Here we concentrate only on the highest level. The technology is based on a hierarchy of Self-Organizing Maps (SOM) and on smart encoding of words. Our experiment with a text document (an annual report) shows that it is possible to separate between different types of paragraphs. It is possible to separate between the original paragraph and the one containing the same words but in random order. It is also possible to categorise the paragraphs or, for instance, to find all unauthorised citations of paragraphs within a long text document. The only requirement is that there be a considerable amount of text documents for the training process. Finally the text documents can be classified based on the trained types of paragraphs. This means that unknown documents can be categorised without reading them. This facility can be called knowledge discovery.

Nowadays, a substantial amount of important text documents exists in the Internet or in various databases. These documents are seldom characterised with keywords or with any other way. However, it is necessary that the documents can be clustered and organised in a reasonable way. This is important because it makes possible to retrieve the knowledge from these documents. There are many research groups addressing this problem [13],[2],[5], etc. The problem has been addressed by automata theory, grammars or language theories. In some applications rule based systems have also been useful. Neural network methods offer an associative approach to the problem. This is the case to those neural networks that use competitive learning, for instance Self-Organizing Map (SOM) [3, 4]. The only demand is that there is a large quantity of text documents available. The main ideas how to use SOM were published by Ritter and Kohonen [12]. Nowadays there are several SOM based attempts to different applications, mainly to browsing or to information retrieval and data mining [13, 14, 10, 7, 8, 9, 2, 5, 1, 15]. The difficulty of using SOM lies in how to construct such numerical representations for documents. The representations should contain the relevant information on the contents of the document. A rather commonly used method in text retrieval is to encode a document as a histogram of its words. In the word histogram representation, the information of relative word order in the document is lost. The efficiency of representation is, however, gained. In large document collections the vocabularies may also be-

0-7695-0493-0/00 $10.00 (c) 2000 IEEE

1

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

come prohibitively large. In WEBSOM [2, 5] word category histograms are used. The words are clustered with the so-called ”self-organizing semantic maps” [12]. These maps use the statistics of the textual contexts of words to provide information on their relations. The size of the word histograms can be reduced. At the same time, the semantic similarity of the words is expressed in the closeness of the word categories on the semantic map. The closeness can be taken into account in the encoding of the documents. The approach is suitable for browsing purposes. In the cases where the vocabulary is large and almost unlimited the accuracy is not sufficient. In this article, the problem how to compare, classify and search documents based on the contents is first discussed and formulated. The new method is described. Finally the results, and the difference with other approaches are discussed.

2 Theory

3 Proposed solution The new proposed technology is based on a hierarchy of Self-Organizing Maps (SOM) and on a smart encoding of words. The original text is preprocessed, i.e. compound words are united to one word, numbers are rounded, etc. The filtered text is translated into a suitable form for clustering purposes. This is done by encoding. The encoding of words is a wide subject and there are several approaches for doing it: 1) The word is recognised and replaced with a code. This approach is sensitive to new words. 2) The succeeding words are replaced with a code. This method is language sensitive. 3) Each word is analysed character by character and based on the characters a key entry to a code table is calculated. This approach is sensitive to capital letters and conjugation if the code table is not arranged in a special way. We chose the last alternative, because it is accurate and suitable for statistical analysis. A word w is transformed into a number in the following manner:

y= In many fields, e.g. law, physics and business, there are lots of documents. Many users are interested in comparing, classifying, and searching these documents. The actual vocabularies of the documents are usually large and specific. At the same time the vocabularies are open, this means that new terms or words may appear time by time. Let’s define the problem. Let’s consider first the problem at word level. Let’s assume a set of words W , a subset of words, a dictionary, W N and W N  W . A word w usually belongs to the dictionary but that is not necessary, i.e. fwjw  W ^ w 6 W N g. The new words are projected on the known words in NW . This formulation makes it difficult to use grammars or rules. To achieve the adaptability to different vocabularies, this means application fields, and to solve the problem that new words may appear, the dictionary W N is created by unsupervised learning, which is a kind of a clustering process. The only demand is that there be a large amount of representative text documents available. At the sentence and paragraph level the problem is similar. The successive words wi ; wi+1 ; wi+2 ; : : : ; wi+n define a set of sentences S . This set may be very large. The successive sentences si ; si+1 ; si+2 ; : : : ; si+nn define also a set of paragraph P . This set is very large. Now we should estimate these sets S and P by subsets SN ,P N . Some unsupervised learning methods are capable of estimating the relation between succeeding units. One way of doing it is to use Self-Organizing Maps to clustering. The set W N and the relation sets SN and P N are used to characterise the text document word, sentence by sentence and paragraph by paragraph levels.

X 4 c

L 1 i=0

2

i

L i

(1)

where L is the number of characters in the word and ci is a character within a word w. The word w assigns now a value, a word vector by a tabulated function. The table has a size of N  M , where N is the length of the table and M is the length of the word vector. Now

Y

Y = f (y) mod P

(2)

where P is a suitable prime and N < P < NW . The table consists of Gray coded binary numbers. The idea is that similar words get the same word vector. The words that resemble each other get word vectors that are close to each other. Note that the relation is not a unique one. The actual code in the table is produced from binary code in iterating manner. Note, that x < N

y1 y2 y3 : : : yN

= = =

x1 x1 x2 x2 x3 (3)

=

xN 1 xN

where is a logical exclusive or operation. A word w is are associated to a word vector . The word vectors clustered by the Self-Organizing Feature Map [3]

Y

ni(t + 1) ni(t + 1)

= =

Y

ni (t) + (t)  [Y(t) ni(t)]; for i 2 Nc ni (t); for i 62 Nc (4)

0-7695-0493-0/00 $10.00 (c) 2000 IEEE

2

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

n

where i is an element of the word map. Nc is a neighbourhood of the best matching element. (t) is a decreasing function with the following properties

X1 t

t= 1

( )=

1;

X1 t 2 < 1:

t= 1

( )

(5)

The neighbourhood, Nc , should have a decreasing behaviour as a function of time t. The clustering can also be achieved by other algorithms. As a result from the clustering process a word map of W N elements is created. The whole process of creating a word map is illustrated in Figure 1. In the second step the filtered text is encoded word by word. A K-gram, small neighbourhood of encoded words, i.e. K is words, is taken as an input to a Self-Organizing process. The small neighbourhood glides step by step over the sentence. The words are translated into the input vectors, sentence vectors , to the SOM (illustrated in Figure 2). The SOM produces a sentence map of size SN , see Figure 3. All the training documents are processed in similar manner. The SOM is so far used twice, first to produce a word map and then to produce a sentence map. The text documents are first encoded by a word map and then they are processed by a sentence map. Each sentence vector is compared with a sentence map, see Formula 6. The best match is determined and the sentence is replaced with the corresponding address on the sentence map. The best match is determined by

S

jjS sc jj = mini jjS si jj

(6)

where si is an element of the sentence map. In the third step the document is considered paragraph by paragraph as a fixed length K-grams, i.e. K is 30. These K-grams consist of addresses on the sentence map are considered as paragraph vectors and clustered by SOM. The process is illustrated in Figure 4. The created map is now called a paragraph map. The size of the paragraph map is P N . The paragraphs are processed in similar way as sentences. A window of K-grams sentence vectors glides over the paragraph. The K-grams of sentence vectors are replaced with the address of the best matching units on the paragraph map, see Formula 7. Now, a histogram A corresponding to the best matching elements is collected. The best match is now determined by

C

jjC cc jj = mini jjC ci jj

count of the document it is possible to compare, to search, and to classify text documents. Note, that it is not necessary to know anything from the actual text document to do this. It is sufficient to give one document as a prototype. The technology gives you all the similar documents, gives a number to the difference, or clusters similar documents. All this is done based on the given prototype document.

4 Experiments The experiments are divided into two parts. Firstly, to demonstrate the possibilities of a paragraph map and secondly, to demonstrate the functionality on a real case, an annual report. In these experiments the size of the word map W N is 52*40, the size of the sentence map is SN 20*15, and the paragraph map P N size is 13*10. These sizes are due to experiments. Note, that the sizes should be checked problem to problem. Levenshtein metrics is used for distance calculations. It is advantageous to use Levenshtein metrics in histogram comparisons. In the first case we demonstrate that the method is capable to separate between paragraphs. For that purpose, one paragraph (paragraph 1 in Figure 8) from the 1995 annual report of the Repap Enterprises is selected. First the paragraph is processed by the paragraph map. The histogram can be seen in Figure 6. The words of the paragraph are mixed by random. The mixed paragraph is reprocessed by the paragraph map. The result is seen in Figure 7. The results are different. They can also be considered statistically different by 96% confidence. This judgement is based on a modification of the Lilliefors method [11]. Note, that no word map or sentence map has been analysed. The judgement is only based on the paragraph map. The original paragraph is naturally recognised independently of its place in the document. Small modifications, i.e. replacement of some words, do not disturb the fact. Secondly 11 paragraphs are selected from the Repap Enterprises annual report. The paragraphs are shown in Figure 8. They are processed by the paragraph map. The results are represented as a confusion matrix of distances, i.e. differences, between paragraphs in Table 1. It can seen that the contents of the first and the ninth paragraphs resemble each other. The content of the first paragraph differs from the contents of the eighth paragraph. The succeeding paragraphs often resemble each other.

(7)

where ci is an element of the paragraph map. The histogram A consisting of P N bins is normalised by the sentence count of the paragraph. It is possible to compare, to search, or to classify text paragraphs, as can be seen in Figure 5. If the histogram A is normalised by the paragraph

5 Discussion Knowledge discovery from text documents is a difficult task for a computer. The computers are not yet capable of reaching the same level as human beings. It is common that

0-7695-0493-0/00 $10.00 (c) 2000 IEEE

3

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

text documents are characterised with some keywords. The clustering or the classification is based on these keywords. Our approach is different, it is a multilevel one consisting of word, sentence and paragraph level considerations. In this paper only the paragraph level is reported. In our approach the search, the comparison and the classification are also based on a considerable number of keywords. However, these keywords are extracted in an automatic way and they are combined to “typical” sentences and paragraphs. The keywords are learnt in unsupervised way from the given documents. This is a big advantage compared with the traditional methods. The competitive learning based neural networks make this progress possible. Neural networks make it also possible to adapt the proposed solution to different user groups or applications fields. The smart word encoding and the statistical nature of the competitive learning based neural networks make our approach also language independent. This is a clear advantage. The relations between the words and the sentences make it possible to cluster paragraphs and documents. In this paper one concentrates mainly on paragraphs hence the validation of results is easier. The search, the comparison, and the classification are based on one desirable document. The similar ones, the distances, or the groupings are returned. This is the reason why we claim that our technology does knowledge recovery from text documents. The size of the word map, vocabulary W N , is chosen by the needs of the user group. However, it is not necessary that the size of the vocabulary W N in the proposed approach is as large as in conventional histogram based classification cases. The suitable size to W N is found by experiments. The same procedure is also used to find the size of the sentence map SN and the size of the paragraph map P N . One should, however, observe that it is useful to assume the following relation: P N < SN < W N . This is a strong assumption. The classification in our approach is based on the comparison between the histograms. The procedure is one form of vector quantization. The selected distance metrics is a delicate matter. Several alternatives have been considered. Now, Levenshtein metrics is used in the proposed approach. Levenshtein metrics is a common metrics used in the comparison of long strings and histograms [6]. The fact that there may appear words in the text that are not included in dictionary is a big advantage to the proposed approach. In many fields the development is rapid and new terms are often introduced. The encoding, the unsupervised way to create the prototypes and the classification based on the distance metrics makes it possible to consider new words easily. The methodology makes also the proposed approach almost independent of language. However, the word encoding is crucial to the success of the whole process. It is very important that the encoding and the rest of

method have a high degree of compatibility. The encoding depends on the problem. It should be checked or selected for each specific case. The validation of the results is a difficult task. It seems that succeeding paragraphs resemble each other but this is not always true. The same tendency can be seen in our annual reports. It is not rare that the results based on word, sentence, and paragraph maps are contradicting each other. It is extremely important to use preclassified material for the verification of the results. The histogram approach makes it possible to gain accuracy and speed in comparison. At the same time the information related to successive words is usually lost. This drawback is avoided in the proposed approach when one gathers the histograms concerning the paragraph vectors. One drawback of the proposed method is the use of K-grams in sentence and paragraph analysis. K-grams approach involves many truncation problems. Our research has developed the proposed method in such a way that K-grams can be avoided both in sentence and paragraph analysis but those results are not reported here. Our research is now focusing on the merge of word, sentence, and paragraph maps and on the use of this methodology in knowledge discovery.

6 Acknowledgements The financial support of TEKES (grant number 40887/97) is gratefully acknowledged.

0-7695-0493-0/00 $10.00 (c) 2000 IEEE

4

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

References [1] B. Back, H. Vanharanta, A. Visa, and J. Toivonen. Toward Computer Aided Analysis of Text in Annual Reports. In Proc. of ECAIS’99, European Conference on Accounting Information Systems, Bordeaux, France, 1999. To be published. [2] T. Honkela, S. Kaski, K. Lagus, and T. Kohonen. Newsgroup Exploration with WEBSOM method and Browsing Interface. Technical Report A32, Helsinki University of Technology, 1996. [3] T. Kohonen. Self-Organized formation of topologically correct feature maps. Biological Cybernetics, 43:59–69, 1982. [4] T. Kohonen. Self-Organizing Maps, volume 30 of Series in Information Sciences. Springer-Verlag, Heidelberg, 2. edition, 1997. [5] T. Kohonen, S. Kaski, K. Lagus, and T. Honkela. Very Large Two-Level SOM for Browsing of Newsgroups. In Proc. of ICANN’96 International Conference on Artificial Neural Networks, pages 269–274. Springer, 1996. [6] V. I. Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 8(10):707–710, 1966. [7] X. Lin, D. Soergel, and G. Marchionini. A Self-Organizing Semantic Map for Information Retrieval. In Proc. of 14th Ann. Int. ACM/SIGIR Conf. on R&D in Information Retrieval, pages 262–269, 1991. [8] B. Mart´ın-del-Br´ıo and C. Serrano-Cinca. Self-Organizing Neural Networks for the Analysis and Representation of Data: Some Financial Cases. Neural Computing & Applications, 1(3):193–206, 1993. [9] B. Mart´ın-del-Br´ıo and C. Serrano-Cinca. Self-Organizing Neural Networks: The Financial State of Spanish Companies. In A. Refenes, editor, Neural Networks in the Capital Markets. John Wiley & Sons, New York, 1995. [10] D. Merkl, A. Tjoa, and G. Kappel. A Self-Organizing Map that Learns the Semantic Similarity of Reusable Software Components. In Proc. of ACNN’94, 5th Australian Conf. on Neural Networks, pages 13–16, 1994. [11] J. S. Milton and J. C. Arnold. Probability and statistics in the engineering and computing sciences. McGraw-Hill, New York, 2. edition, 1987. [12] H. Ritter and T. Kohonen. Self-Organizing Semantic Maps. Biological Cybernetics, 61(4):241–254, 1989. [13] J. C. Scholtes. Unsupervised learning and the information retrieval problem. In Proc. of IJCNN’91, Int. Joint Conf. on Neural Networks, volume I, pages 95–100, Piscataway, NJ, 1991. IEEE; Int. Neural Networks Soc, IEEE Service Center. [14] A. Ultsch. Knowledge Acquisition with Self-Organizing Neural Networks. In I. Aleksander and J. Taylor, editors, Artificial Neural Networks, 2, volume I, pages 735–738, Amsterdam, Netherlands, 1992. North-Holland. [15] A. Visa, J. Toivonen, B. Back, and H. Vanharanta. Toward Text Understanding – Comparison of Text Documents by Sentence Map. In Proc. of EUFIT’99, European Congress on Intelligent Techniques and Soft Computing, 1999.

0-7695-0493-0/00 $10.00 (c) 2000 IEEE

5

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

Word vectors Original text from all documents This, coupled with our belief that the first major steps have been taken toward achieving our goal of an 18 percent return on equity, led us shortly after the year ended to recommend to your directors a 16 percent increase in the quarterly dividend rate - from 62 cents to 72 cents per share. The directors approved this recommendation ...

Y

Filtered text

Filter

This coupled with our belief that the first major steps have been taken toward achieving our goal of an 20 percent returnonequity led us shortly after the year ended to recommend to your directors a 20 percent increase in the quarterly dividend rate from 60 cents to 70 cents per share . The directors approved this recommendation ...

Y Word encoding

11 12

Y Y

Word map created using all word vectors 21

SOM_PAK . ... . .

Y

Y

1n

W

22

. . .

2n

Figure 1. Creating the word map.

This

coupled

with

our

belief

that

the

first

major

steps

have

been

taken

Word vectors

Y1

Y2

Y3

Y4

Y5

Y6

Y7

Y8

Y9

Y 10

.

.

.

Word map unit number

n

n

n

n

n

n

n

n

n

n

.

.

.

5

Sentence vectors

S1

S5

S6

.

.

.

1

2

S2

S3

3

4

S4

6

7

S7

8

S8

9

S9

...

10

S 10

Figure 2. Creating the sentence vectors from the encoded words.

Sentence vectors

Filtered text

S This coupled with our belief that the first major steps have been taken toward achieving our goal of an 20 percent returnonequity led us shortly after the year ended to recommend to your directors a 20 percent increase in the quarterly dividend rate from 60 cents to 70 cents per share . The directors approved this recommendation ...

Word map created using all word vectors

S Sentence encoding

11 12

S S

21 22

. . .

. ... . .

S

S

1n

Sentence map created using all sentence vectors

SOM_PAK

2n

S

W Figure 3. The process of creating a sentence map.

0-7695-0493-0/00 $10.00 (c) 2000 IEEE

6

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

Filtered text

Paragraph map created using all paragraph vectors

Paragraph vectors C 11

C 12

C 21

C 22

Paragraph

Word map created using all word vectors

...

T

C 1n T

SOM_PAK

C 2n

...

encoding

...

Sentence map created using all sentence vectors

W

P

S Figure 4. The process of creating a paragraph map.

Histogram of paragraph one 0.35

NUMBER OF HITS

0.3

0.25

0.2

0.15

0.1

0.05

0 0

20

40

60

80

100

120

MAP NODE

Compare Levenshtein distance

Histogram of paragraph two 0.35

NUMBER OF HITS

0.3

0.25

0.2

0.15

0.1

0.05

0 0

20

40

60

80

100

120

MAP NODE

Figure 5. Comparing and classifying paragraphs based on the extracted paragraph histograms.

0-7695-0493-0/00 $10.00 (c) 2000 IEEE

7

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

Labeled paragraph map •







Our confidence is based on an operating strategy which emphasizes three principles. First, we will focus on optimizing our investments by taking advantage of economies of scale and unrealized operating efficiencies. Second, we will pursue markets and products where we see the potential for the long-term growth and synergies. For example, the creation of of our "Suncell" joint venture in China last year not only increased Repap’s presence in emerging Asian pulp markets, but is already providing promising opportunies in China for our lumber and coated paper businesses, and for the ALCELL reg. technology. Third, we will strive to be the most innovative forest products company in our industry by introducing and developing environmentally sound products and technologies.





















0.35



0.3

• •





Histogram of the original paragraph















• • •







• •

NUMBER OF HITS

Text in the original paragraph



0.25

0.2

0.15

0.1

0.05



• 0













0

20

40

60

80

100

120

MAP NODE

Figure 6. A paragraph processed by the paragraph map.

Labeled paragraph map •



• •

our of by pulp three lumber we of and for example, and and emerging on ALCELL forest potential is economies investments products markets optimizing will efficiencies. technology. synergies. our "Suncell" growth we China taking Asian innovative the promising and in the and creation China confidence an Repap’s most in company not and advantage where our see already principles. strive and year increased to industry environmentally introducing only scale in the developing we the presence emphasizes businesses, of First, joint-venture long-term coated operating operating is last will pursue opportunities focus we unrealized will providing Our products for on markets, by our be which the sound reg. Second, Third, for but in For products technologies. strategy based paper



• •



• •

0.45

• •











Histogram of the mixed paragraph



0.35

















• •





0.4





NUMBER OF HITS

Text in the mixed paragraph

0.3

0.25

0.2

0.15

0.1



0.05







• 0





















0

20

40

60

80

100

120

MAP NODE

Figure 7. The same paragraph processed by the paragraph map but now the words are mixed into random order.

1

2

3

4

5

6

7

8

9

10

11

1

0

0.402680

0.490630

0.360184

0.455891

0.377417

0.432828

1.331780

0.322734

0.513050

0.564370

2

0.402680

0

0.324610

0.252349

0.299686

0.492949

0.376951

0.946231

0.292787

0.341271

0.289462

3

0.490630

0.324610

0

0.171817

0.286951

0.290550

0.248050

1.179650

0.297791

0.314119

0.363189

4

0.360184

0.252349

0.171817

0

0.353957

0.283710

0.280889

1.157530

0.289592

0.348074

0.389062

5

0.455891

0.299686

0.286951

0.353957

0

0.343657

0.369582

0.893041

0.312570

0.209892

0.253063

6

0.377417

0.492949

0.290550

0.283710

0.343657

0

0.375874

1.276930

0.457142

0.445964

0.503248

7

0.432828

0.376951

0.248050

0.280889

0.369582

0.375874

0

1.307990

0.260735

0.320877

0.321729

8

1.331780

0.946231

1.179650

1.157530

0.893041

1.276930

1.307990

0

0.980411

0.688229

0.773381

9

0.322734

0.292787

0.297791

0.289592

0.312570

0.457142

0.260735

0.980411

0

0.297793

0.180347

10

0.513050

0.341271

0.314119

0.348074

0.209892

0.445964

0.320877

0.688229

0.297793

0

0.271380

11

0.564370

0.289462

0.363189

0.389062

0.253063

0.503248

0.321729

0.773381

0.180347

0.271380

0

Table 1. A confusion matrix showing the Levenshtein distance between the paragraphs.

0-7695-0493-0/00 $10.00 (c) 2000 IEEE

8

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

Figure 8. The paragraphs of the test report.

0-7695-0493-0/00 $10.00 (c) 2000 IEEE

9

Suggest Documents