Instance-Based Learning and Information Extraction for ... - CiteSeerX

Instance-Based Learning and Information Extraction for the Generation of Metadata Andreas D. Lattner (Center for Computing Technologies – TZI University of Bremen, Germany [email protected])

Otthein Herzog (Center for Computing Technologies – TZI University of Bremen, Germany [email protected])

Abstract: Knowledge Management recently has become popular in enterprises hoping to achieve a competitive advantage. Describing information by metadata allows for performing detailed queries uniformly over information items from different information sources and enables goal-directed search and automatic provision of relevant information. As the manual acquisition of metadata is very costly, support for this task is desired. This work presents two metadata extractors for the creation of metadata. The first applies Instance-Based Learning for the adoption of metadata from similar objects. The second extracts information by applying regular expressions. Both extractors have been integrated into the metadata generation framework of the KnowWork system and have been evaluated in experiments on two realworld data sets from the engineering domain. Key Words: Metadata Generation, Instance-Based Learning, Information Extraction, Knowledge Management Category: H.3.1

1

Introduction

Knowledge Management recently has become popular in enterprises hoping to achieve a competitive advantage. Having uniform access to all information within a company allows the users to find information faster. Structuring information can be done by setting up an ontology for existing information items. An ontology is the “explicit specification of a conceptualization” [Gruber 1993]. All information classes and their properties can be defined in an ontology. Instances of these classes are representations of business information objects (e.g., a bill of material of a certain product). Their properties – attributes and associations to other objects – are described by metadata. Metadata allows for performing detailed queries uniformly over information items from different information sources. It enables goal-directed search and automatic provision of relevant information. But how does the metadata get into the system? The manual acquisition of metadata is very costly. As no direct benefit is seen by the users, the motivation for entering metadata usually is quite low. Therefore, support by semi-automated metadata generation is needed to overcome this

Lattner, A. D.; Herzog, O.: Instance-Based Learning and Information Extraction for the Generation of Metadata, 3rd International Conference on Knowledge Management (I-KNOW ’03), Graz, Austria, July 2-4, 2003, pp. 472-479.

situation. Semi-automated means that the created metadata has to be understand as a suggestion, which should be verified by the user. As no metadata generation approach will provide perfect metadata, there is a trade-off between checking the created metadata and letting erroneous data into the system.

2

Related Work

There are many different fields which are related to metadata generation. If unstructured documents have to be processed, text classification or information extraction can be used for the creation of metadata (e.g. [Yang and Liu 1999, Hobbs et al. 1996]). In both cases it can be distinguished between automated and manual approaches. [Sebastiani 2002] gives a good survey about machine learning in the area of automated text categorization. Examples for learning information extraction rules can be found in [Soderland 1999] and [Junker et al. 1999]. Inter- and intranet web pages may also be enhanced by metadata, e.g., in the Semantic Web context. Various approaches treat the classification of web pages, e.g., [Pierre 2001], metadata generation for web pages [Jenkins et al. 1999, Stuckenschmidt and van Harmelen 2001], or the creation of knowledge bases from the World Wide Web [Craven et al. 2000]. If metadata has to be created for databases, other approaches can be applied. Database contents can be adopted directly via database wrappers or can be mapped to the defined vocabulary of an appropriate ontology (e.g. [Tork Roth and Schwarz 1997, Bergamaschi et al. 1999]). The technologies to be applied strongly depend on the information sources managed by the system. Ontology

Person

* authors

Name Phone …

Document

…

Creation date URI is a

is a

Change Report

Travel Application

Software variant Area of validity …

Document-ID Checked by …

Metadata Extractor 1



…

Metadata Extractor n

Metadata Generation Extractors

Figure 1: Mapping between attributes and metadata generation extractors


Annotation of value to new document Annotated documents

Contact person: Mrs. Green

Contact person: Mrs. Green

Most similar documents Contact person: Mr. Blue

Figure 2: Adoption of metadata from the k-nearest neighbors

3

Metadata Generation with KnowWork

The metadata generation framework MetaGen is a module of the KnowWork system [Tönshoff et al. 2001] and has been introduced in [Lattner and Apitz 2002]. The KnowWork system allows for managing information classes and information items within its domain model, an ontology representation. Information items can be described by attributes and linked to other items via metadata. With the metadata generation framework it is possible to create metadata for arbitrary information items. Its flexible structure allows for integrating metadata extraction modules as needed by implementing an extractor interface. The different extractors can be connected to all defined attributes for creating metadata. Fig. 1 illustrates the mapping between metadata generation extractors and attributes from different information types. Two extractors have been implemented and evaluated so far: the TextSimilarityExtractor and the RegExExtractor. Both are briefly described in the next two subsections. 3.1 TextSimilarityExtractor The TextSimilarityExtractor uses an instance-based learning approach for the creation of metadata. It adopts metadata from the k-nearest neighbors (k-nn) by applying a similarity measure based on text content. An example is given in Fig. 2. If values for attribute a of object o should be created these steps are performed: • • •

Collect the k-nearest neighbors (default setting is k = 5) to the object o which also have the attribute a, i.e., which are instances of the class (or one of its subclasses) where the attribute a is defined. Collect and count all values for attribute a of neighbors n1, n2, …, nk. Take over the values: o If a is a single-valued attribute, take the value with most appearances as created metadata. o If a is set-valued, take the l most frequently values as created metadata, where l is the average number of values for attribute a for the k neighbors.


In the case of the TextSimilarityExtractor the similarity measure is computed from text documents in vector representations. Our implementation uses the mindaccess SDK, a software development kit for integrating the mindaccess system. mindaccess features, among others, search and classification techniques for text documents [insiders 2002]. All documents dj are represented by their term vectors: d j = ( w1, j , w2, j ,K, wt , j ) . The utilized term-weighting strategy follows the TF-IDF scheme. The similarity between two documents d1 and d2 is computed by the cosine of the angle between their vectors (cf. [Baeza-Yates and Ribeiro-Neto 1999]): sim( d1 , d 2 ) =

d1 • d 2 = d1 × d1

∑ w ×w ∑ w × ∑ t

i =1

t

i =1

i ,1

2 i ,1

i,2 t i =1

wi2, 2

3.2 RegExExtractor With the RegExExtractor regular expressions for information extraction from texts can be defined. It uses the Jakarta ORO package (http://jakarta.apache.org/oro/) which provides Perl5 compatible regular expressions.

Extraction rules Document-ID: Author: Checked by:

„Doc. No.: ($value) [\n]“ „Applicant: ($value) [\n]“ „checked by: ($value) (“

Text content

Rule matching

Doc. No.: 12345678 Travel application

Doc. No.: 12345678 Travel application

Applicant: M. Meyer checked by: K. Müller (1.1.01)

Applicant: M. Meyer checked by: K. Müller (1.1.01)

Document-ID: 12345678 Author: M. Meyer Checked by: K. Müller

Annotation of values to the document

Figure 3: Information extraction with regular expressions If certain patterns, that indicate where attribute values can be found, appear frequently in texts this information can be used for information extraction. Many documents consist of such patterns, e.g., for pointing out the creation date or author data. In these cases, extraction rules can be defined. In the following example the author name is expected after the “Created by:” string. All characters after the text “Created by:” and before the end of line are extracted as the value. After the colon at least one tab or white space is expected. The according extraction rule is ”Created by:[\t ]+([\w0-9-_]+)$”. Fig. 3 illustrates the use of regular expressions for the extraction of information. The figure shows simplified extraction rules for better understanding.


4

Evaluation

Both extractors have been applied to real world data sets from the engineering domain, which have been provided by two of our application partners of the KnowWork project. Due to non-disclosure agreements we are not granted permission to show any original data used in our experiments. The data set from the first company consists of 101 documents of three different document types. All documents have six attributes: five single-valued and one setvalued attribute (Tab. 1). The other data set for evaluation has been provided by a second company. It has 95 documents from a single document type (Change Report). 21 attributes have been assigned to each document. Eight attributes are single-valued and 13 are set-valued (Tab. 2). For both data sets three independent experiments have been performed at each case. For each experiment, the data sets were randomly divided into a training set (ca. 60% of the documents) and a testing set (ca. 40% of the documents). The sizes of the testing sets were 41 documents in the first case and 38 in the second one. For each document from the test sets values for all attributes have been created. The quality of the metadata generation was evaluated by the computation of precision and recall of the created metadata. The precision is the ratio of the correctly created to all created values. The ratio of the correctly created values to all actual values for an attribute determines the recall. These values have been calculated for each attribute on its own and for all attributes together. For the first data set only the TextSimilarityExtractor has been used. As these documents were very homogeneous, taking over attribute values from similar documents worked out very well. The precision was on average 92.7% at a recall of 91.7%. These results must not be overestimated because for many attributes only a few possible values existed (e.g., file format). The most challenging attribute here was the set-valued “keywords” attribute. But even in this case, the TextSimilarityExtractor returned good results with a precision of 79.4% and a recall of 76.2% (Tab. 1). The data set from the second company was more complex. Some attribute values could not be determined by the k-nn approach (e.g., the creation date). For the first seven attributes (see Table 2) the RegExExtractor was used with manually created extraction rules. The TextSimilarityExtractor was applied for the remaining fourteen attributes. The overall average precision and recall on this data set turned out to be 67.7% and 67.9%, respectively. For some attributes the precision and recall values of the TextSimilarityExtractor were quite low. This happens if some attribute values just appear sporadically, or if text similarity can only give little prediction of attribute values. The TextSimilarityExtractor performed poorly at creating values for the attributes “product type”, “change reason”, “hardware version”, and “software version”. In the worst case the precision and recall are 25.7% and 27.7%. Nevertheless in many cases quite good results were achieved. For twelve of the 21 attributes the precision and recall values were higher than 75% on average (Tab. 2).


Extractor

Precision Exp. 1

Recall Exp. 1

Precision Exp. 2

Recall Exp. 2

Precision Exp. 3

Recall Exp. 3

Precision average

Recall average

Set valued

TextSim.

90.2

90.2

94.9

94.9

87.8

87.8

91.0

91.0

File type

TextSim.

100

100

100

100

100

100

100

100

Project

TextSim.

100

100

100

100

100

100

100

100

Customer

TextSim.

100

100

100

100

100

100

100

100

TextSim.

79.7

81.9

87.1

79.2

71.2

67.5

79.4

76.2

TextSim.

97.6

97.6

94.9

94.9

97.6

97.6

96.7

96.7

92.8

93.5

95.1

92.7

90.3

89.0

92.7

91.7

Attribute Contact person

Keywords

X

Author Overall

Table 1: Evaluation results of data set from company 1

5

Conclusion

The experiments on the two data sets show quite promising results. It can be seen that even with pretty simple approaches good results can be achieved on real world data. Even though the first data set was quite homogeneous, the experiments showed the practicability of the two integrated metadata generation extractors. For some attributes it is not recommended two apply these extractors, because the quality of the created metadata is not good enough. But if the extractors are applied only to suitable attributes, they can be a great help for the user during the metadata acquisition phase. Depending on the user’s requirements, the recall of documents based on the created metadata can be increased by taking over more values. This has the advantage that more values (probably including the right ones) are presented to the user. As taking over more values might also include erroneous ones, such a modification could lead to worse precision values. Acknowledgements The content of this paper is a partial result of the KnowWork project, which is funded by the German Ministry for Education and Research (BMBF) under grant 01 IN 001 D. We wish to express our gratitude to the KnowWork colleagues and students at TZI for their contribution during the development and implementation of some of the ideas and concepts presented in this paper. We also want to acknowledge the efforts of the KnowWork project partners, especially the enterprises which provided the data sets for the evaluation of the extractors, and insiders Wissensbasierte Systeme GmbH for the provision and integration of their technologies into the KnowWork system.


Recall Exp. 1

Precision Exp. 2

Recall Exp. 2

Precision Exp. 3

Recall Exp. 3

Precision average

Recall average

RegEx

97.4

97.4

97.4

97.4

100

100

98.3

98.3

SAP number

RegEx

100

100

80

76.2

91.7

100

90.6

92.1

RegEx

73.7

60.9

79.5

66.0

83.8

73.8

79.0

66.9

Author

RegEx

97.4

97.4

89.5

89.5

92.1

92.1

93.0

93.0

Checked by

RegEx

100

73.7

100

79.0

100

79.0

100

77.2

Approved by

RegEx

91.4

84.2

97.0

84.2

100

89.5

96.1

86.0

Creation date

RegEx

91.7

86.8

100

86.8

94.1

84.2

95.3

86.0

Product series

Set valued

Precision Exp. 1

Extractor

Attribute Report ID

X

Power/kW

X

TextSim.

54.7

53.2

54.0

54.7

56.5

64.1

55.1

57.3

Product type

X

TextSim.

35.6

37.0

34.6

31.4

40.4

41.4

36.9

34.6

OEM variant

X

TextSim.

76.7

75.0

66.7

62.2

80.5

73.3

74.6

70.2

Mech. Constr.

X

TextSim.

87.2

87.2

85.7

84.0

87.5

84.0

86.8

85.1

Mounting form

X

TextSim.

52.0

56.3

46.8

45.0

56.3

58.4

51.7

53.3

Hardware variant

X

TextSim.

69.2

80.0

85.4

78.9

84.0

80.8

79.6

79.9

Software variant

X

TextSim.

84.2

84.2

89.5

89.5

92.1

89.7

88.6

87.8

Change reason

X

TextSim.

47.4

43.9

47.4

46.2

50.0

46.3

48.3

45.5

Hardware version

X

TextSim.

24.1

29.2

31.7

29.2

21.1

24.7

25.7

27.7

Software version

X

TextSim.

41.4

52.2

45.7

36.2

53.5

62.0

46.8

50.1

Area of validity

X

TextSim.

77.1

83.1

79.0

80.4

75.2

82.4

77.1

82.0

Categories

X

TextSim.

51.7

55.4

58.0

50.0

66.1

63.8

58.6

56.4

File type

TextSim.

100

100

100

100

100

100

100

100

Paper format

TextSim.

100

100

100

100

100

100

100

100

66.5

68.6

67.9

64.5

68.7

70.8

67.7

67.9

Overall

Table 2: Evaluation results of data set from company 2


References [Baeza-Yates and Ribeiro-Neto 1999] Baeza-Yates, R.; Ribeiro-Neto, B.: “Modern Information Retrieval”. ACM Press New York, Addison-Wesley, 1999. [Bergamaschi et al. 1999] Bergamaschi, S.; Castano, S.; Vincini, M.: “Semantic Integration of Semistructured and Structured Data Sources”. SIGMOD Record, 28(1):54-59, 1999. [Craven et al. 2000] Craven, M.; Dipasquo, D.; Freitag, D.; McCallum, A.; Mitchell, T.; Nugam, K.; Slattery, S.: “Learning to Construct Knowledge Bases from the World Wide Web”, Artificial Intelligence, 118(1-2), 2000, p. 69-113. [Gruber 1993] Gruber, T. R.: „A Translation Approach to Portable Ontology Specifications”. Knowledge Acquisition, 5(2), 1993, p. 199-220. [Hobbs et al. 1996] Hobbs, J.; Appelt, D.; Bear, J.; Israel, D.; Kameyama, M.; Stickel, M.; Tyson, M.: “FASTUS: Extracting Information from Natural Language Texts“. In: E. Roche and Y. Schabes (Eds.): Finite State Devices for Natural Language Processing, MIT Press, 1996. [insiders 2002] “mindaccess – Overview and Concepts, Release 2.7”, Technical Report, insiders Wissensbasierte Systeme GmbH, 08.11.2002. [Jenkins et al. 1999] Jenkins, C; Jackson, M.; Burden, P.; Wallis, J.: “Automatic RDF Metadata Generation for Resource Discovery”, Computer Networks, 31, 1999, p. 1305-1320. [Junker et al. 1999] Junker, M.; Sintek, M.; Rinck, M.: “Learning for Text Categorization and Information Extraction with ILP”, Proceedings of the Workshop on Learning Language in Logic, 1999. [Lattner and Apitz 2002] Lattner, A. D., Apitz, R.: “A Metadata Generation Framework for Heterogeneous Information Sources”, Proceedings of the 2nd International Conference on Knowledge Management (I-KNOW 02), Graz, Austria, July 11-12, 2002, p. 164-169. [Pierre 2001] Pierre, J. M.: “On the Automated Classification of Web Sites”, Linkoping Electronic Articles in Computer and Information Science, Vol. 6, 2001. [Sebastiani 2002] Sebastiani, F.: “Machine Learning in Automated Text Categorization”, ACM Computing Surveys, 34(1), 2002, p. 1-47. [Soderland 1999] Soderland, S.: “Learning information extraction rules for semi-structured and free text”. Machine Learning, 34(1-3):233-272, 1999. [Stuckenschmidt and van Harmelen 2001] Stuckenschmidt, H.; van Harmelen, F.: “Ontologybased Metadata Generation from Semi-Structured Information”, Proceedings of the 1st International Conference on Knowledge Capture (K-CAP 2001), Morgan Kaufmann, 2001, p. 163-170. [Tönshoff et al. 2001] Tönshoff, H. K.; Apitz, R.; Lattner, A. D.; Schlieder C.: “KnowWork – An Approach to Co-ordinate Knowledge within Technical Sales, Design and Process Planning Departments”, Proceedings of the 7th International Conference on Concurrent Enterprising, Bremen, Germany, 27 – 29th June 2001, p. 231-239. [Tork Roth and Schwarz 1997] Tork Roth, M.; Schwarz, P.: “Don't scrap it, wrap it! A Wrapper Architecture for Legacy Sources”. In: Proceeding of the 23rd VLDB Conference, Athens, Greece, 1997, p. 266-275. [Yang and Liu 1999] Yang, Y and Liu, X.: “A Re-examination of Text Categorization Methods”. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), 1999, p. 42-49.


Instance-Based Learning and Information Extraction for ... - CiteSeerX

Instance-Based Learning and Information Extraction for ... - CiteSeerX

Suggest Documents

Information Extraction Tools and Methods for ... - CiteSeerX

Information Embedding and Extraction for ... - CiteSeerX

Open Language Learning for Information Extraction

Coupled Semi-Supervised Learning for Information Extraction

OLLIE: On-Line Learning for Information Extraction

Learning Dictionaries for Information Extraction by ...

Learning for Biomedical Information Extraction with ILP

Learning Recursive Patterns for Biomedical Information Extraction

Learning Knowledge Bases for Information Extraction ...

Evaluating Machine Learning for Information Extraction

Information Extraction - CiteSeerX

Components for information extraction: ontology-based ... - CiteSeerX

Punctuating Speech for Information Extraction - CiteSeerX

Automatic Rule Refinement for Information Extraction - CiteSeerX

Information Extraction From Sound for Medical ... - CiteSeerX

Blogging for Information Management, Learning, and ... - CiteSeerX

A Machine Learning Approach to Information Extraction - CiteSeerX

Integrating Information to Bootstrap Information Extraction ... - CiteSeerX

information extraction research and applications: current ... - CiteSeerX

Information Extraction: Methodologies and Applications - CiteSeerX

Optimal Signature Extraction and Information Loss - CiteSeerX

Information Extraction from Wikipedia Using Pattern Learning

Learning Information Extraction Patterns Using WordNet

Email Analysis and Information Extraction for Enterprise ... - CiteSeerX