SemDupl: Semantic-based Duplicate Identi cation

2 downloads 0 Views 755KB Size Report
sub(x,dach.1.1, categ, categ) ∧ pars(x,flughafengebäude.1.1, categ, proto). HFWWRS sub(x,dach.1.1, categ, categ) ∧ pars(x,hauptgebäude.1.1, categ, proto).
SemDupl: Semantic-based Duplicate Identication Sven Hartrumpf, Hermann Helbig, Tim vor der Brück, Christian Eichhorn Arbeitsgruppe Intelligente Informations- und Kommunikationssysteme FernUniversität in Hagen 58084 Hagen  Germany {sven.hartrumpf,hermann.helbig,tim.vorderbrueck}@FernUni-Hagen.de [email protected]

July 2011

FernUniversität in Hagen  Informatik-Bericht

Contents

1 Overview, Introduction and Motivation; Accomplishment of Tasks

5

2 Comparison with other Systems

6

3 Architecture of SemDupl

6

4 The SemDupl Corpus

7

5 The Shallow Duplicate Detectors SemDupl-shallow and SemDuplquick 9 5.1 5.2 5.3 5.4 5.5 5.6 5.7

Features in SemDupl-shallow . . . . Combination of Features . . . . . . Training Phase . . . . . . . . . . . Application Phase . . . . . . . . . . Aggregation of Feature Values . . . SemDupl-quick (SQ) . . . . . . . . Comparison of Shallow Approaches

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

9 10 10 10 11 11 12

6 Linguistic Phenomena Relevant for Semantic Duplicates

12

7 Knowledge Acquisition for Deep Duplicate Detectors

16

6.1 6.2

7.1 7.2 7.3 7.4

Types of Paraphrases for Semantic Duplicates . . . . . . . . . . . . . 12 Restrictive Contexts and Other Precision Problems for Semantic Duplicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Hypernyms and Meronyms . . . . . . . . . Deep vs. Shallow Patterns . . . . . . . . . Validation of Relation Hypotheses . . . . . System Architecture: Relation Extraction

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

16 19 21 23

8 Annotation GUI

24

9 Extraction of Entailments

26

9.1 9.2 9.3

Extracting Extracting and Hovy Extracting

Entailments Employing a Search Engine . . . . . . . . . . 26 Entailments from SNs Basically Following Ravichandran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Entailments Basically Following Lin and Pantel . . . . . . 27

10 The Deep Duplicate Detector SemDupl-deep

27

11 Combination

28

12 Evaluation

30

13 Evaluation Interpretation and Conclusions

32

A Extracted Knowledge

34 2

B Program Documentation and Manuals B.1 B.2 B.3 B.4 B.5

SemDupl-deep . . . . . . . . . . . . . . . . . SemDupl-shallow . . . . . . . . . . . . . . . SemDupl-quick . . . . . . . . . . . . . . . . Combiner . . . . . . . . . . . . . . . . . . . Knowledge Acquisition . . . . . . . . . . . . B.5.1 Hyponyms, Meronyms and Synonyms B.5.2 Entailments . . . . . . . . . . . . . . B.6 The GUI . . . . . . . . . . . . . . . . . . . . B.6.1 Text Collection Maintenance . . . . . B.6.2 Text Collection Analysis . . . . . . . B.6.3 Text Duplicate Search . . . . . . . . B.7 Class Diagram SemDupl-shallow . . . . . . . B.7.1 Class Text . . . . . . . . . . . . . . . B.7.2 Class Toolbox . . . . . . . . . . . . . B.7.3 Class Comparer . . . . . . . . . . . . B.7.4 Class LAUT . . . . . . . . . . . . . . B.7.5 Class CErkenner . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

44

44 45 48 50 50 50 56 60 60 62 62 62 62 63 64 64 64

List of Figures

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Architecture of the duplicate detector SemDupl. . . . . . . . . . . . Data model of SemDupl-quick. . . . . . . . . . . . . . . . . . . . . . Shallow pattern for hyponymy extraction where the premise is given as a regular expression. . . . . . . . . . . . . . . . . . . . . . . . . . Deep pattern for hyponymy extraction where the premise is given as an SN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matching a deep pattern to a semantic network. . . . . . . . . . . . Deep pattern for hyponymy extraction where the premise is given as an SN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Application of a deep pattern to an SN representing the sentence He sells all common string instruments except celli. . . . . . . . . . . . System Architecture of SemQuire. . . . . . . . . . . . . . . . . . . . Screenshot of the annotation tool. . . . . . . . . . . . . . . . . . . . Graphical user interface of SemChecker for displaying the result of the logical validation, which marks the entry sub(land .1 .1 , provinz .1 .1 ) as potentially defective. . . . . . . . . . . . . . . . . . . . . . . . . . System architecture of the Combiner. . . . . . . . . . . . . . . . . . Precision of hyponymy hypotheses depending on score. . . . . . . . Screenshot of the GUI for SemDupl. . . . . . . . . . . . . . . . . . . Class diagram of SemDupl-shallow. . . . . . . . . . . . . . . . . . .

. 8 . 12 . 17 . 18 . 18 . 19 . 20 . 23 . 24 . . . . .

25 29 32 61 62

List of Tables

1 2 3

Assumed equivalent formulas with given condence score. . . . . . . . 27 Confusion matrix for WCopyFind . . . . . . . . . . . . . . . . . . . . 30 Confusion matrices for shallow approaches . . . . . . . . . . . . . . . 31 3

4 5 6 7 8 9 9 10 10 10 11 11 12 12

Confusion matrices for deep approaches . . . . . . . . . . . . . . . F-Measure, precision, recall, and accuracy for shallow approaches. F-Measure, precision, recall, and accuracy for deep approaches. . Extracted semantic and lexical relation hypotheses. . . . . . . . . Lines of source code of the SemDupl components. . . . . . . . . . Hypernymy relations with high condence score. . . . . . . . . . . Hypernymy relations with high condence score. . . . . . . . . . . Meronymy hypotheses with high condence score. . . . . . . . . . Meronymy hypotheses with high condence score. . . . . . . . . . Meronymy hypotheses with high condence score. . . . . . . . . . Synonymy hypotheses with high condence score. . . . . . . . . . Synonymy hypotheses with high condence score. . . . . . . . . . Selection of entailments. . . . . . . . . . . . . . . . . . . . . . . . Selection of entailments (ctd.). . . . . . . . . . . . . . . . . . . . .

4

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

31 31 31 32 32 35 36 37 38 39 40 41 42 43

1

Overview, Introduction and Motivation; Accomplishment of Tasks

What has been the motivation for the project SemDupl (Semantic Duplicate Detection)?  Duplicate detection becomes more and more important in the information society as the number of available text documents grows rapidly; see for example the growth of the web. At a similar speed, the number of duplicates increases. Duplicates are relevant in many dierent areas. In search engines, question answering, and other information access applications, duplicates should not be counted as dierent documents for scoring purposes and should be excluded from result pages presented to users. Copyright owners want to nd cases of copyright violations, even if the violators apply intelligent techniques to hide the origin of their plagiarisms. Desktop administration and backup tools need ways to suggest which les could be unneeded duplicates. Depending on the application, near-duplicates are subsumed under the term duplicates in this paper. In prior work, duplicate detection worked with shallow checkers; they can be called shallow because they work only on surface-oriented factors or features and do not step into the semantics of words, sentences, paragraphs, or even whole texts. Surface-oriented features of a given text are derived from n-grams, rare words, spelling errors, etc. But two texts can be semantic duplicates, i.e., expressing the same content, without sharing many words or word sequences and hence without having similar values of shallow features. Shallow checkers can be easily tricked by experienced users that employ advanced paraphrase techniques. Therefore, a deep (semantic) approach that compares full semantic representations of two given texts has been designed, implemented, and evaluated in the SemDupl (Semantic Duplicate Detection) project. To achieve this goal, the following tasks had to be accomplished according to the project proposal:

(A1) Conceptional work In the preparation phase, existing shallow duplicate

recognizers had to be compared, with the aim to nd one recognizer which can be used as a baseline (see Section 2) and as a starting point for combining shallow and deep methods for duplicate detection (see Section 3).

(A2) Knowledge acquisition This task turned out to be the most ambitious part

of the project, since the knowledge acquisition had to be done automatically, and the results are the precondition for a successful work of the deep duplicate recognizer. To this end, thousands of basic relations (subordination relations, meronymy relations and semantic entailments, so-called meaning postulates) had to be automatically found by means of statistical learning methods and to be validated with logical methods (see Section 7 and Section 9).

(A3) Creation of convenient test beds As a basis for checking whether the duplicate recognizer works well, suciently large corpora of pairs of duplicate texts and non-duplicates are needed. These corpora had partially been handcrafted and partially gathered from existing collections (see Section 4). Several corpora form the SemDupl corpus, which can be seen as a valuable resource produced in the SemDupl project. 5

(A4) Textual entailment During the work on the project, it turned out that

the theorem prover which had been developed in the LogAnswer project1 is also well suited for textual entailment. Thus the main work consisted in the generation of a large number of textual entailment patterns, deep as well as shallow ones (see Section 7.2).

(A5) Technical realization of the duplicate recognizer The programs embodying the recognizer have been developed according to the architectures shown in Figure 1 and Figure 11. Their technical handling is described in Appendix B.

(A6) Evaluation The results of the evaluation of the recognizer developed in the SemDupl project are nally described in Section 12.

2

Comparison with other Systems

As stated, detecting duplicates is of high interest for holders of rights and tutors. Therefore, many tools exist to detect plagiarisms in given corpora or the web. Below are some of the best ranked systems according to the 2008 test of the University of Applied Sciences Berlin (HTW) (Weber-Wul, 2009).

Copyscape 2 , a plagiarism checker of Indigo Stream Technologies Ltd. Given a

text, it searches the Internet for possible duplicates of this text using the document's words in the given order.

Plagiarism Detector

3

by SkyLine, Inc. uses non-overlapping n-grams with a congurable spacing between them to nd online-plagiarisms of a given text in various possible input formats.

Urkund

4

by PrioInfo AB targets to check papers written by students for possible plagiarism and searches the Internet (with known paper mills), an own corpus of scientic publications and papers checked for plagiarism before.

WCopynd

5

is an n-gram based plagiarism checker of the University of Virginia, Charlottesville (Balaguer, 2009). It targets student's papers, searching a corpus which has to be compiled by the user. Since it is open source software this tool was used as a comparison for our SemDupl system.

3

Architecture of SemDupl

SemDupl is a complex software system, whose top-level architecture is illustrated in Figure 1 (Hartrumpf et al., 2010b,a). (For simplicity, only one of the two shallow duplicate detectors is included in the graph.) The user can access SemDupl via a graphical user interface (GUI) or via a command line interface (CLI); for other application programs, an application program interface (API) is provided. 1 This project had been funded by DFG under contract number HE 2847/10-1. 2 http://www.copyscape.com/, rst and third place (premium and free version)

3 http://plagiarism-detector.com/, scored second place 4 http://www.urkund.de/, scored fourth place

5 http://plagiarism.phys.virginia.edu/Wsoftware.html, marked as good

6

The GUI is designed with two dierent user groups in mind: (1) developers and testers that use SemDupl on a daily basis, (2) normal users that want to use SemDupl to investigate their own corpus for duplicates. It was implemented in Hop, a Scheme-based language for the Web 2.0, because the main parts of SemDupl (parser, deep detector) are implemented in Scheme, too, and because Hop allows concise and elegant programs that run eciently across the Internet. The details of the GUI are explained in Appendix B.6. The main functionality (or mode) is described in Figure 1: testing a single text (the candidate text) against a corpus of texts (the base corpus) in order to decide whether it is a duplicate (and if so, from what texts and in which parts exactly). The candidate text is distributed by the central controller to three duplicate detectors: two shallow ones (SemDupl-shallow, SemDupl-quick), to quickly nd the easy cases; and a deep one (SemDupl-deep), to tackle the dicult cases. The results of the three detectors are combined to give a single result with one condence score for the reported decision. The shallow detector SemDupl-shallow and the Combiner need corpus statistics derived from an annotated corpus of plagiarism. In addition, lexical-semantic relations are used to allow to follow semantic paraphrasing at least on the word level. SemDupl-deep is much more complex than the shallow detector SemDupl-shallow: it starts with the syntactico-semantic parser WOCADI6 (Hartrumpf, 2003), which delivers semantic networks of the MultiNet7 formalism (Helbig, 2006) and syntactic dependency trees. Only the semantic networks are further processed in SemDupldeep. The base corpus is completely parsed during a preprocessing step, which is comparable to an indexing step in simpler systems. The semantic representations of the candidate texts are expanded (query expansion) and textual entailments are precalculated by a large rule component. The resulting representations are the input for a semantic matcher, which can only run eciently over the base corpus due to a specialized index for semantic networks. The knowledge aspect of SemDupl-deep is depicted in the lower right corner of Figure 1. Automatic knowledge acquisition methods deliver semantic facts and rules from parsed corpora like the German Wikipedia. In addition, knowledge engineers can validate and extend the automatically generated resources.

4

The SemDupl Corpus

The corpus used in the learning process of the Combiner, SemDupl-shallow (see Section 5) and for evaluation purposes is composed of ve manually annotated corpora:

RSS news (semdupl-rss) Automatically collected news feed articles of dierent

German media consisting of 99 texts fully annotated with 298 duplicate pairs.

Prose (semdupl-prose) Short stories by Edgar Allen Poe translated to German by dierent translators (split in 136 parts of about 600 words each) with 200 duplicate pairs.

wo c a Multi

6 WOCADI is the abbreviation for

7 MultiNet is the abbreviation of

rd

l ss based

di

7

Net

sambiguating parser.

layered Extended Semantic

works

Application

User

O

a =



!

 }

GUI

CLI

_

API

O

w

Candidate text |

?

z

 { 

z

O

{

Results (score, text ID, matches), ...

O

} |

}



z

A

|

Shallow duplicate detector (SemDuplshallow)



Semantic duplicate detector o (SemDupldeep)

 {

z

Duplicate statistics

O

|

}

|



{

\ }

|

/

& Query expansion

O

z {



z

Semantic rules

Textual entailment component

Knowledge acquisition methods

/ (equivalences, o entailments)

O

}



z

z

|

{

}

}

{

a data collection D

D |

an external entity E

E

}

a function collection F

F

/

a data ow

/

a bidirectional data ow

/

data access

o {

z

data collection D duplicated for readability

D (2) |



Knowledge acquisition corpora e.g. Wikipedia

Knowledge engineer

Knowledge o workbench

}

{

} |

{

Base corpus (2)

{

Facts (lexicalsemantic u relations, o world knowledge)

Annotated duplicate corpus

z

Combination (of duplicate results)

|

|

O

}

|

 

Base corpus }

z

NL parser

 

z

Parsed text o corpus

Semantic o sentence index

Controller

{

z

{

}

Figure 1: Architecture of the duplicate detector SemDupl .

8

Internet (semdupl-google) 100 texts collected from Google (the 10 top texts of the 10 fastest growing search terms in 2008), containing 224 duplicate pairs.

Minimal test units (semdupl-units) A collection of (mostly) minimal text pairs. These were written to test for single paraphrase phenomena; the phenomena will be discussed in Section 6. This subcorpus contains 134 duplicate pairs and 5642 non-duplicate pairs.

Plagiarism (htw) Weber-Wul's collection of plagiarisms, annotated as 77 texts with 141 duplicate pairs and 5788 non-duplicate pairs.

5

The Shallow Duplicate Detectors SemDupl-shallow and SemDupl-quick

SemDupl-shallow is an approach to detect duplicates in a corpus of text documents using shallow techniques only (Eichhorn, 2009). It employs the decision tree learning algorithm ID3 (Quinlan, 1986) to learn to distinguish between duplicate pairs and non-duplicate pairs of texts.

5.1

Features in SemDupl-shallow

Features in SemDupl-shallow use only the surface structure of the analyzed texts. These features are the following:

Checksum To test quickly whether two texts are identical a checksum of each text is calculated with the MD5 algorithm (Rivest, 1992).

Word sets The words of the compared texts as set of words. The sets calculated

are: all words of the text, the words of the text without stop words, the words of the text in stemmed form using the Porter stemming algorithm (Porter, 1997), and the words of the text united with the sets of their synonyms.

Typos A good strategy for humans searching for plagiarism is to compare the

spelling mistakes made by dierent authors. If a text is plagiarized, its typos are often copied, too, as shown by Weber-Wul (2002). Typos are calculated using the spell checker GNU aspell.

Length of words and sentences Weber-Wul (2002) shows that texts which are plagiarized often share the same style of writing. Since the average length of words and sentences (per paragraph) is one characteristic of the writer's style, these two feature values are calculated and compared.

N-grams Word n-grams are sequences of n ∈ N words from the texts. In SemDuplshallow, dierent types of n-grams are used to compare a pair of texts:

Standard n-grams n-grams with a length of 3 and 7. Alliterations These are special n-grams with a length of 3 or 7 where all words start with the same letter. 9

Phonetic alliterations These are alliterations with the words sharing the

same initial phoneme; initial phonemes are determined as described by Brügge and Mohs (2003).

K-skip-n-grams Special n-grams where up to k ∈ N words are skipped between the elements of the n-grams, with a skip-length of k = 2 and a length of n = 3.

5.2

Combination of Features

In order to combine the dierent features, which may be calculated on a given text pair, the system uses a trained decision tree (Quinlan, 1986). Starting at the root of the tree, the value of the feature given in the node is calculated, the decision tree path appropriate to the calculated value is chosen. This is repeated until a leaf is reached and the pair falls in a class of duplicate or non-duplicate. This is a bit unlike the usual use of decision trees, where all values are known prior to the use of the tree, but in this way only some of the features have to be calculated, which reduces processing time. Since there are 39 features which may be calculated and the learned tree has a minimal depth of 3, a maximal depth of 14 and an average depth of 9.6. roughly only a quarter of all features have to be calculated in order to determine whether the pair is a duplicate pair or a non-duplicate pair.

5.3

Training Phase

Training is done in two steps. In the rst step, each feature for each annotated pair of corpus texts is calculated and stored in order to rerun the actual training without time-consuming recalculation of feature values. In the second step, a decision tree is calculated using the ID3 algorithm (Quinlan, 1986). Since the values are from the interval [0, 1], an extended version of the algorithm is used which is capable of dealing with intervals instead of discrete values only (Petersohn, 2005). Both the original and the extended algorithm are part of the learning environment RapidMiner (Mierswa et al., 2006), which is used as the machine learning component in this process.

5.4

Application Phase

During the application phase, the learned decision tree is traversed. The 111 leafs of the tree consist of:

• 30 leafs containing only duplicate pairs • 49 leafs containing only non-duplicate pairs • 32 containing pairs of both classes. If the tree is used for detecting duplicates, the leafs can be interpreted in dierent ways:

• The standard interpretation (which shows the best overall detection rate) is to take the class of the majority of the examples to determine the class of the leaf. 10

• If false detection of duplicates has to be reduced, another interpretation called presumption of innocence can be used. In this case, a leaf falls into the duplicate class only if there is no evidence that it may be a non-duplicate and therefore only pure duplicate leafs belong to this class, while the others are seen as non-duplicate leafs. • If, in contrast, false negatives are to be avoided, the interpretation of presumption of guilt can be used: only leafs without any duplicates are seen as non-duplicates, while all other leafs are seen as members of the duplicate class. • Sometimes, SemDupl-shallow may be used as preprocessor for another automated method of detection. Then it is crucial to eliminate all pairs of texts from the analyzed text set that can be reliably classied as duplicate or nonduplicate in order to save machine time which would be used to classify these examples again. Also, the pairs detected to be non-duplicates should contain as few duplicates as possible because if eliminated during preprocessing there is no chance of detecting them with the latter method. This preprocessing or ltering can be done by using the three class interpretation, allowing SemDupl-shallow to classify pairs as unknown. A pair is classied as unknown if its leaf contains examples of both classes.

5.5

Aggregation of Feature Values

In order to develop an aggregated numerical value for duplicity, dierent measures have been tested. Among these methods, the arithmetic mean of the feature values has shown to be the most useful. Calculated on the values of the annotated corpus, the aggregated value of duplicates falls into the interval of [0, 021; 1], the aggregated value of non-duplicate falls into [0, 016; 0, 701]. If the aggregated value is to be determined, all features have to be calculated, which will need a lot more time than just determining the pair's class.

5.6

SemDupl-quick (SQ)

Tests indicated that the shallow approach of SemDupl-shallow achieves good results regarding precision and accuracy, but due to its time complexity it is rather unsuited for large corpora. So another shallow approach was devised, using only features that can be calculated eciently. In the preprocessing phase, SemDupl-quick (SQ) searches the given texts for misspelled words and words with a frequency class above a given threshold and compiles all n-grams with lengths from 3 to 7. These values are used as indices whereas the text's id (e.g. lename) is used as value. This generates a database with a list of text ids for each value (with a table for each feature). In the detection phase all rows r containing a given text are searched inside the tables. For each other aected text found inside the rows the ratio between the total number of rows r and the number of rows in r containing the aected text id is calculated for each table (and therefore feature). These scores are combined linearly and normalized, resulting in an combined score for each text pair. A text is regarded as a duplicate if the score is greater than a given threshold. 11

Indicator

Indicator_files 1

1..*

wid filename

wid word

Figure 2: Data model of SemDupl-quick. Figure 2 shows the data model of SemDupl-quick. In the initialization phase, for each feature the tables and are lled. An example entry of table trigrams is: Table trigrams: wid: 3, word: aber dann dachte An example entry of table trigramme_les is: Table trigrams_les: wid: 3, lename: brief1_01.txt For determining the overlap score of a le with all other les only the table is used. The score is determined by calculate the number of common entries, i.e., the relative amount of wid's which occur together with both lenames.

5.7

Comparison of Shallow Approaches

SemDupl-shallow is ready to instantly check an arbitrary pair of texts without any preprocessing steps as an out-of-the-box duplicate detector. Its capability to learn the denition of duplicates on an annotated corpus leads to a detection which has a lower chance of failing because of bad user-set thresholds. Its downside is it has to inspect every possible text pair in order to detect all duplicates in a given corpus, resulting in quadratic time complexity, so it should be used on small or ltered corpora. SemDupl-quick, on the other hand, uses preprocessing resulting in a lower time complexity while detecting, but only some of the possible features can be used as index-values and the thresholds, which are dened by the user, may, if not set well, become a source of errors. 6

Linguistic Phenomena Relevant for Semantic Duplicates

6.1

Types of Paraphrases for Semantic Duplicates

The following problems must be solved by advanced duplicate detectors. Most of them are out of reach for standard surface-oriented methods; only items 1, 2, 4, 5, 13, 14, and 15 can probably be detected by shallow approaches, at least partially. For each problem, an illustrative example is given and a solution in our semantic approach is described. Most of these examples are already covered by our deep detector SemDupl-deep (if not indicated otherwise). 1. dierent word forms (e.g. `alle Barockkirchen '/`all baroque churches ' vs. `jede Barockkirche '/`every baroque church '): solved by the morphology component, more specically by the operation called lemmatization. Dierences in syntactic case are irrelevant in paraphrases, but dierences in number must be investigated more closely. In the above example, the dierence is irrelevant 12

because together with the quantiers both noun forms mean almost the same. Such semantic equivalences of syntactically dierent noun phrases can be determined (in part) by the parser. 2. dierent orthography (e.g. `Fluss '/`river ' vs. `Fluÿ '): solved by the morphology component, by way of spelling normalization (new and old orthography in German; also spelling variants). 3. semantically light or empty expressions: `Der Auÿenminister will den Grund der Diskussion ja dann doch wissen.'/`The secretary of foreign aairs now wants to know the reason for the discussion.' vs. `Der Auÿenminister will den Grund der Diskussion wissen.'/`The secretary of foreign aairs wants to know the reason for the discussion.' 4. abbreviations/acronyms and expanded forms (`die EU '/`the EU ' vs. `die europäische Union '/`the European Union '): solved by lexical resources available to the parser. 5. dierent hyphenation of compounds (e.g. `Barock-Kirche '/`baroque church ' vs. `Barockkirche '): solved by the compound analysis component. 6. dierent word order, e.g. `Der Schüler freute sich über das Buch.'/`The student was happy about the book.' vs. `Über das Buch freute sich der Schüler.'/`literally: About the book the student was happy.': solved by the parser. In German, word order is much freer than in English. 7. discontinuous word forms (e.g. German verbs with separable prex: `. . . als der Minister den Namen aufschrieb.'/`. . . as the secretary noted the name down.' vs. `Der Minister schrieb den Namen auf.'/`The secretary noted the name down.'): solved by the parser, which combines a separated verb prex with the corresponding base verb. 8. dierent voices (active or passive in German): e.g. `Der Lehrling reparierte das Auto.'/`The apprentice repaired the car.' vs. `Das Auto wurde vom Lehrling repariert.'/`The car was repaired by the apprentice.': solved by the parser because it generates identical semantic networks. 9. arguments in dierent clauses, e.g. `Die Frau lief 10 Kilometer.'/`The woman ran 10 kilometers.' vs. `Es gelang der Frau, 10 Kilometer zu laufen.'/`The woman achieved to run 10 kilometers.'. In this example, the lexical syntax and semantics of the control verb `gelingen '/`to achieve ' allows to identify similar content. 10. nominalization of situations, e.g. `der Kauf der kleinen Firma durch den Weltkonzern '/`the acquisition of the small company by the world concern ' vs. `Der Weltkonzern kauft die kleine Firma.'/`The global company buys the small company '. Solution: The verb and the corresponding nominalization are linked in the lexicon and lead to isomorphic semantic networks that can be easily matched to nd similar sentences. 13

11. information distribution across sentences, e.g. `Der Räuber verlieÿ die Bank und üchtete zum Bahnhof.'/`The bandit left the bank and escaped to the train station.' vs. `Der Räuber verlieÿ die Bank. Er üchtete zum Bahnhof.'/`The bandit left the bank. He escaped to the train station.' As the parser works sentence-oriented, the representations on the sentence level are dierent. But if one allows to split a conjunction (of situations) when searching, the information can be found also in the document version that uses several sentences. (Note that the case where the information in the candidate document is split in several sentences matches a document with a combining sentence is already covered without any extensions.) 12. names vs. denite description, e.g. `Paris ' vs. `die französische Hauptstadt '/`the French capital '. This eect can be compensated by so-called question decomposition in SemDupl-deep. As the term question decomposition indicates, this kind of decomposition was introduced for question answering, but it can also applied in semantic search for similar sentences. Currently, only one direction (the replacement of a name) can be detected, mainly for eciency reasons. 13. synonyms: partially solved by HaGenLex (Hartrumpf et al., 2003) plus GermaNet (Hamp and Feldweg, 1997) (relation syno); for compounds, many synonyms can be inferred from synonyms of parts, e.g. `Groÿstadt-Atmosphäre '/`the big city atmosphere ' (literally) vs. `Metropolen-Atmosphäre '/`the metropolis atmosphere '. 14. hyponyms (or viewed reversely: hypernyms; in the case of verbs, these are often called troponyms), e.g. `Geschäftsbrief '/`business letter ' vs. `Brief '/`letter ': solved by lexico-semantic relations (sub and similar relations) stored in HaGenLex and derived from other sources like GermaNet. 15. antonyms (plus negation), e.g. `nicht richtig '/`not right ' vs. `falsch '/`false '): solved by lexico-semantic relations (anto and its subrelations) stored in HaGenLex and derived from other sources like GermaNet. In some cases, the negation of an antonym is not completely equivalent to the original concept. Corpus analyses and statistics can help to detect such cases. 16. compounds vs. analytical expressions like complex NPs and clauses: compounds with regular semantics can be paraphrased by a complex constituent, e.g. `Finanzierungslücke '/`nance gap ' vs. `Lücke bei der Finanzierung '/`gap in nancing ' and `die Groÿstadt-Atmosphäre '/`the metropolis atmosphere ' vs. `die groÿstädtische Atmosphäre '/`the metropolitan atmosphere '. Solution: transformation of compound semantics by the rule component of SemDupl. 17. idioms, e.g. `einen Entwurf in die Tonne werfen.'/`to throw a draft into the bin ' (literally) vs. `einen Entwurf verwerfen.'/`to discard a draft ': These are probably not solvable in all cases in the near future. But cases like `ins Gras beiÿen '/`kick the bucket ' vs. `sterben '/`to die ' and the example given above can be entered in an idiom lexicon. Currently, an idiom lexicon of around 250 idioms based on verbs is employed. 18. support verb constructions (SVCs), e.g. `einen Widerspruch äuÿern ' vs. `widersprechen '. Solution: an SVC lexicon that allows to normalize an SVC to a form 14

without the SVC. In SemDupl, this achieved by MultiNet rules applied during query expansion. 19. coreferences (dierent expressions referring to the same entity): solved by the coreference module. For example intrasentential coreferences: `Syrakus, das im Jahr 734 v. Chr. gegründet wurde . . . '/`Syracuse, which was founded in the year 734 BC . . . ' vs. `Syrakus wurde im Jahr 734 v. Chr. gegründet.'/`Syracuse was founded in the year 734 BC.' (the semantic representation of the rst sentence contains a correct resolution of the relative pronoun `das '); intersentential coreferences: `Adenauer ', `Konrad Adenauer ', `er ' is a possible sequence of noun phrases (forming a so-called coreference chain), as is `Konrad Adenauer ', `er ', `Adenauer ', which leads to completely dierent surface sentences, but almost identical semantics. 20. presuppositions: existential presuppositions, e.g. `Der König von Frankreich reiste 1730 . . . '/`The king of France traveled in 1730.' implies (presupposes) `Es gab einen König von Frankreich 1730.'/`There was a king of France in 1730.' 21. entailments (especially entailments of verbs), e.g. `Die Firma verkauft ein Patent an den Konkurrenten.'/`The company sells a patent to the business rival.' vs. `Der Konkurrent kauft ein Patent von der Firma.'/`The business rival bought a patent from the company.'): covered in part by entailments from HaGenLex and entailments derived from knowledge bases like GermaNet and manual translations of XWordNet. 22. metaphors like `am Anfang des Lebensweges '/`at the beginning of the path of life ' vs. `nach der Geburt '/`after birth ' (metaphor: life-is-journey or life-isway). Metaphors span a whole range of phenomena. For example, lexicalized metaphors can be easily handled by the lexicon, but more general cases would require a special metaphor interpretation module to be used by the parser. This enumeration could be extended further and further. In the end, complete text understanding plus complete paraphrasing and inferencing is needed; this is the ultimate goal of natural language understanding (NLU); this goal can only be approximated step by step, phenomenon by phenomenon, module by module, as indicated in this section.

6.2

Restrictive Contexts and Other Precision Problems for Semantic Duplicates

For each problem, an illustrative example and a solution is given. 1. incorrectly collapsed discourse entities, e.g. `IBM kaufte 2000 andere Firmen.'/`literally: IBM bought 2000 other companies ': `2000 ' is a year not a quantier for `andere Firmen '. Partially solved by sentence segmentation and parsing of sentences. 2. incorrect phrases: solved by parsing sentences 15

3. incorrectly selected reading (wrong reading of ambiguous word or constituent): e.g. `Die Birne schmeckt gut.'/`The pear tastes well.' would be irrelevant if the query is about `Birne ' in the sense of `Glühbirne '/`light bulb '. Solved by the lexicon and the parser. 4. negation; constituent negation (compatibility test for the fact layer feature in MultiNet suces); sentence negation, similarly. 5. other modalities, e.g. `Frankreich erzielt vielleicht höhere Einnahmen.'/`Maybe France achieves higher incomes.' would (probably) be an incorrect duplicate match if the other text reads `Frankreich erzielt höhere Einnahmen.'/`France achieves higher incomes '. Incompatible modalities are tested in the semantic network representations. Similarly, hypothetical situations must be excluded from answering factual questions, e.g. `Wenn London in Frankreich liegt, ist Schnee weiÿ.'/`If London is located inside France, then snow is white.' A sentence that could incorrectly match as a (partial) duplicate of the main clause is: `London liegt in Frankreich.'/`London is located inside France.' Other examples of modality come from epistemic modals like `glauben '/`to believe ', e.g. `Er glaubte, dass der Minister vom Präsidenten entlassen wird.'/`He believed that the secretary was dismissed by the president.' vs. `Der Minister wird vom Präsidenten entlassen.'/`The secretary was dismissed by the president.' 7

Knowledge Acquisition for Deep Duplicate Detectors

The deep duplicate detector SemDupl-deep can only be as good as the underlying knowledge bases. Therefore, the SemDupl project tries

• to consolidate our existing knowledge sources, • to automatically (or semi-automatically) derive new knowledge bases, and • to validate these new knowledges bases.

7.1

Hypernyms and Meronyms

A type of textual entailment that is both quite easy to create and to detect is a sentence pair where the two sentences are almost identical, with the exception that certain words (or concepts on a semantic level) of the original sentence are replaced by synonyms, hypernyms or holonyms (the inverse of meronyms). This is a modication which can be done quite easily to obfuscate a plagiarism. For example, `His father went to Bavaria.' implies `His father went to Germany '. In the second sentence, `Bavaria ' is replaced by one of its holonyms, `Germany '. Thus, a large collection of synonyms, hypernyms and holonyms is quite vital for entailment recognition. Since the Wikipedia is often a source for creating plagiarisms or duplicates, synonyms, hypernyms and holonyms are extracted from Wikipedia using a patternbased approach (vor der Brück, 2009; vor der Brück, 2010a,c,b, 2011; vor der Brück and Stenzhorn, 2010; Tim vor der Brück, 2010; vor der Brück and Helbig, 2010). For 16

(a1 sub0 a2 ) ←

a1 ([word ”, ”] [cat (art)]? a1 )∗ [word ”und (and)”] [cat (art)]? [lemma ”ander (other)”] [cat (a)]? a2

Figure 3: Shallow pattern for hyponymy extraction where the premise is given as a regular expression. this it is dierentiated between shallow and deep patterns. Both types of patterns consist of a conclusion part of the form (a1 syno a2 ), (a1 sub0 a2 ) or (a1 mero a2 ) which species that, if the premise holds, a synonymy / hyponymy / meronymy relationship between the constants which are assigned to the variables a1 and a2 , holds. The assignments for both variables are determined by matching the premise part to a linguistic structure which is created by analyzing the associated sentence. In the following, we only describe the extraction of hyponyms. The extraction of synonyms and meronyms works quite similar. The premise of a shallow pattern is given as a regular expression and is matched to the token information given by the parser. This token information contains the following information:

• word: word form • category: grammatical category • lemmas: a list of possible lemmas • readings: a list of possible readings • reading: the reading determined by the word sense disambiguation of the parser • lemma: lemma determined by the word sense disambiguation of the parser If a match is successful, then the variables of the conclusion are replaced with the constants matched to the same variables in the premise. Note that a variable of the premise can be matched several times, either if it appears several times in the premise or if that variable is in the scope of a Kleene-star operator (both is true in the pattern given in Figure 3). In this case, the relations stated by the Cartesian product of all possible variable instantiations are extracted. The premise of a deep pattern is given as a semantic network graph, whereas the conclusion is the semantic relation to be deduced. The deep pattern is applied by a graph pattern matcher (or an automatic theorem prover if axioms are to be employed). An example pattern is given in Equation 1 and Figure 4.

(a1 sub0a2 ) ← f ollows(c, d) ∧ (d pred a2 )∧ *itms

(d prop ander .1 .1 (other .1 .1 ))∧ (c suba1 )

(1)

f ollows*itms (c, d) denotes the fact that c precedes d in the argument list of the function *itms. 17

ander.1.1 (other.1.1) a1 SU B

PROP

(a1 SUB a2)

follows

a2

*ITMS

PRED

... *ITMS

...

... ...

Figure 4: Deep pattern for hyponymy extraction where the premise is given as an SN.

old.1.1 own.1.1 OP

PR

SUBS

SCA

R

B

SU

present.0

*IT

MS

SUB

MP

OB

J

TE

man.1.1

a1=cello.1.1

fol

low

s

*IT M

S

B SU

PR ED

PR

OP

other.1.1 a2=instrument.1.1

Figure 5: Matching a deep pattern to a semantic network.

18

a2

a1

PRED

PRED SUBM

*DIFF

(a1 SUB0 a2)

Figure 6: Deep pattern for hyponymy extraction where the premise is given as an SN. Figure 5 illustrates the application of the deep pattern that is displayed in Figure 4 to a semantic network representing the sentence: `Der alte Mann besitzt ein Cello und andere Instrumente.'/`The old man owns a cello and other instruments '. The semantic network arcs which are matched with the literals of the pattern are printed in bold. The dashed edge represents the inferred fact: (cello.1 .1 sub instrument.1 .1 ) with the variable instantiations a1 = cello.1 .1 , a2 = instrument.1 .1 . Note that we consider instance of relations as a special kind of hyponymy as well and such relations were also extracted by our algorithm. Instance of relations involving named entities are represented in MultiNet using attribute-value constructions, e.g. `Bonn ist eine Stadt.'/`Bonn is a city.' is represented by

(a attr b) ∧ (b sub name.1 .1 ) ∧ (b val bonn.0 ) ∧ (a sub stadt.1 .1 )

7.2

(2)

Deep vs. Shallow Patterns

On the one hand, a shallow pattern has the advantage that it is also applicable if the parse fails. It only relies on the fact that the tokenization is successful. On the other hand, deep patterns are still applicable if there are additional constituents or subclauses between hyponyms and hypernyms which usually cannot be covered by shallow patterns. The following sentences (the rst two originating from the Wikipedia corpus) are typical examples where the hyponymy relationship could only be extracted using deep patterns (hyponym and hypernym are set in bold face). `Die Stadt besitzt eine romanische aus dem 12. Jahrhundert, viele andere romanische und einige Museen.' `The city contains a Romanesque from the 12th century, a lot of other Romanesque and some museums.' Another advantage of deep patterns is illustrated by the following sentence: `Auf jeden Fall sind nicht alle Vorfälle aus dem oder aus anderen vollständig geklärt.'/`In any case, not all incidents from the or from other are fully explained.' From this sentence, a hyponymy pair cannot be extracted by the Hearst pattern `X or other Y ' (Hearst, 1992). The application of this pattern fails due to the word `aus '/`from ' which cannot be matched. To extract this relation by means of shallow patterns an additional pattern would have to be introduced. This could also be the case if syntactic patterns were used instead since the coordination of

cathedral

Kirchen

Weltgegenden Bermuda Triangle

Kathedrale

churches

Bermudadreieck

world areas

19

cello.1.1

string_instrument.1.1

SUB0

PRED

PRED SUBM

PROP

SU

*DIFF

he.1.1

BM

common.1.1

SUB OBJ

AGT

MP

SUBS

TE

present.0 sell.1.1

Figure 7: Application of a deep pattern to an SN representing the sentence He sells all common string instruments except celli. Arcs matched with the pattern premise are printed in bold. The dashed arc is inferred by application of the pattern.

20

`Bermudadreieck '/`Bermuda Triangle ' and `Weltgegenden '/`world areas ' is not represented in the syntactic constituency tree but only on a semantic level8 . Thus, the same deep pattern can be used for the hyponymy extraction in this sentence as for extracting the hyponymy relationship from the phrase: `das Bermuda-Dreieck oder andere Weltgegenden '/`the Bermuda Triangle or other world areas '. Furthermore, several syntactic or surface representations are frequently mapped to the same semantic network like in the sentences: 1. `Er besitzt ein Cello, eine Geige a violin other instruments.'

and

und andere Instrumente.'/`He owns a cello,

sowie

2. `Er besitzt eine Geige, ein Cello andere Instrumente.'/`He owns a violin, a cello other instruments.'

as well as

Thus, the hyponymy relationship that cello and violins are instruments can be extracted by the application of the same deep pattern. However, to extract the same information by the application of shallow or even syntactic patterns, two dierent patterns have to be dened. Another example: • `Er verkauft alle gebräuchlichen Streichinstrumente auÿer Celli.'/`He sells all common string instruments except celli.' • `Er verkauft alle gebräuchlichen Streichinstrumente bis auf Celli.'/`He sells all common string instruments aside from celli.' • `Er verkauft alle gebräuchlichen Streichinstrumente ausgenommen Celli.'/`He sells all common string instruments excluding celli.' All three sentences have dierent dependency trees (tested by applying the Stanford Dependency Parser (de Marnee and Manning, 2008) on the English translations). However, the SN representations of all three sentences are identical (depicted in Figure 7), i.e., the pattern given in Figure 6 can be applied to extract the hyponymy relation (cello.1 .1 sub0 string _instrument.1 .1 ) from all three sentences, while three dierent syntactic patterns would have to be designed if the same relation was to be extracted from the dependency parse. The same fact holds if surface representations are used. Thus, the use of a deep semantic representation reduces the amount of required patterns in comparison to a surface or dependency based representation. A further advantage of the deep semantic approach consists in the fact that person names are already identied by the parser which simplies the extraction of hyponyms (instance-of relations) relating to persons. Finally, the deep approach allows the usage of logical axioms, which can make the patterns more generally applicable.

7.3

Validation of Relation Hypotheses

Naturally, not all hypotheses extracted by the patterns are correct. Therefore, a validation component is required. A two-step validation is followed here. In the rst step, ontological sorts and features (Helbig, 2006) of the two concepts involved in a relation hypothesis are compared. Only if the sort/feature combination is admissible, the hypothesis is stored in the database. In the second step, several 8 Note that some dependency parsers do a semantic-oriented normalization as well.

21

statistical features are calculated and employed to derive a condence score (vor der Brück, 2010a; vor der Brück and Helbig, 2010). Let us look at the rst validation step more closely. First we give a short overview on ontological sorts and semantic features. The ontological sorts (MultiNet denes 45 sorts) form a taxonomy. In contrast to other taxonomies, ontological sorts are not necessarily lexicalized, i.e., they do not necessarily denote lexical entries. The following list shows a small selection of ontological sorts below the sort object (o):

• Objects  Concrete objects ∗ Discrete objects: e.g. `chair ' ∗ Substances: e.g. `milk ', `honey '  Abstract objects: e.g. `race ', `robbery ' Semantic features are certain properties which can be set, unset or underspecied. The following semantic features are dened:

• • • • • • • • • • • • • • • •

animal animate axial artif (articial) geogr (geographic) human info instit (institution) instru (instrument) legper (legal person) mental method movable potag (potential agent) spatial thconc (theoretical concept)

Sample characteristics for the concept bear.1.1 (The sux .1.1 denotes the reading numbered .1.1 of the given word.): discrete object; animal +, animate +, artif -, human -, spatial +, thconc -, . . . Each hypothesis stored in the database has passed a consistency check using ontological sorts and features. For hyponymy hypotheses, the ontological sorts/features of the hypernym must subsume the sorts/features of the hyponym. For synonymy hypotheses, ontological sorts and features must be identical for the two compared concepts. For meronymy hypotheses, the allowed combinations are learned by a tree-augmented naïve Bayes algorithm. Examples:

bear.1.1 (animal:+) can be no hyponym/synonym of house (animal:-) milk.1.1 (substance) can be no hyponym/synonym of house (discrete object) house.1.1 (movable:-) can be no hyponym/synonym of chair.1.1 (movable:+) living_being.1.1 (animal:underspecied) can be a hypernym but no hyponym/synonym of bear.1.1 (animal:+) • child.1.1 (human:+, etype:0) can be no meronym of man.1.1 (human:+, etype:0)

• • • •

22

Text

HaGenLex

WOCADI Analysis SN

Tokens

Deep Pattern Application Shallow Pattern Application Validation (Filter)

Deep patterns

Shallow patterns

Validation (Score)

KB

Figure 8: System Architecture of SemQuire.

7.4

System Architecture: Relation Extraction

To nd meronymy relations from a text, this text is processed by our knowledge acquisition tool called SemQuire.9 1. The sentences of the German Wikipedia are analyzed by the deep linguistic parser WOCADI. As a result of the parsing, a token list, a syntactic dependency tree, and a semantic network are created. 2. Shallow patterns, consisting of a regular expression in the premise, are applied to the token lists, and deep patterns are applied to the SNs to generate proposals for meronymy relations. 3. A validation tool using ontological sorts and semantic features checks whether the proposals are technically admissible to reduce the amount of data stored in the knowledge base (KB). 4. If the validation is successful, the meronymy candidate pair is added to the KB. Steps 24 are repeated until all sentences are processed. 5. Each meronymy candidate pair in the KB is assigned a condence score estimating the likelihood of its correctness. 6. The correct hyponymy/meronymy subrelation is determined. (No further subrelation exists for synonymy.) 9 SemQuire is derived from

semantically acquire knowledge.

23

8

Annotation GUI

A GUI-based tool called SemChecker is provided for annotating the extracted hyponym / meronym / synonym hypotheses for correctness Danninger (2009). SemChecker is primarily an annotation tool for relation hypotheses but can be used also to identify inconsistent hypotheses. SemChecker displays the hypotheses contained in the database including its associated database attributes. In the upper part of the window in Figure 9 the contents of the database table RELATION is given. If one of these entries is selected, the associated entries of the SOURCE table, which specify the source of that entry, is displayed. Additionally, by means of the attributes FILENAME and SENTENCE_INDEX, the sentence from which the hypothesis was extracted from is identied and shown. This sentence is quite vital for the annotator to decide whether a hypothesis is correct or not. This is especially the case for technical terms which are not generally known. Without the text, the annotator is frequently not able to decide about the hypothesis' correctness. In addition, this sentence is often quite useful to identify shortcomings of the relation extraction process and helps therefore to further improve this process. Furthermore, several bits of information for the concepts involved in the relation hypothesis from the lexicon are shown, including ontological sorts and semantic features as well as an example sentence which claries the intended reading. In the shown case, the mistake that the wrong reading was chosen for one of the words of the hypothesis can be identied easily. Usually such errors are based on the fact that a reading was incorrectly chosen by the word sense disambiguation of the parser.

Figure 9: Screenshot of the annotation tool. 24

In the middle of the graphical user interface, there are several buttons whose functions are dened as follows: • view concept1: Display the rst concept (hyponym / meronym / rst synonym) in LIA by employing the LIA remote control. In order for this to work, LIA has to be started previously. • view concept2: Display the second concept (hypernym / holonym / second synonym) in LIA. • view primary1: Display the associated reading belonging to the base word (primary word) of the rst concept , if existing, in LIA. • view primary2: Display the associated reading belonging to the base word (primary word) of the second concept, if existing, in LIA. • score features: Displays the features which were used to calculate the condence score for the selected hypothesis. This value must not be confused with the semantic features of MultiNet. • axioms: Displays the axioms that were applied to derive a contradiction for this hypothesis. This is only possible if the automated theorem prover was able to derive a contradiction for the selected hypothesis. • filter: Activates a lter, which causes only the hypotheses to be displayed whose database attributes fulll the ltering conditions. SemChecker supports ltering according to equality or interval checks for numerical values. Additionally, SemChecker provides the possibility to sort the shown data according to the database attributes. For this, the user just selects the table border above the desired column.

Figure 10: Graphical user interface of SemChecker for displaying the result of the logical validation, which marks the entry sub(land .1 .1 , provinz .1 .1 ) as potentially defective. Beside its use as an annotation tool, SemChecker can also be used to show inconsistent relation hypotheses. Entries which caused a contradiction are marked 25

by an attention symbol. Additionally, if the button axioms is selected, SemChecker displays all axioms which were required for deriving the contradiction proof for the selected hypothesis. This is illustrated in Figure 10. 9

Extraction of Entailments

Next to semantic relations, a collection of entailments can be useful for plagiarism or duplicate detection. An entailment is a relationship between two expressions which holds, if the truth of the rst expressions (the base expression/text) implies the truth of the other (the hypothesis). Three dierent approaches for entailment extraction were devised which are described in the following three subsections.

9.1

Extracting Entailments Employing a Search Engine

We basically follow the approach of Ravichandran and Hovy (2002) which identies entailments by collecting texts with identical noun phrases, using the assumption that such texts often contain similar contents. Consider for example the noun phrases: John McEnroe, Björn Borg, Wimbledon, 1980. Sentences containing such expressions could be: • John McEnroe lost against Björn Borg in the nal of Wimbledon 1980. or • Björn Borg beat John McEnroe in the nal of Wimbledon 1980. Thus, if a lot of such texts are examined, the entailments • X beat Y → Y lost againstX and • Y lost againstX → X beat Y can eventually be learned. The entailment extraction is done in the following steps: • SNs are created for all texts which contain the given noun phrases. • The texts are extracted by web search engine queries. • Nominal phrases which are employed for the search are replaced by variables. • Frequently occurring substructures S in these SN are learned by following the Minimum Description Length Principle (Cook and Holder, 1994). • Entailments are created by building the Cartesian product over S : S × S , i.e., the rst component of a pair s ∈ S × S represents the base expression, the second the hypothesis. Entailments extracted by this approach are given in Appendix A.

9.2

Extracting Entailments from SNs Basically Following Ravichandran and Hovy

The approach described in Section 9.1 prots from a large corpus (the part of the web indexed by the search engine) and the precision is quite good due to the fact that the method is semi-automatic and a human user has to provide the search tuples. However, this can also be a disadvantage since human interaction is required and a high amount of labor is required to extract a high number of entailments. Furthermore, the approach is not fully semantic-oriented since surface strings are used for the search engine queries. Therefore, we devised an additional approach which is fully automatic and is also based on the approach of Ravichandran and Hovy (2002). In this approach, we extract entailments from a newsfeed corpus in form 26

Table 1: Assumed equivalent formulas with given condence score. Formula 1 compl1 (holen) nach.dircl compl1 (holen) nach.dircl compl1 (holen) nach.dircl adj (studieren) compl1 compl1 (kommen) auf.circ (Weg) zu.compl2

Formula 2 compl1 (werden) an.loc (institut) in.loc compl1 (niederlassen) in.compl2 compl1 (gehen) nach.dircl in.loc (promoviern) compl2 compl1 (zur"uckkehren) zu.purp

Score 1.000 1.000 0.219 0.106 1.000

of semantic networks. The semantic networks were derived by a deep syntacticosemantic analysis (the WOCADI parser). The search tuples were determined by corpus statistics following the method introduced by Szpektor et al. (2004) called TE/ASE.

9.3

Extracting Entailments Basically Following Lin and Pantel

We further tested a dierent entailment extraction approach (Zauner, 2009) which is based on the method of Lin and Pantel (2001). For this method it is not necessary to construct any search tuples like for the two other approaches, which is not trivial and can result in either many work or considerably computations if done automatically. However, the other two approaches created several intermediate results which could be used to manually inuence the nal result which is not possible here. This approach extracts all paths from a dependency tree which connect two nouns with each other. These nouns are referred to as slot llers. Two text passages are considered as similar (and therefore as paraphrases) if the associated paths connect essentially the same nouns with the same frequency with each other. For this purpose, every path is assigned two vectors which specify the number of occurrences of the slot ller expressions in the text corpus. For the determination of the similarity of the vector pairs, a distance function is dened, where the occurrence of usually rare nouns has more inuence for the similarity calculation than of more frequent ones. Note that this algorithm is quite similar to the synonymy/hyponymy recognition by textual context analysis (Cimiano et al., 2005). Instead of dependency trees, this approach was applied to SNs following the MultiNet formalism. Furthermore, a hybrid was realized which is based on dependency trees but replaces the edge labels of the trees by the semantic relations. Some of the extracted entailments are shown in Table 1. 10

The Deep Duplicate Detector SemDupl-deep

To adequately handle linguistic phenomena, i.e., identify paraphrase phenomena discussed in Section 6.1 and to not get disturbed by non-paraphrase phenomena discussed in Section 6.2, a deep semantic approach to duplicate detection has been developed. It integrates existing tools for producing semantic representations for 27

text: the WOCADI parser and the CORUDIS coreference resolver. In an indexing phase, all texts in the base corpus are transformed into semantic representations by WOCADI and CORUDIS (Hartrumpf, 2003). In the detection step, the duplicate candidate is analyzed in the same way as the texts of the base corpus. For each sentence in the candidate, a semantic search query is sent to a retrieval system that contains all the semantic representations for the base corpus. Matches are collected and after all sentences of the candidate have been investigated, scores are calculated from the results for the text sentences. The average overlap score over all candidate sentences is a good score. The individual overlap score is calculated by the retrieval system, based on distances of related concepts and the distance between the left-hand side and right-hand side of inference rules. The power and potential of this detector was illustrated by some examples in Section 6. Some more real-world examples from the SemDupl corpus are presented next. The following four examples can be detected as semantically equivalent by a unit conversion rule between meters and kilometers and abbreviation denitions for `m ' and `km ': `Die Frau lief 10 Kilometer/10000 Meter/10 km/10000 m.'/`The woman ran 10 kilometers/10000 meters/10 km/10000 m.' Another example that shows the inferential power of SemDupl-deep in the area of verb meanings are the three following sentences: `Der alte Parlamentarier stellt einen Antrag auf eine Beratung.'/`The old parliamentarian les an application for a debate.', a variant with the SVC replaced by a full verb `Der alte Parlamentarier beantragt eine Beratung.'/`The old parliamentarian applies for a debate.' and nally a variant with an innitive clause replacing the NP taking the object role of the verb: `Der alte Parlamentarier beantragt zu beraten.'/`The old parliamentarian applies to debate.' (literally translated). All three semantic duplicates can be linked; This linking successfully used knowledge about support verb constructions (`einen Antrag stellen '/`to le an application ' and `beantragen ' / `to apply ') and relationships between verbs and their nominalizations (`beraten '/`to debate ' and `Beratung ' / `debate '). From the independent translations subcorpus, the following is a illustrative example that shows what the deep method can detect: `Man weiÿ ferner, dass sie noch im Besitze des Dokumentes ist.'/`Furthermore, one knows that she still holds possession of the document.' (corpus text brief1_01) vs. `Man weiÿ ebenfalls, daÿ sich das Schriftstück noch in ihrem Besitz bendet.'/`Likewise one knows that the document is still in her possession.' (corpus text brief2_01). To evaluate the inferential aspects of SemDupl-deep, the English test set from RTE-2 (Recognizing Textual Entailment Challenge) was translated to German by a native German speaker. Then, the translations of the IE and IR parts were controlled by an English native speaker in order to check that no information from the English versions was missing or mistranslated and nally revised by a second German native speaker. 11

Combination

The architecture of the Combiner is illustrated in Figure 11. First, SemDupl-shallow, SemDupl-quick and SemDupl-deep produce XML output containing the feature values for a given training set of annotated text pairs. The annotation is 1 for duplicates 28

SemDupl-deep

SemDuplshallow

XML

XML

XML

XML Combiner

SVM Classifier

arff-File

SemDuplquick

Duplicate Output

svm-train

libsvm model

Figure 11: System architecture of the Combiner.

29

Table 2: Confusion matrix for WCopyFind. D=Duplicate, ND=No Duplicate, PD=Predicated Duplicate, PND=Predicted Non-Duplicate. D ND

PD 39 8

PND 215 21645

and 0 for non-duplicates. The XML output is combined by the XMLCombiner which produces a single LIBSVM le as the result. This le is used by the support vector machine LIBSVM (Chang and Lin, 2001) to create a model le. Afterwards, a classication for a previously unseen data set can be done. Again, SemDupl-shallow, SemDupl-quick and SemDupl-deep produce XML output for this dataset. This output is combined by the svm_classifier.out which additionally employs the model le to classify each instance of this dataset. The classication output consists of a list of 4-tuples (t1 , t2 , r, p) ∈ T × T × {0, 1} × [0, 1] where

• T : the text corpus • t1 : a text from the text corpus • t2 : the text t1 is compared with • r: the classication value which can be duplicate (1) or non-duplicate (0). • p: the probability estimate that t1 and t2 are actually duplicates of each other. Note that we use the single features from the ShallowChecker and not the total score as determined by the decision tree. 12

Evaluation

The three individual detectors (shallow, quick, deep) as well as the combined system have been evaluated on the SemDupl corpus, which is annotated for duplicates. For each text pair and each approach, a set of feature values is generated where high values indicate the texts being duplicates. These values are combined by the support vector machine classier WLSVM (EL-Manzalawy and Honavar, 2005), which is based on LIBSVM (Chang and Lin, 2001) as described in the last section. For training this classier, the text pairs of our corpus were used (in 10-fold crossvalidation). The confusion matrices calculated for shallow and deep approaches are shown in Table 3 and Table 4. Furthermore the knowledge extraction process was evaluated. Hyponyms, meronyms, synonyms and entailments were automatically extracted. Table 7 shows a number extracted relation hypotheses, including the number of hypotheses which were manually annotated and which were annotated as correct. The precision depending on the validation score is given in Figure 12. Part of the extracted knowledge is listed in Appendix A. Furthermore, the number of source code lines of the individual systems were determined and is given in Table 8. 30

Table 3: Confusion matrices for shallow approaches. SQ=SemDupl-quick, SS=SemDupl-shallow, D=duplicate, ND=no duplicate, PD=predicted duplicate, PND=predicted non-duplicate. SQ D ND

PD 97 16

SS

PND 157 21637

PD 200 14

SQ+SS

PND 54 21639

PD 201 13

PND 53 21640

Table 4: Confusion matrices for deep approaches. D=duplicate, ND=no duplicate, PD=predicted duplicate, PND=predicted non-duplicate, SD=SemDupldeep. SD D ND

PD 42 5

SD+SQ

PND 212 21648

PD 106 16

SD+SQ+SS

PND 148 21637

PD 202 11

PND 52 21642

Table 5: F-Measure, precision, recall, and accuracy for shallow approaches. Measure

WCopynd

F-Measure Precision Recall Accuracy

SQ

0.259 0.830 0.154 0.990

SS

SQ+SS

0.529 0.855 0.858 0.935 0.382 0.787 0.992 0.997

0.859 0.939 0.791 0.997

Table 6: F-Measure, precision, recall, and accuracy for deep approaches. Measure F-Measure Precision Recall Accuracy

SD

SD+SQ

SD+SQ+SS

0.279 0.894 0.165 0.990

0.564 0.869 0.417 0.993

0.865 0.948 0.795 0.997

31

Table 7: Extracted semantic and lexical relation hypotheses. Type of Relation Synonymy Hyponymy Meronymy Entailments

Hypotheses 88 624 391 153 1 483 701 150 000

Correctness

Annotated 996 29 523 49 839 413

Annotated+Correct 164 7617 1421 91

+

+

+ +

+

+

Corr.

+ +

+

+

0.5 0.5

Score

1

Figure 12: Precision of hyponymy hypotheses depending on score. 13

Evaluation Interpretation and Conclusions

In order to compare the results of the combined system with plagiarism detection software, WCopynd was evaluated on our text corpus, see Table 2 and Table 5 for details. Table 5Table 7 show the results of our approach. It can be seen that each approach of our system generates signicantly better results in terms of F-measure, precision, and recall than WCopynd (determined with condence intervals of level 99%). Similarly (at 95% condence), the best combined shallow+deep approach outperforms the best shallow approach. Furthermore, F-measure, precision, and recall of the combined shallow+deep system are considerably higher than the associated values of the combination of the two shallow systems. Note that the SemDupl-deep approach can show its full potential only in more professionally constructed duplicates; for example, on the semdupl-units subcorpus (which was excluded from the evaluation because it was constructed), it performs much better than any shallow system. Table 8: Lines of source code of the SemDupl components. Component SemDupl-quick SemDupl-deep SemDupl-shallow SemQuire

Number of lines of source code 4000 19 000 8000 100 000 32

The SemDupl system shows that a deep, semantic approach for duplicate detection pays o. The combination of deep and shallow methods outperforms each single method. In future work, we want to improve the coverage of the deep approach by further extending its knowledge bases by (semi-)automatic means.

33

A

Extracted Knowledge

Part of the extracted knowledge is given in the tables below. Table 9 contains a selection of hyponymy hypotheses, Table 10 contains highly scored meronymy hypotheses. Synonym hypotheses are listed in Table 11. Finally, some entailment examples are shown in Table 12. It must be remarked that automatic extraction of entailments from texts is one of the most ambitious tasks in automatic knowledge acquisition. Therefore, this task is far from being completed and has to be considered as a research eld in itself. Further research should be carried out in the following directions:

• Including new methods into the automatic acquisition of entailments. • Enlarging the base of textual corpora for this task. • Improving the computation of scores judging the quality of the entailments. • Rening the logical validation of entailments and nding generalizations over the achieved results (generation of axiom schemata).

34

Table 9: Hypernymy relations with high condence score. Relation sub(silber .1 .1 , metall .1 .1 , categ, situa) sub(bronze.1 .1 , werkstoff .1 .1 , categ, situa) sub(wasserdampf .1 .1 , gas.1 .1 , categ, situa) sub(trompete.1 .1 , instrument.1 .1 , categ, situa) sub(gold .1 .1 , metall .1 .1 , categ, situa) sub(blei .1 .1 , metall .1 .1 , categ, situa) sub(gitarre.1 .1 , instrument.1 .1 , categ, situa) sub(blut.1 .1 , k¨o rperfl ¨u ssigkeit.1 .1 , categ, situa) sub(physik .1 .1 , naturwissenschaft.1 .1 , categ, situa) sub(aluminium.1 .1 , leichtmetall .1 .1 , categ, situa) sub(schmuck .1 .1 , gegenstand .1 .1 , categ, situa) sub(kunststoff .1 .1 , material .1 .1 , categ, situa) sub(buch.1 .1 , schrift.1 .1 , categ, situa) sub(pr¨a sident.1 .1 , staatsoberhaupt.1 .1 , categ, situa) sub(kruzifix .1 .1 , element.1 .1 , categ, situa) sub(tierarzt.1 .1 , person.1 .1 , categ, situa) sub(vesper .1 .1 , ding.1 .1 , categ, situa) sub(sultan.1 .1 , person.1 .1 , categ, situa) sub(knochen.1 .1 , gegenstand .1 .1 , categ, situa) sub(aluminium.1 .1 , stoff .1 .1 , categ, situa) sub(milchstrae.1 .1 , objekt.1 .1 , categ, situa) sub(dolmen.1 .1 , struktur .1 .1 , categ, situa) sub(gewerkschaft.1 .1 , b¨u ndnis.1 .1 , categ, situa) sub(heiz ¨o l .1 .1 , stoff .1 .1 , categ, situa) sub(energie.1 .1 , stoff .1 .1 , categ, situa) sub(editor .1 .1 , programm.1 .1 , categ, situa) sub(vollholz .1 .1 , werkstoff .1 .1 , categ, situa) sub(erbauer .1 .1 , person.1 .1 , categ, situa) sub(honig.1 .1 , stoff .1 .1 , categ, situa) sub(regierung.1 .1 , institution.1 .1 , categ, situa) sub(gummi .1 .1 , kunststoff .1 .1 , categ, situa) sub(k¨o rperschaft.1 .1 , institution.1 .1 , categ, situa) sub(erde.1 .1 , existenzbereich.1 .1 , categ, situa) sub(alkohol .1 .1 , droge.1 .1 , categ, situa) sub(pflanzen¨o l .1 .1 , fl ¨u ssigkeit.1 .1 , categ, situa) sub(zoll .1 .1 , beh¨o rde.1 .1 , categ, situa) sub(klavier .1 .1 , instrument.1 .1 , categ, situa) sub(kupfer .1 .1 , material .1 .1 , categ, situa) sub(biologie.1 .1 , naturwissenschaft.1 .1 , categ, situa) sub(vater .1 .1 , person.1 .1 , categ, situa)

35

Score 0.9737 0.9696 0.9696 0.9591 0.9589 0.9572 0.9567 0.9548 0.9546 0.9521 0.9518 0.9504 0.9486 0.9485 0.9458 0.9458 0.9458 0.9458 0.9458 0.9458 0.9458 0.9458 0.9458 0.9458 0.9458 0.9458 0.9458 0.9458 0.9458 0.9458 0.9458 0.9458 0.9458 0.9452 0.9425 0.9424 0.9415 0.9377 0.9375 0.9368

Table 9: Hypernymy relations with high condence score. Relation sub(kleidung.1 .1 , ding.1 .1 , categ, situa) sub(phosphor .1 .1 , element.1 .1 , categ, situa) sub(schule.1 .1 , geb¨a ude.1 .1 , categ, situa) sub(schlagzeug.1 .1 , instrument.1 .1 , categ, situa) sub(methan.1 .1 , treibhausgas.1 .1 , categ, situa) sub(frucht.1 .1 , pflanzenteil .1 .1 , categ, situa) sub(englisch.2 .1 , sprache.1 .1 , categ, situa) sub(k¨a se.1 .1 , milchprodukt.1 .1 , categ, situa) sub(chemie.1 .1 , naturwissenschaft.1 .1 , categ, situa) sub(holz .1 .1 , stoff .1 .1 , categ, situa) sub(aluminiumoxid .1 .1 , material .1 .1 , categ, situa) sub(wysiwyg -texteditor .1 .1 , programm.1 .1 , categ, situa) sub(cayennepfeffer .1 .1 , gew ¨u rz .1 .1 , categ, situa) sub(carolinalilie.1 .1 , pflanze.1 .1 , categ, situa) sub(havelland .2 .1 , landschaft.1 .1 , categ, situa) subs(karnevalssession.1 .1 , veranstaltung.1 .1 , categ, situa) sub(exoskeletts.1 .1 , tier .1 .1 , categ, situa) sub(zypressenwolfsmilch.1 .1 , pflanze.1 .1 , categ, situa) sub(qawwali .1 .1 , musikstil .1 .1 , categ, situa) sub(hammaburg.1 .1 , kirche.1 .1 , categ, situa) sub(verbandsvorsitzend .1 .1 , mitglied .1 .1 , categ, situa) sub(untergrund .1 .2 , bestandteil .1 .1 , categ, situa) sub(knaanischen.1 .1 , sprache.1 .1 , categ, situa) sub(jugendbildungswerk .1 .2 , organisation.1 .1 , categ, situa) sub(wrangelschen.1 .1 , geb¨a ude.1 .1 , categ, situa) sub(kriebelm¨u cke.1 .1 , insekt.1 .1 , categ, situa) sub(dozentin.1 .1 , frau.1 .1 , categ, situa) sub(betelwachs.1 .1 , produkt.1 .1 , categ, situa) sub(beteiligungshaushalt.1 .1 , form.1 .1 , categ, situa) sub(neokonservatismus.1 .1 , form.1 .1 , categ, situa) sub(tongji -universit¨a t.1 .1 , einrichtung.1 .2 , categ, situa) sub(phosphor .1 .1 , mineralien.1 .1 , categ, situa) sub(plektrum.1 .1 , musiker .1 .1 , categ, situa) sub(s¨u nderin.1 .1 , film.1 .1 , categ, situa) sub(senf .1 .1 , gew ¨u rz .1 .1 , categ, situa) sub(aluminium.1 .1 , metall .1 .1 , categ, situa) sub(gem¨u se.1 .1 , produkt.1 .1 , categ, situa)

36

Score 0.9356 0.9356 0.9352 0.9318 0.9315 0.9298 0.9291 0.9286 0.9281 0.9281 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9279 0.9261 0.9249 0.9231

Table 10: Meronymy hypotheses with high condence score. Relation sub(x, haut.1 .1 , categ, categ) ∧ pars(x, fisch.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, haus.1 .1 , categ, proto) sub(x, motor .1 .1 , categ, categ) ∧ pars(x, fahrzeug.1 .1 , categ, proto) pars(gep¨a ckwagen.1 .1 , zug.1 .1 , proto, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, auto.1 .1 , categ, proto) sub(x, kopf .1 .1 , categ, categ) ∧ pars(x, mensch.1 .1 , categ, proto) sub(x, ufer .1 .1 , categ, categ) ∧ pars(x, flus.1 .1 , categ, proto) sub(x, schwanz .1 .1 , categ, categ) ∧ pars(x, tier .1 .1 , categ, proto) sub(x, turm.1 .1 , categ, categ) ∧ pars(x, burg.1 .1 , categ, proto) sub(x, kind .1 .1 , categ, categ) ∧ pars(x, paar .3 .1 , categ, proto) sub(x, wagen.1 .1 , categ, categ) ∧ pars(x, zug.1 .1 , categ, proto) sub(x, fakult¨a t.1 .1 , categ, categ) ∧ subm(x, hochschule.1 .1 , categ, ktype) sub(x, zeiger .1 .1 , categ, categ) ∧ pars(x, uhr .1 .1 , categ, proto) sub(x, kopf .1 .1 , categ, categ) ∧ pars(x, fisch.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, saal .1 .1 , categ, proto) pars(fassade.1 .1 , querhaus.1 .1 , categ, situa) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, hauptschiff .1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, seitenschiff .1 .1 , categ, proto) sub(x, giebel .1 .1 , categ, categ) ∧ pars(x, geb¨a ude.1 .1 , categ, proto) pars(haut.1 .1 , auge.1 .1 , categ, situa) sub(x, fassade.1 .1 , categ, categ) ∧ pars(x, haus.1 .1 , categ, proto) sub(x, fassade.1 .1 , categ, categ) ∧ pars(x, nachbarhaus.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, neubau.1 .1 , categ, proto) sub(x, bau.1 .1 , categ, categ) ∧ pars(x, gebiet.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, fahrzeug.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, dom.1 .1 , categ, proto) sub(x, hafen.1 .1 , categ, categ) ∧ pars(x, schiff .1 .1 , categ, proto) sub(x, herz .1 .1 , categ, categ) ∧ pars(x, mensch.1 .1 , categ, proto) sub(x, netzhaut.1 .1 , categ, categ) ∧ pars(x, auge.1 .1 , categ, proto) pars(ortsteil .1 .1 , stadt.1 .1 , proto, proto) sub(x, haut.1 .1 , categ, categ) ∧ pars(x, wirtstier .1 .1 , categ, proto) sub(x, haut.1 .1 , categ, categ) ∧ pars(x, tier .1 .1 , categ, proto) sub(x, haut.1 .1 , categ, categ) ∧ pars(x, frosch.1 .1 , categ, proto) sub(x, haut.1 .1 , categ, categ) ∧ pars(x, arm.1 .1 , categ, proto) sub(x, musik .1 .1 , categ, categ) ∧ pars(x, szene.1 .1 , categ, proto) sub(x, dorf .1 .1 , categ, categ) ∧ pars(x, gemeinde.1 .1 , categ, proto) pars(gegenstand .1 .1 , welt.1 .1 , proto, proto) sub(x, erdgeschoss.1 .1 , categ, categ) ∧ pars(x, haus.1 .1 , categ, proto) sub(x, etage.1 .1 , categ, categ) ∧ pars(x, haus.1 .1 , categ, proto) sub(x, erdgeschoss.1 .1 , categ, categ) ∧ pars(x, geb¨a ude.1 .1 , categ, proto)

37

Score 1.0000 0.9999 0.9993 0.9990 0.9982 0.9975 0.9973 0.9973 0.9971 0.9967 0.9964 0.9964 0.9963 0.9962 0.9962 0.9962 0.9962 0.9962 0.9961 0.9959 0.9959 0.9959 0.9956 0.9956 0.9956 0.9955 0.9954 0.9954 0.9953 0.9952 0.9952 0.9952 0.9952 0.9952 0.9951 0.9951 0.9951 0.9951 0.9951 0.9950

Table 10: Meronymy hypotheses with high condence score. Relation sub(x, sockel .1 .1 , categ, categ) ∧ pars(x, geb¨a ude.1 .1 , categ, proto) sub(x, innenraum.1 .1 , categ, categ) ∧ pars(x, geb¨a ude.1 .1 , categ, proto) sub(x, fundament.1 .1 , categ, categ) ∧ pars(x, geb¨a ude.1 .1 , categ, proto) sub(x, obergescho.1 .1 , categ, categ) ∧ pars(x, haus.1 .1 , categ, proto) sub(x, fundament.1 .1 , categ, categ) ∧ pars(x, haus.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, geb¨a ude.1 .1 , categ, proto) sub(x, spitzdach.1 .1 , categ, categ) ∧ pars(x, turm.1 .1 , categ, proto) sub(x, wagen.1 .1 , categ, categ) ∧ pars(x, d -zug.1 .1 , categ, proto) pars(bord .1 .1 , flugzeugtr¨a ger .1 .1 , proto, proto) sub(x, bord .1 .1 , categ, categ) ∧ pars(x, raumschiff .1 .1 , categ, proto) sub(x, bord .1 .1 , categ, categ) ∧ pars(x, boot.1 .1 , categ, proto) sub(x, bord .1 .1 , categ, categ) ∧ pars(x, satellit.1 .1 , categ, proto) sub(x, stockwerk .1 .1 , categ, categ) ∧ pars(x, geb¨a ude.1 .1 , categ, proto) sub(x, bord .1 .1 , categ, categ) ∧ pars(x, fangschiff .1 .1 , categ, categ) sub(x, fassade.1 .1 , categ, categ) ∧ pars(x, geb¨a ude.1 .1 , categ, proto) sub(x, fraktion.1 .1 , categ, categ) ∧ subm(x, partei .1 .1 , categ, ktype) sub(x, b¨u ndnis.1 .1 , categ, categ) ∧ subm(x, partei .1 .1 , categ, ktype) sub(x, titel .1 .1 , categ, categ) ∧ pars(x, werk .1 .1 , categ, proto) sub(x, ast.1 .1 , categ, categ) ∧ pars(x, baum.1 .1 , categ, proto) sub(x, wurzel .1 .1 , categ, categ) ∧ pars(x, baum.1 .1 , categ, proto) sub(x, schnauze.1 .1 , categ, categ) ∧ pars(x, tier .1 .1 , categ, proto) sub(x, k¨o rper .1 .1 , categ, categ) ∧ pars(x, tier .1 .1 , categ, proto) sub(x, k¨o rper .1 .1 , categ, categ) ∧ pars(x, pferd .1 .1 , categ, proto) sub(x, deckel .1 .1 , categ, categ) ∧ pars(x, topf .1 .1 , categ, proto) sub(x, hilfsmotor .1 .1 , categ, categ) ∧ pars(x, segelschiff .1 .1 , categ, proto) sub(x, antriebshebel .1 .1 , categ, categ) ∧ pars(x, maschine.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, waisenhaus.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, mittelschiff .1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, wohngeb¨a ude.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, reichstagsgeb¨a ude.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, flughafengeb¨a ude.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, hauptgeb¨a ude.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, kirchenschiff .1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, wolkenkratzer .1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, nachbarhaus.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, schulhaus.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, sockelbau.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, m¨u nster .1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, haupthaus.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, schulgeb¨a ude.1 .1 , categ, proto)

38

Score 0.9950 0.9950 0.9950 0.9950 0.9950 0.9949 0.9949 0.9949 0.9949 0.9949 0.9949 0.9949 0.9949 0.9949 0.9949 0.9947 0.9947 0.9946 0.9946 0.9946 0.9946 0.9946 0.9946 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945

Table 10: Meronymy hypotheses with high condence score. Relation sub(x, dach.1 .1 , categ, categ) ∧ pars(x, fahrerhaus.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, wasserhochbeh¨a lter .1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, rauchsalon.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, parlamentsgeb¨a ude.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, westbau.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, zugfahrzeug.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, kraftfahrzeug.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, bus.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, studio.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, torbau.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, priesterhaus.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, landhaus.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, glockenhaus.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, f ¨u hrerhaus.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, hochbau.1 .1 , categ, categ) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, zwerchhaus.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, hubschrauber .1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, mittelbau.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, oiya -geb¨a ude.1 .1 , categ, categ) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, festsaal .1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, bahnhofsgeb¨a ude.1 .1 , categ, categ) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, amtsgeb¨a ude.1 .1 , categ, categ) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, reaktorgeb¨a ude.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, generatorhaus.1 .1 , categ, categ) sub(x, umschlag.1 .1 , categ, categ) ∧ pars(x, hafen.1 .1 , categ, proto) sub(x, fu.1 .1 , categ, categ) ∧ pars(x, tier .1 .1 , categ, proto) sub(x, etage.1 .1 , categ, categ) ∧ pars(x, geb¨a ude.1 .1 , categ, proto) sub(x, vorderseite.1 .1 , categ, categ) ∧ pars(x, haus.1 .1 , categ, proto) sub(x, schauseite.1 .1 , categ, categ) ∧ pars(x, haus.1 .1 , categ, proto) sub(x, titel .1 .1 , categ, categ) ∧ pars(x, novelle.1 .1 , categ, proto) sub(x, regierung.1 .1 , categ, categ) ∧ pars(x, welt.1 .1 , categ, proto) sub(x, fu.1 .1 , categ, categ) ∧ pars(x, patient.1 .1 , categ, proto) sub(x, hand .1 .1 , categ, categ) ∧ pars(x, mensch.1 .1 , categ, proto) sub(x, gesicht.1 .1 , categ, categ) ∧ pars(x, mensch.1 .1 , categ, proto) sub(x, ohr .1 .1 , categ, categ) ∧ pars(x, mensch.1 .1 , categ, proto) sub(x, auge.1 .1 , categ, categ) ∧ pars(x, mensch.1 .1 , categ, proto) sub(x, dach.1 .1 , categ, categ) ∧ pars(x, wagen.1 .1 , categ, proto) sub(x, kopf .1 .1 , categ, categ) ∧ pars(x, tier .1 .1 , categ, proto) sub(x, hornhaut.1 .1 , categ, categ) ∧ pars(x, auge.1 .1 , categ, proto) sub(x, konstruktion.1 .1 , categ, categ) ∧ pars(x, wagen.1 .1 , categ, proto)

39

Score 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9945 0.9944 0.9944 0.9944 0.9944 0.9944 0.9943 0.9943 0.9943 0.9943 0.9943 0.9943 0.9943 0.9942 0.9942 0.9942 0.9942

Table 11: Synonymy hypotheses with high condence score. Relation syno(kopfschmerz.1.1, cephalgie.1.1, categ, categ) syno(parallelepiped.1.1, spat.1.1, categ, categ) syno(kopfschmerz.1.1, kephalgie.1.1, categ, categ) syno(kopfschmerz.1.1, kephalalgie.1.1, categ, categ) syno(kopfschmerz.1.1, zephalgie.1.1, categ, categ) syno(kopfschmerz.1.1, cephalaea.1.1, categ, categ) syno(refsum-syndrom.1.1, refsum-thiébaut-krankheit.1.1, categ, categ) syno(honigmagen.1.1, honigblase.1.1, categ, categ) syno(rhesusinkompatibilität.1.1, rh-inkompatibilität.1.1, categ, categ) syno(ott-zeichen.1.1, ott-maÿ.1.1, categ, categ) syno(versetzungszeichen.1.1, akzidens.1.1, categ, categ) syno(surrogatmarker.1.1, surrogatparameter.1.1, categ, categ) syno(glucose-6-phosphat.1.1, robisonester.1.1, categ, categ) syno(vierhügelplatte.1.1, lamina.1.1, categ, categ) syno(nierenzellkarzinom.1.1, grawitz-tumor.1.1, categ, categ) syno(hornschwiele.1.1, tylositas.1.1, categ, categ) syno(rhesusinkompatibilität.1.1,rhesusunverträglichkeit.1.1, categ, categ) syno(verkehrsleistung.1.1, beförderungsleistung.1.1, categ, categ) syno(kreiselkäfer.1.1, breithalskäfer.1.1, categ, categ) syno(honigmagen.1.1, sozialmagen.1.1, categ, categ) syno(verkehrsleistung.1.1, transportleistung.1.1, categ, categ) syno(taiwan_taoyuan_international_airport.0, tty_airport.0, categ, categ) syno(chiang_kai-shek_international_airport.0, cks_airport.0, categ, categ) syno(karnaugh-veitch-diagramm.1.1, kv-diagramm.1.1, categ, categ) syno(propagandistin.1.1, propagator.1.1, categ, categ) syno(schober-zeichen.0, schober'sches_zeichen.0, categ, categ) syno(alpha-1-antitrypsinmangel.1.1, laurell-eriksson-syndrom.1.1, categ, categ) syno(refsum-syndrom.1.1, heredopathia.1.1, categ, categ) syno(propagandistin.1.1, verkaufsförderer.1.1, categ, categ) syno(parallelepiped.1.1, parallelach.2.1, categ, categ) syno(alpha-1-antitrypsinmangel.1.1, proteaseinhibitormangel.1.1, categ, categ) syno(versetzungszeichen.1.1, akzidentalen.1.1, categ, categ) syno(lapacho.1.1, iperoxo.1.1, categ, categ) syno(la_crosse-enzephalitis.0, crosse_la.0, categ, categ) syno(mbtps1.0, s1p.0, categ, categ) syno(laurent.0, saint_laurent.0, categ, categ)

40

Score 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107

Table 11: Synonymy hypotheses with high condence score. Relation syno(zenionidae.1.1, zeniontidae.1.1, categ, categ) syno(parallelepiped.1.1, parallelotop.1.1, categ, categ) syno(erythropoetin.1.1, erythropoietin.1.1, categ, categ) syno(erythropoetin.1.1, epoetin.1.1, categ, categ) syno(zenionidae.1.1, macrurocyttidae.1.1, categ, categ) syno(laurent.0, pinot_saint_laurent.0, categ, categ) syno(ngerperimetrie.1.1, konfrontationsperimetrie.1.1, categ, categ) syno(perimetrie.1.1, goldmannperimetrie.1.1, categ, categ) syno(perimetrie.1.1, computerperimetrie.1.1, categ, categ) syno(invagination.1.1, intussuszeption.1.1, categ, categ) syno(nierenzellkarzinom.1.1, hypernephrom.1.1, categ, categ) syno(colitis.1.1, kollagenkolitis.1.1, categ, categ) syno(hornschwiele.1.1, tylosis.1.1, categ, categ) syno(colitis.1.1, kollagencolitis.1.1, categ, categ) syno(perimetrie.1.1, schwellenperimetrie.1.1, categ, categ) syno(ethylenimin.1.1, aziridin.1.1, categ, categ) syno(basaliom.1.1, basalzellkarzinom.1.1, categ, categ) syno(hornschwiele.1.1, tylom.1.1, categ, categ) syno(art.1.1, alpha-fehler.1.1, categ, categ) syno(lzstift.1.1, lzschreiber.1.1, categ, categ) syno(verhalten.1.1, verhaltensformung.1.1, categ, categ) syno(lzstift.1.1, lzmaler.1.1, categ, categ) syno(lzstift.1.1, faserschreiber.1.1, categ, categ) syno(lzstift.1.1, fasermaler.1.1, categ, categ) syno(krankheit.1.1, morbus.1.1, categ, categ) syno(loránd-eötvös-universität_budapest.0, elte.0,categ,categ) syno(amyotrophe_lateralsklerose.0, als.0, categ, categ) syno(sender_policy_framework.0, spf.0, categ, categ) syno(nancial_reporting.0, icofr.0, categ, categ) syno(Åtvidabergs_.0, å.0, categ, categ) syno(unshiu_mikan.0, mikan.0, categ, categ) syno(initial-v.0, dulv.0, categ, categ) syno(hunter's_rank.0, hr.0, categ, categ) syno(bande_dessinée.0, dnap.0, categ, categ) syno(radboud-universität_nimwegen.0, ru.0, categ, categ) syno(sociaal-democratische_arbeiderspartij.0, sdap.0, categ, categ) syno(azienda_trasporti_milanesi.0, atm.0, categ, categ) syno(koca_mimar_sinan_aga.0, mimar_sinan.0, categ, categ) syno(integrada.0, oilb.0,categ,categ)

41

Score 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7107 0.7106 0.7106 0.7106 0.7105 0.7105 0.7105 0.7105 0.7105 0.7105 0.7105 0.7105 0.7105 0.7105 0.7105 0.7105 0.7105 0.7105 0.7105 0.7105 0.7105 0.7105 0.7105 0.7105 0.7105

Table 12: Selection of entailments. Premise

temp(d, b )∧ subs(d, tod.1.1 )∧ aff(d, a )

Conclusion

subs(d, versterben.1.1 )∧ temp(d, b )∧ aff(d, a )

Score 0.9081

subs(d, versterben.1.1 )∧ temp(d, b )∧ aff(d, a )

temp(d, b )∧ subs(d, tod.1.1 )∧ aff(d, a )

0.9081

subs(e, entdecken.1.1 )∧ obj(e, b )∧ exp(e, a )

subs(m, entdeckung.1.1 )∧ temp(n, present.0 )∧ obj(m, b )∧ assoc(n, m )∧ agt(n, a )

0.5567

subs(m, entdecken.1.1 )∧ obj(m, b )∧ exp(m, a )

0.5567

semrel(e, d )∧ subs(d, aussprechen.1.4 )∧ purp(d, b )∧ agt(d, a )

0.5464

attch(b, e )∧ temp(d, c )∧ subs(d, gewinnen.1.1 )∧ exp(d, a )

assoc(e, b )∧ temp(e, past.0 )∧ temp(e, c )∧ subs(d, sieg.1.1 )∧ scar(e, d )∧ exp(d, a )

0.5464

assoc(e, b )∧ temp(e, past.0 )∧ temp(e, c )∧ subs(d, sieg.1.1 )∧ scar(e, d )∧ exp(d, a )

attch(b, e )∧ temp(d, c )∧ subs(d, gewinnen.1.1 )∧ exp(d, a )

0.5464

subs(d, sterben.1.1 )∧ temp(d, b )∧ aff(d, a )

subs(d, versterben.1.1 )∧ temp(d, b )∧ aff(d, a )

0.5461

subs(d, versterben.1.1 )∧ temp(d, b )∧ aff(d, a )

subs(d, sterben.1.1 )∧ temp(d, b )∧ aff(d, a )

0.5461

subs(e, gewinnen.1.1 )∧ temp(e, c )∧ attch(b, f )∧ exp(e, a )

temp(m, past.0 )∧ subs(n, sieg.1.1 )∧ temp(m, c )∧ assoc(m, b )∧ scar(m, n )∧ exp(n, a )

0.5459

temp(e, past.0 )∧ subs(f, sieg.1.1 )∧ temp(e, c )∧ assoc(e, b )∧ scar(e, f )∧ exp(f, a )

subs(m, gewinnen.1.1 )∧ temp(m, c )∧ attch(b, n )∧ exp(m, a )

0.5459

subs(e, entdecken.1.1 )∧ obj(e, b )∧ exp(e, a )

temp(m, erst.2.1 )∧ subs(m, landen.1.2 )∧ temp(m, past.0 )∧ *in(n, b )∧ loc(m, n )∧ agt(m, a )

0.5461

subs(e, entdeckung.1.1 )∧ temp(f, present.0 )∧ obj(e, b )∧ assoc(f, e )∧ agt(f, a ) subs(e, bekräftigen.1.1 )∧ temp(e, past.0 )∧ obj(e, d )∧ sub(d, forderung.1.1 )∧ obj(d, a ) ∧ ante(b, e )

42

Table 12: Selection of entailments (ctd.). Premise

temp(e, erst.2.1 )∧ subs(e, landen.1.2 )∧ temp(e, past.0 )∧ *in(f, b )∧ loc(e, f )∧ agt(e, a ) sub(e, forderung.1.1 )∧ ante(b, e )∧ obj(d, e )∧ temp(d, present.0 )∧ subs(d, pochen.1.1 )∧ agt(d, a ) subs(e, entdecken.1.1 )∧ obj(e, b )∧ exp(e, a ) semrel(e, d )∧ subs(d, aussprechen.1.4 )∧ purp(d, b )∧ agt(d, a ) sub(e, forderung.1.1 )∧ ante(b, e )∧ obj(d, e )∧ temp(d, present.0 )∧ subs(d, pochen.1.1 )∧ agt(d, a ) obj(d, b )∧ subs(d, pochen.1.1 )∧ agt(d, a ) subs(e, bekräftigen.1.1 )∧ temp(e, past.0 )∧ obj(e, d )∧ sub(d, forderung.1.1 )∧ obj(d, a ) ∧ ante(b, e ) obj(d, b )∧ subs(d, pochen.1.1 )∧ agt(d, a ) obj(d, b )∧ subs(d, fordern.1.1 )∧ agt(d, a ) obj(e, b )∧ *mods(f, besonders.1.1, fordern.1.1 )∧ subs(e, f )∧ sourc(e, d )∧ pred(d, reihe.1.1 )∧ attch(a, d ) obj(d, b )∧ subs(d, pochen.1.1 )∧ agt(d, a ) sub(e, ernder.1.1 )∧ arg2(f, g )∧ equ(a, e )∧ obj(h, b )∧ arg1(f, a )∧ agt(h, e ) subs(e, ernden.1.1 )∧ agt(e, b )∧ agt(e, a )

Conclusion

Score

subs(m, entdecken.1.1 )∧ obj(m, b )∧ exp(m, a )

0.5461

obj(d, b )∧ subs(d, fordern.1.1 )∧ agt(d, a )

0.5461

subr(m, attch.0 )∧ subs(n, entdecken.1.2 )∧ temp(o, past.0 )∧ arg1(m, p )∧ arg2(m, b )∧ obj(n, b )∧ mcont(o, m )∧ mexp(o, a ) sub(e, forderung.1.1 )∧ ante(b, e )∧ obj(d, e )∧ temp(d, present.0 )∧ subs(d, pochen.1.1 )∧ agt(d, a )

0.5461

0.5461

semrel(e, d )∧ subs(d, aussprechen.1.4 )∧ purp(d, b )∧ agt(d, a )

0.5461

subs(e, bekräftigen.1.1 ) temp(e, past.0 )∧ obj(e, d )∧ sub(d, forderung.1.1 )∧ obj(d, a ) ∧ ante(b, e )∧

0.5461

obj(d, b )∧ subs(d, pochen.1.1 )∧ agt(d, a )

0.5461

obj(d, b )∧ subs(d, fordern.1.1 )∧ agt(d, a ) obj(d, b )∧ subs(d, pochen.1.1 )∧ agt(d, a )

0.5461 0.5461

semrel(e, d )∧ subs(d, aussprechen.1.4 )∧ purp(d, b )∧ agt(d, a )

0.5461

sub(d, forderung.1.1 )∧ ante(b, d )∧ obj(d, a )

0.5461

subs(m, ernden.1.1 )∧ agt(m, b )∧ agt(m, a )

0.1131

sub(m, ernder.1.1 )∧ arg2(n, o )∧ equ(a, m )∧ obj(p, b )∧ arg1(n, a )∧ agt(p, m )

0.1131

43

B

Program Documentation and Manuals

The SemDupl system consist of three individual detectors, a combination module, and a GUI. The GUI is the easiest access to the detectors, but hides most of their details. Therefore, the GUI is mainly intended for the casual user, while for experimentation the single detectors should be called with explicit parameter settings as described in their manual sections below.

B.1

SemDupl-deep

The deep detector is implemented in the Scheme program semdupl. This program is contained in the directory semdupl-deep and can be built by calling make. The main module is semdupl, which needs several other modules as dened in semdupl.mod. SemDupl-deep uses a special key-value store in order to access large collections of semantic networks eciently. The current implementation employs a Tokyo Tyrant server for this. It must be started (currently on the server named ki205) with the following script:

start-gnet-servers.205 First, texts must be registered with SemDupl-deep. To register the text poe13, issue the following command:

semdupl --register poe13 ... Registration assigns a unique ID consisting of four characters to each text. These IDs must be used in the following if referring to texts. If later on, a text with ID Fzzz should be discarded, the following command suces:

semdupl --delete Fzzz ... This deletes the registration information and removes all representations for the text with ID Fzzz. Before a text can be used in the deep detection of duplicates it must be analyzed by the parser and the coreference resolver. For text Fzzz, the command is:

semdupl --analyze Fzzz ... To process a long list of texts, the following command is often more convenient:

xargs -a semdupl --analyze In addition, some metadata statistics (like number of paragraphs and sentences) must be collected for SemDupl-deep:

find /cl/sd/net/ -type f | sort | xargs wocadi-tool -collect-metadata > /cl/registry/db_metadata.tsv cd /cl/registry/ tchmgr importtsv db_metadata.tch < db_metadata.tsv To detect duplicates for a given text (e.g. with id Fzzz), the test option is needed: 44

semdupl --test Fzzz This command accepts several additional options:

start-sentence : For detection, sentences in the candidate text are only

considered from the Nth sentence onwards. In other words, the rst N-1 sentences are ignored.

stop-sentence : The detection stops after the Nth sentence in the can-

didate text. The main purpose for this would be to speed up the detection because the complexity of SemDupl-deep is linear to the number of sentences in the candidate text.

html : Output is written in HTML format instead of plain text format. show-sentence : The corresponding sentence pairs between the candidate text and the duplicate at hand are shown in the output.

To transform the output of SemDupl-deep into a format accepted by the evaluation module, the XML results option (short: -x) is needed, e.g.

semdupl --xml-results semdupl-deep.log > eval.semdupl-deep.xml

B.2

SemDupl-shallow

All options of this program can be set by command line arguments. Since SemDuplshallow is usually called with certain attributes which are normally constant, there is the possibilities to specify default options in a conguration le in XML format. If not specied otherwise the le CErkenner.properties contains the conguration. Listing 1 shows the settings which are given in this le.

Einstellungen fuer Flatchecker /usr/bin/aspell de UTF-8 semdupl stdout anonymous 0.001 warning semdupl semdupl /cl/sd/utf 1 stderr anonymous 132.176.72.201 17 MAX The SemDupl-quick program is called via the Bash script SemDupl-quick with the following options: 48

-put or -p The contents of the given les is inserted into the database if not already contained. -check or -c Checks the given texts for duplicates -noput The contents of the given les is not inserted into the database -text [...] Space separated list of texts which should be checked by SemDupl-quick. If this options is specied as very rst option then the keyword -text can be omitted. The text must be specied relatively to SemDupl-quick base directory or absolutely. The use of wildcards (e.g. * or ? ) is also possible. -textdir All les inside this directory are included into the database. This directory can be specied absolutely or relatively of the SemDupl-quick base directory. -textfile A text le is specied which contains a list of lenames, each on a separate line. All these les are included into the database. -R If this option is specied, a given directory is searched for les recursively. -loglevel This option species the logging level for debug and error output. The options for this output are: off no logging output at all severe warning info config fine finer finest all most detailed output -aspellpath to the executable of GNU Aspell. -lang mnemonic for the language for GNU Aspell ( de for German) -encoding The encoding used by GNU Aspell (ISO-8859-1 for latin1 or utf8 for UTF-8) -dbpath Name of the database -dbip IP address of the database server -dbpasswd Password for the database server -wortschatzpasswd Password for the database server for the Leipzig Wortschatz system. 49

-sourcepath Optional parameter of the source path for the specied les. In this way, there is no need to state the path of each le separately. -processors The number of the processors, usually 1, 2, 4, or 0 for all processors. -error or stderr Controls the error output of the program. stderr Write all output to the console (stderr stream). Errors are written into the le with name -maxfreq or MAX , frequence which is assumed if wortschatz.uni-leipzig. de returns no result for the query for the frequency class of a word. Use the keyword MAX for the greatest possible integer number (Integer.MAX).

B.4

Combiner

This section describes how duplicate features from the individual classiers SemDupldeep SemDupl-quick and SemDupl-shallow can be combined to a decision duplicate or no duplicate. To create such classication values for previously unseen data do the following:

• Run SemDupl-quick, SemDupl-deep and SemDupl-shallow on this data and copy the les, as created by these classiers, to the locations expected by svm.classify.sh which are

 /home_pc/gesamt/semdupl/input_data/semdupl-deep.xml for SemDupldeep

 /home_pc/gesamt/semdupl/input_data/semdupl-quick.xml for SemDuplquick

 /home_pc/gesamt/semdupl/input_data/semdupl-shallow.xml for SemDuplshallow

• Switch to the directory /home_pc/gesamt/semdupl/Combiner/scripts • Type: ./svm_classify.sh

B.5 B.5.1

Knowledge Acquisition Hyponyms, Meronyms and Synonyms

The knowledge acquisition process of semantic and lexical relations is done in the following steps:

• Learning deep and shallow patterns • Applying deep and shallow patterns to extract relations hypotheses • Filtering hypotheses • Determining features • Assigning a quality score to the hypotheses 50

Learning deep and shallow patterns Shallow patterns can be learned using the relation extraction system developed by Duman (2008) which is not described in this paper. In order to extract deep patterns, rst a set of semantic networks containing a known hyponymy/meronymy or synonymy relation have to be collected. Switch to the directory trunk/src/pattern_extraction/tools and enter

./example_searcher.out -d -m -n -r -s -y where the parameters have the following meaning

-d directory where the SNs reside (omitting the directory net, key in conguration le: directory) -m maximum number of les used for learning -n negative-examples: le to store the negative examples where no meronym or hyponym shows up -p output directory -r relation-list le which contain a list of relations (hyponyms/meronyms) -s start-index: numerical index where to start -y hyperrelation: sub0 or mero (syno not yet supported) Afterwards the pattern can be learned. For that switch to the directory src/pattern_extraction/deep and type

./mdl_pattern_extractor.out -c -d -m -p -o The parameters have the following meaning: -c either mdl (minimum description length) or kernel. Recommendation: use the minimum description principle for learning -d directory where the SNs reside -m number of le use for learning -o le where to store the patterns to -p directory where the positive examples reside as determined by

example_searcher.out

Application of Deep Patterns Deep patterns have to be dened in FOIL (First Order Inductive Learner) input format. An example line is:

sub0(A,B) :- sub(C,A), f_itms(D,C), pred(E,B), f_itms(D,E), h_itms(C,E), prop(E,ander#1#1), refer(E,det) ;0.1;0.1; undandere_neighbor_sub The period (.) has a special meaning in FOIL and is therefore replaced by a #. The Conclusion is given by sub0(A,B), :- is the inference operator (←). The premise 51

is given by a comma-separated list which is interpreted as logical conjunction. The predicate f is used for functions with variable number of arguments like *itms. It make it easier to dene patterns where such function appear in. f_x(C0,C1) means that C1 appears as parameter for function x and the result is C0. The predicate g_x(C0,C1) is used to specify that C0 precedes C1 in the argument list of function x. The predicate h_x(C0,C1) species that argument C0 immediately precedes C1 in the argument list of function x. The two numbers (here 0.1) are currently not used. The last argument is the name of the pattern. Patterns can be commented out by a preceding //. All extracted relation hypotheses are stored in a database. This database can be created by:

cd acquire/trunk/src/pattern_application/tools ./database_table_creator_mysql.out.out -b Afterwards the database can be lled with extracted hyponymy hypotheses by: cd acquire/trunk/src/pattern_application/deep ./pattern_applier.out -d -p [-e ] -b -u -q where the parameters have the following meaning:

-d where the semantic network reside (key in conguration le: directory) -b (key in conguration le: db_name) -q (key in conguration le: db_password) -u (key in conguration le: db_username) -e exclude hyponyms with compounds (e.g. Graubär is a Bär ) -i (key in conguration le: history_deep_hyponyms) Similarly, meronyms meronyms can be extracted by:

cd acquire/trunk/src/pattern_application/deep ./meronym_pattern_applier.out -d -p -b -u -q The parameters are mostly identical to those of ./pattern_applier.out. Synonyms can be extracted by: cd acquire/trunk/src/pattern_application/deep ./synonym_pattern_applier.out -d -p -b -u -q

Application of Shallow Patterns Besides the usage of deep patterns there is

also the possibility to apply shallow patterns. Shallow patterns are applied on the contents of the analysis-ml tag (token list) of WOCADI. The format of a pattern is: (conclusion premise num1 num2 name). The parameters num1 and num2 are currently not used. An example entry is:

(((syno a b)) (a ((word respektive)) b) 0.5 0.5 shallow_respektiv) The variables a and b are matched with concepts of type object. The token entries (e.g. ((word "respektive")) are tried to be unied with the token list provided 52

by the parser. The character ? is used to specify an optional element, e.g. ? (( cat (adj))). The wildcard * is used to specify zero or more occurrences of one expression, like: * ((( word ",")) b). The variables a and b are allowed to appear several times in a token list. In this case the Cartesian product of the bound constants is created as the set of extracted relations. d is used to specify a disjunction, e.g. (d ((((word und))) (((word oder))))). The shallow hyponym pattern applier can be called in the following way:

cd trunk/src/pattern_application/shallow ./shallow_pattern_applier.out -d -p -b -u -q Shallow meronym patterns can be applied by:

cd trunk/src/pattern_application/shallow ./shallow_meronym_applier.out -d -p -b -u -q Synonym patterns can be applied by:

cd trunk/src/pattern_application/shallow ./shallow_synonym_applier.out -d -p -b -u -q

Feature Calculation First the features have to be calculated. The dataset is divided into training and evaluation partitions: Type: cd acquire_repos_scheme/trunk/src/feature_combination/tools and

eval_training_set_divider.out -b -u -q -f where is the relative amount of training data (e.g. 0.7=7 parts training data, 3 parts evaluation data) Afterwards, calculate the features in the following way: First type: cd acquire_repos_scheme/trunk/src/feature_combination/ Make sure that the history le is removed: rm feature_calculation.hist Then start the feature calculation process: ./feature_calculator.out -b -p -b -u -q -F The current progress is always written to the le feature_calculation.hist. Thus, if the program crashes, the feature calculation process can be continued just by restarting the feature calculator as described above.

Score Calculation Finally, the score must be calculated. First, a certain subset of the entire annotated data is selected. To do this, switch to the directory acquire_repos_scheme/trunk/src/feature_combination/tools. Type: ./combination_training_set_creator.out -e -m -r -i [-I] -o -d -p 53

-b -u -q

The parameters have the following meaning:

-e num_examples_pattern species the number of examples for hyponymy / meronymy / synonymy hypotheses per pattern -m num_example total number of patterns -r generalized relation name of the generalized relation, can be sub0, mero, syno. -i initial_set this option can be used to enlarge an existing training set -I This option is used to specify that the number of positive and negative examples need not to coincide -o Output le sux containing the selected training data. The les: