Automated Detection of Reference Structures in Law - Semantic Scholar

3 downloads 0 Views 2MB Size Report
Automated Detection of Reference. Structures in Law. Emile DE MAAT, Radboud WINKELS and Tom VAN ENGERS. Leibniz Center for Law. University of ...
Automated Detection of Reference Structures in Law Emile DE MAAT, Radboud WINKELS and Tom VAN ENGERS Leibniz Center for Law University of Amsterdam {deMaat, Winkels, vanEngers}@uva.nl Abstract. Combining legal content stores of different providers is usually time, effort and money intensive due to the usually ’hard-wired’ links between different parts of the constituting sources within those stores. In practice users of legal content are confronted with a vendor lock-in situation and have to find work-arounds when they want to combine their own content with the content provided by others. In the BSN project we developed a parser that enables the creation of a referential structure on top of a legal content store. We empirically tested the parsers’ effectiveness and found an over 95% accuracy even for complex references. Keywords. Reference parsing, Natural Language Processing, Tagging Legal Texts

Introduction The Dutch Tax and Customs Administration (DTCA) is one of the many organisations that deal with a multitude of electronic legal data, from various sources and in different formats. In addition to the data from other sources, the DTCA itself produces new and enriches existing legal information. All this legal information needs to be integrated and (inter)connected to support the legal experts in the DTCA. One of the problems of the DTCA and similar organisations is the dependency between the software that supports access to the content (the portal) and the contents’ structure and form. In practice this may result in a vendor-lock-in; it is hard to switch supplier, because for one thing all internal data has to be re-integrated. In order to overcome this problem, the DTCA started a project in which a semantic network for their legal content is constructed. An overview of that project is given in [1] as well as a description of the general architecture of the semantic network. In this paper we discuss recent developments for part of this project: the further development and testing of a parser that automatically finds references in and between legal sources. As stated above, organisations making use of various collections of data from different content providers face several problems. For our current purpose, these are the most important ones: 1. Limited Scope: References are limited to the particular collection, i.e. interrelations between documents belonging to one collection are supported but not between documents belonging to different collections.

2. Incomplete: Not all potential references within a collection are explicit, i.e. can be followed automatically by the users. Adding the additional relations by hand is a very expensive operation, both in terms of time and effort. Therefore, work has started to attempt to discover these relations automatically, using parsing techniques. Earlier work on Italian sources [2] has indicated that automated detection of references can be a very great help, as 85% of all references could be detected automatically, and another 9% could at least partially be detected.

1. The structure of references 1.1. Simple References We distinguish four types of simple references. 1. 1. The most simple structure is a reference by name, which consist of the name of the entity being referred to: "Douanewet" ("Customs Law"), "Wet installaties Noordzee" ("Law installations North Sea"). 2. 2. Next are the references comprised of label and number. These references are comprised of a label, such as article or chapter combined with a number (or letter, or some other designation). Examples of such references are "artikel 1" ("article 1") and "afdeling 1A" ("part 1A"). In the case of members or subparts that are numbered, the number may appear as an ordinal in front of the label, instead of a number following the label: "eerste lid" ("first member"). This ordinal numbering is hardly used above the level of article (i.e. in legislation no references to the first section or the second chapter will be found). 3. 3. A variation of label and number are references comprised of a label, a number and/or a publication date (and sometimes a venue of publication). An example of such a reference is "de wet van 13 april 1995" ("the law of April 13th, 1995"). These patterns are usually more elaborate than label and number patterns, and may require additional keywords and other elements (such as brackets). References including publication information refer to a complete document. 4. 4. Finally, there are the anaphors, indirect references, which often refer to an earlier reference: "dat artikel" ("that article") and "het volgende artikel" ("the next article"). These references can always be resolved to one of the former types. 1.2. Complex References The first type of complex reference is the multi valued reference. This is a label and number reference that includes several numbers. For example, "artikel 12, 13 en 15" ("article 12, 13 and 15"). Often, these numbers are represented as a range: "artikel 1318" ("article 13-18"). These ranges can themselves be included in a list containing more numbers: "artikelen 12, 14-18, 20, 22 and 24-26" ("articles 12, 14-18, 20, 22 and 2426"). Multi valued references can also be constructed using ordinals: "eerste en tweede lid" ("first and second member"). They differ from multiple simple references in that the label is not repeated, so both references need to be read as one to determine what is

referenced. In a multi valued reference, the label may be plural (i.e. "articles" instead of "article", but this is not necessary). The second type of complex reference is a multi-layered reference. This is a reference that consists of several simple references, which "navigate" through the structure of the target document. For example: "Bankwet 1998, artikel 1, eerste lid" ("Banklaw 1998, article 1, first member"). These references are ordered in one of three ways: 1. Zooming in: the reference starts with the broadest part and ends with the narrowest part, as the example given above. 2. Zooming out: the reference starts with the narrowest part and ends with the broadest part: "lid 1, artikel 1, Bankwet 1998" ("member 1, article 1, Banklaw 1998"). In this case, the parts may be connected through the word "van" ("of"): "eerste lid van artikel 1 van de Bankwet 1998" ("first member of article 1 of the Banklaw 1998"). 3. Zooming in, then zooming out: The reference starts at some (convenient) level in the target document, then "zooms in" and finally "zooms out" again: "artikel 11a, tweede lid van de Consulaire Wet" ("article 11a, second member, of the Consular Law"). The "zooming out" part usually consist of one step, sometimes two, but seldom more. A multi-layered reference can have a multi-valued reference as its lowest level. For example: "lid 1-3 en 5 van artikel 5 van de Gaswet" ("member 1-3 and 5 of article 5 of the Gaslaw"). When "zooming in", multi-valued references can occur on more levels, resulting in a more tree-like description: "Gaswet, artikel 5, eerste tot en met derde en vijfde lid en 5a, tweede lid". ("Gaslaw, article 5, first through third and fifth member, and 5a, second member")1 . 1.3. Special cases A still common exception to the structures presented above is the use of the word "aanhef" ("opening words"). This is used when an element in the text contains a list that is preceded by a description of the list, without which the list does not make sense. In these cases, the reference can be made to this description (the "opening words") and one or more of the list elements. For example: "artikel 12, aanhef en onderdelen i en j" ("article 12, opening words and parts i and j"). One special case is the exception to a range, for example: "artikelen 1-12, uitgezonderd artikel 7 ("articles 1-12, with the exception of article 7")2 . An exception can also occur on different levels within the reference, with the higher level being the "range" from which the lower level is omitted: "article 1, with the exception of the second member". Here, article one represents (for example) the range article 1, first member through article 1, fifth member. Another special case is the use of the word "telkens" ("each time"), which is used to shorten a list when there is a series of references with lower level references, where the 1 This

reference gets rather confusing when instead of ordinals, numbers are used for the members as well, resulting in: "Gaslaw, article 5, member 1-3 and 5, and 5a, member 2" To minimise this confusion, it is common in these cases to use numbers for articles and ordinals for members. 2 Because of the possible existence of articles numbered for example 6a or 7.1, this reference is different from "articles 1-6 and 8-12").

lower references have the same number. For example, when a reference is made to the first member of article 1, the first member of article 2 and the first member of article 4, the reference can be shortened to "artikelen 1, 2 en 4, telkens het eerste lid" ("articles 1, 2 and 4, each time the first member"). 1.4. Complete and incomplete references So far we can classify references using two distinctions: 1. Single layer or multilayered: does the reference refer to a single structure unit, or does it specify specific subparts of a unit? 2. Single valued or multi valued: does the reference refer to one location, or to multiple locations? A final distinction must be added that is important for the resolving of such references (see section 3). This distinction is whether the reference is complete or incomplete. A reference is complete if it includes information of the complete document (in these cases, the law) it refers to. It is incomplete if it does not include that information. Thus, "member 1, article 1, Banklaw 1998" is a complete reference, whereas "member 1, article 1" is an incomplete reference. In [2], the distinction between complete and incomplete references has also been made. There, they were named "well-formed references" and "not well-formed references". Not well-formed references were so named because they do not contain sufficient information to distinguish the document referred to (50% of their cases while 35% was well formed). As we shall see in section 3, this is not a problem when parsing Dutch legal texts (nor in most other texts, as is discussed in section 5). As long as we know the context in which the reference has been found, we can resolve the reference.

2. Finding references The references as presented above follow a very strict structure, which can easily be represented using a regular expression or context free grammar. Basic references, such as an article, are simple:
→ "article"

The rules for other simple references are similar, using a different label. In order to allow for "zoom-out references", the option of adding a reference to a higher level is added:
→ "article" [[","] ["of"] ]

Here, represents the rules to match a book or a law. For a member, this is the rule to match an article, etc. Similar, in order to allow for "zoom-in references", the option of adding references to lower levels is added.
→ "article" [[","] ] [[","] ["of"] ]

represents the rules to match a member, subpart or sentence. For a member, this is the rule to match a subpart or sentence, etc.

Finally, to allow multiple references, it is possible to add a list of designations and/or ranges of designations, each with their own sub-references.
→ "article" [[","] ] [ "-" [[","] ] [ ( [","] [[","] ])* "and" [[","] ] ] [[","] ["of"] ]

The first optional section allows for ranges to be included instead of a single designation. The second optional section makes it possible to have a list of designations and ranges. The grammar above needs to be expanded to allow for equivalent words and constructions. For example, the word "article" may be replaced by "articles", and when specifying a range, instead of a dash the words "up to and including" may be used. With those modifications, a grammar based on those derivation rules is able to recognise most references. However, one problem arises. Names of laws (or names of any other regulation) do not follow a clear pattern. There are few keywords: the only certainty is that the word "law" appears somewhere in the name, but not always as a separate word. The names can even contain commas and other names, which makes it even more difficult to separate the names from the other text. Therefore, we have decided not to try to recognise names by matching it to a regular expression, but to simply compare the text with a list of names of laws. A drawback of this approach is of course that if a name is not on the list, it will not be recognised. In practice, it is possible to maintain a list of all published laws and regulations. However, within a law, local names may be defined (usually to abbreviate the official name). These names will be missed unless they are added to the list3 . Another problem that occurs is that simple patterns consisting of label and number are the same as the headings of articles and chapters (which are usually also indicated using label and number). In order to avoid detecting these headings as references, they should be marked as headings, not as actual text. Such markings are also necessary to correctly resolve the references, as will be discussed in the next section.

3. Resolving references After a reference has been found in a text, it should also be resolved, meaning that the URI of the document that is referred to should be found. As has been discussed in [1], there are (at least ) three different levels of reference to a specific regulation: 1. A reference to a work, referenced by its citation title (or date of publication, if no citation title exists). 2. A reference to a source. A source is a version of a work at a particular time. 3 The sentences in which alternative names are defined follow a strict format themselves. It is likely that we can solve this problem by using a two-pass method, first searching the document for any alternative name definitions and than searching it for references (using any local names found).

3. A reference to a manifestation. A manifestation is a specific publication of a source. Two manifestations can differ in terms of, for example, medium, lay-out and comments. References found in legislation always refer to a work; for case law and commentaries references to sources can be found and in commentaries sometimes even to manifestations or sources that never came into operation. Constructing an URI for the work referenced should be done by resolving the name found by means of a resolver (such as an online database) or by simply reading the URI from the list attached to the name of the law. After this, the URI for the precise location in the document must be found. This van also be done by means of a resolver, or, , if the URI methodology supports it, the URI can be constructed using the base URI and the information found. For example, within the Norme In Rete project, it is prescribed that the identifier for "article 2" should be "art2", and that the complete identifier for article 2 of the Destructionlaw should be "#art2" appended to the URI for the Destructionlaw [4]. Things get a little bit more complicated if we do not have a complete reference. In that case, we do not have a name which points to a base document. Within a the text of a law, however, such an incomplete reference means a reference within the current document. Thus, al that is needed is that we know the identity of the text that we are parsing, and we can resolve the reference in a way similar to the resolving of the complete reference. In order to establish the identity of the text it will not always be sufficient that we can identity the law the text refers to. A reference to "the first member" means "the first member of this article of this law". In order to resolve this reference, we will need not only the name of law being parsed, but also the designation of the current article. This means that the input document for a reference parser should contain sufficient information on the structure of a document. It helps if this structure is already made explicit as in an XML document. Our parser works on META lex documents4 . Another group of references that is somewhat harder to resolve are the anaphors. From the point of resolving the reference, the anaphors come in three groups. The first group are those references that refer to the current text: "this article", "this law". Such references are easily resolved if the identity of the current location is known, as discussed above. The second group of anaphors refers to an earlier point in the text, such as "the previous article". These can be resolved using structure information as well (though they require that the parser does not only know its current location, but also keeps a (limited) history). Finally, there are those anaphors that refer to an earlier reference, for example "that article", referring not to the current article, but to an article that was earlier mentioned in the text. Usually, this is the most recent reference to an article in the text. In order to resolve these references, the parser keeps a history of the references found so far. This history can be limited to the current piece of text, since such anaphors will not cross the boundaries of (for example) different articles5 . a description of META lex see www.metalex.eu[5] in common natural language texts, where anaphoric references can be quite complicated to resolve, see e.g. [3]. 4 For

5 Unlike

Table 1. Results of the reference parser applied to six randomly chosen Dutch laws

4. Results In order to test our approach, a grammar has been constructed containing most of the patterns mentioned in section 1. Not included were the special cases for references containing exceptions to ranges and references constructed using "each time". We randomly selected six Dutch laws. The only additional requirements were that we wanted to include one law written before 1900, and one between 1920 and 1949, since we expected that the references, i.e. the language used to express them, would be different from modern laws. We applied our parser and measured the number of correctly identified simple references, the number of missed simple references, the number of (partly) correctly identified complex references, the number of missed complex references and the number of skipped references to sources other then laws. Table 1 presents the results. Almost all references were found correctly and completely (99% of the simple ones and 95% of the complex ones). The few misses were caused by missing labels, names or patterns from the grammar. The grammar can be corrected for this, and will be if those labels and patterns occur often enough6 . However, there may always remain some patterns that are too rare to include. False positives occur when one of the labels (such as "member") is used in a different meaning, for example, when "the first member" does not refer to the first member of an 6 The missing names will be added in any case. The parser used a list of current laws, and some of the references found referred to retracted laws.

article, but to the first member of a certain committee discussed in the text. These false positives may be identified when trying to resolve them, as there is seldom a complete reference that the anaphor refers to. A reference to "the first member" (of a committee) may be proven to be a false positive if the current article does not include any members.

5. Expanding it to Other Legal Sources The results presented here are based on references from laws to laws and within laws. However, there are of course a lot more documents that refer to laws and regulations. As part of our research, various examples of such documents and references used in those documents have been studied, though most of the results of these studies have yet to be implemented and tested. 5.1. References from other types of documents For the development of the Semantic Network for the DTCA, it was very important to study the legal commentaries supplied by various publishers. These commentaries link together case law, regulations and parliamentary reports and form an overview of the relevant information for a certain legal problem. In general, in almost any document quoting a law, the same method of addressing the law and parts of such a law is used7 . This means that the grammar that was developed for the law should also be able to detect references in other documents. Outside laws and regulations, unofficial names for laws are often used besides the official names. This means that the list of names of laws will need to be expanded for the use of parsing these documents. Since there is no official source for these unofficial names, it may occur more often that a name cannot be recognised because it is not on the list. Other than that, few problems should occur. There are, however, two important differences. First of all, depending on the type of document, the references do not point to a work (as is the case for a law), but instead refer to a source. For example, commentaries are based on the law at the moment of writing, and may no longer apply after certain changes have been made. The version of the law referred to is usually not mentioned in the text. Instead, it should be derived from the time of writing of the text. When resolving the reference, the work referred to can be found by using the methods described in section 3. In order to find the correct source, a date must be provided. As stated before, in some cases, such a date is simply the conception date of the document. Thus, for a commentary, the conception date could be passed along to the parser to resolve the references8 . The second important difference is that in a document that is not structured using articles etc., an incomplete reference does not refer to a location within the referring document itself. Instead, an incomplete reference is an abbreviation of an earlier complete reference. For example, after introducing "article 14 of the Law on Legal Support", a 7 Actually,

in many case the people that write legal commentaries for publishers are the same people that write the draft legislation at one of the ministries, or work at one of the law enforcement organizations as legal expert. 8 It may be better to consider commentaries to refer to ranges of versions of a law (or individual articles or members), as it may apply to several versions of a law.

writer may abbreviate it to "article 14". This may lead to confusion when a writer refers to more than one "article 14". In this case, the writer usually refers to the most recently mentioned "article 14" . Resolving these references a list (history) of those references found earlier in (that section of) the source document. 5.2. References to other types of documents Several other types of legal documents, such as Royal Decrees, decisions and court cases are also referred to in legal texts. If structured, such documents follow similar structures as a law, and their subparts are referred to using the same formulations. In general, most of these documents are not named. Instead, they are identified by a number and/or date (and sometimes venue) of publications. This means that these references require additions to the grammar. For each type of document, a couple of different reference formats exist. However, when these additions to the grammar have been made, there will be less need for maintenance, as there is no need to continue adding new names to the list. As said, not all documents are structured using divisions like chapter or article. Lacking structural identifiers, writers will sometimes refer to certain pages. This makes such a reference a reference to a manifestation, as the pages and their numbering are dependent on (among others) the lay-out. In itself, this is not a problem, since we are able to construct a URI for the manifestation. But, we might be interested in a link to the source. This means that for these sources, publishers creating a new manifestation of a source must strive to maintain the page information of the original manifestation. This is a requirement that cannot be fulfilled using most currently available formats9 .

Conclusion We have described a parser that automatically finds references in and between Dutch laws and legislation. A test on six very diverse Dutch laws showed an accuracy of 95-99% and hardly any false positives. In the Norme-in-Rete project achieved a similar result was achieved on a much larger, but less diverse Italian corpus [2]. Their parser found 85% of the references, but only 35% could be resolved. We can resolve every reference found. The parsing technique used is a simple, but effective one. As soon as we need to go deeper into the meaning of the texts to be parsed, we will have to resort to natural language techniques and grammars as we and others have done before [6][7]. The approach is general enough to extend to other legal sources like case law and legal commentaries, but it remains to be seen if we can achieve as high a success rate as with laws. The basis for the success with laws is the strict and predictable way legal drafters refer to and within legislation, and the structured nature of the documents we are dealing with. Both these advantages are weaker for case law and even more so for commentaries. However, some of the problems we have to deal with will disappear with further standardisation. If the structure of legal documents and the way they represent references is standardised, integration of sets from different origins will be straightforward10 . 9 It

is advisable to refrain from this type of manifestation dependent referencing. is one of the reasons we have started such a standardisation process within Europe based on META lex as an interchange format. For more information see: http://www.cenorm.be/cenorm/ businessdomains/businessdomains/isss/activity/ws_metalex.asp 10 This

The approach developed is not only useful if we want to build a referential structure (or semantic network, as it is called in this project) on top of a legal content store. It will also give us the opportunity to support legislative drafters or writers of commentaries when writing their texts. We have planned to develop an open source based editing environment that enables the writing of such legal texts. In such an editing environment we will use the reference parser as a basis for functionality such as reference checking and automated completion.

Acknowledgements We thank the Dutch Tax and Customs Administration for offering us the opportunity to test our theories in a very interesting practical situation. We also want to thank the European Commission for having started the 6th framework programme and supporting our Estrella project11 .

References [1]

Winkels, R., Boer, A., de Maat, Tom van Engers, Matthijs Breebaart, and Henri Melger. Constructing a semantic network for legal content. In: Anne Gardner (Ed.), ICAIL-2005: Proceedings of the Tenth International Conference on Artificial Intelligence and Law, p. 125-140, ACM Press (2005). [2] Palmirani, M., Brighi, R., and Massini, M. Automated Extraction of Normative References in Legal Texts. In: G. Sartor (ed). ICAIL-2003: Proceedings of the 9th International Conference on Artificial Intelligence and Law, p. 105-106. ACM Press (2003). [3] Webber, B. 1982. So What Can We Talk about Now? In: M. Brady and R. Berwick, Eds., Computational Models of Discourse. MIT Press, Cambridge, MA: 331-371. [4] Spinosa, P. 2001. Identification of Legal Documents through URNs. In: O.Signore and B.Hopgood (eds.), Proceedings of the Euroweb 2001 Conference "The Web in Public Administration", Felici, Pisa (2001). [5] Boer, A., Hoekstra, R., Winkels, R., van Engers, T. and Willaert, F. METAlex: Legislation in XML. In: T. Bench-Capon et al. (eds), Legal Knowledge and Information System. Jurix 2002. Amsterdam, IOS Press (2002), pp. 1-10 [6] Maat, E. and van Engers, T (2003). Mission impossible?: Automated norm analysis of legal texts. In: D. Bourcier (ed), Legal Knowledge and Information Systems, JURIX 2003. Amsterdam, IOS Press (2003), pp. 143-144. [7] Bolioli, A., Dini, L., Mercatali P. and Romano, F. For the automated mark-up of Italian legislative texts in XML. In: T. Bench-Capon et al. (eds), Legal Knowledge and Information Systems. Jurix 2002. Amsterdam, IOS Press (2002), pp. 21-30.

11 see

www.estrellaproject.org

Suggest Documents