The rules for other simple references are similar, using a different label. In order to allow for "zoom-out references", the option of adding a reference to a higher level is added:
Here, represents the rules to match a book or a law. For a member, this is the rule to match an article, etc. Similar, in order to allow for "zoom-in references", the option of adding references to lower levels is added.
represents the rules to match a member, subpart or sentence. For a member, this is the rule to match a subpart or sentence, etc.
Finally, to allow multiple references, it is possible to add a list of designations and/or ranges of designations, each with their own sub-references.
The first optional section allows for ranges to be included instead of a single designation. The second optional section makes it possible to have a list of designations and ranges. The grammar above needs to be expanded to allow for equivalent words and constructions. For example, the word "article" may be replaced by "articles", and when specifying a range, instead of a dash the words "up to and including" may be used. With those modifications, a grammar based on those derivation rules is able to recognise most references. However, one problem arises. Names of laws (or names of any other regulation) do not follow a clear pattern. There are few keywords: the only certainty is that the word "law" appears somewhere in the name, but not always as a separate word. The names can even contain commas and other names, which makes it even more difficult to separate the names from the other text. Therefore, we have decided not to try to recognise names by matching it to a regular expression, but to simply compare the text with a list of names of laws. A drawback of this approach is of course that if a name is not on the list, it will not be recognised. In practice, it is possible to maintain a list of all published laws and regulations. However, within a law, local names may be defined (usually to abbreviate the official name). These names will be missed unless they are added to the list3 . Another problem that occurs is that simple patterns consisting of label and number are the same as the headings of articles and chapters (which are usually also indicated using label and number). In order to avoid detecting these headings as references, they should be marked as headings, not as actual text. Such markings are also necessary to correctly resolve the references, as will be discussed in the next section.
3. Resolving references After a reference has been found in a text, it should also be resolved, meaning that the URI of the document that is referred to should be found. As has been discussed in [1], there are (at least ) three different levels of reference to a specific regulation: 1. A reference to a work, referenced by its citation title (or date of publication, if no citation title exists). 2. A reference to a source. A source is a version of a work at a particular time. 3 The sentences in which alternative names are defined follow a strict format themselves. It is likely that we can solve this problem by using a two-pass method, first searching the document for any alternative name definitions and than searching it for references (using any local names found).
3. A reference to a manifestation. A manifestation is a specific publication of a source. Two manifestations can differ in terms of, for example, medium, lay-out and comments. References found in legislation always refer to a work; for case law and commentaries references to sources can be found and in commentaries sometimes even to manifestations or sources that never came into operation. Constructing an URI for the work referenced should be done by resolving the name found by means of a resolver (such as an online database) or by simply reading the URI from the list attached to the name of the law. After this, the URI for the precise location in the document must be found. This van also be done by means of a resolver, or, , if the URI methodology supports it, the URI can be constructed using the base URI and the information found. For example, within the Norme In Rete project, it is prescribed that the identifier for "article 2" should be "art2", and that the complete identifier for article 2 of the Destructionlaw should be "#art2" appended to the URI for the Destructionlaw [4]. Things get a little bit more complicated if we do not have a complete reference. In that case, we do not have a name which points to a base document. Within a the text of a law, however, such an incomplete reference means a reference within the current document. Thus, al that is needed is that we know the identity of the text that we are parsing, and we can resolve the reference in a way similar to the resolving of the complete reference. In order to establish the identity of the text it will not always be sufficient that we can identity the law the text refers to. A reference to "the first member" means "the first member of this article of this law". In order to resolve this reference, we will need not only the name of law being parsed, but also the designation of the current article. This means that the input document for a reference parser should contain sufficient information on the structure of a document. It helps if this structure is already made explicit as in an XML document. Our parser works on META lex documents4 . Another group of references that is somewhat harder to resolve are the anaphors. From the point of resolving the reference, the anaphors come in three groups. The first group are those references that refer to the current text: "this article", "this law". Such references are easily resolved if the identity of the current location is known, as discussed above. The second group of anaphors refers to an earlier point in the text, such as "the previous article". These can be resolved using structure information as well (though they require that the parser does not only know its current location, but also keeps a (limited) history). Finally, there are those anaphors that refer to an earlier reference, for example "that article", referring not to the current article, but to an article that was earlier mentioned in the text. Usually, this is the most recent reference to an article in the text. In order to resolve these references, the parser keeps a history of the references found so far. This history can be limited to the current piece of text, since such anaphors will not cross the boundaries of (for example) different articles5 . a description of META lex see www.metalex.eu[5] in common natural language texts, where anaphoric references can be quite complicated to resolve, see e.g. [3]. 4 For
5 Unlike
Table 1. Results of the reference parser applied to six randomly chosen Dutch laws
4. Results In order to test our approach, a grammar has been constructed containing most of the patterns mentioned in section 1. Not included were the special cases for references containing exceptions to ranges and references constructed using "each time". We randomly selected six Dutch laws. The only additional requirements were that we wanted to include one law written before 1900, and one between 1920 and 1949, since we expected that the references, i.e. the language used to express them, would be different from modern laws. We applied our parser and measured the number of correctly identified simple references, the number of missed simple references, the number of (partly) correctly identified complex references, the number of missed complex references and the number of skipped references to sources other then laws. Table 1 presents the results. Almost all references were found correctly and completely (99% of the simple ones and 95% of the complex ones). The few misses were caused by missing labels, names or patterns from the grammar. The grammar can be corrected for this, and will be if those labels and patterns occur often enough6 . However, there may always remain some patterns that are too rare to include. False positives occur when one of the labels (such as "member") is used in a different meaning, for example, when "the first member" does not refer to the first member of an 6 The missing names will be added in any case. The parser used a list of current laws, and some of the references found referred to retracted laws.
article, but to the first member of a certain committee discussed in the text. These false positives may be identified when trying to resolve them, as there is seldom a complete reference that the anaphor refers to. A reference to "the first member" (of a committee) may be proven to be a false positive if the current article does not include any members.
5. Expanding it to Other Legal Sources The results presented here are based on references from laws to laws and within laws. However, there are of course a lot more documents that refer to laws and regulations. As part of our research, various examples of such documents and references used in those documents have been studied, though most of the results of these studies have yet to be implemented and tested. 5.1. References from other types of documents For the development of the Semantic Network for the DTCA, it was very important to study the legal commentaries supplied by various publishers. These commentaries link together case law, regulations and parliamentary reports and form an overview of the relevant information for a certain legal problem. In general, in almost any document quoting a law, the same method of addressing the law and parts of such a law is used7 . This means that the grammar that was developed for the law should also be able to detect references in other documents. Outside laws and regulations, unofficial names for laws are often used besides the official names. This means that the list of names of laws will need to be expanded for the use of parsing these documents. Since there is no official source for these unofficial names, it may occur more often that a name cannot be recognised because it is not on the list. Other than that, few problems should occur. There are, however, two important differences. First of all, depending on the type of document, the references do not point to a work (as is the case for a law), but instead refer to a source. For example, commentaries are based on the law at the moment of writing, and may no longer apply after certain changes have been made. The version of the law referred to is usually not mentioned in the text. Instead, it should be derived from the time of writing of the text. When resolving the reference, the work referred to can be found by using the methods described in section 3. In order to find the correct source, a date must be provided. As stated before, in some cases, such a date is simply the conception date of the document. Thus, for a commentary, the conception date could be passed along to the parser to resolve the references8 . The second important difference is that in a document that is not structured using articles etc., an incomplete reference does not refer to a location within the referring document itself. Instead, an incomplete reference is an abbreviation of an earlier complete reference. For example, after introducing "article 14 of the Law on Legal Support", a 7 Actually,
in many case the people that write legal commentaries for publishers are the same people that write the draft legislation at one of the ministries, or work at one of the law enforcement organizations as legal expert. 8 It may be better to consider commentaries to refer to ranges of versions of a law (or individual articles or members), as it may apply to several versions of a law.
writer may abbreviate it to "article 14". This may lead to confusion when a writer refers to more than one "article 14". In this case, the writer usually refers to the most recently mentioned "article 14" . Resolving these references a list (history) of those references found earlier in (that section of) the source document. 5.2. References to other types of documents Several other types of legal documents, such as Royal Decrees, decisions and court cases are also referred to in legal texts. If structured, such documents follow similar structures as a law, and their subparts are referred to using the same formulations. In general, most of these documents are not named. Instead, they are identified by a number and/or date (and sometimes venue) of publications. This means that these references require additions to the grammar. For each type of document, a couple of different reference formats exist. However, when these additions to the grammar have been made, there will be less need for maintenance, as there is no need to continue adding new names to the list. As said, not all documents are structured using divisions like chapter or article. Lacking structural identifiers, writers will sometimes refer to certain pages. This makes such a reference a reference to a manifestation, as the pages and their numbering are dependent on (among others) the lay-out. In itself, this is not a problem, since we are able to construct a URI for the manifestation. But, we might be interested in a link to the source. This means that for these sources, publishers creating a new manifestation of a source must strive to maintain the page information of the original manifestation. This is a requirement that cannot be fulfilled using most currently available formats9 .
Conclusion We have described a parser that automatically finds references in and between Dutch laws and legislation. A test on six very diverse Dutch laws showed an accuracy of 95-99% and hardly any false positives. In the Norme-in-Rete project achieved a similar result was achieved on a much larger, but less diverse Italian corpus [2]. Their parser found 85% of the references, but only 35% could be resolved. We can resolve every reference found. The parsing technique used is a simple, but effective one. As soon as we need to go deeper into the meaning of the texts to be parsed, we will have to resort to natural language techniques and grammars as we and others have done before [6][7]. The approach is general enough to extend to other legal sources like case law and legal commentaries, but it remains to be seen if we can achieve as high a success rate as with laws. The basis for the success with laws is the strict and predictable way legal drafters refer to and within legislation, and the structured nature of the documents we are dealing with. Both these advantages are weaker for case law and even more so for commentaries. However, some of the problems we have to deal with will disappear with further standardisation. If the structure of legal documents and the way they represent references is standardised, integration of sets from different origins will be straightforward10 . 9 It
is advisable to refrain from this type of manifestation dependent referencing. is one of the reasons we have started such a standardisation process within Europe based on META lex as an interchange format. For more information see: http://www.cenorm.be/cenorm/ businessdomains/businessdomains/isss/activity/ws_metalex.asp 10 This
The approach developed is not only useful if we want to build a referential structure (or semantic network, as it is called in this project) on top of a legal content store. It will also give us the opportunity to support legislative drafters or writers of commentaries when writing their texts. We have planned to develop an open source based editing environment that enables the writing of such legal texts. In such an editing environment we will use the reference parser as a basis for functionality such as reference checking and automated completion.
Acknowledgements We thank the Dutch Tax and Customs Administration for offering us the opportunity to test our theories in a very interesting practical situation. We also want to thank the European Commission for having started the 6th framework programme and supporting our Estrella project11 .
References [1]
Winkels, R., Boer, A., de Maat, Tom van Engers, Matthijs Breebaart, and Henri Melger. Constructing a semantic network for legal content. In: Anne Gardner (Ed.), ICAIL-2005: Proceedings of the Tenth International Conference on Artificial Intelligence and Law, p. 125-140, ACM Press (2005). [2] Palmirani, M., Brighi, R., and Massini, M. Automated Extraction of Normative References in Legal Texts. In: G. Sartor (ed). ICAIL-2003: Proceedings of the 9th International Conference on Artificial Intelligence and Law, p. 105-106. ACM Press (2003). [3] Webber, B. 1982. So What Can We Talk about Now? In: M. Brady and R. Berwick, Eds., Computational Models of Discourse. MIT Press, Cambridge, MA: 331-371. [4] Spinosa, P. 2001. Identification of Legal Documents through URNs. In: O.Signore and B.Hopgood (eds.), Proceedings of the Euroweb 2001 Conference "The Web in Public Administration", Felici, Pisa (2001). [5] Boer, A., Hoekstra, R., Winkels, R., van Engers, T. and Willaert, F. METAlex: Legislation in XML. In: T. Bench-Capon et al. (eds), Legal Knowledge and Information System. Jurix 2002. Amsterdam, IOS Press (2002), pp. 1-10 [6] Maat, E. and van Engers, T (2003). Mission impossible?: Automated norm analysis of legal texts. In: D. Bourcier (ed), Legal Knowledge and Information Systems, JURIX 2003. Amsterdam, IOS Press (2003), pp. 143-144. [7] Bolioli, A., Dini, L., Mercatali P. and Romano, F. For the automated mark-up of Italian legislative texts in XML. In: T. Bench-Capon et al. (eds), Legal Knowledge and Information Systems. Jurix 2002. Amsterdam, IOS Press (2002), pp. 21-30.
11 see
www.estrellaproject.org