Proceedings of the Workshop on Semantic Web

2 downloads 0 Views 3MB Size Report
Jun 8, 2007 - from 4 parliamentary records (UK, US and India), from 5 court reports (from the .... RST may be very successful in providing a formalization of.
Proceedings of the Workshop on Semantic Web technology for Law June 8, 2007

at the International Conference on AI and Law ICAIL 2007 Stanford, USA

Organization • • •

Michel Klein, VU University Amsterdam Paulo Quaresma, Universidade de Évora, Portugal Núria Casellas, Institute of Law and Technology, UAB, Spain

Program Committee • • • • • • • • • • • • • • •

Richard Benjamins, ISOCO, Spain Danièle Bourcier, CNRS/Université Paris, France Núria Casellas, Institute of Law and Technology, UAB, Spain Bob DuCharme, Innodata Isogen, USA Roberto García González, Universitat de Lleida, Spain Rinke Hoekstra, Leibniz Center for Law, University of Amsterdam, the Netherlands Michel Klein, VU University Amsterdam, the Netherlands Arno R. Lodder, Centre for Electronic Dispute Resolution, the Netherlands John McClure, Legal-RDF.org / Hypergrove Engineering, USA Henry Prakken, Department of Information and Computing Sciences, Utrecht University / Faculty of Law, Groningen University, the Netherlands Paulo Quaresma, Universidade de Évora, Portugal Peter Spyns, Vrije Universiteit Brussels, Belgium Daniela Tiscornia, Institute of Legal Information Theories and Techniques, Italian National Research Council Joshua H. Walker, CodeX, Stanford University Radboud Winkels, Leibniz Center for Law, University of Amsterdam, the Netherlands

Website http://www.cs.vu.nl/~mcaklein/SW4Law/

Contents Regular papers ACILA - Automatic Detection of Arguments in Legal Cases Raquel Mochales and Marie-Francine Moens.

5

Mapping legal cases to RDF named graphs using a minimal deontic ontoloy for computer-assisted legal querying Alan Abrahams and David Eyers

11

EGODO and Applications: sharing, retrieving and exchanging legal documentation across e-Government Fernando Ortiz

21

MetaVex: Regulation Drafting meets the Semantic Web Saskia van de Ven and Rinke Hoekstra

27

Semantic Case Law Retrieval – Findings and Challenges Elisabeth M. Uijttenbroek, Michel Klein, Arno R. Lodder, Frank van Harmelen and Paul Huygen

33

Using an Extended Argumentation Framework based on Confidence Degrees for Legal Ontology Mapping Cássia T. dos Santos, Paulo Quaresma and Renata Vieira.

41

Accelerating Semantic Search with Application of Specific Platforms Marius Monton, Jordi Carrabina, Carlos Montero, Javier Serrano, Xavier Binefa, Ciro Gracia, Mercedes Blázquez, Jesús Contreras, Emma Teodoro, Núria Casellas, Joan-JosepVallbé, Marta Poblet and Pompeu Casanovas.

47

Position papers An Open Model for a Web-based Case Law Repository Edward Bryant.

53

Semantic Wikis for Law John McClure

55

SW4Law workshop 2007

1

2

June 8, 2007

Regular Papers

SW4Law workshop 2007

3

4

June 8, 2007

ACILA - Automatic Detection of Arguments in Legal Cases Raquel Mochales Palau

Marie-Francine Moens

Katholieke Universiteit Leuven

Katholieke Universiteit Leuven

Tiensestraat 41

Tiensestraat 41

B-3000 Leuven, Belgium

B-3000 Leuven, Belgium

+32 16328713

+32 16325383

[email protected]

[email protected]

ABSTRACT This paper summaries the work done during the first months of the ACILA project. The summary contains an overview of the project and the main problems in argumentation as well as related previous research. It presents the first experiments done to automatically classify arguments in texts and the conclusions reached, and also the next approaches that will be followed.

Categories and Subject Descriptors

recognition of arguments in texts or the automatic classification of arguments according to its argument type (e.g., counter argument, rebuttal). If we think in how much time a legal professional must invest in an information search, to build or find a viable argument to support or reject its own or the opponent’s claim, it is obvious there is a need to do a further investigation on the automatic detection of arguments.

Information visualization and semantic annotation.

Notwithstanding the practical need in the legal domain, automatic detection and classification of arguments in a legal text entail many fundamental research questions. Questions regard the study of legal argumentation structures, the construction of a taxonomy of rhetorical discourse structures for the legal field, the natural language processing of legal texts, the automatic classification of the arguments according to their rhetorical type or the convenient and user-friendly visualization and presentation to the user. These objectives are exactly what the ACILA project (Automatic detection and ClassIfication of arguments in Legal cAses - grant K.U.Leuven OT 03/06, 2006-2010) aims at.

1. INTRODUCTION

2. ARGUMENTATION

In the legal domain the need for tools that automatically help digest the masses of textual information is enormous. Legal professionals are confronted with large quantities of legal cases, legislative and doctrinal texts, apart from enormous amounts of written texts produced when drafting investigation reports or police files.

Argumentation is concerned primarily with reaching conclusions through reasoning, that is, claims based on premises. In law it is used, primarily, to test the validity of certain kinds of evidence.

I.2.1 [Artificial Intelligence]: Applications and Expert Systems Cartography, Games, Industrial automation, Law, Medicine and Science, Natural language interfaces and Office automation.

General Terms Your general terms must be any of the following 16 designated terms: Algorithms.

Keywords

Several aspects of information retrieval are especially relevant in the legal domain. The current research on this domain is more focused on visualization/drafting of arguments (e.g. Araucaria project [15]), on automatically generate outlines of legal memorandums (e.g. HYPO [3]), on indexing arguments according to their content (e.g. SMILE [6]) or in visualization of defeasible argumentation (ARGUMED project [20]). However, there are also other fields to explore, for example, the automated recognition of argumentation structures, the automated

Most researchers describe an argument as a set of statements, consisting of a number of premises, a number of inferences, and a conclusion, which is said to have the following property: if the premises are true, then the conclusion must be true, or highly likely to be true. We restrict this definition as follows; an argument is a set of clauses, consisting of a number of premises, a number of inferences, and a conclusion, being in total at least two clauses inside the argument. An argument proceeds from the premises to the inferences to the conclusion by employing a particular form of reasoning. If the reasoning is deductive, then the conclusion follows necessarily from the premises. If the reasoning is inductive, the argument may show only that the conclusion is highly likely to be true if the premises are true. Other forms of reasoning are also used, with corresponding variations in the precise sense in which the conclusion follows from the premise. Theoretical models of legal reasoning and represented argumentation structures in a logical formalism have been built in studies on legal reasoning by [4], [14], [17], [22] and others.

SW4Law workshop 2007

5

3. ARGUMENT DETECTION In real life, there are no easy mechanical rules to identify arguments, and persons usually have to rely on the context in order to determine which are the premises and the conclusions. However, it is true that sometimes the presence of certain indicators facilitate the detection of the premises or the conclusion. For example, if someone makes a statement, and then adds “this is because…”, then it is quite likely that the first statement is a conclusion, supported by the statements that come afterwards. Taking into account this lack of formalized human methodology, the first approach that came to our mind was to treat the problem of automated detection as a classification problem [13]. We represented a sentence as a vector of features and trained a classifier on examples that were manually annotated. We defined generic features that could easily be extracted from the texts and studied their contribution in the classification of sentences as argumentative. The feature vectors of these training examples served as input for state of the art pattern recognition algorithms being a multinomial naive Bayes classifier and a maximum entropy model. The corpus used was extracted from the Araucaria project of the University of Dundee [15]. The Araucaria corpus is composed of two distinct data sets: a structured English set, collected and analysed according to a specific methodology, and an unstructured multi-lingual set of user-contributed analyses. The data was collected over a six week period in 2003, during which time a weekly regime of data collection harvested arguments from 19 newspapers (from UK, US, India, Australia, South Africa, Germany, China, Russia and Israel, in their English editions), from 4 parliamentary records (UK, US and India), from 5 court reports (from the UK, US and Canada), from 6 magazines (UK,US and India), and from 14 further online discussion boards and “cause” sources such as Human Rights Watch and GlobalWarning.org. These sources were selected because they offered (a) long-term online archive of material; (b) free access to archive material; (c) reasonable likelihood of argumentation. Each week, the first argument encountered in each source was identified and analysed by hand. Examples extracted from the Araucaria corpus can be found in Table 1.

(thereby reducing intercoder reliability). The Araucaria corpus analysis employed the rigorous scheme-based analysis approach of [15] to mitigate these problems. The Araucaria corpus had 1899 sentences that contained an argument and 827 sentences without arguments. For our project we manually added 1072 sentences containing no argument extracted from the same sources as the ones used for the Araucaria project. The main conclusion of this test, ranked with 73.75% average accuracy, was that approaching the automated argument detection problem only by taking into account single sentences involves a big amount of lost inter-sentence information. For example, some times only one of two clauses from the same argument, contained in different sentences, can be detected, due to the lack of information about the connections between these sentences. An example of this is presented in the following sentences: “The government should apply in my case the same pension rights than on a married woman case. We were never married, but I lived with him a long time as his wife and I called myself Mrs.”, the second sentence would be classified as a simple explanation, not as an argument, when the first sentence is not taken into account. How to capture the information relating to all the sentences’ part of an argument? Our research is currently based on finding the answer to this question.

4. ARGUMENTATION SEMANTICS When thinking how to find the possible relations between two sentences an obvious answer is to check the semantic links between them. Pronouns, generalizations or specifications, negations, cue-phrases or synonyms/antonyms could help to identify a hidden relation structure between all the sentences of a text. For example, in Figure 1 it is possible to see one of the most obvious kinds of links between sentences due to their semantic relations, the ones created by the appearance of the same word in different sentences.

Figure 1. Semantic Relations Table 1. Some examples of Araucaria corpus From Online discussion boards

I just wanted to make it clear, should it not have been crystal in my original post, I am not attacking TJ.

From Parliamentary Records

Being a juror is more akin to voting, a civic duty that occurs in private.

The skill of distinguishing arguments from non-arguments is sophisticated and requires training: it is a typical learning outcome of an undergraduate critical thinking course. The analysis of arguments, including the categorization of text by an argumentation scheme, is more challenging yet, and faces the additional problem that multiple analyses may be possible

6

However, to find these different kinds of phenomena is not an easy task. Co-reference has been studied during years and there are different approaches that could be used [7, 8]. But to deal with synonyms/antonyms or negative relations has a major difficulty; because these problems are normally solved with ontologies, which normally do not contain all the possible relations between words. An example of common ontology is WordNet [http://wordnet.princeton.edu]. For example, in the sentence of Figure 1 some relations, such as “reckless” vs. “bat out of hell” or “anyone” vs. “Tom”, would be too hard to identify. These relations need world knowledge to conclude they are synonyms or generalizations; the only way to collect this information in semantic ontologies is to take into account all the proper names and all the “common” expressions.

June 8, 2007

However, this should need a constant update due to a world that is always changing.

5. ARGUMENTATION STRUCTURE To avoid the use of semantic ontologies, we arrived at our current approach, to perform a study of a text’s argumentative structure, as others have studied the rhetorical structure in texts [11, 12, 21]. The rhetorical structure of a text is a main indicator of how information in that text is ordered into a coherent informational structure. One can argue that in argumentative texts the rhetorical structure is even more important than in other texts, since these texts have the purpose to persuade. In the past some efforts have been made to construct formalized models that can be easily implemented in natural language processing applications, one of the most successful of which is Rhetorical Structure Theory (RST). RST describes what parts or segments texts have and what principles of combination can be found to combine parts into entire texts. It assumes a text to have hierarchical organization based on asymmetrical nucleus-satellite relationships. This means that pairs of adjacent elementary units combine into parent units or text spans, which are again recursively, merged until a certain point a unit spans the entire text. The constituent halves of any text span are linked together by a text structuring relation, which typically holds between a semantically more central unit, the nucleus, and a more peripheral one the satellite (although there also exists a small set of multinuclear relations which consist of two or more equivalent units). Rhetorical relations are not mutually exclusive according to Hovy [9], it is possible to assign different relations to the same text fragments. Although in theory the set of rhetorical relations is open, it is generally assumed that the number of relevant rhetorical relations is relatively small. The classification designed by Mann and Thompson [11] seems to have been accepted by some part of the discourse analysis community. They distinguish the following rhetorical relations: Nucleus-Satellite relation: Evidence, justification, antithesis, concession, circumstance, solutionhood, elaboration, background, enablement, motivation, volitional cause, non-volitional cause, volitional result, non-volitional result, purpose, condition, interpretation, evaluation, restatement, summary. Multi-nuclear relations: Sequence, contrast, joint. The result of a complete rhetorical analysis is a tree structure, every node of which represents a rhetorical relation. However, the nucleus-satellite distinction does not always reflect the relations of the corresponding text units correctly, [24] argue and prove with a set of 135 texts that trees cannot adequately represent many coherence structures in natural language texts. The discourse structure of these texts contains various kinds of crossed dependencies as well as nodes with multiple parents. Neither phenomenon can be represented by using trees. These authors plead that a chain graph representation is better than a tree. In a labeled coherence chain graph, an ordered array of nodes represents the discourse segments; the order in which the nodes occur reflects the temporal sequencing of the discourse segments. Labeled directed or undirected edges represent coherence relations that hold among the discourse segments.

SW4Law workshop 2007

RST may be very successful in providing a formalization of discourse structure, but it nevertheless has some inherent shortcomings that make it not so good for detecting argumentative relations. For example: - The theory has been primarily designed for analyzing monologue discourse [11] while argumentative texts have some inherent dialogical properties (e.g. the alternation of the plaintiff’s and the defendant’s arguments in a case text). - Arguments are often constructed in a highly parallel arrangement in which the only applicable RST relation for many inter-premise and inter-argument spans is “joint” (which functions in RST as a catch-all for when other relations or inapplicable). - RST relations can fail to capture argumentative relations and moves. Inferential structures (like Modus Tollens or Disjunctive Syllogism) and rhetorical figures (such as ad hominem or ad populum arguments) are simply functioning at a different level to that captured by RST. Despite these shortcomings, we believe that an improved RST based on a graph structure could help determining argumentation structures. However, the corpus extracted from the Araucaria project, is not a good starting point for this approach because it only contains the arguments but not their context. When detecting rhetorical relations we will work with a corpus containing legal reports from the European Court of Human Rights (ECHR). The ECHR was selected as source because it offers (a) long-term archive of material; (b) high likelihood of argumentation. The ECHR corpus will be a set of English reports manually annotated. Currently we have collected around 33.000 English documents from August-06 and December-06 editions. During the next months we will hire two legal experts to annotate the biggest possible amount of documents; we are planning to annotate around 14.000 lines. For each argument it will be annotated where it is starting and ending, the argument structure and if an argument is inside another one this will be also shown. Each clause of the document will be marked with the argumentative relation that it holds, there will be a special mark for the cases that don’t hold argumentative relation. These ECHR documents have a clear structure: first an explanation of the legal process up to now, then a presentation of the facts, followed by the complaints if existing, and finally the applicable law and the court decision. Some of the reports also contain an individual opinion of one of the judges, this opinion is added to the end of the document as an independent paragraph and it is clearly marked. There are quite long documents, with more than 300 sentences, and short documents, between 40 and 60 sentences. However, the average size of each document is 145 sentences per document, where each sentence contains an average of 49 tokens, making a final average of 6912 tokens per document. The language used in these documents is clearly polite and formal; the appearance of “common or colloquial” expressions could be estimated as zero. The use of passive voice is mixed with the active one depending on the section of the document or the “speaker”, for example all the literal references to what a person says are presented as a passive statement while the judges conclusions, opinions and comments are presented in active voice.

7

Among all argumentation structures we will pay special attention to Toulmin’s theory [19], where he argued that arguments need to be analyzed using a richer format than the traditional one of formal logic in which only premises and conclusions are distinguished. He proposed a scheme that, next to data and claim, distinguishes between warrant, backing, rebuttal and qualifier. The datum consists of certain facts that support the claim. The warrant is an inference license according to which the datum supports the claim, while the backing, in its turn, provides support for the warrant. A rebuttal provides conditions of exception for the argument, and the qualifier can express a degree of force that the datum gives to the claim by the warrant. As an illustration of this theory, check Figure 2 where Toulmin discussed the claim “Harry is a British subject” [20]. We want to base our work on this scheme because it reflects a basic structure for argumentation. In our first studies, our primary goal is exploring the possibilities and difficulties of automated text understanding with regard to argumentation detection. We believe that more advanced models of argumentation structures that incorporate specific and complex argument classes will only complicate our experiments and diverge us from our primary goal.

Figure 3. Argumentative/Non-Argumentative Graphs Once all these questions have an answer there will be still a problem, how to automatically construct the argumentative graphs from the given text? [12] give an automated approach to rhetorical trees construction using a shift-reduce algorithm. Our expectations are to find a similar algorithm able to automatically detect the argumentative graphs.

6. SEMANTIC WEB Concluding the previous section we can state that we will construct a (semi-) automatic annotation system comparable to the ones used in semantic web technologies. (Semi-) automatic annotation systems in semantic web basically make use of NLP (Natural Language Processing) techniques to extract references in the text to certain concepts described in ontologies, and normally require the input of seed patterns or corpuses of documents in order to train the system [16]. Our system will extract references in the text to argument units (claim, rebuttal, warrant. etc…) that will be described in an argumentative taxonomy, in our case, Toulmin’s structure. With the extracted references of the units we will re-compose the full argument.

Figure 2. Toulmin's analysis of "Harry is a British subject" Toulmin’s structure can be represented as a graph, where the nodes represent the clauses and the relations between them are warrant, claim, rebuttal, datum, backing or qualifier. However, there are different questions about this structure: If the claim is the nucleus of the graph, what happens with the arguments where the conclusion is implicit? What happens when a premise is part of two arguments? An argumentative text has a main claim, that is the nucleus of the main graph, but in a non-argumentative text there is no main graph, only a set of disconnected sub-graphs (Figure 3). These two representations are known as connected graph and a set of connected graphs. A connected graph is a graph such that there is a path between all pairs of vertices. There are various levels of connectivity, depending on the degree at which each pair of nodes is connected. In a connected graph it is possible to visit all the nodes starting from a single one and following the edges from one node to the other, sometimes passing by one vertex more than once. There are some special algorithms that can only be applied to connected graphs. Is a connected graph easier to represent, visualize and also to visit?

8

However, we believe this approach will be successful to detect arguments; we have our doubts it can be seen as a general defined and accepted taxonomy for argumentation as we have no knowledge of the existence of a convention on which argument structure is the best representation for argumentation.

7. NATURAL LANGUAGE PROCESSING Automated recognition of argumentative structures in a text first needs to deal with some natural language problems. A first problem regards boundary detection and segmentation. [18] remark that semantic/discourse segmentation is a notorious underresearched problem, both at a sentence and discourse level. They suggest segmentation based on the syntactic structure, but are vague on how to translate syntax into discourse segmentation. Argument segmentation presents similar problems to discourse segmentation and is still a problem we should work on. Humans use lexical cues, pronouns, other forms of phoric reference, tense and aspect to correctly detect argumentation and to recognize the argument relations between the discourse segments [9]. A computer can also find these lexical cues, pronouns, etc… However, the computer has trouble dealing with their ambiguity or the implicit relations [2]. A part of our work will be to deal with word ambiguity and missing or implicit information.

June 8, 2007

An automatic way to learn which lexical cues can be used to identify an argument is to use a learning corpus. [12] show that it is possible to learn a corpus of specific rhetorical cues from a corpus of training data, taking into account many more features than the ones commonly considered.

8. VISUALIZATION Once the argumentative relations are detected, and the relations between them classified, an important issue is to define a representation that is suitable to capture the argumentative structure, both intra and inter arguments, and that allows easily accessing the information or making additional computations. Our first approach should be to highlight in the text the arguments found, this highlight can also be accompanied by a probability of being argument. The second step will go further, to also present the argument structure of the text. We mean the relations between the different clauses and between the different segments, which should be valuable for a legal professional when searching and assessing arguments for his or her legal case at hand.

40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, PA, 2002. [8]

Cardie, C. and Wagstaff, K. Noun Phrase Coreference as Clustering. Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC), Maryland, MD, 1999.

[9] Hovy, E. H. Automated discourse generation using discourse structure relations. Artificial Intelligence, 63(1-2):341–385. Essex, UK, 1993. [10] Kintsch, W. Comprehension: A Paradigm for Cognition. Cambridge University Press, Cambridge, England, 1998. [11] Mann, W. and Thompson, S. Rhetorical structure theory: Toward a functional theory of text organization. 3(8):243– 281, Providence, RI, 1998. [12] Marcu, D. The Theory and Practice of Discourse Parsing and Summarization. MIT Press, Cambridge, MA, 2000.

9. CONCLUSIONS

[13] Moens, M.-F., Boiy, E., Mochales, R. and Reed, C.. Automatic detection of arguments in legal texts. Proceedings of the 11th International Conference on Artificial Intelligence and Law (ICAIL), Stanford, CA, 2007.

To initiate the research on argument detection and classification in text, rhetorical structure analysis seems to us the most logical way.

[14] Prakken, H. Analysing reasoning about evidence with formal models of argumentation. Law, Probability and Risk, 3:33–50. Oxford University Press, Oxford, UK, 2004.

A theory of discourse structure of legal argumentation is currently lacking [5], while there have been several studies of rhetorical discourse structures in general texts. Furthermore, in order to make commercially viable any system for argument recognition in legal texts, it needs to be grounded in a generic framework, and from the options studied till now we think rhetorical structure analysis is the one providing a better generic methodology for legal text analysis. Still, we expect to find some problems due to implicit arguments that depend on external knowledge, but we will try to acquire this knowledge automatically from large text corpora.

[15] Reed, C. and Rowe, G. Araucaria: Software for argument analysis, diagramming and representation. International Journal of AI Tools, 14 (3 - 4):961 – 980, 2004.

10. REFERENCES [1] Aleven, V. Teaching Case-Based Argumentation Through a Model and Examples. Ph.D. Dissertation, University of Pittsburgh, Pittsburgh, PA, 1997. [2]

Allen, J. Natural language understanding. BenjaminCummings Publishing Co., Inc., Redwood City, CA, 1995.

[3]

Ashley, K. Reasoning with Cases and Hypotheticals. MIT Press, Cambridge, MA, 1990.

[4]

Bench-Capon, T. J. M. Argument in artificial intelligence and law. Artif. Intell. Law, 5(4):249–261, 1997.

[5]

Branting, K. Reasoning with Rules and Precedents: A Computational Model of Legal Analysis. Kluwer Academic Publishers, Dordrecht. Boston, MA, 2000.

[6]

Bruninghaus, S. and Ashley, K. D. Reasoning with textual cases. Proceedings of the 4th International Conference on Case-Based Reasoning (ICCBR-05), Chicago, IL, 2005.

[7]

Cardie, C. and Ng, V. Improving machine learning approaches to coreference resolution. Proceedings of the

SW4Law workshop 2007

[16] Sanchez-Fernandez, L. and Fernandez-Garcia N. The Semantic We: Fundamentals and A Brief State-of-the-Art. Upgrade, The European Journal for the Informatics Professional Num. 6, Vol. 6:5-11, December, 2005. [17] Sartor, G. Legal reasoning, a cognitive approach to the law. In A Treatise of Legal Philosophy and General Jurisprudence, Volume 5, 2005. [18] Soricut, R. and Marcu, D. Sentence level discourse parsing using syntactic and lexical information. Proceedings of the Human Language Technology and North American Association for Computational Linguistics Conference (HLT/NAACL), Edmonton, Canada, 2003. [19] Toulmin, S.E. The Uses of Argument. University Printing House, Cambridge, UK, 1958. [20] Verheij, B. Virtual Arguments: On Design of Argument Assistants for Lawyers and Other Arguers. TMC Asser Press, The Hague, Holland, 2005. [21] Thompson, S., Matthiessen, C. and Mann, W. Rhetorical structure theory and text analysis. In Discourse description: Diverse linguistic analyses of a fund-raising text, 39–78, 1992. [22] Walton, D. Argumentation Methods for Artificial Intelligence in Law. Springer, 2005. [23] Winograd, T. Understanding Natural Language. Academic Press, Inc., Orlando, FL, 1972. [24] Wolf, F. and Gibson, E. Coherence in Natural Language. MIT Press, Cambridge, MA, 2006.

9

10

June 8, 2007

Mapping legal cases to RDF named graphs using a minimal deontic ontology for computer-assisted legal inference David M. Eyers University of Cambridge Computer Laboratory JJ Thomson Avenue, Cambridge, United Kingdom {firstname.lastname}@cl.cam.ac.uk Jean Bacon University of Cambridge Computer Laboratory JJ Thomson Avenue, Cambridge, United Kingdom {firstname.lastname}@cl.cam.ac.uk

Alan S. Abrahams Wharton School University of Pennsylvania [email protected]

ABSTRACT The complexity of natural language employed within legal documents will leave their meaning opaque to automated computer interpretation for many years to come. Even so, this paper proposes that informative inference of deontic state can still be performed over a simplified encoding of legal semantics. We discuss how to map numerous deontic concepts into our minimal encoding. We use RDF named graphs to independently represent the numerous conflicting perspectives on the state of affairs being examined within a legal case. Inference over these named graphs is performed using an implementation of event calculus. We encode the key parts of a commercial case from the High Court of South Africa, and demonstrate that useful inferences can be formed on the basis of the minimal encoding.

1.

INTRODUCTION

Justice and the law, internationally, are immense resource consumers in terms of staff time, money, and both electronic and physical records. Increasingly, legal information is publicly available through the World Wide Web: many countries provide summarised representations of their courts’ cases. As Semantic Web technologies mature, there are significant increases in the potential for computers to be able to correlate legal material from distributed information sources. By better organising public legal data, we believe that it will become more accessible to the public, as well as increasing the efficiency of legal practitioners. Compared to the explosive expansion of the Web, though, the Semantic Web is growing more slowly. It is unlikely that

ontologies and parsing techniques will be widely available to encode existing legal records for many years yet. However, we argue that significant benefit to searching and analysis of legal documents can be gained even if only a minimal encoding of legal semantics is applied to cases. This paper presents such a minimal encoding, and discusses how to map more complex deontic concepts into it. Inference of deontic state at points in time is performed using an implementation of the Event Calculus [29]. This implementation operates over data stored using RDF [33]. More specifically, we use RDF named graphs [34, 8]. This paper is organised as follows. Section 2 describes the semantic web technology we employ. We then provide a brief introduction to the simplified event calculus in section 3. An overview of related research is presented in section 4. Section 5 introduces our minimal deontic ontology. A worked example is described in section 6. Finally, section 7 provides concluding remarks.

2.

RDF NAMED GRAPHS

The Resource Description Framework (RDF – [33]) provides a straightforward, expressive means for knowledge representation. Each RDF statement has a subject, a predicate and an object. The instances of concepts need names that are defined with global scope and decentralised ownership. The World Wide Web Universal Resource Identifier (URI – [6]) system is used to meet both requirements. URIs contain (among other parts) a domain name portion, e.g. vt.edu, and a local path. Domain names are owned by particular organisations and can be used to locate the servers of an organisation unambiguously within the Internet. The local paths are under the control of the owner of the domain name, and thus can be defined at will. URIs themselves are just names, but by convention their global lookup process can be used to confirm they are genuine – for example returning human or machine readable information should the URIs be requested using a web browser. Each subject and predicate of an RDF statement is a URI,

SW4Law workshop 2007

11

and statement objects can be either a URI or a literal value. The example below indicates that the title of RFC 3870 is “application/rdf+xml Media Type Registration“. We use ‘S’, ‘P’ and ‘O’ to label the RDF statement’s subject, predicate and object respectively. Note that the concept of “title” in this RDF statement is drawn from the Dublin Core ontology. S: http://www.ietf.org/rfc/rfc3870.txt P: http://purl.org/dc/elements/1.1/title O: "application/rdf+xml Media Type Registration"

This paper does not focus on implementation-level RDF concerns, indeed our prototype evaluations are written in Prolog and use an RDF-compatible, functional term representation of our data. Our requirement was for an encoding of directed graphs with named edges: RDF can easily encode such structures. Unlike the ideal global semantic web, however, we are encoding data that is highly subjective, transactional and possibly contradictory. Clearly court cases frequently arise from parties having conflicting beliefs. Also, the submissions received within legal proceedings are sets of statements presented together, and thus are transactional. To support representation of this sort of data, we use RDF named graphs [34]. RDF named graphs introduce an extra term to each statement, creating a quad instead of a triple. Three of the elements of this quad are the usual elements of RDF triples. The fourth element allows grouping of sets of statements. For example, we could rephrase the above RDF triple in named graph form (‘G’ labelling the graph). G: S: P: O:

http://vt.edu/people/abra/claim/20070501a http://www.ietf.org/rfc/rfc3870.txt http://purl.org/dc/elements/1.1/title "application/rdf+xml Media Type Registration"

By referring to this RDF statement using the graph name, we can reason about an individual at Virginia Tech’s claim that RFC 3870’s title is as shown. It is worth pointing out that there is some interplay between RDF named graphs and the implementation of the RDF store in use, however. Strictly, if the RDF Schema [35] semantics are added to storage and inferencing components, the notion of closed lists can be used to describe named graphs. However the URIs that describe the elements of any such named graph can only build the graph’s statements through reification, so the representation ends up somewhat bulky and inconvenient. We note that this is a similar but more extreme version of the situation with typed RDF literals, which can also be implemented using standard RDF statements and untyped literals provided that the RDF engine understands the conventions employed. There is no restriction on RDF statements using graph-name URIs in their subjects, objects and predicates (although the former two contexts would probably be more intuitive than the latter). Indeed we employ statements about the containment of named graphs within other named graphs to effect collections of RDF statements and named graphs in a manner similar to the discrete “files” used in most computer software.

12

holds_at(U,T)

:- 0= d[v] d[v] := d[u] + w(u,v) previous[v] := u

Figure 1: Dijkstra's Algorithm among them. That condition is actually not valid on our ontologies (where the ratio is around 1 to 10), so other techniques are needed to improve this algorithm.

3. INTRODUCTION TO RECONFIGURABLE DEVICES The computational platforms used for this kind of applications have evolved through the history, experimenting with different architectures (servers, meshes of processors, clusters, etc.). But complex indexing and search problems represent a broad spectrum of algorithms that combine different computations with specific platform requirements, which demands architectures that support such heterogeneity. When planning the architecture of a general purpose machine, there is the additional requirement of giving support to a number of different applications with the same platform, leading the whole system to be programmable. Our purpose is the development of a system capable of prototyping the implementation for a specific problem, semantic search, with reconfigurable devices.

2 7 ,5 0 2 5 ,0 0 2 2 ,5 0 2 0 ,0 0 1 7 ,5 0

tim e (s)

The process to cover an ontology based on 10.000 nodes (with different connectivity degrees) might take from 9 hours up to 4 days. The improvement of the computational time would result in a more efficient and cheaper application of semantic technologies. In this paper we outline the development of the computational acceleration for ontological searches using application specific embedded systems, based on hardware platforms and FPGAs (Field Programmable Gate Arrays) or CPLDs (Complex Programmable Logic Devices).

Q := V[G] while Q is not an empty set

1 5 ,0 0 1 2 ,5 0 1 0 ,0 0 7 ,5 0 5 ,0 0

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’04, Month 1–2, 2004, City, State, Country. Copyright 2004 ACM 1-58113-000-0/00/0004…$5.00.

48

2 ,5 0 10

50

100

200

300

400

500

# nodes

Figure 2: Execution time of the algorithm according to the number of nodes June 8, 2007

A system like this is configured in compilation time from a description written in a high level language, partitioned into processes to be executed different resources ranging from processors to application specific hardware resources. This process of describing both hardware and software is based on a set of design methodologies known as “hardware-software codesign”. In last years, they can not only implement hardware modules but also, due to the increasing device size and density, FPGAs can contain several processors (soft core processors).

4.

ACCELERATION SCHEME

Our acceleration platform will be based on a PCI expansion card for a standard PC. This PCI card will contain a FPGA as main computational device and the amount of memory required for complex problems. This expansion board will accelerate the process of finding next suitable vertex and maintaining the edges set updated. Due to amount of memory needed to store all vertices and edges, external memory is used. FPGA will access this memory, and will be accessible to the SW application (usually running in a PC) through a standard Application Programming Interface (API). Using this API, current Dijkstra code (included in the SW application) will call Extract_Max function and this function will be executed on our HW accelerator platform. This way, the most costly function will be implemented in HW, with a fast execution time and running in parallel to the rest of the SW application. There are three possible approaches to implement the Extract_Max function in our platform using different computational structures: (i) sorted array (ii) sorted linked list and (iii) heap.

4.1

Sorted array

Inside FPGA, it's possible to store small amount of information. The design should allow keeping an array permanently sorted. Insertion is done in the right place, and the rest of the array is shifted accordingly. This implementation is fast (one cycle per insertion): its complexity is O(1) for insertions and Extract_Max, but it has great penalty on delete and update functions that are O(|E|), and also in the fact that it cannot store large amounts of data due to internal FPGA memory limitations. Logic Blocks

Routing

SRAM

Memory Blocks

Figure 3: Sorted Array 4.2

Sorted linked list

A linked list can be implemented using external memory to FPGA. In this implementation, FPGA is in charge of access to external memory to ensure that the linked list is always sorted. This way, FPGA will find the insertion point by exploring the list and then modify that list to insert, update or delete a node. Complexity of this operation is O(|E|).

4.3

Binary Heap

In this approach, FPGA would be in charge of maintaining a binary heap of nodes. Using this implementation, external memory to the FPGA will be used to store the heap. For every operation, FPGA needs to access several times the external memory, what means slower speed. This implementation has O(1) for Extract_Max function and O(log |E|) for insert and delete operations, and it can store large amount of edges due use of external memory. This last solution can be used implement both costly functions Extract-Max function and update adjacent vertices sharing the same storage structure. This method can be optimally implemented with a FPGA and using external memory to store the array of the binary heap. In these approaches, main bottle-neck will surely be the PCI transfer between PC and our HW acceleration platform. For that reason, next stage of development will involve the implementation of the complete algorithm in the reconfigurable platform, where software only stores graph into memory and PCI board returns all possible relations between nodes.

Figure 5: Linked List I/O Blocs

5. CONCLUSIONS AND FURTHER WORK Figure 4: FPGA Basic Blocs

SW4Law workshop 2007

In this paper we presented a set of proposals to improve ontology search, based on the implementation of reconfigurable devices. This research is currently being developed within the nationally

49

[2] McCarty, L.T. A language for legal discourse, I. Basic features. In Proceedings of the Second International Conference on Artificial Intelligence and Law. Vancouver, Canada. [3] Stamper, R.K. The role of semantic in legal expert reasoning and legal systems. Ratio Iuris, 4(2): 219-244, 1991. [4] Stamper, R.K. Signs, Information, Norms and Systems. In B. Holmqvist and P. Andersen, editors, Signs of Work. De Gruyter, 1996. [5] Valente, A. A Modelling Approach to Legal Knowledge Engineering. IOS Press, 1995. [6] Breuker, J., Elhag, A., Petkov, E. and Winkels, R. Ontologies for legal information serving and knowledge management. In Legal Knowledge and Information Systems, Jurix 2002: The Fifteenth Annual Conference. IOS Press, 2002.

Figure 6: Binary Heap funded E-Sentencias Project, which has the development of a software-hardware system for lawyers to manage the documentation connected to their legal cases and the related multimedia files as its objective. With these improvements, we will able to do complex searches and relationship extractions in large ontologies in few seconds instead of in the current minutes or hours. Moreover, the plan to develop a full platform for managing ontologies can enhance the use of these technologies in new applications.

6.

ACKNOWLEDGMENTS

E-Sentencias (E-Sentencias. Plataforma hardware-software de aceleración del proceso de generación y gestión de conocimiento e imágenes para la justicia) is a Project funded by the Ministerio de Industria, Turismo y Comercio (FIT-350101-2006-26). A consortium of: Intelligent Software Components (iSOCO), Wolters Kluwer España, IUAB Institute of Law and Technology (IDT-UAB), Centro de Prototipos y Soluciones Hardware Software (CHEPIS - UAB) y Digital Video Semantics (Dpt. Computer Science UAB).

7.

REFERENCES

[1] Casellas, N., Jakulin, A., Vallbé, J.-J. and Casanovas, P Acquiring an ontology from the text. In M. Ali and R. Dapoigny, editors, Advances in Applied Artificial Intelligence, 19th Internatoinal Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems (IEA/AIE 2006). Annecy, France, June 27-30 2006, Lecture Notes in Computer Science 4031, Springer, 2006: 1000-1013.

50

[7] Kralingen, R.W. van. Frame-Based Conceptual Models of Statute Law. Computer/Law Series, No.16, Kluwer Law Interational, 1995. [8] Gangemi, A., Pisanelli, D.M. and Steve, G. A formal ontology framework to represent Norm Dynamics. In Proceedings of the Second International Workshop on Legal Ontologies, 2001. [9] Casanovas, P., Poblet, M., Casellas, N., Vallbé, J.-J., Ramos, F., Benjamins, V.R., Blázquez, M., Rodrigo, L., Contreras, J. and Gorroñogoitia, J. D10.2.1 Legal Case Study: Legal Scenario. Technical Report SEKT, EU-IST Project IST2003-506826, 2004. [10] Casanovas, P., Casellas, N., Vallbé, J.-J., Poblet, M., Benjamins, V.R. Blázquez, M., Peña-Ortiz, Rl and Contreras, J. Semantic Web: A Legal Case Study. In J. Davies, R. Studer and P. Warren, editors, Semantic Web Technologies: Trends and Research in Ontology-based Systems. John Wiley & Sons, 2006. [11] Benjamins, V.R., Casanovas, P., Gangemi, A. and Breuker, J., editors, Law and the Semantic Web: Legal Ontologies, Methodologies, Legal Information Retrieval, and Applications. Lecture Notes in Computer Science. Springer Verlag, 2005. [12] Casanovas, P., Casellas, N., Vallbé, J.-J., Poblet, M., Ramos, R., Gorroñogoitia, J., Contreras, J., Blázquez, M. and Benjamins, V.R.. Iuriservice II: Ontology Development and Architectural Design. In Proceedings of the Tenth International Conference on Artificial Intelligence and Law (ICAIL 2005). Alma Mater Studiorum-University of Bologna, CIRSFID, 2005. [13] Goos, G., Hartmanis, J., and van Leeuwen, J.Intelligent Search on XML Data Applications, Languages, Models, Implementations, and Benchmarks, 125, 2003.

June 8, 2007

Position Papers

SW4Law workshop 2007

51

52

June 8, 2007

An Open Model for a Web-Based Semantic Case Law Repository Edward T. Bryant OpenGavel Project http://www.opengavel.com 2730 N. Sawyer Ave. Chicago, Illinois, 60647 [email protected]

ABSTRACT A web-based repository of U.S. court decisions that is free to the public and based on open standards would encourage a wider use of semantic technologies and increase innovation. This paper articulates some of the reasoning behind the need for such a repository, its potential benefits, and what elements of the proposed repository would be most important in meeting its goals. Lastly, the paper describes the OpenGavel project, a current effort to implement an open case law repository, and highlights some of the remaining obstacles to its success.

Keywords Semantic web, openness, case law, public access, open access, standardization.

1.INTRODUCTION Although various debates surrounding software development and intellectual property policy have focused on the concepts of “openness” and “freedom” (e.g., open source software), these concepts have been relatively recent additions to debates concerning the information at the heart of other industries, such as the legal information industry. Yet, concern over public access to legal information is the primary reason behind the United States’ current system of commercialized publishing of court decisions. The commercialization of case law publishing developed as a method by which printed court opinions could reach a wider distribution [4, 7]. The trade-off of this cost-shifting bargain was that primary and secondary sources of legal information were increasingly available only to legal professionals. Whatever the pros and cons of commercialized law publishing were in the print era, advances in technology have both lowered the costs of mass dissemination and increased the need for tools to navigate the vast amount of data [3]. The spread of information retrieval technologies has also resulted in continuing redefinitions of the public’s expectations of “access” [1]. While the availability of a printed copy of a decision may have satisfied the need for public access in the past, the modern definition of access has become a moving target that shifts as the public incorporates new expectations, such as web availability, search capability, and other information retrieval technologies. Knowledge-centric navigation and other semantic technologies can be viewed as the latest in a long line of information technology tools that further push public expectations. By the mid-1990s, there was a growing recognition that these advances in information technology should result in more effective and equitable access to the law, and that greater consistency would allow for more use of artificial intelligence in

SW4Law workshop 2007

the retrieval of legal information [6]. Although significant attention has been recently given to open access publishing in the field of legal scholarship [1, 2, 3], the value of such scholarship in the development of the common law is clearly less than that provided by court opinions [1]. The work of professors Peter Martin and Tom Bruce at the Legal Information Institute at Cornell Law School (LII)1 have been largely responsible for what progress has been made in providing access to primary sources of legal information, including court opinions. The work of the LII has led to the spread of LII organizations in many countries around the world. Despite the progress of the LII and the increasing number of U.S. courts that now post their opinions on the web, inconsistencies among the methods and formats used to post court opinions continues to stress the need for a combined open repository based on a common standard.

2.WHY WE NEED AN OPEN MODEL? 2.1 Reasons to be Pessimistic About the Current System Absent an open case law repository, the implementation of semantic technologies in the U.S. legal field is largely dependant on the three major U.S. legal publishers2 and the issuing courts themselves. There are several reasons to be pessimistic that these organizations have sufficient incentives to promote the use of these technologies for the benefit of the general public. Some of these reasons include:

1 2



There is a underserved audience for legal materials that courts and legal publishers do not have an incentive to address [4, 2].



Courts have little incentive to dedicate resources toward technologies that do not directly apply to the court’s internal workflow (e.g., communicating between parties and the court, organizing case files, e-filing, etc.).



Consolidation of the legal publishing industry limits innovation by virtue of the number of publishers that maintain case law databases [1].



The target market of legal publishers narrows the type of research tools that they have an incentive to develop.

http://www.law.cornell.edu. The top three U.S. legal publishers are Thomson (Thomson West), Reed Elsevier (LexisNexis), and Wolters Kluwer (CCH / Loislaw).

53



Individual courts have little incentive to adhere to standards and the lack of uniformity creates a barrier to the extraction of semantic information.



The project should aim to create a retrospective archive within the chosen specialty (i.e., include decisions issued before cases began being posted online).

Even under the current system, there are several projects that are independently attempting to promote greater public access to the law (e.g., LII) or encourage greater use of semantic technologies (e.g., MetaLex, NormInRete, and Legal XML). Several of these projects, however, focus on statutory and regulatory law rather than common law court decisions.



The repository should be designed to allow stable and relatively simple inbound links from external sites.

2.2 Benefits of an Open Model Repository There would be several obvious benefits to the creation of an open model repository: •

Greater access to the law is commonly cited as a positive contribution to numerous civic goals, including the rule of law as well as fairness and equality in the legal system [8].



Lower barriers to innovation through the ability to easily gather standardized source documents during the experimentation and development of new semantic technologies.



Lower barriers involved in the study of legal concepts, court behavior, or citation patterns.



Encourage the use of semantic technologies to address the difficulty of understanding legal materials by nonprofessionals [4].

4.THE OBSTACLES AHEAD Several obstacles to an open access case law repository still remain. The primary obstacle is establishing ongoing financial resources to support the project. Although many automated techniques may be used to mine more recent court opinions, significant resources may be required to populate the repository with older cases. The project is currently exploring whether an adbased economic model can support the continued growth of such a repository. Additional obstacles include the finality of posted court decisions (e.g., slip opinions, etc.), the integration of neutral citation features [5], and the uncertainty of private publisher claims over rights to pagination and court opinion text [5].

5.REFERENCES



Provide a foundation for the development of web-based tools designed to navigate and retrieve legal information based on semantic technologies.

[1] Arewa, Olufunmilayo. Open Access in a Closed Universe: Lexis, Westlaw, Law Schools, and the Legal Information Market. Lewis & Clark L. Rev. vol. 10 (2006), 797-839. http://ssrn.com/abstract=888321.



Provide a foundation for the further development of more robust secondary legal materials on the web (e.g., legal commentary, specialized collections, etc.).

[2] Carroll, Michael W. The Movement for Open Access Law. Lewis & Clark L. Rev. vol. 10 (2006), 741-760. http://ssrn.com/abstract=918298.

3.CHARACTERISTICS OF THE OPEN MODEL CASE LAW REPOSITORY The OpenGavel project (http://www.opengavel.com) was started to apply an open model to the creation of a web-based case law repository. Given the goals and potential benefits of an open case law repository, several characteristics have become essential to its development: •

Documents should be marked-up in an XML language specifically designed for court documents.



The project should be set-up as a non-profit to eliminate the incentive to focus on any one target market.



The project should clearly relinquish any claim to rights in the court opinions through a public domain dedication.



54

In following these characteristics, OpenGavel has designed an XML language and posted a few dozen cases, each of which display a Creative Commons public domain dedication. The project has decided to focus on copyright cases as its initial specialty because it is an active topic of discussion on the web and the numerous blogs and related commentary should provide a varied audience for the material.

The early stages of the repository should focus on a topical legal specialty, as opposed to being based on a particular jurisdiction.

[3] Litman, Jessica. The Economics of Open-Access Law Publishing. John M. Olin Ctr. for Law & Economics paper #06-005 (2006). http://ssrn.com/abstract=912304. [4] Martin, Peter W. Digital Law: Some Speculations on the Future of Legal Information Technology. NCAIR Sponsored Program on the Future of Legal Info. Tech. (May 1995). http://law.cornell.edu/papers/fut95fnl.htm. [5] Martin, Peter W. Neutral Citation, Court Web Sites, and Access to Case Law. Cornell Law School research paper No. 06-047 (2007). http://ssrn.com/abstract=950387. [6] Martin, Peter W. Legal Information: A Strong Case for Free Content. Prepared for Conference on Free Information Ecology, Info. L. Inst., N.Y. Univ. Sch. of Law (March 2000). http://www.law.cornell.edu/papers/free.rtf. [7] May, Christopher. Between Commodification and ‘Openness’: The Information Society and the Ownership of Knowledge. Journal of Info. L. & Tech. (2005). http://www2.warwick.ac.uk/fac/soc/law/elj/jilt/2005_2/may/. [8] Poulin, Daniel. Open access to law in developing countries. First Monday (Dec. 2004). http://www.firstmonday.org/issues/issue9_12/poulin/index.ht ml.

June 8, 2007

Semantic Wikis for Law John McClure ([email protected]) Legal-RDF.org / Hypergrove Engineering Semantic Wikis represent a fundamental, generational shift in the production and distribution of information that has important implications throughout the practice of law. The requirements of semantic markup in general, within the context of Semantic Wikis in particular, are discussed in this paper to guide policies regarding the adoption of wiki technology by public and private agencies.

Historic Opportunity Injecting semantic markup into legal documents – despite clear benefits in operational efficiencies, legal analysis capacities, and social transparencies – has real costs that can be difficult to justify. The remarkable popularity of Wiki technology however presents an opportunity to resolve this dilemma. A wiki is a web content management application whose typical interface is near ideal for use by non-technical authors. A wiki normally includes an ability to create, modify, and discuss the contents of XHTML web-pages (i.e., of ‘articles’ that are organized into ‘namespaces’); a simple page-naming syntax that is translatable into proper URLs when creating XHTML for the article; and third, a non-SGML syntax to markup document components commonly needed by authors: headings, italicized and bold text, enumerated and bullet lists, and hyperlinks to internal or external locations. More sophisticated users can incorporate XHTML markup into their pages, which is preserved by the wiki server whenever it translates the page to XHTML. A semantic wiki builds upon this simplicity by the addition of two features: extension of the markup syntax to allow an author to name text strings within the displayed or hidden content of the page and second, syntax for queries of the named text strings. Semantic wikis can be distinguished by whether the names assigned to the text strings are emergently defined in a folksonomy or formally pre-defined in an ontology. Social wikis usually feature folksonomies due to perceived requirements and skills of their user bases, while operational wikis feature their own ontology in order to flow information reliably between a wiki and conventional databases. For several practical reasons, industry and governments have been reluctant to adopt XHTML or XML, and much less so RDF, for markup of their documents and reports. Wikis and semantic wikis however appear to offer an historic opportunity for a near-term change of heart that can (finally) lead to the exchange of semantic markup of their documents. Primarily driving this is the emerging appreciation that wikis can be more than just a web content management application. Rather, they can be a selfcontained document management system that can package together, as a single unit, the complete record for a particular relationship. This view is substantial since it provides a solution for an actual and signficant, current business problem: while standards exist to represent a single document (or page), there is no standard for represention of a set of documents. Wikis provide a path to a standard for packaging together all multi-page documents, singlepage flyers and emails, and other material about one or more particular event, relation, contact, place, or other thing. Because wikis in general are easy for an author to use, are becoming widely installed, and are a flexible, useful tool applicable in numerous domains, then semantic wikis offer the

SW4Law workshop 2007

chance to achieve the socially desirable goal of establishing semantic markup as a common business practice performed on one’s documents prior to their publication and distribution. No other approach seems feasible.

Smart Wikis The tipping point for widespread buy-in to routine document markup using semantic tags probably won’t occur without integrating workstation-based wikis with word processors.1 Authors need to retain the strong feeling of control they’ve historically had over their documents; they need to extend that control to the wikis that hold the document(s) they are authoring; and they need to use wiki-centered tools that require semantic tagging for their operation. A wiki application (with tools) that operates on file directories would fulfill this need. A document wiki is a wiki that holds single-page articles and multi-page documents, both stored in wiki-style syntax and translated to XHTML for display. A smart wiki is a document wiki that has been enhanced with an embedded ontology, i.e., a smart wiki is a semantic wiki that can manage multi-page documents as easily as single-page articles. (Social wikis on the other hand, seem to have little need to store documents with their single-page articles.) An ontology distinguishes a smart wiki from a document wiki and is integrated in three ways. Authors can reference only terms defined in the ontology when tagging the document, article page, or text, image, or sound content of a document or article page. Second, the ontology is referenced during specification and processing of queries of tagged information, These ‘ask’ commands can be embedded within an article page or a document and eliminate concern stemming from static wiki or document wiki content. The ‘ask’ commands allow queries to specify typing (is-a) filters to restrict the range of a result set and also the format of the result, as tables, lists, or icons. The third integration point between an ontology and a wiki would be one or more data-entry forms that are associated with a general article or an article that is specific to a type of thing defined in the ontology, such as a person, business, organization, or government.2 Each field on a form is named using terms defined in the ontology; an author’s data entry can either be used to generate the initial content for the article or document, or used to maintain tagged, hidden data items associated with the article page or document.

Wiki Namespaces A wiki ‘namespace’ is a named-group of articles held within the wiki. Most articles are located in a “main” name-space, although articles can be lodged in namepaces called project, template, category, help, or user (a correlated set of namespaces exist for ‘talk’ about another article). Essentially, an article is little more than a ‘resource’ as defined by the RDF (Resource 1 The HyperGrover Wiki tool for example, provides document pagination, a.k.a. WYSIWYG, as a key feature of a document editor that is integrated with its user interface. 2 The HyperGrover Wiki for example, has forms for Dublin Core data -- for a general article – and for many types and sub- types of ‘articles’.

55

Description Framework). And it is perhaps not a coincidence that RDF’s metamodel distinguishes between a resource and ‘talk’ about a resource. Wiki ‘categories’ and RDF ‘classes’ are also unmistakeably similar: these classify or categorize (your pick) an article or resource. The RDF allows one to define properties of instances of a class, a capacity unknown to wiki categories. In short, RDF seems appropriate technology for defining categories which may be affixed to wiki articles, documents or their text content. However, RDF and wikis define the term ‘namespace’ differently. RDF namespaces are for resources that are limited to definitions of classes and or properties, while wiki namespaces are groups of any kind of article/resource. Wiki namespaces are useful because they can enforce a coordination of the wiki’s namespace structure with that of its associated RDF ontology. A set of namespaces is therefore proposed3 which extends (or refines some of) the standard MediaWiki namespaces. main – general topical articles* document – multi-page ‘articles’ ask – namespace index queries category – folksonomy entries* contact – persons, businesses, ... email – single-page email entries event – actions and activities help - correlated view material* place – physical locations project – wiki software* * special – virtual namespace template – boilerplate, rules, .. * term – ontology entries thing – physical artifacts user – registered wiki authors* image – non-text figures* A correlated ‘talk’ namespace exists for each of the namespaces named above. Asterisked namespaces identify those standard in MediaWiki software although in one case (‘project’) its use has been modified. The ‘project’ namespace is designed to contain all the software needed to view articles, documents, and data-entry forms hosted by the wiki. This namespace is necessary to address a problem with digital signatures that is little discussed. E-signature standards guarantee that both binary and textual material can be signed and then later extracted with no loss or alteration. That is a valuable service, however a legal problem exists when the signed material contains executable software deemed relevant to a party’s actions that occurred during a transaction. While a digital signature ensures that the material has not changed when extracted at a timesubsequent, it cannot ensure that the behavior of the executable software will be identical at a time-subsequent to a signature. Such legal assurance may be had under only one condition: when the software is neither compiled or tokenized. This can be achieved only when software is coded in a computer language such as Javascript or other textual language then interpreted by a generic processor, e.g., Internet Explorer or Firefox4. Therefore, the “project” namespace is a repository for textual programs that control the operation of the wiki itself; in this way, transactions conducted using the wiki are defensible: they give parties the capacity to review the substance of the entire environment with which they interacted during the transaction. Document wikis must be viewed as composed of distributed namespaces. That is, to avoid wasteful duplication of software and other material across local and remote wikis, URIs assigned to namespaces should be handled such that a wiki can override material retrieved from the locations.

3

The HyperGrover Wiki implements namespaces locally as folders and remotely as a (non-opaque) token in an article’s URL. 4 The HyperGrover Wiki tool uses Javascript and HTML for viewing/editing articles, for viewing documents, and for forms data-entry.

56

Semantic Markup The primary goal for the markup of legal and related documents is a more efficient, less costly information flow between all parties and their semantic applications. This section provides an overview of markup inserted to the material housed in a smart wiki. As mentioned above, wiki articles and documents have text commands expressed in a syntax specific to wikis. The format to ‘tag’ a hyperlink in an article is [[property-name :: address]]. This syntax identifies both the name of the article’s object-property and the value of the object-property. For text properties, the format is [[property-name ::= value]]. This syntax is quite easy for a non-technical author to use. For more adept authors, XHTML elements may be freely interspersed in the body of an article. W3 has developed a syntax for annotating XHTML elements with RDF property-names, with the format: value (when tagging a hyperlink, the rel attribute is used instead of the property attribute). The W3 syntax, known as RDF/A, is more powerful than the wiki syntax because RDF/A can leverage the element hierarchy in an XHTML document to allow RDF properties to be associated with resources that are linked to the document. However, the RDF/A approach can often result in complicated annotations. Accordingly a dotted-name is proposed5 that not only simplifies XHTML annotations (by allowing one to avoid nested tags) but also provides wiki-authors a similarly powerful facility: The Wiki syntax is [[tag-1…tag-n ::= value]] while, in XHTML using RDF/A, it is: value. For example, [[Author.Name ::= Smith]]; and, within an XHTML file, it is Smith. Dotted-names reduce annotations to named-value pairs, and are more intuitive and more managable than hierarchical elements, while bringing greater expressivity to wiki syntax. Property-names typical to RDF are not being used, e.g., Smith. However predicate verbs and predicate nouns are highly recommended to be segregated from one another by the dot-operator whenever the verb is functionally or technically necessary to be explicitly specified, e.g., Smith. Predicate (helping) verbs should be only a small (but important) part of ontologies aimed at markup of legal documents.6 The alternative (that is, concatenated predicate verb-nouns) yields ontologies difficult to manage, for instance, one having hasName, hadName, willHaveName’, etc.

Summary Document wikis can be established as the repository of record for all emails, documents, and single-paged articles flowing between the parties to a legal matter, that is, a case or contract. The case or contract wiki can then easily function as a ‘portal’ by which parties synchronize their own local wiki with the contents of a shared wiki, yielding new, measurable efficiences. Adoption of semantic markup to-date has been sparse for good reasons. However, semantic wikis are the right tool at the right time for tapping into a new, more receptive audience.

5 Dotted-names, a constrained form of XPATH, are a key design requirement for the Legal-RDF ontology. 6 The Legal-RDF ontology features a small set of predicate verbs as properties of a ‘statement’; every other property in the ontology is a predicate noun.

June 8, 2007