Tradition- ally, overlaps were the hallmarks of bad HTML coders and naive HTML page edi- tors, taking .... 2008] we presented a number of algorithms to convert XML documents with overlapping ...... RDFa in XHTML: Syntax and process- ing.
A A Semantic Web Approach To Everyday Overlapping Markup Angelo Di Iorio, [email protected], Department of Computer Science, University of Bologna Silvio Peroni, [email protected], Department of Computer Science, University of Bologna Fabio Vitali, [email protected], Department of Computer Science, University of Bologna
Overlapping structures in XML are not the symptoms of a misunderstanding of the intrinsic characteristics of a text document, nor the evidence of extreme scholarly requirements far beyond those needed by the most common XML-based applications. On the contrary, overlaps have started to appear in a large number of incredibly popular applications hidden under the guise of syntactical tricks to the basic hierarchy of the XML data format. Unfortunately, syntactical tricks have the drawback that the affected structures require complicated workarounds to support even the simplest query or usage. In this paper we present EARMARK, an approach to overlapping markup that simplifies and streamlines the management of multiple hierarchies on the same content, and provides an approach to sophisticated queries and usages over such structures without the need of ad-hoc applications, simply by using Semantic Web tools and languages. We compare how relevant tasks (e.g., the identification of the contribution of an author in a Word Processor document) are of some substantial complexity when using the original data format, and become more or less trivial when using EARMARK. We finally evaluate positively the memory and disk requirements of EARMARK documents in comparison to Open Office and Microsoft Word XML-based formats. Categories and Subject Descriptors: I.7.2 [Document And Text Processing]: Document Preparation— Markup languages; I.2.4 [Artificial Intelligence]: Knowledge Representation Formalisms and Methods— Representation languages General Terms: Languages Additional Key Words and Phrases: EARMARK, OWL, RDF, XML, overlapping
1. INTRODUCTION
The overwhelming consensus among XML practitioners is that documents are trees, the hierarchy is the fundamental data structure, and violations of the hierarchy are errors or unnecessary complications. Therefore, overlapping markup has received ambivalent, almost schizoid considerations in the field of markup languages. Traditionally, overlaps were the hallmarks of bad HTML coders and naive HTML page editors, taking advantage of the unjustified benevolence in web browsers that would display basically any HTML regardless of proper nesting. At the same time, far from the awareness of the general public, overlaps have been a fringe, almost esoteric discipline of scholars in the humanities, competently used for arcane specifications of linguistic annotations and literary analysis. But although the first type of overlap was judged with scorn and the second with awe, they both fundamentally represent a situation that is more common than thought, and the scholars were only more aware, and not more justified, about the need to represent overlaps. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c YYYY ACM 0000-0000/YYYY/01-ARTA $10.00 � DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000 ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A:2
Angelo Di Iorio et al.
Generally, overlap is needed when multiple independent items refer to the same segment, either when considering textual markup documents or multimedia structures [Salembier & Benitez 2007]. As regards to documents with markup, we need overlap whenever multiple markup elements needs to be applied over the same content, and these elements happen to be independent of each other. In some (rather frequent) situations, this independence means that the content referred to by some elements is partially but not completely the same as the content referred to by other elements. This situation is more frequent than may appear: not only bad HTML code and arcane linguistic annotations use overlap, but many more mainstream and mundane examples exist. For instance, change tracking in a office document is often at odds with the underlying structure of the text; microformats [Allsopp 2007] and RDFa [Adida et al. 2008] annotations may need to refer to concepts that span across multiple XML elements; complex data structures (e.g. biological data) force graphs into trees and hide multiple parentage as internal references. And the list could go on. Differently from SGML, that is able to handle some overlapping scenarios through the CONCUR notation [Goldfarb 1990], XML grammatically imposes and requires a strict hierarchy of containment generating a single mathematical tree of the document where no overlap is allowed. This requirement has been turned into an intrinsic characteristic of the documents XML was meant to represent, rather than a syntactical and conceptual constraint into which these documents need to fit. Thus, whenever authors needed to cope with independent markup elements, they managed either by naively ignoring the hierarchical limitation (and therefore creating invalid documents), or by creating careful workarounds within the syntactical constraint, or even by inventing completely new markup languages that allow some types of overlap. But while new multi-hierarchical markup languages such as TexMecs [Huitfeldt & Sperberg-McQueen 2003] and LMNL [Tennison & Piez 2002] have a small number of adepts and applications, and while bad HTML coders and bad HTML page editors are disappearing from the market, the careful workarounds within the XML syntax [TEI Consortium 2005], such as segmentation, milestones or standoff markup, are to this day frequently used and ubiquitous. All workarounds share the same approach of hiding structural information about a secondary hierarchy under the guise of something else: split individual elements, empty boundary elements, indirect references, etc. The result is that the secondary structural information is hidden or its importance is lessened, so as not to break or obfuscate the main hierarchy expressed in the visible XML structure. But this comes at a price: structures specified through workarounds are more difficult to find, identify and act upon than the structures in the main XML hierarchy. Thus, trivial searches that should amount to a short XPath in a more direct situation end up being multiple lines long, pretty basic visualizations require incredibly complex XSLT stylesheets, specific choices of the main markup hierarchy actually prevent some features of the secondary markup to even exist, etc. So, although workarounds exist and can be used, hierarchies expressed through them are second class citizens that cannot fully exploit the sophisticated tools that the XML language provides. In this paper we show how EARMARK (Extremely Annotational RDF Markup), our proposal for managing overlapping markup, does not generate first and second class hierarchies, and allows existing, sophisticated tools to be used on all markup even in the presence of overlaps. Rather than creating a completely new language requiring completely new tools and competencies, EARMARK uses Semantic Web technologies and Semantic Web tools to obtain many of the results obtainable with usual XML tools. EARMARK defines markup vocabularies by means of OWL ontologies [W3C OWL Working Group 2009]. Since each individual markup item is an independent assertion over some content or some other assertions, overlaps of content is not a problem as well ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A Semantic Web Approach To Everyday Overlapping Markup
A:3
as all the issues connected to physical embedding and containments, such as contiguity and document order. Furthermore, by using standard Semantic Web technologies, fairly sophisticated functionalities can be provided over EARMARK documents. Through EARMARK, operations that were previously very hard or impossible exactly because of the interferences of the multiple hierarchies or of the workarounds they employed, become now fundamentally trivial, since no syntactical tricks are employed and the different hierarchies do not interfere with each other. Thus for instance identifying the individual contributions in a multi-authored MS Word or Open Office document is quite hard on their original XML formats, and becomes trivial when the same documents are converted into EARMARK. This paper is an extended version of previous works on EARMARK ( [Peroni & Vitali 2009] and [Di Iorio et al. 2010]). In those papers we focused on identifying workarounds for overlapping data existing in real XML documents and translating them into EARMARK assertions. In those papers we also sketched the EARMARK ontology and presented a simple implementation of EARMARK-aware tools. This paper follows and extends them and provides some novel contributions: — The systematic analysis of the EARMARK model, with particular attention to data typing and overlapping structures. — The discussion of further applications for the ontological EARMARK approach. In particular, we show how EARMARK can be used to improve the content filtering and reversions mechanisms of wikis. — The brief description of a process – called ROCCO – for generating EARMARK documents from existing XML documents (even ones that use workarounds for overlapping structures) — An evaluation of EARMARK efficiency when dealing with multiple hierarchies in comparison with the XML structures used by popular XML-based formats such as Open Office and MS Word. The paper is structured as follows: in Section 2 we provide a brief overview of existing approaches to overlap using workarounds in XML or ad-hoc markup metalanguages, and in Section 3 we provide a few examples of situations where overlaps are used today and sometimes in rather mainstream situations. In Section 4 we present the EARMARK model and its rules. In Section 5 we discuss some use-cases that are meant to demonstrate the superiority of the EARMARK approach to a traditional XML format, especially when overlaps come into question. Section 6 goes into the details of the generation of EARMARK documents, converting legacy documents. Section 7 contains an initial evaluation of the efficiency of EARMARK compared to popular XML based data formats such as ODT ad OOXML, leading to our conclusions in Section 8. 2. EXISTING APPROACHES TO OVERLAPPING
The need for multiple overlapping structures over documents using markup syntaxes such as XML and SGML is an age-old issue, and a large amount of literature exists about techniques, languages and tools that allow users to create multiple entangled hierarchies over the same content. A good review can be found in [DeRose 2004]. Some of such research proposes to use plain hierarchical markup (i.e., XML) and employ specially tailored elements or attributes to express the semantics of overlapping in an implicit way. The TEI Guidelines [TEI Consortium 2005] present a number of different techniques that use SGML/XML constructs to force multiple hierarchies into a single one, including: — milestones (the overlapping structures are expressed through empty elements to mark the boundaries of the “content”), ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A:4
Angelo Di Iorio et al.
— fragmentation (the overlapping structures are split into individual nonoverlapping elements that may even be linked through id-idref pairs) and — standoff markup (the overlapping structures are placed elsewhere and indirectly refer to their would-be locations through pointers, locators and/or id-idref pairs). Given the large number of techniques to deal with overlapping structures in XML, in [Marinelli et al. 2008] we presented a number of algorithms to convert XML documents with overlapping structures from and to the most common approaches, as well as a prototype implementation. In [Riggs 2002] the author introduces a slightly different technique for fragmentation within XML structures. In this proposal, floating elements, i.e., those elements that do not fall in a proper or meaningful hierarchical order, are created using the name of the element followed by an index referring to its semantically-related parent element. For example, the floating element John means that John is semantically child of the second occurrence of the element person, even though the floating element is not structurally contained by its logical parent. Other research even proposes to get rid of the theory of trees at the base of XML/SGML altogether, and use different underlying models and newly invented XMLlike languages that allow the expression of overlaps through some kind of syntactical flourishing. For instance, GODDAG [Sperberg-McQueen & Huitfeldt 2004] is a family of graphtheoretical data structures to handle overlapping markup. A GODDAG is a Direct Acyclic Graph whose nodes represent markup elements and text. Arcs are used to explicitly represent containment and father-child relations. Since multiple arcs can be directed to the same node, overlapping structures can be straightforwardly represented in GODDAG. Full GODDAGs cannot be linearized in any form using embedded markup, but restricted GODDAGs, a subset thereof, can be and has been linearized into TexMecs [Huitfeldt & Sperberg-McQueen 2003], a multi-hierarchical markup language that also allows full GODDAGs through appropriate non-embedding workarounds, such as standoff markup. LMNL [Tennison & Piez 2002] is a general data model based on the idea of layered text fragments and ranges, where multiple types of overlap can be modeled using concepts drawn from the mathematical theory of intervals. Multiple serializations of LMNL exists, such as CLIX and LMNL-syntax. XConcur [Schonefeld & Witt 2006] is a similar solution based on the representation of multiple hierarchies within the same document through layers. Strictly related to its predecessor CONCUR as it was included in the SGML, XConcur was developed in conjunction with the validation language XConcur-CL to handle relationships and constraints between multiple hierarchies. The variant graph approach [Schmidt & Colomb 2009] is also based on graph theory. Developed to deal with textual variations – that generate multiple versions of the same document with multiple overlapping hierarchies – this theory proposes a new data model to represent literary documents and a graph linearization (based on lists) that scales well even with a large number of versions. The same authors recently presented an extension of their theory that also allows users to merge multiple variants into one document [Schmidt 2009]. In [Portier & Calabretto 2009] a detailed survey about overlapping approaches was presented, also discussing the MultiX data model – that uses W3C standard languages such as XInclude to link and fetch text fragments within overlapping structures – and a prototype editor for the creation of multi-structured documents. ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A Semantic Web Approach To Everyday Overlapping Markup
A:5
In [Tummarello et al. 2005] a proposal for using RDF as a standoff notation for overlapping structures of XML documents was proposed. Since this proposal has many affinities with the one we are presenting in this paper, in section Section 4.5 we discuss in deep its characteristics and compare it with ours. 3. MORE FREQUENT THAN ONE MAY THINK: OVERLAPPING IN THE WILD
Overlapping structures have been considered often as appropriate only in highly specific contexts and basically for scholars: the solutions proposed in literature were complex since they were considered grounded in the intrinsic complexity of the topics themselves. Yet, overlapping structures can be found in many more fields than these, and even mainstream applications generate and use markup with overlapping structures. While the complexity of overlapping is hidden to the final user, application that consume such data may very well find it rather difficult to handle such information. In the following we discuss three very different contexts where overlapping already exist and fairly relevant information is encoded in multiple independent structures, leaving to special code the task of managing the complexity. 3.1. Change Tracking in Office Document Format
Word processors such as Microsoft Word and Open Office provide users with powerful tools for tracking changes, allowing each individual modification by individual authors to be identified, highlighted, and acted upon (e.g. by accepting or discarding them). The intuitiveness of the relevant interfaces actually hides the complexity of the data format and of the algorithms necessary to handle such information. For instance, the standard ODT format [JTC1/SC34 WG 6. 2006] used by Open Office, when saving change tracking information, relies on two specific constructs for insertions and deletions that may overlap with the structural markup. While adding a few words within a paragraph is not in itself complex, as it does not require the breaking of the fundamental structural hierarchy, conversely changes that affect the structure itself (e.g. the split of one paragraph in two by the insertion of a return character, or vice versa the joining of two paragraphs by the elimination of the intermediate return character) require that annotations are associated to the end of a paragraph and the beginning of the next, in an unavoidably overlapping pattern. ODT uses milestones and standoff markup for insertions and deletions respectively, and also relies on standoff markup for annotations about the authorship and date of the change. For instance, the insertion of a return character and a few characters in a paragraph creates a structure as follows: < text : tracked - changes > < text : changed - region text : id =" S1 " > < text : insertion > < office : change - info > < dc : creator > John Smith < dc : date >2009 -10 -27 T18 :45:00 [... other changes ...] [... content ...] < text :p > The beginning and < text : change - start text : change - id =" S1 "/ > < text :p > also < text : change - end text : change - id =" S1 "/ > the end .
The empty elements and are milestones marking respectively the beginning and the end of the range that constituted the insertion, while the element , before the beginning of the document content, is standoff markup for the metadata about the change (author and date information). ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A:6
Angelo Di Iorio et al.
Similarly, a deletion creates a structure as follows: < text : tracked - changes > < text : changed - region text : id =" S2 " > < text : deletion > < office : change - info > < dc : creator > John Smith < dc : date >2009 -10 -27 T18 :46:00 < text : p / > < text : p / > [... other changes ...] [... content ...] < text :p > The beginning and < text : change text : change - id =" S2 " / > also the end .
The element represents a milestone of the location where the deletion took place in the content, and the corresponding standoff markup annotation contains not only the metadata about the change, but also the text that was deleted. The OOXML format [JTC1/SC34 WG 4. 2008] (the XML-based format used by Microsoft Office 2007 and standardized by ISO in 2008), on the other hand, uses a form of segmentation to store change-tracking information across all previous elements involved. < w : rPr > < w :t > The beginning and < w :t > also < w :t > the end .
This heavily simplified version of an OOXML document shows two separate changes: the first is the insertion of a return character and the second is the insertion of a word. These modifications are not considered as a single change, and therefore the segments are not connected to each other, but simply created as needed to fit the underlying structure. In fact, change tracking in OOXML is a fairly complex proposition. Although providing more complete coverage of special cases and situations than ODT, dealing with its intricacies is not for the casual programmer. Even a simple XSLT stylesheet to show inserted text in a different color and hide deleted text may run several hundred lines of code1 . 3.2. Overlapping with Microformats
Microformats [Allsopp 2007] add semantic markup to web documents by using common structures of the HTML language itself, in particular the class attribute. The HTML code is annotated using microformats so as to provide new semantic, machine-processable assertions. In the following example, a plain HTML table is enriched with metadata about events2 and people3 : 1 http://OOXMLdeveloper.org/archive/2006/09/07/625.aspx 2 HCalendar, 3 HCard,
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A Semantic Web Approach To Everyday Overlapping Markup
A:7
< body >
< span class =" summary " > WWW 2010 Conference : < abbr class =" dtstart " title ="2010 -04 -26" > April 26 -< abbr class =" dtend " title ="2010 -10 -30" >30 , < span class =" location " > Raleigh , NC , USA . < table >
< th > Name < th > Role < tr class =" vcard " > < td class =" fn " > Juliana Freire < td class =" role " > Program Committee Chair < tr class =" vcard " > < td class =" fn " > Michal Rappa < td class =" role " > Conference Chair < tr class =" vcard " > < td class =" fn " > Paul Jones < td class =" role " > Conference Chair < tr class =" vcard " > < td class =" fv " > Soumen Chakrabarti < td class =" role " > Program Committee Chair < body >
The table was enriched by additional data declaring it to be an event (a conference) and data about the event itself – the url, summary, location – and about four relevant individuals – with their names and roles within the conference – were associated where necessary to the actual content of the table. So far, so good, and no overlap to speak about. Things change dramatically, though, when the overall structure of the main hierarchy (the HTML table) is at odds with the intrinsic hierarchy of the microformat data, for instance if the people are organized in columns rather than rows. For instance: < table >
Program Committee Chair < td > Conference Chair
Conference Chair < td > Program Committee Chair
Juliana Freire < td > Michael Rappa
Paul Jones < td > Soumen Chakrabarti
Unfortunately, vcards are a hierarchy themselves, and if the hierarchy of vcards is organized differently from the hierarchy of the HTML table, as in the latter case, it is just impossible to define the four vcards for the four people organizing the conference. Thus in plain HTML the choice of one of two possible presentation models for the main hierarchy of content makes it trivial or completely impossible the existence of the second hierarchy. A possible and partial solution to express vcard hierarchies in the latter example is RDFa [Adida et al. 2008], a W3C recommendation. It describes a mechanism to embed RDF statements into HTML documents by using some HTML attributes (href, rel, rev, content) in combination with other ad hoc attributes (property, about, typeof) proposed in the recommendation itself. < table xmlns : vc =" http :// www . w3 . org /2006/ vcard / ns #" xmlns : my =" http :// www . essepuntato . it /2010/05/ myVCard #" >
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A:8
Angelo Di Iorio et al.
< td about =" my : pcc " typeof =" vc : Role " > Program Committee Chair < td about =" my : cc " typeof =" vc : Role " > Conference Chair < td about =" my : cc " property =" vc : hasName " > Conference Chair < td about =" my : pcc " property =" vc : hasName " > Program Committee Chair
< td about =" my : jf " rel =" vc : role " resource =" my : pcc " > < span property =" vc : fn " > Juliana Freire < td about =" my : mr " rel =" vc : role " resource =" my : cc " > < span property =" vc : fn " > Michael Rappa < td about =" my : pj " rel =" vc : role " resource =" my : cc " > < span property =" vc : fn " > Paul Jones < td about =" my : sc " rel =" vc : role " resource =" my : pcc " > < span property =" vc : fn " > Soumen Chakrabarti
Since all attributes live in the context of elements, the price to pay is that to assert everything we want to assert we often need to add some structurally unnecessary elements to the current markup hierarchy of a document, needed only to add the RDF statements (e.g., the span elements emphasized above). Even if that does not represent a significant problem for strict Semantic Web theorists, document architects and markup expert see this as a kludge and an inelegant compromise. 3.3. Wikis: no overlapping where some should be
The strength of wikis lies in their allowing users to modify content at any time. The mechanisms of change-tracking and rollback that are characteristics of all wikis, in fact, promote users’ contributions and make “malicious attacks” pointless in the long run, since previous versions can be easily restored. A number of tools exist that automatically discover “wiki vandalisms” and provide users with powerful interfaces to surf changes, diff subsequent versions and revert content. For instance, Huggle4 is an application dealing with vandalism in Wikipedia, based on a proxy architecture and .NET technologies. A straightforward interface allows users to access any version of a page, highlights contributions of a specific user and reverts the content to old versions. Even client-side tools – meant to be installed as browsers extensions or bookmarklets – exist to extend the rollback mechanisms of Wikipedia, giving users more flexibility and control over (vandalistic) changes. For instance, Lupin5 is a set of javascript scripts that check a wiki page against a list of forbidden terms so that authors can identify undesirable modifications and restore previous (good) versions without a continuous control over the full content of the page; yet again, Twinkle6 provides users powerful rollback functions and includes a full library of batch deletion functions, automatic reporting of vandals, and users notification functions. These tools are successful in highlighting vandalism and in identifying versions created by malicious users. However, although it is possible to revert the page to any previous version, all changes (even acceptable ones) that were subsequent to the malicious version cannot be automatically inherited by the restored page. For instance, let us consider versions V1, V2, and V3 of a wiki page, where versions V1 contains a baseline (acceptable) content, V2 is identified as a partial vandalism 4 http://en.wikipedia.org/wiki/Wikipedia:Huggle
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A Semantic Web Approach To Everyday Overlapping Markup
A:9
and is agreed to be removed, but V3 contains (possibly, in a completely different section than the target of the malicious attack) relevant and useful content that was added before the vandalistic version V2 was declared as such. The task of removing the modifications of version V2 while maintaining (whatever is possible of) version V3 is a difficult, error-prone and time-consuming task if done manually, yet there is no tool we are aware of that automatically filters contributions from multiple versions and merges them into a new one (or, equivalently, removes only selected intermediate versions). Yet, it is possible to characterize the interdependencies between subsequent changes to a document in a theoretical way. In fact, literature has existed for a long time on exactly these themes (see for instance [Durand 2008] [Durand 1994]). Although a detailed discussion of abstract models of interconnected changes is out of scope for this paper – details and authoritative references can be found in the above mentioned works – what is relevant in this discussion is that they happen to assume a hierarchical form that is frequently at odds with the hierarchical structure of the content of the document, and as such most issues derive from the data structures in which content is stored and from the model for manipulating these structures. For instance, the fact that in the wiki perspective each version is an independent unit that shares no content (even unchanged content) with the other versions prevents considering multiple versions as overlapping structures coexisting on the same document. If we were able to make these hierarchies explicit we would be able to create models and tools to manipulate these documents in a more powerful way and to exploit the existing interconnections between the overlapping hierarchies. 4. INTRODUCTION TO EARMARK AND ITS SUPPORT FOR OVERLAPPING FEATURES
The presence of hidden overlapping structures – transparent to users but very difficult to handle by applications – is the common denominator for the scenarios described in the previous section. More than the overlap itself – that cannot be ignored as it does exist and carries important meanings – the problem we face lies in the way applications store such overlapping structures. In the XML world, in fact, the only way to do so is through the use of (complex) workarounds that force the multiple hierarchies into one hierarchy of a XML document. That makes very tricky to perform sophisticated analysis and searches. This section discusses a different approach to metamarkup, called EARMARK (Extremely Annotational RDF Markup) [Di Iorio et al. 2009] [Peroni & Vitali 2009] [Di Iorio et al. 2010] based on ontologies and Semantic Web technologies. The basic idea is to model EARMARK documents as collections of addressable text fragments, and to associate such text content with OWL assertions that describe structural features as well as semantic properties of (parts of) that content. As a result EARMARK allows not only documents with single hierarchies (as with XML) but also multiple overlapping hierarchies where the textual content within the markup items belongs to some hierarchies but not to others. Moreover EAMARK makes it possible to add semantic annotations to the content though assertions that may overlap with existing ones. One of the advantages of using EARMARK is the capability to access and query documents by using well-known and widely supported tools for Semantic Web. In fact, EARMARK assertions are simply RDF assertions, while EARMARK documents are modeled through OWL ontologies. The consequence is that query languages (such as SPARQL [Garlik & Seaborne 2010]) and actual existing tools, such as Jena7 and Pel7 http://jena.sourceforge.net
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A:10
Angelo Di Iorio et al.
let8 ) can be directly used to deal with even incredibly complicated overlapping structures. What is very difficult (or impossible) to do with traditional XML technologies becomes much easier with these technologies under the EARMARK approach. In the rest of this section we give a brief overview of the EARMARK model, while in Section 5 we describe how EARMARK can be used to deal with the issues presented earlier. The model itself is defined through an OWL document9 , summarized in Fig. 1, specifying classes and relationships. We distinguish between ghost classes - that define the general model - and shell classes - that are actually used to create EARMARK instances.
Fig. 1. An UML-like representation of the EARMARK ontology.
4.1. Ghost classes
The ghost classes describe three disjoint base concepts – docuverses, ranges and markup items – through three different and disjoint OWL classes10 . 8 http://pellet.owldl.com
9 http://www.essepuntato.it/2008/12/earmark 10 All
our OWL samples are presented using the Manchester Syntax [Horridge & Patel-Schneider 2009], which is one of the standard linearization syntaxes of OWL. The prefixes rdfs and xsd refer respectively to RDF Schema and XML Schema namespaces, while the empty prefix refers to the EARMARK ontology URI plus “#”. Moreover, we use the prefix c to indicate entities taken from an imported ontology made for the SWAN project [Ciccarese et al. 2008], available at http://swan.mindinformatics.org/spec/1.2/collections.html. ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A Semantic Web Approach To Everyday Overlapping Markup
A:11
The textual content of an EARMARK document is conceptually separated from its annotations, and is referred to through the Docuverse class11 . The individuals of this class represent the object of discourse, i.e. all the containers of text of an EARMARK document. Class : Docuverse DatatypeProperty : hasContent Characteristics : Fun ct io na lP ro pe rt y Domain : Docuverse Range : rdfs : Literal
Any individual of the Docuverse class – commonly called a docuverse (lowercase to distinguish it from the class) – specifies its actual content with the property hasContent. We then define the class Range for any text lying between two locations of a docuverse. A range, i.e, an individual of the class Range, is defined by a starting and an ending location (any literal) of a specific docuverse through the properties begins, ends and refersTo respectively. Class : Range EquivalentTo : refersTo some Docuverse and begins some rdfs : Literal and ends some rdfs : Literal ObjectProperty : refersTo Characteristics : FunctionalProperty Domain : Range Range : Docuverse DatatypeProperty : begins Characteristics : Fun ct io na lP ro pe rt y Domain : Range Range : rdfs : Literal DatatypeProperty : ends Characteristics : Fun ct io na lP ro pe rt y Domain : Range Range : rdfs : Literal
There is no restriction on locations used for the begins and ends properties. That is very useful: it allows us to define ranges that follow or reverse the text order of the docuverse they refer to. For instance, the string “desserts” can be considered both in document order, with the begins location lower than the ends location or in the opposite one, forming “stressed”12 . Thus, the values of properties begins and ends define the way a range must be read. The class MarkupItem is the superclass defining artifacts to be interpreted as markup (such as elements and attributes). Class : MarkupItem SubClassOf : ( c : Set that c : element only ( Range or MarkupItem ) ) or ( c : Bag that c : item only ( c : itemContent only ( Range or MarkupItem ) ) DatatypeProperty : hasGener a l I d e n t i f i e r Characteristics : Fu nc ti onalProperty Domain : MarkupItem Range : xsd : string
11 This
class (and its name) is based on the concept introduced by Ted Nelson in his Xanadu Project [Nelson 1980] to refer to the collection of text fragments that can be interconnected to each other and transcluded into new documents. 12 http://en.wikipedia.org/wiki/Palindrome#Semordnilaps ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A:12
Angelo Di Iorio et al.
DatatypeProperty : hasNamespace Characteristics : Fu nct io na lP ro pe rt y Domain : MarkupItem Range : xsd : anyURI Class : Class : Class : Class : Class : Class :
c : Collection c : Set SubClassOf : c : Collection c : Bag SubClassOf : c : Collection c : List SubClassOf : c : Bag c : Item c : ListItem SubClassOf : c : Item
ObjectProperty : c : element Domain : c : Collection ObjectProperty : c : item SubPropertyOf : c : element Domain : c : Bag Range : c : Item ObjectProperty : c : firstItem SubPropertyOf : c : item Domain : c : List ObjectProperty : c : itemContent Characteristics : FunctionalProperty Domain : c : Item Range : not c : Item ObjectProperty : c : nextItem Characteristics : FunctionalProperty Domain : c : ListItem Range : c : ListItem
A markupitem individual is a collection (c:Set, c:Bag or c:List, where the latter is a subclass of the second one and all of them are subclasses of c:Collection) of individuals belonging to the classes MarkupItem and Range. Through these collections it is possible to define a markup item as a set, a bag or a list of other markup items, using the properties element (for sets), item and itemContent (for bags and lists). Thus it becomes possible to define elements containing nested elements or text, or attributes containing values, as well as overlapping and complex structures. Note also that handling collections directly in OWL allows us to reason about content models for markup items, which would not be possible if we had used the corresponding constructs in RDF13 . A markupitem might also have a name, specified in the functional property hasGeneralIdentifier (recalling the SGML term to refer to the name of elements [Goldfarb 1990]), and a namespace specified using the functional property hasNamespace. Note that we can have anonymous markup items – as it is possible in LMNL [Tennison & Piez 2002] and GODDAG [Sperberg-McQueen & Huitfeldt 2004] – by simply asserting that the item belongs to the class of all those markupitems that do not have a general identifier (i.e., hasGeneralIdentifier exactly 0). 4.2. Shell classes
The ghost classes discussed so far give us an abstract picture of the EARMARK framework. We need to specialize our model, defining a concrete description of our classes. These new shell subclasses apply specific restrictions to the ghost classes. First of all, the class Docuverse is restricted to be either a StringDocuverse (the content is specified by a string) or an URIDocuverse (the actual content is located at the URI specified). Class : StringDocuverse DisjointWith : URIDocuverse SubClassOf : Docuverse , hasContent some xsd : string
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A Semantic Web Approach To Everyday Overlapping Markup
A:13
Class : URIDocuverse SubClassOf : Docuverse , hasContent some xsd : anyURI
Depending on particular scenarios or on the kind of docuverse we are dealing with – it may be plain-text, XML, LaTeX, a picture, etc. – we need to use different kinds of ranges. Therefore, the class Range has three different subclasses: — PointerRange defines a range by counting characters. In that case, the value of the properties begins and ends must be a non-negative integer that identifies unambiguous positions in the character stream, remembering that the value 0 refers to the location immediately before the 1st character, the value 1 refers to the location after the 1st character and before the 2nd one, and so on. By using the hasKey OWL property, we also assert that two pointer ranges having equal docuverse, begin and end locations are the same range; — XPathRange defines a range considering the whole docuverse or its particular context specifiable through an XPath expression [Berglund et al. 2007] as value of the property hasXPathContext. Note that, by using these ranges, we implicitly admit that the docuverse it refers to must be an XML structure. Moreover, the properties begins and ends have to be applied on the string value obtained by juxtaposing all the text nodes identified by the XPath. By using the hasKey OWL property, we also assert that two xpath ranges having equal docuverse, XPath context, begin and end locations are the same range; — XPathPointerRange is an XPathRange in which the value of the properties begins and ends must be a non-negative integer that identifies unambiguous positions in the character stream as described for the class PointerRange. Class : PointerRange HasKey : refersTo begins ends SubClassOf : Range , begins some xsd : n on Ne ga ti ve In teg er and ends some xsd : n on Ne ga tiveInteger Class : XPathRange SubClassOf : Range EquivalentTo : hasXPathContext some rdfs : Literal HasKey : refersTo begins ends hasXPathContext Class : XPath Pointe rRange SubClassOf : XPathRange , begins some xsd : n on Ne ga ti ve In teg er and ends some xsd : n on Ne ga tiveInteger DatatypeProperty : hasXPathContext Characteristics : Fun ct io na lP ro pe rt y Domain : XPathRange Range : rdfs : Literal
MarkupItem is specialized in three disjointed sub-classes: Element, Attribute and Comment, that allow a more precise characterization of markup items. Class : Element SubClassOf : MarkupItem Class : Attribute SubClassOf : MarkupItem Class : Comment SubClassOf : MarkupItem DisjointedClasses : Element , Attribute , Comment 4.3. Range and markup item overlap
The presence of overlap in EARMAK is worth discussing more in detail. Different types of overlap exist – according to the subset of items involved – and different strategies are needed to detect them. In particular, there is a clear distinction between overlapping ranges and overlapping markup items. ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A:14
Angelo Di Iorio et al.
By definition, overlapping ranges are two ranges that refer to the same docuverse and so that at least one of the locations of the first range is contained in the interval described by the locations of the second range (excluding its terminal points). Totally overlapping ranges have the locations of the first range completely contained in the interval of the second range or vice versa, while partially overlapping ranges have either exactly one location inside the interval and the other outside or identical terminal points in reversed roles. Thus, if we consider the following excerpt: Individual : r1 Types : PointerRange Facts : refersTo aDocuverse , begins "0"^^ xsd : n on Ne ga ti ve In te ger , ends "7"^^ xsd : n on Ne ga tiveInteger Individual : r2 Types : PointerRange Facts : refersTo aDocuverse , begins "4"^^ xsd : n on Ne ga ti ve In te ger , ends "9"^^ xsd : n on Ne ga tiveInteger
we can infer, through a reasoner such as Pellet, that these two ranges overlap by using the following rules: begins (x , b1 ) ^ ends (x , e1 ) ^ begins (y , b2 ) ^ ends (y , e2 ) ^ refersTo (x , d ) ^ refersTo (y , d ) ^ DifferentFrom (x , y ) ^ P -> overlapWith (x , y )
The case of overlapping markup items is slightly more complicated. We define that two markup items A and B overlap when at least one of the following sentences holds: (1) [overlap by range] A contains a range that overlaps with another range contained by B; (2) [overlap by content hierarchy] A and B contain at least a range in common; (3) [overlap by markup hierarchy] A and B contain at least a markup item in common. The three possible scenarios for such item overlap are summarized in Fig. 214 . The EARMARK ontology, in fact, is completed by another ontology15 that models all overlapping scenarios, either for ranges or markup items, and includes rules for inferring overlaps automatically, through a reasoner. 4.4. EARMARK as a standoff notation
If we ignore for a moment the semantic implications of using EARMARK and concentrate on its syntactical aspects only, it is easy to observe that EARMARK is nothing but yet another standoff notation, where the markup specifications point to, rather than contain, the relevant substructure and text fragments. 14 The
EARMARK documents describing these three overlapping scenarios and all the other ones presented in the following sections are available at http://www.essepuntato.it/2011/jasist/examples. 15 http://www.essepuntato.it/2011/05/overlapping ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A Semantic Web Approach To Everyday Overlapping Markup
A:15
Fig. 2. Three EARMARK examples of overlapping between elements p.
Standoff notations, also known in literature as out-of-line notations [TEI Consortium 2005], are hardly new, but never really caught on for a number of reasons, most having to do with their perceived fragility under the circumstances of desynchronized modification to the text. In [Georg et al. 2010] and [Baski 2010] we can find a pair of recent and substantially complete analysis of their merits and demerits. In particular, according to , “standoff annotation has [...] quite a few disadvantages: (1) very difficult to read for humans (2) the information, although included, is difficult to access using generic methods (3) limited software support as standard parsing or editing software cannot be employed (4) standard document grammars can only be used for the level which contains both markup and textual data (5) new layers require a separate interpretation (6) layers, although separate, often depend on each other”16 . And yet, although EARMARK is in practice a standoff notation, it provides a number of workarounds to most of the above-mentioned issues. Firstly, since EARMARK is based on OWL and can be linearized in any of the large number of OWL caricaturisation syntaxes, it follows that 1) readability, 2) access and 3) software support for it are exactly those existing for well-known, widespread and important W3C standards such as RDF and OWL. Being able to employ common RDF and OWL tools such as Jena and SPARQL for EARMARK documents was in fact a major motivation for it. Issue 4 should be examined beyond the mere validation against document grammars and towards a general evaluation of the level of compliancy of the markup to some formally specified expectations. EARMARK documents, while being subject to no document grammar in the stricter XML sense, allow the specification of any number 16 In
order to individually address the issues, we edited the original bullets into a numbered list.
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A:16
Angelo Di Iorio et al.
of constraints, expressed either directly in OWL, or in SWRL [Horrocks et al. 2004] or even in SPARQL, that trigger or generate validity evaluations. In [Di Iorio et al. 2011] we tried to show that a large number of requirements, from hierarchical wellformedness in the XML sense, to validation requirements in terms of XML DTDs, to adherence to design patterns, can be expressed satisfactorily using these technologies. Item 5 regards the difficulty of standoff notations to provide inter-layer analysis on XML structures: separate interpretation of markup layers is easy, but identification and validation of overlapping situations is more complex: standoff markup is mainly composed of pointers to content, and does not have any direct way to determine overlap locations without some kind of pointer arithmetics to compute them. Validation of contexts allowing overlaps as describable using rabbit-duck grammars [SperbergMcQueen 2006] is also not trivial. In this regard EARMARK provides yet again a solution that does not require special tools: although OWL does not allow direct pointer arithmetics, SWRL on the contrary does, as shown in Section 4.3 where we described a batch of (SWRL-implementable) rules that do in fact determine overlapping locations on EARMARK documents with good efficiency. Finally, issue 6 refers to the fact that evolution of separate markup annotation layers need to take place synchronously, lest one of them become misaligned with the new state of the document. This is, in summary, the fragility of pointers, which can be considered the fundamental weakness of standoff, as well as of any notation that has markup separate from its content: if a modification occurs to the underlying (probably text-based) source, all standoff pointers that could not be updated at the same time of the change become outdated and possibly wrong. All standoff notations fall prey of this weakness, and there is no way to completely get rid of it. What is possible is to identify exactly what are the conditions under which such weakness acts, and see if there is a way to reduce the mere frequency of such events. In fact, in order for a standoff pointer to become outdated, several conditions must take place at the same time: — the standoff notation must be used as a storage format, rather than just as a processing format; — the source document must make sense even without the additional standoff markup (i.e., the standoff notation contains no information that is necessary for at least some types of document modifications); — the source document must be editable (and, in fact, must be edited) on its own; — the standoff pointers must rely on positions that change when the source is edited (e.g., character-based locations); — editing must be done in contexts and with tools that cannot or do not update the standoff pointers; — there must be no computable way to determine the modifications of the document (e.g. via a diff between the old and the new version). Of course, no standoff notation can rule out that these conditions occur on their documents. But it is worth pointing out that all six of them must occur, for standoff pointers to become outdated. EARMARK is not safe from these occurrences, either, but, at least for the use cases here described, one or more of these conditions simply do not apply: EARMARK is mostly used as a processing format, with no need to save it on disk (conversion from the source formats, e.g. MS Word, is described in Section 6 and does not require special storage), the data format described is either in a very specific format (such as MS Word or ODT) that in fact already does handle internally its data changes and requires the overlapping data exactly for this purpose, or is in fact the result of a diff action on successive versions of a document (as in the case of the wiki pages). Finally, EARMARK allows references to relatively stable fragment ids ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A Semantic Web Approach To Everyday Overlapping Markup
A:17
of the documents (by using XPath ranges without specifying explicitly begin and end locations), rather than the extremely fragile character locations, further reducing the chances of outdated pointers. For this reason, without being able to completely rule out the possibility of standoff pointers to go wrong, we tend to consider it as a significantly little risk, at least for the use case here described. 4.5. Using OWL vs. RDF for standoff notations
EARMARK is strongly based on OWL 2 DL [W3C OWL Working Group 2009] to express multiple markup layers with possible overlapping ranges over the same content. OWL 2 DL is not the only possible choice for expressing standoff notations via Semantic Web technologies. In fact, RDF is another valid and effective model for dealing with the same issue, as shown in [Tummarello et al. 2005] by means of the open-source API RDF Textual Encoding Framework (RDFTef). This API was created to demonstrate a plausible way for handling overlapping markup within documents and identifying textual content of a document as a set of independent RDF resources that can be linked mutually and with other parent resources. Beside giving the possibility to define multiple structural markup hierarchies over the same text content, the use of RDF as the language for encoding markup allows to specify semantic data on textual content as well. But the real main advantage in using RDF is the possibility of using particular built-in resources appositely defined in the RDF syntax specification [Beckett 2004] for describing and dealing with different kinds of containers, either ordered (rdf:Seq) or unordered (rdf:Bag). Thus, RDF resources can be used to represent every printable element in the text – words, punctuation, characters, typographical symbols, and so on – while RDF containers can be used to combine such fragments and containers as well. Although RDF is not sufficient to define a formal vocabulary for structural markup – does a given resource represent an element, an attribute, a comment or a text node? In which way is a resource of a certain type related to others? – the specification of an RDFS [Brickley & Guha 2004] or of an OWL layer can successfully address this issues. Hybrid solutions obtained by mixing different models, even when they are built one upon another, may seem elegant but not necessarily the best choice. In fact, there exist well-known interoperability limits between OWL 2 DL and RDF that prevent the correct use of Semantic Web tools and technologies. In particular: — any markup document made using RDF containers (e.g. to describe what markup items contain and in which order) and OWL ontologies (e.g. to define classes of markup entities and their semantics) results in an set of axioms that end up outside of OWL DL and well within OWL Full, which limits the applicability of the most frequently used Semantic Web tools, that are usually built upon the (computationally-tractable) description logic underlying OWL 2 DL; — the individual analysis of each language may be not applicable when we have to check particular properties that lay between RDF and OWL layers. For example, verifying the validity of a markup document against a particular schema, which is one of the most common activities with markup, needs to be made to work with both markup item structures (that would be defined in RDF) and logical constraints about classes of markup items (e.g., elements only, attributes only, the element “p”, all the element of a particular namespace, etc., all of them definable in OWL). Being able to express everything we need directly in OWL addresses both issues quite straightforwardly. The well known absence of containers and sequences in OWL can be overcome by modeling classes in specific ways using specific design patterns such as [Ciccarese et al. 2008] and [Drummond et al. 2006]. ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A:18
Angelo Di Iorio et al.
5. USING EARMARK
There are multiple applications for the EARMARK approach. The most interesting for this paper is its capability of dealing with overlapping structures in an elegant and straightforward manner. Under EARMARK such structures do not need to be specified through complex workarounds as with XML, but they are explicit and can be easily described and accessed. Sophisticated searches and content manipulations become very simple when using this ontological model. The goal of this section is to demonstrate the soundness and applicability of EARMARK by discussing how the use-cases presented in Section 3 are addressed. Notice that throughout the section we investigate multiple EARMARK data structures and documents, focussing on the feasibility and potentiality of such an ontological representation. 5.1. Looking for authorial changes in Office Documents
The discussion in Section 3.1 showed that both ODT (OpenOffice format) and OOXML (Microsoft Word format) use complex data structures to store overlaps generated by change-tracking functionalities. These structures make it very difficult to search and manipulate the content when using XML languages and tools. Even very simple edits generate a rather tangled set of overlapping elements. Let us recall the example mentioned in Section 3.1, where the user “John Smith” splits a single paragraph into two. The ODT representation is: < text : tracked - changes > < text : changed - region text : id =" S1 " > < text : insertion > < office : change - info > < dc : creator > John Smith < dc : date >2009 -10 -27 T18 :45:00 [... other changes ...] [... content ...] < text :p > The beginning and < text : change - start text : change - id =" S1 "/ > < text :p > also < text : change - end text : change - id =" S1 "/ > the end .
The OOXML representation , as shown in Section 3.1, is even more complex. In fact these formats exploit in large scale (tangled) fragmentation (OOXML), or milestones and stand-off markup (ODT) to deal with overlaps. EARMARK, on the other hand, stores overlapping data in a direct and streamlined manner that does not require tools to rebuild information from the twists of a treebased XML structure. The information is already available and expressed through consistent RDF and OWL statements. Fig. 3 graphically shows the corresponding EARMARK document. The original paragraph content and the new string “also” are now encoded as two docuverses over which the ranges r1, r2 and r3 are defined. The original paragraph is then composed of the (content of) ranges r1 and r2, while the paragraphs resulting after the (text and carriage return) insertion now comprise respectively range r1 and ranges r2, r3. Metadata about the author and the modification date are encoded as further RDF statements. Individual : doc1 Types : StringDocuverse Facts : " The beginning and the end " Individual : doc2 Types : StringDocuverse Facts : " also "
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A Semantic Web Approach To Everyday Overlapping Markup
A:19
Fig. 3. Encoding in EARMARK the ODT change-tracking example.
Individual : r1 Types : PointerRange Facts : refersTo doc1 , begin "0"^^ xsd : nonNegativeInteger , end "17"^^ xsd : nonNegati ve In te ge r Individual : r2 Types : PointerRange Facts : refersTo doc1 , begin "17"^^ xsd : nonNegativeInteger , end "25"^^ xsd : nonNegati ve In te ge r Individual : r3 Types : PointerRange , insJS Facts : refersTo doc2 , begin "0"^^ xsd : nonNegativeInteger , end "5"^^ xsd : nonNegative In te ge r Individual : p - b Types : Element Facts : firstItem p -b - i1 Individual : p -b - i1 Facts : c : itemContent r1 , nextItem p -b - i2 Individual : p -b - i2 Facts : c : itemContent r2 Individual : p - m Types : Element , insJS Facts : firstItem p -m - i1 Individual : p -m - i1 Facts : c : itemContent r3 , nextItem p -m - i2 Individual : p -m - i2 Facts : c : itemContent r2 Individual : insJS Types : Insertion Facts : dc : creator " John Smith " , dc : date "2009 -10 -27 T18 :45:00" Individual : p - t Types : Element Facts : firstItem p -t - i Individual : p -t - i Facts : c : itemContent r1
The advantages of streamlining overlaps becomes apparent if we consider tasks a little beyond the mere display. For instance, the query for “the textual content of all paragraphs inserted by John Smith” ends up rather entangled if we used XPath on the ODT structure. The process for finding that textual content needs to browse the current version of the document, look for all the text:change-start/text:change-end pairs that refer to an insertion made by John Smith involving the creation of a new paragraph (i.e., text:change-start is in a first paragraph while its pair, text:change- end, is in the following one), that are either currently present in the document body or hidden behind a subsequent deletion made by someone else. Once identified the paragraphs, we need to retrieve the content that originally was contained there, i.e., the text fragments that still are within those boundaries or that may have been deleted in ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A:20
Angelo Di Iorio et al.
subsequent versions. The following XPath represent an implementation of the above process: for $cr in (// text : changed - region ) , $date in ( $cr / text : insertion //( @office : chg - date - time | dc : date ) ) return $cr [.// text : insertion [(.// @office : chg - author = ’ John Smith ’ and count ( $cr // text : p ) = 2) or (.// dc : creator = ’ John Smith ’ and (// text : change - start [ @text : change - id = $cr / @text : id ]/ following :: text : p intersect // text : change - end [ @text : change - id = $cr / @text : id ]/( ancestor :: text : p ) ) ) ]]/ root () //(( text : change - start [ @text : change - id = $cr / @text : id ]/( following :: text : p //(( text () |( for $tc in ( text : change ) return // text : changed - region [ @text : id = $tc / @text : change - id and not ( text : insertion //( @office : chg - date - time | dc : date ) > $date ) ]// text : p [1]// text () ) ) except (( for $tc in ( text : change ) return $tc [ count (// text : changed - region [ @text : id = $tc / @text : change - id and not ( text : insertion //( @office : chg - date - time | dc : date ) > $date ) ]// text : p ) = 2]/ following :: text () ) union (// text : changed - region / text : deletion [.// dc : date $date ) ]// text : p [1]// text () ) ) except (( for $tc in ( following :: text : change ) return $tc [ count (// text : changed - region [ @text : id = $tc / @text : change - id and not ( text : insertion //( @office : chg - date - time | dc : date ) > $date ) ]// text : p ) = 2]/ following :: text () ) union (// text : changed - region / text : deletion [.// dc : date $date ]| ancestor :: w : del [ @w : date < tbody > < tr valign =" top " > 18 http://www.mediawiki.org 19 For
the sake of clarity we removed all markup irrelevant to our discussion.
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A:24
Angelo Di Iorio et al. Table I. All the versions of a wiki page modified by different authors.
Version
V1
V2
V3
V4
Author
151.61.3.122
Angelo Di Iorio
Silvio Peroni
Fabio Vitali
Content
Bob was farming carrots and tomatoes
Bob was farming carrots, tomatoes and beans
Bob was farming carrots, tomatoes and green beans. They were all tasteful .
Bob was farming carrots, tomatoes and green beans. [new paragraph] They were all tasteful.