2012 IEEE 36th International Conference on Computer Software and Applications Workshops
Robust XML Watermarking Using Fuzzy Queries Tchokpon Romaric/Ernesto Damiani
Nadia Bennani
Università degli studi di Milano Dipartimento di Tecnologie dell’Informazione via Bramante 65 26013 Crema (CR), Italy {romaric-ange.tchokpon, ernesto.damiani}@unimi.it
Université de Lyon, CNRS, INSA-Lyon 20 Avenue Albert Einstein, Villeurbanne, France
[email protected]
extension requires tackling several challenges. The first issue is to find a suitable location for the watermark; another one is defining a suitable algorithm for watermark embedding. Another crucial issue is defining a way to evaluate the robustness of the algorithm with respect to alterations and other kind of attacks. Some early attempts were promising but research is still at a preliminary stage. The contribution of these articles [1] [2] [4] [5] is presented in the related work section of this paper.
Abstract— XML is today the most used data interchange format for business-to-business applications. Indeed, an increasing amount of data is created and published over the Internet every day in XML format. Moreover, organizations need more and more to share sets of XML documents usually managed via a common XML repository. XML copyright protection and source tracking have become strong requirements for collaborative environments. In the literature, XML-specific fingerprinting mechanisms have been proposed, inspired by similar work on relational data. However their robustness is impaired by the fact that an XML file can undergo a set of updates that change both the file structure and content. In this paper, we propose a novel watermarking schema for XML files. This solution is based first on an adequate selection of XML locators, i.e. document fragments targeted to embed the watermark. Watermark retrieval is achieved thank to a set of personalized fuzzy queries that reconstruct the locators that contain the watermark. We show theoretically the watermark robustness against possible XML file transformations; then, we present some initial experiments that validate the approach.
In our proposal we use a targeted embedding technique to choose watermark locations according to given user profiles. Then, we use fuzzy queries to re-construct data corresponding to our locations within the XML file, in order to tolerate modifications. This paper is organized as follows: In section II, we present results from existing results in XML watermarking and fuzzy XML querying that are relevant to our work, In section III, we will present the architecture of our XML watermarking framework. The section IV presents the results of some preliminary experiments. II.
XML watermarking; copyright protection (key words)
I.
A. XML watermarking Copyright protection has become a major requirement for many applications and digital watermarking is a widely used technique to achieve this goal. Originally digital watermarking was applied to multimedia documents mainly images, video and audio files. Research about applying digital watermarking to other types of data started some years ago. The first attempt to watermark data in a database is proposed in [1] and is related to the relational model. In this paper, the authors present a watermarking algorithm that modifies some bits of numerical attributes of some relational tuples. A major contribution of this paper is an algorithm to choose the attribute to be watermarked. Once these attributes (called locators) have been selected, they change a certain number of attributes such that changes in a few of their values do not affect the application. The proposal is robust against some attacks such as data updates but is also limited, as the proposed technique is applicable only to numerical attributes. This article inspired other researchers to propose the notion of watermarking for the XML data model as well.
INTRODUCTION
. Today, many applications use the XML format to exchange information. However, data exchanged through the network are of different level of sensitivity. Through our proposal we provide a way to fingerprint XML documents using digital watermarking in order to enable verifying the ownership of an XML file using fuzzy queries Let us consider a scenario where a user has created a XML file and wants to publish it on the network. We want to provide a way to mark this document transparently and in a robust way. This mark should be resistant to basic transformations like re-arranging the file structure and copying portions of the document. The choice of digital watermarking to achieve this goal is motivated by research results obtained for multimedia data. Originally, digital watermarking is the process of embedding information into a digital signal that may be used to verify its authenticity or the identity of its owner. There are several works in digital watermarking for multimedia data (audio, video, images) proving its advantages [9] [10]. Some years ago, some researchers started thinking about the watermarking of relational and XML data but such 978-0-7695-4758-9/12 $26.00 © 2012 IEEE DOI 10.1109/COMPSACW.2012.82
RELATED WORK
433
Securing XML document is not a recent research topic [10]. There exists a lot of proposals among them some standards such as XML Signature and XML Encryption [11] relying on cryptography and used to guarantee integrity and confidentiality respectively. These concepts are fundamental to XML and Web services security but are not sufficient to verify the ownership of the document. Indeed, signing a document proves the signer’s identity but not his ownership. This is the role of digital watermarking. Zhou and al. in [3] extended the solution dedicated to databases [1] to XML files. The locators in their solution are the XML elements instead of relational database attributes. Considering the tree structure of an XML file, the solution aimed numeric and textual leaf nodes as locators only. In [3], a locator selection algorithm has been proposed to identify a set of queries templates to increase data usability. The paper presented three major features that must be reached by a watermarking schema: Imperceptibility means that the watermark should not alter data usability, resilience means that attacks do not succeed to delete the watermark and finally credibility means that the detection stage allow to identify enough the owner. Besides, [2] extends the locator selection to non-leaf nodes. The selected locators receive each, a bit of the sequence of bits to be embedded in the file. The solution has been studied to be mainly resilient to data reorganization. To cope with this, the verification step attempts to establish a mapping between the initial structure of an XML file and the structure of the same file to verify. Then the set of chosen query templates are rewritten to be compliant to the new structure in order to retrieve the watermark. The rewriting phase is necessary as it allows localizing the locators in a specified order to recognize the watermark which weak flexibility and is costly. Shingo and al. in [4] propose some techniques to embed a binary sequence of bits called a pattern in an XML file. Example of these techniques is the order in which attributes appear in an element. Assuming an element elt with two attributes att1 and att2. The appearance of att1 then att2 in the element elt corresponds to embedding ‘1’ when the order att2 then att1 corresponds to a ‘0’. Another technique consists to embed a ‘0’ by representing an element elt with its structure when embedding ‘1’ corresponds to the structure of the element elt. This solution could not be applicable in a professional use of XML files where strict dtd or XML schema has to be enforced. Besides this solution weak robustness against reorganization attacks and data update as it can be seen for the example2. Nevertheless, this paper is relevant to our approach regarding pattern definition. Finally Mir and al. in [5] propose a specific watermarking technic for web content. The XML based HTML file is watermarked by replacing some words with their synonyms or acronyms, stored in a list. The navigator using the synonym or acronym list performs the verification stage. Beyond the technic itself that is not applicable in other context than web content, this technic is interesting in
the sense that it implies the navigator in the verification stage before applying the original file. B.
Fuzzy queries A major challenge in XML watermarking is the extraction of watermarked elements after modifications. As the structure of the XML file can be modified easily, we need a way to find locators in a restructured file. The previous works [1][2][3][4][5] proposed several methods to detect the presence of the watermark or to extract it from the XML file. These methods are based on a predefined set of modifications and attacks. But the big problem with XML files is that they can be easily modified and it’s almost impossible to tackle all the possible modifications. Finding a standard technic to extract specific information is then not an easy task. Many proposals exist in the field of querying XML files regardless of the structure and the modifications. Fuzzy queries are one of these proposals. In [6] [7], the authors introduced this query language taking in consideration the fuzziness of the XML file structure. The proposed language use fuzzy predicates to retrieve information from a XML dataset. This language is very interesting because regardless of modifications to the dataset structure, one can extract with some approximation whatever one wants. In this paper, we present a novel pattern-based framework for XML file watermarking. After applying an adequate locator selection procedure, the owner pattern is embedded in every selected locator and a fuzzy query is assigned to it and used every time the ownership has to be verified. Thank to fuzzy queries, the target locators are retrieved even in case of reorganization or locator name changes. The complete pattern embedding in some locator decreases the threat of reorganization attacks, as we need at least one remaining watermarked element to prove the file ownership. This also increases the watermark robustness against locator deletion attacks. In this paper we mainly focus on the benefit of using fuzzy queries to retrieve the watermark. III.
OUR SOLUTION
Our solution is aimed at solving three main issues. A first part, which will help choosing the locators, a second issue concerning the watermarking process to use and the last issue is about the use of fuzzy queries to extract the locators. The locators’ selection is a big issue in watermarking XML files. Many criterions could be used to choose the locators. One choice is the relevance of the tag for the usability of the data [2]. This requirement means that the removal of this tag from the file will break its meaning. Another possibility is also to define a metric to evaluate the frequency of each tag in the document and then choose the less or most frequent tag, the less frequent because the probability that a malicious user modify this tag is low, as it has not enough importance in the file. And the more frequent tag because the probability that this kind of tags
434
will be completely deleted is very low. IIn the literature, many technics exist to select relevant elem ment in an XML file [2][3]. We propose to use tags that aree relevant for the user profile. It means that for an XML ffile exploited by many users of different interests, we’ll chooose the tags that are most used by each profile. In a samplle file containing for example data for the financial service, ccustomer service, human resource service, each service will qquery certain tags more than others. In our proposal we will define a pattern per user and per service. The watermarking process described in Figgure 1 is also an important issue, mainly the algorithms and thhe information to embed. We propose here to define a uniquee pattern for each user. The challenges at this step are about tthe representation of the watermark and the algorithm to use. Inn our framework, we propose to embed series of randomly ggenerated bits. A function is used to insert this watermark inn the selected tag, and a reverse function also to extract the watermark from these tags. Each bit is represented by an inviisible character in the locator text. We choose to use a sppace character to represent “1”. We then insert all the positioons of the pattern in each occurrence of the locator. The reedundancy of the watermark increases its robustness. The advvantage is that we can recover the entire watermark from any w watermarked tag. Then the system is robust to reorganization of the tags as the insertion sequence of the bits is connserved in each watermarked tag. Our architecture is composed of three m main components. The processes describing the working scenaario are illustrated in Figure 1 and Figure 2. The next section is a description of each component involved in these processes.
Figure 1. Process of watermarrking an XML file
B. The Extraction Unit The Extraction Unit answers watermark verification queries for a given XML file f and a specified user identity. Once requested, the Extraction Un nit obtains first the fuzzy query previously created for thee file f from the local database and applies it to the file f. Thank to the NEAR predicate, the fuzzy query allows to retrieve tags with both exact value and synonym values. The T user pattern obtained from the pattern manager is applied d then to the selected tags to verify the watermark extraction rate. r
A. The Embedding Unit 1) Locator Selection Manager(LSM) This component is in charge of selectting the locators from the XML file given as input. It extraccts the tags of the documents by analyzing the queries executed more frequently. If the file has never been quueried, the LSM defines default locators to embed our waterm mark. The selected locators are then stored inn a local database with the name of the target file, and will bee used later in the extraction phase. Meanwhile, the file namee and the user Id are sent to the pattern generator, w which assigns a personalized pattern to the user and store it securely in a database.
Figure 2. The watermark ex xtraction Process
IV.
EXPERIM MENTS
A. Validation methodology To validate our model we proceeed in three steps. First we define an XML sample file and deffine the locators that will contain our watermark. Then wee create and define the watermark that we embed in our file using our program. Once inserted, we simulate some atttacks on the watermarked file: node insertion, node deletion, file reorganization, copy and paste modifications. i watermark using The objective is to extract the inserted the fuzzy queries even if the file has been modified. We o fuzzy queries in java. customize an implementation [9] of This code implements the ‘NEAR R’ predicate to make a comparison of the given statementt to all the value found. This comparison is defined by multtiple functions depending of the type of the value to compare.
2) Watermark-embebbing Unit The watermark-embedding Unit receivess both the user pattern, the list of selected locators and embbed them with the pattern.
435
Then in a third step, we extract the watermark rate still present in the updated file.
TCP/IP Illustrated Ste**vens** W. 10ab7 Addison-Wesley Advanced Programming in the Unix environment Ste**vens** W. 13cd7 Addison-Wesley Data on the Web Abi**tebo**ul S**erge Bun**eman** Pet**er Suc**iu D**an 19aqd8 Morgan Kaufmann Publishers The Economics of Technology and Content for Digital TV Ger**barg** Dar**cy 11abw Kluwer Academic Publishers
B. Test configuration We used the file illustrated in figure 3 for the experiments. This file contains information about books such as the author, the authorization of publication id, the publisher. We create a java program that allows to select the XML file to watermark. In this first version, the locators are chosen manually by specifying node names. We choose as a sample tag the author tag, as we considered that it is crippling for a book description. The author pattern used in the test is 111001. The watermarked file is shown in Figure 4.
In this example, we choose to embed the pattern in the content of selected tags as follows: each ‘0’ in the pattern is embedded adding a space character to the content. In the figure 4, we replace the space character by a ‘*’ to make it more viewable. TCP/IP Illustrated Stevens W. 10ab7 Addison-Wesley Advanced Programming in the Unix environment Stevens W. 13cd7 Addison-Wesley Data on the Web Abiteboul Serge Buneman Peter Suciu Dan 19aqd8 Morgan Kaufmann Publishers The Economics of Technology and Content for Digital TV Gerbarg Darcy 11abw Kluwer Academic Publishers
Figure 4. wbook.xml
To test the robustness of our proposal, our program simulates some updates of the watermarked XML file according to [7]. We execute then a node addition, a node deletion, and file reorganization. In future work we plan to cover a larger set of updates. Finally we evaluate for each modification the percentage of watermark that remains in the file. An important step of the detection of the watermark is the creation of fuzzy queries. This step will be implemented in the future version of the work. In this preliminary version, we use the NEAR predicate to extract all the tags whose labels contains the text “author”. The query executed is the following: /bib{/#}[{../tagName() NEAR 'author'}] {/#} is used to select all the parents and the children of the context node. tagName() is used to extract the label of each tag. At this step, we query the file wbook.xml in Figure 4. The output of the query is an XML file containing all the locators that are similar to the one specified. Figure 5 is the output of our query. The results of the simulated updates are presented in the next part of the paper.
Figure 3. Book.xml
C. Results Using the file book.xml (see figure 3) and the pattern 111001, we get in output the watermarked file in Figure 3. Then we execute the fuzzy query defined previously to extract the locators which value are similar to “author”. The output of our fuzzy query is the file results.xml in Figure 5.
436
w is the owner of this It’s still possible to prove who file because the extracted watermark value is the same for all locators. The number of node addeed has no effect on the process but is very good to examine because it raises the issue of differentiating the original nodes to the added one. Update 2: node deletion n with N representing • We delete a number n