On Evaluating an Approach for Balancing the Trade ...

‘

On Evaluating an Approach for Balancing the Trade-Off on XML Schema Design Rebeca Schroeder Univ. Fed. do Paraná Curitiba, Brazil

Denio Duarte Univ. Fed. da Fronteira Sul Chapecó, Brazil

Ronaldo S. Mello Univ. Fed. de St. Catarina Florianópollis, Brasil

Abstract Purpose – Designing efficient XML schemas is essential for XML applications which manage semi-structured data. On generating XML schemas, there are two opposite goals: (I) to avoid redundancy and (ii) to provide connected structures in order to achieve good performance on queries. In general, highly connected XML structures allow data redundancy, and redundancy-free schemas generate disconnected XML structures. The purpose of this paper is to describe and evaluate by experiments an approach which balances such trade-off through a workload analysis. Additionally, we identify the most accessed data based on the workload and suggest indexes to improve access performance. Design/methodology/approach – We apply and evaluate a workload-aware methodology to provide indexing and highly connected structures for data which are intensively accessed through paths traversed by the workload. Findings - We present benchmarking results on a set of design approaches for XML schemas and demonstrate that the XML schemas generated by our approach provide high query performance and low cost of data redundancy on balancing the trade-off on XML Schema Design. Research limitations/implications - Although an XML benchmark is applied in our experiments, further experiments are expected in a real-world application. Practical implications – The approach proposed may be applied in a real-world process for designing new XML databases as well as in reverse engineering process to improve XML schemas from legacy databases. Originality/value – Unlike related work, the reported approach integrates the two opposite goal in the XML schema design, and generates suitable schemas according to a workload. An experimental evaluation shows that the proposed methodology is promising. Keywords – XML schema, logical design, workload, redundancy, indexing Article classification – Research paper

1. Introduction XML is a well- known model for data exchange and data representation [Bird et al., 2000; Routledge et al., 2002; Moro et al., 2007]. It has been over and over applied by several applications in several contexts [Moro et al., 2009; Brantner, 2009]. Today, we have many XML database management systems [Bradford et al., 2011; Schöning, 2011] as well as Web XML repositories [Ley, 2011; Miklau, 2012]. In response to the widespread of XML-based applications, methods for designing efficient XML schemas have become a natural problem. Thus, XML schemas must represent aspects of a data domain as a logical model. In order to provide a good understanding from data domain, an appropriate approach is to generate XML schemas from high level schemas provided by a conceptual modeling phase [Mok et al., 2006]. In other words, we may consider the XML design process in the context of the traditional design approach for databases, which is defined through a three-phase process starting by conceptual modeling and followed by logical and physical modeling phases. In this paper, we focus on the logical phase considering that a conceptual schema is given to generate suitable logical schemas for XML data. In general, a conceptual schema does not include information about application workload, justifying that performance issues must be considered only at the physical level. However, most of the domain expert users are aware of the main workload and are able to estimate them at the conceptual level. Our methodology is based on the consideration of workload information at conceptual level in order to provide better design choices at the logical level. On this way, we avoid workload information to be considered only at physical modeling phase, which can cause modifications on logical and even physical schemas already defined.

Translating conceptual schemas into XML schemas is not a straightforward process, given that conceptual schemas are complex graphs to be represented by XML trees. In a previous work [Schroeder et al., 2011], we have addressed a trade-off found in such translation process. Indeed, there are two opposite goals on generating XML schema, respectively: (i) to generate highly connected XML structures or (ii) to generate redundancy-free schemas. In general, connected structures provide higher query performance given by a tight-coupled bind between XML elements. On the other hand, redundancy-free schemas ensure normalized databases and avoid updating anomalies. They are opposite goals because, in general, highly connected structures are achieved by XML schemas which allow redundancy. In this paper, we define a complete approach to balance such trade-off by extending our previous solution defined in [Schroeder et al., 2011]. Our design method defines a few criteria for tolerating data redundancy in order to represent the main workload by highly connected structures whenever it is possible. We have shown that our approach can improve the overall system throughput by providing better results on query processing than approaches focused in one of the conflicting goals. Although indexing is not applied by our previous solution, we have identified that it can improve query response time in a native XML database by roughly 100% in average [Schroeder et al., 2011]. The extended approach presented in this paper incorporates indexing strategies on XML elements that are relevant for the workload. Our contribution is certificated by benchmarking results comparing our method to related work. Therefore, this extended process is able to design XML documents which can process efficiently the main workload. The remainder of this paper is organized as follows. Section 2 describes models and workload information considered by our design approach. In the following, Section 3 presents our extended process for mapping conceptual constructs to suitable structures in the XML logical model. In Section 4, experiments on an XML database shows that our approach provides better results on query processing than related approaches. Section 5 discusses related work and Section 6 concludes this paper by outlining some future work.

2. Preliminaries This section presents a few definitions for models and workload data applied by our design method. Our purpose here is to characterize workload on conceptual schemas in order to define suitable schemas following an XML logical model. 2.1. Conceptual Model Our approach provides the conversion of conceptual schemas into XML logical schemas, where conceptual schemas are defined by the Extended Entity-Relationship (EER) model [Batini et al., 1992]. We do not consider conceptual models based on the XML model [Embley et al., 2004; Mani, 2004] because they mix conceptual and logical (XML) constructs in a same model. We argue that a high level abstraction of a data domain must be provided through a pure conceptual model, as we found in traditional database design. Other conceptual models could be considered, like UML. Instead, we adopt EER because it contains essential constructs used for conceptual modeling. Without loss of generality, a conceptual schema is defined as a tuple Ɛ = (T, l, G, R), where (i) T is a set of entity and relationship types t1, t2, , tn, l is a function that assigns a label to each type in T; (iii) G is a set of generalization relationships; and (iv) R is a set of aggregation relationships. From now on, type means both entity and relationship types. A generalization relationship is defined as a tuple γ = (super, sub), where super is a function that returns the label of an entity type which acts as superclass and sub is a function that returns labels of entity types which act as subclasses. We refer G = {g1, g2, ..., gn} as the set of generalization relationships of Ɛ where gi ∈ G (where 1 ≤ i ≤ n) is defined by γ . Figure 1 shows a conceptual schema with three generalization relationships. We name gA as the relationship stated by the superclass A with the subclasses

B and C. Besides, there are the relationships gC and gD established by the superclasses C and D, respectively.

Figure 1. Conceptual Schema. An aggregation relationship is defined as a tuple ρ = (tr, agg, min, max), where (i) tr is a function that returns the label l(tr) being tr the relationship type which defines the aggregation relationship; (ii) agg is a function that returns a set of labels l(t1), , l(tn), where ti refers to an entity types related by tr; (iii) min is the minimum occurrence of an entity type ti in tr ; and (iv) max is the maximum occurrence of an entity type ti in tr. We refer R = {r1, r2, , rn} as the set of aggregation relationships of Ɛ where ri ∈ R is defined by ρ with 1 ≤ i ≤ n. The aggregation relationships in Figure 1 are defined by the relationship types G, I, K and L. In rG, tr(rG)= G, agg(rG)= {D,H}, min(rG,D) = 1, min(rG,H) = 0, max(rG,D) = 1 and max(rG,H) = n. For recursive relationships as the one stated by K, roles are applied to the related entity types. For example, tr(rK)= K, agg(rK)= {J’,J’’}, min(rK,J’) = 0, min(rK,J’’) = 0, max(rK,J’) = 1 and max(rK,J’’) = 1. 2.2. Workload Characterization We assume that the conceptual schema and workload information are the input to our approach. Workload information corresponds to data load expected for an XML-based application. Notice that we are not concerned about how to obtain conceptual schemas and workload information. In fact, we focus on using such information to generate XML schemas that can result in good performance for query and update operations on XML data. Our workload-aware approach identifies critical types of Ɛ in order to represent them through suitable structures in the logical model. Critical types are the concepts frequently accessed by transactions. These concepts must be modeled through appropriate XML structures in order to provide good performance. We propose two measures for identifying critical types: General Access Frequency (GAF) and General Update Frequency (GUF). Before defining GAF and GUF, we consider the volume of data and the application load expected on each type. Such information is provided by applying the workload modeling methodology defined in [Batini, 1992]. Definition 1. Volume of Data Given an EER schema Ɛ, the volume of data is defined by V = {N(t), Avg(Γ,r)}, where N(t) is the average number of occurrences of a type t (t ∈ T) and, given Γ = agg(r) as the list of entity types associated through a relationship type r, Avg(Γ,r) is the average cardinality among entities in Γ.

Figure 1 presents a conceptual schema enhanced with volume of data. The average number of instances N expected for types are represented in the type shape and the average cardinality (Avg) is presented on the associations. For the sake of clarity, we omitted Avg parameters. Thus, we have N(A)=300, N(B)=100 and so on. For the average cardinality, we have Avg(,G)=1, Avg(,G)=10 and so on. Cardinality direction is given as follows: the average volume of instance of D related to H through G (i.e., Avg(,G)) is 1 and it appears next to H. By the same way, the value of Avg(,G) appears next to D. Definition 2. Application Load Consider a conceptual schema Ɛ and a set of operations O={o1, o2, ...,on} over Ɛ such that each oi ∈ O is applied over a list of types Toi=(t1,t2,..,tp) with Toi ⊆ T. The application load on Ɛ is defined by a set of query and update operations and it is composed of three functions: f(oi): the average frequency of oi in a period of time; at(oi,tj): the operation type (query or update) of oi over a tj ∈ Toi; v(oi,tj): the volume of instances of tj accessed by oi; This volume is given for each tj ∈ Toi respecting the accessed order imposed by oi. v(oi,tj) is defined as:

f(oi), if j =1 v(oi,tk).ω, otherwise

where (i) tk is the type accessed by oi before tj, and (ii) ω is 1 if tj is an entity type or Avg( GAF(C). For each generalization relationship (line 3), we consider two conversion strategies: (i) by the traditional rules that generate XML elements and parent-child relationships (line 8), and; (ii) by references among XML elements (line 4). In short, our process verifies if the application of the standard rules is

possible, otherwise it converts by references. It is important to clarify that the reasoning for choosing the suitable rule is defined in [Schroeder and Mello, 2008]. A marked subclass s is an entity type that was already processed and it is represented in the content model of an XML element, either as attribute or as a subelement of such element. It occurs when there is a subclass that was processed through one of its superclasses in a multiple-inheritance case. Thus, in order to avoid data redundancy, the association of s with its other superclasses should be represented by references. However, we check the GAF of the superclass and identify if the accesses to s through the superclass are relevant for the workload. If its GAF is lower than MAF, we allow the representation by reference because it is irrelevant according to the workload. Otherwise, we check if the GUF of s is lower than MUF and then we allow data redundancy by the application of the traditional rules. In this case, the parent-child association can improve query performance on this relevant entity type, if compared with the negative effect generated by references. Finally, if both GAF for superclass and GUF for s are relevant, we process it by references because extra costs for managing data replication could affect the system performance. In the example, F is first processed and marked by the superclass D. When the generalization relationship defined by C is evaluated, the algorithm establishes a reference association between C and F given that GAF(C) MAF) then 6 index super(gi) 7 else 8 generate a hierarchy to represent gi 9 mark each type in sub(gi) 10 end if 11 end for 12 Sort R in descending order of GAF(tr(ri)) with ri ∈ R 13 for all ri ∈ R do 14 ej := an entity type where ej ∈ agg(ri) 15 if (min(ej)==0 or (GAF(ri)MUF)) then 16 establish a reference on tr(ri) to each ek ∈ agg(ri) 17 if (GAF(ri) > MAF) then 18 index ei 19 else 20 generate a hierarchy to represent ri 21 mark ej 22 end if 23 end for 24 define the root element

As demonstrated in [Schroeder et al., 2011], references generate extra costs in query processing because, in general, they execute value-based comparisons among several XML elements. However, we have observed that the costs for such comparisons can be reduced by defining text indexes on the referenced elements. To this end, in a multiple inheritance case when GUF for a subclass and GAF for the superclass are relevant, the algorithm assigns an index to the element which represents the superclass. Thus, value-joins on the reference relationship between subclasses and the superclass are supported by indexing.

Rules for converting relationship types are also proceed based on logical design of traditional data models. We adapt two common rules for translating relationship types into XML structures: (i) Relationship Modeled by an Entity: only one XML element is generated to represent the relationship type and its related entities; and (ii) Relationship Modeled by a Hierarchy: XML elements are created to represent each participating entity type of the relationship. The first rule is suitable to represent 1:1 relationships, where one of the entity types is represented by attributes in the XML element created to represent the other one. In the second rule, the elements are connected by a parent-child relationship, where one of the entity types is placed on the top of the XML hierarchy. We sort the set of relationships in order to start the conversion by the types which have the highest GAF according to the workload (line 12). For each type, the algorithm defines if rules can be applied on each relationship, like generating a hierarchy to represent the relationship (line 20). As for generalization types, there are a few cases regarding relationships that require the use of references (line 16). In order to avoid data redundancy, relationships with N:M cardinality are not able to be represented by a parent-child relationship. Besides, redundancy-free hierarchies cannot be generated by a relationship ri where one of the entities is marked. It occurs when an entity ej related by ri has already been processed and it is represented in the sub-hierarchy of an element which is not related by ri. In this case, references must be defined among ej and the other elements of ri. In order to exemplify a relationship conversion, consider the conceptual schema presented in Figure 1 and the operations of Table 1. We can verify the dependency of D to the entities H and M, given the participation of D in the relationships G and L. Such dependency enables D to be represented as a subelement of H or M according to the second traditional rule. On checking Table 2, we verify that the GAF for the types G and L are 20000 and 9500, respectively. Thus, the algorithm converts G first by nesting D on H (line 20) because the highest GAF for G indicates that D is more frequently accessed through H than through M. On the conversion of the relationship type L, we could also nest D as a subelement of M, generating data redundancy. In this case, the process preserves D as a subelement of H and evaluates the availability to allow data redundancy by nesting D on M either. First, we verify if L is relevant for the workload by evaluating GAF of L against MAF (line 15). Given that GAF(L)=9500 is higher than MAF=954, we identify that this relationship is relevant for the application. The second verification point is related to the update frequency of D. Assuming that MUF=MAF, the relationship L must be represented by references to avoid data redundancy on D given that GUF(D) > MUF. However, GAF(L)>MAF and, thus, an index is stated for D in order to reduce extra costs with value-based comparisons among L and D. The relationship I has N:M cardinality and GAF(I)>MAF. As GUF(H)