Automating Relational Database Schema Design

0 downloads 0 Views 1MB Size Report
Mar 29, 2013 - algorithm can efficiently mine huge datasets (e.g., Wikipedia Infobox data) ..... PostgreSQL and Oracle. ... For instance, Wikipedia adopts a lib-.
HKU CS Tech Report TR-2013-02

Automating Relational Database Schema Design for Very Large Semantic Datasets Thomas Y. Lee, David W. Cheung, Jimmy Chiu, S.D. Lee, Hailey Zhu, Patrick Yee, Wenjun Yuan Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong

Abstract Many semantic datasets or RDF datasets are very large but have no pre-defined data structures. Triple stores are commonly used as RDF databases yet they cannot achieve good query performance for large datasets owing to excessive self-joins. Recent research work proposed to store RDF data in column-based databases. Yet, some study has shown that such an approach is not scalable to the number of predicates. The third common approach is to organize an RDF data set in different tables in a relational database. Multiple “correlated” predicates are maintained in the same table called property table so that table-joins are not needed for queries that involve only the predicates within the table. The main challenge for the property table approach is that it is infeasible to manually design good schemas for the property tables of a very large RDF dataset. We propose a novel data-mining technique called Attribute Clustering by Table Load (ACTL) that clusters a given set of attributes into correlated groups, so as to automatically generate the property table schemas. While ACTL is an NP-complete problem, we propose an agglomerative clustering algorithm with several effective pruning techniques to approximate the optimal solution. Experiments show that our algorithm can efficiently mine huge datasets (e.g., Wikipedia Infobox data) to generate good property table schemas, with which queries generally run faster than with triple stores and column-based databases. Keywords: RDF database schema design, attribute clustering, Wikipedia Infobox, semantic web

1. Introduction In conventional information systems, data are usually very structured and can be organized neatly in pre-designed tables. Simply speaking, a database consists of a set of transactions. A transaction has a set of attributes and an attribute can be associated with a value. Generally, transactions that represent the same kind of objects or concepts are regarded as in the same class. For example, teacher and course data are two different classes of transactions. Transactions in the same class tend to share the same set of attributes, e.g., code and teacher attributes for course transactions. Considering the attributes for the same transaction class being “correlated,” a database designer seeks to group them in Email addresses: [email protected] (Thomas Y. Lee), [email protected] (David W. Cheung), [email protected] (Jimmy Chiu), [email protected] (S.D. Lee), [email protected] (Hailey Zhu), [email protected] (Patrick Yee), [email protected] (Wenjun Yuan)

the same table. Since a query typically involves the attributes in only one or a few classes, storing correlated attributes in the same table can often avoid excessive table-joins. To design a good database schema, the designer must have a priori domain knowledge on the structure of the data and the pattern of the queries. However, many semantic web datasets have neither well-defined data structures nor well-known query patterns. Some datasets are so huge that it is not feasible to manually design a good database schema by analyzing the data patterns. This makes conventional database design methodologies not applicable in these cases. Semantic web data are commonly represented by RDF[1] triplets, each comprising a subject, a predicate, and an object, i.e., (s p o). Although storing RDF data in a single 3-column table, a.k.a. triple store, is straightforward and flexible, this structure is not optimized for query processing. A typical query involving multiple predicates requires expensive self-joins of the triplet table. This makes query March 29, 2013

processing on a huge RDF dataset slow. To address this performance problem, the vertical or columnbased approach was introduced.[2] This approach splits the large triplet table into many smaller 2column tables, each storing the subjects and objects only for a single predicate. This way, self-joins of a large 3-column table are translated into tablejoins between multiple small 2-column tables. However, some study[3] has shown that a column-based database sometimes performs even slower than a triple store when the number of predicates is large. To avoid expensive table joins, property tables can be used.[4] In this approach, a database is organized as different property tables, each storing the objects for a distinct group of correlated predicates. Suppose p1 , p2 , p3 are correlated predicates maintained in property table C. C has four columns, one for the subjects, and the others for the objects of p1 , p2 , p3 . If an query involves only the predicates among p1 , p2 , p3 , no table-join is needed. To store the triplets (s p1 o1 ) and (s p3 o3 ), a row is created in C where the subject cell stores s, the p1 and p3 cells store o1 and o3 respectively while the p2 cell is null. In general, subjects semantically in the same class (e.g., course) often share the same predicates (e.g., course code and teacher). However, not many RDF datasets contain the class information. Therefore, it is crucial to have an effective automated schema design technique in order to make the property table approach useful for huge loosely structured RDF datasets.

of correlation between the predicates. ACTL seeks to group correlated attributes together into as few tables as possible yet the load factors of all tables are high enough. On the one hand, the fewer tables there are, the “wider” the tables become, the less likely table-joins are needed. On the other hand, the higher the load factors are, the higher attribute correlation the tables have, the higher chance there is for queries to be answered with fewer joins. In addition, a higher load factor implies a higher storage efficiency for a table. Recognizing the ACTL problem is NP-complete, we have developed an agglomerative clustering algorithm with pruning to approximate the optimal solution. We conducted experiments with huge real-life datasets, Wikipedia Infobox data[5] and Barton Libraries data[6]. Experimental results have demonstrated the following. Firstly, our algorithm can efficiently generate “good” schemas for these datasets. Secondly, the performance of running some common queries on the Wikipedia dataset using the property tables automatically designed by our technique is generally higher than that using the triple store and the vertical database. 1.2. Organization of This Paper This paper is organized as follows. Sect. 1 introduces the motivation of our research. Sect. 2 compares four popular RDF storage schemes and reviews the related work. Sect. 3 formulates the ACTL problem. Sect. 4 presents a basic version of our agglomerative clustering algorithm and shows that this version of algorithm has high time and space complexity. To cope with the high complexity, we propose three pruning techniques to enhance the performance of the algorithm in Sect. 5. Sect. 6 discusses the concept of attribute connectivity, which is used to prevent uncorrelated attributes from being put in the same cluster. Sect. 7 introduces two metrics to measure the “fitness” of the schema of property tables against their actual, storage. Sect. 8 compares our approach to other related approaches. Sect. 9 analyzes the results of three experiments. Lastly, Sect. 10 sums up our work.

1.1. Our Contributions In this paper, we propose a new clustering problem called Attribute Clustering by Table Load (ACTL) to automate schema design of property tables. Intuitively, given a set of predicates, which we call attributes in this paper, ACTL aims to cluster them into disjoint clusters of attributes; each attribute cluster is used to create a property table. Intuitively, if the attributes maintained in table C are highly correlated, then the subjects, which we call transactions in this paper, stored in the table should share most of the attributes maintained in C. Hence, most of the cells in the table should be non-null. Conversely, when a large portion of the cells in C are null, we know that the correlation of some attributes is low. Therefore, C should be further split into sub-tables so that each maintains only the attributes with sufficient degree of correlation. We use the proportion of non-null cells in a table, which we call load factor, to measure the degree

2. Preliminaries An RDF file is essentially a collection of triplets. Each triplet is in the form of “subject predicate object .”[7] The subject is a resource represented by its URI. The predicate defines a property of the subject and is represented by a URI. The object is the 2

subject hTomi hMayi hRoyi hMayi hRoyi hRoyi hSami hTomi hKati hTomi hMayi hSami hSami hKati hDbi hNeti hWebi hDbi hNeti hWebi

predicate hdegreei hdegreei hdegreei henrollsi henrollsi henrollsi htitlei htitlei htitlei hsupervisori hsupervisori hinteresti hinteresti hinteresti hteacheri hteacheri hteacheri hcodei hcodei hcodei

SELECT ?s FROM WHERE { ?s "MPhil" . ?s . ?s . }

object . “PhD” . “MPhil” . “BSc” . hDbi . hDbi . hWebi . “Professor” . “Instructor” . “Professor” . hSami . hSami . “Datamining” . “Database” . “Security” . hSami . hSami . hTomi . “C123” . “C246” . “C135” .

Figure 2: SPARQL query example involving three predicates SELECT a.subject FROM triples a, triples b, triples c WHERE a.subject = b.subject AND b.subject = c.subject AND a.predicate = "" AND a.object = "MPhil" AND a.predicate = "" AND a.object = "" AND a.predicate = "" AND a.object = "";

Figure 3: SQL statement translated from Fig. 2 for triple store

Figure 1: Motivating RDF example (triples)

subject and each column stores the objects of a distinct predicate. Note that a cell may store multiple values because of the multi-valued nature of RDF properties. Fig. 4 shows the content of the horizontal database for the motivating example. This scheme can save table-joins for a query involving multiple predicates. For example, the SPARQL query in Fig. 2 is translated into the SQL query in Fig. 5 for the horizontal database. However, unless the RDF data set has a fixed structure and a small number of predicates, the row-based horizontal database is not practical for the following reasons. First, the table is very sparse; for instance, in Fig. 4, 38 out of 56 property cells are null. Second, a horizontal database is not scalable to the number of predicates. For example, the Wikipedia data set needs a table with over 50,000 columns, which is impractical for implementation. Third, this scheme does not handle data of dynamic structures well. When an RDF statement with a new predicate is added, the table has to be expanded, which requires costly data restructuring

value of the predicate, which can be a resource or a literal (constant value). For example: . "Professor" . .

Fig. 1 shows an RDF file which serves as the motivating example in this paper. For simplicity, the full URIs of resources and predicates are abbreviated and angle-bracketed, while literals are doublequoted. In the following, we review four mainstream RDF storage schemes, namely triple store, horizontal database, vertical database, and property tables, and study their pros and cons. 2.1. Triple Store The triple store approach[8] stores RDF triplets in a 3-column table. Fig. 1 can be considered as a triple store. A variant[9] of this scheme uses a symbol table to represent each different resource, predicate or literal by a unique system ID. Despite the flexibility of this scheme, it is commonly recognized to be slow in processing queries[2] for large datasets. It is because a query would likely require many self-joins of a huge triplet table. For instance, the SPARQL[10] query shown in Fig. 2 is translated into the SQL statement for the triple store shown in Fig. 3, which requires two self-joins.

subject hTomi

PhD

hMayi hRoyi

MPhil BSc

title Instructor

hDbi hDbi, hWebi

supervisor hSami

interest

teacher

Professor

hKati

Professor

Database, Datamining Security hSami C123 hSami C246 hTomi C135

Figure 4: Horizontal database for Fig. 1

3

code

hSami

hSami

hDbi hNeti hWebi

2.2. Horizontal Database The horizontal database approach[11] uses a single “universal table” where each row represents a

degree enrolls

SELECT subject FROM horizontal_table WHERE degree = "MPhil" AND enrolls = "" AND supervisor = "";

example, the SPARQL query in Fig. 2 is translated into the SQL query in Fig. 7 on the vertical database, which requires 2 joins. This scheme neglects any data and query patterns, which may suggest organizing some related predicates in one table can save table-joins. For example, resources hDbi, hNeti, and hWebi about courses all have predicates hteacheri and hcodei, so they can be stored together in the same table. (2) Column-based databases are not as mature and popular as row-based RDBMS. (3) Sidirourgos et al.[3] pointed out the scalability issue of this scheme. They found that the triple store approach outperformed the vertical partitioning approach on a row-based RDBMS in answering queries. Although the vertical database performed much faster on a column store than on an RDBMS, the query performance decreased rapidly as the number of predicates increased. The triplestore could still outperform the vertical database on a column store when the number of predicates was large enough, e.g., around 200 predicates.

Figure 5: SQL statement for triple store

operations. 2.3. Vertical Database Abadi et al.[2] proposed to store all objects of each different predicate in a separate vertical table. In this vertical partitioning approach, Fig. 1 requires 7 tables as shown in Fig. 6.1 The advantages of this scheme are as follows. (1) It is simple to implement. (2) It does not waste space on null values (but it does waste space replicating the subject columns in all tables). (3) It does not store multiple values in one cell while not every RDBMS handles multi-valued fields. (4) Some column-based databases (e.g., C-Store[12] and MonetDB[13]), which are optimized for column-based storage, can be used. degree table subject property hTomi PhD hMayi MPhil hRoyi BSc code table subject property hDbi C123 hNeti C246 hWebi C135 enrolls table subject property hMayi hDbi hRoyi hDbi hRoyi hWebi supervisor table subject property hTomi hSami hMayi hSami

2.4. Property Tables

teacher table subject property hDbi hSami hNeti hSami hWebi hTomi title table subject property hTomi Instructor hSami Professor hKati Professor interest table subject property hSami Database hSami Datamining hKati Security

The property table approach proposed by the Jena project[4] strikes a balance between the horizontal approach and the vertical approach. Correlated predicates are grouped into a property table. Each subject with any of the predicates of this property table has one row in the table. As shown in Fig. 8, 7 predicates are clustered into 3 property tables: student, staff, and course.2 In the course table, all the resources hDbi, hNeti, hWebi have both hteacheri and hcodei predicates. However, in the student table, 2 out of 3 subjects do not have all the predicates maintained by the table. Also note that the subject hTomi appears in both student and staff tables. Obviously, the proportion of null cells (only 3 nulls out of 21 predicate cells) is largely reduced when compared to the horizontal database. In contrast to the vertical database, no table join is needed when a query only involves the predicates within the same table. For example, the SPARQL query in Fig. 2 is translated into the SQL statement in Fig. 9, which needs no table-join. However, if a query involves the predicates from different property tables, table-joins cannot be avoided. In general, some table joins can often be saved compared

Figure 6: Vertical database for Fig. 1

However, this scheme also has some disadvantages. (1) To process queries that involve multiple predicates, table-joins are always needed. For 1 Abadi et al. used IDs to represent subjects and predicates.

SELECT a.subject FROM degree a, enrolls b, supervisor c WHERE a.subject = b.subject AND b.subject = c.subject AND a.value = "MPhil" AND b.value = "" AND c.value = "";

2 Jena actually uses a resource table and a literal table to reference the subjects and objects by system IDs in property tables.

Figure 7: SQL statement for vertical database

4

subject hTomi hSami hKati subject hTomi hMayi hRoyi

staff table title interest Instructor Professor Database, Datamining Professor Security student table degree enrolls supervisor PhD hSami MPhil hDbi hSami BSc hDbi, hWebi course table subject teacher code hDbi hSami C123 hNeti hSami C246 hWebi hTomi C135

not feasible. For instance, Wikipedia adopts a liberal editing, collaborative authoring model where there are no governing hard schemas, but only soft guidelines[16]; therefore, the data pattern is hardly comprehensible by humans. On the other hand, the predicates can also be “somehow” clustered by a suitable data mining algorithm based on the data or query patterns to generate the property table schemas. 2.5.1. Data Pattern Analysis for Schema Design Our ACTL approach aims to analyze the data patterns in order to automate schema design. Two related approaches can be applied for the same purpose. They are the classical frequent pattern mining approach suggested by Ding and Wilkinson[15, 17] and the HoVer approach proposed by Cui et al.[18] We compare these two approaches with ACTL in Sect. 8.

Figure 8: Property tables for motivating example SELECT subject FROM student WHERE degree = "MPhil" AND enrolls = "Db" AND supervisor = "Sam";

Figure 9: SQL statement for property tables

with the vertical database. Oracle also provides a property-table-like facility called subject-property matrix table.[14]

2.5.2. Query Pattern Analysis for Schema Design Another approach to aid schema designs is by analysing the query patterns. Ding and Wilkinson proposed to mine frequent predicate patterns from query logs to form candidate property tables.[15] However, this technique suffers the same problems discussed in Sect. 8 about the usefulness of the frequent pattern mining results. Many research works[19–22] proposed different techniques to evaluate the query costs of database designs against a given query workload for database optimizations, e.g., index selection, table partitioning. However, in many RDF applications, query patterns are unknown and dynamic, which makes these techniques not applicable.

2.5. Problems of Property Table Approach Despite the above advantages of the property table scheme, it poses two technical challenges as described below. Multi-valued properties: Some single predicate representing a one-to-many (or many-to-many) relationship between a subject and multiple objects (e.g., hinteresti and henrollsi) can result in multiple objects stored in a single cell. Abadi et al. argued this is a drawback of the property table approach compared to the vertical approach because multivalued fields were poorly supported by RDBMS.[2] Jena avoided this problem by not clustering multivalued predicates in any property table but by storing them in a fallback triple store instead.[15] Another solution is to use the proprietary support of multi-valued fields by some RDBMS systems, e.g., PostgreSQL and Oracle. Lack of good predicate clustering algorithms: The benefit of fewer joins can only be capitalized when the predicates are well clustered so that the transactions for a subject are contained within only one or few property tables. This is the major drawback pointed out by Abadi.[2] On the one hand, the schemas of the property tables can be designed manually. However, this requires the database designer to have a priori domain knowledge on the data and query patterns, which is often

3. Problem Definition This section models the property table design problem as the following attribute clustering problem. Given a set of attributes (predicates) and a set of transactions (subjects) associated with the attribute set, we intend to partition the attributes into clusters, each comprising the columns of one table, so that the proportion of null cells is not less than a given threshold while the number of tables (clusters) is minimized. Definition 1 formally defines the terminology we used in this paper. Definition 1. A database D = (A, T ) consists of a set of attributes A and a set of transactions T . Each transaction t ∈ T is associated with a group 5

of attributes α(t) ⊆ A. A database schema Π for D S is a partition of A, i.e., (1) C∈Π = A, (2) C ∩C 0 = ∅ for any distinct C, C 0 ∈ Π, and (3) C 6= ∅ for any C ∈ Π. Each element C in Π is called an attribute cluster, for which the following properties are defined: The transaction group of C (a subset of T ): τ (C) = {t ∈ T : α(t) ∩ C 6= ∅} The table load of C (an integer): X LD(C) = |α(t) ∩ C|

4. Agglomerative Clustering This section introduces a basic agglomerative algorithm to approximate the optimal solution to the ACTL problem, and shows that this algorithm has high time and space complexity. Algorithm 1 shows the agglomerative ACTL algorithm. At first, each attribute forms a distinct attribute cluster. A priority queue is used to maintain all unordered pairs of clusters {C, C 0 } ∈ Π where (1) the combined load factor LF(C ∪ C 0 ) (Definition 3) is not below the given load factor threshold θ, and (2) the cluster pair with the highest combined load factor is placed at the top of the queue. Then, the cluster pair from the queue with highest combined load factor is taken for merging into a new cluster. All existing cluster pairs with any of the old clusters that have already been merged are removed from the queue. The new cluster is then used to generate new cluster pairs with the existing clusters. Among these new cluster pairs, those with the combined load factor greater than or equal to θ are added into the queue. Iteratively, the next cluster pair is taken from the top of the queue to repeat the above steps until the queue is empty.

(1)

(2)

t∈τ (C)

The table load factor of C (a rational number): LF(C) =

LD(C) |C| × |τ (C)|

(3)

In our example, 8 subjects (i.e., hTomi, hMayi, hSami, hKati, hDbi, hNeti, and hWebi) form the set of transactions while 7 predicates (i.e., hdegreei, henrollsi, htitlei, hsupervisori, hinteresti, hteacheri, hcodei) form the set of attributes. Fig. 4 and Fig. 8 organize these attributes in two different database schemas, where the attributes are organized in a single table (cluster), and in 3 tables respectively. In Fig. 8, the student cluster contains 3 attributes: hdegreei, henrollsi, and hsupervisori. The transaction group of student is { hTomi, hMayi, hRoyi }. The table load of an attribute cluster is the number of non-null attribute cells in a table, so the table load of student is 7. The table load factor of an attribute cluster is its table load divided by the total number of attribute cells, so the table load factor of student is 7/9 = 0.778. Definition 2 formally defines the Attribute Clustering by Table Load (ACTL) problem. Essentially, it is how to partition a given set of attributes into the fewest clusters such that each cluster has a table load factor not less than a given threshold θ. We have proved that the ACTL is NP-complete (Theorem 1), as shown in Appendix A. Fig. 8 shows an optimal solution to the example for θ = 2/3.

Definition 3. The combined load factor of two clusters C and C 0 , where C ∩ C 0 = ∅, is the load factor for the cluster C 00 = C ∪ C 0 : LF(C 00 ) =

LD(C) + LD(C 0 ) |τ (C) ∪ τ (C 0 )| × (|C| + |C 0 |)

Algorithm 1. BasicACTL Input: database (A, T ), where |A| = n, |T | = m Input: load factor threshold 0 ≤ θ ≤ 1 Output: partition Π of A such that LF(C) ≥ θ for any C ∈ Π 1: initialize Π = {{a1 }, {a2 }, . . . , {an }} 2: create an empty priority queue Q, where each element is an unordered pair of clusters {C, C 0 } and the pair with the largest LF(C ∪ C 0 ) is placed at the top 3: for all {C, C 0 } where C, C 0 ∈ P do 4: compute LF(C ∪ C 0 ) /* O(m) */ 5: if LF(C ∪ C 0 ) ≥ θ then 6: add {C, C 0 } to Q /* O(log |Q|) */ 7: end if 8: end for/* O(n2 (m + log n)) */ 9: while Q is not empty do /* let |Π| = p (i.e., O(n)) and |Q| = q (i.e., O(n2 )) */ 10: pop {C, C 0 } from Q /* O(1) */

Definition 2. Given (1) a database (A, T ), and (2) a load factor threshold θ, where 0 ≤ θ ≤ 1, find a partition (schema) Π of A such that: 1. for each cluster C ∈ Π, LF(C) ≥ θ, and 2. for any possible partition Π0 of A where LF(C 0 ) ≥ θ for any C 0 ∈ Π0 , |Π| ≤ |Π0 |. Theorem 1. The ACTL problem is NP-complete. 6

11: 12: 13: 14: 15: 16: 17: 18: 19: 20:

C 00 ← C ∪ C 0 /* O(n) */ delete C, C 0 from Π delete all pairs with C or C 0 from Q /* O(p log q) */ for all Cˆ ∈ Π do ˆ ≥ θ then /* O(m) */ if LF(C 00 ∪ C) 00 ˆ add {C , C} to Q /* O(log q) */ end if end for/* O(p(m + log q)) */ add C 00 to P end while/* O(n2 m + n2 log n) */

C and C 0 , i.e., LD(C ∪C 0 ) = LD(C)+LD(C 0 ). The table load factor for C 00 can then be easily computed by Eq. 3. Therefore, the time complexity to compute the combined load factor for two clusters from their transaction groups and table loads is O(m), where m is the number of transactions. 4.2. Complexity Analysis The time complexity of the non-trivial steps in Algorithm 1 are specified in the comments. The total time complexity of the algorithm is O(n2 m + n2 log n), where n is the number of attributes and m is the number of transactions. Clearly, the time complexity is more sensitive to the number of attributes than to the number of transactions. The space complexity of this algorithm is determined by (1) the storage for the transaction lists and attributes for each cluster, which is nm and (2) the priority queue size, which is n(n − 1)/2 in the worst case. The total space complexity is O(nm + n2 ).

Here, we run Algorithm 1 using Fig. 1 with θ = 2/3. Initially, there are 7 clusters, each containing one attribute and has a load factor 1. The clusters {hteacheri} and {hcodei} are first merged into the new cluster { hteacheri, hcodei } because their combined load factor is 1. Then, either { htitlei, hinteresti }, { hdegreei, henrollsi }, or { hdegreei, hsupervisori } is created because they all have a combined load factor 5/6. Suppose { htitlei, hinteresti } and { hdegreei, henrollsi } are created. Finally, { hdegreei, henrollsi, hsupervisori } is created by merging { hdegreei, henrollsi } and { hsupervisori } while the combined load factor 7/9 is still not below θ = 2/3. After that, no bigger cluster can be formed as merging any two clusters will make the combined load factor drop below θ.

5. Pruning Techniques We can see that Algorithm 1 may not be time and space efficient enough to handle large sets of attributes and transactions. The Wikipedia dataset we used in our experiments has around 50,000 attributes and 480,000 transactions, and was too large for the above algorithm to cluster in an acceptable time period. Our implementation of this basic algorithm failed with an insufficient memory error after processing the dataset for 3 hours as the priority queue had grown to exhaust the 12GB memory we used. In this section, we discuss several pruning techniques that can significantly improve the time and space efficiency of the algorithm. Our implementation of the algorithm enhanced with these pruning techniques succeeded in clustering the dataset in ∼42 minutes.

4.1. Core Data Structure To facilitate the computation of the table load factor, ACTL is conducted using an attributeoriented data structure rather than a transactionoriented one. The transactions, attributes, and clusters are assigned unique system IDs for all manipulations in clustering. For each cluster C, (1) a sorted list of the transaction IDs for the transaction group, and (2) the integer table load are maintained. Initially, since each cluster contains one distinct attribute, the IDs of all transactions containing that attribute are sorted into the transaction list for that cluster, and the table load for each cluster is the size of the list. (Note that the table load factors for all initial clusters are 1.) When two clusters C and C 0 are combined into a new cluster C 00 = C ∪ C 0 , the transaction group of C 00 is the union of the transaction groups of C and C 0 , i.e., τ (C ∪ C 0 ) = τ (C) ∪ τ (C 0 ). Since the transaction lists for C and C 0 are sorted, they can be merged into the sorted transaction list for C 00 in O(k) time, where k = max{|τ (C)| , |τ (C 0 )|}. The table load for C 00 is the sum of the table loads for

5.1. Transaction Group Equality Test In real-life datasets, there are many clusters of attributes which share identical transaction groups. In Fig. 1, the attributes hteacheri and hcodei have the same transaction group: { hDbi, hNeti, hWebi }. These attributes with the same transaction group will surely be merged into the same cluster because their combined load factor is always 1. We can use an efficient transaction group equality test to merge every set of attributes with the same transaction group into one cluster in the very beginning 7

in order to reduce the total number of initial clusters. Since testing the equality of two sets has nontrivial computation cost, we can use a simple hash function on transaction groups to efficiently find out all attributes potentially having the same transaction group. The hash value of an attribute is computed as the sum of the IDs of all transactions in its transaction group P modulo the maximum hash value: HASH(a) = t∈τ (a) ID(t) mod M . To further process each set of attributes with the same hash value, we divide them into subsets where all attributes in the same subset has the same number of transactions. The hash value and size comparisons on transaction groups can effectively filter out most of false positives. Nevertheless, we still need to exercise the actual equality test on the transaction groups for all attributes in each subset. Since the transaction groups are represented by sorted ID lists, the time to confirm if two lists are equal is linear to the list size. Algorithm 2 performs the above process.

22: 23: 24: 25:

end if end for ¯ to L append L end for

5.2. Maximum Combined Load Factor In Lines 14-19 of Algorithm 1, when a new cluster C 00 is created, we need to find all existing clusters C ∈ Π such that LF(C 00 ∪ C) ≥ θ and put {C, C 00 } into the priority queue. It is very time-consuming if we compute LF(C 00 ∪ C) for all clusters C ∈ Π as each list-merge on τ (C 00 ) ∪ τ (C) costs O(m) time. We have implemented a mechanism to prune all clusters C ∈ Π where LF(C 00 ∪ C) cannot reach θ without computing τ (C 00 ) ∪ τ (C). 5.2.1. Bounds on Transaction Group Size To Attain Maximum Combined Load Factor When we examine a new cluster C 00 just created, we know its member attributes and transaction group size. For the combined load factor of C 00 and some existing C (i.e., LF(C 00 ∪ C)), we can derive the upper bound before LF(C) is known and τ (C 00 )∪τ (C) is computed. Note that LF(C 00 ∪C) attains its maximum value when (1) C is fully loaded and (2) τ (C 00 ) ∪ τ (C) is the smallest, which equals max{|τ (C 00 )| , |τ (C)|}.

Algorithm 2. ClusterAttrsWithEqualTrxGrps Input: set of attributes A Output: list L of attribute sets where all attributes in each set share the same transaction group 1: initialize L to be an empty set 2: create a hash table H of attributes sets, where H(h) = {a : HASH(a) = h} 3: for all attribute a ∈ A do /* compute H */ 4: Add a to the attribute set H(HASH(a)) 5: end for 6: for all attribute set A0 in H do ¯ of attribute sets 7: create a list L 8: for all attribute a0 ∈ A0 do 9: done ← false ¯ do 10: for all attribute set A¯ ∈ L 11: let a be some attribute in A¯ 12: if |τ (¯ a)| = |τ (a0 )| then /* try to avoid comparing τ (¯ a) = τ (a0 ) */ 13: if τ (¯ a) = τ (a0 ) then 14: put a ¯ into A¯ 15: done ← true 16: break the for-loop 17: end if 18: end if 19: end for 20: if done = false then ¯ and add a0 to 21: create a new set A00 in L 00 A

LD(C 00 ) + LD(C) (4) |τ (C 00 ) ∪ τ (C)| (|C 00 | + |C|) LD(C 00 ) + |τ (C)| × |C| ≤ max{|τ (C 00 )| , |τ (C)|}(|C 00 | + |C|) (5)

LF(C 00 ∪ C) =

Therefore, if we require LF(C 00 ∪C) ≥ θ, we must have: LD(C 00 ) + |τ (C)| × |C| ≥θ max{|τ (C 00 )| , |τ (C)|}(|C 00 | + |C|)

(6)

Case 1: if 0 < |τ (C)| ≤ |τ (C 00 )| then LD(C 00 ) + |τ (C)| × |C| ≥θ (7) |τ (C 00 )| (|C 00 | + |C|) |τ (C 00 )| (|C 00 | + |C|)θ − LD(C 00 ) |τ (C)| ≥ |C| (8) Since |τ (C)| is a positive integer, we have: 8

 |τ (C)| ≥

|τ (C 00 )| (|C 00 | + |C|)θ − LD(C 00 ) |C|



|C|=

(9)

1 2

Case 2: if |τ (C)| > |τ (C 00 )| then

... 00

θ≤

LD(C ) + |τ (C)| × |C| |τ (C)| (|C 00 | + |C|) (10)

|τ (C)| × ((|C 00 | + |C|)θ − |C|) ≤ LD(C 00 ) Case 2a: if |τ (C)| > |τ (C 00 )| and θ > then

|τ (C)| ≤

(|C 00 |

LD(C 00 ) + |C|)θ − |C|



LD(C 00 ) 00 (|C | + |C|)θ − |C|

1

4

5

id(C) =

25

3

64

| τ(C) | =

3

6

7

id(C) =

37

34

76

| τ(C) | =

2

2

6

id(C) =

6

85

2

...

...

...

188 23 204 11 386 28

(11) of all clusters with n attributes. For example, the second list in Fig. 5.2.2 records all clusters with 2 attributes. The second-level sorting is designed to find out all clusters of which the transaction group sizes are within a specified range. It is supported by a list of data nodes for all clusters with a particular number of attributes. Each node stores the transaction group size and the ID of a different cluster. On the list, the nodes are sorted by the transaction group sizes of the corresponding clusters. To efficiently return a sub-list of clusters with a specified range of transaction group size, each list is implemented as a red-black tree, which can locate the sub-list in O(log n) time. For example, in the second list of the example, two clusters with IDs 37 and 34 are returned if the requested range is between 2 to 6 inclusively.

|C| |C 00 |+|C|

(12)

Case 2b: if |τ (C)| > |τ (C 00 )| and θ ≤ |C 00|C| |+|C| then Eq. 11 always hold since LD(C 00 ) is positive. Therefore, we only need to consider Case 2a. Since |τ (C)| is a positive integer, we have:

|τ (C)| ≤

11

| τ(C) | =

 (13)

For example, suppose the newly created cluster C 00 has 4 attributes, a load factor of 0.9, and a transaction group of size 1,000 while θ = 0.8. Among all existing clusters with 5 attributes, we only need to group m l consider those with transaction 1000×(4+5)×0.8−1000×4×0.9 = 720 to size from 5 j k 1000×4×0.9 00 (4+5)×0.8−5 = 1636 for merging with C .

5.2.3. Algorithm to Find Candidate Clusters for Merging Based on Sect. 5.2.1 and Sect. 5.2.2, we have developed Algorithm 3 to efficiently find out all existing clusters C to merge with the newly created cluster C 00 with a combined load factor LF(C 00 ∪C) ≥ θ.

5.2.2. Cluster Sorter Data Structure We have designed a data structure called cluster sorter to sort each cluster C using (1) its number of attributes |C| as the primary key, and (2) the size of its transaction group |τ (C)| as the secondary key. This two-level sorting is supported by the data structure shown in Fig. 5.2.2. The aim of this data structure is to quickly find out, given integers n, mlower , mupper , all clusters C with n attributes and a transaction group of size between mlower and mupper . The first-level sorting is designed to return a list of all clusters with a specified number of attributes in O(1) time. This is implemented as an array of pointers where each pointer points to a list of cluster data nodes. The n-th array entry points to the list

Algorithm 3. FindCandidateClusters Input: cluster sorter S where S(n, mlower , mupper ) returns a list of existing clusters C ∈ Π with n attributes and transaction groups of size m, where mlower ≤ m ≤ mupper Input: newly created cluster C 00 Input: load factor threshold θ Output: list L of existing clusters C where LF(C 00 , C) ≥ θ 1: initialize L to be an empty list 2: for all n ← 1 . . . maximum number of attributes of existing clusters in P do |τ (C 00 )|(|C 00 |+n)θ−LD(C 00 ) , τ (C 00 )}e 3: mlower ← dmin{ n n 4: if |C 00 |+|C| < θ then 9

00

5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:

) 00 mupper ← bmax{ (|CLD(C 00 |+n)x−n , τ (C ) + 1}c else mupper ← ∞ end if L ← S(n, mlower , mupper ) /* O(log n) */ end for for all C ∈ L do LD(C 00 )+LD(C) if max{|τ (C 00 )|,|τ (C)|}(|C 00 |+|C|) ≥ θ then remove C from L else compute τ (C 00 ) ∪ τ (C) /* O(log m) */ 00 )+LD(C) if |τ (CLD(C 00 )∪τ (C)|(|C 00 |+|C|) < θ then remove C from L end if end if end for

proach allows our clustering algorithm to be tunable for speeding up performance as well as to be scalable for adapting to different data sizes. 5.4. Modified Clustering Algorithm Combining the techniques discussed in Sect. 5.1, 5.2, and 5.3, Algorithm 1 is modified to Algorithm 4. Algorithm 4. ModifiedACTL Input: database (A, T ) Input: load factor threshold 0 ≤ θ ≤ 1 Input: threshold step 0 ≤ δ ≤ 1 Output: partition Π such that LF(C) ≥ θ for any C∈Π 1: Π ← ClusterAttrsWithEqualTrxGrps(A) 2: create a cluster sorter S and add all cluster C ∈ Π to S 3: create an empty priority queue Q, where each element is an unordered pair of clusters {C, C 0 } and the pair with the largest LF(C ∪ C 0 ) placed in the top of Q 4: for all θ 0 = 1 − δ, 1 − 2δ, . . . , 1 − kδ, θ where kδ < θ for k = 1, 2, . . . do 5: for all C ∈ Π do 6: L ← FindCandidateClusters(S, C, θ0 ) 7: add {C 0 , C} to Q for all C 0 ∈ L and C 0 6= C 8: end for 9: while Q is not empty do 10: pop {C, C 0 } from Q 11: C 00 ← C ∪ C 0 12: delete C, C 0 from both Π and S 13: delete all pairs containing C or C 0 from Q 14: L ← FindCandidateClusters(S, C 00 , θ0 ) 15: add {C 0 , C 00 } to Q for all C 0 ∈ L 16: add C 00 to both Π and S 17: end while 18: end for

5.3. Iterative Clustering When we cluster a large number of attributes using Algorithm 1, we face the insufficient memory problem when too many cluster pairs are stored in the priority queue. For example, there are over 50,000 attributes in the Wikipedia dataset, which can produce 1.25 billion possible attribute pairs at maximum. Initially, if all possible pairs of clusters, each having one attribute, are candidates for merging, the priority queue would need tens of gigabytes of memory space. This could not only run out of the memory but also significantly slow down the insertions of a new cluster pair to and the deletions of an invalid cluster pair from the priority queue. Each of these operations costs O(log q) time, where q is the queue size. It is easy to see that the higher the load factor threshold is, the more effective Algorithm 3 is in pruning unqualified cluster pairs for merging, the fewer candidate cluster pairs which attain the threshold need to be added to the priority queue. Since Algorithm 1 merges the clusters from the highest combined load factor down to the required load factor threshold, we can do the clustering step by step. However, this iterative clustering approach does not always improve the performance. The reason is that the combined load factor of some candidate cluster pairs may need to be recomputed for those pairs which cannot be pruned by Algorithm 3. For example, if clusters C and C 0 have a combined load factor LF(C ∪ C 0 ) = 0.75, there is a chance that the same LF(C ∪ C 0 ) is recomputed in all three iterations at θ = 0.9, 0.8, 0.7. Nevertheless, this ap-

6. Attribute Connectivity Some applications may require only “connected” attributes to be allocated in the same cluster. Intuitively, when two disjoint sets of attributes are not connected, there is not any transaction that contains some attributes from one set and some from the other set. In that case, putting two disconnected sets of attributes together in the same cluster will not save any table-joins but only waste storage on null values. For example, in Fig. 10, two clusters C1 and C2 may be produced from clustering 7 attributes over 10

a1 x

a2 x x

a3

a4

a5 x x

a6 x x

a7

Definition 4. Two attributes a and a0 are said to be 1-connected if τ (a) ∩ τ (a0 ) 6= ∅. Two attributes a and a0 are said to be k-connected if there exists another attribute a00 such that a and a00 are (k − 1)-connected and a00 and a0 are 1-connected. Two attributes a and a0 are said to be connected or ∗connected if there exists some integer k such that a and a0 are k-connected.

Input: a database D = (A, T ) Output: a mapping CCID[a] returns the connectivity class id for attribute a 1: initialize CCID[a] ← 0 for all a ∈ A 2: create a mapping K : Z+ 7→ 2A and initialize K[i] = ∅ for all 1 ≤ i ≤ n 3: j ← 1 4: for all t ∈ T do 5: At ← α(t) 6: if CCID[a] = 0 for all a ∈ At then 7: CCID[a] ← j for all a ∈ At 8: K[j] ← At 9: j ←j+1 10: else if there exists non-empty sets A0t , A00t where A0t ∪ A00t = At such that all attributes a0 ∈ A0t with CCID[a0 ] = 0 and all attributes a00 ∈ A00t share the same CCID[a00 ] = d > 0 then 11: CCID[a0 ] ← d for all a0 ∈ A0t 12: K[d] ← K[d] ∪ A0t 13: else if there exists a, a0 ∈ At such that both CCID[a], CCID[a0 ] > 0 and CCID[a] 6= CCID[a0 ] then /* attributes in At have two or more different CCIDs */ 14: for all a ∈ At do 15: if CCID[a] = 0 then 16: CCID[a] ← j 17: K[j] ← K[j] ∪ {a} 18: else if CCID[a] 6= j then 19: d ← CCID[a] 20: CCID[a00 ] ← j for all a00 ∈ K[d] 21: K[j] ← K[j] ∪ K[d] 22: K[d] ← ∅ 23: end if 24: end for 25: j ←j+1 26: end if 27: end for

Theorem 2. ∗-connectivity is an equivalence relation.

7. Measuring Schema Fitness

t1 t2 t3

x x

x

t4 t5 t6

C1 = {a1 , a2 , a3 } LF(C1 ) = 1/2

x C2 = {a5 , a6 , a7 }

LF(C2 ) = 5/9

Figure 10: Example on attribute connectivity

6 transactions given load factor threshold θ = 0.5. We can say that attributes a1 and a2 are connected because transaction t1 contains both attributes and it is natural to put them in the same cluster C1 . Although there is no transaction containing both a1 and a3 , they are both connected to a2 so a1 and a3 are indirectly connected. Similarly, a4 is also connected to a1 , a2 , a3 too. Therefore, it is natural to put all of them in single cluster C1 . Conversely, for cluster C2 , it seems not very useful to put a7 with a5 and a6 because a7 is “orthogonal” to others, even though the load factor of C2 attains the threshold. Definition 4 formally defines this attribute connectivity relation. In fact, ∗-connectivity is a transitive closure on 1-connectivity. We can consider the attributes being the nodes in a graph where there is an edge between two attributes if they are 1-connected. It is easy to observe that all nodes in each connected component of the graph forms an equivalence class of ∗-connectivity while ∗connectivity is a equivalence relation (Theorem 2).

This section proposes several metrics to measure the fitness of an ACTL-mined schema. Definition 5 formally defines storage configuration as well as mappings ACLUSTER and TCLUSTERS. ACLUSTER returns the cluster to which attribute a belongs while TCLUSTERS returns the set of clusters covered by a given transaction.

We call each equivalence class on ∗-connectivity a connectivity class. If attribute connectivity is required, we can modify Algorithm 1 or 4 as follows. We can assign each connectivity class with a unique ID (CCID) and associate each attribute with its CCID. Then, we add a new condition into the algorithm to restrict that only two clusters with the same CCID can be merged. The CCID assignment algorithm is listed in Algorithm 5.

Definition 5. Let (D, Π) be a storage configuration, where D = (A, T ) is a database and Π is a partition of A.

Algorithm 5. AssignCCID 11

Define mapping ACLUSTER : A 7→ Π so that ACLUSTER(a) = C if a ∈ C. Define mapping TCLUSTERS : T 7→ 2Π so that TCLUSTERS(t) = {ACLUSTER(a) : a ∈ α(t)}.

8. Other Clustering Approaches As discussed in Sect. 2.5.1, some frequent pattern mining algorithm and the HoVer (horizontal representation over vertically partitioned subspaces) algorithm[18] can be used to cluster attributes for designing property tables. However, both algorithms have some shortcomings when they are used for this purpose.

A single transaction may spread across multiple tables (e.g., transaction hTomi in Fig. 8). The average number of clusters per transaction (ACPT) (Definition 6) is the total number of rows in all tables (or the sum of the numbers of clusters storing each transaction) divided by the total number of transactions. An ACPT value is inclusively between 1 and the total number of clusters. For example, ACPT = 2.3 means each transaction spreads across 2.3 tables on average. Therefore, if the ACPT is small then the number of tablejoins required for a query is potentially small. The more structured the data pattern is, the smaller the ACPT is. The ACPT for Fig. 8 is 9/8 = 1.125.

8.1. Frequent Patterns Mining Ding and Wilkinson[15, 17] suggested that frequent patterns mining techniques (e.g., Apriori[23] and FP-Growth[24]) could be used to aid property tables design. To verify this suggestion, we ran the FP-Growth algorithm[24] to find the frequent attribute patterns given different support values for the Barton Libraries dataset. The experimental result is given in Section 9.3.1, which shows that the algorithm cannot generate proper schema designs for the following reasons:

Definition 6. The average number of clusters per transaction (ACPT) of a storage configuration is 1. defined as follows: P |TCLUSTERS(t)| , where D = (A, T ) ACPT(D, Π) = t∈T |T |

The aggregate load factor (ALF) (Definition 7) measures the aggregate storage efficiency of all tables. It is the total number of non-null cells in all tables divided by the total number of cells of the attribute columns in all tables. An ALF is inclusively between the load factor threshold and 1. The higher the ALF is, the less storage space is wasted on storing null values. The ALF for Fig. 8 is 18/21 = 0.857. Definition 7. The aggregate load factor (ALF) of a storage configuration is defined as follows: P

LD(C) , where D = (A, T ) (|C| × |τ (C)|) C∈Π

ALF(D, Π) = P

C∈Π

Since good clustering is indicated by a low ACPT and a high ALF, we may use the ratio of the ALF/ACPT to find a suitable load factor threshold that produces a schema for balanced query performance and storage efficiency. 12

The frequent patterns are overlapping attribute subsets instead of disjoint clusters. Also, a frequent pattern only counts those transaction where all attributes in the pattern always occur together. Therefore, it is difficult to set a suitable support value while the frequent attribute subsets with the highest support may not be good table schemas. This problem can be illustrated in Fig. 8. The attribute subset {degree, enrolls, supervisor} in the student table does not receive the highest support (which is only 1 because of the subject May). However, we tend to group these attributes together because all three subjects in the table share at least two of the predicates. 2. There is no guarantee that the union of all frequent attribute subsets must cover the entire set of attributes. Even when a very low support value of 2 transactions was used to mine the frequent patterns from the Barton dataset, all the frequent attribute subsets cover only 149 out of 285 attributes. 3. For the above reasons, the frequent attribute subsets cannot be used to create property tables without manual selection and adjustments. We doubt how much the mining result can help design the schemas. It is infeasible for humans to analyze so many frequent patterns generated. When the support value was set as high as 100,000, 10,285 frequent patterns

was only ∼3 and ∼15,000 clusters remained at very small thresholds. In contrast, ACTL progressively grew large clusters from merging small ones as the threshold was decreasing.

(covering only 25 attributes) from the Barton dataset were still discovered. In contrast, ACTL, by definition, produces disjoint attribute clusters covering the whole attribute set. This means that the produced clusters can be readily used as a schema design for property tables. The choice of the loading factor threshold only affects the number of property tables to be built and the storage efficiency of those tables when data are populated. In this sense, the ACTL approach is more suitable than the frequent pattern mining approach for mining property table schemas from data patterns.

9. Experiments We have conducted three experiments to show the applicability and effectiveness of our schema mining approach compared with other techniques. Experiment 1 performed ACTL (Algorithm 4) on two datasets, namely Wikipedia Infobox and Barton Libraries, for analyzing the clustering performance and the schema fitness. Experiment 2 performed clustering techniques (i.e., FP-Growth, and HoVer) on the Wikipedia dataset for comparison with ACTL. Experiment 3 stored the Wikipedia dataset using (1) property tables based on a schema mined from ACTL, (2) property tables based on a schema mined from HoVer, (3) a triplet store, and (4) a vertical database. These experiments were conducted on a Debian Linux (amd64) server with two Quad-Core Xeon Pro E5420 2.50GHz CPUs and 16GB RAM. Algorithm 4 was implemented as a single-threaded Java program, running on the Java HotSpot 64-Bit Server Virtual Machine initialized with 12GB heap-size.

8.2. HoVer Clustering Cui et al. proposed a approach called horizontal representation over vertically partitioned subspaces (HoVer) to solve a similar problem addressed by ACTL.[18] HoVer is a basic hierarchical clustering algorithm on attributes using a new distance function, called correlated degree. The correlated degree between two attributes (called dimensions) a and a0 is the size of the intersection of their transaction groups (called active tuples) divided by the size of the union of their transaction groups , i.e., |τ ({a}) ∩ τ ({a0 })| / |τ ({a}) ∪ τ ({a0 })|. A correlation degree threshold is given to the algorithm; HoVer seeks to group attributes into clusters (called subspaces) where for every pair of attributes in each cluster, the correlation degree is not less than the threshold. This algorithm poses the following problems when used to generate property tables schemas:

9.1. Datasets Wikipedia Infobox dataset. This dataset was extracted from the Wikipedia database[5] (20071018 English version). (The data extraction toolkit has been open-souced.[25]) Many Wikipedia pages contained an infobox created from a list of name-value pairs (e.g., bridge_name = Tsing Ma Bridge). Fig. 11 shows the infobox on the Wikipedia page “Tsing Ma Bridge.” Every page with an infobox was treated as a subject resource, and each infobox field name qualified by the infobox name as a predicate. The infobox field value produced the objects for the predicate. If the field value did not have links, the object was a literal (e.g., “HK$30 (cars)” in the toll field). If the field value contained some links to other pages, each linked page was treated as one object resource, and the whole text value was treated as a literal. For example, in Fig. 11, the field locale generates the following 3 triplets:

1. The correlation degree is merely a heuristic function without an obvious physical meaning like the load factor. It is difficult to use it to control the schema fitness. 2. The algorithm favors only small clusters but not large ones. In order for an unclassified attribute to be grouped into a existing cluster, HoVer requires the correlation degree threshold between that attribute with every attribute from that cluster to attain the correlation degree threshold. In other words, an unclassified attribute cannot be grouped into that cluster simply when merely a single attribute in that cluster is not sufficiently correlated with that unclassified one. Therefore, it is difficult to grow large clusters. Sect. 9.3.2 shows that, in our experiment where HoVer was used to cluster Wikipedia dataset, the average cluster size