7 … Generated Transformation (XQuery). ... $x1L1 IN $doc/expenseDB/company. WHERE .... aid Sk4[MSFT, u1,40].
Translating Web Data Ronald Fagin, Mauricio A. Hernandez, Lucian Popa, IBM Almaden Research Center
Renee. J. Miller, Yannis Velegrakis
(speaker)
University of Toronto
1
Our Motivation l We address the problem of data translation between schemas.
w This is an old but recurrent problem (see the old Express project from IBM – 70’s) w People usually write complex queries to transform data ü Time consuming ü Requires query experts ü Even more when the data and schemas are XML
l Our approach emphasizes automation of the task of data translation:
w Given two schemas (XML and/or relational), and a high-level specification of a mapping between schemas, we generate queries (XQuery/XSLT/SQL)
w The user does not have to know XQuery/XSLT/SQL l Major challenges that we address here:
w Reason about schemas and mapping to infer the “right” queries w Guarantee that the translated data will comply with the target schema 2
Schema Mapping & Data Translation •Wants data from S •Understands T •May not understand S •XML Schema •DTD •Relational
Source schema S
“conforms to”
High-level mapping
Mapping Compiler
Target schema T
“conforms to”
data Low-level mapping (translation program or queries) Our approach can be applied in both target materialization and query unfolding 3
Mapping is not Schema Integration
Integrated schema
Source schema S1
“conforms to”
Source schema Sn
“conforms to”
data
Design Problem: integrated schema • Designed to match sources • Has no semantics of its own
data
4
Applications l Data Migration l Schema Evolution l Data Exchange on the Web (P2P applications) l Software engineering
5
Challenges in Schema Mapping l
Goal: interoperability between independent data sources
l
Support Nested Structures w Nested Relational Model w Nested Constraints
l
Element correspondences w Human friendly w Automatic discovery
l
Capture User’s Intentions
l
Preserve data meaning w Discover associations w Use constraints & schema
l
Create New Target Values
l
Produce Correct Grouping
l
and …
6
… Generated Transformation (XQuery) , distinct ( FOR $x0 IN $doc/expenseDB/grant, $x1 IN $doc/expenseDB/company WHERE $x1/cid/text() = $x0/cid/text() RETURN $x0/cid/text() , $x1/cname/text() , distinct ( FOR $x0L1 IN $doc/expenseDB/grant, $x1L1 IN $doc/expenseDB/company WHERE $x1L1/cid/text() = $x0L1/cid/text() AND $x1/cname/text() = $x1L1/cname/text() AND $x0/cid/text() = $x0L1/cid/text() RETURN "Sk35(", $x0L1/amount/text(), ", ", $x1L1/cname/text(), ", ", $x0L1/cid/text(), ")" , "Sk36(", $x0L1/amount/text(), ", ", $x1L1/cname/text(), ", ", $x0L1/cid/text(), ")" , "Sk32(", $x0L1/amount/text(), ", ", $x1L1/cname/text(), ", ", $x0L1/cid/text(), ")" ) ), distinct ( FOR $x0 IN $doc/expenseDB/grant, $x1 IN $doc/expenseDB/company WHERE $x1/cid/text() = $x0/cid/text() RETURN "Sk32(", $x0/amount/text(), ", ", $x1/cname/text(), ", ", $x0/cid/text(), ")" , $x0/amount/text() )
7
… Generated Transformation (XSLT) true Sk_statisticsDB() Sk_statisticsDB() Sk_statisticsDB_0_1(Sk_statisticsDB()) Sk_statisticsDB_0_2(Sk_statisticsDB()) Sk_statisticsDB_0_1(Sk_statisticsDB()) Sk_statisticsDB_0_1_0_2(, , Sk_statisticsDB_0_1(Sk_statisticsDB())) Sk_statisticsDB_0_1_0_2(, , Sk_statisticsDB_0_1(Sk_statisticsDB())) Sk35(, , ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
Mapping Algorithm Element - Element correspondences Source schema S
Query (source): - unnest - join
Target schema T
Query (target): - nest - split - create new values
l Step 1. Intra-schema associations discovery l Step 2. Logical mapping generation l Step 3. Query generation 9
Association Discovery Source schema S
Source Associations (logical views)
Target schema T
Target Associations (logical views)
l Step 1. Discover intra-schema associations between schema elements
w relational views that contain maximal groups of related elements w Each represents a different category of data that may exist in the database 10
Associations w Groups of elements that are semantically associated w Chase with intra-schema constraints expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project sponsor
expenseDB.companies expenseDB.grants ? expenseDB.companies
statDB: Set of Rcd cityStat: Rcd city orgs: Set of Rcd org: Rcd orgid oname fundings: Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount statDB.cityStat.orgs statDB [ cityStat.orgs.org.fundings ? cityStat.financials ] 11
Logical Mapping Generation Element - Element correspondences Source schema S
Target schema T
Logical Mappings Source Associations (logical views)
Target Associations (logical views)
l Step 2. Logical mapping generation:
w each source association -> each target association ü based on all correspondences that are relevant
w By construction, logical mappings preserve associations between elements 12
Logical Mappings w Inter-schema constraints expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project sponsor
M1
∏name expenseDB.companies ∏name gid amount
expenseDB.grants ? expenseDB.companies
statDB: Set of Rcd cityStat: Rcd city orgs: Set of Rcd org: Rcd orgid oname fundings: Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount
M2
⊆
∏name statDB.cityStat.orgs
⊆
∏name gid amount
statDB [ cityStat.orgs.org.fundings ? cityStat.financials ] 13
Query Generation Element - Element correspondences Source schema S
Target schema T
Query (source): - unnest - join Logical Mapping Source Associations (logical views)
Query (target): - nest - split - create new values Target Associations (logical views)
l Step 3. Generate queries. Each query: w Performs the nest and split operation into multiple target elements w May need to create new values (unmapped) in the target
14
Query Generation Issues l Translation of the logical mappings into queries
w flat representation of how schemas correspond w not all target attributes are determined by the source w we need to materialize the nested target l
Skolemization algorithm: the heart of query generation
w Achieves a good nesting (grouping) w Generates new values (ids) l Skolem functions control the creation of the unknown elements:
w atomic values (this enforces the integrity of the target) , and w sets (this controls how we group elements in the target) l Skolem functions are automatically generated. 15
Skolemization Algorithm statDB
expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project sponsor cid ibm1 ibm2 ms1
name IBM IBM Microsoft
statDB: Set of Rcd cityStat cityStat: Rcd city Sk 0[ ] city = Sk0[] orgs orgs: Set of Rcd = Sk1[] org M2 org: Rcd Sk 2[IBM] orgid orgid = Sk2[name] oname IBM oname fundings fundings: Set of Rcd = Sk3[name] funding gid r1 funding: Rcd aid Sk 4[IBM, r1,80] gid proj funding = Sk4[name,gid,amt] gid u1 aid aid Sk 4[IBM, u1,40] financials: Set of Rcd org financial: Rcd orgid Sk 2[MSFT] = Sk [name,gid,amt] 4 aid oname MSFT fundings date funding amount
gid u1 aid Sk 4[MSFT, u1,40]
city Almaden Yorktown Redmond
cid
gid
amount
project
Ibm1
r1
80K
DB2
Ibm1
u1
40K
XQuery
ms1
u1
40K
XSL
financials financial aid Sk 4[IBM, r1,80] amount 80 financial aid Sk 4[IBM, u1,40] amount 40 financial 16 aid Sk 4[MSFT, u1,40] amount 40
Features l Produce all the semantically meaningful queries.
w Finds all the associations that exist in the schemas w Each one maps from a source association to a target association w Allows user to select a subset of them l Target schema constraints are taken into consideration
w To infer the user intention w To guarantee that the generated data satisfies the structure and the constraints of the target schema
l Target Instance is guaranteed to be in Partition Normal
Form (PNF) 17
Multiple Associations expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project sponsor
statDB: Set of Rcd city cityStat: Rcd orgs: Set of Rcd org: Rcd cid name fundings: Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount
n
Grants may be associated with companies in multiple ways
n
Association 1: grants ? companies join on cid = cid
n
Association 2: grants ? companies join on sponsor = cid
n
We do not make the one flavor assumption (URA) 18
Mapping Algorithm Implementation l Clio System w IBM Almaden Research Center / University of Toronto ü http://www.cs.toronto.edu/db/clio
w Advanced GUI
ü Constraints ü Mappings ü Generated queries
l Experiments with real-world schemas DBLP
Nesting Depth 1/4
TPC-H
1/3
34 / 10
9/1
7
14
GeneX
1/3
65 / 63
9/3
4
4
Mondial
1/4
102 / 90
15 / 21
4
5
Amalgam
1/1
53 / 101
26 / 14
9
10
Schema
Attributes
Constraints
52 / 12
0/1
Queries w/o Queries w/ constraints constraints 11 13
19
Conclusion l Schema mapping framework w For data with schemas w Covers relational/XML schemas with nested constraints w Output: XQuery, XSLT, SQL w Build “data transformations” (Queries with Skolem functions) l Future extensions w Semantics of ordering in doing mapping w Key constraints w Recursion of more than 1 level
20