Translating Web Data

5 downloads 87 Views 243KB Size Report
7 … Generated Transformation (XQuery). ... $x1L1 IN $doc/expenseDB/company. WHERE .... aid Sk4[MSFT, u1,40].
Translating Web Data Ronald Fagin, Mauricio A. Hernandez, Lucian Popa, IBM Almaden Research Center

Renee. J. Miller, Yannis Velegrakis

(speaker)

University of Toronto

1

Our Motivation l We address the problem of data translation between schemas.

w This is an old but recurrent problem (see the old Express project from IBM – 70’s) w People usually write complex queries to transform data ü Time consuming ü Requires query experts ü Even more when the data and schemas are XML

l Our approach emphasizes automation of the task of data translation:

w Given two schemas (XML and/or relational), and a high-level specification of a mapping between schemas, we generate queries (XQuery/XSLT/SQL)

w The user does not have to know XQuery/XSLT/SQL l Major challenges that we address here:

w Reason about schemas and mapping to infer the “right” queries w Guarantee that the translated data will comply with the target schema 2

Schema Mapping & Data Translation •Wants data from S •Understands T •May not understand S •XML Schema •DTD •Relational

Source schema S

“conforms to”

High-level mapping

Mapping Compiler

Target schema T

“conforms to”

data Low-level mapping (translation program or queries) Our approach can be applied in both target materialization and query unfolding 3

Mapping is not Schema Integration

Integrated schema

Source schema S1

“conforms to”

Source schema Sn

“conforms to”

data

Design Problem: integrated schema • Designed to match sources • Has no semantics of its own

data

4

Applications l Data Migration l Schema Evolution l Data Exchange on the Web (P2P applications) l Software engineering

5

Challenges in Schema Mapping l

Goal: interoperability between independent data sources

l

Support Nested Structures w Nested Relational Model w Nested Constraints

l

Element correspondences w Human friendly w Automatic discovery

l

Capture User’s Intentions

l

Preserve data meaning w Discover associations w Use constraints & schema

l

Create New Target Values

l

Produce Correct Grouping

l

and …

6

… Generated Transformation (XQuery) , distinct ( FOR $x0 IN $doc/expenseDB/grant, $x1 IN $doc/expenseDB/company WHERE $x1/cid/text() = $x0/cid/text() RETURN $x0/cid/text() , $x1/cname/text() , distinct ( FOR $x0L1 IN $doc/expenseDB/grant, $x1L1 IN $doc/expenseDB/company WHERE $x1L1/cid/text() = $x0L1/cid/text() AND $x1/cname/text() = $x1L1/cname/text() AND $x0/cid/text() = $x0L1/cid/text() RETURN "Sk35(", $x0L1/amount/text(), ", ", $x1L1/cname/text(), ", ", $x0L1/cid/text(), ")" , "Sk36(", $x0L1/amount/text(), ", ", $x1L1/cname/text(), ", ", $x0L1/cid/text(), ")" , "Sk32(", $x0L1/amount/text(), ", ", $x1L1/cname/text(), ", ", $x0L1/cid/text(), ")" ) ), distinct ( FOR $x0 IN $doc/expenseDB/grant, $x1 IN $doc/expenseDB/company WHERE $x1/cid/text() = $x0/cid/text() RETURN "Sk32(", $x0/amount/text(), ", ", $x1/cname/text(), ", ", $x0/cid/text(), ")" , $x0/amount/text() )

7

… Generated Transformation (XSLT) true Sk_statisticsDB() Sk_statisticsDB() Sk_statisticsDB_0_1(Sk_statisticsDB()) Sk_statisticsDB_0_2(Sk_statisticsDB()) Sk_statisticsDB_0_1(Sk_statisticsDB()) Sk_statisticsDB_0_1_0_2(, , Sk_statisticsDB_0_1(Sk_statisticsDB())) Sk_statisticsDB_0_1_0_2(, , Sk_statisticsDB_0_1(Sk_statisticsDB())) Sk35(, , ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

Mapping Algorithm Element - Element correspondences Source schema S

Query (source): - unnest - join

Target schema T

Query (target): - nest - split - create new values

l Step 1. Intra-schema associations discovery l Step 2. Logical mapping generation l Step 3. Query generation 9

Association Discovery Source schema S

Source Associations (logical views)

Target schema T

Target Associations (logical views)

l Step 1. Discover intra-schema associations between schema elements

w relational views that contain maximal groups of related elements w Each represents a different category of data that may exist in the database 10

Associations w Groups of elements that are semantically associated w Chase with intra-schema constraints expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project sponsor

expenseDB.companies expenseDB.grants ? expenseDB.companies

statDB: Set of Rcd cityStat: Rcd city orgs: Set of Rcd org: Rcd orgid oname fundings: Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount statDB.cityStat.orgs statDB [ cityStat.orgs.org.fundings ? cityStat.financials ] 11

Logical Mapping Generation Element - Element correspondences Source schema S

Target schema T

Logical Mappings Source Associations (logical views)

Target Associations (logical views)

l Step 2. Logical mapping generation:

w each source association -> each target association ü based on all correspondences that are relevant

w By construction, logical mappings preserve associations between elements 12

Logical Mappings w Inter-schema constraints expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project sponsor

M1

∏name expenseDB.companies ∏name gid amount

expenseDB.grants ? expenseDB.companies

statDB: Set of Rcd cityStat: Rcd city orgs: Set of Rcd org: Rcd orgid oname fundings: Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount

M2



∏name statDB.cityStat.orgs



∏name gid amount

statDB [ cityStat.orgs.org.fundings ? cityStat.financials ] 13

Query Generation Element - Element correspondences Source schema S

Target schema T

Query (source): - unnest - join Logical Mapping Source Associations (logical views)

Query (target): - nest - split - create new values Target Associations (logical views)

l Step 3. Generate queries. Each query: w Performs the nest and split operation into multiple target elements w May need to create new values (unmapped) in the target

14

Query Generation Issues l Translation of the logical mappings into queries

w flat representation of how schemas correspond w not all target attributes are determined by the source w we need to materialize the nested target l

Skolemization algorithm: the heart of query generation

w Achieves a good nesting (grouping) w Generates new values (ids) l Skolem functions control the creation of the unknown elements:

w atomic values (this enforces the integrity of the target) , and w sets (this controls how we group elements in the target) l Skolem functions are automatically generated. 15

Skolemization Algorithm statDB

expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project sponsor cid ibm1 ibm2 ms1

name IBM IBM Microsoft

statDB: Set of Rcd cityStat cityStat: Rcd city Sk 0[ ] city = Sk0[] orgs orgs: Set of Rcd = Sk1[] org M2 org: Rcd Sk 2[IBM] orgid orgid = Sk2[name] oname IBM oname fundings fundings: Set of Rcd = Sk3[name] funding gid r1 funding: Rcd aid Sk 4[IBM, r1,80] gid proj funding = Sk4[name,gid,amt] gid u1 aid aid Sk 4[IBM, u1,40] financials: Set of Rcd org financial: Rcd orgid Sk 2[MSFT] = Sk [name,gid,amt] 4 aid oname MSFT fundings date funding amount

gid u1 aid Sk 4[MSFT, u1,40]

city Almaden Yorktown Redmond

cid

gid

amount

project

Ibm1

r1

80K

DB2

Ibm1

u1

40K

XQuery

ms1

u1

40K

XSL

financials financial aid Sk 4[IBM, r1,80] amount 80 financial aid Sk 4[IBM, u1,40] amount 40 financial 16 aid Sk 4[MSFT, u1,40] amount 40

Features l Produce all the semantically meaningful queries.

w Finds all the associations that exist in the schemas w Each one maps from a source association to a target association w Allows user to select a subset of them l Target schema constraints are taken into consideration

w To infer the user intention w To guarantee that the generated data satisfies the structure and the constraints of the target schema

l Target Instance is guaranteed to be in Partition Normal

Form (PNF) 17

Multiple Associations expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project sponsor

statDB: Set of Rcd city cityStat: Rcd orgs: Set of Rcd org: Rcd cid name fundings: Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount

n

Grants may be associated with companies in multiple ways

n

Association 1: grants ? companies join on cid = cid

n

Association 2: grants ? companies join on sponsor = cid

n

We do not make the one flavor assumption (URA) 18

Mapping Algorithm Implementation l Clio System w IBM Almaden Research Center / University of Toronto ü http://www.cs.toronto.edu/db/clio

w Advanced GUI

ü Constraints ü Mappings ü Generated queries

l Experiments with real-world schemas DBLP

Nesting Depth 1/4

TPC-H

1/3

34 / 10

9/1

7

14

GeneX

1/3

65 / 63

9/3

4

4

Mondial

1/4

102 / 90

15 / 21

4

5

Amalgam

1/1

53 / 101

26 / 14

9

10

Schema

Attributes

Constraints

52 / 12

0/1

Queries w/o Queries w/ constraints constraints 11 13

19

Conclusion l Schema mapping framework w For data with schemas w Covers relational/XML schemas with nested constraints w Output: XQuery, XSLT, SQL w Build “data transformations” (Queries with Skolem functions) l Future extensions w Semantics of ordering in doing mapping w Key constraints w Recursion of more than 1 level

20

Suggest Documents