Sep 15, 2001 - product type=âdigital cameraâ name=âCoolpix880â mpix=â3.34â price=â1799.00 ... signature sp
A Framework for Generic Integration of XML Sources
Wolfgang May Institut fur ¨ Informatik ¨ Freiburg Universitat Germany
KRDB Workshop, Rome 15.9.2001
Generic Integration of XML Sources
OVERVIEW Integration operations (Warehouse strategy) Integration strategies Language: XPathLog Implementation: L O PiX Conclusion
Overview
1
Generic Integration of XML Sources
TOPICS OVERVIEW Considerations on a Data Model for XML with updates/integration: FMLDO/FMII’01 independent from the programming language XPathLog as an XML Database Programming Language: DBPL’01 Implementation: LoPiX VLDB Demonstration Track Application for Data Integration objects of different sources represent the same real-world object
Fusing objects, merging their properties different names and structure
Result views as projections Strategies for “intelligent” data-driven integration KRDB’01
Overview
2
Generic Integration of XML Sources
D OCUMENT VS . DATABASE integration of documents: tree, ordered “integration follows structure” integration of databases: graph, non-ordered semantics-driven integration process
DATABASE I NTEGRATION objects of different sources represent the same real-world object
Fusing objects, merging their properties synonyms, ontologies not compatible with XML Data Model (DOM, XML Query Data Model) requires a powerful language experiences with F-Logic for semi-structured data and data integration
Introduction
3
Generic Integration of XML Sources
S CENARIO Different autonomous sources, describing the same application area e.g., catalogs of Digital Photography overlapping, but potentially incomplete and inconsistent No fixed integration mapping known
Data-Driven integration strategies stepwise generation of an integrated database Warehouse vs. virtual approach
Introduction
4
Generic Integration of XML Sources
producer name=“Nikon” product type=“digital camera” name=“Coolpix880” mpix=“3.34” price=“1799.00” zoom external focallength=“8 20” digital factor=“4”
/zoom
accessory type=“lens” name=“WC-E24”/ accessory type=“lens” name=“TC-E2”/ accessory type=“lens” name=“TC-E3”/ /product product
type=“wide angle adapter” name=“WC-E24” factor=“0.66” price=“219.00”
product
type=“teleconverter” name=“TC-E2” factor=“2” price=“259.00”
product
/product
/product
type=“teleconverter” name=“TC-E3” factor=“3” price=“589.00”
/product
.. . /producer Introduction
5
Generic Integration of XML Sources
store name=“shop1” digitalcamera
producer=“Nikon” type=“Coolpix880” price=“1699.00”/
digitalcamera
producer=“Nikon” type=“Coolpix990” price=“2399.00”/
digitalaccessory
producer=“Nikon” type=“WC-E24” price=“199.00”/
digitalaccessory
producer=“Nikon” type=“TC-E2” price=“269.00”/
digitalcamera
producer=“Olympus” type=“C3000” price=“1599.00”/
.. . /store
other resellers pages test reports etc
Introduction
6
Generic Integration of XML Sources
I NTEGRATION : “T HREE - LEVEL” MODEL access multiple sources “basic” layer: source(s) provide tree structures, optionally with namespaces nikon: producer’s tree shop1, shop2 etc: resellers trees merge data from different sources Abstract Operations fuse elements/merge subtrees introduce synonyms for properties connect elements and tree fragments from several sources by links generate elements “internal” data model: XTreeGraph overlapping trees multiple parents references “export” layer: result trees views defined as projections Integration
7
Generic Integration of XML Sources
I NTEGRATION : F USING E LEMENTS AND S UBTREES
Situation elements represent the same real-world entity in different sources fuse elements into a unified element:
½
¾
Resulting element 1. globally replace 2.
¾
in all properties by
½.
is then an element of both source trees, i.e., positive queries against the original tree using the original namespace still yield at least the original answers, ½
3.
½
collects the attributes of both original elements.
4.
½
collects the subelements of both original elements.
Integration
8
Generic Integration of XML Sources
S YNONYMS identify properties with the same semantics name½
name¾
take properties from the (namespaced) sources are completely added to the result – with another name namespace:name½
name¾
does not introduce new children or attribute nodes, “only” defines an alternative navigation path, does not change order of children
Integration
9
Generic Integration of XML Sources
R ESULT V IEWS
Projection by Signatures Given: Database with (overlapping) trees signature specification (derivable from DTD or XML Schema), a root node . Tree view rooted in : the root node , attributes and subelements (recursively) filtered according to the signature Integrity Constraints: result may contain dangling references. result may be cyclic/infinite.
Integration
10
Generic Integration of XML Sources
S TRATEGIES use a reference tree describe keys (names, codes, titles etc.) Identify corresponding concepts and elements in different trees (element types, attribute names etc.) candidate sets of corresponding elements products in the nikon tree and in the reseller’s trees verify correspondence and fuse elements identify corresponding properties by values based on “known” identical elements – properties which have to be identified – properties which correspond, but have to be compared – technical data vs. prices of different resellers generalize to all elements of a given type from nikon products to olympus products in the reseller tree analogies: semantics of related element types detect mappings between properties (price in DM, Dollars; lengths in cm, inch) generalize relationships between sources Strategies
11
Generic Integration of XML Sources
S TRATEGIES , D ETAILS , E XTENSIONS Well-founded semantics for detecting sets of corresponding elements in “graph” databases interfering, negative dependencies between candidate sets, deep-equality etc. Statistical methods/Data Mining for handling inconsistencies property coincides for 95% of all identified objects perhaps consider another input database for these data confidence measures include meta-knowledge – ontologies – domain-dependent knowledge (units, currencies, taxes, languages)
requires a powerful language
Strategies
12
Generic Integration of XML Sources
WAREHOUSE VS . V IRTUAL REVISITED Apply strategies to an excerpt of the database in the warehouse approach derive a mapping apply mapping to complete databases in a virtual integration strategy
Strategies
13
Generic Integration of XML Sources
L ANGUAGE P ROPOSAL : XPATH L OG
Design Decisions experiences with F-Logic for semi-structured data and data integration declarative rule-based language with bottom-up semantics extend XPath with variable-bindings XTreeGraph data model support for 3-level integration approach
XPathLog
14
Generic Integration of XML Sources
XPATH L OG BY E XAMPLES Pure XPath expressions ?- //producer[@name = “Nikon”] //product[@name=“Coolpix 880”]/@mpix. true Output Result Set ?- //producer[...]//product[...]/@mpixM. M/3.34 Additional Variables ?- //producer[...]//product[@name=N]/@priceP. N/“Coolpix 880”
P/1799.00
N/“WC-E24” .. .
P/219.00
Dereferencing ?- //producer[...]//product/accessory/@name/@priceP. Schema Querying ?- //product[@type=“camera”]/@Prop. Prop/name Prop/mpix Prop/price XPathLog
15
Generic Integration of XML Sources
XPATH L OG RULES head(V½ ,. . . ,V ) :- body(V½ ,. . . ,V ) Constructive semantics of XPath expressions Definite XPathLog atoms: – use only the child, sibling, and attribute axes – no negation, function applications, aggregation, and proximity position predicates “/” and “[. . . ]” act as constructors:
modifies
of the form
–
–
–
–
–
unambiguous insertions
XPathLog: Rule Heads
16
Generic Integration of XML Sources
I MPLEMENTATION : L O P I X developed using major components from F LORID
System Commands
Pretty Printer Bindings/XML
interactive Output
XPathLog Parser (Programs and Queries)
Logic Evaluation
XML output
(Bottom-up) Algebraic
Algebraic
Evaluation ( )
Insertion
OM Access
WebAccess DTD XML Parser Parser
Internet
Evaluation
Execution
User Interface
XML url½
Storage
DTD url¾ Internet in- and output Object Manager
inserts to internal storage internal information flow querying internal storage
Implementation
17
Generic Integration of XML Sources
C ONCLUSION specialized integration operations for XML data: fusing, linking, synonyms not compatible with DOM/XML Query Data Model: unique-parent graph data model suitable & necessary for updates and integration 3-level integration process manually written integration programs vs. high-level, generic, heuristics-based strategies
Conclusion
18
Generic Integration of XML Sources
Contents Overview
1
1 Overview
1
2 Topics Overview
2
3 Document vs. Database
3
3 Database Integration
3
4 Scenario
4
7 Integration: “Three-level” model
7
8 Integration: Fusing Elements and Subtrees
8
9 Synonyms
9
10 Result Views
10
11 Strategies
11
12 Strategies, Details, Extensions
12
13 Warehouse vs. Virtual revisited
13
14 Language Proposal: XPathLog
14
XPathLog 15 XPathLog by Examples Rules
15 15 16
16 XPathLog Rules
16
Implementation
17
17 Implementation: LoPiX
17
Results
18
18 Conclusion
18
Conclusion
19