An Adaptive Query Execution Engine for Data Integration

0 downloads 0 Views 239KB Size Report
Feb 20, 2006 - sources relevant to their query interact with each source ... Query optimizers and efficient query execution ... topic matters? • Q2: If you are a ...
An Adaptive Query Execution Engine for Data Integration Zachary Ives, Daniela Florescu, Marc Friedman, Alon Levy, Daniel S. Weld University of Washington Presented by Peng [email protected]

1

Outline • • • • •

The Background of Data Integration Systems Tukwila Architecture Interleaving of planning and execution Adaptive Query Operators Collector & Double Pipelined Join Performance

2

Background (Data Integration Systems)

Data Integration System

Multiple autonomous (can’t affect behavior of sources) heterogeneous (different models and schemas) data sources

3

The key of DISs: • “Free users from having to locate the sources relevant to their query interact with each source independently manually combine the data from the different sources”

4

The main challenges of the design of DISs: • Query Reformulation • The construction of wrapper programs • Query optimizers and efficient query execution engines

5

Motivations: zLittle information for cost estimates zUnpredictable data transfer rates zUnreliable, overlapping sources Adaptive zWant initial results quickly zNetwork bandwidth generally constrains the data sources to be smaller than in traditional db applications.

6

Tukwila Architecture

1. a semantic description of the contents of the data sources; 2. overlap information about pairs of data sources ; 3. key statistics about the data, such as the cost of accessing each source and so on

7

Novel Features of Tukwila • Interleaving of planning and execution – Compensates for lack of information

• Handle event-condition-action rules – When and how to modify the implementation of certain operators at runtime if needed. – Detect opportunities for re-optimization.

• Manages overlapping data sources (collectors) • Tolerant of latency (double-pipelined join) – Returns initial results quickly

8

RAP1

Interleaving of planning and execution the non-traditional characteristics of Tukwila are as following, – The optimizer can only create a partial plan if essential statistics are missing or uncertain – The optimizer generates not only operator trees but also the appropriate event-conditionaction rules. – The optimizer conserves the state of its search space when it calls the execution engine.

9

Slide 9 RAP1

1. too wordy 2. you mean "characteristics" rather than "characters" Rachel Pottinger, 2/20/2006

Overview of the query plan structure •

The fragment structure is the key mechanism for implementing the Why does the system need A plan adaptive includes a partially-ordered property: at the thefragment end ofstructure? each set fragment, rest of of the plan can be refragments and the a set global rules optimized or rescheduled

of

• A fragment consists of a fully pipelined tree of physical operators and a set of local rules.

10

Rules • Re-optimization The optimizer’s cardinality estimate for the fragment’s result is significantly different from the actual size ->reinvoke optimizer

• Contingent planning The execution engine checks properties of the result to select the next fragment

• Rescheduling Reschedule if a source times out

• Adaptive operators

11

RAP2

Rule format When event if condition then actions When closed(frag1) if card(join1)>2*est_card(join1) then replan An event triggers a rule, coursing it to check its condition. If the condition is true, the rule fires, executing the action(s).

12

Slide 12 RAP2

Given time constraints, I'd cut slides 12 & 13 Rachel Pottinger, 2/20/2006

Events Actions open, closed: fragment/operator starts or completes set the overflow method forunable a double pipelined join error: operator failure, e.g., to contact source alter a memory timeout(n): data allotment source has not responded in n msec. deactivate an operator fragment,memory which stops its out-of memory: join hasor insufficient execution and deactivates its associated rules reschedule the query operator tree re-optimize the plan Conditions return an error to the user state(operutor): the operator’s current state card(operator): the number of tuples produced so far time(operator): the time waiting since last tuple memory(operator): the memory used so far

13

Group Discussion • For one of the following motivating situations of Tukwila – – – –

Absence of statistics Unpredictable data arrival characteristics Overlap and redundancy among sources Optimizing the time to initial answers

• Q1: Can you give some examples where the chosen topic matters? • Q2: If you are a member of Tukwila team, what rules or policy would you have to deal with the problem? – To help discussion, more specific situations will be given – But you may assume any problem or situation

• Discussion – Form 8 groups (3~4 person per group, two teams per topic) – Discuss Q1 and Q2 for one topic (5 ~ 7 minutes)

14

Examples Orders OrderNo 1234 1235 1399 1500

TrackNo 01-23-45 02-90-85 02-90-85 03-99-10

UPS TrackNo 01-23-45 02-90-85 03-99-10 04-08-30

Status In Transit Delivered Delivered Undeliverable

JoinOrders.TrackNo = UPS.TrackNo (Orders, UPS) OrderNo 1234 1235 1399 1500

TrackNo 01-23-45 02-90-85 02-90-85 03-99-10

Status In Transit Delivered Delivered Delivered

15

Query Plan Execution Query plan represented as data-flow tree:

• Control flow “Show which orders have been delivered”

– Iterator (top-down) • Most common database model

JoinOrders.TrackNo = UPS.TrackNo

• Easier to implement

– Data-driven (bottom-up) SelectStatus = “Delivered” Read Orders

Read UPS

• Threads or external scheduling • Better concurrency

16

Tukwila Plans & Execution • Multiple fragments ending at materialization points

(3)

JoinOrders.TrackNo = UPS.TrackNo

• Rules triggered by events – Re-optimize remainder if necessary – Return statistics

(1)

Read Orders

(2)

SelectStatus = “Delivered” Read UPS

When(closed(1)): if size_of(Orders) > 1000 then reoptimize {2, 3}

17

RAP3

Performance evaluation Interleaving Planning and Execution

We can find that Tukwila’s strategy of interleaving planning and execution can slash the total time spent processing a query. With a total speedup of 1.42 over pipeline and 1.69 over the naïve strategy of materializing .

18

Slide 18 RAP3

Given time constraints, I'd cut this slide Rachel Pottinger, 2/20/2006

Adaptive Query Operators Collectors • Overlap issues Data Integration Systems needs to perform a union over a large number of overlapping sources. However, a standard union operator has no mechanism for handling errors or for deciding to ignore slow mirror data sources once it has obtained the full data set.

QA∪ QB ∪ QC

A

B

C

Mirror_C

19

RAP4

Collectors (cont.) • Collectors can deal with the problems by using policies! • A collector operator = a set of children (wrapper calls, local data and so on) + a policy for contacting them

20

Slide 20 RAP4

Again, considering time constraints, consider cutting slides 19 and 20 Rachel Pottinger, 2/20/2006

Collectors (cont.) A complex policy example,

Tukwila

A

B

21

Adaptive Query Operators Double Pipelined Join Conventional Joins • Sort merge joins &indexed joins ---can not be pipelined • Nested loops joins and hash joins ---Follow an asymmetric execution model For Nested loops joins, we must wait for the entire inner table to be transmitted initially before pipelining begins For hash joins, we must load the entire inner relation into a hash table before we can pipeline.

22

Double Pipelined Hash Join • Proposed for parallel main-memory databases (Wilschut 1990) – Hash table per source – As a tuple comes in, add to hash table and probe opposite table

• Evaluation: – Results as soon as tuples received – Symmetric – Requires memory for two hash tables

• But data-driven!

23

Orders OrderNo 1234 1235 1399 ……

TrackNo 01-23-45 02-90-85 02-90-85 ……

Hash Table (Orders)

JoinOrders.TrackNo = UPS.TrackNo (Orders, UPS)

UPS TrackNo 01-23-45 02-90-85 03-99-10 ……

Status In Transit Delivered Delivered ……

Hash Table (UPS) 01-23-45

24

Double-Pipelined Join Adapted to Iterator Model • Use multiple threads with queues – Each child (A or B) reads tuples until full, then sleeps & awakens parent – Join sleeps until awakened, then: • Joins tuples from QA or QB, returning all matches as output • Wakes owner of queue

Join QA

QB

A

B

25

RAP5

Performance Evaluation: Double Pipelined Hash Join

26

Slide 26 RAP5

Again, consider cutting due to time constraints. Rachel Pottinger, 2/20/2006

Insufficient Memory? • May not be able to fit hash tables in RAM • Strategy for standard hash join – Swap some buckets to overflow files – As new tuples arrive for those buckets, write to files – After current phase, clear memory, repeat join on overflow files

27

Conclusions • • • • •

General Tukwila architecture Non-conventional characters of Tukwila Interleaving of optimization and execution main idea of the collector operator Double pipelined hash join

28