Final Exam Possible Questions about Last Few Lectures How To ...

Final Exam •  Wednesday, Dec 12, 8:30am-10:20am, in class

Introduction to Data Management CSE 344

•  Content: Everything –  But focus on Lectures 14 through 26

Lecture 26: Data Integration •  Open books and open notes –  You can bring any notes that you want including slides, assignments, old exams, stuff from the web, … –  BUT NO laptops, phones, or other devices Magda Balazinska - CSE 344, Fall 2012

1

Magda Balazinska - CSE 344, Fall 2012

Possible Questions about Last Few Lectures

How To Study

•  NoSQL: You can expect some sort of qualitative question that will ask you to –  Explain the motivation for NoSQL systems –  Discuss some similarities and differences between relational and NoSQL systems –  Something else along those lines

•  Data integration and data cleaning –  You can similarly expect a qualitative question on these topics Magda Balazinska - CSE 344, Fall 2012

2

•  •  •  • 

Go over the lecture notes Read the book Go over the homeworks Practice –  Practice web quiz –  Finals from past 344 –  Look at both midterms and finals from 444 past years: be careful because several questions do not apply to us! –  Questions in the book

•  Ask course staff questions! •  The goal of the final is to help you learn! 3

Today: Data Integration


4

Data Integration Challenges (1/2) •  Each database could be in a different type of DBMS (different data model, query language, etc.)

•  Goal: –  Data is available in multiple distinct databases –  Want to ask queries across all these databases

–  Relational, semi-structured, NoSQL

•  Schema heterogeneity

•  When is data integration needed? –  Two companies merge –  Want to get legacy databases to talk to each other –  Want to analyze data produced by different sources –  Want to combine data from different websites

–  S1: Employee(ID, name, address, position, salary) –  S2: Worker(EID, name, address) Position(PID,salary,from,until)

•  Data type heterogeneity –  Employee ID could be a string or an integer

•  Value heterogeneity –  The “cashier” position could be called “cashier” or “associate”


5


6

1

Data Integration Challenges (2/2)

Data Integration Approaches •  Federated databases

•  Semantic heterogeneity –  Most difficult to manage –  E.g., salary is hourly salary before tax –  Or salary is net, weekly salary with lunch allowance

•  Data integration is a very, very, very difficult pb!

–  Each source remains independently administered –  One source can call on others to supply info

•  Centralized warehouse –  Data from source is extracted-transformed-loaded into a single, centralized database –  Data is refreshed periodically (so data is not 100% up-to-date)

•  Mediator –  Virtual database on top of others –  Takes query as input and rewrites it in terms of queries over the other databases, then synthesizes the answer


7

Creating a Data Warehouse


8

Executing Queries over Mediator

•  Extract data from distributed operational databases

•  Wrappers more complex than with warehouses

–  Can do this by running a query over the data source

•  Clean to minimize errors and fill in missing information –  We will discuss cleaning next lecture

•  Transform to reconcile semantic (and other) mismatches –  Performed by defining views over the data sources

–  Need to execute all sorts of queries, not just extract –  One approach is to define templates –  A template is a parameterized query •  select * from EmployeeMed where position=‘$p’

•  Query optimization at the mediator is a challenge –  Wrapped data sources can be seen as views –  How can I answer the given query using these views?

•  Load to materialize the above defined views –  Build indexes and additional materialized views

•  Refresh to propagate updates to warehouse periodically –  Update the warehouse incrementally Magda Balazinska - CSE 344, Fall 2012

9

•  Data sources can exhibit bad and variable performance •  May want a more dynamic query plan: process data as it arrives Magda Balazinska - CSE 344, Fall 2012

10

Dirty Data •  Another challenge with data integration –  Often hard to decide if two records represent the same entity or not –  E.g., John Doe from 1234 56th ave NE, Seattle –  Vs. J. R. Doe from 1234 56th ave NE, Seattle –  Vs John Doe from 789 108th St., Bellevue

•  Even without data integration, data often dirty –  Missing values, duplicates, odd characters, etc. Magda Balazinska - CSE 344, Fall 2012

11

2

Final Exam Possible Questions about Last Few Lectures How To ...

Final Exam Possible Questions about Last Few Lectures How To ...

Suggest Documents

Physics 390 ”Final” Exam Practice Questions These are a few ...

Old Final Exam Questions

phil 93515: Final exam questions

A Few Questions About Consciousness ... - NeuroQuantology

Sedimentary Basins (few lectures)

EPN Examples of possible exam questions

Possible Final Exam for Career Choices - CareerChoices.com

Final Exam OOP Questions Solved - yimg.com

AMS 501 Sample Questions for Final Exam

Final Exam Review Sheet and Sample Questions

Final Exam Practice Questions Workings.pdf - Google Drive

Calculus 1: Sample Questions, Final Exam, Solutions

Frequently asked questions about exam results - Cambridge ...

Frequently asked questions about exam results - Cambridge ...

Final Exam Practice Questions for General Chemistry NOTICE TO ...

Gurukripa's Guideline Answers to May 2016 Exam Questions CA Final ...

Gurukripa's Guideline Answers to May 2015 Exam Questions CA Final ...

Gurukripa's Guideline Answers to May 2012 Exam Questions Final ...

Gurukripa's Guideline Answers to May 2016 Exam Questions CA Final ...

Gurukripa's Guideline Answers to May 2016 Exam Questions CA Final ...

Solutions to Final Exam

Review Lectures The Final Exam Paper The Structure Topics

Last - Layout Final-Final

Physics 390 ”Final” Exam Practice Solutions These are a few ...