IV Distributed Databases - Motivation & Introduction -

99 downloads 183 Views 224KB Size Report
Parallel vs Distributed DBS hs / FUB dbsII-03-10DDBIntro-2. References. > M.T. Özsu and P. Valduriez. Principles of Distributed Database. Systems, 2nd edition.
IV Distributed Databases - Motivation & Introduction -

I OODBS II XML DB III Inf Retr DModel

• Motivation

• • • • •

Expected Benefits Technical issues Types of distributed DBS 12 Rules of C. Date Parallel vs Distributed DBS

References  M.T. Özsu and P. Valduriez. Principles of Distributed Database Systems, 2nd edition. Prentice-Hall,1999.  Rahm, E.: Mehrrechner-Datenbanksysteme, Addison-Wesley, 1994

 G. Vossen, G. Weikum: Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery, Morgan Kaufmann, 2001, ISBN ISBN: 1558605088  Gray, J.; Reuter, A.: Transaction Processing - Concepts and Techniques, Morgan Kaufmann Publishers, San Matteo, 1993  Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987 (pdf)  Bernstein, P.A., Newcomer, E.: Principles of Transaction Processing, Morgan Kaufmann, San Matteo, 1997 Material used from B. Kemme (McGill), H. Garcia-Molina (Stanford), A. Zaslavsky et al.(Monash), G. Alonso (ETH) hs / FUB dbsII-03-10DDBIntro-2

1

Motivation  Application: Data "naturally" distributed  Companies with different branches  Airlines  Financial Business  University / faculties  Any organization with a decentralized organizational structure

 Technology: Network infrastructure, processors, RAM  Economy: Hardware cost  Software supporting Distributed Processing, e.g RPC Ö Huge number of interconnected systems Recent challenge: Web-based Computing Ö E-Commerce hs / FUB dbsII-03-10DDBIntro-3

Goals: Improvement of non functional characteristics  Performance:  the more computing power, the better  Primary goal for parallel DBS, not necessary distributed DB

 Reliability:  Substitute faulty components (HW, software… … and network) seamlessly  Fault tolerance: the ability to hide failures from users  Related to higher availability 95,8 % too low?

Definitely: 1 hour / day !

 Scalability  upscale / downscale your system incrementally  Central components and algorithms counter productive Ö Distributed algorithms hs / FUB dbsII-03-10DDBIntro-4

2

The dark side of distribution  Systems often less reliable  "You will never make a system of unreliable components more reliable by adding more unreliable components"  However: hot standby  But: data copies must be kept consistent, complex software, unreliable network.

 Scalability  DS inherently complex  High development cost -> middleware efforts  High administration cost

Ö lack of flexibility

hs / FUB dbsII-03-10DDBIntro-5

The dark side …  Performance  Double resources do not guarantee double performance  Network performance? Q Transfer time not only depends on bandwidth Transfer of 4 KB page latency Bandwidth transfer 100 m 0.5 µs 10 Mbps 5 ms 100 m 0.5 µs 100 Mbps 0.5 ms 1 km 5 µs 100 Mbps 0.5 ms - 100 km 0.5 ms 100 Mbps 1 ms - 1000 km 5 ms 100 Mbps 5.5 ms - 10000 km 50 ms 1 Gbps 50 ms Q

Distance > 100 km Ö signal propagation time

Q

Compare mean disk access time: ~ 5 ms

dominates

hs / FUB dbsII-03-10DDBIntro-6

3

What is a Distributed Database?  A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer network.  A distributed database management system (D– DBMS) is the software that manages the DDB and provides an access mechanism that makes this distribution transparent to the users.  Distributed database system (DDBS) = DDB + D– DBMS

Def. by P. Valduriez, T. Öszu hs / FUB dbsII-03-10DDBIntro-7

Example (1)  Transparency of distribution: one logical DB UPDATE empl SET sal = sal*1.1 WHERE proj.dur>12 AND emp.id = ass.eid AND proj.id=ass.pid

Berlin

net

All projects Berlin employees All assigments

Munic Muc projects Muc employees Muc assigments

New York NY employees

Expl. by B. Kemme

hs / FUB dbsII-03-10DDBIntro-8

4

Example (2)  Cooperation: autonomous DB cooperating on particular tasks SELECT flights WHERE departure = Montreal AND arrival = Munich AND date = 12/9/2002 AND price < 800$ lufthansa.com

Travel-overland.com

net

air-canada.com

hs / FUB dbsII-03-10DDBIntro-9

Example(3)  Autonomous, heterogenous systems, logically identical data types Select empl SET sal = sal*0.9 WHERE jobTitle = "product manager"

Daimler / Stuttg.

Chrysler / Detroit

net

OnlyStuttgart data IBM DB2

Only Detroit data Oracle 9i

Daimler / Bremen

Only Bremen data MySQL

hs / FUB dbsII-03-10DDBIntro-10

5

Example (4)  Sophisticated Client / Server computing client client

Application Server A

client client Application Server B Possible R/W conflict

hs / FUB dbsII-03-10DDBIntro-11

Classification criteria  Distribution  Physically independent systems  Peer-to-peer: data distribution and sharing  Client / Server: function distribution e.g. parsing in client

 Heterogeneity  DBMS software  Database schema (Types) and languages (SQL variants)

 Autonomy  No global control  Local DBS operations may not influenced by global operations (e.g. of a global transaction)  Note: subsumes completely independent or semiautonomous systems , see scenarios hs / FUB dbsII-03-10DDBIntro-12

6

Classification cube

Distributed DB: looks like one DB Federated: more autonomy but not independent (Expl. 3) Multi DB: independent, cooperative (Expl. 2) hs / FUB dbsII-03-10DDBIntro-13

by P. Valduriez, T. Öszu

Scenarios and common problems  Not just one distributed database systems .. but indefinitely many  Understand common problems e.g. how to guarantee one state for replicated data from the user point of view  Solve by developing distributed algorithms e.g. transaction commit Main issue: Partial failure

Any unsolvable problems? Example: Internet marriage priest bride

Distributed transaction: YES of NO, this is the question

groom

All participants and communication unreliable hs / FUB dbsII-03-10DDBIntro-14

7

12 +1 rules for DDBS (C. Date) Rule 0: A DDB looks like a central DB to users Rule 1: sites should be as independent as possible – local autonomy Rule 2: There should not be a central master all sites are dependent on - No reliance on central site Rule 3: Never a need for complete shutdown – continuous operation Rule 4: Users should not need to know where data are stored - location transparency (independence) Rule 5: If data are split (e.g. columns of one relation) and distributed over several sites, user's should not be aware of it - fragmentation transparency hs / FUB dbsII-03-10DDBIntro-15

12 rules… Rule 6: Users should not be aware of replicated data - replication independence Rule 7: Efficient distributed query processing Rule 8: Global concurrency control and recovery – distributed transaction management Rule 9: Hardware independence Rule 10: OS independence Rule 11: Network independence Rule 12: DBMS independence hs / FUB dbsII-03-10DDBIntro-16

8

Parallel versus Distributed Databases  More similarities than differences  Similar to Parallel / Distributed Processing distinction  Parallel DBS  Not geographically distributed  Goal: High Performance  Homogenous Software  Fast interconnect

Transparency

 Distributed DBS  Data geographically distributed  Goal: Data sharing  Disconnected operation possible -> autonomy hs / FUB dbsII-03-10DDBIntro-17

Parallel / distributed DBS  Query processing in parallel DBS Distribute operators (sort, filter,…) an data over processor to make complex processing fast e.g. join on a shared disk MP system

P M1

P

P

P Mn

Join (R, S) { // |R| >> | S| 1. Split R into n-1 partitions Ri and assign to Mi/Pi; Assign S to processor / memory Pn / Mn; 2. Sort Ri and S; ( //n parallel 3. Join (n-1) + 1 streams } hs / FUB dbsII-03-10DDBIntro-18

9

Parallel / distributed DBS  Distributed QP Given a data distribution Find strategy to evaluate query with minimal cost, in particular communication cost 10000 km

|S| = 100000 records

100 km

|R| = 10000 records

|T| = 1000 records Compute with minimal cost (time): R ZY S ZY T hs / FUB dbsII-03-10DDBIntro-19

Important terms  Motivation: technology, application, economy  Expected benefits: Scalability reliability performance  Data / function distribution  Fault tolerance in case of partial failures  Autonomy , multi database, federated DB  Distribution transparency  Parallel versus Distributed DBS

hs / FUB dbsII-03-10DDBIntro-20

10