Data storage and indexing

Storage and Indexing

(Appendix D, Chapter 10B – Kroenke)

1

Database Design Process     

Requirements analysis Conceptual design  data model Logical design Schema refinement: Normalization Physical tuning

Ramakrishnan, Gehrke: Database Management Systems

2

1

Goals  Query Execution  Indexing


3

Disks and Files  Basic data abstraction - File - collection of records  DBMS store data on (“hard”) disks  Why not main memory?  Why not tapes?

 Data is stored and retrieved in units called disk blocks or pages.  Unlike RAM, time to retrieve a disk page varies depending upon location on disk.  Therefore, relative placement of pages on disk has major impact on DBMS performance! Ramakrishnan, Gehrke: Database Management Systems

4

2

Class Exercise  Consider a disk with average I/O time 20msec and page size = 1024 bytes  Table: 200,000 rows of 100 bytes each, no row spans 2 pages  Find:  Number of pages needed to store the table  Time to read all rows sequentially  Time to read all rows in some random order Ramakrishnan, Gehrke: Database Management Systems

5

Class Exercise Solution  Number of pages for table:  Nb rows per page = floor(page size/ row size) = floor(1024/100) = 10  Nb pages to store all rows = ceiling(nb rows/nb rows per page) = ceiling(200000/10) = 20000 pages

 Time to read all rows sequentially  nb pages*time per page = 20000*20msec = 400sec ~6.6min

 Time to read all rows in some random order:  Note: Since memory is limited, in the worst case, every time we need to read a row, the page holding the row needs to be brought in from disk to memory  Worst case time: nb rows * time per page = 4000 sec ~66.6 min Ramakrishnan, Gehrke: Database Management Systems

6

3

Queries  Equality queries: SELECT * FROM Product WHERE BarCode = 10002121  Range queries: SELECT * FROM Product WHERE Price BETWEEN 5 and 15

 Assume: 200,000 rows in table – 20000 pages on disk  Need indexes to allow fast access to data Ramakrishnan, Gehrke: Database Management Systems

7

Indexes  An index on a file speeds up selections on the search key columns  Any subset of the columns of a table can be the search key for an index on the table


8

4

Hash Index Constant search time Equality queries only


9

B+ Tree Index

O(logdN) search time d – fan-out (~150) N – number of data entries Supports range queries Ramakrishnan, Gehrke: Database Management Systems

10

5

Example B+ Tree

 Find 28*? 29*? All > 15* and < 30*  Insert/delete: Find data entry in leaf, then change it. Need to adjust parent sometimes.  Change sometimes bubbles up the tree Ramakrishnan, Gehrke: Database Management Systems

11

Index Classification  Clustered vs. unclustered: If order of rows on hard-disk is the same as order of data entries, then called clustered index.  A file can be clustered on at most one search key.  Cost of retrieving data records through index varies greatly based on whether index is clustered or not!


12

6

Clustered vs. Unclustered


13

CREATE INDEX in MySQL  CREATE [UNIQUE] INDEX index_name [USING index_type] ON tbl_name (col_name,...)

index_type BTREE | HASH Example: CREATE INDEX I_ItemPrice USING BTREE ON Items (Price)

SELECT * FROM Product WHERE Price between 5 and 10 SELECT * FROM Product WHERE BarCode = 100111 Ramakrishnan, Gehrke: Database Management Systems

14

7

Use Indexes – Decisions to Make  What indexes should we create?  Which tables should have indexes? What column(s) should be the search key? Should we build several indexes?  For each index, what kind of an index should it be?  Clustered? Hash/tree?


15

ICE: Compute query time 

SELECT * FROM Product WHERE Price between 5 and 10 1. Is a hash index useful in this case? Why? 2. Compute time needed to evaluate query assuming 20% of data satisfies condition, disk with average I/O time 20msec, page size = 1024 bytes,200,000 rows of 100 bytes each, no row spans 2 pages 1. If no index exists 2. Clustered B+tree index on Price exists 3. Not-clustered B+tree index on Price exists

8

Index Selection Guidelines  Columns in WHERE clause are candidates for index keys.  Exact match condition suggests hash index.  Range query suggests tree index.

 Try to choose indexes that benefit as many queries as possible.  At most one clustered index per table! Think of trade-offs before creating an index! Ramakrishnan, Gehrke: Database Management Systems

17

Examples


18

9

Class Exercise What index would you construct? 1. SELECT * FROM Mids WHERE Company = 6 2. SELECT CourseID, Count(*) FROM StudentsEnroll WHERE Company = 6 GROUP BY CourseID Ramakrishnan, Gehrke: Database Management Systems

19

Summary  Indexes are used to speed up queries  They can slow down inserts/deletes/updates

 Can have several indexes on a given table, each with a different search key.  Indexes can be  Hash-based vs. Tree-based  Clustered vs. unclustered


20

10