Lecture3 - ECS 165B: Database System Implementa6on Lecture...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ECS 165B: Database System Implementa6on Lecture 3 UC Davis April 2, 2010 Acknowledgements: design of course project for this class borrowed from CS 346 @ Stanford's RedBase project, developed by Jennifer Widom, and used with permission. Slides based on earlier ones by Raghu Ramakrishnan, Johannes Gehrke, Jennifer Widom, Bertram Ludaescher, and Michael Gertz. Class Agenda Last 6me: Finish file and buffer management review File and buffer management in DavisDB Today: Quick look at DavisDB Record Manager component Start review of indexing Reading: Chapter 8 of Ramarkrishnan & Gehkre (or Chapter 12 of Silberschatz et al.) Announcements Project Part I has been posted; due Sunday, 4/11 at 11:59pm hcp://www.cs.ucdavis.edu/~green/courses/ecs165b/recordManager.html Please read all documenta/on carefully, and start early! Teams have been finalized; s6ll wai6ng on support@ for subversion repositories Code distribu6on available from /home/cs165b/DavisDB NOTE: some updates made there this agernoon DavisDB Extra Credit Opportunity We've already found and fixed a bug in the page file component since pos6ng the code last night There will undoubtedly be more (immature codebase...) EXTRA CREDIT OPPORTUNITY: 5% boost to your team's score for Part 1 for each new bug in the page file component you discover and fix yourself! (Up to 10% boost / team; first team to the bug gets the credit) Email your bug reports and fixes to the class mailing list Quick Tour of Record Manager Component Provides classes and methods for managing files of records (aka heap files) Built on top of Page File component, described last 6me You have to implement four main classes, supplied with the code distribu6on: RecordManager RecordFileHandle Record RecordFileScan Don't change any methods in the interface we've given (you can add new methods) Changing the interface will break automated tests Coding Tip: Don't Forget to Mark Pages Dirty! Be diligent about genng this right from the beginning, else you risk introducing tough-to-track-down bugs FileHandle* file;! PageHandle page;! ReturnCode code = file->getFirstPage(&page);! if (code == RC_OK) {! !// ... modify contents of page ...! !file->markDirty(page.pageNo);! !file->unpinPage(page.pageNo);! }! Coding Tip: Asser6ons are Very Useful #include <assert.h>! void computeSomething() {! !assert(... preconditions ...);! !... do some work ...! !assert(... postconditions ...);! }! assert will cause a crash if the condi6on is not sa6sfied. This is exactly what you want to happen! Coding Tip: Asser6ons in Page File Manager Page file manager makes heavy use of run6me asser6ons; some of these will catch your bugs! ReturnCode PageFileManager::allocateBlock(FileHandle* fileHandle, int pageNo, char** data) {! !// first, look for a free block, while also computing! !// the LRU unpinned block to use as backup! !int iLru = -1;! !long epochLru = LONG_MAX;! !for (uint i = 0; i < PF_BUFFER_SIZE; i++) {! ! !assert(pageBlocks_[i].isConsistent());! !...! !}! If this asser6on fires, it means !...! }! your code wrote past the end of a page block! Coding Tip: Asser6ons in Page File Manager page block of size PF_PAGE_SIZE = 4096 bytes ! !assert(pageBlocks_[i].isConsistent());! isConsistent() checks for guard bytes modifica6on of the guard bytes following the page block Review: Indexing Reading: Chapter 8 of Ramarkrishnan & Gehkre (or Chapter 12 of Silberschatz et al.) Alterna6ve File Organiza6ons Many alterna6ves exist, each ideal for some situa6ons, and not so good for others Unordered heap files (aka record files in DavisDB): suitable when typical access is a file scan retrieving all records Sorted files: best if records must be retrieved in some order, or only a range of records is needed Indices: data structures to organize records via trees or hashing Like sorted files, they speed up searches for a subset of records, based on values in certain search key fields Updates are much faster than in sorted files Indices An index on a file speeds up selec6ons on the search key fields for the index Any subset of the fields of a rela6on can be the search key for an index on the rela6on Search key is not the same as key (minimal set of fields that uniquely iden6fy a record in a rela6on) An index contains a collec6on of data entries, and supports efficient retrieval of all data entries k* with a given key value k Alterna6ves for Data Entry k* in Index Three alterna6ves: 1. The actual data record with key value k 2. <k, id of record with search key value k> 3. <k, list of ids of records with search key value k> Choice of alterna6ve for data entries is orthogonal to the indexing technique used to locate data entries with a given key value k Examples of indexing techniques: B+ trees (DavisDB, part 2), hash- based structures Typically, index contains auxiliary informa6on that directs searches to the desired data entries Alterna6ves for Data Entries (Contd.) Alterna6ve 1: the record itself If used, index structure is really a file organiza6on for data records (instead of a heap file or sorted file) At most one index on a given collec6on of data records can use Alterna6ve 1. (Otherwise, data records are duplicated, leading to redundant storage and poten6al inconsistency.) If data records are very large, # of pages containing data entries is high. Implies size of auxiliary informa6on in the index is also large, typically. Alterna6ves for Data Entries (Contd.) Alterna6ves 2 and 3 (record id / list of record ids) Data entries typically much smaller than data records. So, becer than Alterna6ve 1 with large data records, especially if search keys are small. Alterna6ve 3 more compact than Alterna6ve 3, but leads to variable- sized data entries even if search keys are of fixed length Index Classifica6on Primary vs. secondary: if search key contains primary key, then called primary index Clustered vs. unclustered: if order of data records is the same as (or "close to") the order of data entries, then index is called clustered Alterna6ve 1 is always a clustered index; in prac6ce, converse usually holds too (since sorted files are rare) A file can be clustered on at most one search key Cost of retrieving data records through index varies greatly based on whether index is clustered or not! Suppose that Alternative (2) is used for data entries, and that the data records are data entries, and that file. Suppose Alterna6ve 2 is used for stored in a Heap the Clustered vs.vUnclusteredIndex Clustered s. Unclustered Index data build clustered index,hfirst file the Heap file (with To records are stored in a eap sort To uild space on each page for future inserts). somebfreeclustered index, first sort the heap file (with some free space on each page for future inser6ons) Overflow pages may be needed for inserts. (Thus, order of Overflow pages may be needed for inser6ons. (Thus, order of data data recs iis"close to", but not inot identicalsto, order.) order.) `close to', but den6cal to, the ort the sort records s Index entries direct search for data entries CLUSTERED UNCLUSTERED Data entries Data entries (Index File) (Data file) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke Data Records Data Records 9 Hash-Based Indices Good for equality selec6ons Index is a collec6on of buckets. Bucket = primary page plus zero or more overflow pages Hash func@on h: h(r) = bucket in which record r belongs. h looks at the search key fields of r. If Alterna6ve 1 is used, the buckets contain the data records themselves; otherwise, they contain <key, record id> or <key, record id list> pairs B+ T B+ Tree Indexes ree Indices Non-leaf Pages B+ Tree Indexes Non-leaf Pages Leaf Leaf Pages Pages Leaf pages contain data entries, and are chained (prev & next) pages contain data entries, and nd chained (prev & next) LeafLeaf pages contain data entries, aareare chained (prev + next) Non-leaf pages contain index entries and direct searches: Non-leaf pages contain index entries andnd direct searches Non-leaf pages contain index entries a direct searches: index entry P0 0 K 1 index entry K 1 P P1 K 2 P 2 K 2 P 1 2 K m Pm Pm P K m Database Management Systems R. Ramakrishnan and J. Gehrke Database Management Systems 3ed,3ed, R. Ramakrishnan and J. Gehrke 11 11 ExampleExample of a B+ Tree B+ Tree Root 17 Entries <= 17 < 5 13 Entries > 17 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39* Find 28*? 29*? All > 15* and < 30* Find 28*? 29*? (> 15 and < 30)*? Insert/delete: Find data entry in leaf, then Insert/delete: find data entry in leaf, then change it. change it. Need to adjust parent sometimes. Need to adjust parent some6mes And change sometimes bubbles up the tree Change some6mes bubbles up the tree Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 12 Costs and Benefits of Different Schemes Cost model: ignore CPU costs, for simplicity B: number of data pages R: number of records per page D: (average) 6me to read or write a disk page Measuring number of page I/Os ignores gains of pre-fetching a sequence of pages; thus, even I/O cost is only approximated Average-case analysis; based on several simplis6c assump6ons But, good enough to show the overall trends! Comparing File Organiza6ons Heap files (random order; insert at end-of-file) Sorted files, sorted on <age, sal> Clustered B+ tree file, Alterna6ve 1, search key <age, sal> Heap file with unclustered B+ tree index on search key <age, sal> Heap file with unclustered hash index on search key <age, sal> Opera6ons to Compare Scan: fetch all records from disk Equality search Range selec6on Insert a record Delete a record Assump6ons in Our Analysis Heap files: Equality selec6on on key; exactly one match Sorted files: Files compacted ager dele6ons Indices: Alterna6ves 2, 3: data entry size = 10% size of record Hash: no overflow buckets B+ tree: 67% occupancy (this is typical) Implies file size = 1.5 data size Cost of Operations (a) Scan (1) Heap (2) Sorted (3) Clustered (4) Unclustered Tree index (5) Unclustered Hash index (b) Equality Cost of Opera6ons (c ) Range (d) Insert (e) Delete Several assump@ons underlie these (rough) es@mates! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke * Several assumptions underlie these (rough) estimates! 17 Cost of Operations (a) Scan (1) Heap BD (b) Equality Cost of Opera6ons (c ) Range (d) Insert (e) Delete Search +D (2) Sorted BD Dlog 2B Dlog 2 B + Search Search # matches + BD +BD (3) Clustered 1.5BD Dlog F 1.5B Dlog F 1.5B Search Search + # matches + D +D (4) Unclustered BD(R+0.15) D(1 + Dlog F 0.15B D(3 + Search Tree index log F 0.15B) + # matches log F 0.15B) + 2D (5) Unclustered BD(R+0.1 2D BD 4D Search Hash index 25) + 2D 0.5BD BD 2D B = # data pages; R = # of records per page; D = (average) 6me to read or write assumptions underlie these (rough) estimates! * Several a disk page Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 18 ...
View Full Document

This note was uploaded on 04/29/2010 for the course ECS 152 taught by Professor Mr. during the Spring '10 term at University of Great Falls.

Ask a homework question - tutors are online