data representation and external sorting.pdf - Data Representation External Sorting INF 551 Wensheng Wu 1 Outline \u2022 Representing data \u2013 How are

data representation and external sorting.pdf - Data...

This preview shows page 1 out of 22 pages.

You've reached the end of your free preview.

Want to read all 22 pages?

Unformatted text preview: Data Representation & External Sorting INF 551 Wensheng Wu 1 Outline • Representing data – How are tables stored on storage devices? • External Sorting – How to sort 1TB data using 1GB of memory? Sorting: #Buffer pages available M = 1GB = 1024 runs = 1K runs Merging: #ways of merging = M - 1 buffer block (page) = 1GB/(4KB-1) = 256K - 1 ways of merging but need to merge only 1k run 2 Representing Data Elements • Relational database elements: CREATE TABLE Product ( pid INT PRIMARY KEY, name CHAR(20), description VARCHAR(200), maker CHAR(10) REFERENCES Company(name)) tinyint = 1bytes smallint = 2 bytes mediumint = 3 bytes int = 4 bytes largeint = 8 bytes which attribute has variable length? - description • A tuple is stored as a "record" 3 Record Formats: Fixed Length F1 F2 F3 F4 L1 L2 L3 L4 Base address (B) Address = B+L1+L2 • Information about field types is the same for all records in a file; stored in system catalogs. • Note the importance of schema information! 4 Record Header To schema length F1 F2 F3 F4 L1 L2 L3 L4 header timestamp Header: • Pointer to schema: help finding fields • Length: so we know where the record ends w/o consulting schema • Timestamp: time when record last modified or read 5 Variable Length Records Other header information header F1 F2 F3 F4 L1 L2 L3 L4 length beginning of F3 is the end of F1 + F2 and since they are fixed length, we can determine F3 without pointer. Place the fixed fields first: F1, F2 Then the variable length fields: F3, F4 Note: actually no need for pointer to F3, why? 6 Storing Records in Blocks • Blocks have fixed size (typically 4KB) – But records may have variable-length Offset table (slot directory) BLOCK inserting R5 - insert a pointer + the data. this is better than keeping the free space at the end because we don’t where the records end. We just know where they start. So, difficult to find the free space while inserting new record. Here, just the total length will let you know. R4 R3 R2 R1 Why are records placed from the end? 7 Problem with this design? • Records right after slot directory • Free space after all records Offset table (slot directory) Here, there is no space for adding pointer to the new records being add after R4. That’s why it is better to have free space in front. even with fixed record sizes, it gets messy with free space at the end. R1 R2 R3 R4 while getting record, you get block id and then slot id (for pointer) in offset table. So, you have pointer even when a record is deleted. Confirm this. 8 Outline • Representing data – How are tables stored on storage devices? • External Sorting – How to sort 1TB data using 1GB of memory? 9 The I/O Model of Computation • In main memory algorithms: – we care about CPU time • In databases – time is dominated by I/O cost I/O: cost of reading and writing • Assumption: cost is given only by I/O • Consequence: need to redesign certain algorithms, e.g. sorting 10 Notes • A block on storage devices loaded into a page in main memory – We sometimes interchange page with block • Buffer pages – Often refer to pages in main memory used to store input, output, and intermediate data for an algorithm • Run: a sorted sublist of input data 11 Notes • Make a pass through data: – Loading the entire data from disk once 12 Sorting • Illustrates the difference in algorithm design when your data is not in main memory: – Problem: sort 1TB of data with 1GB of RAM • Arises in many places in database systems: – – – – – Data requested in sorted order (ORDER BY) Needed for grouping operations First step in sort-merge join algorithm Duplicate removal Bulk loading technique for creating B+-tree indexes 13 Pass 0: Pass 1: there will be two buffers: 3,4 —> input 1: 3,4 2,6 —> input 1: 2, 6 output: 2 then 3 and then 4. At this time input 1 will be empty. So, we need to check if the buffer is empty or not before we push 6 to the output. 2-Way Merge-sort: Requires 3 Buffers least amount of memory needed to perform external sorting: no. of disks read and write, no, of pages (blocks needed) in place sorting: doesn’t need additional memory than what it was assigned. Quicksort does that. • Pass 0: Read a page, sort it, write it – only one buffer page is used • Pass 1, 2, …, etc.: merging two runs at a time – three buffer pages used. Pass 2: Buffer 1: 2, 3|4, 6 —> Buffer 2: 4, 7|8, 9 —> INPUT 1 Input 1: 2, 3 -> 2, 3 Input 2: 4, 7 Check if there is anything left in buffer 1 Input 1: 4, 6 -> 4, 4 Input 2: 4, 7 Check if there is anything left in buffer 1 Check if there is anything left in buffer 2 Input 1: 6 -> 6, 7 Input 2: 7, 8 Input 1: Input 2: 8, 9 -> 8, 9 OUTPUT INPUT 2 Disk Main memory buffers Disk 14 Pass 0 7 pages. Each page is read and then sorted and then written back. So, computation is done one block at a time. How many buffer pages needed for pass 0? 1 page only because computation done one page at a time. Pass 1 - Merging In this we need two input buffer and output is the merged sorted version. computation done on 2 page. So, 2 pages needed. after each pass, the pages are merged and then read again and then sorted and then merged again until all pages become one. Two-Way External Merge Sort • Each pass we read + write write it back sorted each page in file. • N pages in the file => the number of passes log 2 N 1 9,4 8,7 5,6 3,1 2 3,4 2,6 4,9 7,8 5,6 1,3 2 4,7 2,3 4,6 1,3 5,6 8,9 2 Input file PASS 0 1-page runs PASS 1 2-page runs 2,3 4,4 6,7 2 N log 2 N 1 1,2 3,5 6 8,9 4-page runs PASS 3 multiple merging phases because we use only two input pairs in one go. Each input is one page. So, max 2 pages are used for every run for inputs. This is why we have multiple merging phases. • Sort 4MB with buffer page size = 4KB: needs11 passes Number of merging pairs: Pass 0: 1; Pass 1: 2; Pass 2: 4 …. So, Number of merging pairs = 1 + 2^ (number of passes, k) In this example, N = 7, k = 3 #Merging pairs = 1 * 2^k >= N k = ceiling(log_2(N)) k = number of passes/merging steps = ceiling(log_2(N)) 6,2 PASS 2 • So total cost is: 3,4 size of run = number of pages/blocks needed for that run. After every pass, the size of run doubles. 1,2 2,3 3,4 8-page runs 4,5 6,6 7,8 9 15 Least # of pages required. For sorting, atleast 1 page is required for external merge sort . For merging, atleast 2 pages is required. Can We Do Better ? • We have more main memory • Should use it to improve performance given the following, how can we do it better when we have been given more memory? • M: # of blocks (i.e., pages) in main memory • B(R): # of blocks of relation R 16 B(R) = 2000; M = 100 Loading 100 blocks at a time. Number of runs = B(R)/M = 2000/100 = 20 atmost how many blocks for input block? M-1 = 99 So, M-1ways to merge External Merge-Sort • Pass 0: load M blocks in memory, sort – Result: B(R)/M sorted sublists of size M – Each sorted sublist is a run R ... Disk M blocks M blocks of main memory ... B(R)/M sorted sublists Disk 17 before merging = M after merging = M * (M-1) in 2 way merge, every time we had he merging pases doubled. in this case, it is M-1 ways to merge, so the size increases after every run is by M-1. Size of Run/#blocks = after k merging pases, we will have M (M-1)^k blocks (size of run). Question: Do we stop after k runs? Yes, when M(M-1)^k > = B(R) and, k = ceil(log_(M-1) B(R)/M), where B(R)/M = # of runs k = ceil(log_(M-1) #Runs) Pass One • Merge M – 1 runs into a new run • Result: each run has now M (M – 1) blocks 2000 blocks loaded 100 at a time —> 20 runs so, even when we have 2080 blocks, we will have 21 runs with just less blocks being loaded in the last run. Input 1 M-1 sublists ... Disk Input 2 Output .... Input M-1 Disk M blocks of main memory 18 Cost of Two-Pass, Multiway Merge Sort • Pass 0: sort B/M sublists of size M, write – Cost: 2B(R) Read + Write Cost • Pass 1: merge B/M sublists, write – Cost: 2B(R) Read + Write Cost • Total cost: 4B(R) • Assumption: B(R) <= M2 – B/M <= M – 1 or – B <= M(M-1) ~ M2 19 Generalized to k Passs • Merge every M – 1 runs into a new run • Result: each run has now M (M – 1)k blocks Input 1 M-1 sublists ... Disk Input 2 Output .... Input M-1 Disk M blocks of main memory 20 If k is the last pass • Merge M – 1 runs into a single run • We must have M (M – 1)k >= B k log M 1 B / M Input 1 M-1 sublists ... Disk Input 2 Output .... Input M-1 Disk M blocks of main memory 21 M = 5; B(R) = 108 Pass 0: Sort, load all 5 pages at a time for 108 pages. So, #runs = ceil(108/5) = 22 runs with last run loading only 3 pages. Pass 1: We need to keep going through passes until we have only 1 run left. This will be done through merging using the memory available. Buffer pages available to merge = M- 1 = 4. So, using 4 pages to merge 22 runs. Therefore, #runs_2 = ceil(22/4) = 6 runs_1 (5 runs_1 with 4 pages (runs) from pass 0 and each (run) page in pass 0 had 5 pages -> so, with 5*4 = 20 pages and 1 runs_1 with 2 runs from pass 0 and each run in pass 0 had 5 pages -> so, with 2 * 5 = 10 pages, but we only have 8 more pages left out of 108 -> so, last run will have 8 pages.) Cost of External Merge Sort Pass 2: still merging. So, 4 pages at a time and 6 runs_1 from pass 1, So, total runs_2 = ceil(6/4) = 2 runs_2. Now, we have 4 pages for each runs_2 and we have individually from pass 1, they have 20 pages. So, first run will have 80 pages and next run can have 80 pages too but need only 28 (because total 108). the 1 is added for sorting. 1 log M 1 B / M • Number of passes: • Cost = 2B * (# of passes) • E.g., with 5 buffer pages, to sort 108 page file: – – – – Pass 0: produces 108/5 = 22 runs (21 sorted runs of 5 pages each + last run of only 3 pages) Pass 1: 22/4 = 6 (5 sorted runs of 20 pages each + last run or only 8 pages) Pass 2: 2 sorted runs, 80 pages and 28 pages Pass 3: Sorted file of 108 pages Pass 0: For each run: 1 1 1 1 1 pages = 5 pages. Therefore, 21 * 5 = 105 pages + 1 * 5 = 5 pages (but need only 3) Pass 1: For each run: 1 1 1 1 pages = 4 pages. Each page: 1 1 1 1 1 pages from pass 0. So, 4 * 5 = 20 pages for each page. First 5, completely used and hence, 5 * 20 = 100 pages. Last page, not completely used - only 2 pages used. So, 1 1 pages —> each 1 1 1 1 1 pages => 2 * 5 pages = 10 pages. But need only 8 pages. Pass 2: For each run: 1 1 1 1 pages = 4 pages. Each page: 1 1 1 1 = 4 pages from pass 1. Each page in pass 1: 1 1 1 1 1 pages from pass 0. So, total each run in pass 2 has 4 * 4 * 5 = 80 pages. First page in pass 2 completely used. Second, only 28 required our of 80. 22 ...
View Full Document

  • Fall '14

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern

Stuck? We have tutors online 24/7 who can help you get unstuck.
A+ icon
Ask Expert Tutors You can ask You can ask You can ask (will expire )
Answers in as fast as 15 minutes