10_external - !  …
we
have
been
assuming
that
the
data
 collections
we
have
been
manipulating
were


Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: !  …
we
have
been
assuming
that
the
data
 collections
we
have
been
manipulating
were
 entirely
stored
in
memory.


 !  …
In
practice,
this
is
not
always
a
reasonable
 assumption.
 !  What
if
we
were
asked
to
search
records
of
all
Canadians
 for
a
particular
Canadian
(search
key
‐>
lastname)?
 ▪  How
many
records?
 ▪  Problem?
 class
Canadian
 {
 



private
String
lastName;
 



private
String
firstName;
 



private
String
middleName;
 



private
String
SIN;
 



…
 }
 !  What
if
we
were
asked
to
search
records
of
all
 Canadians
for
a
particular
Canadian
(search
 key
‐>
lastname)?
 ▪  How
many
records?
 ▪  How
much
space?
 ▪  35
million
*
20
bytes
/
string
*
100
strings(?)
=
approx
70GB
 !  Some
large
databases,
in
which
records
are
kept
in
 files
stored
on
external
storage
such
as
hard
disk,
 cannot
be
read
entirely
into
main
memory.


 !  We
refer
to
such
data
as
disk‐bound
data.


 !  Hence,
big
datasets
cannot
fit
in
memory
 !  Need
to
keep
them
on
hard
disk
(“on
disk”)
 !  Just
read
what
we
need
at
one
time
into
memory
 !  Challenge:
memory
and
disk
access
are
not
 created
equal
 !  !  Time
efficiency
of
search
for
Canadian?
 Important
factors:
 !  Accessing
data
stored
in
a
file
kept
on
the
hard
disk
is
 extremely
slow
compared
to
accessing
data
in
memory

 ‐>
order
of
milliseconds
(10‐3)

 ‐>
order
of
nanoseconds
(10‐9)

 !  In
contrast,
accessing
data
in
memory
is
fast

 !  Given
the
million‐to‐1
ratio
of
disk
access
time
versus
 memory
access
time,
to
search
our
30M
records
 efficiently,
we
will
need
to
devise
a
way
that
minimizes
 the
number
of
disk
accesses
performed.
 7,200 RPM 2.5-Inch Hard Disk Drives MK8054GSY 80GB1 1 1 MK1254GSY 120GB1 1 2 MK1654GSY 160GB1 Serial ATA Revision 2.6 / ATA-8 1 2 3 Gb/sec MK2554GSY 250GB1 2 4 MK3254GSY 320GB1 2 4 Series Overview Drive Capacity Drive Interface Number of Platters (disks) Number of Data Heads Transfer Rate to Host Performance Track-to-track Seek Average Seek Time Rotational Speed Buffer Size 1ms 10.5ms (Read), 12ms (Write) 7,200 RPM 16MB Power Requirements Voltage Spin up (start) Power Seek Power Read/Write Power Active Idle Power Low Power Idle Standby Power Sleep Power 5V (+/-5%) 5.5 watts 2.2 watts 1.9 watts 1.2 watts 0.9 watts 0.18 watts 0.13 watts Physical Size 15,000 RPM 3.5-Inch Enterprise Hard Disk Drives MBA30732 73.5GB1 1 2 MBA31472 147GB1 2 4 MBA33002 300GB1 4 8 Series Overview Drive Capacity Drive Interface Number of Platters (disks) Number of Data Heads RoHS Compliant Transfer Rate to Host Dual Port SAS (RC), SCA-2 80Pin (NC), 68Pin Wide (NP), Dual Port FCAL (FC) Yes SAS: 3 Gb/sec, SCSI: 320 MB/sec, FCAL: 4 Gb/sec Performance Track-to-track Seek Average Seek Time Rotational Speed Average Latency Buffer Size 0.2ms (Read), 0.4ms (Write) 3.4ms (Read), 3.9ms (Write) 15,000 RPM 2ms SCSI: 8MB, SAS/FC: 16MB 3 Power Requirements Voltage Spin up (start) Power Read/Write Power Active Idle Power 5V (+/-5%), 12V (+/-5%) 12V (+/-5%) @ 2.5A (peak), 3.0A (peak < 100 us), 5V (+/-5%) @ 0.8A SAS: 17.7 watts, SCSI: 18.5 watts, FC watts SAS: 12.8 watts, SCSI: 12.4 watts, FC: 13.4 watts Physical Size !  What
do
those
numbers
mean?
 !  Search
in
a
red‐black
tree
with
35
million
records?
 !  log
n
=
25
 !  If
dataset
fits
in
memory
 !  Hundreds
of
nanoseconds
per
search
 ▪  Can
handle
thousands
of
searches
per
second
 !  Hundreds
of
milliseconds
per
search
 ▪  Can
handle
only
a
few
searches
per
second
 !  If
dataset
doesn’t
 !  Most
time
consuming
operation
when
 elements
stored
in
external
storage
(disk)
 !  Compared
to
10
milliseconds,
compute
time
is
 irrelevant
 ▪  How
many
operations
can
a
CPU
do
in
10
milliseconds?
 ▪  @3GHz,
a
lot
 !  Basic
unit
written
to/read
from
external
 storage
(disk)
 !  If
you’re
going
to
read
don’t
just
read
1
bit
 !  Block
size
varies
on
each
system
 ▪  E.g.
could
be
8KB,
could
be
1MB
 Nodes of a binary tree can be located in different blocks on a disk. !  Random
access
file
 !  Linear
data
collection
(like
an
array)
 !  Sequential
access
file
 !  Linear
data
collection
(like
a
linked
list)
 Go
to
external_read
example
in
Eclipse
C++
 !  We
have
records
for
~30M
Canadians
 !  Assume
we
can’t
store
them
in
memory
 ▪  So
we
keep
them
on
disk
 !  Now
we
want
to
search
for
one
Canadian
 !  How
should
we
do
it?
 !  We
could
store
our
30M
Canadian
records
in
a
 disk
file
which
we
access
randomly
 !  Time
efficiency
to
search
for
that
Canadian?
 !  How
fast
is
this
in
seconds?
 ▪  30M
*
milliseconds
=
seconds
 !  Assume
each
block
on
disk
contains
only
1
record
 !  If
our
records
are
not
sorted:
linear
search
‐>
O(n)
 !  We
could
store
our
30M
Canadian
records
in
a
 disk
file
which
we
access
randomly.
 beginning
of
file,
Z.
Zygmund
at
end)
 !  Assume
each
block
on
disk
contains
only
1
record
 !  Sort
the
records
within
the
disk
file
(A.
Aaronson
at
 !  Time
efficiency
to
search
for
that
Canadian
 !  If
our
records
are
sorted:
binary
search
‐>
O(log2
n)
 !  How
fast
is
this
in
seconds?
 ▪  log
(30M)
*
milliseconds
=
hundreds
of
milliseconds
 !  Still
need
to
do
many
disk
accesses
 !  Array
is
sorted,
so
log(n)
disk
accesses
 !  Disk
accesses
are
really
slow
 !  Let’s
try
to
reduce
them
even
further
 !  Main
idea:
split
data
into
two
files
on
disk
 !  DATA
file
 ▪  Holds
all
information
about
all
Canadians
(our
70GB
of
 data)
 !  INDEX
file
 ▪  A
smaller
file
that
tells
me
where
to
find
data
about
each
 Canadian
 ▪  Remember
the
seekg
command,
random
access
to
DATA
file
 !  INDEX
file
should
hold
entries
<key,
file
byte>
 !  key
is
name
of
Canadian
(or
SIN)
 !  file
byte
is
offset
into
DATA
file
of
where
the
 record
for
this
Canadian
starts
 … <G Mori, 504> <H Mori, 206> <G Jensen, 7> <R Henderson, 1083> … INDEX file t 500 6 501 4 502 2 503 M 504 o 505 r 506 i 507 DATA file !  Index
file
will
be
smaller
than
data
file
 !  File
size
will
be?
 !  #
Canadians
*
key
size
*
file
byte
size
 !  Much
smaller
than
data
file
if
record
for
each
 Canadian
is
large
 !  In
order
to
find
data
about
 an
individual,
need
to
find
 his
entry
in
index
file
 !  So
what
should
we
do
to
 the
index
file?
 !  Sort
it,
e.g.
into
a
tree
data
 … <G Mori, 504> <H Mori, 206> <G Jensen, 7> <R Henderson, 1083> … INDEX file structure
 !  Let’s
assume
30
million
*
key
size
*
file
byte
 size
is
not
“too
big”
 !  I.e.
it
fits
in
memory
 !  Builda
tree
structure
to
store
the
contents
of
 the
index
file
in
memory
 !  Can
build
it
/
read
it
from
disk
when
the
program
 starts
 !  Make
it
a
balanced
tree
(e.g.
red‐black)
 !  Time
efficiency
to
search
for
a
record
will
be:
 


O(log2
n)
comparisons
(worst
case)

 (for
searching
the
index
tree
and
finding
the
desired
key,
hence
block
#)
 !  Time
efficiency
to
search
for
a
particular
 +
1
disk
access
to
fetch
the
block,
in
the
data
file,
that
 contains
the
desired
record
(using
block
#
found
above)
 !  Just
a
few
milliseconds
 Canadian
will
be:

 
 about
25
comparisons
+
1
disk
access
 !  Wait
a
minute,
30
million
*
key
size
*
file
byte
 size
isn’t
that
much
smaller
than
the
data
 file!!
 file
 !  Hmm…
we
can
use
a
similar
trick
on
the
index
 !  If
the
entire
tree
stored
in
the
Index
file
 cannot
be
loaded
into
main
memory:
 !  Each
of
its
nodes,
stored
in
a
block,
will
 contain
as
the
“location
of
this
node’s
left
and
 right
subtrees”
the
block
#
of
the
block
in
the
 Index
file
containing
the
root
of
the
left/right
 subtree.
 !  I.e.
instead
of
a
tree
in
memory
with
child
 pointers,
a
tree
in
the
file
with
child
block
#s
 !  To
perform
a
search:
 !  the
block
containing
the
root
of
the
tree
is
first
 accessed
from
the
Index
file
 !  Tree
search
algorithm
is
performed
on
node
contained
 in
that
block
 !  the
block
#
of
the
next
tree
node
(block
in
Index
file)
is
 determined
and
the
block
containing
that
node
is
 accessed

 !  above
two
steps
are
repeated
until
the
desired
key
is
 found
or
bottom
of
tree
is
reached
(i.e.,
key
not
found)
 !  if
key
found,
the
data
file
block
containing
the
 matching
record
is
accessed
using
the
block
#
of
pair
 !  Time
efficiency
to
search
for
a
record
will
be:
 


O(log2
n)
disk
accesses
(worst
case)

 +
1
disk
access
to
fetch
the
block,
in
the
data
file,
that
 contains
the
desired
record
 !  Time
efficiency
to
search
for
a
particular
 Canadian
will
be:

 


about
25
disk
accesses
+
1
disk
access
 !  Wait,
25
disk
accesses
sounds
familiar
 !  That
was
the
case
for
good
old
binary
search
 on
the
data
file
 !  Let’s
(again)
try
to
do
better
 !  How
can
we
improve
search
performance?
 !  In
order
to
minimize
the
number
of
disk
 accesses,
we
need
to
minimize
the
number
of
 levels
in
our
search
tree,
i.e.,
we
need
to
 flatten
our
tree.
 !  This
can
be
achieved
by
increasing
the
 number
of
records
each
node
of
our
search
 tree
can
deal
with.

 !  A
B
Tree
can
help
…
 !  !  !  !  Definition:

m‐way
search
tree
T
is
a
tree
of
order
m,
in
which
 each
node
can
have
at
most
m
children
 Binary
search
trees
generalize
directly
to
m‐way
search
trees
 Purpose
of
m‐way
search
tree:
Efficient
search
(hence
 retrieval)

 Other
names
given
to
m‐way
search
trees
are

 !  m‐ary
search
trees
 !  multiway
search
trees
 !  n‐way
search
trees
 !  n‐ary
search
trees
 !  Definition:
An
m‐way
search
tree
T
is
an
m‐way
tree
(a
 tree
of
order
m)
such
that:
 !  T
is
either
empty
or
 !  each
non‐leaf
node
of
T
has
at
most
m
children
(subtrees):
 
 !  
 !  T0,
T1,
…,
Tm‐1


 and
m
–
1
key
values
in
ascending
order:

 K1
<
K2
<
…
<
Km‐1
 for
every
key
value
V
in
subtree
Ti:


(rules
of
construction)
 
 





V
<
K1,














i
=
0

 






Ki
<
V
<
Ki+1,





1
<=
i
<=
m‐2
 






V
>
Km‐1,












i
=
m‐1
 every
subtree
Ti
is
also
an
m‐way
search
tree
 Example:

The
following
is
a
3‐way
search
tree:
 16 18 4 6 7 22 26 20 24 28 30 !  Search
for
the
spot
where
the
new
element
is
 to
be
inserted
(using
its
search
key)
until
you
 reach
an
empty
subtree
 !  Insert
the
new
element
into
the
parent
of
the
 empty
subtree,
if
there
is
room
in
the
node.

 !  Insert
the
new
element
into
the
subtree,
if
 there
is
no
room
in
its
parent.


 !  !  !  Let’s
construct
the
m‐way
search
tree
shown
on
the
 previous
slide
where
m=3
 To
do
so,
we
shall
insert
the
following
search
keys:
18,
16,
 Remember:
the
search
keys
(and
their
associated
 elements)
are
inserted
in
ascending
sorting
order
in
a
 node
 Let’s
begin
by
inserting
18:
 ▪  since
the
m‐way
tree
is
empty,
we
create
the
first
node
i.e.,
the
 root
and
insert
18
 6,
22,
26,
4,
28,
24,
20,
30,
17

 !  18 !  Let’s
insert
16:
 !  Search
for
the
spot
where
the
new
element
is
to
be
inserted
using
 its
search
key
until
you
reach
an
empty
subtree
 !  Insert
the
new
element
into
the
parent
of
the
empty
subtree,
in
 the
proper
sorted
order,
if
there
is
room
in
the
parent
node.

 18 becomes 16 18 !  Let’s
insert
6:
 !  Search
for
the
spot
where
the
new
element
is
to
be
inserted
using
 its
search
key
until
you
reach
an
empty
subtree
 !  Insert
the
new
element
into
the
empty
subtree,
if
there
is
no
room
 in
its
parent
node.


 16 18 becomes 16 18 6 !  Let’s
insert
22:
 !  Search
for
the
spot
where
the
new
element
is
to
be
inserted
using
 its
search
key
until
you
reach
an
empty
subtree
 !  Insert
the
new
element
into
the
empty
subtree,
if
there
is
no
room
 in
its
parent
node.


 16 18 6 becomes 16 18 22 6 !  Let’s
insert
26:
 !  Search
for
the
spot
where
the
new
element
is
to
be
inserted
using
its
 search
key
until
you
reach
an
empty
subtree
 !  Insert
the
new
element
into
the
parent
of
the
empty
subtree,
in
the
 proper
sorted
order,
if
there
is
room
in
the
parent
node.
 16 18 6 22 becomes 16 18 22 26 6 !  Let’s
insert
4:
 !  Search
for
the
spot
where
the
new
element
is
to
be
inserted
using
its
 search
key
until
you
reach
an
empty
subtree
 !  Insert
the
new
element
into
the
parent
of
the
empty
subtree,
in
the
 proper
sorted
order,
if
there
is
room
in
the
parent
node.

 16 18 6 becomes 16 18 6 22 26 22 26 4 !  Let’s
insert
28:
 !  Search
for
the
spot
where
the
new
element
is
to
be
inserted
using
its
 search
key
until
you
reach
an
empty
subtree
 !  Insert
the
new
element
into
the
empty
subtree,
if
there
is
no
room
in
 its
parent
node.
 16 18 4 6 becomes 16 18 6 22 26 28 22 26 4 !  Let’s
insert
24:
 !  Search
for
the
spot
where
the
new
element
is
to
be
inserted
using
 its
search
key
until
you
reach
an
empty
subtree
 !  Insert
the
new
element
into
the
empty
subtree,
if
there
is
no
room
 in
its
parent
node.
 16 18 4 6 22 26 becomes 16 18 4 6 24 22 26 28 28 !  Let’s
insert
20:
 !  Search
for
the
spot
where
the
new
element
is
to
be
inserted
using
its
 search
key
until
you
reach
an
empty
subtree
 !  Insert
the
new
element
into
the
empty
subtree,
if
there
is
no
room
in
 its
parent
node.
 16 18 4 20 6 24 22 26 becomes 16 18 4 6 24 22 26 28 28 !  Let’s
insert
30:
 !  Search
for
the
spot
where
the
new
element
is
to
be
inserted
using
its
 search
key
until
you
reach
an
empty
subtree
 !  Insert
the
new
element
into
the
parent
of
the
empty
subtree,
in
the
 proper
sorted
order,
if
there
is
room
in
the
parent
node.

 16 18 4 20 6 24 22 26 becomes 16 18 4 6 20 24 22 26 28 30 28 !  Let’s
insert
17:
 !  Search
for
the
spot
where
the
new
element
is
to
be
inserted
using
its
 search
key
until
you
reach
an
empty
subtree
 !  Insert
the
new
element
into
the
empty
subtree,
if
there
is
no
room
in
 its
parent
node.
 16 18 4 20 6 24 22 26 becomes 16 18 4 6 17 20 24 22 26 28 30 28 30 !  Definition:

A
B
Tree
is
a
data
collection
that
 organizes
its
blocks
(B)
into
an
m‐way
search
 tree,
and
in
addition

 !  the
root
of
a
B
Tree
has
at
least
2
children
(unless
 it
is
a
leaf
node)

 !  and
its
other
non‐leaf
nodes
have
at
least

m
/
2

 children.


 !  A
B
Tree
is
built
from
the
leaves
up,
rather
 than
from
the
root
down,
and
so
all
leaf
 nodes
in
a
B
Tree
are
on
the
same
level.
 !  Hence,
B
Tree
is
a
balanced
m‐way
tree,
just
as
 Red‐black
trees
are
balanced
binary
search
trees
 !  Each
block
contains
a
tree
node
 !  m‐1
<key,
data
file
block
#>
pairs
in
a
node
+
 index
file
block
#
as
links
to
children/subtrees
 Example
of
B
Tree 
 B-Tree of order 5 ( m = 5 ) in which every node (except the root and the leaves) has •  at least 5 / 2 = 3 children, and •  no more than 5 children <Key, block #> pair Children: block # in index file Example:

The
following
is
a
B
Tree
with
m=4
 






















(such
B
Trees
are
called
2‐3‐4
search
trees)
 7 12 5 9 17 20 1 3 4 6 8 10 11 15 16 18 22 23 !  Let’s
construct
the
B
Tree
shown
on
the
previous
slide
 where
m=4
 To
do
so,
we
shall
insert
the
following
search
keys:
12,
1,
7,
 Remember:
the
search
keys
(and
their
associated
 elements)
are
inserted
in
ascending
sorting
order
in
a
 node
 Let’s
begin
by
inserting
12:
 23,
20,
6,
18,
5,
4,
22,
10,
15,
8,
3
,
9,
17,
11,
16
 !  Actually,
that
B
Tree
is
an
example
of
a
2‐3‐4
search
tree
 !  !  !  ▪  since
the
m‐way
tree
is
empty,
we
create
the
first
node
i.e.,
the
 root
and
insert
12
 12 !  Insert
1:
 !  compare
each
key
found
in
the
root
with
the
 key
1
and
since
1
<
12,
move
12
over,
then
insert
 1
 1 12 !  Insert
7:
 !  compare
each
key
found
in
the
root
with
the
 key
7
and
since
1
<
7
<
12,
move
12
over,
then
 insert
7
 1 7 12 !  Insert
23:
 1 7 12 !  starting
at
the
root,
right
away
we
encounter
a
full
 node
so
we
split
it
as
follows:
 ▪  create
a
new
node
(parent)
and
move
the
middle
key
 into
it


 ▪  create
a
sibling
and
move
the
key
>
7
into
it
 ▪  link
the
subtrees
to
the
newly
formed
parent
node
 7 1 12 !  Insert
23
(cont’d):
 !  starting
at
the
root,
since
7
<
23,
23
is
inserted
into
 its
right
subtree
 !  considering
the
root
of
its
right
subtree,
since
its
 only
key
12
<
23,
insert
23
after
12

 7 1 12 23 7 !  Insert
20:
 1 12 23 !  starting
at
the
root,
since
7
<
20,
20
is
inserted
into
 its
right
subtree
 !  moving
on
to
the
root
of
its
right
subtree,
since
12
 <
20
<
23,
move
23
over,
then
insert
20
 7 1 12 20 23 !  Let’s
pick
up
the
pace
now…
 !  Insert
6:
 7 1 12 20 23 1 6 7 12 20 23 !  Insert
18:
 1 6 7 12 20 23 !  on
our
way
to
insert
18
we
encounter

 a
full
node
so
we
split
it
first:
 ▪  we
move
its
middle
key
into
the
parent
node
 ▪  we
create
a
sibling
and
move
the
key
>
20
into
it
 ▪  link
the
newly
formed
rightmost
subtree
to
the
parent
 node
 7 20 1 6 12 23 !  Insert
18
(cont’d):
 7 20 1 6 7 20 1 5 6 12 18 23 12 18 23 !  Insert
5:
 !  Insert
4:
 1 5 7 20 6 12 18 23 !  on
our
way
to
insert
4
we
encounter
a
full
node,
so
 we
split
it
first
 1 5 7 20 12 18 23 6 5 7 20 !  then
insert
4
 1 4 6 12 18 23 !  Insert
22:
 1 4 6 5 7 20 12 18 23 !  on
our
way
to
insert
22,
right
away
we
encounter
a
 full
node
so
we
split
it
first
hence
creating
another
 level
 7 5 1 4 6 20 12 18 23 !  Insert
22
(cont’d):
 !  then
insert
22:
 7 5 1 4 6 20 12 18 22 23 !  Insert
10:
 7 5 1 4 6 20 10 12 18 22 23 !  Insert
15:
 !  on
our
way
to
insert
15,
we
encounter
a
full
node,
 so
we
split
it
first
 7 12 20 5 1 4 6 10 18 22 23 !  Insert
15:
 !  then
insert
15:
 7 5 1 4 6 10 12 20 15 18 22 23 !  Insert
8,
3,
9
and
17:
 7 5 1 3 4 6 8 9 10 12 20 15 17 18 22 23 !  Insert
11:
 !  on
our
way
to
insert
11,
we
encounter
a
full
node,
 so
we
split
it
first,
then
we
insert
11
 7 5 1 3 6 4 8 10 11 9 12 20 15 17 18 22 23 !  And
finally,
we
insert
16:
 !  on
our
way
to
insert
16,
we
encounter
2
full
nodes
 which
we
split
before
inserting
16.

 7 5 1 3 6 4 8 10 11 9 12 20 15 17 18 22 23 !  Insert
16
(cont’d):
 7 12 5 1 3 6 4 8 10 11 9 15 17 18 20 22 23 !  Insert
16
(cont’d):
 7 12 5 1 3 6 4 8 10 11 9 15 18 17 20 22 23 !  Insert
16
(cont’d):
 7 12 5 9 17 20 1 3 4 6 8 10 11 15 16 18 22 23 !  Et
voilà!
 !  Ok,
don’t
worry,
that
won’t
be
on
the
exam
 !  Summary:
another
balanced
tree
 !  But
it’s
not
binary,
it’s
an
m‐way
tree
 !  Will
have
far
fewer
levels
in
it
than
a
binary
tree
 !  Has
similar
balancing
properties
to
red‐black
 ▪  Number
of
levels
similar
to
best
case
log(n)
 !  !  Access
block
from
index
file
containing
the
root
 Linearly
search
for
key
in
accessed
block
 !  If
found
‐>
done!
 !  If
not
found
&
node
(block)
is
leaf
‐>
not
there!
 !  Otherwise,
determine
which
index
file
block
#to
access
next
based
 !  If
found
desired
key:
determine
its
matching
block
#
and
 access
that
block
from
data
file

 on
rules
of
construction
of
m‐way
search
tree
 !  Access
that
block
from
index
file

 !  Repeat
above
step
“Linearly
search
for
key
in
accessed
block”
 !  !  !  Assuming
the
entire
Index
file
(B
Tree)
cannot
be
loaded
into
 main
memory.
 In
analyzing
the
search
time
efficiency,
we
need
to
know
 how
many
levels
a
B
Tree
(accommodating
30M
records)
has.
 Answer:

 !  Assuming
we
are
using
a
B
Tree
of
order
4
to
store
our
30M
keys
(and
 matching
block
#’s)
and
that
each
node
of
the
B
Tree
is
filled
(i.e.,
each
 node
contains
3
key
pairs)
and
that
every
level
of
our
B
Tree
is
filled,
 then
our
B
Tree
contains:
 
 (4L
–
1)
key
pairs,
where
L
is
the
number
of
levels.
 Hence
a
data
collection
containing
30,000,000
data
 records
will
have

 
 
log2(
30,000,001
)
or
_____
levels!
 

















log24
 !  !  In
this
example,
we
could
increase
the
value
of
m,

 which
would
decrease
the
number
of
levels
in
our
B
Tree,
 hence
further
reduce
the
number
of
disk
accesses
 performed
during
a
search
of
our
data
collection
 containing
30M
Canadians
 !  Good
for
disk‐bound
data

 !  When
n
is
large,
m
can
be
set
to
a
large
number,
which
keeps
the
 number
of
levels
low
 !  Since
the
number
of
disk
accesses
is
proportional
to
the
number
of
 levels
in
a
tree,
then
small
#
of
levels
translates
into
small
number
of
 disk
accesses,
and
hence
good
time
efficiency
for
search/insert/ remove
operations
 !  In
practice,
commercial
databases
use
specialized
versions
of
 these
search
trees
where
m
is
of
the
order
of
100
 !  Assume
we
inserted
our
30M
Canadian
 records
into
a
random
access
disk
file.

 !  How
can
we
sort
these
records?
 !  Let’s
look
at
our
favourite
algorithms
 ▪  QuickSort
 ▪  HeapSort
 ▪  MergeSort

 !  Find
pivot
 !  Walk
data,
swapping
entries
greater
than
/
 less
than
pivot
 disk?
 !  Is
this
going
to
work
well
if
data
are
stored
on
 !  Heapify
data
 !  Call
bubbleUp
repeatedly
 !  Remove
data
from
heap
 !  Is
this
going
to
work
well
if
data
are
stored
on
 disk?
 !  The
simplest
algorithm
that
can
be
used
to
 sort
disk‐bound
data,
and
one
that
turns
 out
to
be
quite
efficient,
is
Merge
Sort.
 !  Recall
the
internal
Merge
Sort
algorithm:
 !  divide
the
data
collection
into
two
sections
of
 approx.
equal
size
 ‐
recursively
apply
the
algorithm
to
sort
each
of
 the
smaller
sections
‐>
sorting
is
done
on
adjacent
records

 ‐
merge
the
sorted
sections
back
together
 !  !  Suppose
we’re
trying
to
sort
32
million
records
 Suppose
disk
blocks
hold
1
million
records
 !  I.e.
reading
1
million
records
is
roughly
as
fast
as
 !  Suppose
we
only
have
enough
memory
to
hold
3
 million
records
in
memory
at
a
time
 Let’s
see
how
we
can
MergeSort
under
these
 constraints
 reading
1
 !  !  Phase
1:
 !  Divide
32
million
records
into
groups
of
1
million
 !  Read
each
1
million
into
memory
in
turn
 ▪  For
each
group
i,
sort
and
write
back
to
disk
as
R_i
 (sorted)
 !  Phase
2:
 !  Merge
sorted
groups
R_1,…,R_32
 !  Let’s
see
why
this
can
be
done
under
our
 !  This
phase
can
be
done
under
our
constraint
 constraint
 !  Recall
constraint:
only
3
million
records
in
 memory
at
a
time,
blocks
are
1
million
 !  We
need
to
merge
up
32
sorted
files
R_1, …,R_32
each
with
1
million
records
 !  First
level
merge
is
easy:
 !  Each
merge
only
requires
2
million
records
 !  Merge
R_1
and
R_2
into
R_{1,2},
R_3
and
R_4,
…
 !  What
about
the
second
level
merges?
 !  That
is,
merging
R_{1,2}
and
R_{3,4}
 !  Suppose
we
are
merging
one
sorted
8
million
 record
file
with
another
 !  Only
need
memory
for
3
million
records!
 !  Read
1
million
records
(a
block)
from
file
1
 !  Read
1
million
records
(a
block)
from
file
2
 !  Start
merging
 !  Allocate
memory
for
1
million
records
for
output
 !  Once
the
output
is
full,
write
it
to
disk
 !  Once
a
file
input
block
is
finished,
read
another
 !  QuickSort
is
O(n
log
n)
 !  But
how
many
disk
reads
will
it
require?
 !  O(n
log
n)
 !  External
MergeSort
is
O(n
log
n)
 !  But
how
many
disk
reads
will
it
require?
 !  O(n/B
log
n/B)
 ▪  Where
B
is
the
number
of
records
in
a
block
 !  How
to
handle
big
datasets?
 !  Big
=
do
not
fit
in
memory
 !  Disk
access
is
slow
 !  Minimize
number
of
disk
accesses
algorithms
 perform
 !  Searching
 !  Sorting
 !  Use
MergeSort
 !  Index
files
and
data
files
 !  Can
access
index
file
from
disk
too
if
it’s
too
big
 !  C++:
Ch.
14
 !  Java:
Ch.
15
 ...
View Full Document

This note was uploaded on 04/17/2010 for the course CMPT 11151 taught by Professor Gregorymori during the Spring '10 term at Simon Fraser.

Ask a homework question - tutors are online