friend ostream& operator < (ostream&, BTree&);
protected: int order;/ order of tree
Page<Key, Data> *root; / root of the tree
Page<Key, Data> *bufP; / buffer page for distribution/merging
virtual Item<Key, Data>*(Page<Key, Data
bandwidth needed by all these units. As mentioned above, the Emotion Engine
designers chose many dedicated memories. The CPU has a 16-KB scatchpad memory
(SPRAM) in addition to a 16-KB instruction cache and an 8-KB data cache. VPU0 has
a 4-KB instruction
Given these goals, what should be the size of the caches? Looking at the
SPEC2000 results in Figure 5.17 on page 413 suggests miss rates of 0.5% for a
1-MB data cache, with infinitesimal instruction cache miss rates at those sizes. It would
seem that a 1-
In addition to prefetching data, UltraSPARC III has a small instruction prefetch buffer
of 32 bytes that tries to stay one block ahead of the instruction de- coder. On an instruction
cache miss, two blocks are requested: one for the in- struction cache an
Figure 5.53 shows the cumulative average instruction misses per thousand instructions for five inputs to a single SPEC2000 program. For these inputs, the av- erage
memory rate for the first 1.9 billion instructions is very different from their average
Even the venerable 80x86 line is showing danger signs, with Intel justifying
migration to IA-64 in part to provide a larger flat address space than 32 bits, and AMD
proposing its own 64-bit address extension called x86-64.
As we expected, by this third ed
Pitfall: Delivering high memory bandwidth in a cache-based system.
Caches help with average cache memory latency but may not deliver high memo- ry
bandwidth to an application that needs it. Figure 5.55 shows the top ten results from
the Stream benchmark a
case diagnosePack: /. break;
class DimsDontMatch cfw_; class
cfw_; class BadRow
cfw_; class BadCol
class Matrix cfw_
/ do a binary search on items of false otherwise
/ returns true if successful and a page
template <class Data>:BinarySearch (Key key, int &idx)
Bool Page<Key, Key, class Data>
int low = 0;
int high = used - 1;
mid = (low + high) / 2
/ recursively print a page and its subtrees
template <class Data>:PrintPage (ostream& os, const int margin)
void Page<Key, Key, class Data>
/ build the margin<= margin; +i)
for (int i = 0; i string:
margBuf[i] = ' ';
margBuf[i] = '
if (page->BinarySearch(item->KeyOf(),/ already in tree
if (child InsertAux(item, child); 0)
item = = page->Right(idx) !=
/ child is not a leaf
if (item != 0) cfw_
/ page is a leaf, or passed up
if (page->Used() < 2 * order) cfw_
Underflow(page, child, idx, underflow);
/ delete from item and deal with or merging two borrowing
/ items an neighboring pages underflows by pages
template <class Key, class Data> (Page<Key, Data> *parent,
void BTree<Key, Data>:DeleteAux2
became time to grow page sizes with later Alphas, the operating system designers
balked and the virtual memory system was revised to grow the address space while
maintaining the 8-KB page.
Architects of other computers noticed very high TLB miss rates, an
16-KB two-way set-associative unified cache using write back.
32-KB direct-mapped unified cache using write back.
Assume the memory latency is 40 clocks, the transfer rate is 4 bytes per clock cycle and that 50% of
the transfers are dirty. There are 3
cache using direct mapping could consistently outperform one using fully associative with
 <5.8> Explain why this would be possible. (Hint: You cant explain this with the three Cs
model because it ignores replacement policy.)
tions in the expression above are by powers of two, they can be replaced by binary shifts
(a very fast operation).
The address is now small enough to find the modulo by looking it up in a read-only
memory (ROM) to get the bank number.
Finally, we are r
there is a potential conflict in the addresses.
Assume a 64-KB direct-mapped cache for data and a 64-KB direct-mapped cache for instructions with a block size of 32 bytes. The CPI of the CPU is 1.5 with a perfect memory system
and it takes 14 clocks on a
processor performance growth, driven by the microprocessor, was at its highest rate since the first
transistorized computers in the late 1950s and early 1960s.
On balance, though, your authors believe that parallel processors will definite- ly have a bigg
A Taxonomy of Parallel Architectures
We begin this chapter with a taxonomy so that you can appreciate both the breadth of design
alternatives for multiprocessors and the context that has led to the development of the
dominant form of multiprocessors. We b