Today's Lecture
Thursday Jan 22, 2008
Dependability MTTF, etc. Quantitative Principles of Computer Design
Taking Advantage of Parallelism Principle of Locality Focus on the Common Case Amdahl's Law The Processor Performance Equation
Measuring Performan
The numbers in this template are fabricated. They are here only to fill the tables.
Execution time of Matrix multiplication
Reference C
implementation
(single thread)
10.123456 sec
OpenCL on CPU
OpenCL on GPU
Normal kernel
Tiled kernel
Normal kernel
Tiled
CSE 420 Computer Architecture Spring 2017
Programming assignment 3 (100 points)
In this assignment, you will study and implement 2 dynamic branch predictors, tournament predictor and
O-GEHL predictor. You also have to compare them with existed predictor i
Branch Prediction in Simplescalar
Spring 2017
Computer Science & Engineering Department
Arizona State University
Tempe, AZ 85287
Dr. Yann-Hang Lee
yhlee@asu.edu
(480) 727-7507
CSE 420, Spring 2017
Simulator Suite
Sim-Fast
300 lines
functional
No timing
Si
The numbers of miss rate in this template are fabricated. They are here only to fill the tables and graphs.
You could change the style of graph and table as you see fit.
Q1
DESCRIBTION OF CACHE IMPLEMENTATION IN SIMPLESCALAR
DESCRIBTION OF CACHE IMPLEMEN
CSE 420 Computer Architecture Spring 2017
Programming assignment 4 (100 points)
In this assignment, you will execute matrix multiplication on different devices (CPU or GPU) using
OpenCL. The goal of this assignment is to compare the performance of single
CSE 420 Computer Architecture Spring 2017
Programming assignment 2 (100 points)
In this assignment, you are asked to add two cache replacement policies into SimpleScalar simulator. The
two policies are partial LRU and SRRIP. Here are the steps you need to
2017 SPRING/CSE 420: Matrix Multiplication Using AVX and SSE
1. Brief Introduction of AVX and SSE
This exercise is intended for you to understand some recent technics provided as instruction set called Advanced
Vector Extensions, or AVX, that makes the CP
2.11 Consider the usage of critical word first and early restart on L2 cache misses. Assume a 1 MB L2
cache with 64 byte blocks and a refill path that is 16 bytes wide. Assume that the L2 can be written with
16 bytes every 4 processor cycles, the time to
Q1: 1.7(a, b)
Your companys internal studies show that a single-core system is sufficient for the demand on your processing power;
however, you are exploring whether you could save power by using two cores.
a. <1.9> Assume your application is 80% parallel
CSE 420 Computer Architecture Spring 2017
Programming assignment 1 (100 points)
Task 1: memory hierarchy performance measurement
Write a C program running on Linux to access (read, write) memory in different pattern (linear,
random), and measure the avera
3.7 <2.1> Computers spend most of their time in loops, so multiple loop iterations are great places to speculatively find
more work to keep CPU resources busy. Nothing is ever easy, though; the compiler emitted only one copy of that loops
code, so even th
Introduction to SimpleScalar
SimpleScalar (http:/www.simplescalar.com) is an open source computer
architecture simulator. SimpleScalar is a set of tools that model a virtual computer
system with CPU, Cache and Memory Hierarchy. Using the SimpleScalar tool
4.13 <4.4> Assume a GPU architecture that contains 10 SIMD processors. Each SIMD instruction has a width of 32
and each SIMD processor contains 8 lanes for single-precision arithmetic and load/store instructions, meaning
that each non-diverged SIMD instru
Examples of Tomasulos Algorithm
Spring 2017
Computer Science & Engineering Department
Arizona State University
Tempe, AZ 85287
Dr. Yann-Hang Lee
yhlee@asu.edu
(480) 727-7507
The set of slides are based on the lecture material at
https:/people.eecs.berkele
The numbers of miss rate and IPC in this template are fabricated. They are here only to fill the tables and
graphs. You could put more tables if they are helpful for your analysis of branch predictor and
benchmark characteristic, such as branch lookup cou
#include <iostream>
using namespace std;
int *readNumbers(int n)cfw_
int *num=new int[n];
for (int i=0;i<n;i+)cfw_
cin>num[i];
return num;
void printNumbers(int *numbers,int length)cfw_
for (int i=0;i<length;i+)cfw_
cout<i<" "<numbers[i]<endl;
bool pa
Chapter 5
Large and Fast: Exploiting Memory Hierarchy
5.1 Introduction
Memory Technology
Static RAM (SRAM)
0.5ns 2.5ns, $2000 $5000 per GB 50ns 70ns, $20 $75 per GB 5ms 20ms, $0.20 $2 per GB Access time of SRAM Capacity and cost/GB of disk
Dynamic RAM (DR
CSE 420/598 Exams are
FINAL EXAM REVIEW
Preliminary Version
3 May 2009
SPRING 2009
3pm section: Thursday May 7, 2009 12:10-2:00 pmBYAC 150 6pm section: Thursday May 7, 2009 2:30-4:20 pm BYAC 150 Closed book except one 8 x 5" colored card of handwritten no
EECS 252 Graduate Computer Architecture Lec 4 Memory Hierarchy Review
David Patterson
Electrical Engineering and Computer Sciences University of California, Berkeley http:/www.eecs.berkeley.edu/~pattrsn http:/www-inst.eecs.berkeley.edu/~cs252
Review from
CSE 420/598 FALL 1994
FINAL EXAM
NAME
Put Name on Every Page
1. (15) Performance Assume the 5 stage pipeline with a structural hazard shown below. Suppose that data references constitute 25% of the mix and that ideal CPI of the pipelined machine, ignoring
CSE 420/598 SPRING 1992
FINAL EXAM
NAME
Put Name on Every Page
Questions 1 thru 10 are worth 3 points each. Pick the best answer for each question based on the material covered in class and the text. Hardware exploits parallelism dynamically and software
CSE 420/598 SPRING 2009 5. (25) Control Hazards
MIDTERM 1 3PM CLASS Relevant question
NAME
Put Name on Every Page
The figure below is from your text on scheduling the branch-delay slot. The top picture in each pair shows the code before scheduling, and th
CSE 420/598 SPRING 2009
MIDTERM 1 6PM CLASS
NAME
Put Name on Every Page
1. (20) Processor Performance Equation and CPI Given that the CPI for instruction types are:
ALU operation Load/store Branch 1 3 4
Not a relevant question
Use the data from the two ta
#include <iostream>
using namespace std;
int *readNumbers(int n)cfw_
int *num=new int[n];
for (int i=0;i<n;i+)cfw_
cin>num[i];
return num;
void printNumbers(int *numbers,int length)cfw_
for (int i=0;i<length;i+)cfw_
cout<i<" "<numbers[i]<endl;
int sum
#include <iostream>
using namespace std;
int *readNumbers(int n)cfw_
int *num=new int[n];
for (int i=0;i<n;i+)cfw_
cin>num[i];
return num;
void printNumbers(int *numbers,int length)cfw_
for (int i=0;i<length;i+)cfw_
cout<i<" "<numbers[i]<endl;
bool pa
#include <iostream>
using namespace std;
int *readNumbers(int n)cfw_
int *num=new int[n];
for (int i=0;i<n;i+)cfw_
cin>num[i];
return num;
void printNumbers(int *numbers,int length)cfw_
for (int i=0;i<length;i+)cfw_
cout<i<" "<numbers[i]<endl;
delete n