CSE 721
Winter 2009
Sample Problems for Final Examination
1. Consider 4096processor systems with the following topologies: i) 3D torus (with wraparound, bidirectional links), ii) hypercube (with bidirectional links). Order the systems with respect to
CSE721HW2
S M Faisal January 26, 2011
1
Answer to question number 1 (Problem 4.3)
(a) The standard ring algorithm with p processors runs in p  1 steps where each processor sends a message to its neighbor in a fixed direction (clockwise or anticlock wi
CSE 721
Programming Assignment 1
Due 2/28/2011, 3:00PM For this assignment, you are to create versions of the following two codes using SSE intrinsics. Submit via Carmen; be sure to include source code listings and report on performance. A single file mus
10/11/10
Agenda
Intel SEE SIMD InstrucTons Administrivia Technology Break SSE in C
10/11/10
Fall 2010  Lecture #18
3
Single InstrucTon/MulTple Data Stream
Single InstrucTon, MulTple Data streams (SIMD)
Computer that exploits mulTple data strea
Floating Point Operations and Streaming SIMD Extensions
Advanced Topics Spring 2009 Prof. Robert van Engelen
SIMD Short Vector Extensions
Using SIMD short vector extensions can result in large performance gains
Instruction set extensions execute fast Ne

Arash Ashari 200105422 CSE721 Programming Assignment #1 Winter 2011
1) Following shows my code in which I have first unrolled the loops and then I have used SSE: static inline void mul4x4sse(float *A,float *B, float *C) cfw_ int i,j; _m128 AA, BB, CC1,
Ohio State University
CSE 721 Programming Assignment 1: SSEIntrinsics
Author: Josh Mahaffey
February 27, 2011
Answer to Question 1:
Matrixmatrix multiplication typically requires a doublenested forloop. Consider the multiplications of two input matric
CSE 721
Winter 2011
Term Project
Due 3/15/2011 For the term project, please identify a topic in one of the following four classes. Comparative performance study of two (or more if you wish) systems, using a number of benchmark kernels and/or applications.
C36 7& g
SoMHmg h; NFC! him /_
® 5324+ SoluHmU/J hr. , 77
Oie*°*( Ems? aim; in wwejgaoa vomited
Alivh Bafm? p 9 MWqu «.9 mg?
A4: On¢§§=8m+ m \D mag j. [530093) ((35+Vh/t'u)
A1 3» a o. n »
ETDEWWQ @463 7 gx%9a69(ts (mm)
FM 2 LFVE +m'tw)
m 9"
Performance of Parallel Systems
The execution time of a parallel program is influenced by many factors
communication latency, idle times, load unbalance, synchronization overhead, .
Measuring performance on parallel systems is not trivial
simplest mea
Comments NOTE: 1. The answers to Question 4.6 in both "standard answers" are not totally correct. Here is my answer to this question. If there is any question, please tell me: I. There are two kinds of answers: (ts + m * tw)(p 1) or (ts + m * tw * p /2) *
Basic communication operations
Possible variants
# of nodes nodes involved
Pointtopoint vs collective operation
routing scheme
Storeandforward (S&F) and cutthrough (CT)
Usually pointtopoint implemented in hardware, collective in software Many
Dense matrix algorithms
We will first study algorithms involving dense matrices (as opposed to sparse matrices) A very important issue is how to map a matrix onto processors
the combination of proper mapping and efficient algorithm is performance critic
Overview of the Global Arrays Parallel Software Development Toolkit
Bruce Palmer Manoj Kumar Krishnan,
Sriram Krishnamoorthy, Ahbinav Vishnu, Daniel Chavarria, Patrick Nichols, Jeff Daily
Distributed Data vs Shared Memory
Distributed Data:
Data is explici
CSE 721
Winter 2011
Homework 1
Due 9:00am, Tuesday, 1/18/2011 (submit via Carmen) 1. (10 points) Problem 2.3 from text 2. (20 points) Problem 2.10 from text 3. (15 points) Problem 2.11 from text 4. (15 points) Consider an Omega network with P = 2p inputs
CSE721 Winter 2011 Homework 2 Naser Sedaghati (200116698)
Problem1: 4.3 from text
1) Run time for case a): Standard alltoall broadcast time on the ring: T = (P 1) (ts + m* tw) 2) Run time for case b): Considering alltoall algorithm of hypercube implem
CSE 721
Winter 2011
Homework 2
Due 1/26/2011, 3:30pm (Submit via Carmen) 1. (10 points) Problem 4.3 from text 2. (15 points) Problem 4.5 from text 3. (15 points) Problem 4.6 from text 4. (15 points) Problem 4.7 from text 5. (15 points) Problem 4.16 from t
CSE721 Homework 3
Due on Tuesday, 1/18/2011
Thanks to Xin Huo, S M Faisal
1
Problem 1 (5.3)
1. Compute the degree of concurrency. Degree of concurrency is the maximum number of tasks that can be executed in parallel. a) 2n1 b) 2n1 c) n d) n 2. Compute t
CSE 721
Winter 2011
Homework 3
Due 2/2/2011, 3:30pm (Submit via Carmen) 1. (25 points) Problem 5.3 from text 2. (25 points) Problem 5.5 from text 3. (25 points) Problem 5.9 from text 4. (25 points) Problem 8.3 from text