Reg task 1 reg read a0 reg reg

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: te reg CSEP 524: Parallel ComputaDon Task 2 reg = read A[1] reg = -­‐reg A[1] = write reg Winter 2013: Chamberlain 15 Example: Compe33on For Memory cache line size L3 L2 L2 L1 L1 reg array nega@on using cyclic distribu@on: reg Task 1 reg = read A[0] reg = -­‐reg A[0] = write reg CSEP 524: Parallel ComputaDon Task 2 reg = read A[1] reg = -­‐reg A[1] = write reg Winter 2013: Chamberlain 16 Defini3on: False Sharing False Sharing: When cache lines must be invalidated not because two tasks are accessing the same data, but because they’re accessing data on the same cache line –  in reality, the data is truly independent, hence “false” –  the details of the granularity at which data is stored within HW is what causes the interdependence (“sharing”) –  NOTE: On cache coherent architectures, this is a performance issue, not a correctness issue –  (“true sharing” might be considered when two tasks actually access the same shared variable/data) CSEP 524: Parallel ComputaDon Winter 2013: Chamberlain 17 False Sharing Implica3ons for Assignment #1 •  WriDng to an array using a cyclic distribuDon can result in performance impacts due to false sharing –  possible fixes: •  have each task 0 start its cyclic iteraDon from a skewed posiDon –  e.g., have task t starts from element t + t*n/p –  but, results in more complex loop idioms due to need to wrap around •  use padding/alignment pragmas to spread out array data –  but, results in wasted space CSEP 524: Parallel ComputaDon Winter 2013: Chamberlain 18 Performance Gotcha #1: Memory Issue #2: Memory is a boFleneck –  typically, processors increase in speed faster than memory –  having mulDple processors share memory doesn’t help •  there are only so many wires to access memory •  cache coherence protocols also add overhead/complexity memory L2 L1 core L3 L2 L2 L1 node L1 core core L2 L1 core socket CSEP 524: Parallel ComputaDon Winter 2013: Chamberlain 19 Performance Gotcha #1: Memory Issue #2: Memory is a boFleneck –  algorithms with more computa@onal intensity can beMer amorDze these memory overheads memory L2 L1 core L3 L2 L2 L1 node L1 core core L2 L1 core socket CSEP 524: Parallel ComputaDon Winter 2013: Chamberlain 20 Defini3on: Computa3onal Intensity Computa4onal Intensity: CSEP 524: Parallel ComputaDon Winter 2013: Chamberlain 21 Defini3on: Computa3onal Intensity Computa4onal Intensity: How much computaDon is performed per memory access –  high computaDonal intensity: lots of OPS per load/store => memory performance is less of an issue –  low computaDonal intensity: few OPS per load/store => memory performance is more of an issue CSEP 524: Parallel ComputaDon Winter 2013: Chamberlain 22 Mem. Performance Implica3ons for Assignment #1 •  ComputaDons with greater computaDonal intensity should result in beMer speedup –  e.g., factorial should speed up beMer than negaDon CSEP 524: Parallel ComputaDon Winter 2013: Chamberlain 23 Performance Gotcha #2: Load Balance NegaDon + Ramp: ComputaDonal Inten...
View Full Document

This document was uploaded on 04/04/2014.

Ask a homework question - tutors are online