ass3 - ACA Assignment 3 Name Siva Maddali SUID 213328832...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ACA Assignment 3 Name: Siva Maddali SUID: 213328832 Reading Assignment As the advancement in processor occur over the period the demand for better utilization of processor resourse increased and hence many techniques has come forward. SMT is one of the promising techineques. Given the enormous transistor budget in the next computer era, we believe simultaneous multithreading provides an efficient base technology that can be used in many ways to extract improved performance. In SMT processors, thread-level parallelism can come from either multithreaded, parallel programs or individual, independent programs in a multiprogramming workload. Simultaneous multithreading combines hardware features of wide-issue superscalars and multithreaded processors. From superscalars, it inherits the ability to issue multiple instructions each cycle; and like multithreaded processors it contains hardware state for several programs (or threads). A comparative results for three different types namely superscalar, multithreading, and simultaneous multithreading and the results are compared. After instruction decoding, the register-renaming logic maps the architectural registers to the hardware renaming registers to remove false dependencies. Instructions are then fed to either the integer or floating-point dispatch queues. When their operands become available, instructions are issued from these queues to their corresponding functional units. Only two components, the instruction fetch unit and the processor pipeline, were redesigned to benefit from SMT’s multithread instruction issue. Its dynamic scheduling core is similar to Mips R10000. SMT’s implementation parameters will, of course, change as technologies shrink. In our simulations, they targeted an implementation that should be realized approximately three years from now. an eight-instruction fetch/decode width; • six integer units, four of which can load (store) from (to) memory; • four floating-point units; • 32-entry integer and floating-point dispatch queues; • hardware contexts for eight threads; • 100 additional integer renaming registers; • 100 additional floating-point renaming registers; and • retirement of up to 12 instructions per cycle. The larger (workload-wide) working set of the multiprogramming workload should stress the shared structures in an SMT processor (for example, the caches, TLB, and branch prediction hardware) more than the largely identical threads of the parallel programs, which share both instructions and data. We chose the programs in the multiprogramming workload from the Spec95 and Splash2 benchmark suites. A comparative study for different workloads is made in the paper which shows the throughput increases for SMT as the number of threads increases. It is shown that using SMT the performance advantages are huge compared to other processor like the one stated in like [email protected] and MP4. A) a)When an unconditional branch is found in the buffer the penality in clock cycles would be -1. b) CPI for the branch is =5%(90%(0)*10%(2)) ==>0.01 (I.e since branch transfer buffer stores the target instruction) CPI for the Branch Instruction buffer=5%(90%(0)+10%(2)) ==>0.035(i.e since the buffer stores the target address which requires Fetch to get a new instruction) Therefore seeing the above two results we can conclude that the CPI is reduced. B) a) In order to normalize the loop (I.e inorder to start the index from 1 and increment it by 1), start the index by 1 and make the condition i<=50 and increment it by 1 hence the answer would be for(int i=1;i<=50;i++) { a[2i]=a[100*i+1]; } Now applying GCD test on the above loop GCD Test: To decide if there is loop carried dependence (two array references access the same memory location and one of them is a write operation) between a[x*i+k] and a[y*i+m], one usually need to solve the equation x*i1 +k = y*i2+m (Or x*i1 -y*i2 = m -k) Where 0<=i1, i2 <n and i1 != i2. If GCD(x,y) divides (m­k) then there may exist some dependency in the loop statement . Hence here GCD(2,100)=2 and m­k=1 Hence there exists no dependency in the above loop since the GCD test fails. b) for (i=2, i<=n; i+=2) a[i] = a[i] + a[i + n/2]; Normalizing the above loop: for(i=1;i<=n/2;i++) { a[2*i]=a[2*i]+a[2*i+n/2]; } the above Testing the loop carried Dependency: Hence GCD(2,2)=2 and m­k=n/2 which tells us that if n is amultiple of 4 the above loop wiil have loop carried dependency but the two terms a[2*i],a[2*i] indicates that the aboveloop will have loop carried dependency irrespective of the value of n since they are dependant. D) Speculation and Prediction would be of less value in Embedded computers because this involves the prediction of whether to take a branch or not to take a branch.This involves an increased support of hardware and as well as huge resources such as memory which is not possible in embedded computers since the cost of the processor is pretty low all these resources can't be provided.This speculation and prediction can be provided in deskop computers since we have access to huge resources such as larger amount of memory.If we use speculation we need to speculate the outcome of an event rather than wait for an outcome to be known ,if we misspeculate a penalty might occur which requires the instruction to be executed again which is a cost consuming operation. E) Processor MicroArchite Fetch/Issue/E Func.Units cture xecute 6 int 3 FP ClockRate(G Transistors Hz) and die size 3.0 731 milion transistors/ 263 mm sq Power 130W Intel Core I7 It has 3/3/4 Nehalem SMT,HyperT hreading,Ato m Architecture IBM Power 7 RISC 3/3/4 Architecture It has SMT 6int 4fp 4.4 1.2 billion / 567 mm sq chip 80W(estimat ed) References:­8­core­power7­twice­the­muscle­half­the­ transistors.ars F) The advantage of Super ScalarImplementation is that with a standard instruction set it executes multiple instructions per clock cycle.It exploits the degree of Instructional level Parallelism by building multiple pipelines.It achieves higher performance without theproblems of superpipelining.The disadvantage of superscalar implementation without multithreading support is the use of issue slots is limited by lack of ILP.Using SuperScalar implementation it is difficult to decide which operations can be done concurrently and which ones have to wait and be done sequentially after others. Advantages of Simultaneous multithreading Approach: ­­>it minimizes the architectural impact on the conventional superscalar design ­­>it has minimal performance impact on a single thread executing alone and ­­>it achieves significant throughput gains when running multiple threads Disadvantages: Area increases quadratically with core’s complexity. Increase in cycle time – interconnect delays. Delay with wires dominate delay of critical path of CPU. Possible to make simpler clusters, but results in deeper pipeline and increase in branch misprediction penalty. Design verification cost high, due to complexity and single processor Large demand on memory system. C) Given int A[80], B[80], C[80], D[80]; for (i = 0 to 40) { A[i] = B[i] * D[2*i]; // ­­­­­s1 C[i] = C[i] + B[2*i]; //­­­­­s2 D[i] = 2*B[2*i]; //­­­­­­­­­­s3 A[i+40] = C[2*i] + B[i]; //­­s4 } Inorder to write the above code in to two threads so that they can be executed parallelly we split the set of statements inside the loop. Statements s1 and s3 arre dependant on each other because the value of D[i] calculated in I th iteration would be used later to calculate the value of A[i]. Statements s2 and s4 are dependant on each other because the value of C[i] calculated in I th iteration would be used later to calculate the value of A[i+40]. Hence the we can split the givenloop into two loops which can be executed parallelly as two different threads. i.e for (i = 0 to 40) { A[i] = B[i] * D[2*i]; // ­­­­­s1 D[i] = 2*B[2*i]; //­­­­­­­­­­s3 } for (i = 0 to 40) { C[i] = C[i] + B[2*i]; //­­­­­s2 A[i+40] = C[2*i] + B[i]; //­­s4 } ...
View Full Document

  • Fall '09
  • nj
  • Central processing unit, Simultaneous multithreading, GCD test, floating-point dispatch queues, loop GCD Test, multiprogramming workload

{[ snackBarMessage ]}

Ask a homework question - tutors are online