9 Pages

paper3

Course: CMPMSI 08, Fall 2009
School: Utah
Rating:
 
 
 
 
 

Word Count: 5959

Document Preview

the Extending Scalability of Single Chip Stream Processors with On-chip Caches Ali Bakhoda and Tor M. Aamodt University of British Columbia, Vancouver, BC, Canada {bakhoda,aamodt}@ece.ubc.ca Abstract As semiconductor scaling continues, more transistors can be put onto the same chip despite growing challenges in clock frequency scaling. Stream processor architectures can make effective use of these additional...

Register Now

Unformatted Document Excerpt

Coursehero >> Utah >> Utah >> CMPMSI 08

Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

Course Hero has millions of student submitted documents similar to the one below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
the Extending Scalability of Single Chip Stream Processors with On-chip Caches Ali Bakhoda and Tor M. Aamodt University of British Columbia, Vancouver, BC, Canada {bakhoda,aamodt}@ece.ubc.ca Abstract As semiconductor scaling continues, more transistors can be put onto the same chip despite growing challenges in clock frequency scaling. Stream processor architectures can make effective use of these additional resources for appropriate applications. However, it is important that programmer effort be amortized across future generations of stream processor architectures. Current industry projections suggest a single chip may be able to integrate several thousand 64-bit oating-point ALUs within the next decade. Future designs will require signicantly larger, scalable onchip interconnection networks, which will likely increase memory access latency. While the capacity of the explicitly managed local store of current stream processor architectures could be enlarged to tolerate the added latency, existing stream processing software may require signicant programmer effort to leverage such modications. In this paper we propose a scalable stream processing architecture that addresses this issue. In our design, each stream processor has an explicitly managed local store model backed by an on-chip cache hierarchy. We evaluate our design using several parallel benchmarks to show the trade-offs of various cache and DRAM congurations. We show that addition of a 256KB L2 cache per memory controller increases the performance of our 16, 64 and 121 node stream processors designs (containing 128, 896, and 1760 ALUs, respectively) by 14.5%, 54.9% and 82.3% on average respectively. We nd that even those applications that utilize the localstore in our study benet signicantly from the addition of L2 caches. 1600 1200 IPC 800 400 0 Nave scaling Potential performance 4x4 (90 nm) 8x8 (32nm) 11x11(22nm) Figure 1. Potential performance assuming zero latency memory system compared to nave scaling of cores tures [18, 4, 1, 12] dedicate more chip area to ALUs than superscalar processors and are very effective for workloads that contain large amounts of data-level parallelism. Assuming that current scaling trends continue, in less than ten years, it will be possible to put thousands of 64-bit oating point ALUs on a single chip. Even for applications with abundant parallelism, keeping thousands of ALUs busy will be a challenging task as we show in this paper. Figure 1 shows that, as process technology scales and therefore number of ALUs per chip increases, the gap between potential IPC and the IPC achieved via nave scaling widens. The potential IPC presented in the gure was measured assuming a perfect memory system incurring zero penalty cycles to access memory (along with the hardware conguration shown in Table 1). The nave scaling numbers were measured assuming scaling is achieved by simply replicating stream processor cores similar to those employed in contemporary hardware [12]1 and connecting them using a mesh interconnection network without any L2 caches for non local-store memory accesses. Although stream processors utilize silicon more effectively, they will only outperform traditional superscalar pro1 We consider stream processor cores of the granularity of a Streaming Multiprocessor (SM) in the GeForce 8 Series [12]. 1 Introduction Despite challenges to clock frequency scaling, continued reductions in process technology feature sizes are enabling manufacturers to put ever more transistors on a single chip. This combination has led the semiconductor industry towards architectures that expose greater amounts of parallelism to software. Stream processor architec- cessors if the application contains plentiful data-level parallelism and is rewritten in a streaming model language. The signicant effort this entails implies that maintaining backward compatibility among future generations of stream processors is essential to amortize programmer effort. Recent graphics processing units (GPUs) are among the most successful and widely available stream processors. NVIDIAs Compute Unied Device Architecture (CUDA) and ATIs Close To the Metal (CTM) are two programming models that enable users to run data parallel kernels on recent generations of GPUs without the need to employ graphics programming interfaces as was typical for early approaches to General-Purpose computation on GPUs (GPGPU). Although GPUs are primarily designed for graphic applications, their shader cores resemble streaming processors. Both NVIDIA and ATI allow users to run compute kernels on the GPUs shader cores. Writing a functionally correct application is relatively easy in these environments. The challenging part is optimizing the application to take advantage of all the potential computing power the GPU has to offer. Even beyond the well known challenge of nding parallelism, there are others. For example, on the NVIDIA GeForce 8 Series, the programmer must manually change the application to avoid bank conicts when accessing the local store since conicting accesses can degrade performance. Achieving this can require signicant programmer effort (the CUDA development environment helps by providing tools for identifying such conicting accesses [15]). In this paper we will study the scalability issues of a SIMD stream architecture. We use a mesh network as the interconnection fabric and then use an on chip cache hierarchy to improve the performance. The contributions of this paper are: We propose a multi-level cache design specially suited for stream processors. We quantify the effects of cache size, DRAM bandwidth and other parameters on stream processor performance. The rest of this paper is organized as follows. In Section 2 we discuss background information about interconnection network design and memory controller limitations in stream processors. In Section 3 we discuss our proposed memory hierarchy and scalable stream processor architecture. Our experimental methodology is described in Section 4 and Section 5 presents and analyzes results. Section 6 reviews related work and Section 7 concludes the paper. trollers integrated on a single chip. In this paper we call the stream processing cores shader coresnamed after GPU shader cores (e.g., an SM on GeForce 8 Series hardware [12]) which are SIMD stream processors. Shader cores and memory controllers are connected using an interconnection network that is a critical component since all memory accesses go through it. In this section, we rst discuss several trade-offs for designing on-chip interconnection networks. Then we discuss the limitations of on-chip memory controllers and nave SIMD pipeline width scaling. 2.1 On-chip interconnection networks There are various options to design the interconnection network of a stream processor. Here we briey discuss full crossbar, ring and mesh networks. A full crossbar provides high bandwidth, low latency and minimum latency variation. Crossbars cost and area increase quadratically as a function of the number of nodes connected by the crossbar. Although a crossbars cost and area are reasonably economical in small congurations, they quickly become prohibitive as the number of nodes increases [6]. A ring interconnect is used in the Cell processor [17] and ATI R600 [13]. Although a ring interconnect is not as area demanding as a crossbar, its throughput and latency become degraded as the number of nodes increases. The simulations in [2] also conrm that a 2D mesh provides higher throughput and lower latency than a ring when a network has multiple hot spot nodes (nodes with higher trafc demands). Memory controllers are the hot spots [24] in a stream processor chip with integrated memory controllers. Mesh networks are inherently scalable, but they result in variable and long latencies compared to crossbars. Since our primary goal in this paper is scalability, we opt for a mesh network. To cope with the side effects of the mesh network such as increased latency, we consider applying microarchitecture techniques to keep all the processing units busy by trying to ensure they are not starved for data. As we will show, addition of caches helps by reducing the load on the interconnect. We leave a more detailed comparison with other network topologies for future work. 2.2 Available memory controllers 2 Background Single chip stream processors typically contain several stream processing cores as well as several memory con- Another restriction for all future processors is the number of available memory controllers per chip. According to ITRS projections [9] the number of pins per chip will not increase at the same rate as the number of transistors per chip (pin count is projected to increase at a rate of roughly 10% per year). Additionally, the number of memory controllers (total off-chip memory bandwidth) has a direct and nonnegligible effect on total system cost. Inevitably, the ratio of processing units to memory controllers is going to decrease in future stream processors. Consequently, pressure on the memory system will increase, necessitating techniques to both increase off-chip bandwidth per pin and to reduce the average number of off-chip accesses per ALU operation. Shader Core L1 $ Shader Core L1 $ Shader Core L1 $ Shader Core L1 $ Shader Program SIMD Scheduler Scalar Pipeline Scalar Pipeline Scalar Pipeline Scalar Pipeline Interconnection Network 2.3 SIMD Width L2 $ Memory Controller L2 $ Memory Controller L2 $ Memory Controller SIMD Pipeline p Local Store One way to scale the total number of ALUs in future process technologies is to increase the number of processing elements in each shader cores SIMD pipeline (in other words designing fatter shader cores). This approach has drawbacks. SIMD Stream processors, such as NVIDIAs GeForce 8 series, group threads into warps for scheduling purposes [15]. A warp is a collection of threads that execute together in SIMD fashion on the hardware. The number of threads in a warp is equal to a multiple of the number of processing elements in the SIMD pipeline in the shader core. A warp is formed when a compute kernel is dispatched to the stream processor and afterwards all the threads in a warp execute together. The advantage of grouping scalar threads into warps is the reduction in area associated with SIMD hardware. Increasing the number of threads grouped together, or warp size, exposes a major performance limitations of current graphics hardware, which is branch divergence [7]. As shown by Fung et al., increasing warp size from 8 to 16 using contemporary approaches decreases throughput by roughly 30% [7] on set of non-graphics applications (similar to those we study in this paper). Furthermore, although it is possible to write applications that are aware of the warp size of the hardware they are executing upon (which is necessary on current hardware when synchronization operations are employed [15]), backward compatibility could be maintained trivially in future designs by increasing the number of shader cores instead of increasing the warp size. However, this option increases the number of nodes in the interconnection network, thus requiring a scalable interconnect such as a mesh. GDDR3 GDDR3 Off-chip DRAM GDDR3 Figure 2. Streaming processor architecture overview 3.1 Baseline Architecture 3 Design and Implementation In this section we rst describe our baseline architecture, which scales to future process technologies by simply increasing the number of shader cores and using a scalable mesh interconnection network. Then we describe our modied architecture which incorporates second level caches at the memory controllers to enhance scalability. Figure 2 depicts a high level view of the stream processor architecture we consider. The portions labeled L2 $ correspond to the proposed hardware changes. The stream processor is treated as a co-processor that a CPU can ofoad highly data-parallel compute kernels on to. The stream processor consists of several compute nodes (labeled Shader Core in Figure 2) and memory nodes (labeled Memory Controller). Each shader core has a warp size and SIMD width of 16, and uses a seven-stage, in-order pipeline. The processing elements in a shader core share a low latency 16KB local store that is explicitly managed by the programmer. Each shader core also has a private L1 cache to back up the local store. This cache is implicitly managed by hardware similar to a traditional L1 cache. Each on-chip memory controller interfaces a single GDDR3 DRAM chip2 module with 4 banks. The on-chip interconnection network can be designed in various ways. As discussed in Section 2 we use a mesh network to achieve scalability. We show that it performs reasonably well for the massively parallel benchmarks that we study. Thread scheduling is performed with zero overhead on a ne grained per cycle basis. Each cycle a warp that is ready for execution is selected by warp scheduler and issued to the SIMD pipelines. We use a scheduling policy called DFIFO [7] which is basically a round robin technique, but if there is a cache miss in a particular warp, then that warp is not considered for the scheduling and next ready warp is selected (i.e., the round robin order can change). All the threads in each given warp execute the same instruction with different data values simultaneously in all pipelines. Whenever any thread inside a warp faces a long latency operation such as a cache miss, all the threads in the warp are taken out of the scheduling pool until the long latency operation is over. Meanwhile other threads that are not waiting are sent to the pipeline for execution. Since there are many threads running in the same shader core long latency operations can be 2 GDDR3 stands for Graphics Double Data Rate 3 [21]. Graphics DRAM is typically optimized to provide higher peak data bandwidth. tolerated to some extent. We use the immediate post-dominator (PDOM) [7] mechanism to allow for diverging control ow among threads within a given warp. This mechanism employs a stack to allow separate traversal of different control ow paths when the threads in a single warp wish to take different paths following a conditional branch. Fung et al. [7] describe the PDOM mechanism in more detail. It must be noted that using the local store is crucial to achieve high performance on GPUs. Using the local store alleviates the DRAM bandwidth bottleneck that many applications generally face [20]. In our designs the local store is backed up by an on-chip cache hierarchy to mitigate the effects of increased latencies incurred by the interconnection network and the extra pressure on the memory system. Each shader core also includes a 32 KB data cache for memory accesses to the non local-store global address space3 . We nd this baseline architecture scales well for applications with very high arithmetic intensity (ratio of arithmetic operations to memory accesses), but less well for others. Next we consider how to add additional cache capacity to reduce interconnection network bandwidth requirements. Figure 3. Layout of 8x8 conguration (shaded areas are memory controllers) 4 Methodology To model the architecture described in this paper we extended the simulator used by Fung et al. [7] that models various aspects of a massively parallel architecture with highly programmable pipelines similar to contemporary GPU architectures. The version of this simulator used in this study employs the SimpleScalar PISA [3] instruction set to simulate scalar threads. The timing simulator accounts for how these would be grouped into warps by the hardware we model. Table 1 shows the simulators conguration. Rows that have multiple entries show different congurations that we have simulated. We used a detailed interconnection network model to simulate the mesh network. The interconnection network of the simulator is an extension of the simulator introduced in [6] and is highly congurable. Table 2 shows the interconnection conguration used in our simulations. The benchmark applications used for this study were selected from SPLASH-2[23], and several CUDA programs published by NVIDIA [14]. Six benchmarks (BIN, CON, IMD, MAT, SCN and SOB) use the local store and three benchmarks (BLK,BIT and LU) do not use the local store. The simulators programming model is similar to that of CUDA4 . A computing kernel is invoked by a spawn instruction, which signals the SimpleScalar out-of-order core to launch a predetermined number of threads for parallel execution on the GPU simulator. Note that all the spawned threads for a specic kernel are running the same group of instructions on different data elements although they might take different control paths along the way. Our simulator supports thread blocks and synchronization instructions inside the blocks. Every time a compute kernel is spawned all the threads inside that kernel are divided into groups of threads called blocks5 . All the threads in a block are assigned to a single shader core for running. Communication and synchronization among different blocks are not supported but threads inside each block can communicate 4 Currently we must undergo several manual steps to prepare the benchmarks for running in our simulator which limit the number of benchmarks used in this study. In this study, we convert CUDA applications to C with meta information placed in a conguration le used to simulate the effect of launching a CUDA grid onto a GPU (this approach was used in [7] as well). 5 The user species the number of threads in each block and hardware sets a maximum number of threads per block 3.2 Extending Stream Processor Scalability Increasing cache capacity can reduce memory bandwidth requirements for applications that contain sufcient locality. We propose to achieve this by adding a second level, shared cache to the design described in Section 3.1. A shared cache can be advantageous in that some threads may not require as much capacity at any given time as others, and furthermore there may be inter-thread temporal locality among threads since they are from the same application. For our study, none of the benchmarks require communication between threads operating on different shader cores, and all inter-thread communication for threads scheduled on a particular shader core occurs via the local-store. Hence we do not model any cache coherence protocol for our L1 or L2 caches (we leave this to future work). We locate each bank of the L2 cache with a memory controller. The L2 caches are simple single-port set associative caches. Each L2 cache, only caches the data that maps to its corresponding memory controller and DRAM. The trade-off of L2 cache addition will be discussed in Section 5. We will show that relatively small L2 caches can substantially improve streaming application performance. Note that since L2 caches are located on the memory controller side of the network, accesses to L2 caches must always traverse the network. 3 These memory accesses correspond to to global memory access in CUDAe.g., ld.global and st.global in the PTX instruction set [16]. Table 1. Hardware Conguration Number of Shader SIMD Cores Warp Size Number of Threads per Shader Core Local Storage per Shader Core Number of Memory Channels GDDR3 Memory Timing Bandwidth per Memory Module DRAM queue capacity Memory Controller L1 Data Cache Size (per core) L1 Data Cache Hit Latency Branch Handling Method Warp Issue Heuristic L2 Cache Size (per Mem Controller) L2 Cache Parameters 8 / 56 / 110 16 256 16KB 8 / 8 / 11 tCL =9, tRP =13, tRC =34 tRAS =21, tRCD =12, tRRD =8 8 / 16 / 32 (Bytes/Cycle) 32 requests out of order (FR-FCFS) [19] 32KB 2-way set assoc. LRU 3 cycle latency (pipelined 1 access/cycle/SIMD pipeline) Post Dominator [7] DFIFO among ready warps [7] 256KB / 512KB / 1MB 8-way set assoc. 64B lines LRU Table 2. Interconnect Conguration Virtual channels Virtual channel buffers Virtual channel allocator Alloc iters Credit delay Routing delay VC alloc delay Input speedup Flit size 4 16 islip 1 1 1 1 2 32 Table 3. Benchmark abbreviations Benchmark Black Scholes option pricing Binomial Options Bitonic Sort Convolution Separable Image Denoising LU Matrix Multiply Scan Large Array Sobel Filter Abr. BLK BIN BIT CON IMD LU MAT SCN SOB either through the fast local store or main memory with the aid of barrier instructions6 . All the threads that reach a barrier instruction wait for the rest of the threads in their block to catch up and, when all of threads reach the barrier, they resume execution again. 5 Performance Evaluation To evaluate how well the designs introduced in Section 3 scale we simulated 3 different congurations. The rst conguration consists of 8 shader cores and 8 memory controllers: a conguration which is intended to represent contemporary architectures. The second conguration has 56 shader cores and 8 memory controllers. This conguration has 896 ALUs (or streaming processors [12]) which can be integrated on a single chip by 2013 according to ITRS [9] projections. Our third conguration has 110 shader cores and 11 memory controllers. This would incorporate 1760 ALUs (streaming processors) and could be built by 2016. According to the ITRS semiconductor roadmap [9] process instructions are specied by user in the body of the kernel and operate similar to CUDAs syncthreads() operation [15] 6 Barrier technology should reach 32nm and 22nm by 2013 and 2016, respectively. The number of ALUs for future generation GPUs is based on a linear extrapolation of the number of ALUs on an NVIDIA 8800 series GPU assuming constant area and overheads per shader core. In our design the memory controllers are physically distributed over the chip. Figure 3 shows the physical layout of the memory controllers in our 8x8 conguration as shaded areas. A similar layout is applied for 4x4 and 11x11 congurations. These layouts are based on the assumption that it is possible to place pads all over the chip area so that there is no need for extra wires to connect the memory controller to chips circumference. Table 3 shows the benchmarks we used for simulations along with the abbreviations that are used for each benchmark in the gures. We ran BLK and BIT for 100 million and the rest of the benchmarks for 1 billion instructions. Figure 4 shows the effect of increasing the number of cores on IPC. The rst bar shows the maximum IPC for each benchmark given perfect memory (all memory accesses hit in L1 cache). The NoL2 bar shows performance of the device without any L2 cache and L2 shows the IPC results when a 256KB L2 cache is added per memory controller. Addition of L2 cache increases the performance by 14.5%, 54.9% and 82.3% on average for 4x4, 8x8, and 11x11 designs respectively. For the applications that use the local store (BIN, CON, IMD, MAT, SCN and SOB), the harmonic mean speedup is 5.7%, 53%, and 50% for 4x4, 8x8, and 11x11 designs respectively; for applications that do not use the local store (BLK, BIT, and LU) the harmonic mean speedup is 23.6%, 64%, and 105% for 4x4, 8x8, and 11x11 designs respectively. Figure 5 shows L1 miss rates for all the NoL2 and L2 congurations in Figure 4. For some benchmarks L1 miss rates with an L2 cache are higher compared to when there is no L2. Our detailed investigation revealed that this behavior is related to a complex relationship of cache replacement (LRU) and the scheduling policy that we use to issue warps to the SIMD pipeline. When the L2 cache is added the order that memory requests return to the shader core changes dramatically; some of the requests hit in the L2 and return much faster than the ones that miss in L2 and are serviced by DRAM. This reordering effect combined with the warp scheduling policy creates this counterintuitive behavior. Figure 6 shows the L2 miss rates. As the number of shader cores increases from 8 (4x4) to 56 (8x8) the L2 miss rate increases for all the benchmarks. In this conguration L2 cache size remains the same as the conguration with 8 shader cores while the number of shader cores increases7 . 7 We hold the size of the L2 per memory controller constant consistent with the assumption the area of a memory controller block is kept a xed ratio with respect to a shader core to minimize the impact on layout as we PerfMem NoL2 L2 PerfMem NoL2 L2 PerfMem NoL2 L2 1600 1600 1600 1200 IPC IPC 1200 IPC 800 1200 800 800 400 400 400 0 BLK BIN BIT CON IMD LU MAT SCN SOB HM 0 BLK BIN BIT CON IMD LU MAT SCN SOB HM 0 BLK BIN BIT CON IMD LU MAT SCN SOB HM (a) 4x4 (90nm) (b) 8x8 (32nm) (c) 11x11 (22nm) Figure 4. IPC for 4x4, 8x8 and 11x11 congurations with 256KB L2 cache per shader core NoL2 4x4 40% 35% 30% L2 4x4 NoL2 8x8 L2 8x8 NoL2 11x11 L2 11x11 L2 4x4 80% 70% 60% L2 Miss Rate 50% 40% 30% 20% 10% L2 8x8 L2 11x11 L1 Miss rate 25% 20% 15% 10% 5% 0% BLK BIN BIT CON IMD LU MAT SCN SOB 0% BLK BIN BIT CON IMD LU MAT SCN SOB Figure 5. L1 miss rates for 4x4, 8x8 and 11x11 conguration without L2 and with 256K L2 per memory controller As the number of shader cores increases to 110 (11x11) some benchmarks behave differently and their cache miss rate does not increase despite the higher ratio of shader cores to cache capacity. This conguration has 11 memory controllers and since we add an L2 cache to each memory controller this conguration also has a higher aggregate L2 cache capacity. Our version of the LU benchmark does not have enough threads to use all 110 shader cores and therefore its L2 miss rate drops. For the SOB and BIT benchmarks the working set ts into L2 cache in this conguration. These three benchmarks are also not sensitive to L2 cache size increase as shown in Figure 8. We measured the DRAM utilization of the various hardware congurations we model, and this data is shown in Figure 7. The utilization numbers are calculated by counting the number of DRAM responses each cycle even when there is no outstanding request to DRAM. Therefore, either few DRAM requests or poor DRAM scheduling might result in low utilization. These cases can be separated by considering the number of requests coming from the memory controllers. The rst three bars for each benchmark in scale the architecture. Figure 6. L2 miss rates for 4x4, 8x8 and 11x11 congurations with 256K L2 per memory controller Figure 7 show the data for the NoL2 congurations and the next three bars show the data when an L2 cache is included. As shown in the gure, utilization for the second three bars is lower than the rst three, highlighting the effectiveness of L2 cache in reducing the number of requests sent to DRAM (recall that performance increases as we add the L2 cache). Figure 7 is also consistent with Figure 6 as the DRAM utilization decreases for the benchmarks that have an decrease in L2 cache miss rate (e.g., for BLK, the miss rate increases going from 4x4 to 8x8 to 11x11, as does the DRAM utilization). 5.1 Sensitivity to Cache Size Figure 8 shows the effects of cache size increase for 11x11 core design. Most benchmarks experience substantial performance boosts when the amount of cache per memory controller is increased from 256KB to 512KB. The improvement for increasing memory capacity from 512KB to 1MB is not as dramatic. One of the interesting observations in this graph is that benchmarks that benet from increasing the L2 cache capacity includes those benchmarks that use the local store. BIN, CON, IMD, MAT and SCN all uti- NoL2 4x4 NoL2 8x8 NoL2 11x11 L2 4x4 L2 8x8 L2 11x11 100% 90% 80% DRAM Utilization 70% 60% 50% 40% 30% 20% 10% 0% BLK BIN BIT CON IMD LU MAT SCN SOB NoL2 B 4 1000 800 600 IPC 400 200 0 NoL2 B 8 NoL2 B 16 L2 B 4 L2 B 8 L2 B 16 BLK BIN BIT CON IMD LU MAT SCN SOB HM Figure 7. DRAM Utilization for 4x4, 8x8 and 11x11 congurations without L2 and with 256K L2 per memory controller NoL2 1600 256K 512K 1M Figure 9. Sensitivity to DRAM Bandwidth for 8x8 conguration NoL2 1:1 800 NoL2 3:2 L2 1:1 L2 3:2 600 1200 IPC IPC BLK BIN BIT CON IMD LU MAT SCN SOB HM 800 400 400 200 0 0 BLK BIN BIT CON IMD LU MAT SCN SOB HM Figure 8. Sensitivity to L2 size for 11x11 conguration lize local store and all experience a substantial performance boost as L2 cache size increases from NoL2 to 1MB. We believe the reason for this is that benchmarks that are able to easily utilize the local store have a lot of data sharing and localitythat is why we could write them in such a way that they can use local store in the rst place. Figure 10. GPU to DRAM clock ratio effects for 8x8 burst length 16 (third bar) is only 1% higher than the design with L2 cache and burst-length 4 (fourth bar). We suspect the reason increasing burst length is not as helpful as increasing cache capacity is that not all data brought in by larger burst lengths is used before being evicted from the on-chip caches. 5.2 Sensitivity to DRAM bandwidth 5.3 GPU to DRAM Clock Ratio Eects The effects of increasing DRAM bandwidth on performance are presented in this section. In order to simulate a higher bandwidth we changed the DRAM burst length while keeping the duration of an entire burst transfer (from start to nish) constant with respect to the shader core clock. For the 8x8 design with 256KB L2 caches increasing the burst length from 4 (i.e. 8Bytes/cycle) to 8 and 16 increases the performance less than 15.8% and 17.1% respectively. The same burst length increases result in 45.6% and 58% performance increases when there is no L2 cache. Figure 9 shows that the IPC of conguration with no L2 cache and Figure 10 shows the effects of changing the GPU to DRAM clock ratio form 1:1 to 3:2 (every 3 GPU cycles there is 2 DRAM cycles) for the 8x8 conguration. When there is no L2 cache on chip IPC drops more than 37% when clock ratio is increased to 3:2 but when a 256KB L2 cache is added to each memory controller, IPC only decreases 21%. The only counter-intuitive phenomena in Figure 10 is LUs IPC increase for about 1% when the DRAM is slowed down. This is again due to our L1 cache replacement policy: Close inspection reveals that victim L1 cache lines are selected when the memory requests comes back to L1, during the time new cache line is on its way, new requests can still hit in the cache changing the LRU line. 7 Conclusions and Future Work In this paper, we argued that it is important that programmer effort be amortized across future generations of stream processor architectures. We focused on stream processors similar to GPUs and used a programming model similar to CUDA. We explored the benets applications written using CUDA like models will obtain as chip resources scale. We proposed to use a mesh topology as a scalable on-chip intercon...

Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more. Course Hero has millions of course specific materials providing students with the best way to expand their education.

Below is a small sample set of documents:

Utah - NEURALENGI - 6040
Utah - NEURALENGI - 6040
Neural Engineering 6040Lecture 1: IntroductionBradley Greger, PhDBioengineeringMoran Eye CenterGeneral InformationBradley Gregerbradley.greger@utah.eduThis is the best way to contact me 585-5795 BPRB 506 Office hours Wednesdays
Utah - ECE - 3720
CS/EE 3720 Spring 2006 MyersHandout #1LAB #1: M68HC11E1 in Single-Chip ModeLab writeup is due to your TA at the beginning of your next scheduled lab. Dont put this o to the last minute! In general, there will also be pre-lab work to complete bef
Utah - ECE - 6750
CS/EE 6750 Fall 2004 MyersHandout #11HOMEWORK #9: Hazard-Free Logic SynthesisThis homework is due at midnight on Thursday, November 11, 2004. NO LATE HOMEWORK WILL BE ACCEPTED. Read Chapter 6. Complete Problem 6.3. Complete the following demo
Utah - ECE - 3991
VLSI The Ubiquitous Keystone of ElectronicsKen Stevens University of Utah1Part One: The Ubiquitous Nature of VLSI2DefinitionsUbiquitous 1. existing or being everywhere at the same time 2. constantly encountered VLSI the process of creating
Utah - EE - 3710
JEDIManual SectionJEDI-0 -1 100-1-0 -1HY HY FG FG FG FY FYHY FG FG FY FY FY HG00110 10110 01000 11000 11000 01001 11001Typical state assignment use is the following: % jedi -p fsm.kiss2 | espresso > fsm.pla % misII misII> readpla -s fsm
Utah - ECE - 6750
Huffman CircuitAsynchronous Circuit DesignChris J. MyersLecture 5: Huffman Circuits Chapter 5INPUTS OUTPUTSDelayComb. LogicSTATEChris J. Myers (Lecture 5: Huffman Circuits)Asynchronous Circuit Design1/1Chris J. Myers (Lecture 5: H
Utah - EE - 3720
'$'Examples of Embedded Computer Systems$ECE/CE 3720: Embedded System DesignSlide 1 Chris J. Myers Lecture 1: Microcomputer-Based Systems Slide 3&%&%'$'$Real-Time Interfacing Embedded Computer Systems An embedded compute
Utah - EE - 3720
'$'$Thread MemoryECE/CE 3720: Embedded System DesignSlide 1 Chris J. Myers Lecture 10: Threads Slide 3&%&%'$'$Introduction to Threads Interrupts create a multithreaded environment with a single foreground thread (the mai
Utah - WEBAPPS - 2
Utah Emergency Management Association Conference & Annual Meeting People, Partnerships, and Professionalism Emergency Management in Profile January 8, 2009 South Towne Expo Center To Register Now!: https:/webapps2.utah.edu/conferences/uema/
Utah - WEBAPPS - 2
Loews Denver Hotel1-800-345-9172 http:/www.loewshotels.com/en/Hotels/Denver-Hotel/Overview.aspxJW Marriott at Cherry Creek1-800-228-9290 http:/www.marriott.com/hotels/travel/denjw-jw-marriott-denver-at-cherry-creek/Courtyard-by Marriott1-800-2
Utah - BIO - 5410
Lecture Notes on Gene GenealogiesAlan R. Rogers1 All rights reserved. March 8, 20021Department of Anthropology, University of Utah, Salt Lake City, UT 84112Bibliography[1] A.M. Bowcock, A. Ruiz-Linares, J. Tomfohrde, E. Minch, J.R. Kidd, and
Utah - BIO - 5410
Lecture Notes on Gene GenealogiesAlan R. Rogers1 All rights reserved. March 22, 20041Department of Anthropology, University of Utah, Salt Lake City, UT 84112Lecture 6The Site Frequency Spectrum6.1 The empirical site frequency spectrumIn a
Utah - BIO - 5410
Lecture Notes on Gene GenealogiesAlan R. Rogers1 All rights reserved. March 8, 20021Department of Anthropology, University of Utah, Salt Lake City, UT 84112Lecture 3Neutral SubstitutionsThe rate at which neutral mutants are substituted into
Utah - ANT - 1050
Outline of Lecture on NepotismNepotism and Kin SelectionThe theory of kin selection Alan R. Rogers Evidence from Beldings ground squirrels Evidence from Japanese macaques Evidence from human homicides February 4, 2009What is altruism?Why altru
Utah - BIO - 5410
Lecture Notes on Gene GenealogiesAlan R. Rogers1 All rights reserved. March 10, 20041Department of Anthropology, University of Utah, Salt Lake City, UT 84112Lecture 4Gene GenealogiesThe coalescent process [13, 8] describes the ancestry of a
Utah - ANT - 5221
Lecture Notes on Gene Genealogies1Alan R. Rogers2 January 22, 2009c 2009, Alan R. Rogers. Anyone is allowed to make verbatim copies of this document and also to distribute such copies to other people, provided that this copyright notice stays inta
Utah - ANT - 5221
Eects of Drift and Mutation on Gene Diversity (and Gene Identity) LECTURE: Decay of Heterozygosity with Mutation Adding mutation to the equation describing decay of gene diversity Example: Prehistoric elk from the Emeryville Shellmound12How
Utah - BIO - 5410
Lecture Notes on Gene GenealogiesAlan R. Rogers1 All rights reserved. April 4, 20041Department of Anthropology, University of Utah, Salt Lake City, UT 84112Lecture 9Microsatellites9.1 Repeat polymorphisms: Nomenclaturetandem repeat polymo
Utah - ANT - 1050
Evolution of Hominin Brain Size History of brain size Are big brains a side eect of big bodies? Why did selection favor big brains? (1) Unpredictable food resource, (2) Unpredictable climate, (3) Expensive tissue hypothesis, (4) Social brain hypot
Utah - ANT - 1050
Anthro 1050, University of Utah Evolution of Human Nature Study Guide for Exam 2Alan Rogers March 24, 2009Exam 2 is not cumulative; it will not cover the material on Exam 1. This study guide does not cover the lectures, because you can review all o
Utah - ANT - 5221
The dominance variance, and schizophrenia as a quantitative traitWhat does dominance do to the variances and covariances?Anthro/Biol 5221, 5 December 2008One locus, two alleles, no environmental variance A1A1 A1A2 A2A2 +1 1-2h -112Is schiz
University of Hawaii - Hilo - BANQUET - 05
The 17th Annual CTAHR Awards Banquet: Growing Together College of Tropical Agriculture and Human Resources/CTAHR Alumni Association Mr./Mrs./Ms. _ (Please list attendee names on the reverse side of this card) Organizational Affiliation _ CTAHR Alumnu
University of Hawaii - Hilo - TUESDAYF - 2005
Working with linguistic data Training session 30 Oct 2005, University of Hawai'i Session provided by the Pacific And Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) and the Resource Network for Linguistic Diversity (RNLD). Cri
Utah - FCG - 2
i iiiContentsPreface 1 Introduction1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 Graphics Areas . . . . Major Applications . . Graphics APIs . . . . 3D Geometric Models Graphics Pipeline . . . Numerical Issues . . . Efciency . . . . . . . Software Enginee
Utah - WCONF - 07
Do Managers Time Securitization Transactions to Obtain Accounting Benefits?*Patricia M. Dechow The Carleton H. Griffin Deloitte & Touche LLP Collegiate Professor of Accounting Stephen M. Ross School of Business University of Michigan Ann Arbor, MI 4
Utah - WCONF - 07
The Timing of Earnings Announcements: An Examination of the Strategic Disclosure HypothesisJeffrey T. Doyle College of Business Utah State University 3540 Old Main Hill Logan, UT 84322 jeffrey.doyle@usu.eduMatthew Magilke David Eccles School of B
Utah - WCONF - 05
The Persistence and Pricing of the Cash Component of Earnings*Patricia M. Dechow Carleton H. Griffin Deloitte and Touche LLP Professor of Accounting Ross School of Business, University of Michigan dechow@umich.edu Scott A. Richardson Assistant Profe
Utah - WCONF - 07
Adopting a Label: Heterogeneity in the Economic Consequences of IFRS Adoptions*Holger Daske The Wharton School University of Pennsylvania Luzi Hail The Wharton School University of Pennsylvania Christian Leuz The Graduate School of Business Universi
Utah - WCONF - 06
How High Is US CEO Pay? A Comparison with UK CEO PayMartin J. Conyon University of Pennsylvania, USAconyon@wharton.upenn.eduJohn E. Core University of Pennsylvania, USAjcore@wharton.upenn.eduWayne R. Guay University of Pennsylvania, USAguay@w
Utah - WCONF - 05
IPO Failure Risk: Determinants and Pricing ConsequencesElizabeth Demers Wm. E. Simon School of Business University of Rochester Rochester, NY 14627 Phone: (585) 273-1650 lizdemers@simon.rochester.eduPhilip Joos Wm. E. Simon School of Business Uni
Utah - WCONF - 05
Disclosure Quality and Information AsymmetryStephen Brown# Stephen A. HillegeistDecember 2003Abstract: We examine the association between firms disclosure quality and information asymmetry using a three-stage least squares estimation procedure th
Utah - WCONF - 05
Altering Investment Decisions to Manage Financial Reporting Outcomes: Asset Backed Commercial Paper Conduits and FIN 46Daniel A. Bens daniel.bens@ChicagoGSB.edu University of Chicago Graduate School of Business 5807 South Woodlawn Avenue Chicago, IL
Utah - WCONF - 07
The Effects of Financial Statement and Informational Complexity on Cash Flow ForecastsLeslie Hodder Patrick E. Hopkins David A. Wood Kelley School of Business Indiana University Bloomington, IN 47405-1701December 23, 2006 Email: lhodder@indiana.e
Utah - WCONF - 06
A Re-examination of the Nave-Investor Hypothesis in Accruals Mispricing: The Role of Cash FlowsGerhard J. Barone Assistant Professor University of Texas at Austin Austin, TX 78712 Ph: (512) 471-5619 Fax: (512) 471-3904 gerhard.barone@mccombs.utexas
Utah - WCONF - 06
How Do Managers Deal With Bad News? Management Forecasts around Negative Earnings Surprises Jonathan L. Rogers University of Chicago Graduate School of Business 5807 S. Woodlawn Ave Chicago, IL 60637 (773) 834-0161 jrogers2@chicagogsb.edu Andrew Van
Utah - WCONF - 07
Investor Myopia and CEO Horizon IncentivesBrian Cadman Kellogg School of Management Northwestern University Evanston, IL 60208 b-cadman@kellogg.northwestern.edu (847) 491-2662Jayanthi Sunder Kellogg School of Management Northwestern University Ev
Utah - WCONF - 07
Conservatism, Growth, and Return on InvestmentMadhav V. Rajan Stefan Reichelstein February 11, 2007 Mark T. SolimanAbstract Return on Investment (ROI) is widely regarded as a key measure of firm profitability. The accounting literature has long re
Utah - WCONF - 06
How Disaggregated Forecasts Enhance the Credibility of Management Earnings ForecastsD. Eric Hirst Lisa Koonce Shankar Venkataraman Department of Accounting McCombs School of Business The University of Texas CBA 4M.202 Austin, TX 78712First DraftC
Utah - WCONF - 06
Optimal "Pay-Performance-Sensitivity" in the presence of Exogenous Risk Thomas Hemmer January, 2006Abstract. I study the relation between the level of exogenous risk and pay-for-performance sensitivity (P P S) of optimal contracts. I rst show that n
Utah - WCONF - 06
Bringing it home: A study of the incentives surrounding the repatriation of foreign earnings under the American Jobs Creation Act of 2004Jennifer Blouin The Wharton School Linda Krull University of Texas at AustinFebruary 2006 (First Version)Abs
Utah - WCONF - 05
Trading Incentives to Meet Earnings Thresholds* Sarah McVay New York University 44 West Fourth Street New York, NY 10012 smcvay@stern.nyu.edu Venky Nagar University of Michigan 701 Tappan Street Ann Arbor, MI 48109 venky@umich.edu Vicki (Wei) Tang Un
Utah - WCONF - 02
Participant List Second Annual Utah Winter Financial Accounting Conference Name Abarbanell Bens* Bhattacharya Bishop Botosan# Bradshaw* Bryant* Carter Cushing Easton# Eining Hayes Hirst Hodder* Hopkins* Jorgensen Kirschenheiter Leuz* Lynch Mercer Mi
University of Hawaii - Hilo - GG - 612
Geo 612January 27, 20041ROCK STRUCTURE ("FRACTURES AND FOLDS") (2) I Main Topics A Planar geologic structures (mostly fractures) B Folds C Fabrics: grain-scale structure Planar geologic structures (mostly fractures) A Fractures/classification:
University of Hawaii - Hilo - GG - 303
GG303 Lab 39/7/051MAPS AND CROSS SECTIONS (I) I Main Topics A Three point problems B Rule of vees C Map interpretation and cross sections II Three point problems (see handout) A Three points define a plane B The line of strike is given by the b
University of Hawaii - Hilo - GG - 303
GG303 Lecture 288/19/051JOINTS I Main Topics A Scientific method B Why are joints important? C Observations of joints D Formulation and testing of hypotheses: The origin of joints II Scientific method (See Fig. 1.1) III Why are fractures import
University of Hawaii - Hilo - GG - 454
GG 454March 8, 20021CASE HISTORIES: TRANSLATIONAL LANDSLIDES (22) I Main Topics A Frank Landslide B Love Creek Landslide I I Frank Landslidehttp:/www.frankslide.com/boulder.html http:/www.frankslide.com/ http:/cgrg.geog.uvic.ca/abstracts/ReadG
University of Hawaii - Hilo - LIS - 670
Database structure & file organizationLIS 670 Bair-MundyElectronic databasesERICGovernmental databasesEBSCO HostCommercial databases Specialty databasesOPACOnline Public Access CatalogTrust TerritoryTitle Publisher The online Date of
Utah - P - 419
The Photographs Collection of Frank Rasmussen Photograph Collection ( P0419 ) Number and types of photographs: 14, 5x7 black & white copy printsDates
Utah - P - 503
The Photographs of Al Brain Photograph Collection (P0503) Number and types of photographs: 40 photographs in original collection,Addendum added
Utah - P - 954
The Photographs of Nate Christensen Photograph Collection (P0954) Number and types of photographs: 1066 digital scans taken from 35mm slidesDates o
Utah - P - 933
The Photographs of Alton Crane MelvillePhotograph Collection (P0933)Number and types of photographs: 1 B&W Photograph, 22 Digital ScansDates of photographs: 1950s 1960sCollection Processed by: Jamie ColtonRegister Prepared by: Mary Ann
University of Hawaii - Hilo - BULL - 20
University of Hawaii - Hilo - BULL - 25
University of Hawaii - Hilo - BULL - 29
University of Hawaii - Hilo - BULL - 18
University of Hawaii - Hilo - BULL - 27
University of Hawaii - Hilo - BOTANY - 159
TABLE 5. RARE, THREATENED AND ENDANGERED PLANTS HISTORICALLY KNOWN FROM MOLOKAI (with excerpts of type collection (Wagner et. al. 1999) (Note: place names given where known from type)FAMILY: AMARANTHACEAE Achyranthes splendens Mart. ex Moq. var. ro
University of Hawaii - Hilo - MATH - 414
Math 414 Lecture 35MARRIAGE PROBLEM. There are 4 boys {A, B, C, D} and 5 girls {M, N, O, P, Q}. X marks couples who will dance together. Find a matching with the most dance couples.A M N O P Q X X X X X X X B C X X X DDraw a graph which represent
University of Hawaii - Hilo - MATH - 414
Math 414 Lecture 35The shortest path problem Given: a connected undirected weighted graph with an origin node. Here, the weight of an edge is its length. Goal: find the shortest path from each node to the origin. The distance between two nodes or be