29 Pages

mcfarling93

Course: EE 7304, Spring 2008
School: Dallas
Rating:
 
 
 
 
 

Word Count: 7063

Document Preview

1993 WRL JUNE Technical Note TN-36 Combining Branch Predictors Scott McFarling digi tal Western Research Laboratory 250 University Avenue Palo Alto, California 94301 USA The Western Research Laboratory (WRL) is a computer systems research group that was founded by Digital Equipment Corporation in 1982. Our focus is computer science research relevant to the design and application of high performance scientific...

Register Now

Unformatted Document Excerpt

Coursehero >> Texas >> Dallas >> EE 7304

Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

Course Hero has millions of student submitted documents similar to the one below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
1993 WRL JUNE Technical Note TN-36 Combining Branch Predictors Scott McFarling digi tal Western Research Laboratory 250 University Avenue Palo Alto, California 94301 USA The Western Research Laboratory (WRL) is a computer systems research group that was founded by Digital Equipment Corporation in 1982. Our focus is computer science research relevant to the design and application of high performance scientific computers. We test our ideas by designing, building, and using real systems. The systems we build are research prototypes; they are not intended to become products. There is a second research laboratory located in Palo Alto, the Systems Research Center (SRC). Other Digital research groups are located in Paris (PRL) and in Cambridge, Massachusetts (CRL). Our research is directed towards mainstream high-performance computer systems. Our prototypes are intended to foreshadow the future computing environments used by many Digital customers. The long-term goal of WRL is to aid and accelerate the development of high-performance uni- and multi-processors. The research projects within WRL will address various aspects of high-performance computing. We believe that significant advances in computer systems do not come from any single technological advance. Technologies, both hardware and software, do not all advance at the same pace. System design is the art of composing systems which use each level of technology in an appropriate balance. A major advance in overall system performance will require reexamination of all aspects of the system. We do work in the design, fabrication and packaging of hardware; language processing and scaling issues in system software design; and the exploration of new applications areas that are opening up with the advent of higher performance systems. Researchers at WRL cooperate closely and move freely among the various levels of system design. This allows us to explore a wide range of tradeoffs to meet system goals. We publish the results of our work in a variety of journals, conferences, research reports, and technical notes. This document is a technical note. We use this form for rapid distribution of technical material. Usually this represents research in progress. Research reports are normally accounts of completed research and may include material from earlier technical notes. Research reports and technical notes may be ordered from us. You may mail your order to: Technical Report Distribution DEC Western Research Laboratory, WRL-2 250 University Avenue Palo Alto, California 94301 USA Reports and notes may also be ordered by electronic mail. Use one of the following addresses: Digital E-net: Internet: UUCP: DECWRL::WRL-TECHREPORTS WRL-Techreports@decwrl.dec.com decwrl!wrl-techreports To obtain more details on ordering by electronic mail, send a message to one of these addresses with the word help in the Subject line; you will receive detailed instructions. Combining Branch Predictors Scott McFarling June 1993 digi tal Western Research Laboratory 250 University Avenue Palo Alto, California 94301 USA Abstract One of the key factors determining computer performance is the degree to which the implementation can take advantage of instruction-level parallelism. Perhaps the most critical limit to this parallelism is the presence of conditional branches that determine which instructions need to be executed next. To increase parallelism, several authors have suggested ways of predicting the direction of conditional branches with hardware that uses the history of previous branches. The different proposed predictors take advantage of different observed patterns in branch behavior. This paper presents a method of combining the advantages of these different types of predictors. The new method uses a history mechanism to keep track of which predictor is most accurate for each branch so that the most accurate predictor can be used. In addition, this paper describes a method of increasing the usefulness of branch history by hashing it together with the branch address. Together, these new techniques are shown to outperform previously known approaches both in terms of maximum prediction accuracy and the prediction accuracy for a given predictor size. Specifically, prediction accuracy reaches 98.1% correct versus 97.1% correct for the most accurate previously known approach. Also, this new approach is typically at least a factor of two smaller than other schemes for a given prediction accuracy. Finally, this new approach allows predictors with a single level of history array access to outperform schemes with multiple levels of history for all but the largest predictor sizes. Copyright 1993 Digital Equipment Corporation i 1 Introduction In the search for ever higher levels of performance, recent machine designs have made use of increasing degrees of instruction level parallelism. For example, both superscalar and superpipelining techniques are becoming increasingly popular. With both these techniques, branch instructions are increasingly important in determining overall machine performance. This trend is likely to continue as the use of superscalar and superpipelining increases especially if speculative execution becomes popular. Moreover, some of the compiler assisted techniques for minimizing branch cost in early RISC designs are becoming less appropriate. In particular, delayed branches are decreasingly effective as the number of delay slots to ll increases. Also, multiple implementations of an architecture with different superscalar or superpipelining choices make the use of delay slots problematic[Sit93]. Together, these trends increase the importance of hardware methods of reducing branch cost. The branch performance problem can be divided into two subproblems. First, a prediction of the branch direction is needed. Second, for taken branches, the instructions from the branch target must be available for execution with minimal delay. One way to provide the target instructions quickly is to use a Branch Target Buffer, which is a special instruction cache designed to store the target instructions. This paper focuses on predicting branch directions. The alternatives available for providing target instructions will not be discussed. The reader is referred to Lee and Smith[LS84] for more information. Hardware branch prediction strategies have been studied extensively. The most well known technique, referred to here as bimodal branch prediction, makes a prediction based on the direction the branch went the last few times it was executed. More recent work has shown that signicantly more accurate predictions can be made by utilizing more branch history. One method, considers the history of each branch independently and takes advantage of repetitive patterns. Since the histories are independent, we will refer to it as local branch prediction. Another technique uses the combined history of all recent branches in making a prediction. This technique will be referred to as global branch prediction. Each of these different branch prediction strategies have distinct advantages. The bimodal technique works well when each branch is strongly biased in a particular direction. The local technique works well for branches with simple repetitive patterns. The global technique works particularly well when the direction taken by sequentially executed branches is highly correlated. This paper introduces a new technique that allows the distinct advantages of different branch predictors to be combined. The technique uses multiple branch predictors and selects the one which is performing best for each branch. This approach is shown to provide more accurate predictions than any one predictor alone. This paper also shows a method of increasing the utility of branch history by hashing it together with the branch address. The organization of this paper is as follows. First, Section 2 discusses previous work in branch prediction. Later sections describe in detail the prediction methods found useful 1 in combination and will evaluate them quantitatively to provide a basis for evaluating the new techniques. Sections 3, 4, and 5 review the bimodal, local, and global predictors. Section 6 discusses predictors indexed by both global history and branch address information. Section 7 discusses hashing global history and branch address information before indexing the predictor. Section 8 describes the technique for combining multiple predictors. Section 9 gives some concluding remarks. Section 10 gives some suggestions for future work. Finally, Appendix A presents some additional comparisons to variations of the local prediction method. 2 Related Work Branch performance issues have been studied extensively. J. E. Smith[Smi81] presented several hardware schemes for predicting branch directions including the bimodal scheme that will be described in Section 3. Lee and A. J. Smith[LS84] evaluated several branch prediction schemes. In addition, they showed how branch target buffers can be used to reduce the pipeline delays normally encountered when branches are taken. McFarling and Hennessy[MH86] compared various hardware and software approaches to reducing branch cost including using prole information. Hwu, Conte, and Chang[HCC89] performed a similar study for a wider range of pipeline lengths. Fisher and Freudenberger[FF92] studied the stability of prole information across separate runs of a program. Both the local and global branch prediction schemes were described by Yeh and Patt[YP92, YP93]. Pan, So, and Rahmeh[PSR92] described how both global history and branch address information can be used in one predictor. Ball and Larus[BL93] describe several techniques for guessing the most common branches directions at compile time using static information. Several studies[Wal91, JW89, LW93] have looked at the implications of branches on available instruction level parallelism. These studies show that branch prediction errors are a critical factor determining the amount of local parallelism that can be exploited. 3 Bimodal Branch Prediction The behavior of typical branches is far from random. Most branches are either usually taken or usually not taken. Bimodal branch prediction takes advantage of this bimodal distribution of branch behavior and attempts to distinguish usually taken from usually nottaken branches. There are a number of ways this can be done. Perhaps the simplest approach is shown in Figure 1. The gure shows a table of counters indexed by the low order address bits in the program counter. Each counter is two bits long. For each taken branch, the appropriate counter is incremented. Likewise for each not-taken branch, the appropriate counter is decremented. In addition, the counter is saturating. In other words, the counter is not decremented past zero, nor is it incremented past three. The most signicant bit determines the prediction. Repeatedly taken branches will be predicted to be taken, and 2 Counts Taken predictTaken PC Figure 1: Bimodal Predictor Structure repeatedly not-taken branches will be predicted to be not-taken. By using a 2-bit counter, the predictor can tolerate a branch going an unusual direction one time and keep predicting the usual branch direction. For large counter tables, each branch will map to a unique counter. For smaller tables, multiple branches may share the same counter, resulting in degraded prediction accuracy. One alternate implementation is to store a tag with each counter and use a set-associative lookup to match counters with branches. For a xed number of counters, a set-associative table has better performance. However, once the size of tags is accounted for, a simple array of counters often has better performance for a given predictor size. This would not be the case if the tags were already needed to support a branch target buffer. To compare various branch prediction strategies, we will use the SPEC89 benchmarks [SPE90] shown in Figure 2. These benchmarks include a mix of symbolic and numeric applications. However, to limit simulation time, only the rst 10 million instructions from each benchmark was simulated. Execution traces were obtained on a DECstation 5000 using the pixie tracing facility[Kil86, Smi91]. Finally, all counters are initially set as if all previous branches were taken. Figure 3 shows the average conditional branch prediction accuracy of bimodal prediction. The number plotted is the average accuracy across the SPEC89 benchmarks with each benchmark simulated for 10 million instructions. The accuracy increases with predictor size since fewer branches share counters as the number of counters increases. However, prediction accuracy saturates at 93.5% correct once each branch maps to a unique counter. A set-associative predictor would saturate at the same accuracy. 3 benchmark doduc eqntott espress fpppp gcc li mat300 nasa7 spice tomcatv description Monte Carlo simulation conversion from equation to truth table minimization of boolean functions quantum chemistry calculations GNU C compiler lisp interpreter matrix multiplication NASA Ames FORTRAN Kernels circuit simulation vectorized mesh generation Figure 2: SPEC Benchmarks Used for Evaluation Conditional Branch Prediction Accuracy (%) 98 97 96 95 94 93 92 91 90 89 88 | 32 | | | | | | bimodal | | | | | 64 | | | 128 256 512 | 1K | 2K | 4K | | | | 8K 16K 32K 64K Predictor Size (bytes) | Figure 3: Bimodal Predictor Performance 4 4 Local Branch Prediction One way to improve on bimodal prediction is to recognize that many branches execute repetitive patterns. Perhaps the most common example of this behavior is loop control branches. Consider the following example: for (i=1; i<=4; i++) f g If the loop test is done at the end of the body, the corresponding branch will execute the pattern 1110n , where 1 and 0 represent taken and not taken respectively, and n is the number of times the loop is executed. Clearly, if we knew the direction this branch had gone on the previous three executions, then we could always be able to predict the next branch direction. A branch prediction method close to one developed by Yeh and Patt[YP92] that can take advantage of this type of pattern is shown in Figure 4. The gure shows a branch predictor that uses two tables. The rst table records the history of recent branches. Many different organizations are possible for this table. In this paper, we will assume that it is simply an array indexed by the low-order bits of the branch address. Yeh and Patt assumed a set-associative branch history table. As with bimodal prediction, a simple array avoids the need to store tags but does suffer from degraded performance when multiple branches map to the same table entry, especially with smaller table sizes. Each history table entry records the direction taken by the most recent n branches whose addresses map to this entry, where n is the length of the entry in bits. The second table is an array of 2-bit counters identical to those used for bimodal branch prediction. However, here they are indexed by the branch history stored in the rst table. In this paper, this approach is referred to as local branch prediction because the history used is local to the current branch. In Yeh and Patts nomenclature this method is referred to as a per-address scheme. Consider again the simple loop example above. Lets assume that this is the only branch in the program. In this case, there will be a history table entry that stores the history of this branch only and the counter table will reect solely the behavior of this branch. With 3 bits of history and 23 counters, the local branch predictor will be able determine the current iteration and always make the correct prediction after some initial settling of the counter values. If there are more branches in the program, a local predictor can suffer from two kinds of contention. First, the branch history may reect a mix of histories of all the branches that map to each history entry. Second, since there is only one counter array for all branches, there may be conict between patterns. For example, if there is another branch that typically executes the pattern 0110n instead of 1110n , there will be contention when the branch history is (110). However, with 4 bits of history and 24 counters, this contention can be avoided. Note however, that if the rst pattern is executed a large number of times followed by a large number of executions of the second pattern, then only 3 bits of history are needed since the counters dynamically adjust to the more recent patterns. 5 History Taken Counts predictTaken PC Figure 4: Local History Predictor Structure Figure 5 shows the performance of local branch prediction as a function of the predictor size. For simplicity, we assume that the number of history and count array entries are the same. See Appendix A for a discussion of some alternatives. For very small predictors, the local scheme is actually worse than the bimodal scheme. If there is excessive contention for history entries, then storing this history is of no value. However, above about 128 bytes, the local predictor has signicantly better performance. For large predictors, the accuracy approaches 97.1% correct, with less than half as many misspredictions as the bimodal scheme. 5 Global Branch Prediction In the local branch prediction scheme, the only patterns considered are those of the current branch. Another scheme proposed by Yeh and Patt[YP92] is able to take advantage of other recent branches to make a prediction. One implementation of such an approach is shown in Figure 6. A single shift register GR, records the direction taken by the most recent n conditional branches. Since the branch history is global to all branches, this strategy is called global branch prediction in this paper. Global branch prediction is able to take advantage of two types of patterns. First, the direction take by the current branch may depend strongly on other recent branches. Consider the example below: if (x<1) : : : if (x>1) : : : Using global history prediction, we are able to base the prediction for the second if on the direction taken by the rst if. If (x<1), we know that the second if will not be 6 Conditional Branch Prediction Accuracy (%) 98 97 96 95 94 93 92 91 90 89 88 | 32 | | | | | | local bimodal | | | | | 64 | | | 128 256 512 | 1K | 2K | 4K | | | | 8K 16K 32K 64K Predictor Size (bytes) | Figure 5: Local History Predictor Performance Counts Taken predictTaken Taken GR Figure 6: Global History Predictor Structure 7 taken. If (x1) then we dont know conclusively which way this branch will be taken, but the probability may well be skewed one direction or the other. If so, we should be able to make a better prediction than if we had no information about the value of x. Pan, So, and Rahmeh[PSR92] showed several examples of neighboring branches in the SPEC benchmarks with conditions that are correlated in this way. A second way that global branch prediction can be effective is by duplicating the behavior of local branch prediction. This can occur when the global history includes all the local history needed to make an accurate prediction. Consider the example: for (i=0; i<100; i++) for (j=0; j<3; j++) After the initial startup time, the conditional branches have the following behavior, assuming GR is shifted to the left: test value j<3 j=1 j<3 j=2 j<3 j=3 i<100 GR result 1101 taken 1011 taken 0111 not taken 1110 usually taken Here the global history is able to both distinguish which of the two branches is being executed and what the current value of j is. Thus, the prediction accuracy here would be as good as that of local prediction. Figure 7 compares the performance of the global prediction with local and bimodal branch prediction. The plot shows that the global scheme is signicantly less effective than the local scheme for a xed size predictor. It is only better than the bimodal scheme above 1KB. We can understand this behavior intuitively by looking at the information content of the counter table index. For small predictors, the bimodal scheme is relatively good. Here, the branch address bits used in the bimodal scheme efciently distinguish different branches. As the number of counters doubles, roughly half as many branches will share the same counter. Informally, we can say that the information content of the address bits is high. For large counter tables, this is no longer true. As more counters are added, eventually each frequent branch will map to a unique counter. Thus, the information content in each additional address bit declines to zero for increasingly large counter tables. The information content of the global history register begins relatively small, but it continues to grow for larger sizes. To understand why, consider the history one might expect when a particular branch is executed. Since over 90% of the time each branch goes the same direction, the sequence of previous branches and the directions taken by these branches will tend to be highly repetitive for any one branch, but perhaps very different for other branches. This behavior allows a global predictor to identify different branches. However as Figure 7 suggests, that the global history is less efcient at this than the branch 8 Conditional Branch Prediction Accuracy (%) 98 97 96 95 94 93 92 91 90 89 88 | 32 address. On the other hand, the global history register can capture more information than just identifying which branch is current, and thus for sufciently large predictors it does better than bimodal prediction. | | | | | | global local bimodal | | | | | 64 | | | 128 256 512 | 1K | 2K | 4K | | | | 8K 16K 32K 64K Predictor Size (bytes) | Figure 7: Global History Predictor Performance 6 Global Predictor with Index Selection As discussed in the previous section, global history information is less efcient at identifying the current branch than simply using the branch address. This suggests that a more efcient prediction might be made using both the branch address and the global history. Such a scheme was proposed by Pan, So, and Rahmeh[PSR92]. Their approach is shown in Figure 8. Here the counter table is indexed with a concatenation of global history and branch address bits. The performance of global prediction with selected address bits (gselect) is shown in Figure 9. With the bit selection approach, there is a tradeoff between using more history bits or more address bits. For a predictor table with 2K counters, we could use anywhere from 1 to (K-1) address bits. Rather than show all these possibilities, Figure 9 only shows the performance of the predictor of the given size with with the best accuracy across the benchmarks (gselect-best). As we would expect, gselect-best performs better than either bimodal or global prediction since both are essentially degenerate cases. For small sizes, gselect-best parallels the performance of bimodal prediction. However, once there are enough address bits to identify most branches, more global history bits are used, resulting in signicantly better 9 Counts Taken predictTaken Taken GR n PC m n+m Figure 8: Global History Predictor with Index Selection Conditional Branch Prediction Accuracy (%) 98 97 96 95 94 93 92 91 90 89 88 | 32 Figure 9: Global History with Index Selection Performance | | | | | | gselect-best global local bimodal | | | | | 64 | | | 128 256 512 | 1K | 2K | 4K | | | | 8K 16K 32K 64K Predictor Size (bytes) | 10 prediction results than the bimodal scheme. The gselect-best method also signicantly outperforms simple global prediction for most predictor sizes because the branch address bits more efciently identify the branch. For predictor sizes less than 1KB, gselect-best also outperforms local prediction. The global schemes have the advantage that the storage space required for global history is negligible. Moreover, even for larger predictors, the accuracies are close. This is especially interesting since gselect requires only a single array access whereas local prediction requires two array accesses in sequence. This suggests that a gselect predictor should have less delay and easier be to pipeline than a local predictor. 7 Global History with Index Sharing In the discussion of global prediction, we described how global history information weakly identies the current branch. This suggests that there is a lot of redundancy in the counter index used by gselect. If there are enough address bits to identify the branch, we can expect the frequent global history combinations to be rather sparse. We can take advantage of this effect by hashing the branch address and the global history together. In particular, we can expect the exclusive OR of the branch address with the global history to have more information than either component alone. Moreover, since more address bits and global history bits are in use, there is reason to expect better predictions than gselect. Consider the following simple example where there are only two branches and each branch has only two common global histories: Branch Address 00000000 00000000 11111111 11111111 Global History 00000001 00000000 00000000 10000000 gselect 4/4 00000001 00000000 11110000 11110000 gshare 8/8 00000001 00000000 11111111 01111111 Strategy gselect 4/4 concatenates the low order 4 bits of both the branch address and the global history. We will call the strategy of exclusive ORing branch address and global history gshare. Strategy gshare 8/8 uses the bit-wise exclusive OR of all 8 bits of both the branch address and the global history. Comparing gshare 8/8 and gselect 4/4 shows that only gshare is able to separate all four cases. The gselect predictor cant take advantage of the distinguishing history in the upper four bits. As with gselect, we can choose to use fewer global history bits than branch address bits. In this case, the global history bits are exclusive ORed with the higher order address bits. Typically, the higher order address bits will be more sparse than the lower order bits. Figure 10 shows the gshare predictor structure. Figure 11 compares the performance of gshare with gselect. Figure 11 only shows the gshare predictor among the various history length choices that has the best performance across the benchmarks (gshare-best). 11 Counts Taken predictTaken GR m PC XOR n n Figure 10: Global History Predictor with Index Sharing For predictor sizes of 256 bytes and over, gshare-best outperforms gselect-best by a small margin. For smaller predictors, gshare underperforms gselect because there is already too much contention for counters between different branches and adding global information just makes it worse. 8 Combining Branch Predictors The different branch prediction schemes we have presented have different advantages. A natural question is whether the different advantages can be combined in a new branch prediction method with better prediction accuracy. One such method is shown in Figure 12. This combined predictor contains two predictors P1 and P2 that could be one of the predictors discussed in the previous sections or indeed any kind of branch prediction method. In addition, the combined predictor contains an additional counter array which serves to select the best predictor to use. As before, we will use 2-bit up/down saturating counters. Each counter keeps track of which predictor is more accurate for the branches that share that counter. Specically, using the notation P1c and P2c to denote whether predictors P1 and P2 are correct respectively, the counter is incremented or decremented by P1c-P2c as shown below: P1c 0 0 1 1 P2c P1c-P2c 0 0 (no change) 1 -1 (decrement counter) 0 1 (increment counter) 1 0 (no change) 12 Conditional Branch Prediction Accuracy (%) 98 97 96 95 94 93 92 91 90 89 88 | 32 Figure 11: Global History with Index Sharing Performance | | | | | | gshare-best gselect-best global | | | | | 64 | | | 128 256 512 | 1K | 2K | 4K | | | | 8K 16K 32K 64K Predictor Size (bytes) | Counts P1c-P2c useP1 P1 P2 PC Figure 12: Combined Predictor Structure 13 bimodal gshare bimodal/gshare doduc eqntott espress fpppp gcc li mat300 nasa7 spice tomcatv average | 80 | 82 | 84 | 86 | | | | | | | 88 90 92 94 96 98 100 Conditional Branch Prediction Accuracy (%) | Figure 13: Combined Predictor Performance by Benchmark One combination of branch predictors that is useful is bimodal/gshare. In this combination, global information can be used if it is worthwhile, otherwise the usual branch direction as predicted by the bimodal scheme can be used. Here, we assume gshare uses the same number of history and address bits. This assumption maximizes the amount of global information. Diluting the branch address information is less of a concern because the bimodal prediction can always be used. Similarly, gshare performs signicantly better here than gselect since it uses more global information. Figure 13 shows how the bimodal/gshare combination works on the SPEC89 benchmarks. Here, all the benchmarks were run to completion. Also, each predictor array has 1K counters. Thus, the combined predictor is actually 3 times as large. As the graph shows, the combined predictor always does better than either predictor alone. For example, with eqntott, gshare is much more effective than bimodal and bimodal/gshare matches the performance of gshare. Figure 14 shows how often each predictor was used in the bimodal/gshare combined predictor on these same runs. For these sizes, the bimodal predictor is typically used more often. However, for eqntott, the gshare predictor is more often used. Again the choice of predictors is made branch by branch. In any one benchmark, many branches may use the bimodal prediction while other branches use gshare. Figure 15 shows how using bimodal/gshare effects the number of instructions between misspredicted branches. The combination increases this measure signicantly for some of the benchmarks, especially some of the less predictable benchmarks like gcc. Figure 16 shows the combined predictor accuracy for a range of predictor sizes. As earlier, only the average accuracy across the SPEC89 benchmarks run for 10M instructions is shown. In this chart, we choose to display a bimodal/gshare predictor where the 14 doduc eqntott espress fpppp gcc li mat300 nasa7 spice tomcatv average | 0 | 20 | | | | 40 60 80 100 Fraction Predictions from bimodal (%) | Figure 14: bimodal/gshare Predictor Performance by Benchmark bimodal gshare bimodal/gshare doduc eqntott espress fpppp gcc li mat300 nasa7 spice tomcatv | 10 | | | | | | | || | | | | | | | || | | | | | | | || | Figure 15: Instructions between Misspredicted Branches | 100 1000 10000 Instructions 15 Conditional Branch Prediction Accuracy (%) 98 97 96 95 94 93 92 91 90 89 88 | 32 gshare array contains twice as many counters (bimodalN/gshareN+1). This allows a more direct comparison to gselect-best since the total predictor size is an integral number of bytes. Predictor bimodalN/gshareN+1 also has slightly better performance since the predictor selection array cost is amortized over more predictor counters. The performance of bimodalN/gshareN+1 is signicantly better than gselect-best. The 1KB combined predictor has nearly the same performance as a 16KB gselect-best predictor. Figure 16 also shows the performance of a combined local/gshare predictor where it outperforms bimodal/gshare. For this plot, all the local/gshare arrays have the same number of entries. For sizes of 2KB and larger, the local/gshare predictor has better accuracy than bimodalN/gshareN+1. For large arrays this accuracy approaches 98.1% correct. This result is as we would expect since large local predictors subsume the information available to a bimodal predictor. | | | | | | | bimodalN/gshareN+1 local/gshare gselect-best bimodal | 64 | | | 128 256 512 | 1K | 2K | 4K | | | | 8K 16K 32K 64K Predictor Size (bytes) | | | | Figure 16: Combined Predictor Performance by Size 9 Conclusions In this paper, we have presented two new methods for improving branch prediction performance. First, we showed that using the bit-wise exclusive OR of the global branch history and the branch address to access predictor counters results in better performance for a given counter array size. We also showed that the advantages of multiple branch predictors can be combined by providing multiple predictors and by keeping track of which predictor is more accurate for the current branch. These methods permit construction of predictors that are more accurate for a given size than previously known methods. Also, combined 16 predictors using local and global branch information reach a prediction accuracy of 98.1% as compared to 97.1% for the previously most accurate known scheme. The approaches presented here should be increasingly useful as machine designers attempt to take advantage of instruction level parallelism and miss-predicted branches become a critical performance bottleneck. 10 Suggestions for Future Work There are a number of ways this study could be extended to possibly nd more accurate and less costly branch predictors. First, there are a large number of parameters such as sizes, associativities, and pipeline costs that were not fully explored here. Careful exploration of this space might yield better predictors. Second, other sources of information such as whether the branch target is forward or backward might be usefully added to increase accuracy. Third, the typically sparse branch history might be compressed to reduce the number of counters needed. Finally, a compiler with prole support might be able to reduce or eliminate the need for branch predictors as described here. For example, previous work has shown that using prole information to set a likely-taken bit in the branch results in accuracy close to that of the bimodal scheme. Thus, for code optimized in this way, the bimodal predictor in the bimodal/gshare scheme might be unnecessary. More elaborate optimization might also eliminate the need for the gshare predictor as well. This might be done with either more careful inspection of branch conditions or more elaborate proling of typical branch patterns. For example, branches with correlated conditions might be detected with semantic analysis or by more elaborate proling that could detect branch correlation dynamically. This information might then be used to duplicate or restructure the branches so that a simpler branch prediction method could take advantage of the correlation. Furthermore, branch patterns caused by loops might be exploited by careful unrolling that takes advantage of the typical iteration count detected either semantically or with more elaborate proling. A Appendix The local branch prediction scheme has a number of variations that were not discussed in the Section 4. In this appendix, we will discuss two variations and show that the combined predictor has better performance than these alternatives. First, similarly to the gselect scheme, it is possible to index the counter array with both the branch address and the local history. Again, there are a large number of possibilities. Figure 17 shows the family of performance curves where the number of history bits used to index the counter array is held constant. For example, local-2h implies that there are 2 history bits used to index the counter array and the remaining index bits come from the branch address. We keep the assumption that the number of history array and counter array entries are the same. As 17 Miss Rate (%) 98 97 96 95 94 93 92 91 90 89 88 | 32 Figure 17: Local Predictor Performance with Address Bits the gure shows, reducing the number of history bits can improve performance. This is mainly due to the reduction in the history array size itself. The gure also shows that the bimodal/gshare predictor performance is still signicantly better than the different local predictor variations. In addition, the bimodal/gshare predictor only requires a single level of array access. Another variation in the local scheme is to change the number of history entries. Figure 18 shows the resulting performance. The notation local-64HR signies that there are 64 history table entries. As the gure shows, using the same number of history table entries as counters is usually a good choice. | | | | bimodal/gshare local local-2h local-4h local-6h local-8h local-10h local-12h | | | | | | | 64 | | | 128 256 512 | 1K | 2K | 4K | | | | 8K 16K 32K 64K Predictor Size (bytes) | References [BL93] T. Ball and J. R. Larus. Branch prediction for free. In Proceedings of the ACM SIGPLAN 93 Conference on Programming Language Design and Implementation, Albuquerque, NM, 1993. J. A. Fisher and S. M. Freudenberger. Predicting conditional branch directions from previous runs of a program. In Proceedings of ASPLOS V, pages 8595, Boston, MA, October 1992. [FF92] [HCC89] W. W. Hwu, T. M. Conte, and P. P. Chang. Comparing software and hardware schemes for reducing the cost of branches. In Proc. 16th Int. Sym. on Computer Architecture, pages 224233, May 1989. 18 Conditional Branch Prediction Accuracy (%) 97 96 95 94 93 92 91 90 89 88 | 32 Figure 18: Local Predictor Performance with Varying Number of History Registers [JW89] N. P. Jouppi and D. W. Wall. Available instruction-level parallelism for superscalar and superpipelined machines. In Proceedings of ASPLOS III, pages 272282, Boston, MA, April 1989. E. A. Killian. In RISCompiler and C Programmers Guide. MIPS Computer Systems, 930 Arques Ave., Sunnyvale, CA 94086, 1986. J.K.L. Lee and A.J. Smith. Branch prediction strategies and branch target buffer design. Computer, 17(1), January 1984. M. S. Lam and R. P. Wilson. Limits of control ow on parallelism. In Proc. 20th Int. Sym. on Computer Architecture, May 1993. | | | | | | local- 64 HR local-256 HR local- 1K HR local- 4K HR local-16K HR local | 64 | | | 128 256 512 | 1K | 2K | 4K | | | | 8K 16K 32K 64K Predictor Size (bytes) | | | | [Kil86] [LS84] [LW93] [MH86] S. McFarling and J. Hennessy. Reducing the cost of branches. In Proc. 13th Int. Sym. on Computer Architecture, pages 396403, June 1986. [PSR92] S. T. Pan, K. So, and J. T. Rahmeh. Improving the accuracy of dynamic branch prediction using branch correlation. In Proceedings of ASPLOS V, pages 7684, Boston, MA, October 1992. [Sit93] R. L. Sites. Alpha AXP architecture. Communications of the ACM, 36(2):3344, Feb. 1993. [Smi81] J. E Smith. A study of branch prediction strategies. In Proc. 8th Int. Sym. on Computer Architecture, pages 135148, May 1981. 19 [Smi91] M. D. Smith. Tracing with pixie. Technical Report CSL-TR-91-497, Computer Systems Laboratory...

Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more. Course Hero has millions of course specific materials providing students with the best way to expand their education.

Below is a small sample set of documents:

Temple - CIS - 3309
C6109_AppendixD_CTP.qxd18/7/0602:34 PMPage 1A P P E N D I X DDDEPLOYING A VISUAL BASIC .NET APPLICATIONAfter completing this appendix, you will be able to: Understand how Visual Studio performs deployment Create a Setup project with a Wizard Config
Western Kentucky University - WEB - 102
U.S. Set Back on Treatment of CombatantAugust 28, 2004 By ERIC LICHTBLAU WASHINGTON, Aug. 27 - A federal judge on Friday ordered thegovernment to explain why Yaser E. Hamdi, an enemycombatant captured in Afghanistan, has remained in solitaryconfinem
Mississippi State - ECE - 4743
Computer Aided Digital Systems Design - EE 4743/6743Thomas MorrisStructural VerilogDepartment of Electrical and Computer Engineering Mississippi State UniversityStructural VerilogEssentially a schematic in text form Build up a circuit from gates/flip
Columbus State University - CPSC - 1105
2009 Prentice-Hall, Inc.1Technology in ActionChapter 11 Behind the Scenes: Databases and Information Systems 2009 Prentice-Hall, Inc.2Chapter Topics Databases and their uses Database components Types of databases Database management systems 2009
Columbus State University - CPSC - 1105
2009 Prentice-Hall, Inc.1Technology in ActionChapter 13 Behind the Scenes: The Internet: How It Works 2009 Prentice-Hall, Inc.2Chapter Topics Managing the Internet Interaction between Internet components Internet data transmission and protocols IP
Duke - ECE - 593
Link Layer ProtocolLecture 11Wireless Sensor NetworksEE 493/593OutlineGoals and Tasks Error Control Framing Link Management1Goals and TasksError control make sure that the sent bits arrive and no otherForward and backward error controlFraming gr
Stanford - STATS - 210
STAST 210 SAS LAB TWO, July 7, 2004 Lab Two: Importing and manipulating data in SAS Lab Objectives After today's lab you should be able to: 1. Use SAS for generating random variables from the uniform and other probability distributions. 2. Understand the
University of Florida - EXP - 4595
EEL4595, Fall Semester 2007 Experiment #2: CRC Generation/Verification Due date: October 8, 2007This experiment is the second of several to explore and demonstrate important concepts discussed in class. Its goal is to allow you to gain hands-on experienc
GWU - NSAEBB - 247
ARCHIVO GENERAL DE LA NACIN UNIDAD DE ENLACEAsunto: Respuesta a la solicitud 0495000012705 Mxico, D.F. a 21 de julio de 2005.En atencin a su solicitud 0495000012705, recibida en est Unidad de Enlace, me permito informarle lo siguiente: Se ha realizado u
Midwestern State University - CHEM - 492
Research ArticleDevelopmental Exposure to Estradiol and Bisphenol A Increases Susceptibility to Prostate Carcinogenesis and Epigenetically Regulates Phosphodiesterase Type 4 Variant 4Shuk-Mei Ho, Wan-Yee Tang, Jessica Belmonte de Frausto, and Gail S. Pr
SUNY Morrisville - CITA - 130
State College - Vehicles ListDate: Created By: Purpose: Track data on cars, vans, and trucksID # YEAR MAKE 87 2002 Ford 195 2004 Ford 503 2003 Chevrolet 678 1995 Ford 696 2000 Dodge 798 2003 Ford 817 2004 Chevrolet 818 2000 Chevrolet 829 1995 Ford 834 1
Lake County - EX - 312
From: William Gropp &lt;gropp@XXXXXXXXXXX&gt;Date: January 3, 2008 4:06:06 PM CSTTo: mpi-21@XXXXXXXXXXXXXSubject: [mpi-21] Further errors in MPI-1 example 3.12Reply-To: mpi-21@XXXXXXXXXXXXXIn addition to missing the communicator arguments in the calls to M
Lake County - EX - 312
ADDITIONAL ITEMS, NOT YET IN SECTION 3.2 of MPI-2-Example 3.12 in MPI 1 has an errorPage 43, lines 47 to Page 44, lines 1 read: CALL MPI_ISEND(outval, 1, MPI_REAL, 1, 0, req, ierr) CALL MPI_REQUEST_FREE(req, ierr) CALL MPI_IRECV(inval, 1, MPI_REAL,
Allan Hancock College - MATH - 12222
Summary of Ordinary Differential Equations Techniquesshown in first 4 chapters for Math12222 1) Solving and exact First Order Ordinary Differential Equations (ODE) Sec 1.4 2) Finding an integrating factor that makes a First Order ODE exact Sec 1.4 3) Lin
UCF - TAX - 6845
Domestic production activities deduction In each of the following independent situations, determine the domestic production activities deduction (DPAD) for 2008. Taxpayer A QPAI Taxable income W-2 wages 20,000400,000 500,000B800,000 600,000 100,000C7
Concordia Chicago - TIE - 547
TIE547 Benchmark AssignmentTo fulfill the requirements of this course, students must complete a three part project applicable to their current school environment. The three parts are: 1. Design a web-based professional development project. 2. Develop and
UMass (Amherst) - OIT - 595
A New Account of QuantifiersGary M. Hardegree Department of Philosophy University of Massachusetts Amherst, MA 010031. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.21.Introduction.2 Our Old Theory of Quantifiers .2 Our New Theory
Lake County - CI - 332
C&amp;I 332 Metalesson 10-AssessmentJennie RipoliPersonally, I must say that I am a huge advocate of incorporating problem solving instructional techniques into the mathematics curriculum. I think it's important for students to not only know how to answer a
Lake County - CI - 332
Metalesson 9Unit Plan Idea My partners for the unit plan are Sarah and Christina. We will be focusing on 2nd grade and two-dimensional geometry. At the second or third grade level, it seems appropriate to focus on the more basic concepts of geometry. The
Lake County - CI - 332
Meta Lesson 8 Literature Review One Hundred Hungry Ants By: Elinor J. Pinczes I found One Hundred Hungry Ants to be super cute; I would definitely use it in my classroom. The story is about a group of a hundred ants traveling to picnic in search of food.
Lake County - CI - 332
C&amp;I 332 Metalesson 6Jennie RipoliI feel that the budget lesson was an excellent method of demonstrating how crucial math skills are needed to solve real life problems. I really enjoyed this lesson, and think that students would as well. The material ext
Lake County - CI - 332
C&amp;I 332 Metalesson 5Jennie RipoliI found the Kathy article to be very interesting. It was really exciting to see a teacher willing to take such a huge risk in her math curriculum. Immediately, it made me think about seeing myself doing something similar
Lake County - CI - 332
C&amp;I 332 Metalesson 4Jennie RipoliBeep, Beep, Vroom, Vroom was an excellent example of how literacy and mathematics can be woven together. This book is great for elementary age students, especially grades k-2, however can be easily adapted for students i
Lake County - CI - 332
C&amp;I 332 Meta Lesson 3 In class Math Lesson ExperienceJennie RipoliInstead of critiquing a specific lesson, I decided to critique a part of the daily math lesson. I feel that in general the math lessons taught to the students are done in a very efficient
Lake County - CI - 332
C&amp;I 332 Metalesson 2Jennie RipoliFirst of all, I must say it's the best feeling in the world knowing that I've chosen the right profession. This semester although somewhat stressful, has been so uplifting. To begin with, my placement is amazing. I could
Lake County - CI - 332
Metalesson #1 I really enjoyed the way that you incorporated children's literature into the class introduction. I feel that it is such a great way to grasp student's attention while putting them into a mathematical frame of mind. It's a great way to menta
Ohio State - SOC - 101
Soc 101 Lecture Day 13: Population, Demography Demographic questions: When/where were you born? How many persons were born that year? What are your chances of marriage/divorce? Do you have/plan to have children? How many and how far apart? What kind of j
N. Arizona - JAB - 432
1A MODEST PROPOSAL (1729)Jonathan Swift (1667 1745)FOR PREVENTING THE CHILDREN OF POOR PEOPLE IN IRELAND FROM BEING A BURDEN TO THEIR PARENTS OR COUNTRY, AND FOR MAKING THEM BENEFICIAL TO THE PUBLICIT IS a melancholy object to those who walk through t
W. Florida - DEPID - 9035
Jount Task Force on Student Learning final reportPage 1 of 12Powerful Partnerships A Shared Responsibility for LearningA Joint ReportAmerican Association for Higher Education American College Personnel Association National Association of Student Perso
Southwestern College - MATH - 119
Introducing probabilityBPS chapter 10 2006 W. H. Freeman and CompanyObjectives (BPS chapter 10)Introducing probability The idea of probability Probability models Probability rules Discrete sample space Continuous sample space Random variables Pe
Southwestern College - MATH - 119
Relationships Scatterplots and correlationBPS chapter 4 2006 W.H. Freeman and CompanyObjectives (BPS chapter 4)Relationships: Scatterplots and correlation Explanatory and response variables Displaying relationships: scatterplots Interpreting sc
Southwestern College - MATH - 119
Inferenceforapopulation meanBPS chapter 18 2006 W.H. Freeman and CompanyObjectives(BPSchapter18)Inference about a Population Mean Conditions for inference The t distribution The one-sample t confidence interval Using technology Matched pairs t proce
Southwestern College - MATH - 119
The Normal distributionsBPS chapter 3 2006 W.H. Freeman and CompanyObjectives (BPS 3)The Normal distributions Density curves Normal distributions The 68-95-99.7 rule The standard Normal distribution Finding Normal proportions Using the standard
Southwestern College - MATH - 119
GeneralrulesofprobabilityBPS chapter 12 2006 W.H. Freeman and CompanyObjectives(BPSchapter12)General rules of probabilityIndependence and the multiplication rule The general addition rule Conditional probability The general multiplication rule Indepe
Southwestern College - MATH - 119
Numerical descriptorsBPS chapter 2 2006 W.H. Freeman and CompanyObjectives (BPS chapter 2)Describing distributions with numbers Measure of center: mean and median Measure of spread: quartiles and standard deviation The five-number summary and bo
Southwestern College - MATH - 119
RelationshipsRegressionBPS chapter 5 2006 W.H. Freeman and CompanyObjectives (BPS chapter 5)Regression Regression lines The least-squares regression line Using technology Facts about least-squares regression Residuals Influential observations C
Western Kentucky University - WEB - 102
Dream OnApril 29, 2004 By PRASHANT AGARWAL MUMBAI, India India's latest export officially arrives on the shores ofAmerica today and, for a change, it won't be anelection-year issue. &quot;Bombay Dreams,&quot; which chronicles ayoung man's quest to be a Bolly
Cox School of Business - C - 5373
TLV320AIC23Stereo Audio CODEC, 8 to 96 kHz, With Integrated Headphone AmplifierData ManualJuly 2001Digital Audio ProductsSLWS106CIMPORTANT NOTICE Texas Instruments Incorporated and its subsidiaries (TI) reserve the right to make corrections, modific
UCSD - MATH - 1501
Problem 1. A 13-ft. ladder is leaning against a house when its base starts to slide away because a cat is walking up the ladder at 1 ft/sec. When the cat is 5 ft up the ladder and the base of the ladder is moving at 5 ft/sec and is 5 feet away from the wa
Allan Hancock College - CS - 1101
This,.+] is a test.
Allan Hancock College - CS - 1101
Early computers in the 1940s and 50s were more like gigantic calculators because they were used primarily for numerical computation. However, as computers have evolved to possess more computational power, our use of computers is no longer limited to numer
Allan Hancock College - CS - 1101
Solutions to Quick Check Questions12File Input and Output12.1 File and JFileChooser Objects1.This question is specific to the Windows platform. Suppose you want to open a file prog1.java inside the directory C:\JavaProjects\ Ch11\Step4. What is the a
Allan Hancock College - CS - 1101
Solutions to Quick Check Questions8Exceptions and Assertions8.1Catching Exceptions1.What would be an output from the following code if the secondparseInt throws an exception?try cfw_ int num1 = Integer.parseInt(&quot;14&quot;); System.out.println(&quot;Okay 1&quot;);
Allan Hancock College - CS - 1101
Solutions to Quick Check Questions6Repetition Statements6.1The while Statement1.Write a while statement to add numbers 11 through 20. Is this a countcontrolled or sentinel-controlled loop?int sum = 0, i = 11; while ( i &lt;= 20 ) cfw_ /this is a sum +
Allan Hancock College - CS - 1101
Solutions to Quick Check Questions2Java Programming Basics2.1The First Java Program1. Which of the following are invalid identifiers?a. b. c. d. one e. my Window f. 1234 g. DecafeLattePleaseh. hello JAVA hello,there acct122b (no space is allowed),
Appalachian State - MATHSCI - 3010
-&gt; The History of Probability Start Date End Date Event Place1494Fra Luca Paccioli writes Summa de arithmetica, geometria, proportioni e proportionalita which was the first printed work on probability,1550Geronimo Cardano writes book about games of ch
Blinn College - TEST - 1401
Final test of 1401, Fall 2008, Lianxi MaName:section:Work out problems. You MUST show your work to get credits. 1. (10) A heat transfer of 8.5 x 105 J is required to convert a block of ice at -12 C to water at 15 C. What was the mass of the block of ic
St. Ambrose - CSCI - 185
University of Hawaii at Manoa College of Engineering Mastering the VI editorIndex Introduction Conventions Before You Begin Starting the VI Editor Getting Out of VI The Two Modes of VI How to Type Commands in Command Mode Some Simple VI Commands Text B
Western Kentucky University - WEB - 102
C.I.A. Sends Terror Experts to Tell Small Towns of RiskJuly 18, 2004 By DAVID JOHNSTON and DOUGLAS JEHL WASHINGTON, July 17 - The Central Intelligence Agency hasbegun a series of terrorism briefings for state and locallaw enforcement personnel, for t
Western Kentucky University - WEB - 323
Editorial: The Social Security Fear FactorJanuary 3, 2005 If you've lent even one ear to the administration's recentcomments on Social Security, you have no doubt heardPresident Bush and his aides asserting that a $10 trillionshortfall threatens the
University of Toronto - CSC - 104
CSC104 MIDTERM TEST 1 SOLUTIONSQ UESTION 1 [10 POINTS ] Part a. [2 points] what are the four binary numbers that come after 1010, in order? 1011, 1100, 1101, 1110Part b. [2 points] Which part of the operating system does the user interact with directly?
Montana - CS - 210
1Chapter 14 - File ProcessingOutline 14.1 Introduction 14.2 The Data Hierarchy 14.3 Files and Streams 14.4 Creating a Sequential-Access File 14.5 Reading Data from a Sequential-Access File 14.6 Updating Sequential-Access Files 14.7 Random-Access Files 1
Oregon State University - CS - 261
CS 261 Data StructuresAbstract Data Types (ADTs)TypesWhat is a type?1. Set of possible values 2. Operations on those values 3. PropertiesExample: Integer (int) typeValues : -2147483648 to 2147483647 Operations : +, -, *, /, +, -, etc. Properties : C
Oregon State University - CS - 261
CS 261 Data StructuresGraphsGraphs Used in a variety of applications and algorithms Graphs represent relationships or connections Superset of trees (i.e., a tree is a restricted form of a graph): A graph represents general relationships: Each node ma
Oregon State University - CS - 261
CS 261 Data StructuresGraphsGraphs Used in a variety of applications and algorithms Graphs represent relationships or connections Superset of trees (i.e., a tree is a restricted form of a graph): A graph represents general relationships: Each node ma
Oregon State University - CS - 261
Why Study Data Structures?CS 261 Data StructuresManaging ComplexityOrganize Information into Meaningful &amp; Useful UnitsComplexity Management How do we manage complex problems? Abstraction: Purposeful suppression of detail to aid inunderstanding other
Oregon State University - CS - 261
CS 261 Data StructuresManaging ComplexityWhy Study Data Structures?Organize Information into Meaningful &amp; Useful UnitsComplexity Management How do we manage complex problems? Abstraction: Purposeful suppression of detail to aid inunderstanding other
Oregon State University - CS - 261
Vector: Variations There are some useful variations on the basic Vector data structure Sorting vectors sort the vector elements (and keep it sorted as elements are added, changed, and removed)CS 261 Data StructuresSorting Vectors Applications of sorti
Oregon State University - CS - 261
CS 261 Data StructuresSorting VectorsVector: Variations There are some useful variations on the basic Vector data structure Sorting vectors sort the vector elements (and keep itsorted as elements are added, changed, and removed) Applications of sorti
Oregon State University - CS - 261
Trees Ubiquitous they are everywhere in CS Probably ranks third among the most used data structure:CS 261 Data StructuresTrees1. Vectors 2. Lists 3. TreesTree Characteristics A tree consists of a collection of nodes connected by directed arcs A tree
Oregon State University - CS - 261
CS 261 Data StructuresTreesTrees Ubiquitous they are everywhere in CS Probably ranks third among the most used data structure:1. Vectors 2. Lists 3. TreesTree Characteristics A tree consists of a collection of nodes connected by directed arcs A tree