Computer Organization And Design cis501

Computer Organization And Design cis501 - 1 Fundamentals of...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 Fundamentals of Computer Design And now for something completely different. Monty Python’s Flying Circus 1 1.1 1 1.2 The Task of a Computer Designer 3 1.3 Technology and Computer Usage Trends 6 1.4 Cost and Trends in Cost 8 1.5 Measuring and Reporting Performance 18 1.6 Quantitative Principles of Computer Design 29 1.7 Putting It All Together: The Concept of Memory Hierarchy 39 1.8 Fallacies and Pitfalls 44 1.9 Concluding Remarks 51 1.10 Historical Perspective and References 53 Exercises 1.1 Introduction 60 Introduction Computer technology has made incredible progress in the past half century. In 1945, there were no stored-program computers. Today, a few thousand dollars will purchase a personal computer that has more performance, more main memory, and more disk storage than a computer bought in 1965 for $1 million. This rapid rate of improvement has come both from advances in the technology used to build computers and from innovation in computer design. While technological improvements have been fairly steady, progress arising from better computer architectures has been much less consistent. During the first 25 years of electronic computers, both forces made a major contribution; but beginning in about 1970, computer designers became largely dependent upon integrated circuit technology. During the 1970s, performance continued to improve at about 25% to 30% per year for the mainframes and minicomputers that dominated the industry. The late 1970s saw the emergence of the microprocessor. The ability of the microprocessor to ride the improvements in integrated circuit technology more closely than the less integrated mainframes and minicomputers led to a higher rate of improvement—roughly 35% growth per year in performance. 2 Chapter 1 Fundamentals of Computer Design This growth rate, combined with the cost advantages of a mass-produced microprocessor, led to an increasing fraction of the computer business being based on microprocessors. In addition, two significant changes in the computer marketplace made it easier than ever before to be commercially successful with a new architecture. First, the virtual elimination of assembly language programming reduced the need for object-code compatibility. Second, the creation of standardized, vendor-independent operating systems, such as UNIX, lowered the cost and risk of bringing out a new architecture. These changes made it possible to successively develop a new set of architectures, called RISC architectures, in the early 1980s. Since the RISC-based microprocessors reached the market in the mid 1980s, these machines have grown in performance at an annual rate of over 50%. Figure 1.1 shows this difference in performance growth rates. 350 DEC Alpha 300 250 1.58x per year 200 SPECint rating DEC Alpha 150 IBM Power2 DEC Alpha 100 1.35x per year HP 9000 50 SUN4 MIPS R2000 MIPS R3000 IBM Power1 5 19 9 4 19 9 3 19 9 2 19 9 1 19 9 0 19 9 9 19 8 8 19 8 7 19 8 6 19 8 5 19 8 19 8 4 0 Year FIGURE 1.1 Growth in microprocessor performance since the mid 1980s has been substantially higher than in earlier years. This chart plots the performance as measured by the SPECint benchmarks. Prior to the mid 1980s, microprocessor performance growth was largely technology driven and averaged about 35% per year. The increase in growth since then is attributable to more advanced architectural ideas. By 1995 this growth leads to more than a factor of five difference in performance. Performance for floating-point-oriented calculations has increased even faster. 1.2 The Task of a Computer Designer 3 The effect of this dramatic growth rate has been twofold. First, it has significantly enhanced the capability available to computer users. As a simple example, consider the highest-performance workstation announced in 1993, an IBM Power-2 machine. Compared with a CRAY Y-MP supercomputer introduced in 1988 (probably the fastest machine in the world at that point), the workstation offers comparable performance on many floating-point programs (the performance for the SPEC floating-point benchmarks is similar) and better performance on integer programs for a price that is less than one-tenth of the supercomputer! Second, this dramatic rate of improvement has led to the dominance of microprocessor-based computers across the entire range of the computer design. Workstations and PCs have emerged as major products in the computer industry. Minicomputers, which were traditionally made from off-the-shelf logic or from gate arrays, have been replaced by servers made using microprocessors. Mainframes are slowly being replaced with multiprocessors consisting of small numbers of off-the-shelf microprocessors. Even high-end supercomputers are being built with collections of microprocessors. Freedom from compatibility with old designs and the use of microprocessor technology led to a renaissance in computer design, which emphasized both architectural innovation and efficient use of technology improvements. This renaissance is responsible for the higher performance growth shown in Figure 1.1—a rate that is unprecedented in the computer industry. This rate of growth has compounded so that by 1995, the difference between the highest-performance microprocessors and what would have been obtained by relying solely on technology is more than a factor of five. This text is about the architectural ideas and accompanying compiler improvements that have made this incredible growth rate possible. At the center of this dramatic revolution has been the development of a quantitative approach to computer design and analysis that uses empirical observations of programs, experimentation, and simulation as its tools. It is this style and approach to computer design that is reflected in this text. Sustaining the recent improvements in cost and performance will require continuing innovations in computer design, and the authors believe such innovations will be founded on this quantitative approach to computer design. Hence, this book has been written not only to document this design style, but also to stimulate you to contribute to this progress. 1.2 The Task of a Computer Designer The task the computer designer faces is a complex one: Determine what attributes are important for a new machine, then design a machine to maximize performance while staying within cost constraints. This task has many aspects, including instruction set design, functional organization, logic design, and implementation. The implementation may encompass integrated circuit design, 4 Chapter 1 Fundamentals of Computer Design packaging, power, and cooling. Optimizing the design requires familiarity with a very wide range of technologies, from compilers and operating systems to logic design and packaging. In the past, the term computer architecture often referred only to instruction set design. Other aspects of computer design were called implementation, often insinuating that implementation is uninteresting or less challenging. The authors believe this view is not only incorrect, but is even responsible for mistakes in the design of new instruction sets. The architect’s or designer’s job is much more than instruction set design, and the technical hurdles in the other aspects of the project are certainly as challenging as those encountered in doing instruction set design. This is particularly true at the present when the differences among instruction sets are small (see Appendix C). In this book the term instruction set architecture refers to the actual programmervisible instruction set. The instruction set architecture serves as the boundary between the software and hardware, and that topic is the focus of Chapter 2. The implementation of a machine has two components: organization and hardware. The term organization includes the high-level aspects of a computer’s design, such as the memory system, the bus structure, and the internal CPU (central processing unit—where arithmetic, logic, branching, and data transfer are implemented) design. For example, two machines with the same instruction set architecture but different organizations are the SPARCstation-2 and SPARCstation-20. Hardware is used to refer to the specifics of a machine. This would include the detailed logic design and the packaging technology of the machine. Often a line of machines contains machines with identical instruction set architectures and nearly identical organizations, but they differ in the detailed hardware implementation. For example, two versions of the Silicon Graphics Indy differ in clock rate and in detailed cache structure. In this book the word architecture is intended to cover all three aspects of computer design—instruction set architecture, organization, and hardware. Computer architects must design a computer to meet functional requirements as well as price and performance goals. Often, they also have to determine what the functional requirements are, and this can be a major task. The requirements may be specific features, inspired by the market. Application software often drives the choice of certain functional requirements by determining how the machine will be used. If a large body of software exists for a certain instruction set architecture, the architect may decide that a new machine should implement an existing instruction set. The presence of a large market for a particular class of applications might encourage the designers to incorporate requirements that would make the machine competitive in that market. Figure 1.2 summarizes some requirements that need to be considered in designing a new machine. Many of these requirements and features will be examined in depth in later chapters. Once a set of functional requirements has been established, the architect must try to optimize the design. Which design choices are optimal depends, of course, on the choice of metrics. The most common metrics involve cost and perfor- 1.2 The Task of a Computer Designer Functional requirements Typical features required or supported Application area Target of computer General purpose Balanced performance for a range of tasks (Ch 2,3,4,5) Scientific High-performance floating point (App A,B) Commercial Support for COBOL (decimal arithmetic); support for databases and transaction processing (Ch 2,7) Level of software compatibility Determines amount of existing software for machine At programming language Most flexible for designer; need new compiler (Ch 2,8) Object code or binary compatible Instruction set architecture is completely defined—little flexibility—but no investment needed in software or porting programs Operating system requirements Necessary features to support chosen OS (Ch 5,7) Size of address space 5 Very important feature (Ch 5); may limit applications Memory management Required for modern OS; may be paged or segmented (Ch 5) Protection Different OS and application needs: page vs. segment protection (Ch 5) Standards Certain standards may be required by marketplace Floating point Format and arithmetic: IEEE, DEC, IBM (App A) I/O bus For I/O devices: VME, SCSI, Fiberchannel (Ch 7) Operating systems UNIX, DOS, or vendor proprietary Networks Support required for different networks: Ethernet, ATM (Ch 6) Programming languages Languages (ANSI C, Fortran 77, ANSI COBOL) affect instruction set (Ch 2) FIGURE 1.2 Summary of some of the most important functional requirements an architect faces. The left-hand column describes the class of requirement, while the right-hand column gives examples of specific features that might be needed. The right-hand column also contains references to chapters and appendices that deal with the specific issues. mance. Given some application domain, the architect can try to quantify the performance of the machine by a set of programs that are chosen to represent that application domain. Other measurable requirements may be important in some markets; reliability and fault tolerance are often crucial in transaction processing environments. Throughout this text we will focus on optimizing machine cost/ performance. In choosing between two designs, one factor that an architect must consider is design complexity. Complex designs take longer to complete, prolonging time to market. This means a design that takes longer will need to have higher performance to be competitive. The architect must be constantly aware of the impact of his design choices on the design time for both hardware and software. In addition to performance, cost is the other key parameter in optimizing cost/ performance. In addition to cost, designers must be aware of important trends in both the implementation technology and the use of computers. Such trends not only impact future cost, but also determine the longevity of an architecture. The next two sections discuss technology and cost trends. 6 Chapter 1 Fundamentals of Computer Design 1.3 Technology and Computer Usage Trends If an instruction set architecture is to be successful, it must be designed to survive changes in hardware technology, software technology, and application characteristics. The designer must be especially aware of trends in computer usage and in computer technology. After all, a successful new instruction set architecture may last decades—the core of the IBM mainframe has been in use since 1964. An architect must plan for technology changes that can increase the lifetime of a successful machine. Trends in Computer Usage The design of a computer is fundamentally affected both by how it will be used and by the characteristics of the underlying implementation technology. Changes in usage or in implementation technology affect the computer design in different ways, from motivating changes in the instruction set to shifting the payoff from important techniques such as pipelining or caching. Trends in software technology and how programs will use the machine have a long-term impact on the instruction set architecture. One of the most important software trends is the increasing amount of memory used by programs and their data. The amount of memory needed by the average program has grown by a factor of 1.5 to 2 per year! This translates to a consumption of address bits at a rate of approximately 1/2 bit to 1 bit per year. This rapid rate of growth is driven both by the needs of programs as well as by the improvements in DRAM technology that continually improve the cost per bit. Underestimating address-space growth is often the major reason why an instruction set architecture must be abandoned. (For further discussion, see Chapter 5 on memory hierarchy.) Another important software trend in the past 20 years has been the replacement of assembly language by high-level languages. This trend has resulted in a larger role for compilers, forcing compiler writers and architects to work together closely to build a competitive machine. Compilers have become the primary interface between user and machine. In addition to this interface role, compiler technology has steadily improved, taking on newer functions and increasing the efficiency with which a program can be run on a machine. This improvement in compiler technology has included traditional optimizations, which we discuss in Chapter 2, as well as transformations aimed at improving pipeline behavior (Chapters 3 and 4) and memory system behavior (Chapter 5). How to balance the responsibility for efficient execution in modern processors between the compiler and the hardware continues to be one of the hottest architecture debates of the 1990s. Improvements in compiler technology played a major role in making vector machines (Appendix B) successful. The development of compiler technology for parallel machines is likely to have a large impact in the future. 1.3 Technology and Computer Usage Trends 7 Trends in Implementation Technology To plan for the evolution of a machine, the designer must be especially aware of rapidly occurring changes in implementation technology. Three implementation technologies, which change at a dramatic pace, are critical to modern implementations: s s s Integrated circuit logic technology—Transistor density increases by about 50% per year, quadrupling in just over three years. Increases in die size are less predictable, ranging from 10% to 25% per year. The combined effect is a growth rate in transistor count on a chip of between 60% and 80% per year. Device speed increases nearly as fast; however, metal technology used for wiring does not improve, causing cycle times to improve at a slower rate. We discuss this further in the next section. Semiconductor DRAM—Density increases by just under 60% per year, quadrupling in three years. Cycle time has improved very slowly, decreasing by about one-third in 10 years. Bandwidth per chip increases as the latency decreases. In addition, changes to the DRAM interface have also improved the bandwidth; these are discussed in Chapter 5. In the past, DRAM (dynamic random-access memory) technology has improved faster than logic technology. This difference has occurred because of reductions in the number of transistors per DRAM cell and the creation of specialized technology for DRAMs. As the improvement from these sources diminishes, the density growth in logic technology and memory technology should become comparable. Magnetic disk technology—Recently, disk density has been improving by about 50% per year, almost quadrupling in three years. Prior to 1990, density increased by about 25% per year, doubling in three years. It appears that disk technology will continue the faster density growth rate for some time to come. Access time has improved by one-third in 10 years. This technology is central to Chapter 6. These rapidly changing technologies impact the design of a microprocessor that may, with speed and technology enhancements, have a lifetime of five or more years. Even within the span of a single product cycle (two years of design and two years of production), key technologies, such as DRAM, change sufficiently that the designer must plan for these changes. Indeed, designers often design for the next technology, knowing that when a product begins shipping in volume that next technology may be the most cost-effective or may have performance advantages. Traditionally, cost has decreased very closely to the rate at which density increases. These technology changes are not continuous but often occur in discrete steps. For example, DRAM sizes are always increased by factors of four because of the basic design structure. Thus, rather than doubling every 18 months, DRAM technology quadruples every three years. This stepwise change in technology leads to 8 Chapter 1 Fundamentals of Computer Design thresholds that can enable an implementation technique that was previously impossible. For example, when MOS technology reached the point where it could put between 25,000 and 50,000 transistors on a single chip in the early 1980s, it became possible to build a 32-bit microprocessor on a single chip. By eliminating chip crossings within the processor, a dramatic increase in cost/performance was possible. This design was simply infeasible until the technology reached a certain point. Such technology thresholds are not rare and have a significant impact on a wide variety of design decisions. 1.4 Cost and Trends in Cost Although there are computer designs where costs tend to be ignored— specifically supercomputers—cost-sensitive designs are of growing importance. Indeed, in the past 15 years, the use of technology improvements to achieve lower cost, as well as increased performance, has been a major theme in the computer industry. Textbooks often ignore the cost half of cost/performance because costs change, thereby dating books, and because the issues are complex. Yet an understanding of cost and its factors is essential for designers to be able to make intelligent decisions about whether or not a new feature should be included in designs where cost is an issue. (Imagine architects designing skyscrapers without any information on costs of steel beams and concrete.) This section focuses on cost, specifically on the components of cost and the major trends. The Exercises and Examples use specific cost data that will change over time, though the basic determinants of cost are less time sensitive. Entire books are written about costing, pricing strategies, and the impact of volume. This section can only introduce you to these topics by discussing some of the major factors that influence cost of a computer design and how these factors are changing over time. The Impact of Time, Volume, Commodization, and Packaging The cost of a manufactured computer component decreases over time even without major improvements in the basic implementation technology. The underlying principle that drives costs down is the learning curve—manufacturing costs decrease over time. The learning curve itself is best measured by change in yield— the percentage of manufactured devices that survives the testing procedure. Whether it is a chip, a board, or a system, designs that have twice the yield will have basically half the cost. Understanding how the learning curve will improve yield is key to projecting costs over the life of the product. As an example of the learning curve in action, the cost per megabyte of DRAM drops over the long term by 40% per year. A more dramatic version of the same information is shown 1.4 9 Cost and Trends in Cost in Figure 1.3, where the cost of a new DRAM chip is depicted over its lifetime. Between the start of a project and the shipping of a product, say two years, the cost of a new DRAM drops by a factor of between five and 10 in constant dollars. Since not all component costs change at the same rate, designs based on projected costs result in different cost/performance trade-offs than those using current costs. The caption of Figure 1.3 discusses some of the long-term trends in DRAM cost. 80 16 MB 70 60 50 Dollars per DRAM chip 4 MB 1 MB 40 256 KB 30 Final chip cost 64 KB 20 10 16 KB 5 4 19 9 3 19 9 2 19 9 1 19 9 0 19 9 9 19 9 8 19 8 7 19 8 6 19 8 5 19 8 4 19 8 3 19 8 2 19 8 1 19 8 0 19 8 9 19 8 19 7 19 7 8 0 Year FIGURE 1.3 Prices of four generations of DRAMs over time in 1977 dollars, showing the learning curve at work. A 1977 dollar is worth about $2.44 in 1995; most of this inflation occurred in the period of 1977–82, during which the value changed to $1.61. The cost of a megabyte of memory has dropped incredibly during this period, from over $5000 in 1977 to just over $6 in 1995 (in 1977 dollars)! Each generation drops in constant dollar price by a factor of 8 to 10 over its lifetime. The increasing cost of fabrication equipment for each new generation has led to slow but steady increases in both the starting price of a technology and the eventual, lowest price. Periods when demand exceeded supply, such as 1987–88 and 1992–93, have led to temporary higher pricing, which shows up as a slowing in the rate of price decrease. 10 Chapter 1 Fundamentals of Computer Design Volume is a second key factor in determining cost. Increasing volumes affect cost in several ways. First, they decrease the time needed to get down the learning curve, which is partly proportional to the number of systems (or chips) manufactured. Second, volume decreases cost, since it increases purchasing and manufacturing efficiency. As a rule of thumb, some designers have estimated that cost decreases about 10% for each doubling of volume. Also, volume decreases the amount of development cost that must be amortized by each machine, thus allowing cost and selling price to be closer. We will return to the other factors influencing selling price shortly. Commodities are products that are sold by multiple vendors in large volumes and are essentially identical. Virtually all the products sold on the shelves of grocery stores are commodities, as are standard DRAMs, small disks, monitors, and keyboards. In the past 10 years, much of the low end of the computer business has become a commodity business focused on building IBM-compatible PCs. There are a variety of vendors that ship virtually identical products and are highly competitive. Of course, this competition decreases the gap between cost and selling price, but it also decreases cost. This occurs because a commodity market has both volume and a clear product definition. This allows multiple suppliers to compete in building components for the commodity product. As a result, the overall product cost is lower because of the competition among the suppliers of the components and the volume efficiencies the suppliers can achieve. Cost of an Integrated Circuit Why would a computer architecture book have a section on integrated circuit costs? In an increasingly competitive computer marketplace where standard parts—disks, DRAMs, and so on—are becoming a significant portion of any system’s cost, integrated circuit costs are becoming a greater portion of the cost that varies between machines, especially in the high-volume, cost-sensitive portion of the market. Thus computer designers must understand the costs of chips to understand the costs of current computers. We follow here the U.S. accounting approach to the costs of chips. While the costs of integrated circuits have dropped exponentially, the basic procedure of silicon manufacture is unchanged: A wafer is still tested and chopped into dies that are packaged (see Figures 1.4 and 1.5). Thus the cost of a packaged integrated circuit is Cost of integrated circuit = Cost of die + Cost of testing die + Cost of packaging and final test Final test yield In this section, we focus on the cost of dies, summarizing the key issues in testing and packaging at the end. A longer discussion of the testing costs and packaging costs appears in the Exercises. 1.4 Cost and Trends in Cost 11 FIGURE 1.4 Photograph of an 8-inch wafer containing Intel Pentium microprocessors. The die size is 480.7 mm2 and the total number of dies is 63. (Courtesy Intel.) FIGURE 1.5 Photograph of an 8-inch wafer containing PowerPC 601 microprocessors. The die size is 122 mm2. The number of dies on the wafer is 200 after subtracting the test dies (the odd-looking dies that are scattered around). (Courtesy IBM.) 12 Chapter 1 Fundamentals of Computer Design To learn how to predict the number of good chips per wafer requires first learning how many dies fit on a wafer and then learning how to predict the percentage of those that will work. From there it is simple to predict cost: Cost of wafer Cost of die = --------------------------------------------------------------Dies per wafer × Die yield The most interesting feature of this first term of the chip cost equation is its sensitivity to die size, shown below. The number of dies per wafer is basically the area of the wafer divided by the area of the die. It can be more accurately estimated by 2 π × Wafer diameter π × ( Wafer diameter/2 ) Dies per wafer = ---------------------------------------------------------- – ---------------------------------------------Die area 2 × Die area The first term is the ratio of wafer area (πr2) to die area. The second compensates for the “square peg in a round hole” problem—rectangular dies near the periphery of round wafers. Dividing the circumference (πd) by the diagonal of a square die is approximately the number of dies along the edge. For example, a wafer 20 cm (≈ 8 inch) in diameter produces 3.14 × 100 – ( 3.14 × 20 ⁄ 1.41 ) = 269 1-cm dies. EXAMPLE ANSWER Find the number of dies per 20-cm wafer for a die that is 1.5 cm on a side. The total die area is 2.25 cm2. Thus 2 π × 20 314 62.8 π × ( 20 ⁄ 2 ) Dies per wafer = ----------------------------- – ----------------------- = --------- – --------- = 110 2.25 2.25 2.12 2 × 2.25 s But this only gives the maximum number of dies per wafer. The critical question is, What is the fraction or percentage of good dies on a wafer number, or the die yield? A simple empirical model of integrated circuit yield, which assumes that defects are randomly distributed over the wafer and that yield is inversely proportional to the complexity of the fabrication process, leads to the following: Defects per unit area × Die area Die yield = Wafer yield × 1 + --------------------------------------------------------------------------- α –α where wafer yield accounts for wafers that are completely bad and so need not be tested. For simplicity, we’ll just assume the wafer yield is 100%. Defects per unit area is a measure of the random and manufacturing defects that occur. In 1995, these values typically range between 0.6 and 1.2 per square centimeter, depending on the maturity of the process (recall the learning curve, mentioned earlier). Lastly, α is a parameter that corresponds roughly to the number of masking levels, a measure of manufacturing complexity, critical to die yield. For today’s multilevel metal CMOS processes, a good estimate is α = 3.0. 1.4 EXAMPLE ANSWER 13 Cost and Trends in Cost Find the die yield for dies that are 1 cm on a side and 1.5 cm on a side, assuming a defect density of 0.8 per cm2. The total die areas are 1 cm2 and 2.25 cm2. For the smaller die the yield is 0.8 × 1 – 3 Die yield = 1 + ---------------- = 0.49 3 For the larger die, it is 0.8 × 2.25 – 3 Die yield = 1 + ----------------------- = 0.24 3 s The bottom line is the number of good dies per wafer, which comes from multiplying dies per wafer by die yield. The examples above predict 132 good 1-cm2 dies from the 20-cm wafer and 26 good 2.25-cm2 dies. Most high-end microprocessors fall between these two sizes, with some being as large as 2.75 cm2 in 1995. Low-end processors are sometimes as small as 0.8 cm2, while processors used for embedded control (in printers, automobiles, etc.) are often just 0.5 cm2. (Figure 1.22 on page 63 in the Exercises shows the die size and technology for several current microprocessors.) Occasionally dies become pad limited: the amount of die area is determined by the perimeter rather than the logic in the interior. This may lead to a higher yield, since defects in empty silicon are less serious! Processing a 20-cm-diameter wafer in a leading-edge technology with 3–4 metal layers costs between $3000 and $4000 in 1995. Assuming a processed wafer cost of $3500, the cost of the 1-cm2 die is around $27, while the cost per die of the 2.25-cm2 die is about $140, or slightly over 5 times the cost for a die that is 2.25 times larger. What should a computer designer remember about chip costs? The manufacturing process dictates the wafer cost, wafer yield, α, and defects per unit area, so the sole control of the designer is die area. Since α is typically 3 for the advanced processes in use today, die costs are proportional to the fourth (or higher) power of the die area: Cost of die = f (Die area4) The computer designer affects die size, and hence cost, both by what functions are included on or excluded from the die and by the number of I/O pins. Before we have a part that is ready for use in a computer, the part must be tested (to separate the good dies from the bad), packaged, and tested again after packaging. These steps all add costs. These processes and their contribution to cost are discussed and evaluated in Exercise 1.8. 14 Chapter 1 Fundamentals of Computer Design Distribution of Cost in a System: An Example To put the costs of silicon in perspective, Figure 1.6 shows the approximate cost breakdown for a color desktop machine in the late 1990s. While costs for units like DRAMs will surely drop over time from those in Figure 1.6, costs for units whose prices have already been cut, like displays and cabinets, will change very little. Furthermore, we can expect that future machines will have larger memories and disks, meaning that prices drop more slowly than the technology improvement. The processor subsystem accounts for only 6% of the overall cost. Although in a mid-range or high-end design this number would be larger, the overall breakdown across major subsystems is likely to be similar. System Subsystem Cabinet Sheet metal, plastic Fraction of total 1% Power supply, fans 2% Cables, nuts, bolts 1% Shipping box, manuals Processor board 0% Subtotal 4% Processor 6% DRAM (64 MB) 36% Video system 14% I/O system Printed circuit board Subtotal I/O devices Keyboard and mouse Monitor 3% 1% 60% 1% 22% Hard disk (1 GB) 7% DAT drive 6% Subtotal 36% FIGURE 1.6 Estimated distribution of costs of the components in a low-end, late 1990s color desktop workstation assuming 100,000 units. Notice that the largest single item is memory! Costs for a high-end PC would be similar, except that the amount of memory might be 16–32 MB rather than 64 MB. This chart is based on data from Andy Bechtolsheim of Sun Microsystems, Inc. Touma [1993] discusses workstation costs and pricing. Cost Versus Price—Why They Differ and By How Much Costs of components may confine a designer’s desires, but they are still far from representing what the customer must pay. But why should a computer architecture book contain pricing information? Cost goes through a number of changes 1.4 Cost and Trends in Cost 15 before it becomes price, and the computer designer should understand how a design decision will affect the potential selling price. For example, changing cost by $1000 may change price by $3000 to $4000. Without understanding the relationship of cost to price the computer designer may not understand the impact on price of adding, deleting, or replacing components. The relationship between price and volume can increase the impact of changes in cost, especially at the low end of the market. Typically, fewer computers are sold as the price increases. Furthermore, as volume decreases, costs rise, leading to further increases in price. Thus, small changes in cost can have a larger than obvious impact. The relationship between cost and price is a complex one with entire books written on the subject. The purpose of this section is to give you a simple introduction to what factors determine price and typical ranges for these factors. The categories that make up price can be shown either as a tax on cost or as a percentage of the price. We will look at the information both ways. These differences between price and cost also depend on where in the computer marketplace a company is selling. To show these differences, Figures 1.7 and 1.8 on page 16 show how the difference between cost of materials and list price is decomposed, with the price increasing from left to right as we add each type of overhead. Direct costs refer to the costs directly related to making a product. These include labor costs, purchasing components, scrap (the leftover from yield), and warranty, which covers the costs of systems that fail at the customer’s site during the warranty period. Direct cost typically adds 20% to 40% to component cost. Service or maintenance costs are not included because the customer typically pays those costs, although a warranty allowance may be included here or in gross margin, discussed next. The next addition is called the gross margin, the company’s overhead that cannot be billed directly to one product. This can be thought of as indirect cost. It includes the company’s research and development (R&D), marketing, sales, manufacturing equipment maintenance, building rental, cost of financing, pretax profits, and taxes. When the component costs are added to the direct cost and gross margin, we reach the average selling price—ASP in the language of MBAs—the money that comes directly to the company for each product sold. The gross margin is typically 20% to 55% of the average selling price, depending on the uniqueness of the product. Manufacturers of low-end PCs generally have lower gross margins for several reasons. First, their R&D expenses are lower. Second, their cost of sales is lower, since they use indirect distribution (by mail, phone order, or retail store) rather than salespeople. Third, because their products are less unique, competition is more intense, thus forcing lower prices and often lower profits, which in turn lead to a lower gross margin. List price and average selling price are not the same. One reason for this is that companies offer volume discounts, lowering the average selling price. Also, if the product is to be sold in retail stores, as personal computers are, stores want to keep 40% to 50% of the list price for themselves. Thus, depending on the distribution system, the average selling price is typically 50% to 75% of the list price. 16 Chapter 1 Fundamentals of Computer Design List price 33.3% Average discount Gross margin 33.3% Gross margin Average selling price 50% 25% 100% Component costs Direct costs 12.5% Direct costs 8.3% Direct costs 75% Component costs 37.5% Component costs 25% Component costs Add 33% for direct costs Add 100% for gross margin Add 50% for average discount FIGURE 1.7 The components of price for a mid-range product in a workstation company. Each increase is shown along the bottom as a tax on the prior price. The percentages of the new price for all elements are shown on the left of each column. List price 45% Average discount Average selling price 25% Gross margin 14% Gross margin 25% 100% Component costs Direct costs 19% Direct costs 10% Direct costs Component costs 56% Component costs 31% 75% Component costs Add 33% for direct costs Add 33% for gross margin Add 80% for average discount FIGURE 1.8 The components of price for a desktop product in a personal computer company. A larger average discount is used because of indirect selling, and a lower gross margin is required. 1.4 Cost and Trends in Cost 17 As we said, pricing is sensitive to competition: A company may not be able to sell its product at a price that includes the desired gross margin. In the worst case, the price must be significantly reduced, lowering gross margin until profit becomes negative! A company striving for market share can reduce price and profit to increase the attractiveness of its products. If the volume grows sufficiently, costs can be reduced. Remember that these relationships are extremely complex and to understand them in depth would require an entire book, as opposed to one section in one chapter. For example, if a company cuts prices, but does not obtain a sufficient growth in product volume, the chief impact will be lower profits. Many engineers are surprised to find that most companies spend only 4% (in the commodity PC business) to 12% (in the high-end server business) of their income on R&D, which includes all engineering (except for manufacturing and field engineering). This is a well-established percentage that is reported in companies’ annual reports and tabulated in national magazines, so this percentage is unlikely to change over time. The information above suggests that a company uniformly applies fixedoverhead percentages to turn cost into price, and this is true for many companies. But another point of view is that R&D should be considered an investment. Thus an investment of 4% to 12% of income means that every $1 spent on R&D should lead to $8 to $25 in sales. This alternative point of view then suggests a different gross margin for each product depending on the number sold and the size of the investment. Large, expensive machines generally cost more to develop—a machine costing 10 times as much to manufacture may cost many times as much to develop. Since large, expensive machines generally do not sell as well as small ones, the gross margin must be greater on the big machines for the company to maintain a profitable return on its investment. This investment model places large machines in double jeopardy—because there are fewer sold and they require larger R&D costs—and gives one explanation for a higher ratio of price to cost versus smaller machines. The issue of cost and cost/performance is a complex one. There is no single target for computer designers. At one extreme, high-performance design spares no cost in achieving its goal. Supercomputers have traditionally fit into this category. At the other extreme is low-cost design, where performance is sacrificed to achieve lowest cost. Computers like the IBM PC clones belong here. Between these extremes is cost/performance design, where the designer balances cost versus performance. Most of the workstation manufacturers operate in this region. In the past 10 years, as computers have downsized, both low-cost design and cost/ performance design have become increasingly important. Even the supercomputer manufacturers have found that cost plays an increasing role. This section has introduced some of the most important factors in determining cost; the next section deals with performance. 18 Chapter 1 Fundamentals of Computer Design 1.5 Measuring and Reporting Performance When we say one computer is faster than another, what do we mean? The computer user may say a computer is faster when a program runs in less time, while the computer center manager may say a computer is faster when it completes more jobs in an hour. The computer user is interested in reducing response time—the time between the start and the completion of an event—also referred to as execution time. The manager of a large data processing center may be interested in increasing throughput—the total amount of work done in a given time. In comparing design alternatives, we often want to relate the performance of two different machines, say X and Y. The phrase “X is faster than Y” is used here to mean that the response time or execution time is lower on X than on Y for the given task. In particular, “X is n times faster than Y” will mean Execution time Y --------------------------------------- = n Execution time X Since execution time is the reciprocal of performance, the following relationship holds: 1 --------------------------------Execution time Y Performance Y Performance X n = --------------------------------------- = ---------------------------------- = --------------------------------1 Execution time X Performance Y --------------------------------Performance X The phrase “the throughput of X is 1.3 times higher than Y” signifies here that the number of tasks completed per unit time on machine X is 1.3 times the number completed on Y. Because performance and execution time are reciprocals, increasing performance decreases execution time. To help avoid confusion between the terms increasing and decreasing, we usually say “improve performance” or “improve execution time” when we mean increase performance and decrease execution time. Whether we are interested in throughput or response time, the key measurement is time: The computer that performs the same amount of work in the least time is the fastest. The difference is whether we measure one task (response time) or many tasks (throughput). Unfortunately, time is not always the metric quoted in comparing the performance of computers. A number of popular measures have been adopted in the quest for a easily understood, universal measure of computer performance, with the result that a few innocent terms have been shanghaied from their well-defined environment and forced into a service for which they were never intended. The authors’ position is that the only consistent and reliable measure of performance is the execution time of real programs, and that all proposed alternatives to time as the metric or to real programs as the items measured 1.5 Measuring and Reporting Performance 19 have eventually led to misleading claims or even mistakes in computer design. The dangers of a few popular alternatives are shown in Fallacies and Pitfalls, section 1.8. Measuring Performance Even execution time can be defined in different ways depending on what we count. The most straightforward definition of time is called wall-clock time, response time, or elapsed time, which is the latency to complete a task, including disk accesses, memory accesses, input/output activities, operating system overhead—everything. With multiprogramming the CPU works on another program while waiting for I/O and may not necessarily minimize the elapsed time of one program. Hence we need a term to take this activity into account. CPU time recognizes this distinction and means the time the CPU is computing, not including the time waiting for I/O or running other programs. (Clearly the response time seen by the user is the elapsed time of the program, not the CPU time.) CPU time can be further divided into the CPU time spent in the program, called user CPU time, and the CPU time spent in the operating system performing tasks requested by the program, called system CPU time. These distinctions are reflected in the UNIX time command, which returns four measurements when applied to an executing program: 90.7u 12.9s 2:39 65% User CPU time is 90.7 seconds, system CPU time is 12.9 seconds, elapsed time is 2 minutes and 39 seconds (159 seconds), and the percentage of elapsed time that is CPU time is (90.7 + 12.9)/159 or 65%. More than a third of the elapsed time in this example was spent waiting for I/O or running other programs or both. Many measurements ignore system CPU time because of the inaccuracy of operating systems’ self-measurement (the above inaccurate measurement came from UNIX) and the inequity of including system CPU time when comparing performance between machines with differing system codes. On the other hand, system code on some machines is user code on others, and no program runs without some operating system running on the hardware, so a case can be made for using the sum of user CPU time and system CPU time. In the present discussion, a distinction is maintained between performance based on elapsed time and that based on CPU time. The term system performance is used to refer to elapsed time on an unloaded system, while CPU performance refers to user CPU time on an unloaded system. We will concentrate on CPU performance in this chapter. 20 Chapter 1 Fundamentals of Computer Design Choosing Programs to Evaluate Performance Dhrystone does not use floating point. Typical programs don’t … Rick Richardson, Clarification of Dhrystone (1988) This program is the result of extensive research to determine the instruction mix of a typical Fortran program. The results of this program on different machines should give a good indication of which machine performs better under a typical load of Fortran programs. The statements are purposely arranged to defeat optimizations by the compiler. H. J. Curnow and B. A. Wichmann [1976], Comments in the Whetstone Benchmark A computer user who runs the same programs day in and day out would be the perfect candidate to evaluate a new computer. To evaluate a new system the user would simply compare the execution time of her workload—the mixture of programs and operating system commands that users run on a machine. Few are in this happy situation, however. Most must rely on other methods to evaluate machines and often other evaluators, hoping that these methods will predict performance for their usage of the new machine. There are four levels of programs used in such circumstances, listed below in decreasing order of accuracy of prediction. 1. Real programs—While the buyer may not know what fraction of time is spent on these programs, she knows that some users will run them to solve real problems. Examples are compilers for C, text-processing software like TeX, and CAD tools like Spice. Real programs have input, output, and options that a user can select when running the program. 2. Kernels—Several attempts have been made to extract small, key pieces from real programs and use them to evaluate performance. Livermore Loops and Linpack are the best known examples. Unlike real programs, no user would run kernel programs, for they exist solely to evaluate performance. Kernels are best used to isolate performance of individual features of a machine to explain the reasons for differences in performance of real programs. 3. Toy benchmarks—Toy benchmarks are typically between 10 and 100 lines of code and produce a result the user already knows before running the toy program. Programs like Sieve of Eratosthenes, Puzzle, and Quicksort are popular because they are small, easy to type, and run on almost any computer. The best use of such programs is beginning programming assignments. 4. Synthetic benchmarks—Similar in philosophy to kernels, synthetic benchmarks try to match the average frequency of operations and operands of a large set of programs. Whetstone and Dhrystone are the most popular synthetic benchmarks. 1.5 Measuring and Reporting Performance 21 A description of these benchmarks and some of their flaws appears in section 1.8 on page 44. No user runs synthetic benchmarks, because they don’t compute anything a user could want. Synthetic benchmarks are, in fact, even further removed from reality because kernel code is extracted from real programs, while synthetic code is created artificially to match an average execution profile. Synthetic benchmarks are not even pieces of real programs, while kernels might be. Because computer companies thrive or go bust depending on price/performance of their products relative to others in the marketplace, tremendous resources are available to improve performance of programs widely used in evaluating machines. Such pressures can skew hardware and software engineering efforts to add optimizations that improve performance of synthetic programs, toy programs, kernels, and even real programs. The advantage of the last of these is that adding such optimizations is more difficult in real programs, though not impossible. This fact has caused some benchmark providers to specify the rules under which compilers must operate, as we will see shortly. Benchmark Suites Recently, it has become popular to put together collections of benchmarks to try to measure the performance of processors with a variety of applications. Of course, such suites are only as good as the constituent individual benchmarks. Nonetheless, a key advantage of such suites is that the weakness of any one benchmark is lessened by the presence of the other benchmarks. This is especially true if the methods used for summarizing the performance of the benchmark suite reflect the time to run the entire suite, as opposed to rewarding performance increases on programs that may be defeated by targeted optimizations. In the remainder of this section, we discuss the strengths and weaknesses of different methods for summarizing performance. Benchmark suites are made of collections of programs, some of which may be kernels, but many of which are typically real programs. Figure 1.9 describes the programs in the popular SPEC92 benchmark suite used to characterize performance in the workstation and server markets.The programs in SPEC92 vary from collections of kernels (nasa7) to small, program fragments (tomcatv, ora, alvinn, swm256) to applications of varying size (spice2g6, gcc, compress). We will see data on many of these programs throughout this text. In the next subsection, we show how a SPEC92 report describes the machine, compiler, and OS configuration, while in section 1.8 we describe some of the pitfalls that have occurred in attempting to develop the benchmark suite and to prevent the benchmark circumvention that makes the results not useful for comparing performance among machines. 22 Chapter 1 Fundamentals of Computer Design Benchmark Source Lines of code Description espresso C 13,500 li C 7,413 A lisp interpreter written in C that solves the 8-queens problem. eqntott C 3,376 Translates a Boolean equation into a truth table. compress C 1,503 Performs data compression on a 1-MB file using Lempel-Ziv coding. Minimizes Boolean functions. sc C 8,116 gcc C 83,589 Consists of the GNU C compiler converting preprocessed files into optimized Sun-3 machine code. spice2g6 FORTRAN 18,476 Circuit simulation package that simulates a small circuit. doduc FORTRAN 5,334 A Monte Carlo simulation of a nuclear reactor component. mdljdp2 FORTRAN 4,458 A chemical application that solves equations of motion for a model of 500 atoms. This is similar to modeling a structure of liquid argon. wave5 FORTRAN 7,628 A two-dimensional electromagnetic particle-in-cell simulation used to study various plasma phenomena. Solves equations of motion on a mesh involving 500,000 particles on 50,000 grid points for 5 time steps. tomcatv FORTRAN 195 ora FORTRAN 535 mdljsp2 FORTRAN 3,885 alvinn C 272 Simulates training of a neural network. Uses single precision. ear C 4,483 An inner ear model that filters and detects various sounds and generates speech signals. Uses single precision. swm256 FORTRAN 487 Performs computations within a UNIX spreadsheet. A mesh generation program, which is highly vectorizable. Traces rays through optical systems of spherical and plane surfaces. Same as mdljdp2, but single precision. A shallow water model that solves shallow water equations using finite difference equations with a 256 × 256 grid. Uses single precision. su2cor FORTRAN 2,514 Computes masses of elementary particles from Quark-Gluon theory. hydro2d FORTRAN 4,461 An astrophysics application program that solves hydrodynamical Navier Stokes equations to compute galactical jets. nasa7 FORTRAN 1,204 Seven kernels do matrix manipulation, FFTs, Gaussian elimination, vortices creation. fpppp FORTRAN 2,718 A quantum chemistry application program used to calculate two electron integral derivatives. FIGURE 1.9 The programs in the SPEC92 benchmark suites. The top six entries are the integer-oriented programs, from which the SPECint92 performance is computed. The bottom 14 are the floating-point-oriented benchmarks from which the SPECfp92 performance is computed.The floating-point programs use double precision unless stated otherwise. The amount of nonuser CPU activity varies from none (for most of the FP benchmarks) to significant (for programs like gcc and compress). In the performance measurements in this text, we use the five integer benchmarks (excluding sc) and five FP benchmarks: doduc, mdljdp2, ear, hydro2d, and su2cor. 1.5 Measuring and Reporting Performance 23 Reporting Performance Results The guiding principle of reporting performance measurements should be reproducibility—list everything another experimenter would need to duplicate the results. Compare descriptions of computer performance found in refereed scientific journals to descriptions of car performance found in magazines sold at supermarkets. Car magazines, in addition to supplying 20 performance metrics, list all optional equipment on the test car, the types of tires used in the performance test, and the date the test was made. Computer journals may have only seconds of execution labeled by the name of the program and the name and model of the computer—spice takes 187 seconds on an IBM RS/6000 Powerstation 590. Left to the reader’s imagination are program input, version of the program, version of compiler, optimizing level of compiled code, version of operating system, amount of main memory, number and types of disks, version of the CPU—all of which make a difference in performance. In other words, car magazines have enough information about performance measurements to allow readers to duplicate results or to question the options selected for measurements, but computer journals often do not! A SPEC benchmark report requires a fairly complete description of the machine, the compiler flags, as well as the publication of both the baseline and optimized results. As an example, Figure 1.10 shows portions of the SPECfp92 report for an IBM RS/6000 Powerstation 590. In addition to hardware, software, and baseline tuning parameter descriptions, a SPEC report contains the actual performance times, shown both in tabular form and as a graph. The importance of performance on the SPEC benchmarks motivated vendors to add many benchmark-specific flags when compiling SPEC programs; these flags often caused transformations that would be illegal on many programs or would slow down performance on others. To restrict this process and increase the significance of the SPEC results, the SPEC organization created a baseline performance measurement in addition to the optimized performance measurement. Baseline performance restricts the vendor to one compiler and one set of flags for all the programs in the same language (C or FORTRAN). Figure 1.10 shows the parameters for the baseline performance; in section 1.8, Fallacies and Pitfalls, we’ll see the tuning parameters for the optimized performance runs on this machine. Comparing and Summarizing Performance Comparing performance of computers is rarely a dull event, especially when the designers are involved. Charges and countercharges fly across the Internet; one is accused of underhanded tactics and the other of misleading statements. Since careers sometimes depend on the results of such performance comparisons, it is understandable that the truth is occasionally stretched. But more frequently discrepancies can be explained by differing assumptions or lack of information. 24 Chapter 1 Fundamentals of Computer Design Hardware Software Model number Powerstation 590 O/S and version AIX version 3.2.5 CPU 66.67 MHz POWER2 Compilers and version C SET++ for AIX C/C++ version 2.1 XL FORTRAN/6000 version 3.1 FPU Integrated Other software See below Number of CPUs 1 File system type AIX/JFS Primary cache 32KBI+256KBD off chip System state Single user Secondary cache None Other cache None Memory 128 MB Disk subsystem 2x2.0 GB Other hardware None SPECbase_fp92 tuning parameters/notes/summary of changes: FORTRAN flags: -O3 -qarch=pwrx -qhsflt -qnofold -bnso -BI:/lib/syscalss.exp C flags: -O3 -qarch=pwrx -Q -qtune=pwrx -qhssngl -bnso -bI:/lib/syscalls.exp FIGURE 1.10 The machine, software, and baseline tuning parameters for the SPECfp92 report on an IBM RS/6000 Powerstation 590. SPECfp92 means that this is the report for the floating-point (FP) benchmarks in the 1992 release (the earlier release was renamed SPEC89) The top part of the table describes the hardware and software. The bottom describes the compiler and options used for the baseline measurements, which must use one compiler and one set of flags for all the benchmarks in the same language. The tuning parameters and flags for the tuned SPEC92 performance are given in Figure 1.18 on page 49. Data from SPEC [1994]. We would like to think that if we could just agree on the programs, the experimental environments, and the definition of faster, then misunderstandings would be avoided, leaving the networks free for scholarly discourse. Unfortunately, that’s not the reality. Once we agree on the basics, battles are then fought over what is the fair way to summarize relative performance of a collection of programs. For example, two articles on summarizing performance in the same journal took opposing points of view. Figure 1.11, taken from one of the articles, is an example of the confusion that can arise. Computer A Computer B Computer C Program P1 (secs) 1 10 20 Program P2 (secs) 1000 100 20 Total time (secs) 1001 110 40 FIGURE 1.11 Execution times of two programs on three machines. Data from Figure I of Smith [1988]. 1.5 Measuring and Reporting Performance 25 Using our definition of faster than, the following statements hold: A is 10 times faster than B for program P1. B is 10 times faster than A for program P2. A is 20 times faster than C for program P1. C is 50 times faster than A for program P2. B is 2 times faster than C for program P1. C is 5 times faster than B for program P2. Taken individually, any one of these statements may be of use. Collectively, however, they present a confusing picture—the relative performance of computers A, B, and C is unclear. Total Execution Time: A Consistent Summary Measure The simplest approach to summarizing relative performance is to use total execution time of the two programs. Thus B is 9.1 times faster than A for programs P1 and P2. C is 25 times faster than A for programs P1 and P2. C is 2.75 times faster than B for programs P1 and P2. This summary tracks execution time, our final measure of performance. If the workload consisted of running programs P1 and P2 an equal number of times, the statements above would predict the relative execution times for the workload on each machine. An average of the execution times that tracks total execution time is the arithmetic mean 1 -n n ∑ Timei i=1 where Timei is the execution time for the ith program of a total of n in the workload. If performance is expressed as a rate, then the average that tracks total execution time is the harmonic mean n ----------------------n 1 ∑ ------------i Rate i=1 where Ratei is a function of 1/ Timei, the execution time for the ith of n programs in the workload. 26 Chapter 1 Fundamentals of Computer Design Weighted Execution Time The question arises: What is the proper mixture of programs for the workload? Are programs P1 and P2 in fact run equally in the workload as assumed by the arithmetic mean? If not, then there are two approaches that have been tried for summarizing performance. The first approach when given an unequal mix of programs in the workload is to assign a weighting factor wi to each program to indicate the relative frequency of the program in that workload. If, for example, 20% of the tasks in the workload were program P1 and 80% of the tasks in the workload were program P2, then the weighting factors would be 0.2 and 0.8. (Weighting factors add up to 1.) By summing the products of weighting factors and execution times, a clear picture of performance of the workload is obtained. This is called the weighted arithmetic mean: n ∑ Weighti × Timei i=1 where Weighti is the frequency of the ith program in the workload and Timei is the execution time of that program. Figure 1.12 shows the data from Figure 1.11 with three different weightings, each proportional to the execution time of a workload with a given mix. The weighted harmonic mean of rates will show the same relative performance as the weighted arithmetic means of execution times. The definition is 1 ---------------------------n Weight i ∑ ------------------Rate i i=1 A B C W(1) W(2) W(3) Program P1 (secs) 1.00 10.00 20.00 0.50 0.909 0.999 Program P2 (secs) 1000.00 100.00 20.00 0.50 0.091 0.001 Arithmetic mean:W(1) 500.50 55.00 20.00 Arithmetic mean:W(2) 91.91 18.19 20.00 Arithmetic mean:W(3) 2.00 10.09 20.00 FIGURE 1.12 Weighted arithmetic mean execution times using three weightings. W(1) equally weights the programs, resulting in a mean (row 3) that is the same as the unweighted arithmetic mean. W(2) makes the mix of programs inversely proportional to the execution times on machine B; row 4 shows the arithmetic mean for that weighting. W(3) weights the programs in inverse proportion to the execution times of the two programs on machine A; the arithmetic mean is given in the last row. The net effect of the second and third weightings is to “normalize” the weightings to the execution times of programs running on that machine, so that the running time will be spent evenly between each program for that machine. For a set of n programs each taking Timei on one machine, the equal-time weightings on that machine are 1 w = --------------------------------------------------- . i n Time i × ∑ ---------------j Time 1 j=1 1.5 27 Measuring and Reporting Performance Normalized Execution Time and the Pros and Cons of Geometric Means A second approach to unequal mixture of programs in the workload is to normalize execution times to a reference machine and then take the average of the normalized execution times. This is the approach used by the SPEC benchmarks, where a base time on a VAX-11/780 is used for reference. This measurement gives a warm fuzzy feeling, because it suggests that performance of new programs can be predicted by simply multiplying this number times its performance on the reference machine. Average normalized execution time can be expressed as either an arithmetic or geometric mean. The formula for the geometric mean is n n ∏ Execution time ratioi i=1 where Execution time ratioi is the execution time, normalized to the reference machine, for the ith program of a total of n in the workload. Geometric means also have a nice property for two samples Xi and Yi: Xi Geometric mean ( X i ) -------------------------------------------------- = Geometric mean ----- Y i Geometric mean ( Y i ) As a result, taking either the ratio of the means or the mean of the ratios yields the same result. In contrast to arithmetic means, geometric means of normalized execution times are consistent no matter which machine is the reference. Hence, the arithmetic mean should not be used to average normalized execution times. Figure 1.13 shows some variations using both arithmetic and geometric means of normalized times. Normalized to A Normalized to B Normalized to C A B C A B C A B C Program P1 1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0 Program P2 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0 Arithmetic mean 1.0 5.05 10.01 5.05 1.0 1.1 25.03 2.75 1.0 Geometric mean 1.0 1.0 0.63 1.0 1.0 0.63 1.58 1.58 1.0 Total time 1.0 0.11 0.04 9.1 1.0 0.36 25.03 2.75 1.0 FIGURE 1.13 Execution times from Figure 1.11 normalized to each machine. The arithmetic mean performance varies depending on which is the reference machine—in column 2, B’s execution time is five times longer than A’s, while the reverse is true in column 4. In column 3, C is slowest, but in column 9, C is fastest. The geometric means are consistent independent of normalization—A and B have the same performance, and the execution time of C is 0.63 of A or B (1/1.58 is 0.63). Unfortunately, the total execution time of A is 10 times longer than that of B, and B in turn is about 3 times longer than C. As a point of interest, the relationship between the means of the same set of numbers is always harmonic mean ≤ geometric mean ≤ arithmetic mean. 28 Chapter 1 Fundamentals of Computer Design Because the weightings in weighted arithmetic means are set proportionate to execution times on a given machine, as in Figure 1.12, they are influenced not only by frequency of use in the workload, but also by the peculiarities of a particular machine and the size of program input. The geometric mean of normalized execution times, on the other hand, is independent of the running times of the individual programs, and it doesn’t matter which machine is used to normalize. If a situation arose in comparative performance evaluation where the programs were fixed but the inputs were not, then competitors could rig the results of weighted arithmetic means by making their best performing benchmark have the largest input and therefore dominate execution time. In such a situation the geometric mean would be less misleading than the arithmetic mean. The strong drawback to geometric means of normalized execution times is that they violate our fundamental principle of performance measurement—they do not predict execution time. The geometric means from Figure 1.13 suggest that for programs P1 and P2 the performance of machines A and B is the same, yet this would only be true for a workload that ran program P1 100 times for every occurrence of program P2 (see Figure 1.12 on page 26). The total execution time for such a workload suggests that machines A and B are about 50% faster than machine C, in contrast to the geometric mean, which says machine C is faster than A and B! In general there is no workload for three or more machines that will match the performance predicted by the geometric means of normalized execution times. Our original reason for examining geometric means of normalized performance was to avoid giving equal emphasis to the programs in our workload, but is this solution an improvement? An additional drawback of using geometric mean as a method for summarizing performance for a benchmark suite (as SPEC92 does) is that it encourages hardware and software designers to focus their attention on the benchmarks where performance is easiest to improve rather than on the benchmarks that are slowest. For example, if some hardware or software improvement can cut the running time for a benchmark from 2 seconds to 1, the geometric mean will reward those designers with the same overall mark that it would give to designers that improve the running time on another benchmark in the suite from 10,000 seconds to 5000 seconds. Of course, everyone interested in running the second program thinks of the second batch of designers as their heroes and the first group as useless. Small programs are often easier to “crack,” obtaining a large but unrepresentative performance improvement, and the use of geometric mean rewards such behavior more than a measure that reflects total running time. The ideal solution is to measure a real workload and weight the programs according to their frequency of execution. If this can’t be done, then normalizing so that equal time is spent on each program on some machine at least makes the relative weightings explicit and will predict execution time of a workload with that mix. The problem above of unspecified inputs is best solved by specifying the inputs when comparing performance. If results must be normalized to a specific machine, first summarize performance with the proper weighted measure and then do the normalizing. 1.6 1.6 Quantitative Principles of Computer Design 29 Quantitative Principles of Computer Design Now that we have seen how to define, measure, and summarize performance, we can explore some of the guidelines and principles that are useful in design and analysis of computers. In particular, this section introduces some important observations about designing for performance and cost/performance, as well as two equations that we can use to evaluate design alternatives. Make the Common Case Fast Perhaps the most important and pervasive principle of computer design is to make the common case fast: In making a design trade-off, favor the frequent case over the infrequent case. This principle also applies when determining how to spend resources, since the impact on making some occurrence faster is higher if the occurrence is frequent. Improving the frequent event, rather than the rare event, will obviously help performance, too. In addition, the frequent case is often simpler and can be done faster than the infrequent case. For example, when adding two numbers in the CPU, we can expect overflow to be a rare circumstance and can therefore improve performance by optimizing the more common case of no overflow. This may slow down the case when overflow occurs, but if that is rare, then overall performance will be improved by optimizing for the normal case. We will see many cases of this principle throughout this text. In applying this simple principle, we have to decide what the frequent case is and how much performance can be improved by making that case faster. A fundamental law, called Amdahl’s Law, can be used to quantify this principle. Amdahl’s Law The performance gain that can be obtained by improving some portion of a computer can be calculated using Amdahl’s Law. Amdahl’s Law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. Amdahl’s Law defines the speedup that can be gained by using a particular feature. What is speedup? Suppose that we can make an enhancement to a machine that will improve performance when it is used. Speedup is the ratio Speedup = Performance for entire task using the enhancement when possible Performance for entire task without using the enhancement Alternatively, Speedup = Execution time for entire task without using the enhancement Execution time for entire task using the enhancement when possible 30 Chapter 1 Fundamentals of Computer Design Speedup tells us how much faster a task will run using the machine with the enhancement as opposed to the original machine. Amdahl’s Law gives us a quick way to find the speedup from some enhancement, which depends on two factors: 1. The fraction of the computation time in the original machine that can be converted to take advantage of the enhancement—For example, if 20 seconds of the execution time of a program that takes 60 seconds in total can use an enhancement, the fraction is 20/60. This value, which we will call Fractionenhanced, is always less than or equal to 1. 2. The improvement gained by the enhanced execution mode; that is, how much faster the task would run if the enhanced mode were used for the entire program—This value is the time of the original mode over the time of the enhanced mode: If the enhanced mode takes 2 seconds for some portion of the program that can completely use the mode, while the original mode took 5 seconds for the same portion, the improvement is 5/2. We will call this value, which is always greater than 1, Speedupenhanced. The execution time using the original machine with the enhanced mode will be the time spent using the unenhanced portion of the machine plus the time spent using the enhancement: Fraction enhanced Execution timenew = Execution timeold × ( 1 – Fraction enhanced ) + --------------------------------------- Speedup enhanced The overall speedup is the ratio of the execution times: Execution time old 1 Speedupoverall = ------------------------------------------- = -----------------------------------------------------------------------------------------------Fraction enhanced Execution time new ( 1 – Fraction enhanced ) + ------------------------------------Speedup enhanced EXAMPLE ANSWER Suppose that we are considering an enhancement that runs 10 times faster than the original machine but is only usable 40% of the time. What is the overall speedup gained by incorporating the enhancement? Fractionenhanced = 0.4 Speedupenhanced = 10 Speedupoverall 1 1 = -------------------- = --------- ≈ 1.56 0.4 0.64 0.6 + -----10 s 1.6 Quantitative Principles of Computer Design 31 Amdahl’s Law expresses the law of diminishing returns: The incremental improvement in speedup gained by an additional improvement in the performance of just a portion of the computation diminishes as improvements are added. An important corollary of Amdahl’s Law is that if an enhancement is only usable for a fraction of a task, we can’t speed up the task by more than the reciprocal of 1 minus that fraction. A common mistake in applying Amdahl’s Law is to confuse “fraction of time converted to use an enhancement” and “fraction of time after enhancement is in use.” If, instead of measuring the time that we could use the enhancement in a computation, we measure the time after the enhancement is in use, the results will be incorrect! (Try Exercise 1.2 to see how wrong.) Amdahl’s Law can serve as a guide to how much an enhancement will improve performance and how to distribute resources to improve cost/performance. The goal, clearly, is to spend resources proportional to where time is spent. We can also use Amdahl’s Law to compare two design alternatives, as the following Example shows. EXAMPLE ANSWER Implementations of floating-point (FP) square root vary significantly in performance. Suppose FP square root (FPSQR) is responsible for 20% of the execution time of a critical benchmark on a machine. One proposal is to add FPSQR hardware that will speed up this operation by a factor of 10. The other alternative is just to try to make all FP instructions run faster; FP instructions are responsible for a total of 50% of the execution time. The design team believes that they can make all FP instructions run two times faster with the same effort as required for the fast square root. Compare these two design alternatives. We can compare these two alternatives by comparing the speedups: 1 1 SpeedupFPSQR = ---------------------------------- = --------- = 1.22 0.2 0.82 ( 1 – 0.2 ) + -----10 1 1 SpeedupFP = ---------------------------------- = --------- = 1.33 0.5 0.75 ( 1 – 0.5 ) + -----2.0 Improving the performance of the FP operations overall is better because of the higher frequency. s In the above Example, we needed to know the time consumed by the new and improved FP operations; often it is difficult to measure these times directly. In the next section, we will see another way of doing such comparisons based on the 32 Chapter 1 Fundamentals of Computer Design use of an equation that decomposes the CPU execution time into three separate components. If we know how an alternative affects these three components, we can determine its overall performance effect. Furthermore, it is often possible to build simulators that measure these components before the hardware is actually designed. The CPU Performance Equation Most computers are constructed using a clock running at a constant rate. These discrete time events are called ticks, clock ticks, clock periods, clocks, cycles, or clock cycles. Computer designers refer to the time of a clock period by its duration (e.g., 2 ns) or by its rate (e.g., 500 MHz). CPU time for a program can then be expressed two ways: CPU time = CPU clock cycles for a program × Clock cycle time or CPU clock cycles for a program CPU time = ---------------------------------------------------------------------------Clock rate In addition to the number of clock cycles needed to execute a program, we can also count the number of instructions executed—the instruction path length or instruction count (IC). If we know the number of clock cycles and the instruction count we can calculate the average number of clock cycles per instruction (CPI): CPU clock cycles for a program CPI = ---------------------------------------------------------------------------IC This CPU figure of merit provides insight into different styles of instruction sets and implementations, and we will use it extensively in the next four chapters. By transposing instruction count in the above formula, clock cycles can be defined as IC × CPI. This allows us to use CPI in the execution time formula: CPU time = IC × CPI × Clock cycle time or IC × CPI CPU time = -----------------------Clock rate Expanding the first formula into the units of measure shows how the pieces fit together: Instructions Clock cycles Seconds Seconds --------------------------- × ------------------------------ × --------------------------- = ------------------- = CPU time Program Instruction Clock cycle Program As this formula demonstrates, CPU performance is dependent upon three characteristics: clock cycle (or rate), clock cycles per instruction, and instruction count. Furthermore, CPU time is equally dependent on these three characteristics: A 10% improvement in any one of them leads to a 10% improvement in CPU time. 1.6 33 Quantitative Principles of Computer Design Unfortunately, it is difficult to change one parameter in complete isolation from others because the basic technologies involved in changing each characteristic are also interdependent: s Clock cycle time—Hardware technology and organization s CPI—Organization and instruction set architecture s Instruction count—Instruction set architecture and compiler technology Luckily, many potential performance improvement techniques primarily improve one component of CPU performance with small or predictable impacts on the other two. Sometimes it is useful in designing the CPU to calculate the number of total CPU clock cycles as n ∑ CPIi × ICi CPU clock cycles = i=1 where ICi represents number of times instruction i is executed in a program and CPIi represents the average number of clock cycles for instruction i. This form can be used to express CPU time as n CPU time = ∑ CPI i × IC i × Clock cycle time i = 1 and overall CPI as n ∑ CPIi × ICi i=1 CPI = ---------------------------------------- = Instruction count n IC i ∑ CPIi × ---------------------------------------- Instruction count i=1 The latter form of the CPI calculation multiplies each individual CPIi by the fraction of occurrences of that instruction in a program. CPIi should be measured and not just calculated from a table in the back of a reference manual since it must include cache misses and any other memory system inefficiencies. Consider our earlier example, here modified to use measurements of the frequency of the instructions and of the instruction CPI values, which, in practice, are easier to obtain. EXAMPLE Suppose we have made the following measurements: Frequency of FP operations = 25% Average CPI of FP operations = 4.0 Average CPI of other instructions = 1.33 Frequency of FPSQR= 2% CPI of FPSQR = 20 34 Chapter 1 Fundamentals of Computer Design Assume that the two design alternatives are to reduce the CPI of FPSQR to 2 or to reduce the average CPI of all FP operations to 2. Compare these two design alternatives using the CPU performance equation. ANSWER First, observe that only the CPI changes; the clock rate and instruction count remain identical. We start by finding the original CPI with neither enhancement: n CPI original = IC i ∑ CPIi × ---------------------------------------- Instruction count i=1 = ( 4 × 25% ) + ( 1.33 × 75% ) = 2.0 We can compute the CPI for the enhanced FPSQR by subtracting the cycles saved from the original CPI: CPI with new FPSQR = CPI original – 2% × ( CPI old FPSQR – CPI of new FPSQR only ) = 2.0 – 2% × ( 20 – 2 ) = 1.64 We can compute the CPI for the enhancement of all FP instructions the same way or by summing the FP and non-FP CPIs. Using the latter gives us CPI new FP = ( 75% × 1.33 ) + ( 25% × 2.0 ) = 1.5 Since the CPI of the overall FP enhancement is lower, its performance will be better. Specifically, the speedup for the overall FP enhancement is IC × Clock cycle × CPI original CPU time original Speedup new FP = ------------------------------------- = ---------------------------------------------------------------------IC × Clock cycle × CPI new FP CPU time new FP CPI original 2.00 = ----------------------- = --------- = 1.33 CPI new FP 1.5 Happily, this is the same speedup we obtained using Amdahl’s Law on page 31. s It is often possible to measure the constituent parts of the CPU performance equation. This is a key advantage for using the CPU performance equation versus Amdahl’s Law in the above example. In particular, it may be difficult to measure things such as the fraction of execution time for which a set of instructions is responsible. In practice this would probably be computed by summing the product of the instruction count and the CPI for each of the instructions in the set. Since the starting point is often individual instruction count and CPI measurements, the CPU performance equation is incredibly useful. 1.6 Quantitative Principles of Computer Design 35 Measuring the Components of CPU Performance To use the CPU performance equation to determine performance, we need measurements of the individual components of the equation. Building and using tools to measure aspects of a design is a large part of a designer’s job—at least for designers who base their decisions on quantitative principles! To determine the clock cycle, we need only determine one number. Of course, this is easy for an existing CPU, but estimating the clock cycle time of a design in progress is very difficult. Low-level tools, called timing estimators or timing verifiers, are used to analyze the clock cycle time for a completed design. It is much more difficult to estimate the clock cycle time for a design that is not completed, or for an alternative for which no design exists. In practice, designers determine a target cycle time and estimate the impact on cycle time by examining what they believe to be the critical paths in a design. The difficulty is that control, rather than the data path of a processor, often turns out to be the critical path, and control is often the last thing to be done and the hardest to estimate timing for. So, designers rely heavily on estimates and on their experience and then do whatever is needed to try to make their clock cycle target. This sometimes means changing the organization so that the CPI of some instructions increases. Using the CPU performance equation, the impact of this trade-off can be calculated. The other two components of the CPU performance equation are easier to measure. Measuring the instruction count for a program can be done if we have a compiler for the machine together with tools that measure the instruction set behavior. Of course, compilers for existing instruction set architectures are not a problem, and even changes to the architecture can be explored using modern compiler organizations that provide the ability to retarget the compiler easily. For new instruction sets, developing the compiler early is critical to making intelligent decisions in the design of the instruction set. Once we have a compiled version of a program that we are interested in measuring, there are two major methods we can apply to obtain instruction count information. In most cases, we want to know not only the total instruction count, but also the frequency of different classes of instructions (called the instruction mix). The first way to obtain such data is an instruction set simulator that interprets the instructions. The major drawbacks of this approach are speed (since emulating the instruction set is slow) and the possible need to implement substantial infrastructure, since to handle large programs the simulator will need to provide support for operating system functions. One advantage of an instruction set simulator is that it can measure almost any aspect of instruction set behavior accurately and can also potentially simulate systems programs, such as the operating system. Typical instruction set simulators run from 10 to 1000 times slower than the program might, with the performance depending both on how carefully the simulator is written and on the relationship between the architectures of the simulated machine and host machine. The alternative approach uses execution-based monitoring. In this approach, the binary program is modified to include instrumentation code, such as a counter 36 Chapter 1 Fundamentals of Computer Design in every basic block. The program is run and the counter values are recorded. It is then simple to determine the instruction distribution by examining the static version of the code and the values of the counters, which tell us how often each instruction is executed. This technique is obviously very fast, since the program is executed, rather than interpreted. Typical instrumentation code increases the execution time by 1.1 to 2.0 times. This technique is even usable when the architectures of the machine being simulated and the machine being used for the simulator differ. In such a case, the program that instruments the code does a simple translation between the instruction sets. This translation need not be very efficient—even a sloppy translation will usually lead to a much faster measurement system than complete simulation of the instruction set. Measuring the CPI is more difficult, since it depends on the detailed processor organization as well as the instruction stream. For very simple processors, it may be possible to compute a CPI for every instruction from a table and simply multiply these values by the number of instances of each instruction type. However, this simplistic approach will not work with most modern processors. Since these processors were built using techniques such as pipelining and memory hierarchies, instructions do not have simple cycle counts but instead depend on the state of the processor when the instruction is executed. Designers often use average CPI values for instructions, but these average CPIs are computed by measuring the effects of the pipeline and cache structure. To determine the CPI for an instruction in a modern processor, it is often useful to separate the component arising from the memory system and the component determined by the pipeline, assuming a perfect memory system. This is useful both because the simulation techniques for evaluating these contributions are different and because the memory system contribution is added as an average to all instructions, while the processor contribution is more likely to be instruction specific. Thus, we can compute the CPI for each instruction, i, as CPI i = Pipeline CPI i + Memory system CPI i In the next section, we’ll see how memory system CPI can be computed, at least for simple memory hierarchies. Chapter 5 discusses more sophisticated memory hierarchies and performance modeling. The pipeline CPI is typically modeled by simulating the pipeline structure using the instruction stream. For simple pipelines, it may be sufficient to model the performance of each basic block individually, ignoring the cross basic block interactions. In such cases, the performance of each basic block, together with the frequency counts for each basic block, can be used to determine the overall CPI as well as the CPI for each instruction. In Chapter 3, we will examine simple pipeline structures where this approximation is valid. Since the pipeline behavior of each basic block is simulated only once, this is much faster than a full simulation of every instruction execution. Unfortunately, in our exploration of advanced pipelining in Chapter 4, we’ll see that full simulations of the program are necessary to estimate the performance of sophisticated pipelines. 1.6 Quantitative Principles of Computer Design 37 Using the CPU Performance Equations: More Examples The real measure of computer performance is time. Changing the instruction set to lower the instruction count, for example, may lead to an organization with a slower clock cycle time that offsets the improvement in instruction count. When comparing two machines, you must look at all three components to understand relative performance. EXAMPLE Suppose we are considering two alternatives for our conditional branch instructions, as follows: CPU A: A condition code is set by a compare instruction and followed by a branch that tests the condition code. CPU B: A compare is included in the branch. On both CPUs, the conditional branch instruction takes 2 cycles, and all other instructions take 1 clock cycle. On CPU A, 20% of all instructions executed are conditional branches; since every branch needs a compare, another 20% of the instructions are compares. Because CPU A does not have the compare included in the branch, assume that its clock cycle time is 1.25 times faster than that of CPU B. Which CPU is faster? Suppose CPU A’s clock cycle time was only 1.1 times faster? ANSWER Since we are ignoring all systems issues, we can use the CPU performance formula: CPI A = 0.20 × 2 + 0.80 × 1 = 1.2 since 20% are branches taking 2 clock cycles and the rest of the instructions take 1 cycle each.The performance of CPU A is then CPU time A = IC A × 1.2 × Clock cycle time A Clock cycle timeB is 1.25 × Clock cycle timeA, since A has a clock rate that is 1.25 times higher. Compares are not executed in CPU B, so 20%/80% or 25% of the instructions are now branches taking 2 clock cycles, and the remaining 75% of the instructions take 1 cycle. Hence, CPI B = 0.25 × 2 + 0.75 × 1 = 1.25 Because CPU B doesn’t execute compares, ICB = 0.8 × ICA. Hence, the performance of CPU B is CPU time B = IC B × CPI B × Clock cycle time B = 0.8 × IC A × 1.25 × ( 1.25 × Clock cycle timeA ) = 1.25 × IC A × Clock cycle time A 38 Chapter 1 Fundamentals of Computer Design Under these assumptions, CPU A, with the shorter clock cycle time, is faster than CPU B, which executes fewer instructions. If CPU A were only 1.1 times faster, then Clock cycle timeB is 1.10 × Clock cycle time A , and the performance of CPU B is CPU time B = IC B × CPI B × Clock cycle time B = 0.8 × IC A × 1.25 × ( 1.10 × Clock cycle time A ) = 1.10 × IC A × Clock cycle time A With this improvement CPU B, which executes fewer instructions, is now faster. s Locality of Reference While Amdahl’s Law is a theorem that applies to any system, other important fundamental observations come from properties of programs. The most important program property that we regularly exploit is locality of reference: Programs tend to reuse data and instructions they have used recently. A widely held rule of thumb is that a program spends 90% of its execution time in only 10% of the code. An implication of locality is that we can predict with reasonable accuracy what instructions and data a program will use in the near future based on its accesses in the recent past. To examine locality, 10 application programs in the SPEC92 benchmark suite were measured to determine what percentage of the instructions were responsible for 80% and for 90% of the instructions executed. The data are shown in Figure 1.14. Locality of reference also applies to data accesses, though not as strongly as to code accesses. Two different types of locality have been observed. Temporal locality states that recently accessed items are likely to be accessed in the near future. Figure 1.14 shows one effect of temporal locality. Spatial locality says that items whose addresses are near one another tend to be referenced close together in time. We will see these principles applied in the next section. 1.7 39 Putting It All Together: The Concept of Memory Hierarchy 60% 50% 40% Fraction of the program 30% 20% 10% mdljdp su2cor hydro2d ear li doduc gcc espresso eqntott compress 0% SPEC benchmark FIGURE 1.14 This plot shows what percentage of the instructions are responsible for 80% and for 90% of the instruction executions. The total bar height indicates the fraction of the instructions that account for 90% of the instruction executions while the dark portion indicates the fraction of the instructions responsible for 80% of the instruction executions. For example, in compress about 9% of the code accounts for 80% of the executed instructions and 16% accounts for 90% of the executed instructions. On average, 90% of the instruction executions comes from 10% of the instructions in the integer programs and 14% of the instructions in the FP programs. The programs are described in more detail in Figure 1.9 on page 22. 1.7 Putting It All Together: The Concept of Memory Hierarchy In the Putting It All Together sections that appear near the end of every chapter, we show real examples that use the principles in that chapter. In this first chapter, we discuss a key idea in memory systems that will be the sole focus of our attention in Chapter 5. To begin, let’s look at a simple axiom of hardware design: Smaller is faster. Smaller pieces of hardware will generally be faster than larger pieces. This simple principle is particularly applicable to memories built from the same technology for two reasons. First, in high-speed machines, signal propagation is a major cause of delay; larger memories have more signal delay and require more levels to decode addresses. Second, in most technologies we can obtain smaller memories that are faster than larger memories. This is primarily because the designer can use more power per memory cell in a smaller design. The fastest memories are generally available in smaller numbers of bits per chip at any point in time, and they cost substantially more per byte. 40 Chapter 1 Fundamentals of Computer Design The important exception to the smaller-is-faster rule arises from differences in power consumption. Designs with higher power consumption will be faster and also usually larger. Such power differences can come from changes in technology, such as the use of ECL versus CMOS, or from a change in the design, such as the use of static memory cells rather than dynamic memory cells. If the power increase is sufficient, it can overcome the disadvantage arising from the size increase. Thus, the smaller-is-faster rule applies only when power differences do not exist or are taken into account. Increasing memory bandwidth and decreasing the time to access memory are both crucial to system performance, and many of the organizational techniques we discuss will focus on these two metrics. How can we improve these two measures? The answer lies in combining the principles we discussed in this chapter together with the rule that smaller is faster. The principle of locality of reference says that the data most recently used is very likely to be accessed again in the near future. Making the common case fast suggests that favoring accesses to such data will improve performance. Thus, we should try to keep recently accessed items in the fastest memory. Because smaller memories will be faster, we want to use smaller memories to try to hold the most recently accessed items close to the CPU and successively larger (and slower) memories as we move farther away from the CPU. Furthermore, we can also employ more expensive and higher-powered memory technologies for those memories closer to the CPU, because they are much smaller and the cost and power impact is lessened by the small size of the memories. This type of organization is called a memory hierarchy. Figure 1.15 shows a multilevel memory hierarchy, including typical sizes and speeds of access. Two important levels of the memory hierarchy are the cache and virtual memory. CPU Registers C a c h e Memory bus I/O bus Memory Register reference Size: Speed: Cache reference Memory reference 200 B 5 ns 64 KB 10 ns 32 MB 100 ns I/O devices Disk memory reference 2 GB 5 ms FIGURE 1.15 These are the levels in a typical memory hierarchy. As we move farther away from the CPU, the memory in the level becomes larger and slower. The sizes and access times are typical for a low- to mid-range desktop machine in late 1995. Figure 1.16 shows the wider range of values in use. 1.7 41 Putting It All Together: The Concept of Memory Hierarchy A cache is a small, fast memory located close to the CPU that holds the most recently accessed code or data. When the CPU finds a requested data item in the cache, it is called a cache hit. When the CPU does not find a data item it needs in the cache, a cache miss occurs. A fixed-size block of data, called a block, containing the requested word is retrieved from the main memory and placed into the cache. Temporal locality tells us that we are likely to need this word again in the near future, so placing it in the cache where it can be accessed quickly is useful. Because of spatial locality, there is high probability that the other data in the block will be needed soon. The time required for the cache miss depends on both the latency of the memory and its bandwidth, which determines the time to retrieve the entire block. A cache miss, which is handled by hardware, usually causes the CPU to pause, or stall, until the data are available. Likewise, not all objects referenced by a program need to reside in main memory. If the computer has virtual memory, then some objects may reside on disk. The address space is usually broken into fixed-size blocks, called pages. At any time, each page resides either in main memory or on disk. When the CPU references an item within a page that is not present in the cache or main memory, a page fault occurs, and the entire page is moved from the disk to main memory. Since page faults take so long, they are handled in software and the CPU is not stalled. The CPU usually switches to some other task while the disk access occurs. The cache and main memory have the same relationship as the main memory and disk. Figure 1.16 shows the range of sizes and access times of each level in the memory hierarchy for machines ranging from low-end desktops to high-end servers. Chapter 5 focuses on memory hierarchy design and contains a detailed example of a real system hierarchy. Level 1 2 3 4 Called Registers Cache Main memory Disk storage Typical size < 1 KB < 4 MB < 4 GB > 1 GB Implementation technology Custom memory with multiple ports, CMOS or BiCMOS On-chip or offchip CMOS SRAM CMOS DRAM Magnetic disk Access time (in ns) 2–5 3–10 80–400 5,000,000 Bandwidth (in MB/sec) 4000–32,000 800–5000 400–2000 4–32 Managed by Compiler Hardware Operating system Operating system/user Backed by Cache Main memory Disk Tape FIGURE 1.16 The typical levels in the hierarchy slow down and get larger as we move away from the CPU. Sizes are typical for a large workstation or small server. The implementation technology shows the typical technology used for these functions. The access time is given in nanoseconds for typical values in 1995; these times will decrease over time. Bandwidth is given in megabytes per second, assuming 64- to 256-bit paths between levels in the memory hierarchy. As we move to lower levels of the hierarchy, the access times increase, making it feasible to manage the transfer less responsively. 42 Chapter 1 Fundamentals of Computer Design Performance of Caches: The Basics Because of locality and the higher speed of smaller memories, a memory hierarchy can substantially improve performance. There are several ways that we can look at the performance of a memory hierarchy and its impact on CPU performance. Let’s start with an example that uses Amdahl’s Law to compare a system with and without a cache. EXAMPLE ANSWER Suppose a cache is 10 times faster than main memory, and suppose that the cache can be used 90% of the time. How much speedup do we gain by using the cache? This is a simple application of Amdahl’s Law. 1 ----------------------------Speedup = ------------------------------------------------------------------------------------------------------------------------------% of time cache can be used ( 1 – % of time cache can be used ) + ------------------------------------------------------------------Speedup using cache 1 Speedup = ---------------------------------0.9 ( 1 – 0.9 ) + -----10 1 Speedup = --------- ≈ 5.3 0.19 Hence, we obtain a speedup from the cache of about 5.3 times. s In practice, we do not normally use Amdahl’s Law for evaluating memory hierarchies. Most machines will include a memory hierarchy, and the key issue is really how to design that hierarchy, which depends on more detailed measurements. An alternative method is to expand our CPU execution time equation to account for the number of cycles during which the CPU is stalled waiting for a memory access, which we call the memory stall cycles. The performance is then the product of the clock cycle time and the sum of the CPU cycles and the memory stall cycles: CPU execution time = ( CPU clock cycles + Memory stall cycles ) × Clock cycle This equation assumes that the CPU clock cycles include the time to handle a cache hit, and that the CPU is stalled during a cache miss. In Chapter 5, we will analyze memory hierarchies in more detail, examining both these assumptions. 1.7 Putting It All Together: The Concept of Memory Hierarchy 43 The number of memory stall cycles depends on both the number of misses and the cost per miss, which is called the miss penalty: Memory stall cycles = Number of misses × Miss penalty = IC × Misses per instruction × Miss penalty = IC × Memory references per instruction × Miss rate × Miss penalty The advantage of the last form is that the components can be easily measured: We already know how to measure IC (instruction count), and measuring the number of memory references per instruction can be done in the same fashion, since each instruction requires an instruction access and we can easily decide if it requires a data access. The component Miss rate is simply the fraction of cache accesses that result in a miss (i.e., number of accesses that miss divided by number of accesses). Miss rates are typically measured with cache simulators that take a trace of the instruction and data references, simulate the cache behavior to determine which references hit and which miss, and then report the hit and miss totals. The miss rate is one of the most important measures of cache design, but, as we will see in Chapter 5, not the only measure. EXAMPLE ANSWER Assume we have a machine where the CPI is 2.0 when all memory accesses hit in the cache.The only data accesses are loads and stores, and these total 40% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the machine be if all instructions were cache hits? First compute the performance for the machine that always hits: CPU execution time = ( CPU clock cycles + Memory stall cycles ) × Clock cycle = ( IC × CPI + 0 ) × Clock cycle = IC × 2.0 × Clock cycle Now for the machine with the real cache, first we compute memory stall cycles: Memory stall cycles = IC × Memory references per instruction × Miss rate × Miss penalty = IC × ( 1 + 0.4 ) × 0.02 × 25 = IC × 0.7 where the middle term (1 + 0.4) represents one instruction access and 0.4 data accesses per instruction. The total performance is thus CPU execution time cache = ( IC × 2.0 + IC × 0.7 ) × Clock cycle = 2.7 × IC × Clock cycle The performance ratio is the inverse of the execution times: 44 Chapter 1 Fundamentals of Computer Design CPU execution time cache 2.7 × IC × Clock cycle ---------------------------------------------------------- = ----------------------------------------------------CPU execution time 2.0 × IC × Clock cycle = 1.35 The machine with no cache misses is 1.35 times faster. 1.8 s Fallacies and Pitfalls The purpose of this section, which will be found in every chapter, is to explain some commonly held misbeliefs or misconceptions that you should avoid. We call such misbeliefs fallacies. When discussing a fallacy, we try to give a counterexample. We also discuss pitfalls—easily made mistakes. Often pitfalls are generalizations of principles that are true in a limited context. The purpose of these sections is to help you avoid making these errors in machines that you design. Fallacy: MIPS is an accurate measure for comparing performance among computers. One alternative to time as the metric is MIPS, or million instructions per second. For a given program, MIPS is simply MIPS = Instruction count 6 Execution time × 10 = Clock rate CPI × 106 Some find this rightmost form convenient since clock rate is fixed for a machine and CPI is usually a small number, unlike instruction count or execution time. Relating MIPS to time, Execution time = Instruction count MIPS × 106 Since MIPS is a rate of operations per unit time, performance can be specified as the inverse of execution time, with faster machines having a higher MIPS rating. The good news about MIPS is that it is easy to understand, especially by a customer, and faster machines means bigger MIPS, which matches intuition. The problem with using MIPS as a measure for comparison is threefold: s MIPS is dependent on the instruction set, making it difficult to compare MIPS of computers with different instruction sets. s MIPS varies between programs on the same computer. s Most importantly, MIPS can vary inversely to performance! 1.8 45 Fallacies and Pitfalls The classic example of the last case is the MIPS rating of a machine with optional floating-point hardware. Since it generally takes more clock cycles per floating-point instruction than per integer instruction, floating-point programs using the optional hardware instead of software floating-point routines take less time but have a lower MIPS rating. Software floating point executes simpler instructions, resulting in a higher MIPS rating, but it executes so many more that overall execution time is longer. We can even see such anomalies with optimizing compilers. EXAMPLE Assume we build an optimizing compiler for the load-store machine for which the measurements in Figure 1.17 have been made. The compiler discards 50% of the arithmetic logic unit (ALU) instructions, although it cannot reduce loads, stores, or branches. Ignoring systems issues and assuming a 2-ns clock cycle time (500-MHz clock rate) and 1.57 unoptimized CPI, what is the MIPS rating for optimized code versus unoptimized code? Does the ranking of MIPS agree with the ranking of execution time? Instruction type Frequency Clock cycle count ALU ops 43% 1 Loads 21% 2 Stores 12% 2 Branches 24% 2 FIGURE 1.17 ANSWER Measurements of the load-store machine. We know that CPIunoptimized = 1.57, so 500MHz MIPSunoptimized = ------------------------ = 318.5 6 1.57 × 10 The performance of unoptimized code is –9 CPU time unoptimized = IC unoptimized × 1.57 × ( 2 × 10 ) = 3.14 × 10 –9 × IC unoptimized For optimized code: ( 0.43 ⁄ 2 ) × 1 + 0.21 × 2 + 0.12 × 2 + 0.24 × 2 CPI optimized = ------------------------------------------------------------------------------------------------------------- = 1.73 1 – ( 0.43 ⁄ 2 ) since half the ALU instructions are discarded (0.43/2) and the instruction count is reduced by the missing ALU instructions. Thus, 500 MHz MIPS optimized = ------------------------ = 289.0 6 1.73 × 10 46 Chapter 1 Fundamentals of Computer Design The performance of optimized code is –9 CPU time optimized = ( 0.785 × IC unoptimized ) × 1.73 × ( 2 × 10 ) = 2.72 × 10 –9 × IC unoptimized The optimized code is 3.14/2.72 = 1.15 times faster, but its MIPS rating is lower: 289 versus 318! s As examples such as this one show, MIPS can fail to give a true picture of performance because it does not track execution time. Fallacy: MFLOPS is a consistent and useful measure of performance. Another popular alternative to execution time is million floating-point operations per second, abbreviated megaFLOPS or MFLOPS but always pronounced “megaflops.” The formula for MFLOPS is simply the definition of the acronym: , Number of floating-point operations in a program MFLOPS = ---------------------------------------------------------------------------------------------------------------------6 Execution time in seconds × 10 Clearly, a MFLOPS rating is dependent on the machine and on the program. Since MFLOPS is intended to measure floating-point performance, it is not applicable outside that range. Compilers, as an extreme example, have a MFLOPS rating near nil no matter how fast the machine, since compilers rarely use floatingpoint arithmetic. This term is less innocent than MIPS. Based on operations rather than instructions, MFLOPS is intended to be a fair comparison between different machines. The belief is that the same program running on different computers would execute a different number of instructions but the same number of floating-point operations. Unfortunately, MFLOPS is not dependable because the set of floatingpoint operations is not consistent across machines. For example, the Cray C90 has no divide instruction, while the Intel Pentium has divide, square root, sine, and cosine. Another perceived problem is that the MFLOPS rating changes not only on the mixture of integer and floating-point operations but also on the mixture of fast and slow floating-point operations. For example, a program with 100% floating-point adds will have a higher rating than a program with 100% floating-point divides. (We discuss a proposed solution to this problem in Exercise 1.15 (b).) Furthermore, like any other performance measure, the MFLOPS rating for a single program cannot be generalized to establish a single performance metric for a computer. Since MFLOPS is really just a constant divided by execution time for a specific program and specific input, MFLOPS is redundant to execution time, our principal measure of performance. And unlike execution time, it is tempting 1.8 47 Fallacies and Pitfalls to characterize a machine with a single MIPS or MFLOPS rating without naming the program, specifying the I/O, or describing the versions of the OS and compilers. Fallacy: Synthetic benchmarks predict performance for real programs. The best known examples of such benchmarks are Whetstone and Dhrystone. These are not real programs and, as such, may not reflect program behavior for factors not measured. Compiler and hardware optimizations can artificially inflate performance of these benchmarks but not of real programs. The other side of the coin is that because these benchmarks are not natural programs, they don’t reward optimizations of behaviors that occur in real programs. Here are some examples: s s s Optimizing compilers can discard 25% of the Dhrystone code; examples include loops that are only executed once, making the loop overhead instructions unnecessary. To address these problems the authors of the benchmark “require” both optimized and unoptimized code to be reported. In addition, they “forbid” the practice of inline-procedure expansion optimization, since Dhrystone’s simple procedure structure allows elimination of all procedure calls at almost no increase in code size. Most Whetstone floating-point loops execute small numbers of times or include calls inside the loop. These characteristics are different from many real programs. As a result Whetstone underrewards many loop optimizations and gains little from techniques such as multiple issue (Chapter 4) and vectorization (Appendix B). Compilers can optimize a key piece of the Whetstone loop by noting the relationship between square root and exponential, even though this is very unlikely to occur in real programs. For example, one key loop contains the following FORTRAN code: X = SQRT(EXP(ALOG(X)/T1)) It could be compiled as if it were X = EXP(ALOG(X)/(2×T1)) since SQRT(EXP(X)) = 2 X e = e X / 2 = EXP(X/2) It would be surprising if such optimizations were ever invoked except in this synthetic benchmark. (Yet one reviewer of this book found several compilers that performed this optimization!) This single change converts all calls to the square root function in Whetstone into multiplies by 2, surely improving performance— if Whetstone is your measure. 48 Chapter 1 Fundamentals of Computer Design Fallacy: Benchmarks remain valid indefinitely. Several factors influence the usefulness of a benchmark as a predictor of real performance and some of these may change over time. A big factor influencing the usefulness of a benchmark is the ability of the benchmark to resist “cracking,” also known as benchmark engineering or “benchmarksmanship.” Once a benchmark becomes standardized and popular, there is tremendous pressure to improve performance by targeted optimizations or by aggressive interpretation of the rules for running the benchmark. Small kernels or programs that spend their time in a very small number of lines of code are particularly vulnerable. For example, despite the best intentions, the initial SPEC89 benchmark suite included a small kernel, called matrix300, which consisted of eight different 300 × 300 matrix multiplications. In this kernel, 99% of the execution time was in a single line (see SPEC [1989]). Optimization of this inner loop by the compiler (using an idea called blocking, discussed in Chapter 5) for the IBM Powerstation 550 resulted in performance improvement by a factor of more than 9 over an earlier version of the compiler! This benchmark tested compiler performance and was not, of course, a good indication of overall performance, nor of this particular optimization. Even after the elimination of this benchmark, vendors found methods to tune the performance of individual benchmarks by the use of different compilers or preprocessors, as well as benchmark-specific flags. While the baseline performance measurements restrict this (the rules for baseline tuning appear on pages 57–58), the tuned or optimized performance does not. In fact, benchmark-specific flags are allowed, even if they are illegal and lead to incorrect compilation in general! This has resulted in long lists of options, as Figure 1.18 shows. This incredible list of impenetrable options used in the tuned measurements for an IBM Powerstation 590, which is not significantly different from the option lists used by other vendors, makes it clear why the baseline measurements were needed. The performance difference between the baseline and tuned numbers can be substantial. For the SPECfp92 benchmarks on the Powerstation 590, the overall performance (which by SPEC92 rules is summarized by geometric mean) is 1.2 times higher for the optimized programs. For some benchmarks, however, the difference is considerably larger: For the nasa7 kernels, the optimized performance is 2.1 times higher than the baseline! Benchmark engineering is sometimes applied to the runtime libraries. For example, SPEC92 added a spreadsheet to the SPEC92 integer benchmarks (called sc). Like any spreadsheet, sc spends a great deal of its time formatting data for the screen, a function that is handled in a UNIX runtime library. Normally such screen I/O is synchronous—each I/O is completed before the next one is done. This increases the runtime substantially. Several companies observed that when the benchmark is run, its output goes to a file, in which case the I/O need not be synchronous. Instead the I/O can be done to a memory buffer that is flushed to disk after the program completes, thus taking the I/O time out of the measure- 1.8 Fallacies and Pitfalls 49 SPECfp92 Tuning parameters/Notes/Summary of changes: Software: KAP for IBM FORTRAN Ver. 3.1 Beta, VAST-2 for XL FORTRAN Ver. 4.03 Beta, KAP for IBM C, Ver. 1.3 all: -O3 -qarch=pwrx -BI:/lib/syscalls.exp 013: -qnosave -P -Wp,-ea478,-Iindxx:dcsol,-Sv01.f:v06.f -lblas 015: -P -Wp,-ea478,-fz,-Isi:coeray,-Ssi.f:coeray.f -lblas 039: -Pk -Wp,-r=3,-inline,-ur=8,-ur2=2 00,-ind=2,-in11=2 034: -Pk -Wp,-r=3,-inline,-ur=4 047: -Q-Pk -Wp,-r=3,-o=4,-ag=a 048: -Pk -Wp,-inline,-r=3,-ur=2,-ur=10 0 052: -Q -Q-input-hidden -qhsflt -Dfloat=double -qassert-typeptr -qproclocal -qmaxmem=9999999 +K4 +Kargs=ur2=1 056: -qproclocal -Dfloat=double -qunroll=2 -qhsflt -qmaxmem=999999 +K4 -Kargs=-ar1=2:-ur2=5000 077: -O3 -qstrict -qarch=ppc -qmaxmem=-1 -Pk -Wp,-inline,-r=3,-ur=2,-ur2=9999 078: -qhsflt -P -Wp,-ea278,-fz,-me -qhot 089: -qnosave -qhssngl -Pk -Wp,-inline=trngv,-r=3,-ur=2,-ur2=9999 090: -P -Wp,-ea,-f1 -qhot 093: -DTIMES -P -Wp,-eaj78,-Rvpetst:vpenta:fftst -qfloat=nosqrt -lesslp2 094: -P -Wp,-ea78 -lesslp2 FIGURE 1.18 The tuning parameters for the SPECfp92 report on an IBM RS/6000 Powerstation 590. This is the portion of the SPEC report for the tuned performance corresponding to that in Figure 1.10 on page 24. These parameters describe the compiler and preprocessor (two versions of KAP and a version of VAST-2) as well as the options used for the tuned SPEC92 numbers. Each line shows the option used for one of the SPECfp92 benchmarks. The benchmarks are identified by number but appear in the same order as given in Figure 1.9 on page 22. Data from SPEC [1994]. ment loop. One company even went a step farther, realizing that the file is never read, and tossed the I/O completely. If the benchmark was meant to indicate real performance of a spreadsheet-like program, these “optimizations” have defeated such goals. Perhaps even worse than the fact that this creative engineering makes the program perform differently is that it makes it impossible to compare among vendors’ machines, which was the key reason SPEC was created. Ongoing improvements in technology can also change what a benchmark measures. Consider the benchmark gcc, considered one of the most realistic and challenging of the SPEC92 benchmarks. Its performance is a combination of CPU time and real system time. Since the input remains fixed and real system time is limited by factors, including disk access time, that improve slowly, an increasing amount of the runtime is system time rather than CPU time. This may be appropriate. On the other hand, it may be appropriate to change the input over time, reflecting the desire to compile larger programs. In fact, the SPEC92 input was changed to include four copies of each input file used in SPEC89; while this increases runtime, it may or may not reflect the way compilers are actually being 50 Chapter 1 Fundamentals of Computer Design used. Over a long period of time, these changes may make even a well-chosen benchmark obsolete. Fallacy: Peak performance tracks observed performance. One definition of peak performance is performance a machine is “guaranteed not to exceed.” The gap between peak performance and observed performance is typically a factor of 10 or more in supercomputers. (See Appendix B on vectors for an explanation.) Since the gap is so large, peak performance is not useful in predicting observed performance unless the workload consists of small programs that normally operate close to the peak. As an example of this fallacy, a small code segment using long vectors ran on the Hitachi S810/20 at 236 MFLOPS and on the Cray X-MP at 115 MFLOPS. Although this suggests the S810 is 2.05 times faster than the X-MP, the X-MP runs a program with more typical vector lengths 1.97 times faster than the S810. These data are shown in Figure 1.19. Cray X-MP Hitachi S810/20 A(i)=B(i)*C(i)+D(i)*E(i) (vector length 1000 done 100,000 times) 2.6 secs 1.3 secs Hitachi 2.05 times faster Vectorized FFT (vector lengths 64,32,…,2) 3.9 secs 7.7 secs Cray 1.97 times faster Measurement Performance FIGURE 1.19 Measurements of peak performance and actual performance for the Hitachi S810/20 and the Cray X-MP. Data from pages 18–20 of Lubeck, Moore, and Mendez [1985]. Also see Fallacies and Pitfalls in Appendix B. While the use of peak performance has been rampant in the supercomputer business, its use in the microprocessor business is just as misleading. For example, in 1994 DEC announced a version of the Alpha microprocessor capable of executing 1.2 billion instructions per second at its 300-MHz clock rate.The only way this processor can achieve this performance is for two integer instructions and two floating-point instructions to be executed each clock cycle. This machine has a peak performance that is almost 50 times the peak performance of the fastest microprocessor reported in the first SPEC benchmark report in 1989 (a MIPS M/2000, which had a 25-MHz clock). The overall SPEC92 number of the DEC Alpha processor, however, is only about 15 times higher on integer and 25 times higher on FP. This rate of performance improvement is still spectacular, even if peak performance is not a good indicator. The authors hope that peak performance can be quarantined to the supercomputer industry and eventually eradicated from that domain. In any case, approaching supercomputer performance is not an excuse for adopting dubious supercomputer marketing habits. 1.9 1.9 Concluding Remarks 51 Concluding Remarks This chapter has introduced a number of concepts that we will expand upon as we go through this book. The major ideas in instruction set architecture and the alternatives available will be the primary subjects of Chapter 2. Not only will we see the functional alternatives, we will also examine quantitative data that enable us to understand the trade-offs. The quantitative principle, Make the common case fast, will be a guiding light in this next chapter, and the CPU performance equation will be our major tool for examining instruction set alternatives. Chapter 2 concludes with a hypothetical instruction set, called DLX, which is designed on the basis of observations of program behavior that we will make in the chapter. In Chapter 2, we will include a section, Crosscutting Issues, that specifically addresses interactions between topics addressed in different chapters. In that section within Chapter 2, we focus on the interactions between compilers and instruction set design. This Crosscutting Issues section will appear in all future chapters, with the exception of Chapter 4 on advanced pipelining. In later chapters, the Crosscutting Issues sections describe interactions between instruction sets and implementation techniques. In Chapters 3 and 4 we turn our attention to pipelining, the most common implementation technique used for making faster processors. Pipelining overlaps the execution of instructions and thus can achieve lower CPIs and/or lower clock cycle times. As in Chapter 2, the CPU performance equation will be our guide in the evaluation of alternatives. Chapter 3 starts with a review of the basics of machine organization and control and moves through the basic ideas in pipelining, including the control of more complex floating-point pipelines. The chapter concludes with an examination and analysis of the R4000. At the end of Chapter 3, you will be able to understand the pipeline design of almost every processor built before 1990. Chapter 4 is an extensive examination of advanced pipelining techniques that attempt to get higher performance by exploiting more overlap among instructions than the simple pipelines in use in the 1980s. This chapter begins with an extensive discussion of basic concepts that will prepare you not only for the wide range of ideas examined in Chapter 4, but also to understand and analyze new techniques that will be introduced in the coming years. Chapter 4 uses examples that span about 20 years, drawing from the first modern supercomputers (the CDC 6600 and IBM 360/91) to the latest processors that first reached the market in 1995. Throughout Chapters 3 and 4, we will repeatedly look at techniques that rely either on clever hardware techniques or on sophisticated compiler technology. These alternatives are an exciting aspect of pipeline design, likely to continue through the decade of the 1990s. In Chapter 5 we turn to the all-important area of memory system design. The Putting It All Together section in this chapter serves as a basic introduction. We will examine a wide range of techniques that conspire to make memory look infinitely large while still being as fast as possible. The simple equations we 52 Chapter 1 Fundamentals of Computer Design develop in this chapter will serve as a starting point for the quantitative evaluation of the many techniques used for memory system design. As in Chapters 3 and 4, we will see that hardware-software cooperation has become a key to high-performance memory systems, just as it has to high-performance pipelines. In Chapters 6 and 7, we move away from a CPU-centric view and discuss issues in storage systems and in system interconnect. We apply a similar quantitative approach, but one based on observations of system behavior and using an end-to-end approach to performance analysis. Chapter 6 addresses the important issue of how to efficiently store and retrieve data using primarily lower-cost magnetic storage technologies. As we saw earlier, such technologies offer better cost per bit by a factor of 50–100 over DRAM. Magnetic storage is likely to remain advantageous wherever cost or nonvolatility (it keeps the information after the power is turned off) are important. In Chapter 6, our focus is on examining the performance of magnetic storage systems for typical I/O-intensive workloads, which are the counterpart to the CPU benchmarks we saw in this chapter. We extensively explore the idea of RAID-based systems, which use many small disks, arranged in a redundant fashion to achieve both high performance and high availability. Chapter 7 also discusses the primary interconnection technology used for I/O devices, namely buses. This chapter explores the topic of system interconnect more broadly, including large-scale MPP interconnects and networks used to allow separate computers to communicate. We put special emphasis on the emerging new networking standards developing around ATM. Our final chapter returns to the issue of achieving higher performance through the use of multiple processors, or multiprocessors. Instead of using parallelism to overlap individual instructions, it uses parallelism to allow multiple instruction streams to be executed simultaneously on different processors. Our focus is on the dominant form of multiprocessors, shared-memory multiprocessors, though we introduce other types as well and discuss the broad issues that arise in any multiprocessor. Here again, we explore a variety of techniques, focusing on the important ideas first introduced in the 1980s as well as those that are developing as this book goes to press. We conclude this book with a variety of appendices that introduce you to important topics not covered in the eight chapters. Appendix A covers the topic of floating-point arithmetic—a necessary ingredient for any high-performance machine. The incorrect implementation of floating-point divide in the Intel Pentium processor, which led to an estimated impact in excess of $300 million, should serve as a clear reminder about the importance of floating point! Appendix B covers the topic of vector machines. In the scientific market, such machines are a viable alternative to the multiprocessors discussed in Chapter 8. Although vector machines do not dominate supercomputing the way they did in the 1980s, they still include many important concepts in pipelining, parallelism, and memory systems that are useful in different machine organizations. Appendix C surveys the most popular RISC instruction set architectures and contrasts the differences among them, using DLX as a starting point. Appendix D examines the popular 1.10 Historical Perspective and References 53 80x86 instruction set—the most heavily used instruction set architecture in existence. Appendix D compares the design of the 80x86 instruction set with that of the RISC machines described in Chapter 2 and in Appendix C. Finally, Appendix E discusses implementation issues in coherence protocols. 1.10 Historical Perspective and References If... history... teaches us anything, it is that man in his quest for knowledge and progress, is determined and cannot be deterred. John F. Kennedy, Address at Rice University (1962) A section of historical perspectives closes each chapter in the text. This section provides historical background on some of the key ideas presented in the chapter. The authors may trace the development of an idea through a series of machines or describe significant projects. If you’re interested in examining the initial development of an idea or machine or interested in further reading, references are provided at the end of the section. The First General-Purpose Electronic Computers J. Presper Eckert and John Mauchly at the Moore School of the University of Pennsylvania built the world’s first electronic general-purpose computer. This machine, called ENIAC (Electronic Numerical Integrator and Calculator), was funded by the U.S. Army and became operational during World War II, but it was not publicly disclosed until 1946. ENIAC was used for computing artillery firing tables. The machine was enormous—100 feet long, 8 1/2 feet high, and several feet wide—far beyond the size of any computer built today. Each of the 20 10digit registers was 2 feet long. In total, there were 18,000 vacuum tubes. While the size was three orders of magnitude bigger than the size of machines built today, it was more than five orders of magnitude slower, with an add taking 200 microseconds. The ENIAC provided conditional jumps and was programmable, which clearly distinguished it from earlier calculators. Programming was done manually by plugging up cables and setting switches and required from a half-hour to a whole day. Data were provided on punched cards. The ENIAC was limited primarily by a small amount of storage and tedious programming. In 1944, John von Neumann was attracted to the ENIAC project. The group wanted to improve the way programs were entered and discussed storing programs as numbers; von Neumann helped crystallize the ideas and wrote a memo proposing a stored-program computer called EDVAC (Electronic Discrete Variable Automatic Computer). Herman Goldstine distributed the memo and put von Neumann’s name on it, much to the dismay of Eckert and Mauchly, whose names were omitted. This memo has served as the basis for the commonly used term von Neumann computer. The authors and several early inventors in the 54 Chapter 1 Fundamentals of Computer Design computer field believe that this term gives too much credit to von Neumann, who wrote up the ideas, and too little to the engineers, Eckert and Mauchly, who worked on the machines. For this reason, this term will not appear in this book. In 1946, Maurice Wilkes of Cambridge University visited the Moore School to attend the latter part of a series of lectures on developments in electronic computers. When he returned to Cambridge, Wilkes decided to embark on a project to build a stored-program computer named EDSAC, for Electronic Delay Storage Automatic Calculator. The EDSAC became operational in 1949 and was the world’s first full-scale, operational, stored-program computer [Wilkes, Wheeler, and Gill 1951; Wilkes 1985, 1995]. (A small prototype called the Mark I, which was built at the University of Manchester and ran in 1948, might be called the first operational stored-program machine.) The EDSAC was an accumulatorbased architecture. This style of instruction set architecture remained popular until the early 1970s. (Chapter 2 starts with a brief summary of the EDSAC instruction set.) In 1947, Eckert and Mauchly applied for a patent on electronic computers. The dean of the Moore School, by demanding the patent be turned over to the university, may have helped Eckert and Mauchly conclude they should leave. Their departure crippled the EDVAC project, which did not become operational until 1952. Goldstine left to join von Neumann at the Institute for Advanced Study at Princeton in 1946. Together with Arthur Burks, they issued a report based on the 1944 memo [1946]. The paper led to the IAS machine built by Julian Bigelow at Princeton’s Institute for Advanced Study. It had a total of 1024 40-bit words and was roughly 10 times faster than ENIAC. The group thought about uses for the machine, published a set of reports, and encouraged visitors. These reports and visitors inspired the development of a number of new computers. The paper by Burks, Goldstine, and von Neumann was incredible for the period. Reading it today, you would never guess this landmark paper was written 50 years ago, as most of the architectural concepts seen in modern computers are discussed there. Recently, there has been some controversy about John Atanasoff, who built a small-scale electronic computer in the early 1940s [Atanasoff 1940]. His machine, designed at Iowa State University, was a special-purpose computer that was never completely operational. Mauchly briefly visited Atanasoff before he built ENIAC. The presence of the Atanasoff machine, together with delays in filing the ENIAC patents (the work was classified and patents could not be filed until after the war) and the distribution of von Neumann’s EDVAC paper, were used to break the Eckert-Mauchly patent [Larson 1973]. Though controversy still rages over Atanasoff’s role, Eckert and Mauchly are usually given credit for building the first working, general-purpose, electronic computer [Stern 1980]. Atanasoff, however, demonstrated several important innovations included in later computers. One of the most important was the use of a binary representation for numbers. Atanasoff deserves much credit for his work, and he might fairly be given credit for the world’s first special-purpose electronic computer. Another 1.10 Historical Perspective and References 55 early machine that deserves some credit was a special-purpose machine built by Konrad Zuse in Germany in the late 1930s and early 1940s. This machine was electromechanical and, because of the war, never extensively pursued. In the same time period as ENIAC, Howard Aiken was designing an electromechanical computer called the Mark-I at Harvard. The Mark-I was built by a team of engineers from IBM. He followed the Mark-I by a relay machine, the Mark-II, and a pair of vacuum tube machines, the Mark-III and Mark-IV. The Mark-III and Mark-IV were being built after the first stored-program machines. Because they had separate memories for instructions and data, the machines were regarded as reactionary by the advocates of stored-program computers. The term Harvard architecture was coined to describe this type of machine. Though clearly different from the original sense, this term is used today to apply to machines with a single main memory but with separate instruction and data caches. The Whirlwind project [Redmond and Smith 1980] began at MIT in 1947 and was aimed at applications in real-time radar signal processing. While it led to several inventions, its overwhelming innovation was the creation of magnetic core memory, the first reliable and inexpensive memory technology. Whirlwind had 2048 16-bit words of magnetic core. Magnetic cores served as the main memory technology for nearly 30 years. Commercial Developments In December 1947, Eckert and Mauchly formed Eckert-Mauchly Computer Corporation. Their first machine, the BINAC, was built for Northrop and was shown in August 1949. After some financial difficulties, the Eckert-Mauchly Computer Corporation was acquired by Remington-Rand, where they built the UNIVAC I, designed to be sold as a general-purpose computer. First delivered in June 1951, the UNIVAC I sold for $250,000 and was the first successful commercial computer—48 systems were built! Today, this early machine, along with many other fascinating pieces of computer lore, can be seen at the Computer Museum in Boston, Massachusetts. IBM, which earlier had been in the punched card and office automation business, didn’t start building computers until 1950. The first IBM computer, the IBM 701, shipped in 1952 and eventually sold 19 units. In the early 1950s, many people were pessimistic about the future of computers, believing that the market and opportunities for these “highly specialized” machines were quite limited. Several books describing the early days of computing have been written by the pioneers [Wilkes 1985, 1995; Goldstine 1972]. There are numerous independent histories, often built around the people involved [Slater 1987], as well as a journal, Annals of the History of Computing, devoted to the history of computing. The history of some of the computers invented after 1960 can be found in Chapter 2 (the IBM 360, the DEC VAX, the Intel 80x86, and the early RISC machines), Chapters 3 and 4 (the pipelined processors, including Stretch and the CDC 6600), and Appendix B (vector processors including the TI ASC, CDC Star, and Cray processors). 56 Chapter 1 Fundamentals of Computer Design Development of Quantitative Performance Measures: Successes and Failures In the earliest days of computing, designers set performance goals—ENIAC was to be 1000 times faster than the Harvard Mark-I, and the IBM Stretch (7030) was to be 100 times faster than the fastest machine in existence. What wasn’t clear, though, was how this performance was to be measured. In looking back over the years, it is a consistent theme that each generation of computers obsoletes the performance evaluation techniques of the prior generation. The original measure of performance was time to perform an individual operation, such as addition. Since most instructions took the same execution time, the timing of one gave insight into the others. As the execution times of instructions in a machine became more diverse, however, the time for one operation was no longer useful for comparisons. To take these differences into account, an instruction mix was calculated by measuring the relative frequency of instructions in a computer across many programs. The Gibson mix [Gibson 1970] was an early popular instruction mix. Multiplying the time for each instruction times its weight in the mix gave the user the average instruction execution time. (If measured in clock cycles, average instruction execution time is the same as average CPI.) Since instruction sets were similar, this was a more accurate comparison than add times. From average instruction execution time, then, it was only a small step to MIPS (as we have seen, the one is the inverse of the other). MIPS has the virtue of being easy for the layman to understand, hence its popularity. As CPUs became more sophisticated and relied on memory hierarchies and pipelining, there was no longer a single execution time per instruction; MIPS could not be calculated from the mix and the manual. The next step was benchmarking using kernels and synthetic programs. Curnow and Wichmann [1976] created the Whetstone synthetic program by measuring scientific programs written in Algol 60. This program was converted to FORTRAN and was widely used to characterize scientific program performance. An effort with similar goals to Whetstone, the Livermore FORTRAN Kernels, was made by McMahon [1986] and researchers at Lawrence Livermore Laboratory in an attempt to establish a benchmark for supercomputers. These kernels, however, consisted of loops from real programs. As it became clear that using MIPS to compare architectures with different instructions sets would not work, a notion of relative MIPS was created. When the VAX-11/780 was ready for announcement in 1977, DEC ran small benchmarks that were also run on an IBM 370/158. IBM marketing referred to the 370/158 as a 1-MIPS computer, and since the programs ran at the same speed, DEC marketing called the VAX-11/780 a 1-MIPS computer. Relative MIPS for a machine M was defined based on some reference machine as Performance M MIPS M = ----------------------------------------------- × MIPS reference Performance reference 1.10 Historical Perspective and References 57 The popularity of the VAX-11/780 made it a popular reference machine for relative MIPS, especially since relative MIPS for a 1-MIPS computer is easy to calculate: If a machine was five times faster than the VAX-11/780, for that benchmark its rating would be 5 relative MIPS. The 1-MIPS rating was unquestioned for four years, until Joel Emer of DEC measured the VAX-11/780 under a timesharing load. He found that the VAX-11/780 native MIPS rating was 0.5. Subsequent VAXes that run 3 native MIPS for some benchmarks were therefore called 6-MIPS machines because they run six times faster than the VAX-11/780. By the early 1980s, the term MIPS was almost universally used to mean relative MIPS. The 1970s and 1980s marked the growth of the supercomputer industry, which was defined by high performance on floating-point-intensive programs. Average instruction time and MIPS were clearly inappropriate metrics for this industry, hence the invention of MFLOPS. Unfortunately customers quickly forget the program used for the rating, and marketing groups decided to start quoting peak MFLOPS in the supercomputer performance wars. SPEC (System Performance and Evaluation Cooperative) was founded in the late 1980s to try to improve the state of benchmarking and make a more valid basis for comparison. The group initially focused on workstations and servers in the UNIX marketplace, and that remains the primary focus of these benchmarks today. The first release of SPEC benchmarks, now called SPEC89, was a substantial improvement in the use of more realistic benchmarks. SPEC89 was replaced by SPEC92. This release enlarged the set of programs, made the inputs to some benchmarks bigger, and specified new run rules. To reduce the large number of benchmark-specific compiler flags and the use of targeted optimizations, in 1994 SPEC introduced rules for compilers and compilation switches to be used in determining the SPEC92 baseline performance: 1. The optimization options are safe: it is expected that they could generally be used on any program. 2. The same compiler and flags are used for all the benchmarks. 3. No assertion flags, which would tell the compiler some fact it could not derive, are allowed. 4. Flags that allow inlining of library routines normally considered part of the language are allowed, though other such inlining hints are disallowed by rule 5. 5. No program names or subroutine names are allowed in flags. 6. Feedback-based optimization is not allowed. 7. Flags that change the default size of a data item (for example, single precision to double precision) are not allowed. 58 Chapter 1 Fundamentals of Computer Design Specifically permitted are flags that direct the compiler to compile for a particular implementation and flags that allow the compiler to relax certain numerical accuracy requirements (such as left-to-right evaluation). The intention is that the baseline results are what a casual user could achieve without extensive effort. SPEC also has produced system-oriented benchmarks that can be used to benchmark a system including I/O and OS functions, as well as a throughputoriented measure (SPECrate), suitable for servers. What has become clear is that maintaining the relevance of these benchmarks in an area of rapid performance improvement will be a continuing investment. Implementation-Independent Performance Analysis As the distinction between architecture and implementation pervaded the computing community in the 1970s, the question arose whether the performance of an architecture itself could be evaluated, as opposed to an implementation of the architecture. Many of the leading people in the field pursued this notion. One of the ambitious studies of this question performed at Carnegie Mellon University is summarized in Fuller and Burr [1977]. Three quantitative measures were invented to scrutinize architectures: s s s S—Number of bytes for program code M—Number of bytes transferred between memory and the CPU during program execution for code and data (S measures size of code at compile time, while M is memory traffic during program execution.) R—Number of bytes transferred between registers in a canonical model of a CPU Once these measures were taken, a weighting factor was applied to them to determine which architecture was “best.” The VAX architecture was designed in the height of popularity of the Carnegie Mellon study, and by those measures it does very well. Architectures created since 1985, however, have poorer measures than the VAX using these metrics, yet their implementations do well against the VAX implementations. For example, Figure 1.20 compares S, M, and CPU time for the VAXstation 3100, which uses the VAX instruction set, and the DECstation 3100, which doesn’t. The DECstation 3100 is about three to five times faster, even though its S measure is 35% to 70% worse and its M measure is 5% to 15% worse. The attempt to evaluate architecture independently of implementation was a valiant, if not successful, effort. 1.10 S (code size in bytes) Program 59 Historical Perspective and References VAX 3100 DEC 3100 M (megabytes code + data transferred) VAX 3100 DEC 3100 CPU time (in secs) VAX 3100 DEC 3100 Gnu C Compiler 409,600 688,128 18 21 291 90 Common TeX 158,720 217,088 67 78 449 95 spice 223,232 372,736 99 106 352 94 FIGURE 1.20 Code size and CPU time of the VAXstation 3100 and DECstation 3100 for Gnu C Compiler, TeX, and spice. Both machines were announced the same day by the same company, and they run the same operating system and similar technology. The difference is in the instruction sets, compilers, clock cycle time, and organization. References AMDAHL, G. M. [1967]. “Validity of the single processor approach to achieving large scale computing capabilities,” Proc. AFIPS 1967 Spring Joint Computer Conf. 30 (April), Atlantic City, N.J., 483–485. ATANASOFF, J. V. [1940]. “Computing machine for the solution of large systems of linear equations,” Internal Report, Iowa State University, Ames. BELL, C. G. [1984]. “The mini and micro industries,” IEEE Computer 17:10 (October), 14–30. BELL, C. G., J. C. MUDGE, AND J. E. MCNAMARA [1978]. A DEC View of Computer Engineering, Digital Press, Bedford, Mass. BURKS, A. W., H. H. GOLDSTINE, AND J. VON NEUMANN [1946]. “Preliminary discussion of the logical design of an electronic computing instrument,” Report to the U.S. Army Ordnance Department, p. 1; also appears in Papers of John von Neumann, W. Aspray and A. Burks, eds., MIT Press, Cambridge, Mass., and Tomash Publishers, Los Angeles, Calif., 1987, 97–146. CURNOW, H. J. AND B. A. WICHMANN [1976]. “A synthetic benchmark,” The Computer J., 19:1. FLEMMING, P. J. AND J. J. WALLACE [1986]. “How not to lie with statistics: The correct way to summarize benchmarks results,” Comm. ACM 29:3 (March), 218–221. FULLER, S. H. AND W. E. BURR [1977]. “Measurement and evaluation of alternative computer architectures,” Computer 10:10 (October), 24–35. GIBSON, J. C. [1970]. “The Gibson mix,” Rep. TR. 00.2043, IBM Systems Development Division, Poughkeepsie, N.Y. (Research done in 1959.) GOLDSTINE, H. H. [1972]. The Computer: From Pascal to von Neumann, Princeton University Press, Princeton, N.J. JAIN, R. [1991]. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling, Wiley, New York. LARSON, E. R. [1973]. “Findings of fact, conclusions of law, and order for judgment,” File No. 4–67, Civ. 138, Honeywell v. Sperry Rand and Illinois Scientific Development, U.S. District Court for the State of Minnesota, Fourth Division (October 19). LUBECK, O., J. MOORE, AND R. MENDEZ [1985]. “A benchmark comparison of three supercomputers: Fujitsu VP-200, Hitachi S810/20, and Cray X-MP/2,” Computer 18:12 (December), 10–24. MCMAHON, F. M. [1986]. “The Livermore FORTRAN kernels: A computer test of numerical performance range,” Tech. Rep. UCRL-55745, Lawrence Livermore National Laboratory, Univ. of California, Livermore (December). REDMOND, K. C. AND T. M. SMITH [1980]. Project Whirlwind—The History of a Pioneer Computer, 60 Chapter 1 Fundamentals of Computer Design Digital Press, Boston. SHURKIN, J. [1984]. Engines of the Mind: A History of the Computer, W. W. Norton, New York. SLATER, R. [1987]. Portraits in Silicon, MIT Press, Cambridge, Mass. SMITH, J. E. [1988]. “Characterizing computer performance with a single number,” Comm. ACM 31:10 (October), 1202–1206. SPEC [1989]. SPEC Benchmark Suite Release 1.0, October 2, 1989. SPEC [1994]. SPEC Newsletter (June). STERN, N. [1980]. “Who invented the first electronic digital computer,” Annals of the History of Computing 2:4 (October), 375–376. TOUMA, W. R. [1993]. The Dynamics of the Computer Industry: Modeling the Supply of Workstations and Their Components, Kluwer Academic, Boston. WEICKER, R. P. [1984]. “Dhrystone: A synthetic systems programming benchmark,” Comm. ACM 27:10 (October), 1013–1030. WILKES, M. V. [1985]. Memoirs of a Computer Pioneer, MIT Press, Cambridge, Mass. WILKES, M. V. [1995]. Computing Perspectives, Morgan Kaufmann, San Francisco. WILKES, M. V., D. J. WHEELER, AND S. GILL [1951]. The Preparation of Programs for an Electronic Digital Computer, Addison-Wesley, Cambridge, Mass. EXERCISES Each exercise has a difficulty rating in square brackets and a list of the chapter sections it depends on in angle brackets. See the Preface for a description of the difficulty scale. 1.1 [20/10/10/15] <1.6> In this exercise, assume that we are considering enhancing a machine by adding a vector mode to it. When a computation is run in vector mode it is 20 times faster than the normal mode of execution. We call the percentage of time that could be spent using vector mode the percentage of vectorization.Vectors are discussed in Appendix B, but you don’t need to know anything about how they work to answer this question! a. [20] <1.6> Draw a graph that plots the speedup as a percentage of the computation performed in vector mode. Label the y axis “Net speedup” and label the x axis “Percent vectorization.” b. [10] <1.6> What percentage of vectorization is needed to achieve a speedup of 2? c. [10] <1.6> What percentage of vectorization is needed to achieve one-half the maximum speedup attainable from using vector mode? d. [15] <1.6> Suppose you have measured the percentage of vectorization for programs to be 70%. The hardware design group says they can double the speed of the vector rate with a significant additional engineering investment. You wonder whether the compiler crew could increase the use of vector mode as another approach to increasing performance. How much of an increase in the percentage of vectorization (relative to current usage) would you need to obtain the same performance gain? Which investment would you recommend? 1.2 [15/10] <1.6> Assume—as in the Amdahl’s Law Example on page 30—that we make an enhancement to a computer that improves some mode of execution by a factor of 10. Enhanced mode is used 50% of the time, measured as a percentage of the execution time when Exercises 61 the enhanced mode is in use. Recall that Amdahl’s Law depends on the fraction of the original, unenhanced execution time that could make use of enhanced mode. Thus, we cannot directly use this 50% measurement to compute speedup with Amdahl’s Law. a. [15] <1.6> What is the speedup we have obtained from fast mode? b. [10] <1.6> What percentage of the original execution time has been converted to fast mode? 1.3 [15] <1.6> Show that the problem statements in the Examples on page 31 and page 33 are the same. 1.4 [15] <1.6> Suppose we are considering a change to an instruction set. The base machine initially has only loads and stores to memory, and all operations work on the registers. Such machines are called load-store machines (see Chapter 2). Measurements of the loadstore machine showing the instruction mix and clock cycle counts per instruction are given in Figure 1.17 on page 45. Let’s assume that 25% of the arithmetic logic unit (ALU) operations directly use a loaded operand that is not used again. We propose adding ALU instructions that have one source operand in memory. These new register-memory instructions have a clock cycle count of 2. Suppose that the extended instruction set increases the clock cycle count for branches by 1, but it does not affect the clock cycle time. (Chapter 3, on pipelining, explains why adding register-memory instructions might slow down branches.) Would this change improve CPU performance? 1.5 [15] <1.7> Assume that we have a machine that with a perfect cache behaves as given in Figure 1.17. With a cache, we have measured that instructions have a miss rate of 5%, data references have a miss rate of 10%, and the miss penalty is 40 cycles. Find the CPI for each instruction type with cache misses and determine how much faster the machine is with no cache misses versus with cache misses. 1.6 [20] <1.6> After graduating, you are asked to become the lead computer designer at Hyper Computers, Inc. Your study of usage of high-level language constructs suggests that procedure calls are one of the most expensive operations. You have invented a scheme that reduces the loads and stores normally associated with procedure calls and returns. The first thing you do is run some experiments with and without this optimization. Your experiments use the same state-of-the-art optimizing compiler that will be used with either version of the computer. These experiments reveal the following information: s The clock rate of the unoptimized version is 5% higher. s Thirty percent of the instructions in the unoptimized version are loads or stores. s The optimized version executes two-thirds as many loads and stores as the unoptimized version. For all other instructions the dynamic execution counts are unchanged. s All instructions (including load and store) take one clock cycle. Which is faster? Justify your decision quantitatively. 1.7 [15/15/8/12] <1.6,1.8> The Whetstone benchmark contains 195,578 basic floating- 62 Chapter 1 Fundamentals of Computer Design point operations in a single iteration, divided as shown in Figure 1.21. Operation Count Add 82,014 Subtract 8,229 Multiply 73,220 Divide 21,399 Convert integer to FP Compare Total 6,006 4,710 195,578 FIGURE 1.21 The frequency of floating-point operations in the Whetstone benchmark. Whetstone was run on a Sun 3/75 using the F77 compiler with optimization turned on. The Sun 3/75 is based on a Motorola 68020 running at 16.67 MHz, and it includes a floatingpoint coprocessor. The Sun compiler allows the floating point to be calculated with the coprocessor or using software routines, depending on compiler flags. A single iteration of Whetstone took 1.08 seconds using the coprocessor and 13.6 seconds using software. Assume that the CPI using the coprocessor was measured to be 10, while the CPI using software was measured to be 6. a. [15] <1.6,1.8> What is the MIPS rating for both runs? b. [15] <1.6> What is the total number of instructions executed for both runs? c. [8] <1.6> On the average, how many integer instructions does it take to perform a floating-point operation in software? d. [12] <1.8> What is the MFLOPS rating for the Sun 3/75 with the floating-point coprocessor running Whetstone? (Assume all the floating-point operations in Figure 1.21 count as one operation.) 1.8 [15/10/15/15/15] <1.3,1.4> This exercise estimates the complete packaged cost of a microprocessor using the die cost equation and adding in packaging and testing costs. We begin with a short description of testing cost and follow with a discussion of packaging issues. Testing is the second term of the chip cost equation: Cost of integrated circuit = Cost of die + Cost of testing die + Cost of packaging Final test yield Testing costs are determined by three components: Cost of testing per hour × Average die test time Cost of testing die = ----------------------------------------------------------------------------------------------------------------Die yield 63 Exercises Since bad dies are discarded, die yield is in the denominator in the equation—the good must shoulder the costs of testing those that fail. (In practice, a bad die may take less time to test, but this effect is small, since moving the probes on the die is a mechanical process that takes a large fraction of the time.) Testing costs about $50 to $500 per hour, depending on the tester needed. High-end designs with many high-speed pins require the more expensive testers. For higher-end microprocessors test time would run $300 to $500 per hour. Die tests take about 5 to 90 seconds on average, depending on the simplicity of the die and the provisions to reduce testing time included in the chip. The cost of a package depends on the material used, the number of pins, and the die area. The cost of the material used in the package is in part determined by the ability to dissipate power generated by the die. For example, a plastic quad flat pack (PQFP) dissipating less than 1 watt, with 208 or fewer pins, and containing a die up to 1 cm on a side costs $2 in 1995. A ceramic pin grid array (PGA) can handle 300 to 600 pins and a larger die with more power, but it costs $20 to $60. In addition to the cost of the package itself is the cost of the labor to place a die in the package and then bond the pads to the pins, which adds from a few cents to a dollar or two to the cost. Some good dies are typically lost in the assembly process, thereby further reducing yield. For simplicity we assume the final test yield is 1.0; in practice it is at least 0.95. We also ignore the cost of the final packaged test. This exercise requires the information provided in Figure 1.22. Microprocessor Die area (mm2 ) Pins Technology 77 208 CMOS, 0.6µ, 3M 3200 PQFP MIPS 4600 Package 85 240 CMOS, 0.6µ, 4M 3400 PQFP 196 504 CMOS, 0.8µ, 3M 2800 Ceramic PGA PowerPC 603 HP 71x0 Estimated wafer cost ($) Digital 21064A 166 431 CMOS, 0.5µ, 4.5M 4000 Ceramic PGA SuperSPARC/60 256 293 BiCMOS, 0.6µ, 3.5M 4000 Ceramic PGA FIGURE 1.22 Characteristics of microprocessors. The technology entry is the process type, line width, and number of interconnect levels. a. [15] <1.4> For each of the microprocessors in Figure 1.22, compute the number of good chips you would get per 20-cm wafer using the model on page 12. Assume a defect density of one defect per cm2, a wafer yield of 95%, and assume α = 3. b. [10] <1.4> For each microprocessor in Figure 1.22, compute the cost per projected good die before packaging and testing. Use the number of good dies per wafer from part (a) of this exercise and the wafer cost from Figure 1.22. c. [15] <1.3> Both package cost and test cost are proportional to pin count. Using the additional assumption shown in Figure 1.23, compute the cost per good, tested, and packaged part using the costs per good die from part (b) of this exercise. d. [15] <1.3> There are wide differences in defect densities between semiconductor manufacturers. Find the costs for the largest processor in Figure 1.22 (total cost including packaging), assuming defect densities are 0.6 per cm2 and assuming that defect densities are 1.2 per cm2. 64 Chapter 1 Fundamentals of Computer Design Package type Pin count Package cost ($) Test time (secs) Test cost per hour ($) PQFP <220 12 10 300 PQFP <300 20 10 320 Ceramic PGA <300 30 10 320 Ceramic PGA <400 40 12 340 Ceramic PGA <450 50 13 360 Ceramic PGA <500 60 14 380 Ceramic PGA >500 70 15 400 FIGURE 1.23 e. Package and test characteristics. [15] <1.3> The parameter α depends on the complexity of the process. Additional metal levels result in increased complexity. For example, α might be approximated by the number of interconnect levels. For the Digital 21064a with 4.5 levels of interconnect, estimate the cost of working, packaged, and tested die if α = 3 and if α = 4.5. Assume a defect density of 0.8 defects per cm2. 1.9 [12] <1.5> One reason people may incorrectly average rates with an arithmetic mean is that it always gives an answer greater than or equal to the geometric mean. Show that for any two positive integers, a and b, the arithmetic mean is always greater than or equal to the geometric mean. When are the two equal? 1.10 [12] <1.5> For reasons similar to those in Exercise 1.9, some people use arithmetic instead of the harmonic mean. Show that for any two positive rates, r and s, the arithmetic mean is always greater than or equal to the harmonic mean. When are the two equal? 1.11 [15/15] <1.5> Some of the SPECfp92 performance results from the SPEC92 Newsletter of June 1994 [SPEC 94] are shown in Figure 1.24. The SPECratio is simply the runtime for a benchmark divided into the VAX 11/780 time for that benchmark. The SPECfp92 number is computed as the geometric mean of the SPECratios. Let’s see how a weighted arithmetic mean compares. a. [15] <1.5> Calculate the weights for a workload so that running times on the VAX11/780 will be equal for each of the 14 benchmarks (given in Figure 1.24). b. [15] <1.5> Using the weights computed in part (a) of this exercise, calculate the weighted arithmetic means of the execution times of the 14 programs in Figure 1.24. 1.12 [15/15/15] <1.6,1.8> Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 =20 Speedup3 = 10 Only one enhancement is usable at a time. 65 Exercises Program name spice2g6 doduc VAX-11/780 Time DEC 3000 Model 800 SPECratio IBM Powerstation 590 SPECratio Intel Xpress Pentium 815\100 SPECratio 23,944 97 128 64 1,860 137 150 84 mdljdp2 7,084 154 206 98 wave5 3,690 123 151 57 tomcatv 2,650 221 465 74 ora 7,421 165 181 97 7,690 385 739 157 25,499 617 546 215 alvinn ear mdljsp2 3,350 76 96 48 swm256 12,696 137 244 43 su2cor 12,898 259 459 57 hydro2d 13,697 210 225 83 nasa7 16,800 265 344 61 fpppp 9,202 202 303 119 Geometric mean 8,098 187 256 81 FIGURE 1.24 SPEC92 performance for SPECfp92. The DEC 3000 uses a 200-MHz Alpha microprocessor (21064) and a 2-MB off-chip cache. The IBM Powerstation 590 uses a 66.67-MHz Power-2. The Intel Xpress uses a 100-MHz Pentium with a 512-KB off-chip secondary cache. Data from SPEC [1994]. a. [15] <1.6> If enhancements 1 and 2 are each usable for 30% of the time, what fraction of the time must enhancement 3 be used to achieve an overall speedup of 10? b. [15] <1.6,1.8> Assume the distribution of enhancement usage is 30%, 30%, and 20% for enhancements 1, 2, and 3, respectively. Assuming all three enhancements are in use, for what fraction of the reduced execution time is no enhancement in use? c. [15] <1.6> Assume for some benchmark, the fraction of use is 15% for each of enhancements 1 and 2 and 70% for enhancement 3. We want to maximize performance. If only one enhancement can be implemented, which should it be? If two enhancements can be implemented, which should be chosen? 1.13 [15/10/10/12/10] <1.6,1.8> Your company has a benchmark that is considered representative of your typical applications. One of the older-model workstations does not have a floating-point unit and must emulate each floating-point instruction by a sequence of integer instructions. This older-model workstation is rated at 120 MIPS on this benchmark. A third-party vendor offers an attached processor that is intended to give a “mid-life kicker” to your workstation. That attached processor executes each floating-point instruction on a dedicated processor (i.e., no emulation is necessary). The workstation/attached processor rates 80 MIPS on the same benchmark. The following symbols are used to answer parts (a)– (e) of this exercise. 66 Chapter 1 Fundamentals of Computer Design I—Number of integer instructions executed on the benchmark. F—Number of floating-point instructions executed on the benchmark. Y—Number of integer instructions to emulate a floating-point instruction. W—Time to execute the benchmark on the workstation alone. B—Time to execute the benchmark on the workstation/attached processor combination. a. [15] <1.6,1.8> Write an equation for the MIPS rating of each configuration using the symbols above. Document your equation. b. [10] <1.6> For the configuration without the coprocessor, we measure that F = 8 × 106, Y = 50, and W = 4. Find I. c. [10] <1.6> What is the value of B? d. [12] <1.6,1.8> What is the MFLOPS rating of the system with the attached processor board? e. [10] <1.6,1.8> Your colleague wants to purchase the attached processor board even though the MIPS rating for the configuration using the board is less than that of the workstation alone. Is your colleague’s evaluation correct? Defend your answer. 1.14 [15/15/10] <1.5,1.8> Assume the two programs in Figure 1.11 on page 24 each execute 100 million floating-point operations during execution. a. [15] <1.5,1.8> Calculate the MFLOPS rating of each program. b. [15] <1.5,1.8> Calculate the arithmetic, geometric, and harmonic means of MFLOPS for each machine. c. [10] <1.5,1.8> Which of the three means matches the relative performance of total execution time? 1.15 [10/12] <1.8,1.6> One problem cited with MFLOPS as a measure is that not all FLOPS are created equal. To overcome this problem, normalized or weighted MFLOPS measures were developed. Figure 1.25 shows how the authors of the “Livermore Loops” benchmark calculate the number of normalized floating-point operations per program according to the operations actually found in the source code. Thus, the native MFLOPS rating is not the same as the normalized MFLOPS rating reported in the supercomputer literature, which has come as a surprise to a few computer designers. Real FP operations Normalized FP operations Add, Subtract, Compare, Multiply 1 Divide, Square root 4 Functions (Exp, Sin, ...) 8 FIGURE 1.25 Real versus normalized floating-point operations. The number of normalized floating-point operations per real operation in a program used by the authors of the Livermore FORTRAN Kernels, or “Livermore Loops,” to calculate MFLOPS. A kernel with one Add, one Divide, and one Sin would be credited with 13 normalized floating-point operations. Native MFLOPS won’t give the results reported for other machines on that benchmark. 67 Exercises Let’s examine the effects of this weighted MFLOPS measure. The spice program runs on the DECstation 3100 in 94 seconds. The number of floating-point operations executed in that program are listed in Figure 1.26. Floating-point operation Times executed addD 25,999,440 subD 18,266,439 mulD 33,880,810 divD 15,682,333 compareD 9,745,930 negD 2,617,846 absD 2,195,930 convertD 1,581,450 Total FIGURE 1.26 109,970,178 Floating-point operations in spice. a. [10] <1.8,1.6> What is the native MFLOPS for spice on a DECstation 3100? b. [12] <1.8,1.6> Using the conversions in Figure 1.25, what is the normalized MFLOPS? 1.16 [30] <1.5,1.8> Devise a program in C that gets the peak MIPS rating for a computer. Run it on two machines to calculate the peak MIPS. Now run the SPEC92 gcc on both machines. How well do peak MIPS predict performance of gcc? 1.17 [30] <1.5,1.8> Devise a program in C or FORTRAN that gets the peak MFLOPS rating for a computer. Run it on two machines to calculate the peak MFLOPS. Now run the SPEC92 benchmark spice on both machines. How well do peak MFLOPS predict performance of spice? 1.18 [Discussion] <1.5> What is an interpretation of the geometric means of execution times? What do you think are the advantages and disadvantages of using total execution times versus weighted arithmetic means of execution times using equal running time on the VAX-11/780 versus geometric means of ratios of speed to the VAX-11/780? 2 Instruction Set Principles and Examples A n Add the number in storage location n into the accumulator. En If the number in the accumulator is greater than or equal to zero execute next the order which stands in storage location n; otherwise proceed serially. Z Stop the machine and ring the warning bell. Wilkes and Renwick Selection from the List of 18 Machine Instructions for the EDSAC (1949) 2 2.1 69 2.2 Classifying Instruction Set Architectures 70 2.3 Memory Addressing 73 2.4 Operations in the Instruction Set 80 2.5 Type and Size of Operands 85 2.6 Encoding an Instruction Set 87 2.7 Crosscutting Issues: The Role of Compilers 89 2.8 Putting It All Together: The DLX Architecture 96 2.9 Fallacies and Pitfalls 108 2.10 Concluding Remarks 111 2.11 Historical Perspective and References 112 Exercises 2.1 Introduction 118 Introduction In this chapter we concentrate on instruction set architecture—the portion of the machine visible to the programmer or compiler writer. This chapter introduces the wide variety of design alternatives available to the instruction set architect. In particular, this chapter focuses on four topics. First, we present a taxonomy of instruction set alternatives and give some qualitative assessment of the advantages and disadvantages of various approaches. Second, we present and analyze some instruction set measurements that are largely independent of a specific instruction set. Third, we address the issue of languages and compilers and their bearing on instruction set architecture. Finally, the Putting It All Together section shows how these ideas are reflected in the DLX instruction set, which is typical of recent instruction set architectures. The appendices add four examples of these recent architectures—MIPS, Power PC, Precision Architecture, SPARC—and one older architecture, the 80x86. Before we discuss how to classify architectures, we need to say something about instruction set measurement. Throughout this chapter, we examine a wide variety of architectural measurements. These measurements depend on the programs measured and on the 70 Chapter 2 Instruction Set Principles and Examples compilers used in making the measurements. The results should not be interpreted as absolute, and you might see different data if you did the measurement with a different compiler or a different set of programs. The authors believe that the measurements shown in these chapters are reasonably indicative of a class of typical applications. Many of the measurements are presented using a small set of benchmarks, so that the data can be reasonably displayed and the differences among programs can be seen. An architect for a new machine would want to analyze a much larger collection of programs to make his architectural decisions. All the measurements shown are dynamic—that is, the frequency of a measured event is weighed by the number of times that event occurs during execution of the measured program. We begin by exploring how instruction set architectures can be classified and analyzed. 2.2 Classifying Instruction Set Architectures The type of internal storage in the CPU is the most basic differentiation, so in this section we will focus on the alternatives for this portion of the architecture. The major choices are a stack, an accumulator, or a set of registers. Operands may be named explicitly or implicitly: The operands in a stack architecture are implicitly on the top of the stack, in an accumulator architecture one operand is implicitly the accumulator, and general-purpose register architectures have only explicit operands—either registers or memory locations. The explicit operands may be accessed directly from memory or may need to be first loaded into temporary storage, depending on the class of instruction and choice of specific instruction. Figure 2.1 shows how the code sequence C = A + B would typically appear on these three classes of instruction sets. As Figure 2.1 shows, there are really two classes of register machines. One can access memory as part of any instruction, called register-memory architecture, and one can access memory only with load and store instructions, called load-store or register-register architecture. A third class, not found in machines shipping today, keeps all operands in memory and is called a memory-memory architecture. Stack Accumulator Register (register-memory) Register (load-store) Push A Load A Load R1,A Load Push B Add B Add R1,B Add Store C Pop C Store C,R1 R1,A Load R2,B Add R3,R1,R2 Store C,R3 FIGURE 2.1 The code sequence for C = A + B for four instruction sets. It is assumed that A, B, and C all belong in memory and that the values of A and B cannot be destroyed. 2.2 Classifying Instruction Set Architectures 71 Although most early machines used stack or accumulator-style architectures, virtually every machine designed after 1980 uses a load-store register architecture. The major reasons for the emergence of general-purpose register (GPR) machines are twofold. First, registers—like other forms of storage internal to the CPU—are faster than memory. Second, registers are easier for a compiler to use and can be used more effectively than other forms of internal storage. For example, on a register machine the expression (A*B) – (C*D) – (E*F) may be evaluated by doing the multiplications in any order, which may be more efficient because of the location of the operands or because of pipelining concerns (see Chapter 3). But on a stack machine the expression must be evaluated left to right, unless special operations or swaps of stack positions are done. More importantly, registers can be used to hold variables. When variables are allocated to registers, the memory traffic reduces, the program speeds up (since registers are faster than memory), and the code density improves (since a register can be named with fewer bits than can a memory location). Compiler writers would prefer that all registers be equivalent and unreserved. Older machines compromise this desire by dedicating registers to special uses, effectively decreasing the number of general-purpose registers. If the number of truly generalpurpose registers is too small, trying to allocate variables to registers will not be profitable. Instead, the compiler will reserve all the uncommitted registers for use in expression evaluation. How many registers are sufficient? The answer of course depends on how they are used by the compiler. Most compilers reserve some registers for expression evaluation, use some for parameter passing, and allow the remainder to be allocated to hold variables. Two major instruction set characteristics divide GPR architectures. Both characteristics concern the nature of operands for a typical arithmetic or logical instruction (ALU instruction). The first concerns whether an ALU instruction has two or three operands. In the three-operand format, the instruction contains a result and two source operands. In the two-operand format, one of the operands is both a source and a result for the operation. The second distinction among GPR architectures concerns how many of the operands may be memory addresses in ALU instructions. The number of memory operands supported by a typical ALU instruction may vary from none to three. Combinations of these two attributes are shown in Figure 2.2, with examples of machines. Although there are seven possible combinations, three serve to classify nearly all existing machines. As we mentioned earlier, these three are register-register (also called load-store), registermemory, and memory-memory. 72 Chapter 2 Instruction Set Principles and Examples Number of memory addresses Maximum number of operands allowed 0 3 SPARC, MIPS, Precision Architecture, PowerPC, ALPHA 1 2 Intel 80x86, Motorola 68000 2 2 VAX (also has three-operand formats) 3 3 VAX (also has two-operand formats) Examples FIGURE 2.2 Possible combinations of memory operands and total operands per typical ALU instruction with examples of machines. Machines with no memory reference per ALU instruction are called load-store or register-register machines. Instructions with multiple memory operands per typical ALU instruction are called register-memory or memorymemory, according to whether they have one or more than one memory operand. The advantages and disadvantages of each of these alternatives are shown in Figure 2.3. Of course, these advantages and disadvantages are not absolutes: They are qualitative and their actual impact depends on the compiler and implementation strategy. A GPR machine with memory-memory operations can easily be subsetted by the compiler and used as a register-register machine. One of the most pervasive architectural impacts is on instruction encoding and the number of instructions needed to perform a task.We will see the impact of these architectural alternatives on implementation approaches in Chapters 3 and 4. Type Advantages Disadvantages Registerregister (0,3) Simple, fixed-length instruction encoding. Simple code-generation model. Instructions take similar numbers of clocks to execute (see Ch 3). Higher instruction count than architectures with memory references in instructions. Some instructions are short and bit encoding may be wasteful. Registermemory (1,2) Data can be accessed without loading first. Instruction format tends to be easy to encode and yields good density. Operands are not equivalent since a source operand in a binary operation is destroyed. Encoding a register number and a memory address in each instruction may restrict the number of registers. Clocks per instruction varies by operand location. Memorymemory (3,3) Most compact. Doesn’t waste registers for temporaries. Large variation in instruction size, especially for three-operand instructions. Also, large variation in work per instruction. Memory accesses create memory bottleneck. FIGURE 2.3 Advantages and disadvantages of the three most common types of general-purpose register machines. The notation (m, n) means m memory operands and n total operands. In general, machines with fewer alternatives make the compiler’s task simpler since there are fewer decisions for the compiler to make. Machines with a wide variety of flexible instruction formats reduce the number of bits required to encode the program. A machine that uses a small number of bits to encode the program is said to have good instruction density—a smaller number of bits do as much work as a larger number on a different architecture. The number of registers also affects the instruction size. 2.3 Memory Addressing 73 Summary: Classifying Instruction Set Architectures Here and in subsections at the end of sections 2.3 to 2.7 we summarize those characteristics we would expect to find in a new instruction set architecture, building the foundation for the DLX architecture introduced in section 2.8. From this section we should clearly expect the use of general-purpose registers. Figure 2.3, combined with the following chapter on pipelining, lead to the expectation of a register-register (also called load-store) architecture. With the class of architecture covered, the next topic is addressing operands. 2.3 Memory Addressing Independent of whether the architecture is register-register or allows any operand to be a memory reference, it must define how memory addresses are interpreted and how they are specified. We deal with these two topics in this section. The measurements presented here are largely, but not completely, machine independent. In some cases the measurements are significantly affected by the compiler technology. These measurements have been made using an optimizing compiler, since compiler technology is playing an increasing role. Interpreting Memory Addresses How is a memory address interpreted? That is, what object is accessed as a function of the address and the length? All the instruction sets discussed in this book are byte addressed and provide access for bytes (8 bits), half words (16 bits), and words (32 bits). Most of the machines also provide access for double words (64 bits). There are two different conventions for ordering the bytes within a word. Little Endian byte order puts the byte whose address is “x...x00” at the leastsignificant position in the word (the little end). Big Endian byte order puts the byte whose address is “x...x00” at the most-significant position in the word (the big end). In Big Endian addressing, the address of a datum is the address of the most-significant byte; while in Little Endian, the address of a datum is the address of the least-significant byte. When operating within one machine, the byte order is often unnoticeable—only programs that access the same locations as both words and bytes can notice the difference. Byte order is a problem when exchanging data among machines with different orderings, however. Little Endian ordering also fails to match normal ordering of words when strings are compared. Strings appear “SDRAWKCAB” in the registers. In many machines, accesses to objects larger than a byte must be aligned. An access to an object of size s bytes at byte address A is aligned if A mod s = 0. Figure 2.4 shows the addresses at which an access is aligned or misaligned. 74 Chapter 2 Instruction Set Principles and Examples Object addressed Aligned at byte offsets Misaligned at byte offsets Byte 0,1,2,3,4,5,6,7 Never Half word 0,2,4,6 1,3,5,7 Word 0,4 1,2,3,5,6,7 Double word 0 1,2,3,4,5,6,7 FIGURE 2.4 Aligned and misaligned accesses of objects. The byte offsets are specified for the low-order three bits of the address. Why would someone design a machine with alignment restrictions? Misalignment causes hardware complications, since the memory is typically aligned on a word or double-word boundary. A misaligned memory access will, therefore, take multiple aligned memory references.Thus, even in machines that allow misaligned access, programs with aligned accesses run faster. Even if data are aligned, supporting byte and half-word accesses requires an alignment network to align bytes and half words in registers. Depending on the instruction, the machine may also need to sign-extend the quantity. On some machines a byte or half word does not affect the upper portion of a register. For stores only the affected bytes in memory may be altered. (Although all the machines discussed in this book permit byte and half-word accesses to memory, only the Intel 80x86 supports ALU operations on register operands with a size shorter than a word.) Addressing Modes We now know what bytes to access in memory given an address. In this subsection we will look at addressing modes—how architectures specify the address of an object they will access. In GPR machines, an addressing mode can specify a constant, a register, or a location in memory. When a memory location is used, the actual memory address specified by the addressing mode is called the effective address. Figure 2.5 shows all the data-addressing modes that have been used in recent machines. Immediates or literals are usually considered memory-addressing modes (even though the value they access is in the instruction stream), although registers are often separated. We have kept addressing modes that depend on the program counter, called PC-relative addressing, separate. PC-relative addressing is used primarily for specifying code addresses in control transfer instructions. The use of PC-relative addressing in control instructions is discussed in section 2.4. Figure 2.5 shows the most common names for the addressing modes, though the names differ among architectures. In this figure and throughout the book, we will use an extension of the C programming language as a hardware description notation. In this figure, only one non-C feature is used: The left arrow (←) is used 2.3 75 Memory Addressing Addressing mode Example instruction Meaning When used Register Add R4,R3 Regs[R4]←Regs[R4]+ Regs[R3] When a value is in a register. Immediate Add R4,#3 Regs[R4]←Regs[R4]+3 For constants. Displacement Add R4,100(R1) Regs[R4]←Regs[R4]+ Mem[100+Regs[R1]] Accessing local variables. Register deferred or indirect Add R4,(R1) Regs[R4]←Regs[R4]+ Mem[Regs[R1]] Accessing using a pointer or a computed address. Indexed Add R3,(R1 + R2) Regs[R3]←Regs[R3]+ Mem[Regs[R1]+Regs[R2]] Sometimes useful in array addressing: R1 = base of array; R2 = index amount. Direct or absolute Add R1,(1001) Regs[R1]←Regs[R1]+ Mem[1001] Sometimes useful for accessing static data; address constant may need to be large. Memory indirect or memory deferred Add R1,@(R3) Regs[R1]←Regs[R1]+ Mem[Mem[Regs[R3]]] If R3 is the address of a pointer p, then mode yields *p. Autoincrement Add R1,(R2)+ Regs[R1]←Regs[R1]+ Mem[Regs[R2]] Regs[R2]←Regs[R2]+d Useful for stepping through arrays within a loop. R2 points to start of array; each reference increments R2 by size of an element, d. Autodecrement Add R1,–(R2) Regs[R2]←Regs[R2]–d Regs[R1]←Regs[R1]+ Mem[Regs[R2]] Same use as autoincrement. Autodecrement/increment can also act as push/pop to implement a stack. Scaled Add R1,100(R2)[R3] Regs[R1]← Regs[R1]+ Mem[100+Regs[R2]+Regs [R3]*d] Used to index arrays. May be applied to any indexed addressing mode in some machines. FIGURE 2.5 Selection of addressing modes with examples, meaning, and usage. The extensions to C used in the hardware descriptions are defined above. In autoincrement/decrement and scaled addressing modes, the variable d designates the size of the data item being accessed (i.e., whether the instruction is accessing 1, 2, 4, or 8 bytes); this means that these addressing modes are only useful when the elements being accessed are adjacent in memory. In our measurements, we use the first name shown for each mode. for assignment. We also use the array Mem as the name for main memory and the array Regs for registers. Thus, Mem[Regs[R1]] refers to the contents of the memory location whose address is given by the contents of register 1 (R1). Later, we will introduce extensions for accessing and transferring data smaller than a word. Addressing modes have the ability to significantly reduce instruction counts; they also add to the complexity of building a machine and may increase the average CPI (clock cycles per instruction) of machines that implement those modes. 76 Chapter 2 Instruction Set Principles and Examples Thus, the usage of various addressing modes is quite important in helping the architect choose what to include. Figure 2.6 shows the results of measuring addressing mode usage patterns in three programs on the VAX architecture. We use the VAX architecture for a few measurements in this chapter because it has the fewest restrictions on memory addressing. For example, it supports all the modes shown in Figure 2.5. Most measurements in this chapter, however, will use the more recent load-store architectures to show how programs use instruction sets of current machines. As Figure 2.6 shows, immediate and displacement addressing dominate addressing mode usage. Let’s look at some properties of these two heavily used modes. Memory indirect Scaled Register deferred Immediate Displacement TeX spice gcc TeX spice gcc 1% 6% 1% 0% 16% 6% 24% TeX spice gcc 3% 11% 43% TeX spice gcc 17% 39% 32% TeX spice gcc 55% 40% 0% 10% 20% 30% 40% 50% 60% Frequency of the addressing mode FIGURE 2.6 Summary of use of memory addressing modes (including immediates). The data were taken on a VAX using three programs from SPEC89. Only the addressing modes with an average frequency of over 1% are shown. The PC-relative addressing modes, which are used almost exclusively for branches, are not included. Displacement mode includes all displacement lengths (8, 16, and 32 bit). Register modes, which are not counted, account for one-half of the operand references, while memory addressing modes (including immediate) account for the other half. The memory indirect mode on the VAX can use displacement, autoincrement, or autodecrement to form the initial memory address; in these programs, almost all the memory indirect references use displacement mode as the base. Of course, the compiler affects what addressing modes are used; we discuss this further in section 2.7. These major addressing modes account for all but a few percent (0% to 3%) of the memory accesses. Displacement Addressing Mode The major question that arises for a displacement-style addressing mode is that of the range of displacements used. Based on the use of various displacement sizes, a decision of what sizes to support can be made. Choosing the displacement field 2.3 77 Memory Addressing sizes is important because they directly affect the instruction length. Measurements taken on the data access on a load-store architecture using our benchmark programs are shown in Figure 2.7. We will look at branch offsets in the next section—data accessing patterns and branches are so different, little is gained by combining them. 30% Integer average 25% Floating-point average 20% Percentage of displacement 15% 10% 5% 0% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of bits needed for a displacement value FIGURE 2.7 Displacement values are widely distributed. The x axis is log2 of the displacement; that is, the size of a field needed to represent the magnitude of the displacement. These data were taken on the MIPS architecture, showing the average of five programs from SPECint92 (compress, espresso, eqntott, gcc, li) and the average of five programs from SPECfp92 (dudoc, ear, hydro2d, mdljdp2, su2cor). Although there are a large number of small values in this data, there are also a fair number of large values. The wide distribution of displacement values is due to multiple storage areas for variables and different displacements used to access them. The different storage areas and their access patterns are discussed further in section 2.7. The graph shows only the magnitude of the displacement and not the sign, which is heavily affected by the storage layout. The entry corresponding to 0 on the x axis shows the percentage of displacements of value 0. The vast majority of the displacements are positive, but a majority of the largest displacements (14+ bits) are negative. Again, this is due to the overall addressing scheme used by the compiler and might change with a different compilation scheme. Since this data was collected on a machine with 16-bit displacements, it cannot tell us anything about accesses that might want to use a longer displacement. Such accesses are broken into two separate instructions—the first of which loads the upper 16 bits of a base register. By counting the frequency of these “load high immediate” instructions, which have limited use for other purposes, we can bound the number of accesses with displacements potentially larger than 16 bits. Such an analysis indicates that we may actually require a displacement longer than 16 bits for about 1% of immediates on SPECint92 and 1% of those for SPECfp92. Relating this data to the graph above, if it were widened to 32 bits we would see 1% of immediates collectively between sizes 16 and 31 for both SPECint92 and SPECfp92. And if the displacement is larger than 15 bits, it is likely to be quite a bit larger since such constants are large, as shown in Figure 2.9 on page 79.To evaluate the choice of displacement length, we might also want to examine a cumulative distribution, as shown in Exercise 2.1 (see Figure 2.32 on page 119). In summary, 12 bits of displacement would capture about 75% of the full 32-bit displacements and 16 bits should capture about 99%. 78 Chapter 2 Instruction Set Principles and Examples Immediate or Literal Addressing Mode Immediates can be used in arithmetic operations, in comparisons (primarily for branches), and in moves where a constant is wanted in a register. The last case occurs for constants written in the code, which tend to be small, and for address constants, which can be large. For the use of immediates it is important to know whether they need to be supported for all operations or for only a subset. The chart in Figure 2.8 shows the frequency of immediates for the general classes of integer operations in an instruction set. 10% Loads 45% 87% Compares 77% 58% ALU operations 78% 35% All instructions 10% 0% 50% 100% Percentage of operations that use immediates Integer average Floating-point average FIGURE 2.8 We see that for ALU operations about one-half to three-quarters of the operations have an immediate operand, while 75% to 85% of compare operations use an immediate operation. (For ALU operations, shifts by a constant amount are included as operations with immediate operands.) For loads, the load immediate instructions load 16 bits into either half of a 32-bit register. These load immediates are not loads in a strict sense because they do not reference memory. In some cases, a pair of load immediates may be used to load a 32-bit constant, but this is rare. The compares include comparisons against zero that are done in conditional branches based on this comparison. These measurements were taken on the DLX architecture with full compiler optimization (see section 2.7). The compiler attempts to use simple compares against zero for branches whenever possible, because these branches are efficiently supported in the architecture. Note that the bottom bars show that integer programs use immediates in about one-third of the instructions, while floatingpoint programs use immediates in about one-tenth of the instructions. Floating-point programs have many data transfers and operations on floating-point data that do not have immediate forms in the DLX instruction set. (These percentages are the averages of the same 10 programs as in Figure 2.7 on page 77.) Another important instruction set measurement is the range of values for immediates. Like displacement values, the sizes of immediate values affect instruction lengths. As Figure 2.9 shows, immediate values that are small are most heavily used. Large immediates are sometimes used, however, most likely in addressing calculations. The data in Figure 2.9 were taken on a VAX because, un- 2.3 79 Memory Addressing like recent load-store architectures, it supports 32-bit long immediates. For these measurements the VAX has the drawback that many of its instructions have zero as an implicit operand. These include instructions to compare against zero and to store zero into a word. Because of the use of these instructions, the measurements show less frequent use of zero than on architectures without such instructions. 60% gcc 50% 40% 30% TeX 20% spice 10% 0% 0 4 8 12 16 20 24 Number of bits needed for an immediate value 28 32 FIGURE 2.9 The distribution of immediate values is shown. The x axis shows the number of bits needed to represent the magnitude of an immediate value—0 means the immediate field value was 0. The vast majority of the immediate values are positive: Overall, less than 6% of the immediates are negative.These measurements were taken on a VAX, which supports a full range of immediates and sizes as operands to any instruction. The measured programs are gcc, spice, and TeX. Note that 50% to 70% of the immediates fit within 8 bits and 75% to 80% fit within 16 bits. Summary: Memory Addressing First, because of their popularity, we would expect a new architecture to support at least the following addressing modes: displacement, immediate, and register deferred. Figure 2.6 on page 76 shows they represent 75% to 99% of the addressing modes used in our measurements. Second, we would expect the size of the address for displacement mode to be at least 12 to 16 bits, since the caption in Figure 2.7 on page 77 suggests these sizes would capture 75% to 99% of the displacements. Third, we would expect the size of the immediate field to be at least 8 to 16 bits. As the caption in Figure 2.9 suggests, these sizes would capture 50% to 80% of the immediates. 80 Chapter 2 Instruction Set Principles and Examples Operator type Examples Arithmetic and logical Integer arithmetic and logical operations: add, and, subtract, or Data transfer Loads-stores (move instructions on machines with memory addressing) Control Branch, jump, procedure call and return, traps System Operating system call, virtual memory management instructions Floating point Floating-point operations: add, multiply Decimal Decimal add, decimal multiply, decimal-to-character conversions String String move, string compare, string search Graphics Pixel operations, compression/decompression operations FIGURE 2.10 Categories of instruction operators and examples of each. All machines generally provide a full set of operations for the first three categories. The support for system functions in the instruction set varies widely among architectures, but all machines must have some instruction support for basic system functions. The amount of support in the instruction set for the last four categories may vary from none to an extensive set of special instructions. Floating-point instructions will be provided in any machine that is intended for use in an application that makes much use of floating point. These instructions are sometimes part of an optional instruction set. Decimal and string instructions are sometimes primitives, as in the VAX or the IBM 360, or may be synthesized by the compiler from simpler instructions. Graphics instructions typically operate on many smaller data items in parallel; for example, performing eight 8-bit additions on two 64-bit operands. 2.4 Operations in the Instruction Set The operators supported by most instruction set architectures can be categorized, as in Figure 2.10. One rule of thumb across all architectures is that the most widely executed instructions are the simple operations of an instruction set. For example, Figure 2.11 shows 10 simple instructions that account for 96% of instructions executed for a collection of integer programs running on the popular Intel 80x86. Hence the implementor of these instructions should be sure to make these fast, as they are the common case. Because the measurements of branch and jump behavior are fairly independent of other measurements, we examine the use of control-flow instructions next. Instructions for Control Flow There is no consistent terminology for instructions that change the flow of control. In the 1950s they were typically called transfers. Beginning in 1960 the name branch began to be used. Later, machines introduced additional names. Throughout this book we will use jump when the change in control is unconditional and branch when the change is conditional. 2.4 81 Operations in the Instruction Set Integer average (% total executed) Rank 80x86 instruction 1 load 22% 2 conditional branch 20% 3 compare 16% 4 store 12% 5 add 8% 6 and 6% 7 sub 5% 8 move register-register 4% 9 call 1% 10 return 1% Total 96% FIGURE 2.11 The top 10 instructions for the 80x86. These percentages are the average of the same five SPECint92 programs as in Figure 2.7 on page 77. We can distinguish four different types of control-flow change: 1. Conditional branches 2. Jumps 3. Procedure calls 4. Procedure returns We want to know the relative frequency of these events, as each event is different, may use different instructions, and may have different behavior. The frequencies of these control-flow instructions for a load-store machine running our benchmarks are shown in Figure 2.12. 13% 11% Call/return 6% 4% Jump 81% 86% Conditional branch 0% 50% 100% Frequency of branch classes Integer average Floating-point average FIGURE 2.12 Breakdown of control flow instructions into three classes: calls or returns, jumps, and conditional branches. Each type is counted in one of three bars. Conditional branches clearly dominate. The programs and machine used to collect these statistics are the same as those in Figure 2.7. 82 Chapter 2 Instruction Set Principles and Examples The destination address of a control flow instruction must always be specified. This destination is specified explicitly in the instruction in the vast majority of cases—procedure return being the major exception—since for return the target is not known at compile time. The most common way to specify the destination is to supply a displacement that is added to the program counter, or PC. Control flow instructions of this sort are called PC-relative. PC-relative branches or jumps are advantageous because the target is often near the current instruction, and specifying the position relative to the current PC requires fewer bits. Using PC-relative addressing also permits the code to run independently of where it is loaded. This property, called position independence, can eliminate some work when the program is linked and is also useful in programs linked during execution. To implement returns and indirect jumps in which the target is not known at compile time, a method other than PC-relative addressing is required. Here, there must be a way to specify the target dynamically, so that it can change at runtime. This dynamic address may be as simple as naming a register that contains the target address; alternatively, the jump may permit any addressing mode to be used to supply the target address.These register indirect jumps are also useful for three other important features: case or switch statements found in many programming languages (which select among one of several alternatives), dynamically shared libraries (which allow a library to be loaded only when it is actually invoked by the program), and virtual functions in object-oriented languages like C++ (which allow different routines to be called depending on the type of the data). In all three cases the target address is not known at compile time, and hence is usually loaded from memory into a register before the register indirect jump. As branches generally use PC-relative addressing to specify their targets, a key question concerns how far branch targets are from branches. Knowing the distribution of these displacements will help in choosing what branch offsets to support and thus will affect the instruction length and encoding. Figure 2.13 shows the distribution of displacements for PC-relative branches in instructions. About 75% of the branches are in the forward direction. Since most changes in control flow are branches, deciding how to specify the branch condition is important. The three primary techniques in use and their advantages and disadvantages are shown in Figure 2.14. One of the most noticeable properties of branches is that a large number of the comparisons are simple equality or inequality tests, and a large number are comparisons with zero. Thus, some architectures choose to treat these comparisons as special cases, especially if a compare and branch instruction is being used. Figure 2.15 shows the frequency of different comparisons used for conditional branching. The data in Figure 2.8 said that a large percentage of the comparisons had an immediate operand, and while not shown, 0 was the most heavily used immediate. When we combine this with the data in Figure 2.15, we can see that a significant percentage (over 50%) of the integer compares in branches are simple tests for equality with 0. 2.4 83 Operations in the Instruction Set 40% Floating-point average 35% 30% 25% Integer average 20% 15% 10% 5% 0% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Bits of branch displacement FIGURE 2.13 Branch distances in terms of number of instructions between the target and the branch instruction.The most frequent branches in the integer programs are to targets that are four to seven instructions away. This tells us that short displacement fields often suffice for branches and that the designer can gain some encoding density by having a shorter instruction with a smaller branch displacement. These measurements were taken on a load-store machine (DLX architecture). An architecture that requires fewer instructions for the same program, such as a VAX, would have shorter branch distances. Similarly, the number of bits needed for the displacement may change if the machine allows instructions to be arbitrarily aligned. A cumulative distribution of this branch displacement data is shown in Exercise 2.1 (see Figure 2.32 on page 119). The programs and machine used to collect these statistics are the same as those in Figure 2.7. Name How condition is tested Advantages Disadvantages Condition code (CC) Special bits are set by ALU operations, possibly under program control. Sometimes condition is set for free. CC is extra state. Condition codes constrain the ordering of instructions since they pass information from one instruction to a branch. Condition register Test arbitrary register with the result of a comparison. Simple. Uses up a register. Compare and branch Compare is part of the branch. Often compare is limited to subset. One instruction rather than two for a branch. May be too much work per instruction. FIGURE 2.14 The major methods for evaluating branch conditions, their advantages, and their disadvantages. Although condition codes can be set by ALU operations that are needed for other purposes, measurements on programs show that this rarely happens. The major implementation problems with condition codes arise when the condition code is set by a large or haphazardly chosen subset of the instructions, rather than being controlled by a bit in the instruction. Machines with compare and branch often limit the set of compares and use a condition register for more complex compares. Often, different techniques are used for branches based on floating-point comparison versus those based on integer comparison. This is reasonable since the number of branches that depend on floating-point comparisons is much smaller than the number depending on integer comparisons. Procedure calls and returns include control transfer and possibly some state saving; at a minimum the return address must be saved somewhere. Some archi- 84 Chapter 2 Instruction Set Principles and Examples Less than/ greater than or equal 7% Greater than/ less than or equal 7% 40% 23% 86% Equal/ not equal 37% 0% 50% 100% Frequency of comparison types in branches Integer average Floating-point average FIGURE 2.15 Frequency of different types of compares in conditional branches. This includes both the integer and floating-point compares in branches. Remember that earlier data in Figure 2.8 indicate that most integer comparisons are against an immediate operand. The programs and machine used to collect these statistics are the same as those in Figure 2.7. tectures provide a mechanism to save the registers, while others require the compiler to generate instructions. There are two basic conventions in use to save registers. Caller saving means that the calling procedure must save the registers that it wants preserved for access after the call. Callee saving means that the called procedure must save the registers it wants to use. There are times when caller save must be used because of access patterns to globally visible variables in two different procedures. For example, suppose we have a procedure P1 that calls procedure P2, and both procedures manipulate the global variable x. If P1 had allocated x to a register it must be sure to save x to a location known by P2 before the call to P2. A compiler’s ability to discover when a called procedure may access register-allocated quantities is complicated by the possibility of separate compilation and situations where P2 may not touch x but can call another procedure, P3, that may access x. Because of these complications, most compilers will conservatively caller save any variable that may be accessed during a call. In the cases where either convention could be used, some programs will be more optimal with callee save and some will be more optimal with caller save. As a result, the most sophisticated compilers use a combination of the two mechanisms, and the register allocator may choose which register to use for a variable based on the convention. Later in this chapter we will examine the mismatch between sophisticated instructions for automatically saving registers and the needs of the compiler. 2.5 85 Type and Size of Operands Summary: Operations in the Instruction Set From this section we see the importance and popularity of simple instructions: load, store, add, subtract, move register-register, and, shift, compare equal, compare not equal, branch, jump, call, and return. Although there are many options for conditional branches, we would expect branch addressing in a new architecture to be able to jump to about 100 instructions either above or below the branch, implying a PC-relative branch displacement of at least 8 bits. We would also expect to see register-indirect and PC-relative addressing for jump instructions to support returns as well as many other features of current systems. 2.5 Type and Size of Operands How is the type of an operand designated? There are two primary alternatives: First, the type of an operand may be designated by encoding it in the opcode— this is the method used most often. Alternatively, the data can be annotated with tags that are interpreted by the hardware. These tags specify the type of the operand, and the operation is chosen accordingly. Machines with tagged data, however, can only be found in computer museums. Usually the type of an operand—for example, integer, single-precision floating point, character—effectively gives its size. Common operand types include character (1 byte), half word (16 bits), word (32 bits), single-precision floating point (also 1 word), and double-precision floating point (2 words). Characters are almost always in ASCII and integers are almost universally represented as two’s complement binary numbers. Until the early 1980s, most computer manufacturers chose their own floating-point representation. Almost all machines since that time follow the same standard for floating point, the IEEE standard 754. The IEEE floating-point standard is discussed in detail in Appendix A. Some architectures provide operations on character strings, although such operations are usually quite limited and treat each byte in the string as a single character. Typical operations supported on character strings are comparisons and moves. For business applications, some architectures support a decimal format, usually called packed decimal or binary-coded decimal—4 bits are used to encode the values 0–9, and 2 decimal digits are packed into each byte. Numeric character strings are sometimes called unpacked decimal, and operations—called packing and unpacking—are usually provided for converting back and forth between them. Our benchmarks use byte or character, half word (short integer), word (integer), and floating-point data types. Figure 2.16 shows the dynamic distribution of the sizes of objects referenced from memory for these programs. The frequency of access to different data types helps in deciding what types are most important to support efficiently. Should the machine have a 64-bit access path, or would ; 86 Chapter 2 Instruction Set Principles and Examples taking two cycles to access a double word be satisfactory? How important is it to support byte accesses as primitives, which, as we saw earlier, require an alignment network? In Figure 2.16, memory references are used to examine the types of data being accessed. In some architectures, objects in registers may be accessed as bytes or half words. However, such access is very infrequent—on the VAX, it accounts for no more than 12% of register references, or roughly 6% of all operand accesses in these programs. The successor to the VAX not only removed operations on data smaller than 32 bits, it also removed data transfers on these smaller sizes: The first implementations of the Alpha required multiple instructions to read or write bytes or half words. Note that Figure 2.16 was measured on a machine with 32-bit addresses: On a 64-bit address machine the 32-bit addresses would be replaced by 64-bit addresses. Hence as 64-bit address architectures become more popular, we would expect that double-word accesses will be popular for integer programs as well as floating-point programs. Double word 0% 69% 74% Word Half word Byte 31% 19% 0% 7% 0% 0% 20% 40% 60% 80% Frequency of reference by size Integer average Floating-point average FIGURE 2.16 Distribution of data accesses by size for the benchmark programs. Access to the major data type (word or double word) clearly dominates each type of program. Half words are more popular than bytes because one of the five SPECint92 programs (eqntott) uses half words as the primary data type, and hence they are responsible for 87% of the data accesses (see Figure 2.31 on page 110). The double-word data type is used solely for double-precision floating-point in floating-point programs. These measurements were taken on the memory traffic generated on a 32-bit load-store architecture. Summary: Type and Size of Operands From this section we would expect a new 32-bit architecture to support 8-, 16-, and 32-bit integers and 64-bit IEEE 754 floating-point data; a new 64-bit address architecture would need to support 64-bit integers as well. The level of support for decimal data is less clear, and it is a function of the intended use of the machine as well as the effectiveness of the decimal support. 2.6 2.6 Encoding an Instruction Set 87 Encoding an Instruction Set Clearly the choices mentioned above will affect how the instructions are encoded into a binary representation for execution by the CPU. This representation affects not only the size of the compiled program, it affects the implementation of the CPU, which must decode this representation to quickly find the operation and its operands. The operation is typically specified in one field, called the opcode. As we shall see, the important decision is how to encode the addressing modes with the operations. This decision depends on the range of addressing modes and the degree of independence between opcodes and modes. Some machines have one to five operands with 10 addressing modes for each operand (see Figure 2.5 on page 75). For such a large number of combinations, typically a separate address specifier is needed for each operand: the address specifier tells what addressing mode is used to access the operand. At the other extreme is a load-store machine with only one memory operand and only one or two addressing modes; obviously, in this case, the addressing mode can be encoded as part of the opcode. When encoding the instructions, the number of registers and the number of addressing modes both have a significant impact on the size of instructions, since the addressing mode field and the register field may appear many times in a single instruction. In fact, for most instructions many more bits are consumed in encoding addressing modes and register fields than in specifying the opcode. The architect must balance several competing forces when encoding the instruction set: 1. The desire to have as many registers and addressing modes as possible. 2. The impact of the size of the register and addressing mode fields on the average instruction size and hence on the average program size. 3. A desire to have instructions encode into lengths that will be easy to handle in the implementation. As a minimum, the architect wants instructions to be in multiples of bytes, rather than an arbitrary length. Many architects have chosen to use a fixed-length instruction to gain implementation benefits while sacrificing average code size. Since the addressing modes and register fields make up such a large percentage of the instruction bits, their encoding will significantly affect how easy it is for an implementation to decode the instructions. The importance of having easily decoded instructions is discussed in Chapter 3. Figure 2.17 shows three popular choices for encoding the instruction set. The first we call variable, since it allows virtually all addressing modes to be with all operations. This style is best when there are many addressing modes and operations. The second choice we call fixed, since it combines the operation and the 88 Chapter 2 Instruction Set Principles and Examples Operation & Address no. of operands specifier 1 Address field 1 Address specifier n Address field n (a) Variable (e.g., VAX) Operation Address field 1 Address field 2 Address field 3 (b) Fixed (e.g., DLX, MIPS, Power PC, Precision Architecture, SPARC) Operation Address specifier Address field Operation Address specifier 1 Address specifier 2 Address field Operation Address specifier Address field 1 Address field 2 (c) Hybrid (e.g., IBM 360/70, Intel 80x86) FIGURE 2.17 Three basic variations in instruction encoding. The variable format can support any number of operands, with each address specifier determining the addressing mode for that operand. The fixed format always has the same number of operands, with the addressing modes (if options exist) specified as part of the opcode (see also Figure C.3 on page C-4). Although the fields tend not to vary in their location, they will be used for different purposes by different instructions. The hybrid approach will have multiple formats specified by the opcode, adding one or two fields to specify the addressing mode and one or two fields to specify the operand address (see also Figure D.7 on page D-12). addressing mode into the opcode. Often fixed encoding will have only a single size for all instructions; it works best when there are few addressing modes and operations. The trade-off between variable encoding and fixed encoding is size of programs versus ease of decoding in the CPU. Variable tries to use as few bits as possible to represent the program, but individual instructions can vary widely in both size and the amount of work to be performed. For example, the VAX integer add can vary in size between 3 and 19 bytes and vary between 0 and 6 in data memory references. Given these two poles of instruction set design, the third alternative immediately springs to mind: Reduce the variability in size and work of the variable architecture but provide multiple instruction lengths so as to reduce code size. This hybrid approach is the third encoding alternative. 2.7 Crosscutting Issues: The Role of Compilers 89 To make these general classes more specific, this book contains several examples. Fixed formats of five machines can be seen in Figure C.3 on page C-4 and the hybrid formats of the Intel 80x86 can be seen in Figure D.8 on page D-13. Let’s look at a VAX instruction to see an example of the variable encoding: addl3 r1,737(r2),(r3) The name addl3 means a 32-bit integer add instruction with three operands, and this opcode takes 1 byte. A VAX address specifier is 1 byte, generally with the first 4 bits specifying the addressing mode and the second 4 bits specifying the register used in that addressing mode. The first operand specifier—r1—indicates register addressing using register 1, and this specifier is 1 byte long. The second operand specifier—737(r2)—indicates displacement addressing. It has two parts: The first part is a byte that specifies the 16-bit indexed addressing mode and base register (r2); the second part is the 2-byte-long displacement (737). The third operand specifier—(r3)—specifies register indirect addressing mode using register 3. Thus, this instruction has two data memory accesses, and the total length of the instruction is 1 + (1) + (1+2) + (1) = 6 bytes The length of VAX instructions varies between 1 and 53 bytes. Summary: Encoding the Instruction Set Decisions made in the components of instruction set design discussed in prior sections determine whether or not the architect has the choice between variable and fixed instruction encodings. Given the choice, the architect more interested in code size than performance will pick variable encoding, and the one more interested in performance than code size will pick fixed encoding. In Chapters 3 and 4, the impact of variability on performance of the CPU will be discussed further. We have almost finished laying the groundwork for the DLX instruction set architecture that will be introduced in section 2.8. But before we do that, it will be helpful to take a brief look at recent compiler technology and its effect on program properties. 2.7 Crosscutting Issues: The Role of Compilers Today almost all programming is done in high-level languages. This development means that since most instructions executed are the output of a compiler, an instruction set architecture is essentially a compiler target. In earlier times, architectural decisions were often made to ease assembly language programming. Because performance of a computer will be significantly affected by the compiler, understanding compiler technology today is critical to designing and efficiently implementing an instruction set. In earlier days it was popular to try to isolate the 90 Chapter 2 Instruction Set Principles and Examples compiler technology and its effect on hardware performance from the architecture and its performance, just as it was popular to try to separate an architecture from its implementation. This separation is essentially impossible with today’s compilers and machines. Architectural choices affect the quality of the code that can be generated for a machine and the complexity of building a good compiler for it. Isolating the compiler from the hardware is likely to be misleading. In this section we will discuss the critical goals in the instruction set primarily from the compiler viewpoint. What features will lead to high-quality code? What makes it easy to write efficient compilers for an architecture? The Structure of Recent Compilers To begin, let’s look at what optimizing compilers are like today. The structure of recent compilers is shown in Figure 2.18. Dependencies Language dependent; machine independent Front-end per language Function Transform language to common intermediate form Intermediate representation Somewhat language dependent, largely machine independent Small language dependencies; machine dependencies slight (e.g., register counts/types) Highly machine dependent; language independent High-level optimizations Global optimizer Code generator For example, procedure inlining and loop transformations Including global and local optimizations + register allocation Detailed instruction selection and machine-dependent optimizations; may include or be followed by assembler FIGURE 2.18 Current compilers typically consist of two to four passes, with more highly optimizing compilers having more passes. A pass is simply one phase in which the compiler reads and transforms the entire program. (The term phase is often used interchangeably with pass.) The optimizing passes are designed to be optional and may be skipped when faster compilation is the goal and lower quality code is acceptable. This structure maximizes the probability that a program compiled at various levels of optimization will produce the same output when given the same input. Because the optimizing passes are also separated, multiple languages can use the same optimizing and code-generation passes. Only a new front end is required for a new language. The high-level optimization mentioned here, procedure inlining, is also called procedure integration. 2.7 Crosscutting Issues: The Role of Compilers 91 A compiler writer’s first goal is correctness—all valid programs must be compiled correctly. The second goal is usually speed of the compiled code. Typically, a whole set of other goals follows these two, including fast compilation, debugging support, and interoperability among languages. Normally, the passes in the compiler transform higher-level, more abstract representations into progressively lower-level representations, eventually reaching the instruction set. This structure helps manage the complexity of the transformations and makes writing a bugfree compiler easier. The complexity of writing a correct compiler is a major limitation on the amount of optimization that can be done. Although the multiple-pass structure helps reduce compiler complexity, it also means that the compiler must order and perform some transformations before others. In the diagram of the optimizing compiler in Figure 2.18, we can see that certain high-level optimizations are performed long before it is known what the resulting code will look like in detail. Once such a transformation is made, the compiler can’t afford to go back and revisit all steps, possibly undoing transformations. This would be prohibitive, both in compilation time and in complexity. Thus, compilers make assumptions about the ability of later steps to deal with certain problems. For example, compilers usually have to choose which procedure calls to expand inline before they know the exact size of the procedure being called. Compiler writers call this problem the phase-ordering problem. How does this ordering of transformations interact with the instruction set architecture? A good example occurs with the optimization called global common subexpression elimination. This optimization finds two instances of an expression that compute the same value and saves the value of the first computation in a temporary. It then uses the temporary value, eliminating the second computation of the expression. For this optimization to be significant, the temporary must be allocated to a register. Otherwise, the cost of storing the temporary in memory and later reloading it may negate the savings gained by not recomputing the expression. There are, in fact, cases where this optimization actually slows down code when the temporary is not register allocated. Phase ordering complicates this problem, because register allocation is typically done near the end of the global optimization pass, just before code generation. Thus, an optimizer that performs this optimization must assume that the register allocator will allocate the temporary to a register. Optimizations performed by modern compilers can be classified by the style of the transformation, as follows: 1. High-level optimizations are often done on the source with output fed to later optimization passes. 2. Local optimizations optimize code only within a straight-line code fragment (called a basic block by compiler people). 92 Chapter 2 Instruction Set Principles and Examples 3. Global optimizations extend the local optimizations across branches and introduce a set of transformations aimed at optimizing loops. 4. Register allocation. 5. Machine-dependent optimizations attempt to take advantage of specific architectural knowledge. Because of the central role that register allocation plays, both in speeding up the code and in making other optimizations useful, it is one of the most important—if not the most important—optimizations. Recent register allocation algorithms are based on a technique called graph coloring. The basic idea behind graph coloring is to construct a graph representing the possible candidates for allocation to a register and then to use the graph to allocate registers. Although the problem of coloring a graph is NP-complete, there are heuristic algorithms that work well in practice. Graph coloring works best when there are at least 16 (and preferably more) general-purpose registers available for global allocation for integer variables and additional registers for floating point. Unfortunately, graph coloring does not work very well when the number of registers is small because the heuristic algorithms for coloring the graph are likely to fail. The emphasis in the approach is to achieve 100% allocation of active variables. It is sometimes difficult to separate some of the simpler optimizations—local and machine-dependent optimizations—from transformations done in the code generator. Examples of typical optimizations are given in Figure 2.19. The last column of Figure 2.19 indicates the frequency with which the listed optimizing transforms were applied to the source program. The effect of various optimizations on instructions executed for two programs is shown in Figure 2.20. The Impact of Compiler Technology on the Architect’s Decisions The interaction of compilers and high-level languages significantly affects how programs use an instruction set architecture. There are two important questions: How are variables allocated and addressed? How many registers are needed to allocate variables appropriately? To address these questions, we must look at the three separate areas in which current high-level languages allocate their data: s The stack is used to allocate local variables. The stack is grown and shrunk on procedure call or return, respectively. Objects on the stack are addressed relative to the stack pointer and are primarily scalars (single variables) rather than arrays. The stack is used for activation records, not as a stack for evaluating expressions. Hence values are almost never pushed or popped on the stack. 2.7 93 Crosscutting Issues: The Role of Compilers Optimization name Explanation High-level Percentage of the total number of optimizing transforms At or near the source level; machineindependent Procedure integration Replace procedure call by procedure body Local Within straight-line code N.M. Common subexpression elimination Replace two instances of the same computation by single copy 18% Constant propagation Replace all instances of a variable that is assigned a constant with the constant 22% Stack height reduction Rearrange expression tree to minimize resources needed for expression evaluation N.M. Global Across a branch Global common subexpression elimination Same as local, but this version crosses branches 13% Copy propagation Replace all instances of a variable A that has been assigned X (i.e., A = X) with X 11% Code motion Remove code from a loop that computes same value each iteration of the loop 16% Induction variable elimination Simplify/eliminate array-addressing calculations within loops 2% Machine-dependent Depends on machine knowledge Strength reduction Many examples, such as replace multiply by a constant with adds and shifts N.M. Pipeline scheduling Reorder instructions to improve pipeline performance N.M. Branch offset optimization Choose the shortest branch displacement that reaches target N.M. FIGURE 2.19 Major types of optimizations and examples in each class. The third column lists the static frequency with which some of the common optimizations are applied in a set of 12 small FORTRAN and Pascal programs. The percentage is the portion of the static optimizations that are of the specified type. These data tell us about the relative frequency of occurrence of various optimizations. There are nine local and global optimizations done by the compiler included in the measurement. Six of these optimizations are covered in the figure, and the remaining three account for 18% of the total static occurrences. The abbreviation N.M. means that the number of occurrences of that optimization was not measured. Machinedependent optimizations are usually done in a code generator, and none of those was measured in this experiment. Data from Chow [1983] (collected using the Stanford UCODE compiler). s s The global data area is used to allocate statically declared objects, such as global variables and constants. A large percentage of these objects are arrays or other aggregate data structures. The heap is used to allocate dynamic objects that do not adhere to a stack discipline. Objects in the heap are accessed with pointers and are typically not scalars. 94 Chapter 2 Instruction Set Principles and Examples hydro l 3 26% hydro l 2 26% hydro l 1 Program and compiler optimization level 36% hydro l 0 100% li level 3 73% li level 2 75% li level 1 89% li level 0 100% 0% 20% 40% 60% 80% Percent of unoptimized instructions executed Branches/calls FLOPs Loads-stores 100% Integer ALU FIGURE 2.20 Change in instruction count for the programs hydro2d and li from the SPEC92 as compiler optimization levels vary. Level 0 is the same as unoptimized code. These experiments were perfomed on the MIPS compilers. Level 1 includes local optimizations, code scheduling, and local register allocation. Level 2 includes global optimizations, loop transformations (software pipelining), and global register allocation. Level 3 adds procedure integration. Register allocation is much more effective for stack-allocated objects than for global variables, and register allocation is essentially impossible for heap-allocated objects because they are accessed with pointers. Global variables and some stack variables are impossible to allocate because they are aliased, which means that there are multiple ways to refer to the address of a variable, making it illegal to put it into a register. (Most heap variables are effectively aliased for today’s compiler technology.) For example, consider the following code sequence, where & returns the address of a variable and * dereferences a pointer: p = &a a = ... *p = ... ...a... –– –– –– -- gets address of a in p assigns to a directly uses p to assign to a accesses a The variable a could not be register allocated across the assignment to *p without generating incorrect code. Aliasing causes a substantial problem because it is often difficult or impossible to decide what objects a pointer may refer to. A compiler must be conservative; many compilers will not allocate any local variables of a procedure in a register when there is a pointer that may refer to one of the local variables. 2.7 Crosscutting Issues: The Role of Compilers 95 How the Architect Can Help the Compiler Writer Today, the complexity of a compiler does not come from translating simple statements like A = B + C. Most programs are locally simple, and simple translations work fine. Rather, complexity arises because programs are large and globally complex in their interactions, and because the structure of compilers means that decisions must be made about what code sequence is best one step at a time. Compiler writers often are working under their own corollary of a basic principle in architecture: Make the frequent cases fast and the rare case correct. That is, if we know which cases are frequent and which are rare, and if generating code for both is straightforward, then the quality of the code for the rare case may not be very important—but it must be correct! Some instruction set properties help the compiler writer. These properties should not be thought of as hard and fast rules, but rather as guidelines that will make it easier to write a compiler that will generate efficient and correct code. 1. Regularity—Whenever it makes sense, the three primary components of an instruction set—the operations, the data types, and the addressing modes— should be orthogonal. Two aspects of an architecture are said to be orthogonal if they are independent. For example, the operations and addressing modes are orthogonal if for every operation to which a certain addressing mode can be applied, all addressing modes are applicable. This helps simplify code generation and is particularly important when the decision about what code to generate is split into two passes in the compiler. A good counterexample of this property is restricting what registers can be used for a certain class of instructions. This can result in the compiler finding itself with lots of available registers, but none of the right kind! ; 2. Provide primitives, not solutions—Special features that “match” a language construct are often unusable. Attempts to support high-level languages may work only with one language, or do more or less than is required for a correct and efficient implementation of the language. Some examples of how these attempts have failed are given in section 2.9. 3. Simplify trade-offs among alternatives—One of the toughest jobs a compiler writer has is figuring out what instruction sequence will be best for every segment of code that arises. In earlier days, instruction counts or total code size might have been good metrics, but—as we saw in the last chapter—this is no longer true. With caches and pipelining, the trade-offs have become very complex. Anything the designer can do to help the compiler writer understand the costs of alternative code sequences would help improve the code. One of the most difficult instances of complex trade-offs occurs in a register-memory architecture in deciding how many times a variable should be referenced before it is cheaper to load it into a register. This threshold is hard to compute and, in fact, may vary among models of the same architecture. 96 Chapter 2 Instruction Set Principles and Examples 4. Provide instructions that bind the quantities known at compile time as constants—A compiler writer hates the thought of the machine interpreting at runtime a value that was known at compile time. Good counterexamples of this principle include instructions that interpret values that were fixed at compile time. For instance, the VAX procedure call instruction (calls) dynamically interprets a mask saying what registers to save on a call, but the mask is fixed at compile time. However, in some cases it may not be known by the caller whether separate compilation was used. Summary: The Role of Compilers This section leads to several recommendations. First, we expect a new instruction set architecture to have at least 16 general-purpose registers—not counting separate registers for floating-point numbers—to simplify allocation of registers using graph coloring. The advice on orthogonality suggests that all supported addressing modes apply to all instructions that transfer data. Finally, the last three pieces of advice of the last subsection—provide primitives instead of solutions, simplify trade-offs between alternatives, don’t bind constants at runtime—all suggest that it is better to err on the side of simplicity. In other words, understand that less is more in the design of an instruction set. 2.8 Putting It All Together: The DLX Architecture In many places throughout this book we will have occasion to refer to a computer’s “machine language.” The machine we use is a mythical computer called “MIX.” MIX is very much like nearly every computer in existence, except that it is, perhaps, nicer … MIX is the world’s first polyunsaturated computer. Like most machines, it has an identifying number—the 1009. This number was found by taking 16 actual computers which are very similar to MIX and on which MIX can be easily simulated, then averaging their number with equal weight: (360 + 650 + 709 + 7070 + U3 + SS80 + 1107 + 1604 + G20 + B220 + S2000 + 920 + 601 + H800 + PDP-4 + II)/16 = 1009. The same number may be obtained in a simpler way by taking Roman numerals. Donald Knuth, The Art of Computer Programming, Volume I: Fundamental Algorithms In this section we will describe a simple load-store architecture called DLX (pronounced “Deluxe”). The authors believe DLX to be the world’s second polyunsaturated computer—the average of a number of recent experimental and commercial machines that are very similar in philosophy to DLX. Like Knuth, 2.8 Putting It All Together: The DLX Architecture 97 we derived the name of our machine from an average expressed in Roman numerals: (AMD 29K, DECstation 3100, HP 850, IBM 801, Intel i860, MIPS M/120A, MIPS M/1000, Motorola 88K, RISC I, SGI 4D/60, SPARCstation-1, Sun-4/110, Sun-4/260) / 13 = 560 = DLX. The instruction set architecture of DLX and its ancestors was based on observations similar to those covered in the last sections. (In section 2.11 we discuss how and why these architectures became popular.) Reviewing our expectations from each section: s s s s s s Section 2.2—Use general-purpose registers with a load-store architecture. Section 2.3—Support these addressing modes: displacement (with an address offset size of 12 to 16 bits), immediate (size 8 to 16 bits), and register deferred. Section 2.4—Support these simple instructions, since they will dominate the number of instructions executed: load, store, add, subtract, move registerregister, and, shift, compare equal, compare not equal, branch (with a PC-relative address at least 8 bits long), jump, call, and return. Section 2.5—Support these data sizes and types: 8-, 16-, and 32-bit integers and 64-bit IEEE 754 floating-point numbers. Section 2.6—Use fixed instruction encoding if interested in performance and use variable instruction encoding if interested in code size. Section 2.7—Provide at least 16 general-purpose registers plus separate floatingpoint registers, be sure all addressing modes apply to all data transfer instructions, and aim for a minimalist instruction set. We introduce DLX by showing how it follows these recommendations. Like most recent machines, DLX emphasizes s s s A simple load-store instruction set Design for pipelining efficiency, including a fixed instruction set encoding (discussed in Chapter 3) Efficiency as a compiler target DLX provides a good architectural model for study, not only because of the recent popularity of this type of machine, but also because it is an easy architecture to understand. We will use this architecture again in Chapters 3 and 4, and it forms the basis for a number of exercises and programming projects. 98 Chapter 2 Instruction Set Principles and Examples Registers for DLX DLX has 32 32-bit general-purpose registers (GPRs), named R0, R1, …, R31. Additionally, there is a set of floating-point registers (FPRs), which can be used as 32 single-precision (32-bit) registers or as even-odd pairs holding doubleprecision values. Thus, the 64-bit floating-point registers are named F0, F2, ..., F28, F30. Both single- and double-precision floating-point operations (32-bit and 64-bit) are provided. The value of R0 is always 0. We shall see later how we can use this register to synthesize a variety of useful operations from a simple instruction set. A few special registers can be transferred to and from the integer registers. An example is the floating-point status register, used to hold information about the results of floating-point operations. There are also instructions for moving between a FPR and a GPR. Data types for DLX The data types are 8-bit bytes, 16-bit half words, and 32-bit words for integer data and 32-bit single precision and 64-bit double precision for floating point. Half words were added to the minimal set of recommended data types supported because they are found in languages like C and popular in some programs, such as the operating systems, concerned about size of data structures. They will also become more popular as Unicode becomes more widely used. Single-precision floating-point operands were added for similar reasons. (Remember the early warning that you should measure many more programs before designing an instruction set.) The DLX operations work on 32-bit integers and 32- or 64-bit floating point. Bytes and half words are loaded into registers with either zeros or the sign bit replicated to fill the 32 bits of the registers. Once loaded, they are operated on with the 32-bit integer operations. Addressing modes for DLX data transfers The only data addressing modes are immediate and displacement, both with 16bit fields. Register deferred is accomplished simply by placing 0 in the 16-bit displacement field, and absolute addressing with a 16-bit field is accomplished by using register 0 as the base register. This gives us four effective modes, although only two are supported in the architecture. DLX memory is byte addressable in Big Endian mode with a 32-bit address. As it is a load-store architecture, all memory references are through loads or stores between memory and either the GPRs or the FPRs. Supporting the data types mentioned above, memory accesses involving the GPRs can be to a byte, to a half word, or to a word. The FPRs may be loaded and stored with single-precision or double-precision words (using a pair of registers for DP). All memory accesses must be aligned. 2.8 99 Putting It All Together: The DLX Architecture DLX Instruction Format Since DLX has just two addressing modes, these can be encoded into the opcode. Following the advice on making the machine easy to pipeline and decode, all instructions are 32 bits with a 6-bit primary opcode. Figure 2.21 shows the instruction layout. These formats are simple while providing 16-bit fields for displacement addressing, immediate constants, or PC-relative branch addresses. I-type instruction 6 Opcode 5 5 16 rs1 rd Immediate Encodes: Loads and stores of bytes, words, half words All immediates (rd – rs1 op immediate) ‹ Conditional branch instructions (rs1 is register, rd unused) Jump register, jump and link register (rd = 0, rs1 = destination, immediate = 0) R-type instruction 6 Opcode 5 5 5 11 rs1 rs2 rd func Register–register ALU operations: rd – rs1 func rs2 ‹ Function encodes the data path operation: Add, Sub, . . . Read/write special registers and moves J-type instruction 6 Opcode 26 Offset added to PC Jump and jump and link Trap and return from exception FIGURE 2.21 types. Instruction layout for DLX. All instructions are encoded in one of three DLX Operations DLX supports the list of simple operations recommended above plus a few others. There are four broad classes of instructions: loads and stores, ALU operations, branches and jumps, and floating-point operations. Any of the general-purpose or floating-point registers may be loaded or stored, except that loading R0 has no effect. Single-precision floating-point numbers occupy a single floating-point register, while double-precision values occupy a pair. Conversions between single and double precision must be done explicitly. The floating-point format is IEEE 754 (see Appendix A). Figure 2.22 gives examples 100 Chapter 2 Instruction Set Principles and Examples Example instruction Instruction name Meaning LW R1,30(R2) Load word Regs[R1]←32 Mem[30+Regs[R2]] LW R1,1000(R0) Load word Regs[R1]←32 Mem[1000+0] LB R1,40(R3) Load byte Regs[R1]←32 (Mem[40+Regs[R3]]0)24 ## Mem[40+Regs[R3]] LBU R1,40(R3) Load byte unsigned Regs[R1]←32 024 ## Mem[40+Regs[R3]] LH R1,40(R3) Load half word Regs[R1]←32 (Mem[40+Regs[R3]]0)16 ## Mem[40+Regs[R3]]##Mem[41+Regs[R3]] LF F0,50(R3) Load float Regs[F0]←32 Mem[50+Regs[R3]] LD F0,50(R2) Load double Regs[F0]##Regs[F1]←64 Mem[50+Regs[R2]] SW R3,500(R4) Store word Mem[500+Regs[R4]]←32 Regs[R3] SF F0,40(R3) Store float Mem[40+Regs[R3]]←32 Regs[F0] SD F0,40(R3) Store double Mem[40+Regs[R3]]←32 Regs[F0]; Mem[44+Regs[R3]]←32 Regs[F1] SH R3,502(R2) Store half Mem[502+Regs[R2]]←16 Regs[R3]16..31 SB R2,41(R3) Store byte Mem[41+Regs[R3]]←8 Regs[R2]24..31 FIGURE 2.22 The load and store instructions in DLX. All use a single addressing mode and require that the memory value be aligned. Of course, both loads and stores are available for all the data types shown. of the load and store instructions. A complete list of the instructions appears in Figure 2.25 (page 104). To understand these figures we need to introduce a few additional extensions to our C description language: s s s s s A subscript is appended to the symbol ← whenever the length of the datum being transferred might not be clear. Thus, ←n means transfer an n-bit quantity. We use x, y ← z to indicate that z should be transferred to x and y. A subscript is used to indicate selection of a bit from a field. Bits are labeled from the most-significant bit starting at 0. The subscript may be a single digit (e.g., Regs[R4]0 yields the sign bit of R4) or a subrange (e.g., Regs[R3]24..31 yields the least-significant byte of R3). The variable Mem, used as an array that stands for main memory, is indexed by a byte address and may transfer any number of bytes. A superscript is used to replicate a field (e.g., 024 yields a field of zeros of length 24 bits). The symbol ## is used to concatenate two fields and may appear on either side of a data transfer. 2.8 Putting It All Together: The DLX Architecture 101 A summary of the entire description language appears on the back inside cover. As an example, assuming that R8 and R10 are 32-bit registers: Regs[R10]16..31 ← 16(Mem[Regs[R8]]0)8 ## Mem[Regs[R8]] means that the byte at the memory location addressed by the contents of R8 is sign-extended to form a 16-bit quantity that is stored into the lower half of R10. (The upper half of R10 is unchanged.) All ALU instructions are register-register instructions. The operations include simple arithmetic and logical operations: add, subtract, AND, OR, XOR, and shifts. Immediate forms of all these instructions, with a 16-bit sign-extended immediate, are provided. The operation LHI (load high immediate) loads the top half of a register, while setting the lower half to 0. This allows a full 32-bit constant to be built in two instructions, or a data transfer using any constant 32-bit address in one extra instruction. As mentioned above, R0 is used to synthesize popular operations. Loading a constant is simply an add immediate where one of the source operands is R0, and a register-register move is simply an add where one of the sources is R0. (We sometimes use the mnemonic LI, standing for load immediate, to represent the former and the mnemonic MOV for the latter.) There are also compare instructions, which compare two registers (=, ≠, <, >, ≤, ≥). If the condition is true, these instructions place a 1 in the destination register (to represent true); otherwise they place the value 0. Because these operations “set” a register, they are called set-equal, set-not-equal, set-less-than, and so on. There are also immediate forms of these compares. Figure 2.23 gives some examples of the arithmetic/logical instructions. Example instruction Instruction name Meaning ADD Add Regs[R1]←Regs[R2]+Regs[R3] ADDI R1,R2,#3 Add immediate Regs[R1]←Regs[R2]+3 LHI Load high immediate Regs[R1]←42##016 SLLI R1,R2,#5 Shift left logical immediate Regs[R1]←Regs[R2]<<5 SLT Set less than if (Regs[R2]<Regs[R3]) Regs[R1]←1 else Regs[R1]←0 R1,R2,R3 R1,#42 R1,R2,R3 FIGURE 2.23 mediates. Examples of arithmetic/logical instructions on DLX, both with and without im- Control is handled through a set of jumps and a set of branches. The four jump instructions are differentiated by the two ways to specify the destination address and by whether or not a link is made. Two jumps use a 26-bit signed offset added 102 Chapter 2 Instruction Set Principles and Examples to the program counter (of the instruction sequentially following the jump) to determine the destination address; the other two jump instructions specify a register that contains the destination address. There are two flavors of jumps: plain jump, and jump and link (used for procedure calls). The latter places the return address—the address of the next sequential instruction—in R31. All branches are conditional. The branch condition is specified by the instruction, which may test the register source for zero or nonzero; the register may contain a data value or the result of a compare. The branch target address is specified with a 16-bit signed offset that is added to the program counter, which is pointing to the next sequential instruction. Figure 2.24 gives some typical branch and jump instructions. There is also a branch to test the floating-point status register for floating-point conditional branches, described below. Example instruction Instruction name Meaning J name Jump PC←name; ((PC+4)–225) ≤ name < ((PC+4)+225) JAL name Jump and link Regs[R31]←PC+4; PC←name; ((PC+4)–225) ≤ name < ((PC+4)+225) JALR R2 Jump and link register Regs[R31]←PC+4; PC←Regs[R2] JR Jump register PC←Regs[R3] BEQZ R4,name Branch equal zero if (Regs[R4]==0) PC←name; ((PC+4)–215) ≤ name < ((PC+4)+215) BNEZ R4,name Branch not equal zero if (Regs[R4]!=0) PC←name; ((PC+4)–215) ≤ name < ((PC+4)+215) R3 FIGURE 2.24 Typical control-flow instructions in DLX. All control instructions, except jumps to an address in a register, are PC-relative. If the register operand is R0, BEQZ will always branch, but the compiler will usually prefer to use a jump with a longer offset over this “unconditional branch.” Floating-point instructions manipulate the floating-point registers and indicate whether the operation to be performed is single or double precision. The operations MOVF and MOVD copy a single-precision (MOVF) or double-precision (MOVD) floating-point register to another register of the same type. The operations MOVFP2I and MOVI2FP move data between a single floating-point register and an integer register; moving a double-precision value to two integer registers requires two instructions. Integer multiply and divide that work on 32-bit floating-point registers are also provided, as are conversions from integer to floating point and vice versa. The floating-point operations are add, subtract, multiply, and divide; a suffix D is used for double precision and a suffix F is used for single precision (e.g., ADDD, ADDF, SUBD, SUBF, MULTD, MULTF, DIVD, DIVF). Floating-point compares set a 2.8 Putting It All Together: The DLX Architecture 103 bit in the special floating-point status register that can be tested with a pair of branches: BFPT and BFPF, branch floating-point true and branch floating-point false. One slightly unusual DLX characteristic is that it uses the floating-point unit for integer multiplies and divides. As we shall see in Chapters 3 and 4, the control for the slower floating-point operations is much more complicated than for integer addition and subtraction. Since the floating-point unit already handles floating point multiply and divide, it is not much harder for it to perform the relatively slow operations of integer multiply and divide. Hence DLX requires that operands to be multiplied or divided be placed in floating-point registers. Figure 2.25 contains a list of all DLX operations and their meaning. To give an idea which instructions are popular, Figure 2.26 shows the frequency of instructions and instruction classes for five SPECint92 programs and Figure 2.27 shows the same data for five SPECfp92 programs. To give a more intuitive feeling, Figures 2.28 and 2.29 show the data graphically for all instructions that are responsible on average for more than 1% of the instructions executed. Effectiveness of DLX It would seem that an architecture with simple instruction formats, simple address modes, and simple operations would be slow, in part because it has to execute more instructions than more sophisticated designs. The performance equation from the last chapter reminds us that execution time is a function of more than just instruction count: CPU time = Instruction count × CPI × Clock cycle time To see whether reduction in instruction count is offset by increases in CPI or clock cycle time, we need to compare DLX to a sophisticated alternative. One example of a sophisticated instruction set architecture is the VAX. In the mid 1970s, when the VAX was designed, the prevailing philosophy was to create instruction sets that were close to programming languages to simplify compilers. For example, because programming languages had loops, instruction sets should have loop instructions, not just simple conditional branches; they needed call instructions that saved registers, not just simple jump and links; they needed case instructions, not just jump indirect; and so on. Following similar arguments, the VAX provided a large set of addressing modes and made sure that all addressing modes worked with all operations. Another prevailing philosophy was to minimize code size. Recall that DRAMs have grown in capacity by a factor of four every three years; thus in the mid 1970s DRAM chips contained less than 1/1000 the capacity of today’s DRAMs, so code space was also critical. Code space was 104 Chapter 2 Instruction Set Principles and Examples Instruction type/opcode Instruction meaning Data transfers Move data between registers and memory, or between the integer and FP or special registers; only memory address mode is 16-bit displacement + contents of a GPR LB,LBU,SB Load byte, load byte unsigned, store byte LH,LHU,SH Load half word, load half word unsigned, store half word LW,SW Load word, store word (to/from integer registers) LF,LD,SF,SD Load SP float, load DP float, store SP float, store DP float MOVI2S, MOVS2I Move from/to GPR to/from a special register MOVF, MOVD Copy one FP register or a DP pair to another register or pair MOVFP2I,MOVI2FP Move 32 bits from/to FP registers to/from integer registers Arithmetic/logical Operations on integer or logical data in GPRs; signed arithmetic trap on overflow ADD,ADDI,ADDU, ADDUI Add, add immediate (all immediates are 16 bits); signed and unsigned SUB,SUBI,SUBU, SUBUI Subtract, subtract immediate; signed and unsigned MULT,MULTU,DIV,DIVU Multiply and divide, signed and unsigned; operands must be FP registers; all operations take and yield 32-bit values AND,ANDI And, and immediate OR,ORI,XOR,XORI Or, or immediate, exclusive or, exclusive or immediate LHI Load high immediate—loads upper half of register with immediate SLL, SRL, SRA, SLLI, SRLI, SRAI Shifts: both immediate (S__I) and variable form (S__); shifts are shift left logical, right logical, right arithmetic S__,S__I Set conditional: “__” may be LT,GT,LE,GE,EQ,NE Control Conditional branches and jumps; PC-relative or through register BEQZ,BNEZ Branch GPR equal/not equal to zero; 16-bit offset from PC+4 BFPT,BFPF Test comparison bit in the FP status register and branch; 16-bit offset from PC+4 J, JR Jumps: 26-bit offset from PC+4 (J) or target in register (JR) JAL, JALR Jump and link: save PC+4 in R31, target is PC-relative (JAL) or a register (JALR) TRAP Transfer to operating system at a vectored address RFE Return to user code from an exception; restore user mode Floating point FP operations on DP and SP formats ADDD,ADDF Add DP, SP numbers SUBD,SUBF Subtract DP, SP numbers MULTD,MULTF Multiply DP, SP floating point DIVD,DIVF Divide DP, SP floating point CVTF2D, CVTF2I, CVTD2F, CVTD2I, CVTI2F, CVTI2D Convert instructions: CVTx2y converts from type x to type y, where x and y are I (integer), D (double precision), or F (single precision). Both operands are FPRs. __D,__F DP and SP compares: “__” = LT,GT,LE,GE,EQ,NE; sets bit in FP status register FIGURE 2.25 Complete list of the instructions in DLX. The formats of these instructions are shown in Figure 2.21. SP = single precision; DP = double precision. This list can also be found on the page preceding the back inside cover. 2.8 105 Putting It All Together: The DLX Architecture compress eqntott espresso gcc (cc1) li Integer average load 19.8% 30.6% 20.9% 22.8% 31.3% 26% store 5.6% 0.6% 5.1% 14.3% 16.7% 9% add 14.4% 8.5% 23.8% 14.6% 11.1% 14% sub 1.8% 0.3% Instruction 0.5% mul 0% 0.1% 0% div 0% compare 15.4% 26.5% 8.3% 12.4% 5.4% 14% load imm 8.1% 1.5% 1.3% 6.8% 2.4% 4% cond branch 17.4% 24.0% 15.0% 11.5% 14.6% 17% jump 1.5% 0.9% 0.5% 1.3% 1.8% 1% call 0.1% 0.5% 0.4% 1.1% 3.1% 1% return, jmp ind 0.1% 0.5% 0.5% 1.5% 3.5% 1% shift 6.5% 0.3% 7.0% 6.2% 0.7% 4% and 2.1% 0.1% 9.4% 1.6% 2.1% 3% or 6.0% 5.5% 4.8% 4.2% 6.2% 5% other (xor, not) 1.0% 2.0% 0.5% 0.1% 1% load FP 0% store FP 0% add FP 0% sub FP 0% mul FP 0% div FP 0% compare FP 0% mov reg-reg FP 0% other FP 0% FIGURE 2.26 DLX instruction mix for five SPECint92 programs. Note that integer register-register move instructions are included in the add instruction. Blank entries have the value 0.0%. de-emphasized in fixed-length instruction sets like DLX. For example, DLX address fields always use 16 bits, even when the address is very small. In contrast, the VAX allows instructions to be a variable number of bytes, so there is little wasted space in address fields. Designers of VAX machines later performed a quantitative comparison of VAX and a DLX-like machine for implementations with comparable organizations. Their choices were the VAX 8700 and the MIPS M2000. The differing 106 Chapter 2 Instruction Set Principles and Examples Instruction doduc ear hydro2d mdljdp2 su2cor FP average load 1.4% 0.2% 0.1% 1.1% 3.6% 1% 0.1% 1.3% 1% 10.9% 4.7% 9.7% 11% 0.7% 0% store 1.3% 0.1% add 13.6% 13.6% sub 0.3% 0.2% mul 0% div 0% compare 3.2% 3.1% 1.2% 0.3% 1.3% 2% load imm 2.2% cond branch 0.2% 2.2% 0.9% 1% 8.0% 10.1% 11.7% 9.3% 2.6% 8% jump 0.9% 0.4% 0.4% 0.1% 0% call 0.5% 1.9% 0.3% 1% return, jmp ind 0.6% 1.9% shift 2.0% 0.2% and 0.4% 0.3% 0.1% 0.1% 0.1% 1% 2.3% 2% 0.3% 1.3% 0.2% or 2.4% 0% 0.1% 0% 21.6% 23% other (xor, not) 0% load FP 23.3% 19.8% 24.1% 25.9% store FP 5.7% 11.4% 9.9% 10.0% 9.8% 9% add FP 8.8% 7.3% 3.6% 8.5% 12.4% 8% sub FP 3.8% 3.2% 7.9% 10.4% 5.9% 6% mul FP 12.0% 9.6% 9.4% 13.9% 21.6% 13% div FP 2.3% 1.6% 0.9% 0.7% 1% compare FP 4.2% 6.4% 10.4% 9.3% 0.8% 6% mov reg-reg FP 2.1% 1.8% 5.2% 0.9% 1.9% 2% other FP 2.4% 8.4% 0.2% 0.2% 1.2% 2% FIGURE 2.27 DLX instruction mix for five programs from SPECfp92. Note that integer register-register move instructions are included in the add instruction. Blank entries have the value 0.0%. goals for VAX and MIPS have led to very different architectures. The VAX goals, simple compilers and code density, led to powerful addressing modes, powerful instructions, efficient instruction encoding, and few registers. The MIPS goals were high performance via pipelining, ease of hardware implementation, and compatibility with highly optimizing compilers. These goals led to simple instructions, simple addressing modes, fixed-length instruction formats, and a large number of registers. Figure 2.30 shows the ratio of the number of instructions executed, the ratio of CPIs, and the ratio of performance measured in clock cycles. Since the organizations 2.8 107 Putting It All Together: The DLX Architecture and 3% shift 4% or 5% store int 9% compare int 13% add int 14% conditional branch 16% load int 26% 0% 5% 10% 15% Total dynamic count compress eqntott espresso 20% gcc 25% li FIGURE 2.28 Graphical display of instructions executed of the five programs from SPECint92 in Figure 2.26. These instruction classes collectively are responsible on average for 92% of instructions executed. shift 2% mov reg FP 2% compare FP 6% sub FP 6% add FP 8% conditional branch 8% store FP 9% add int 11% mul FP 13% load FP 23% 0% doduc 5% 10% 15% Total dynamic count ear hydro2d mdljdp2 20% 25% su2cor FIGURE 2.29 Graphical display of instructions executed of the five programs from SPECfp92 in Figure 2.27. These instruction classes collectively are responsible on average for just under 90% of instructions executed. 108 Chapter 2 Instruction Set Principles and Examples were similar, clock cycle times were assumed to be the same. MIPS executes about twice as many instructions as the VAX, while the CPI for the VAX is about six times larger than that for the MIPS. Hence the MIPS M2000 has almost three times the performance of the VAX 8700. Furthermore, much less hardware is needed to build the MIPS CPU than the VAX CPU. This cost/performance gap is the reason the company that used to make the VAX has dropped it and is now making a machine similar to DLX. 4.0 Performance ratio 3.5 3.0 2.5 MIPS/VAX 2.0 Instructions executed ratio 1.5 1.0 0.5 CPI ratio li t ot eq nt o ss es pr e c do du to m ca t v p fp pp a7 na s x at ri m sp ic e 0.0 SPEC 89 benchmarks FIGURE 2.30 Ratio of MIPS M2000 to VAX 8700 in instructions executed and performance in clock cycles using SPEC89 programs. On average, MIPS executes a little over twice as many instructions as the VAX, but the CPI for the VAX is almost six times the MIPS CPI, yielding almost a threefold performance advantage. (Based on data from Bhandarkar and Clark [1991].) 2.9 Fallacies and Pitfalls Time and again architects have tripped on common, but erroneous, beliefs. In this section we look at a few of them. 2.9 Fallacies and Pitfalls 109 Pitfall: Designing a “high-level” instruction set feature specifically oriented to supporting a high-level language structure. Attempts to incorporate high-level language features in the instruction set have led architects to provide powerful instructions with a wide range of flexibility. But often these instructions do more work than is required in the frequent case, or they don’t exactly match the requirements of the language. Many such efforts have been aimed at eliminating what in the 1970s was called the semantic gap. Although the idea is to supplement the instruction set with additions that bring the hardware up to the level of the language, the additions can generate what Wulf [1981] has called a semantic clash: ... by giving too much semantic content to the instruction, the machine designer made it possible to use the instruction only in limited contexts. [p. 43] More often the instructions are simply overkill—they are too general for the most frequent case, resulting in unneeded work and a slower instruction. Again, the VAX CALLS is a good example. CALLS uses a callee-save strategy (the registers to be saved are specified by the callee) but the saving is done by the call instruction in the caller. The CALLS instruction begins with the arguments pushed on the stack, and then takes the following steps: 1. Align the stack if needed. 2. Push the argument count on the stack. 3. Save the registers indicated by the procedure call mask on the stack (as mentioned in section 2.7). The mask is kept in the called procedure’s code—this permits callee save to be done by the caller even with separate compilation. 4. Push the return address on the stack, then push the top and base of stack pointers for the activation record. 5. Clear the condition codes, which sets the trap enables to a known state. 6. Push a word for status information and a zero word on the stack. 7. Update the two stack pointers. 8. Branch to the first instruction of the procedure. The vast majority of calls in real programs do not require this amount of overhead. Most procedures know their argument counts, and a much faster linkage convention can be established using registers to pass arguments rather than the stack. Furthermore, the CALLS instruction forces two registers to be used for linkage, while many languages require only one linkage register. Many attempts to support procedure call and activation stack management have failed to be useful, either because they do not match the language needs or because they are too general and hence too expensive to use. 110 Chapter 2 Instruction Set Principles and Examples The VAX designers provided a simpler instruction, JSB, that is much faster since it only pushes the return PC on the stack and jumps to the procedure. However, most VAX compilers use the more costly CALLS instructions. The call instructions were included in the architecture to standardize the procedure linkage convention. Other machines have standardized their calling convention by agreement among compiler writers and without requiring the overhead of a complex, very general-procedure call instruction. Fallacy: There is such a thing as a typical program. Many people would like to believe that there is a single “typical” program that could be used to design an optimal instruction set. For example, see the synthetic benchmarks discussed in Chapter 1. The data in this chapter clearly show that programs can vary significantly in how they use an instruction set. For example, Figure 2.31 shows the mix of data transfer sizes for four of the SPEC92 programs: It would be hard to say what is typical from these four programs. The variations are even larger on an instruction set that supports a class of applications, such as decimal instructions, that are unused by other applications. 100% Double word 0% 0% 0% 0% 100% 12% Word 78% 0% 0% 87% Half word 4% Byte 0% 0% 1% 19% 50% 0% 100% Frequency of reference by size hydro2d ear eqntott compress FIGURE 2.31 Data reference size of four programs from SPEC92. Although you can calculate an average size, it would be hard to claim the average is typical of programs. Fallacy: An architecture with flaws cannot be successful. The 80x86 provides a dramatic example: The architecture is one only its creators could love (see Appendix D). Succeeding generations of Intel engineers have 2.10 Concluding Remarks 111 tried to correct unpopular architectural decisions made in designing the 80x86. For example, the 80x86 supports segmentation, whereas all others picked paging; the 80x86 uses extended accumulators for integer data, but other machines use general-purpose registers; and it uses a stack for floating-point data when everyone else abandoned execution stacks long before. Despite these major difficulties, the 80x86 architecture—because of its selection as the microprocessor in the IBM PC—has been enormously successful. Fallacy: You can design a flawless architecture. All architecture design involves trade-offs made in the context of a set of hardware and software technologies. Over time those technologies are likely to change, and decisions that may have been correct at the time they were made look like mistakes. For example, in 1975 the VAX designers overemphasized the importance of code-size efficiency, underestimating how important ease of decoding and pipelining would be 10 years later. Almost all architectures eventually succumb to the lack of sufficient address space. However, avoiding this problem in the long run would probably mean compromising the efficiency of the architecture in the short run. 2.10 Concluding Remarks The earliest architectures were limited in their instruction sets by the hardware technology of that time. As soon as the hardware technology permitted, architects began looking for ways to support high-level languages. This search led to three distinct periods of thought about how to support programs efficiently. In the 1960s, stack architectures became popular. They were viewed as being a good match for high-level languages—and they probably were, given the compiler technology of the day. In the 1970s, the main concern of architects was how to reduce software costs. This concern was met primarily by replacing software with hardware, or by providing high-level architectures that could simplify the task of software designers. The result was both the high-level-language computer architecture movement and powerful architectures like the VAX, which has a large number of addressing modes, multiple data types, and a highly orthogonal architecture. In the 1980s, more sophisticated compiler technology and a renewed emphasis on machine performance saw a return to simpler architectures, based mainly on the load-store style of machine. Today, there is widespread agreement on instruction set design. However, in the next decade we expect to see change in the following areas: s The 32-bit address instruction sets are being extended to 64-bit addresses, expanding the width of the registers (among other things) to 64 bits. Appendix C gives three examples of architectures that have gone from 32 bits to 64 bits. 112 Chapter 2 Instruction Set Principles and Examples s s s s Given the popularity of software for the 80x86 architecture, many companies are looking to see if changes to load-store instruction sets can significantly improve performance when emulating the 80x86 architecture. In the next two chapters we will see that conditional branches can limit the performance of aggressive computer designs. Hence there is interest in replacing conditional branches with conditional completion of operations, such as conditional move (see Chapter 4). Chapter 5 explains the increasing role of memory hierarchy in performance of machines, with a cache miss on some machines taking almost as many instruction times as page faults took on earlier machines. Hence there are investigations into hiding the cost of cache misses by prefetching and by allowing caches and CPUs to proceed while servicing a miss (see Chapter 5). Appendix A describes new operations to enhance floating-point performance, such as operations that perform a multiply and an add. Support for quadruple precision, at least for data transfer, may also be coming down the line. Between 1970 and 1985 many thought the primary job of the computer architect was the design of instruction sets. As a result, textbooks of that era emphasize instruction set design, much as computer architecture textbooks of the 1950s and 1960s emphasized computer arithmetic. The educated architect was expected to have strong opinions about the strengths and especially the weaknesses of the popular machines. The importance of binary compatibility in quashing innovations in instruction set design was unappreciated by many researchers and textbook writers, giving the impression that many architects would get a chance to design an instruction set. The definition of computer architecture today has been expanded to include design and evaluation of the full computer system—not just the definition of the instruction set—and hence there are plenty of topics for the architect to study. (You may have guessed this the first time you lifted this book.) Hence the bulk of this book is on design of computers versus instruction sets. Readers interested in instruction set architecture may be satisfied by the appendices: Appendix C compares four popular load-store machines with DLX. Appendix D describes the most widely used instruction set, the Intel 80x86, and compares instruction counts for it with that of DLX for several programs. 2.11 Historical Perspective and References One’s eyebrows should rise whenever a future architecture is developed with a stack- or register-oriented instruction set. [p. 20] Meyers [1978] 2.11 Historical Perspective and References 113 The earliest computers, including the UNIVAC I, the EDSAC, and the IAS machines, were accumulator-based machines. The simplicity of this type of machine made it the natural choice when hardware resources were very constrained. The first general-purpose register machine was the Pegasus, built by Ferranti, Ltd. in 1956. The Pegasus had eight general-purpose registers, with R0 always being zero. Block transfers loaded the eight registers from the drum. In 1963, Burroughs delivered the B5000. The B5000 was perhaps the first machine to seriously consider software and hardware-software trade-offs. Barton and the designers at Burroughs made the B5000 a stack architecture (as described in Barton [1961]). Designed to support high-level languages such as ALGOL, this stack architecture used an operating system (MCP) written in a high-level language. The B5000 was also the first machine from a U.S. manufacturer to support virtual memory. The B6500, introduced in 1968 (and discussed in Hauck and Dent [1968]), added hardware-managed activation records. In both the B5000 and B6500, the top two elements of the stack were kept in the CPU and the rest of the stack was kept in memory. The stack architecture yielded good code density, but only provided two high-speed storage locations. The authors of both the original IBM 360 paper [Amdahl, Blaauw, and Brooks 1964] and the original PDP11 paper [Bell et al. 1970] argue against the stack organization. They cite three major points in their arguments against stacks: 1. Performance is derived from fast registers, not the way they are used. 2. The stack organization is too limiting and requires many swap and copy operations. 3. The stack has a bottom, and when placed in slower memory there is a performance loss. Stack-based machines fell out of favor in the late 1970s and, except for the Intel 80x86 floating-point architecture, essentially disappeared. For example, except for the 80x86, none of the machines listed in the SPEC reports uses a stack. The term computer architecture was coined by IBM in the early 1960s. Amdahl, Blaauw, and Brooks [1964] used the term to refer to the programmer-visible portion of the IBM 360 instruction set. They believed that a family of machines of the same architecture should be able to run the same software. Although this idea may seem obvious to us today, it was quite novel at that time. IBM, even though it was the leading company in the industry, had five different architectures before the 360. Thus, the notion of a company standardizing on a single architecture was a radical one. The 360 designers hoped that six different divisions of IBM could be brought together by defining a common architecture. Their definition of architecture was ... the structure of a computer that a machine language programmer must understand to write a correct (timing independent) program for that machine. 114 Chapter 2 Instruction Set Principles and Examples The term “machine language programmer” meant that compatibility would hold, even in assembly language, while “timing independent” allowed different implementations. The IBM 360 was the first machine to sell in large quantities with both byte addressing using 8-bit bytes and general-purpose registers. The 360 also had register-memory and limited memory-memory instructions. In 1964, Control Data delivered the first supercomputer, the CDC 6600. As Thornton [1964] discusses, he, Cray, and the other 6600 designers were the first to explore pipelining in depth. The 6600 was the first general-purpose, load-store machine. In the 1960s, the designers of the 6600 realized the need to simplify architecture for the sake of efficient pipelining. This interaction between architectural simplicity and implementation was largely neglected during the 1970s by microprocessor and minicomputer designers, but it was brought back in the 1980s. In the late 1960s and early 1970s, people realized that software costs were growing faster than hardware costs. McKeeman [1967] argued that compilers and operating systems were getting too big and too complex and taking too long to develop. Because of inferior compilers and the memory limitations of machines, most systems programs at the time were still written in assembly language. Many researchers proposed alleviating the software crisis by creating more powerful, software-oriented architectures. Tanenbaum [1978] studied the properties of high-level languages. Like other researchers, he found that most programs are simple. He then argued that architectures should be designed with this in mind and should optimize program size and ease of compilation. Tanenbaum proposed a stack machine with frequency-encoded instruction formats to accomplish these goals. However, as we have observed, program size does not translate directly to cost/performance, and stack machines faded out shortly after this work. Strecker’s article [1978] discusses how he and the other architects at DEC responded to this by designing the VAX architecture. The VAX was designed to simplify compilation of high-level languages. Compiler writers had complained about the lack of complete orthogonality in the PDP-11. The VAX architecture was designed to be highly orthogonal and to allow the mapping of a high-levellanguage statement into a single VAX instruction. Additionally, the VAX designers tried to optimize code size because compiled programs were often too large for available memories. The VAX-11/780 was the first machine announced in the VAX series. It is one of the most successful and heavily studied machines ever built. The cornerstone of DEC’s strategy was a single architecture, VAX, running a single operating system, VMS. This strategy worked well for over 10 years. The large number of papers reporting instruction mixes, implementation measurements, and analysis of the VAX make it an ideal case study [Wiecek 1982; Clark and Levy 1982]. Bhandarkar and Clark [1991] give a quantitative analysis of the disadvantages of the VAX versus a RISC machine, essentially a technical explanation for the demise of the VAX. 2.11 Historical Perspective and References 115 While the VAX was being designed, a more radical approach, called highlevel-language computer architecture (HLLCA), was being advocated in the research community. This movement aimed to eliminate the gap between high-level languages and computer hardware—what Gagliardi [1973] called the “semantic gap”—by bringing the hardware “up to” the level of the programming language. Meyers [1982] provides a good summary of the arguments and a history of high-level-language computer architecture projects. HLLCA never had a significant commercial impact. The increase in memory size on machines and the use of virtual memory eliminated the code-size problems arising from high-level languages and operating systems written in highlevel languages. The combination of simpler architectures together with software offered greater performance and more flexibility at lower cost and lower complexity. In the early 1980s, the direction of computer architecture began to swing away from providing high-level hardware support for languages. Ditzel and Patterson [1980] analyzed the difficulties encountered by the high-level-language architectures and argued that the answer lay in simpler architectures. In another paper [Patterson and Ditzel 1980], these authors first discussed the idea of reduced instruction set computers (RISC) and presented the argument for simpler architectures. Their proposal was rebutted by Clark and Strecker [1980]. The simple load-store machines from which DLX is derived are commonly called RISC architectures. The roots of RISC architectures go back to machines like the 6600, where Thornton, Cray, and others recognized the importance of instruction set simplicity in building a fast machine. Cray continued his tradition of keeping machines simple in the CRAY-1. However, DLX and its close relatives are built primarily on the work of three research projects: the Berkeley RISC processor, the IBM 801, and the Stanford MIPS processor. These architectures have attracted enormous industrial interest because of claims of a performance advantage of anywhere from two to five times over other machines using the same technology. Begun in 1975, the IBM project was the first to start but was the last to become public. The IBM machine was designed as an ECL minicomputer, while the university projects were both MOS-based microprocessors. John Cocke is considered to be the father of the 801 design. He received both the EckertMauchly and Turing awards in recognition of his contribution. Radin [1982] describes the highlights of the 801 architecture. The 801 was an experimental project that was never designed to be a product. In fact, to keep down cost and complexity, the machine was built with only 24-bit registers. In 1980, Patterson and his colleagues at Berkeley began the project that was to give this architectural approach its name (see Patterson and Ditzel [1980]). They built two machines called RISC-I and RISC-II. Because the IBM project was not widely known or discussed, the role played by the Berkeley group in promoting the RISC approach was critical to the acceptance of the technology. The Berkeley 116 Chapter 2 Instruction Set Principles and Examples group went on to build RISC machines targeted toward Smalltalk, described by Ungar et al. [1984], and LISP, described by Taylor et al. [1986]. In 1981, Hennessy and his colleagues at Stanford published a description of the Stanford MIPS machine. Efficient pipelining and compiler-assisted scheduling of the pipeline were both key aspects of the original MIPS design. These early RISC machines—the 801, RISC-II, and MIPS—had much in common. Both university projects were interested in designing a simple machine that could be built in VLSI within the university environment. All three machines used a simple load-store architecture, fixed-format 32-bit instructions, and emphasized efficient pipelining. Patterson [1985] describes the three machines and the basic design principles that have come to characterize what a RISC machine is. Hennessy [1984] provides another view of the same ideas, as well as other issues in VLSI processor design. In 1985, Hennessy published an explanation of the RISC performance advantage and traced its roots to a substantially lower CPI—under 2 for a RISC machine and over 10 for a VAX-11/780 (though not with identical workloads). A paper by Emer and Clark [1984] characterizing VAX-11/780 performance was instrumental in helping the RISC researchers understand the source of the performance advantage seen by their machines. Since the university projects finished up, in the 1983–84 time frame, the technology has been widely embraced by industry. Many manufacturers of the early computers (those made before 1986) claimed that their products were RISC machines. However, these claims were often born more of marketing ambition than of engineering reality. In 1986, the computer industry began to announce processors based on the technology explored by the three RISC research projects. Moussouris et al. [1986] describe the MIPS R2000 integer processor, while Kanes book [1986] is a complete description of the architecture. Hewlett-Packard converted their existing minicomputer line to RISC architectures; the HP Precision Architecture is described by Lee [1989]. IBM never directly turned the 801 into a product. Instead, the ideas were adopted for a new, low-end architecture that was incorporated in the IBM RT-PC and described in a collection of papers [Waters 1986]. In 1990, IBM announced a new RISC architecture (the RS 6000), which is the first superscalar RISC machine (see Chapter 4). In 1987, Sun Microsystems began delivering machines based on the SPARC architecture, a derivative of the Berkeley RISC-II machine; SPARC is described in Garner et al. [1988]. The PowerPC joined the forces of Apple, IBM, and Motorola. Appendix C summarizes several RISC architectures. Prior to the RISC architecture movement, the major trend had been highly microcoded architectures aimed at reducing the semantic gap. DEC, with the VAX, and Intel, with the iAPX 432, were among the leaders in this approach. Today it is hard to find a computer company without a RISC product. With the 1994 announcement that Hewlett Packard and Intel will eventually have a common architecture, the end of the 1970s architectures draws near. 2.11 Historical Perspective and References 117 References AMDAHL, G. M., G. A. BLAAUW, AND F. P. BROOKS, JR. [1964]. “Architecture of the IBM System 360,” IBM J. Research and Development 8:2 (April), 87–101. BARTON, R. S. [1961]. “A new approach to the functional design of a computer,” Proc. Western Joint Computer Conf., 393–396. BELL, G., R. CADY, H. MCFARLAND, B. DELAGI, J. O’LAUGHLIN, R. NOONAN, AND W. WULF [1970]. “A new architecture for mini-computers: The DEC PDP-11,” Proc. AFIPS SJCC, 657–675. BHANDARKAR, D., AND D. W. CLARK [1991]. “Performance from architecture: Comparing a RISC and a CISC with similar hardware organizations,” Proc. Fourth Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (April), Palo Alto, Calif., 310–19. CHOW, F. C. [1983]. A Portable Machine-Independent Global Optimizer—Design and Measurements, Ph.D. Thesis, Stanford Univ. (December). CLARK, D. AND H. LEVY [1982]. “Measurement and analysis of instruction set use in the VAX-11/ 780,” Proc. Ninth Symposium on Computer Architecture (April), Austin, Tex., 9–17. CLARK, D. AND W. D. STRECKER [1980]. “Comments on ‘the case for the reduced instruction set computer’,” Computer Architecture News 8:6 (October), 34–38. CRAWFORD, J. AND P. GELSINGER [1988]. Programming the 80386, Sybex Books, Alameda, Calif. DITZEL, D. R. AND D. A. PATTERSON [1980]. “Retrospective on high-level language computer architecture,” in Proc. Seventh Annual Symposium on Computer Architecture, La Baule, France (June), 97–104. EMER, J. S. AND D. W. CLARK [1984]. “A characterization of processor performance in the VAX-11/ 780,” Proc. 11th Symposium on Computer Architecture (June), Ann Arbor, Mich., 301–310. GAGLIARDI, U. O. [1973]. “Report of workshop 4–Software-related advances in computer hardware,” Proc. Symposium on the High Cost of Software, Menlo Park, Calif., 99–120. GARNER, R., A. AGARWAL, F. BRIGGS, E. BROWN, D. HOUGH, B. JOY, S. KLEIMAN, S. MUNCHNIK, M. NAMJOO, D. PATTERSON, J. PENDLETON, AND R. TUCK [1988]. “Scalable processor architecture (SPARC),” COMPCON, IEEE (March), San Francisco, 278–283. HAUCK, E. A., AND B. A. DENT [1968]. “Burroughs’ B6500/B7500 stack mechanism,” Proc. AFIPS SJCC, 245–251. HENNESSY, J. [1984]. “VLSI processor architecture,” IEEE Trans. on Computers C-33:11 (December), 1221–1246. HENNESSY, J. [1985]. “VLSI RISC processors,” VLSI Systems Design VI:10 (October), 22–32. HENNESSY, J., N. JOUPPI, F. BASKETT, AND J. GILL [1981]. “MIPS: A VLSI processor architecture,” Proc. CMU Conf. on VLSI Systems and Computations (October), Computer Science Press, Rockville, Md. KANE, G. [1986]. MIPS R2000 RISC Architecture, Prentice Hall, Englewood Cliffs, N.J. LEE, R. [1989]. “Precision architecture,” Computer 22:1 (January), 78–91. LEVY, H. AND R. ECKHOUSE [1989]. Computer Programming and Architecture: The VAX, Digital Press, Boston. LUNDE, A. [1977]. “Empirical evaluation of some features of instruction set processor architecture,” Comm. ACM 20:3 (March), 143–152. MCKEEMAN, W. M. [1967]. “Language directed computer design,” Proc. 1967 Fall Joint Computer Conf., Washington, D.C., 413–417. MEYERS, G. J. [1978]. “The evaluation of expressions in a storage-to-storage architecture,” Computer Architecture News 7:3 (October), 20–23. 118 Chapter 2 Instruction Set Principles and Examples MEYERS, G. J. [1982]. Advances in Computer Architecture, 2nd ed., Wiley, New York. MOUSSOURIS, J., L. CRUDELE, D. FREITAS, C. HANSEN, E. HUDSON, S. PRZYBYLSKI, T. RIORDAN, AND C. ROWEN [1986]. “A CMOS RISC processor with integrated system functions,” Proc. COMPCON, IEEE (March), San Francisco, 191. PATTERSON, D. [1985]. “Reduced instruction set computers,” Comm. ACM 28:1 (January), 8–21. PATTERSON, D. A. AND D. R. DITZEL [1980]. “The case for the reduced instruction set computer,” Computer Architecture News 8:6 (October), 25–33. RADIN, G. [1982]. “The 801 minicomputer,” Proc. Symposium Architectural Support for Programming Languages and Operating Systems (March), Palo Alto, Calif., 39–47. STRECKER, W. D. [1978]. “VAX-11/780: A virtual address extension of the PDP-11 family,” Proc. AFIPS National Computer Conf. 47, 967–980. TANENBAUM, A. S. [1978]. “Implications of structured programming for machine architecture,” Comm. ACM 21:3 (March), 237–246. TAYLOR, G., P. HILFINGER, J. LARUS, D. PATTERSON, AND B. ZORN [1986]. “Evaluation of the SPUR LISP architecture,” Proc. 13th Symposium on Computer Architecture (June), Tokyo. THORNTON, J. E. [1964]. “Parallel operation in Control Data 6600,” Proc. AFIPS Fall Joint Computer Conf. 26, part 2, 33–40. UNGAR, D., R. BLAU, P. FOLEY, D. SAMPLES, AND D. PATTERSON [1984]. “Architecture of SOAR: Smalltalk on a RISC,” Proc. 11th Symposium on Computer Architecture (June), Ann Arbor, Mich., 188–197. WAKERLY, J. [1989]. Microcomputer Architecture and Programming, J. Wiley, New York. WATERS, F., ED. [1986]. IBM RT Personal Computer Technology, IBM, Austin, Tex., SA 23-1057. WIECEK, C. [1982]. “A case study of the VAX 11 instruction set usage for compiler execution,” Proc. Symposium on Architectural Support for Programming Languages and Operating Systems (March), IEEE/ACM, Palo Alto, Calif., 177–184. WULF, W. [1981]. “Compilers and computer architecture,” Computer 14:7 (July), 41–47. EXERCISES 2.1 [20/15/10] <2.3,2.8> We are designing instruction set formats for a load-store architecture and are trying to decide whether it is worthwhile to have multiple offset lengths for branches and memory references. We have decided that both branch and memory references can have only 0-, 8-, and 16-bit offsets. The length of an instruction would be equal to 16 bits + offset length in bits. ALU instructions will be 16 bits. Figure 2.32 contains the data in cumulative form. Assume an additional bit is needed for the sign on the offset. For instruction set frequencies, use the data for DLX from the average of the five benchmarks for the load-store machine in Figure 2.26. Assume that the miscellaneous instructions are all ALU instructions that use only registers. a. [20] <2.3,2.8> Suppose offsets were permitted to be 0, 8, or 16 bits in length, including the sign bit. What is the average length of an executed instruction? b. [15] <2.3,2.8> Suppose we wanted a fixed-length instruction and we chose a 24-bit instruction length (for everything, including ALU instructions). For every offset of longer than 8 bits, an additional instruction is required. Determine the number of 119 Exercises Offset bits Cumulative data references Cumulative branches 0 17% 0% 1 17% 0% 2 23% 24% 3 32% 49% 4 40% 64% 5 48% 79% 6 54% 87% 7 57% 93% 8 60% 98% 9 61% 99% 10 69% 100% 11 71% 100% 12 75% 100% 13 78% 100% 14 80% 100% 15 100% 100% FIGURE 2.32 The second and third columns contain the cumulative percentage of the data references and branches, respectively, that can be accommodated with the corresponding number of bits of magnitude in the displacement. These are the average distances of all 10 programs in Figure 2.7. instruction bytes fetched in this machine with fixed instruction size versus those fetched with a byte-variable-sized instruction as defined in part (a). c. [10] <2.3,2.8> Now suppose we use a fixed offset length of 16 bits so that no additional instruction is ever required. How many instruction bytes would be required? Compare this result to your answer to part (b), which used 8-bit fixed offsets that used additional instruction words when larger offsets were required. 2.2 [15/10] <2.2> Several researchers have suggested that adding a register-memory addressing mode to a load-store machine might be useful. The idea is to replace sequences of LOAD ADD R1,0(Rb) R2,R2,R1 ADD R2,0(Rb) by Assume the new instruction will cause the clock cycle to increase by 10%. Use the instruction frequencies for the gcc benchmark on the load-store machine from Figure 2.26. The new instruction affects only the clock cycle and not the CPI. 120 Chapter 2 Instruction Set Principles and Examples a. [15] <2.2> What percentage of the loads must be eliminated for the machine with the new instruction to have at least the same performance? b. [10] <2.2> Show a situation in a multiple instruction sequence where a load of R1 followed immediately by a use of R1 (with some type of opcode) could not be replaced by a single instruction of the form proposed, assuming that the same opcode exists. 2.3 [20] <2.2> Your task is to compare the memory efficiency of four different styles of instruction set architectures. The architecture styles are 1. Accumulator—All operations occur between a single register and a memory location. 2. Memory-memory—All three operands of each instruction are in memory. 3. Stack—All operations occur on top of the stack. Only push and pop access memory; all other instructions remove their operands from stack and replace them with the result. The implementation uses a stack for the top two entries; accesses that use other stack positions are memory references. 4. Load-store—All operations occur in registers, and register-to-register instructions have three operands per instruction. There are 16 general-purpose registers, and register specifiers are 4 bits long. To measure memory efficiency, make the following assumptions about all four instruction sets: s The opcode is always 1 byte (8 bits). s All memory addresses are 2 bytes (16 bits). s All data operands are 4 bytes (32 bits). s All instructions are an integral number of bytes in length. There are no other optimizations to reduce memory traffic, and the variables A, B, C, and D are initially in memory. Invent your own assembly language mnemonics and write the best equivalent assembly language code for the high-level-language fragment given. Write the four code sequences for A = B + C; B = A + C; D = A - B; Calculate the instruction bytes fetched and the memory-data bytes transferred. Which architecture is most efficient as measured by code size? Which architecture is most efficient as measured by total memory bandwidth required (code + data)? 2.4 [Discussion] <2.2–2.9> What are the economic arguments (i.e., more machines sold) for and against changing instruction set architecture? 2.5 [25] <2.1–2.5> Find an instruction set manual for some older machine (libraries and private bookshelves are good places to look). Summarize the instruction set with the discriminating characteristics used in Figure 2.2. Write the code sequence for this machine 121 Exercises for the statements in Exercise 2.3. The size of the data need not be 32 bits as in Exercise 2.3 if the word size is smaller in the older machine. 2.6 [20] <2.8> Consider the following fragment of C code: for (i=0; i<=100; i++) {A[i] = B[i] + C;} Assume that A and B are arrays of 32-bit integers, and C and i are 32-bit integers. Assume that all data values and their addresses are kept in memory (at addresses 0, 5000, 1500, and 2000 for A, B, C, and i, respectively) except when they are operated on. Assume that values in registers are lost between iterations of the loop. Write the code for DLX; how many instructions are required dynamically? How many memory-data references will be executed? What is the code size in bytes? 2.7 [20] <App. D> Repeat Exercise 2.6, but this time write the code for the 80x86. 2.8 [20] <2.8> For this question use the code sequence of Exercise 2.6, but put the scalar data—the value of i, the value of C, and the addresses of the array variables (but not the actual array)—in registers and keep them there whenever possible. Write the code for DLX; how many instructions are required dynamically? How many memory-data references will be executed? What is the code size in bytes? 2.9 [20] <App. D> Make the same assumptions and answer the same questions as the prior exercise, but this time write the code for the 80x86. 2.10 [15] <2.8> When designing memory systems it becomes useful to know the frequency of memory reads versus writes and also accesses for instructions versus data. Using the average instruction-mix information for DLX in Figure 2.26, find s the percentage of all memory accesses for data s the percentage of data accesses that are reads s the percentage of all memory accesses that are reads Ignore the size of a datum when counting accesses. 2.11 [18] <2.8> Compute the effective CPI for DLX using Figure 2.26. Suppose we have made the following measurements of average CPI for instructions: Instruction Clock cycles All ALU instructions 1.0 Loads-stores 1.4 Conditional branches Taken 2.0 Not taken 1.5 Jumps 1.2 122 Chapter 2 Instruction Set Principles and Examples Assume that 60% of the conditional branches are taken and that all instructions in the miscellaneous category of Figure 2.26 are ALU instructions. Average the instruction frequencies of gcc and espresso to obtain the instruction mix. 2.12 [20/10] <2.3,2.8> Consider adding a new index addressing mode to DLX. The addressing mode adds two registers and an 11-bit signed offset to get the effective address. Our compiler will be changed so that code sequences of the form ADD R1, R1, R2 LW Rd, 100(R1)(or store) will be replaced with a load (or store) using the new addressing mode. Use the overall average instruction frequencies from Figure 2.26 in evaluating this addition. a. [20] <2.3,2.8> Assume that the addressing mode can be used for 10% of the displacement loads and stores (accounting for both the frequency of this type of address calculation and the shorter offset). What is the ratio of instruction count on the enhanced DLX compared to the original DLX? b. [10] <2.3,2.8> If the new addressing mode lengthens the clock cycle by 5%, which machine will be faster and by how much? 2.13 [25/15] <2.7> Find a C compiler and compile the code shown in Exercise 2.6 for one of the machines covered in this book. Compile the code both optimized and unoptimized. a. [25] <2.7> Find the instruction count, dynamic instruction bytes fetched, and data accesses done for both the optimized and unoptimized versions. b. [15] <2.7> Try to improve the code by hand and compute the same measures as in part (a) for your hand-optimized version. 2.14 [30] <2.8> Small synthetic benchmarks can be very misleading when used for measuring instruction mixes. This is particularly true when these benchmarks are optimized. In this exercise and Exercises 2.15–2.17, we want to explore these differences. These programming exercises can be done with any load-store machine. Compile Whetstone with optimization. Compute the instruction mix for the top 20 most frequently executed instructions. How do the optimized and unoptimized mixes compare? How does the optimized mix compare to the mix for spice on the same or a similar machine? 2.15 [30] <2.8> Follow the same guidelines as the prior exercise, but this time use Dhrystone and compare it with TeX. 2.16 [30] <2.8> Many computer manufacturers now include tools or simulators that allow you to measure the instruction set usage of a user program. Among the methods in use are machine simulation, hardware-supported trapping, and a compiler technique that instruments the object-code module by inserting counters. Find a processor available to you that includes such a tool. Use it to measure the instruction set mix for one of TeX, gcc, or spice. Compare the results to those shown in this chapter. 2.17 [30] <2.3,2.8> DLX has only three operand formats for its register-register operations. Many operations might use the same destination register as one of the sources. We Exercises 123 could introduce a new instruction format into DLX called R2 that has only two operands and is a total of 24 bits in length. By using this instruction type whenever an operation had only two different register operands, we could reduce the instruction bandwidth required for a program. Modify the DLX simulator to count the frequency of register-register operations with only two different register operands. Using the benchmarks that come with the simulator, determine how much more instruction bandwidth DLX requires than DLX with the R2 format. 2.18 [25] <App. C> How much do the instruction set variations among the RISC machines discussed in Appendix C affect performance? Choose at least three small programs (e.g., a sort), and code these programs in DLX and two other assembly languages. What is the resulting difference in instruction count? 3 Pipelining It is quite a three-pipe problem. Sir Arthur Conan Doyle The Adventures of Sherlock Holmes 3 3.1 125 3.2 The Basic Pipeline for DLX 132 3.3 The Major Hurdle of Pipelining—Pipeline Hazards 139 3.4 Data Hazards 146 3.5 Control Hazards 161 3.6 What Makes Pipelining Hard to Implement? 178 3.7 Extending the DLX Pipeline to Handle Multicycle Operations 187 3.8 Crosscutting Issues: Instruction Set Design and Pipelining 199 3.9 Putting It All Together: The MIPS R4000 Pipeline 201 3.10 Fallacies and Pitfalls 209 3.11 Concluding Remarks 211 3.12 Historical Perspective and References 212 Exercises 3.1 What Is Pipelining? 214 What Is Pipelining? Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. Today, pipelining is the key implementation technique used to make fast CPUs. A pipeline is like an assembly line. In an automobile assembly line, there are many steps, each contributing something to the construction of the car. Each step operates in parallel with the other steps, though on a different car. In a computer pipeline, each step in the pipeline completes a part of an instruction. Like the assembly line, different steps are completing different parts of different instructions in parallel. Each of these steps is called a pipe stage or a pipe segment. The stages are connected one to the next to form a pipe—instructions enter at one end, progress through the stages, and exit at the other end, just as cars would in an assembly line. In an automobile assembly line, throughput is defined as the number of cars per hour and is determined by how often a completed car exits the assembly line. Likewise, the throughput of an instruction pipeline is determined by how often an instruction exits the pipeline. Because the pipe stages are hooked together, all the 126 Chapter 3 Pipelining stages must be ready to proceed at the same time, just as we would require in an assembly line. The time required between moving an instruction one step down the pipeline is a machine cycle. Because all stages proceed at the same time, the length of a machine cycle is determined by the time required for the slowest pipe stage, just as in an auto assembly line, the longest step would determine the time between advancing the line. In a computer, this machine cycle is usually one clock cycle (sometimes it is two, rarely more), although the clock may have multiple phases. The pipeline designer’s goal is to balance the length of each pipeline stage, just as the designer of the assembly line tries to balance the time for each step in the process. If the stages are perfectly balanced, then the time per instruction on the pipelined machine—assuming ideal conditions—is equal to Time per instruction on unpipelined machine ----------------------------------------------------------------------------------------------------------Number of pipe stages Under these conditions, the speedup from pipelining equals the number of pipe stages, just as an assembly line with n stages can ideally produce cars n times as fast. Usually, however, the stages will not be perfectly balanced; furthermore, pipelining does involve some overhead. Thus, the time per instruction on the pipelined machine will not have its minimum possible value, yet it can be close. Pipelining yields a reduction in the average execution time per instruction. Depending on what you consider as the base line, the reduction can be viewed as decreasing the number of clock cycles per instruction (CPI), as decreasing the clock cycle time, or as a combination. If the starting point is a machine that takes multiple clock cycles per instruction, then pipelining is usually viewed as reducing the CPI. This is the primary view we will take. If the starting point is a machine that takes one (long) clock cycle per instruction, then pipelining decreases the clock cycle time. Pipelining is an implementation technique that exploits parallelism among the instructions in a sequential instruction stream. It has the substantial advantage that, unlike some speedup techniques (see Chapter 8 and Appendix B), it is not visible to the programmer. In this chapter we will first cover the concept of pipelining using DLX and a simple version of its pipeline. We use DLX because its simplicity makes it easy to demonstrate the principles of pipelining. In addition, to simplify the diagrams we do not include the jump instructions of DLX; adding them does not involve new concepts—only bigger diagrams. The principles of pipelining in this chapter apply to more complex instruction sets than DLX or its RISC relatives, although the resulting pipelines are more complex. Using the DLX example, we will look at the problems pipelining introduces and the performance attainable under typical situations. Section 3.9 examines the MIPS R4000 pipeline, which is similar to other recent machines with extensive pipelining. Chapter 4 looks at more advanced pipelining techniques being used in the highest-performance processors. 3.1 What Is Pipelining? 127 Before we proceed to basic pipelining, we need to review a simple implementation of an unpipelined version of DLX. A Simple Implementation of DLX To understand how DLX can be pipelined, we need to understand how it is implemented without pipelining. This section shows a simple implementation where every instruction takes at most five clock cycles. We will extend this basic implementation to a pipelined version, resulting in a much lower CPI. Our unpipelined implementation is not the most economical or the highest-performance implementation without pipelining. Instead, it is designed to lead naturally to a pipelined implementation. We will indicate where the implementation could be improved later in this section. Implementing the instruction set requires the introduction of several temporary registers that are not part of the architecture; these are introduced in this section to simplify pipelining. In sections 3.1–3.5 we focus on a pipeline for an integer subset of DLX that consists of load-store word, branch, and integer ALU operations. Later in the chapter, we will incorporate the basic floating-point operations. Although we discuss only a subset of DLX, the basic principles can be extended to handle all the instructions. Every DLX instruction can be implemented in at most five clock cycles. The five clock cycles are as follows. 1. Instruction fetch cycle (IF): IR ← Mem[PC] NPC ← PC + 4 Operation: Send out the PC and fetch the instruction from memory into the instruction register (IR); increment the PC by 4 to address the next sequential instruction. The IR is used to hold the instruction that will be needed on subsequent clock cycles; likewise the register NPC is used to hold the next sequential PC. 2. Instruction decode/register fetch cycle (ID): A ← Regs[IR6..10]; B ← Regs[IR11..15]; Imm ← ((IR16)16##IR16..31) Operation: Decode the instruction and access the register file to read the registers. The outputs of the general-purpose registers are read into two temporary registers (A and B) for use in later clock cycles.The lower 16 bits of the IR are also sign-extended and stored into the temporary register Imm, for use in the next cycle. 128 Chapter 3 Pipelining Decoding is done in parallel with reading registers, which is possible because these fields are at a fixed location in the DLX instruction format (see Figure 2.21 on page 99). This technique is known as fixed-field decoding. Note that we may read a register we don’t use, which doesn’t help but also doesn’t hurt. Because the immediate portion of an instruction is located in an identical place in every DLX format, the sign-extended immediate is also calculated during this cycle in case it is needed in the next cycle. 3. Execution/effective address cycle (EX): The ALU operates on the operands prepared in the prior cycle, performing one of four functions depending on the DLX instruction type. s Memory reference: ALUOutput ← A + Imm; Operation: The ALU adds the operands to form the effective address and places the result into the register ALUOutput. s Register-Register ALU instruction: ALUOutput ← A func B; Operation: The ALU performs the operation specified by the function code on the value in register A and on the value in register B. The result is placed in the temporary register ALUOutput. s Register-Immediate ALU instruction: ALUOutput ← A op Imm; Operation: The ALU performs the operation specified by the opcode on the value in register A and on the value in register Imm. The result is placed in the temporary register ALUOutput. s Branch: ALUOutput ← NPC + Imm; Cond ←(A op 0) Operation: The ALU adds the NPC to the sign-extended immediate value in Imm to compute the address of the branch target. Register A, which has been read in the prior cycle, is checked to determine whether the branch is taken. The comparison operation op is the relational operator determined by the branch opcode; for example, op is “==” for the instruction BEQZ. The load-store architecture of DLX means that effective address and execu– tion cycles can be combined into a single clock cycle, since no instruction needs 3.1 What Is Pipelining? 129 to simultaneously calculate a data address, calculate an instruction target address, and perform an operation on the data. The other integer instructions not included above are jumps of various forms, which are similar to branches. 4. Memory access/branch completion cycle (MEM): The PC is updated for all instructions: PC ← NPC; s Memory reference: LMD ← Mem[ALUOutput] or Mem[ALUOutput] ← B; Operation: Access memory if needed. If instruction is a load, data returns from memory and is placed in the LMD (load memory data) register; if it is a store, then the data from the B register is written into memory. In either case the address used is the one computed during the prior cycle and stored in the register ALUOutput. s Branch: if (cond) PC ← ALUOutput Operation: If the instruction branches, the PC is replaced with the branch destination address in the register ALUOutput. 5. Write-back cycle (WB): s Register-Register ALU instruction: Regs[IR16..20] ← ALUOutput; s Register-Immediate ALU instruction: Regs[IR11..15] ← ALUOutput; s Load instruction: Regs[IR11..15] ← LMD; Operation: Write the result into the register file, whether it comes from the memory system (which is in LMD) or from the ALU (which is in ALUOutput); the register destination field is also in one of two positions depending on the function code. Figure 3.1 shows how an instruction flows through the datapath. At the end of each clock cycle, every value computed during that clock cycle and required on a later clock cycle (whether for this instruction or the next) is written into a storage 130 Chapter 3 Pipelining device, which may be memory, a general-purpose register, the PC, or a temporary register (i.e., LMD, Imm, A, B, IR, NPC, ALUOutput, or Cond). The temporary registers hold values between clock cycles for one instruction, while the other storage elements are visible parts of the state and hold values between successive instructions. Execute/ address calculation Instruction decode/ register fetch Instruction fetch Write back Memory access M u x Add NPC Zero? 4 PC Instruction memory Registers IR A Sign 32 extend M u x ALU B 16 Branch Cond taken M u x ALU output Data memory LMD M u x lmm FIGURE 3.1 The implementation of the DLX datapath allows every instruction to be executed in four or five clock cycles. Although the PC is shown in the portion of the datapath that is used in instruction fetch and the registers are shown in the portion of the datapath that is used in instruction decode/register fetch, both of these functional units are read as well as written by an instruction. Although we show these functional units in the cycle corresponding to where they are read, the PC is written during the memory access clock cycle and the registers are written during the write back clock cycle. In both cases, the writes in later pipe stages are indicated by the multiplexer output (in memory access or write back) that carries a value back to the PC or registers. These backward-flowing signals introduce much of the complexity of pipelining, and we will look at them more carefully in the next few sections. In this implementation, branch and store instructions require four cycles and all other instructions require five cycles. Assuming the branch frequency of 12% and a store frequency of 5% from the last chapter, this leads to an overall CPI of 4.83. This implementation, however, is not optimal either in achieving the best performance or in using the minimal amount of hardware given the performance 3.1 What Is Pipelining? 131 level. The CPI could be improved without affecting the clock rate by completing ALU instructions during the MEM cycle, since those instructions are idle during that cycle. Assuming ALU instructions occupy 47% of the instruction mix, as we measured in Chapter 2, this improvement would lead to a CPI of 4.35, or an improvement of 4.82/4.35 = 1.1. Beyond this simple change, any other attempts to decrease the CPI may increase the clock cycle time, since such changes would need to put more activity into a clock cycle. Of course, it may still be beneficial to trade an increase in the clock cycle time for a decrease in the CPI, but this requires a detailed analysis and is unlikely to produce large improvements, especially if the initial distribution of work among the clock cycles is reasonably balanced. Although all machines today are pipelined, this multicycle implementation is a reasonable approximation of how most machines would have been implemented in earlier times. A simple finite-state machine could be used to implement the control following the five-cycle structure shown above. For a much more complex machine, microcode control could be used. In either event, an instruction sequence like that above would determine the structure of the control. In addition to these CPI improvements, there are some hardware redundancies that could be eliminated in this multicycle implementation. For example, there are two ALUs: one to increment the PC and one used for effective address and ALU computation. Since they are not needed on the same clock cycle, we could merge them by adding additional multiplexers and sharing the same ALU. Likewise, instructions and data could be stored in the same memory, since the data and instruction accesses happen on different clock cycles. Rather than optimize this simple implementation, we will leave the design as it is in Figure 3.1, since this provides us with a better base for the pipelined implementation. As an alternative to the multicycle design discussed in this section, we could also have implemented the machine so that every instruction takes one long clock cycle. In such cases, the temporary registers would be deleted, since there would not be any communication across clock cycles within an instruction. Every instruction would execute in one long clock cycle, writing the result into the data memory, registers, or PC at the end of the clock cycle. The CPI would be one for such a machine. However, the clock cycle would be roughly equal to five times the clock cycle of the multicycle machine, since every instruction would need to traverse all the functional units. Designers would never use this single-cycle implementation for two reasons. First, a single-cycle implementation would be very inefficient for most machines that have a reasonable variation among the amount of work, and hence in the clock cycle time, needed for different instructions. Second, a single-cycle implementation requires the duplication of functional units that could be shared in a multicycle implementation. Nonetheless, this singlecycle datapath allows us to illustrate how pipelining can improve the clock cycle time, as opposed to the CPI, of a machine. 132 Chapter 3 Pipelining 3.2 The Basic Pipeline for DLX We can pipeline the datapath of Figure 3.1 with almost no changes by starting a new instruction on each clock cycle. (See why we chose that design!) Each of the clock cycles from the previous section becomes a pipe stage: a cycle in the pipeline. This results in the execution pattern shown in Figure 3.2, which is the typical way a pipeline structure is drawn. While each instruction takes five clock cycles to complete, during each clock cycle the hardware will initiate a new instruction and will be executing some part of the five different instructions. Clock number Instruction number 1 2 3 4 5 Instruction i IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM Instruction i + 1 Instruction i + 2 Instruction i + 3 Instruction i + 4 6 7 8 9 WB FIGURE 3.2 Simple DLX pipeline. On each clock cycle, another instruction is fetched and begins its five-cycle execution. If an instruction is started every clock cycle, the performance will be up to five times that of a machine that is not pipelined. The names for the stages in the pipeline are the same as those used for the cycles in the implementation on pages 127– 129: IF = instruction fetch, ID = instruction decode, EX = execution, MEM = memory access, and WB = write back. Your instinct is right if you find it hard to believe that pipelining is as simple as this, because it’s not. In this and the following sections, we will make our DLX pipeline “real” by dealing with problems that pipelining introduces. To begin with, we have to determine what happens on every clock cycle of the machine and make sure we don’t try to perform two different operations with the same datapath resource on the same clock cycle. For example, a single ALU cannot be asked to compute an effective address and perform a subtract operation at the same time. Thus, we must ensure that the overlap of instructions in the pipeline cannot cause such a conflict. Fortunately, the simplicity of the DLX instruction set makes resource evaluation relatively easy. Figure 3.3 shows a simplified version of the DLX datapath drawn in pipeline fashion. As you can see, the major functional units are used in different cycles and hence overlapping the execution of multiple instructions introduces relatively few conflicts. There are three observations on which this fact rests. First, the basic datapath of the last section already used separate instruction and data memories, which we would typically implement with separate instruction and data caches (discussed in Chapter 5). The use of separate caches eliminates a conflict for a single memory that would arise between instruction fetch 3.2 133 The Basic Pipeline for DLX Time (in clock cycles) IM DM CC 6 CC 7 Reg DM Reg DM DM CC 9 Reg Reg CC 8 Reg IM IM Reg ALU Reg CC 5 ALU IM CC 4 ALU Reg CC 3 ALU IM CC 2 ALU Program execution order (in instructions) CC 1 Reg DM Reg FIGURE 3.3 The pipeline can be thought of as a series of datapaths shifted in time. This shows the overlap among the parts of the datapath, with clock cycle 5 (CC 5) showing the steady state situation. Because the register file is used as a source in the ID stage and as a destination in the WB stage, it appears twice. We show that it is read in one stage and written in another by using a solid line, on the right or left, respectively, and a dashed line on the other side. The abbreviation IM is used for instruction memory, DM for data memory, and CC for clock cycle. and data memory access. Notice that if our pipelined machine has a clock cycle that is equal to that of the unpipelined version, the memory system must deliver five times the bandwidth. This is one cost of higher performance. Second, the register file is used in the two stages: for reading in ID and for writing in WB. These uses are distinct, so we simply show the register file in two places. This does mean that we need to perform two reads and one write every clock cycle. What if a read and write are to the same register? For now, we ignore this problem, but we will focus on it in the next section. Third, Figure 3.3 does not deal with the PC. To start a new instruction every clock, we must increment and store the PC every clock, and this must be done during the IF stage in preparation for the next instruction. The problem arises 134 Chapter 3 Pipelining when we consider the effect of branches, which changes the PC also, but not until the MEM stage. This is not a problem in our multicycle, unpipelined datapath, since the PC is written once in the MEM stage. For now, we will organize our pipelined datapath to write the PC in IF and write either the incremented PC or the value of the branch target of an earlier branch. This introduces a problem in how branches are handled that we will explain in the next section and explore in detail in section 3.5. Because every pipe stage is active on every clock cycle, all operations in a pipe stage must complete in one clock cycle and any combination of operations must be able to occur at once. Furthermore, pipelining the datapath requires that values passed from one pipe stage to the next must be placed in registers. Figure 3.4 shows the DLX pipeline with the appropriate registers, called pipeline registers or pipeline latches, between each pipeline stage. The registers are labeled with the names of the stages they connect. Figure 3.4 is drawn so that connections through the pipeline registers from one stage to another are clear. ID/EX IF/ID 4 ADD M u x EX/MEM Zero? MEM/WB Branch taken IR6..10 PC Instruction memory IR IR11..15 MEM/WB.IR Registers M u x M u x 16 Sign extend ALU Data memory M u x 32 FIGURE 3.4 The datapath is pipelined by adding a set of registers, one between each pair of pipe stages. The registers serve to convey values and control information from one stage to the next. We can also think of the PC as a pipeline register, which sits before the IF stage of the pipeline, leading to one pipeline register for each pipe stage. Recall that the PC is an edge-triggered register written at the end of the clock cycle; hence there is no race condition in writing the PC. The selection multiplexer for the PC has been moved so that the PC is written in exactly one stage (IF). If we didn’t move it, there would be a conflict when a branch occurred, since two instructions would try to write different values into the PC. Most of the datapaths flow from left to right, which is from earlier in time to later. The paths flowing from right to left (which carry the register write-back information and PC information on a branch) introduce complications into our pipeline, which we will spend much of this chapter overcoming. 3.2 The Basic Pipeline for DLX 135 All of the registers needed to hold values temporarily between clock cycles within one instruction are subsumed into these pipeline registers. The fields of the instruction register (IR), which is part of the IF/ID register, are labeled when they are used to supply register names. The pipeline registers carry both data and control from one pipeline stage to the next. Any value needed on a later pipeline stage must be placed in such a register and copied from one pipeline register to the next, until it is no longer needed. If we tried to just use the temporary registers we had in our earlier unpipelined datapath, values could be overwritten before all uses were completed. For example, the field of a register operand used for a write on a load or ALU operation is supplied from the MEM/WB pipeline register rather than from the IF/ID register. This is because we want a load or ALU operation to write the register designated by that operation, not the register field of the instruction currently transitioning from IF to ID! This destination register field is simply copied from one pipeline register to the next, until it is needed during the WB stage. Any instruction is active in exactly one stage of the pipeline at a time; therefore, any actions taken on behalf of an instruction occur between a pair of pipeline registers. Thus, we can also look at the activities of the pipeline by examining what has to happen on any pipeline stage depending on the instruction type. Figure 3.5 shows this view. Fields of the pipeline registers are named so as to show the flow of data from one stage to the next. Notice that the actions in the first two stages are independent of the current instruction type; they must be independent because the instruction is not decoded until the end of the ID stage. The IF activity depends on whether the instruction in EX/MEM is a taken branch. If so, then the branch target address of the branch instruction in EX/MEM is written into the PC at the end of IF; otherwise the incremented PC will be written back. (As we said earlier, this effect of branches leads to complications in the pipeline that we deal with in the next few sections.) The fixed-position encoding of the register source operands is critical to allowing the registers to be fetched during ID. To control this simple pipeline we need only determine how to set the control for the four multiplexers in the datapath of Figure 3.4. The two multiplexers in the ALU stage are set depending on the instruction type, which is dictated by the IR field of the ID/EX register. The top ALU input multiplexer is set by whether the instruction is a branch or not, and the bottom multiplexer is set by whether the instruction is a register-register ALU operation or any other type of operation. The multiplexer in the IF stage chooses whether to use the value of the incremented PC or the value of the EX/MEM.ALUOutput (the branch target) to write into the PC. This multiplexer is controlled by the field EX/MEM.cond. The fourth multiplexer is controlled by whether the instruction in the WB stage is a load or a ALU operation. In addition to these four multiplexers, there is one additional multiplexer needed that is not drawn in Figure 3.4, but whose existence is clear from looking at the WB stage of an ALU operation. The destination register field is in one of two different places depending on the instruction type (registerregister ALU versus either ALU immediate or load). Thus, we will need a multiplexer to choose the correct portion of the IR in the MEM/WB register to specify the register destination field, assuming the instruction writes a register. 136 Chapter 3 Pipelining Stage Any instruction IF IF/ID.IR ← Mem[PC]; IF/ID.NPC,PC ← (if ((EX/MEM.opcode == branch) & EX/MEM.cond){EX/MEM. ALUOutput} else {PC+4}); ID ID/EX.A ← Regs[IF/ID.IR6..10]; ID/EX.B ← Regs[IF/ID.IR11..15]; ID/EX.NPC ← IF/ID.NPC; ID/EX.IR ← IF/ID.IR; ID/EX.Imm ← (IF/ID.IR16)16##IF/ID.IR16..31; ALU instruction Load or store instruction Branch instruction EX/MEM.IR ← ID/EX.IR; EX/MEM.ALUOutput← ID/EX.A func ID/EX.B; or EX/MEM.ALUOutput ← ID/EX.A op ID/EX.Imm; EX/MEM.cond ← 0; EX/MEM.IR← ID/EX.IR EX/MEM.ALUOutput ← ID/EX.A + ID/EX.Imm; EX/MEM.ALUOutput ← ID/EX.NPC+ID/EX.Imm; EX/MEM.cond ← 0; EX/MEM.B← ID/EX.B; EX/MEM.cond ← (ID/EX.A op 0); MEM MEM/WB.IR ← EX/MEM.IR; MEM/WB.ALUOutput ← EX/MEM.ALUOutput; MEM/WB.IR ← EX/MEM.IR; MEM/WB.LMD ← Mem[EX/MEM.ALUOutput]; or Mem[EX/MEM.ALUOutput] ← EX/MEM.B; WB Regs[MEM/WB.IR16..20] ← MEM/WB.ALUOutput; or Regs[MEM/WB.IR11..15] ← MEM/WB.ALUOutput; For load only: Regs[MEM/WB.IR11..15] ← MEM/WB.LMD; EX FIGURE 3.5 Events on every pipe stage of the DLX pipeline. Let’s review the actions in the stages that are specific to the pipeline organization. In IF, in addition to fetching the instruction and computing the new PC, we store the incremented PC both into the PC and into a pipeline register (NPC) for later use in computing the branch target address. This structure is the same as the organization in Figure 3.4, where the PC is updated in IF from one or two sources. In ID, we fetch the registers, extend the sign of the lower 16 bits of the IR, and pass along the IR and NPC. During EX, we perform an ALU operation or an address calculation; we pass along the IR and the B register (if the instruction is a store). We also set the value of cond to 1 if the instruction is a taken branch. During the MEM phase, we cycle the memory, write the PC if needed, and pass along values needed in the final pipe stage. Finally, during WB, we update the register field from either the ALU output or the loaded value. For simplicity we always pass the entire IR from one stage to the next, though as an instruction proceeds down the pipeline, less and less of the IR is needed. Basic Performance Issues in Pipelining Pipelining increases the CPU instruction throughput—the number of instructions completed per unit of time—but it does not reduce the execution time of an individual instruction. In fact, it usually slightly increases the execution time of each instruction due to overhead in the control of the pipeline. The increase in instruction throughput means that a program runs faster and has lower total execution time, even though no single instruction runs faster! 3.2 137 The Basic Pipeline for DLX The fact that the execution time of each instruction does not decrease puts limits on the practical depth of a pipeline, as we will see in the next section. In addition to limitations arising from pipeline latency, limits arise from imbalance among the pipe stages and from pipelining overhead. Imbalance among the pipe stages reduces performance since the clock can run no faster than the time needed for the slowest pipeline stage. Pipeline overhead arises from the combination of pipeline register delay and clock skew. The pipeline registers add setup time, which is the time that a register input must be stable before the clock signal that triggers a write occurs, plus propagation delay to the clock cycle. Clock skew, which is maximum delay between when the clock arrives at any two registers, also contributes to the lower limit on the clock cycle. Once the clock cycle is as small as the sum of the clock skew and latch overhead, no further pipelining is useful, since there is no time left in the cycle for useful work. EXAMPLE ANSWER Consider the unpipelined machine in the previous section. Assume that it has 10-ns clock cycles and that it uses four cycles for ALU operations and branches and five cycles for memory operations. Assume that the relative frequencies of these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and setup, pipelining the machine adds 1 ns of overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution rate will we gain from a pipeline? The average instruction execution time on the unpipelined machine is Average instruction execution time = = = = Clock cycle × Average CPI 10 ns × ( ( 40% + 20% ) × 4 + 40% × 5 ) 10 ns × 4.4 44 ns In the pipelined implementation, the clock must run at the speed of the slowest stage plus overhead, which will be 10 + 1 or 11 ns; this is the average instruction execution time. Thus, the speedup from pipelining is Average instruction time unpipelined Speedup from pipelining = ---------------------------------------------------------------------------------------Average instruction time pipelined 44 ns = ------------ = 4 times 11 ns The 1-ns overhead essentially establishes a limit on the effectiveness of pipelining. If the overhead is not affected by changes in the clock cycle, Amdahl's Law tells us that the overhead limits the speedup. s Alternatively, if our base machine already has a CPI of 1 (with a longer clock cycle), then pipelining will enable us to have a shorter clock cycle. The datapath of the previous section can be made into a single-cycle datapath by simply removing the latches and letting the data flow from one cycle of execution to the next. How would the speedup of the pipelined version compare to the singlecycle machine? 138 Chapter 3 Pipelining EXAMPLE Assume that the times required for the five functional units, which operate in each of the five cycles, are as follows: 10 ns, 8 ns, 10 ns, 10 ns, and 7 ns. Assume that pipelining adds 1 ns of overhead. Find the speedup versus the single-cycle datapath. ANSWER Since the unpipelined machine executes all instructions in a single clock cycle, its average time per instruction is simply the clock cycle time. The clock cycle time is equal to the sum of the times for each step in the execution: Average instruction execution time = 10 + 8 + 10 + 10 + 7 = 45 ns The clock cycle time on the pipelined machine must be the largest time for any stage in the pipeline (10 ns) plus the overhead of 1 ns, for a total of 11 ns. Since the CPI is 1, this yields an average instruction execution time of 11 ns. Thus, Average instruction time unpipelined Speedup from pipelining = ---------------------------------------------------------------------------------------Average instruction time pipelined 45 ns = ------------ = 4.1 times 11 ns Pipelining can be thought of as improving the CPI, which is what we typically do, as increasing the clock rate—especially compared to another pipelined machine, or sometimes as doing both. s Because the latches in a pipelined design can have a significant impact on the clock speed, designers have looked for latches that permit the highest possible clock rate. The Earle latch (invented by J. G. Earle [1965]) has three properties that make it especially useful in pipelined machines. First, it is relatively insensitive to clock skew. Second, the delay through the latch is always a constant twogate delay, avoiding the introduction of skew in the data passing through the latch. Finally, two levels of logic can be performed in the latch without increasing the latch delay time. This means that two levels of logic in the pipeline can be overlapped with the latch, so the overhead from the latch can be hidden. We will not be analyzing the pipeline designs in this chapter at this level of detail. The interested reader should see Kunkel and Smith [1986]. The pipeline we now have for DLX would function just fine for integer instructions if every instruction were independent of every other instruction in the pipeline. In reality, instructions in the pipeline can depend on one another; this is the topic of the next section. The complications that arise in the floating-point pipeline will be treated in section 3.7, and section 3.9 will look at a complete real pipeline. 3.3 3.3 The Major Hurdle of Pipelining—Pipeline Hazards 139 The Major Hurdle of Pipelining— Pipeline Hazards There are situations, called hazards, that prevent the next instruction in the instruction stream from executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining. There are three classes of hazards: 1. Structural hazards arise from resource conflicts when the hardware cannot support all possible combinations of instructions in simultaneous overlapped execution. 2. Data hazards arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline. 3. Control hazards arise from the pipelining of branches and other instructions that change the PC. Hazards in pipelines can make it necessary to stall the pipeline. In Chapter 1, we mentioned that the processor could stall on an event such as a cache miss. Stalls arising from hazards in pipelined machines are more complex than the simple stall for a cache miss. Eliminating a hazard often requires that some instructions in the pipeline be allowed to proceed while others are delayed. For the pipelines we discuss in this chapter, when an instruction is stalled, all instructions issued later than the stalled instruction—and hence not as far along in the pipeline—are also stalled. Instructions issued earlier than the stalled instruction—and hence farther along in the pipeline—must continue, since otherwise the hazard will never clear. As a result, no new instructions are fetched during the stall. In contrast to this process of stalling only a portion of the pipeline, a cache miss stalls all the instructions in the pipeline both before and after the instruction causing the miss. (For the simple pipelines of this chapter there is no advantage in selecting stalling instructions on a cache miss, but in future chapters we will examine pipelines and caches that reduce cache miss costs by selectively stalling on a cache miss.) We will see several examples of how pipeline stalls operate in this section—don’t worry, they aren’t as complex as they might sound! Performance of Pipelines with Stalls A stall causes the pipeline performance to degrade from the ideal performance. Let’s look at a simple equation for finding the actual speedup from pipelining, starting with the formula from the previous section. 140 Chapter 3 Pipelining Average instruction time unpipelined Speedup from pipelining = ---------------------------------------------------------------------------------------Average instruction time pipelined CPI unpipelined × Clock cycle unpipelined = ------------------------------------------------------------------------------------------------------CPI pipelined × Clock cycle pipelined CPI unpipelined Clock cycle unpipelined = -------------------------------------- × --------------------------------------------------------CPI pipelined Clock cycle pipelined Remember that pipelining can be thought of as decreasing the CPI or the clock cycle time. Since it is traditional to use the CPI to compare pipelines, let’s start with that assumption. The ideal CPI on a pipelined machine is almost always 1. Hence, we can compute the pipelined CPI: CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction = 1 + Pipeline stall clock cycles per instruction If we ignore the cycle time overhead of pipelining and assume the stages are perfectly balanced, then the cycle time of the two machines can be equal, leading to CPI unpipelined Speedup = -------------------------------------------------------------------------------------------1 + Pipeline stall cycles per instruction One important simple case is where all instructions take the same number of cycles, which must also equal the number of pipeline stages (also called the depth of the pipeline). In this case, the unpipelined CPI is equal to the depth of the pipeline, leading to Pipeline depth Speedup = -------------------------------------------------------------------------------------------1 + Pipeline stall cycles per instruction If there are no pipeline stalls, this leads to the intuitive result that pipelining can improve performance by the depth of the pipeline. Alternatively, if we think of pipelining as improving the clock cycle time, then we can assume that the CPI of the unpipelined machine, as well as that of the pipelined machine, is 1. This leads to CPI unpipelined Clock cycle unpipelined Speedup from pipelining = -------------------------------------- × --------------------------------------------------------CPI pipelined Clock cycle pipelined 1 Clock cycle unpipelined = -------------------------------------------------------------------------------------------- × --------------------------------------------------------1 + Pipeline stall cycles per instruction Clock cycle pipelined In cases where the pipe stages are perfectly balanced and there is no overhead, the clock cycle on the pipelined machine is smaller than the clock cycle of the unpipelined machine by a factor equal to the pipelined depth: 3.3 The Major Hurdle of Pipelining—Pipeline Hazards 141 Clock cycle unpipelined Clock cycle pipelined = --------------------------------------------------------Pipeline depth Clock cycle unpipelined Pipeline depth = --------------------------------------------------------Clock cycle pipelined This leads to the following: 1 Clock cycle unpipelined Speedup from pipelining = -------------------------------------------------------------------------------------------- × --------------------------------------------------------1 + Pipeline stall cycles per instruction Clock cycle pipelined 1 = -------------------------------------------------------------------------------------------- × Pipeline depth 1 + Pipeline stall cycles per instruction Thus, if there are no stalls, the speedup is equal to the number of pipeline stages, matching our intuition for the ideal case. Structural Hazards When a machine is pipelined, the overlapped execution of instructions requires pipelining of functional units and duplication of resources to allow all possible combinations of instructions in the pipeline. If some combination of instructions cannot be accommodated because of resource conflicts, the machine is said to have a structural hazard. The most common instances of structural hazards arise when some functional unit is not fully pipelined. Then a sequence of instructions using that unpipelined unit cannot proceed at the rate of one per clock cycle. Another common way that structural hazards appear is when some resource has not been duplicated enough to allow all combinations of instructions in the pipeline to execute. For example, a machine may have only one register-file write port, but under certain circumstances, the pipeline might want to perform two writes in a clock cycle. This will generate a structural hazard. When a sequence of instructions encounters this hazard, the pipeline will stall one of the instructions until the required unit is available. Such stalls will increase the CPI from its usual ideal value of 1. Some pipelined machines have shared a single-memory pipeline for data and instructions. As a result, when an instruction contains a data-memory reference, it will conflict with the instruction reference for a later instruction, as shown in Figure 3.6. To resolve this, we stall the pipeline for one clock cycle when the data memory access occurs. Figure 3.7 shows our pipeline datapath figure with the stall cycle added. A stall is commonly called a pipeline bubble or just bubble, since it floats through the pipeline taking space but carrying no useful work. We will see another type of stall when we talk about data hazards. Rather than draw the pipeline datapath every time, designers often just indicate stall behavior using a simpler diagram with only the pipe stage names, as in Figure 3.8. The form of Figure 3.8 shows the stall by indicating the cycle when no action occurs and simply shifting instruction 3 to the right (which delays its 142 Chapter 3 Pipelining Instruction 3 Instruction 4 Mem Reg Mem Reg Reg Mem Reg Mem Reg Mem Mem Mem CC 5 Reg Reg Mem CC 6 CC 7 CC 8 ALU Instruction 2 Reg CC 4 ALU Instruction 1 Mem CC 3 ALU Load CC 2 ALU CC 1 ALU Time (in clock cycles) Reg Mem FIGURE 3.6 A machine with only one memory port will generate a conflict whenever a memory reference occurs. In this example the load instruction uses the memory for a data access at the same time instruction 3 wants to fetch an instruction from memory. execution start and finish by one cycle). The effect of the pipeline bubble is actually to occupy the resources for that instruction slot as it travels through the pipeline, just as Figure 3.7 shows. Although Figure 3.7 shows how the stall is actually implemented, the performance impact indicated by the two figures is the same: Instruction 3 does not complete until clock cycle 9, and no instruction completes during clock cycle 8. 3.3 143 The Major Hurdle of Pipelining—Pipeline Hazards Instruction 2 Stall Instruction 3 Reg Mem CC 4 CC 5 Mem Reg Reg Mem Reg Bubble CC 6 CC 7 CC 8 Mem Reg Mem Reg Bubble Bubble Bubble ALU Instruction 1 Mem CC 3 ALU Load CC 2 ALU CC 1 ALU Time (in clock cycles) Mem Bubble Mem Reg FIGURE 3.7 The structural hazard causes pipeline bubbles to be inserted. The effect is that no instruction will finish during clock cycle 8, when instruction 3 would normally have finished. Instruction 1 is assumed to not be a load or store; otherwise, instruction 3 cannot start execution. 144 Chapter 3 Pipelining Clock cycle number Instruction 1 Load instruction IF 2 3 4 5 6 7 8 9 10 EX MEM WB ID EX MEM WB ID EX MEM WB stall IF ID EX MEM WB ID EX MEM WB IF ID EX MEM IF Instruction i + 2 ID IF Instruction i + 1 ID EX IF Instruction i + 3 Instruction i + 4 IF Instruction i + 5 Instruction i + 6 FIGURE 3.8 A pipeline stalled for a structural hazard—a load with one memory port. As shown here, the load instruction effectively steals an instruction-fetch cycle, causing the pipeline to stall—no instruction is initiated on clock cycle 4 (which normally would initiate instruction i + 3). Because the instruction being fetched is stalled, all other instructions in the pipeline before the stalled instruction can proceed normally. The stall cycle will continue to pass through the pipeline, so that no instruction completes on clock cycle 8. Sometimes these pipeline diagrams are drawn with the stall occupying an entire horizontal row and instruction 3 being moved to the next row; in either case, the effect is the same, since instruction 3 does not begin execution until cycle 5. We use the form above, since it takes less space. EXAMPLE Let’s see how much the load structural hazard might cost. Suppose that data references constitute 40% of the mix, and that the ideal CPI of the pipelined machine, ignoring the structural hazard, is 1. Assume that the machine with the structural hazard has a clock rate that is 1.05 times higher than the clock rate of the machine without the hazard. Disregarding any other performance losses, is the pipeline with or without the structural hazard faster, and by how much? ANSWER There are several ways we could solve this problem. Perhaps the simplest is to compute the average instruction time on the two machines: Average instruction time = CPI × Clock cycle time Since it has no stalls, the average instruction time for the ideal machine is simply the Clock cycle timeideal. The average instruction time for the machine with the structural hazard is Average instruction time = CPI × Clock cycle time Clock cycle time ideal = ( 1 + 0.4 × 1 ) × -------------------------------------------------1.05 = 1.3 × Clock cycle time ideal Clearly, the machine without the structural hazard is faster; we can use the ratio of the average instruction times to conclude that the machine without the hazard is 1.3 times faster. 3.3 The Major Hurdle of Pipelining—Pipeline Hazards 145 As an alternative to this structural hazard, the designer could provide a separate memory access for instructions, either by splitting the cache into separate instruction and data caches, or by using a set of buffers, usually called instruction buffers, to hold instructions. Both the split cache and instruction buffer ideas are discussed in Chapter 5. s If all other factors are equal, a machine without structural hazards will always have a lower CPI. Why, then, would a designer allow structural hazards? There are two reasons: to reduce cost and to reduce the latency of the unit. Pipelining all the functional units, or duplicating them, may be too costly. For example, machines that support both an instruction and a data cache access every cycle (to prevent the structural hazard of the above example) require twice as much total memory bandwidth and often have higher bandwidth at the pins. Likewise, fully pipelining a floating-point multiplier consumes lots of gates. If the structural hazard would not occur often, it may not be worth the cost to avoid it. It is also usually possible to design an unpipelined unit, or one that isn’t fully pipelined, with a somewhat shorter total delay than a fully pipelined unit. The shorter latency comes from the lack of pipeline registers that introduce overhead. For example, both the CDC 7600 and the MIPS R2010 floating-point unit choose shorter latency (fewer clocks per operation) versus full pipelining. As we will see shortly, reducing latency has other performance benefits and may overcome the disadvantage of the structural hazard. EXAMPLE Many recent machines do not have fully pipelined floating-point units. For example, suppose we had an implementation of DLX with a floating-point multiply unit but no pipelining. Assume the multiplier could accept a new multiply operation every five clock cycles. (This rate is called the repeat or initiation interval.) Will this structural hazard have a large or small performance impact on mdljdp2 running on DLX? For simplicity, assume that the floating-point multiplies are uniformly distributed. ANSWER From Chapter 2 we find that floating-point multiply has a frequency of 14% in mdljdp2. Our proposed pipeline can handle up to a 20% frequency of floating-point multiplies—one every five clock cycles. This means that the performance benefit of fully pipelining the floating-point multiply on mdljdp2 is likely to be limited, as long as the floating-point multiplies are not clustered but are distributed uniformly. In the best case, multiplies are overlapped with other operations, and there is no performance penalty at all. In the worst case, the multiplies are all clustered with no intervening instructions, and 14% of the instructions take 5 cycles each. Assuming a base CPI of 1, this amounts to an increase of 0.7 in the CPI. 146 Chapter 3 Pipelining In practice, examining the performance of mdljdp2 on a machine with a five-cycle-deep FP multiply pipeline shows that this structural hazard increases execution time by less than 3%. One reason this loss is so low is that data hazards (the topic of the next section) cause the pipeline to stall, preventing multiply instructions that might cause structural hazards from being initiated. Of course, other benchmarks make heavier use of floating-point multiply or have fewer data hazards, and thus would show a larger impact. In the rest of this chapter we will examine the contributions of these different types of stalls in the DLX pipeline. s 3.4 Data Hazards A major effect of pipelining is to change the relative timing of instructions by overlapping their execution. This introduces data and control hazards. Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instructions on an unpipelined machine. Consider the pipelined execution of these instructions: ADD SUB AND OR XOR R1,R2,R3 R4,R1,R5 R6,R1,R7 R8,R1,R9 R10,R1,R11 All the instructions after the ADD use the result of the ADD instruction. As shown in Figure 3.9, the ADD instruction writes the value of R1 in the WB pipe stage, but the SUB instruction reads the value during its ID stage. This problem is called a data hazard. Unless precautions are taken to prevent it, the SUB instruction will read the wrong value and try to use it. In fact, the value used by the SUB instruction is not even deterministic: Though we might think it logical to assume that SUB would always use the value of R1 that was assigned by an instruction prior to ADD, this is not always the case. If an interrupt should occur between the ADD and SUB instructions, the WB stage of the ADD will complete, and the value of R1 at that point will be the result of the ADD. This unpredictable behavior is obviously unacceptable. The AND instruction is also affected by this hazard. As we can see from Figure 3.9, the write of R1 does not complete until the end of clock cycle 5. Thus, the AND instruction that reads the registers during clock cycle 4 will receive the wrong results. The XOR instruction operates properly, because its register read occurs in clock cycle 6, after the register write. The OR instruction can also be made to operate without incurring a hazard by a simple implementation technique, implied in our pipeline diagrams. The technique is to perform the register file reads in the second half of the cycle and the writes in the first half. This technique, 3.4 147 Data Hazards Time (in clock cycles) OR R8, R1, R9 IM CC 5 DM Reg Reg IM Reg IM DM Reg CC 6 Reg DM ALU AND R6, R1, R7 Reg CC 4 ALU SUB R4, R1, R5 IM CC 3 ALU ADD R1, R2, R3 CC 2 ALU Program execution order (in instructions) CC 1 XOR R10, R1, R11 IM Reg FIGURE 3.9 The use of the result of the ADD instruction in the next three instructions causes a hazard, since the register is not written until after those instructions read it. which is hinted at in earlier figures by placing the dashed box around the register file, allows the OR instruction in the example in Figure 3.9 to execute correctly. The next subsection discusses a technique to eliminate the stalls for the hazard involving the SUB and AND instructions. Minimizing Data Hazard Stalls By Forwarding The problem posed in Figure 3.9 can be solved with a simple hardware technique called forwarding (also called bypassing and sometimes short-circuiting). The key insight in forwarding is that the result is not really needed by the SUB until after the ADD actually produces it. If the result can be moved from where the ADD 148 Chapter 3 Pipelining produces it, the EX/MEM register, to where the SUB needs it, the ALU input latches, then the need for a stall can be avoided. Using this observation, forwarding works as follows: 1. The ALU result from the EX/MEM register is always fed back to the ALU input latches. 2. If the forwarding hardware detects that the previous ALU operation has written the register corresponding to a source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file. Notice that with forwarding, if the SUB is stalled, the ADD will be completed and the bypass will not be activated. This is also true for the case of an interrupt between the two instructions. As the example in Figure 3.9 shows, we need to forward results not only from the immediately previous instruction, but possibly from an instruction that started two cycles earlier. Figure 3.10 shows our example with the bypass paths in place and highlighting the timing of the register read and writes. This code sequence can be executed without stalls. Forwarding can be generalized to include passing a result directly to the functional unit that requires it: A result is forwarded from the output of one unit to the input of another, rather than just from the result of a unit to the input of the same unit. Take, for example, the following sequence: ADD LW SW R1,R2,R3 R4,0(R1) 12(R1),R4 To prevent a stall in this sequence, we would need to forward the values of R1 and R4 from the pipeline registers to the ALU and data memory inputs. Figure 3.11 shows all the forwarding paths for this example. In DLX, we may require a forwarding path from any pipeline register to the input of any functional unit. Because the ALU and data memory both accept operands, forwarding paths are needed to their inputs from both the ALU/MEM and MEM/WB registers. In addition, DLX uses a zero detection unit that operates during the EX cycle, and forwarding to that unit will be needed as well. Later in this section we will explore all the necessary forwarding paths and the control of those paths. 3.4 149 Data Hazards Time (in clock cycles) OR R8, R1, R9 IM CC 5 DM Reg Reg IM Reg IM DM Reg CC 6 Reg DM ALU AND R6, R1, R7 Reg CC 4 ALU SUB R4, R1, R5 IM CC 3 ALU ADD R1, R2, R3 CC 2 ALU Program execution order (in instructions) CC 1 XOR R10, R1, R11 IM Reg FIGURE 3.10 A set of instructions that depend on the ADD result use forwarding paths to avoid the data hazard. The inputs for the SUB and AND instructions forward from the EX/MEM and the MEM/WB pipeline registers, respectively, to the first ALU input. The OR receives its result by forwarding through the register file, which is easily accomplished by reading the registers in the second half of the cycle and writing in the first half, as the dashed lines on the registers indicate. Notice that the forwarded result can go to either ALU input; in fact, both ALU inputs could use forwarded inputs from either the same pipeline register or from different pipeline registers. This would occur, for example, if the AND instruction was AND R6,R1,R4. 150 Chapter 3 Pipelining LW R4, 0(R1) IM Reg SW 12(R1), R4 CC 5 DM Reg Reg DM Reg IM IM CC 4 Reg ALU ADD R1, R2, R3 CC 2 ALU CC 1 CC 3 DM ALU Program execution order (in instructions) Time (in clock cycles) CC 6 FIGURE 3.11 Stores require an operand during MEM, and forwarding of that operand is shown here. The result of the load is forwarded from the memory output in MEM/WB to the memory input to be stored. In addition, the ALU output is forwarded to the ALU input for the address calculation of both the load and the store (this is no different than forwarding to another ALU operation). If the store depended on an immediately preceding ALU operation (not shown above), the result would need to be forwarded to prevent a stall. Data Hazard Classification A hazard is created whenever there is a dependence between instructions, and they are close enough that the overlap caused by pipelining would change the order of access to an operand. Our example hazards have all been with register operands, but it is also possible for a pair of instructions to create a dependence by writing and reading the same memory location. In our DLX pipeline, however, memory references are always kept in order, preventing this type of hazard from arising. Cache misses could cause the memory references to get out of order if we allowed the processor to continue working on later instructions, while an earlier instruction that missed the cache was accessing memory. For the DLX pipeline we stall the entire pipeline on a cache miss, effectively making the instruction 3.4 151 Data Hazards that contained the miss run for multiple clock cycles. In the next chapter, we will discuss machines that allow loads and stores to be executed in an order different from that in the program, which will introduce new problems. All the data hazards discussed in this chapter involve registers within the CPU. Data hazards may be classified as one of three types, depending on the order of read and write accesses in the instructions. By convention, the hazards are named by the ordering in the program that must be preserved by the pipeline. Consider two instructions i and j, with i occurring before j. The possible data hazards are s s RAW (read after write) — j tries to read a source before i writes it, so j incorrectly gets the old value. This is the most common type of hazard and the kind that we used forwarding to overcome in Figures 3.10 and 3.11. WAW (write after write) — j tries to write an operand before it is written by i. The writes end up being performed in the wrong order, leaving the value written by i rather than the value written by j in the destination. This hazard is present only in pipelines that write in more than one pipe stage (or allow an instruction to proceed even when a previous instruction is stalled). The DLX integer pipeline writes a register only in WB and avoids this class of hazards. If we made two changes to the DLX pipeline, WAW hazards would be possible. First, we could move write back for an ALU operation into the MEM stage, since the data value is available by then. Second, suppose that the data memory access took two pipe stages. Here is a sequence of two instructions showing the execution in this revised pipeline, highlighting the pipe stage that writes the result: LW R1,0(R2) ADD R1,R2,R3 IF ID EX MEM1 MEM2 IF ID EX WB WB Unless this hazard is avoided, execution of this sequence on this revised pipeline will leave the result of the first write (the LW) in R1, rather than the result of the ADD! Allowing writes in different pipe stages introduces other problems, since two instructions can try to write during the same clock cycle. When we discuss the DLX FP pipeline (section 3.7), which has both writes in different stages and different pipeline lengths, we will deal with both write conflicts and WAW hazards in detail. s WAR (write after read) — j tries to write a destination before it is read by i, so i incorrectly gets the new value. This cannot happen in our example pipeline because all reads are early (in ID) and all writes are late (in WB). This hazard occurs when there are some instructions that write results early in the instruction pipeline, and other instructions that read a source late in the pipeline. 152 Chapter 3 Pipelining Because of the natural structure of a pipeline, which typically reads values before it writes results, such hazards are rare. Pipelines for complex instruction sets that support autoincrement addressing and require operands to be read late in the pipeline could create a WAR hazard. If we modified the DLX pipeline as in the above example and also read some operands late, such as the source value for a store instruction, a WAR hazard could occur. Here is the pipeline timing for such a potential hazard, highlighting the stage where the conflict occurs: SW 0(R1),R2 ADD R2,R3,R4 IF ID EX MEM1 MEM2 IF ID EX WB WB If the SW reads R2 during the second half of its MEM2 stage and the ADD writes R2 during the first half of its WB stage, the SW will incorrectly read and store the value produced by the ADD. In the DLX pipeline, reading all operands from the register file during ID avoids this hazard; however, in the next chapter, we will see how these hazards occur more easily when instructions are executed out of order. Note that the RAR (read after read) case is not a hazard. Data Hazards Requiring Stalls Unfortunately, not all potential data hazards can be handled by bypassing. Consider the following sequence of instructions: LW SUB AND OR R1,0(R2) R4,R1,R5 R6,R1,R7 R8,R1,R9 The pipelined datapath with the bypass paths for this example is shown in Figure 3.12. This case is different from the situation with back-to-back ALU operations. The LW instruction does not have the data until the end of clock cycle 4 (its MEM cycle), while the SUB instruction needs to have the data by the beginning of that clock cycle. Thus, the data hazard from using the result of a load instruction cannot be completely eliminated with simple hardware. As Figure 3.12 shows, such a forwarding path would have to operate backward in time—a capability not yet available to computer designers! We can forward the result immediately to the ALU from the MEM/WB registers for use in the AND operation, which begins two clock cycles after the load. Likewise, the OR instruction has no problem, since it receives the value through the register file. For the SUB instruction, the forwarded result arrives too late—at the end of a clock cycle, when it is needed at the beginning. 3.4 153 Data Hazards Time (in clock cycles) AND R6, R1, R7 OR R8, R1, R9 Reg IM CC 4 CC 5 DM Reg Reg IM Reg IM DM ALU SUB R4, R1, R5 IM CC 3 ALU LW R1, 0(R2) CC 2 ALU Program execution order (in instructions) CC 1 Reg FIGURE 3.12 The load instruction can bypass its results to the AND and OR instructions, but not to the SUB, since that would mean forwarding the result in “negative time.” The load instruction has a delay or latency that cannot be eliminated by forwarding alone. Instead, we need to add hardware, called a pipeline interlock, to preserve the correct execution pattern. In general, a pipeline interlock detects a hazard and stalls the pipeline until the hazard is cleared. In this case, the interlock stalls the pipeline, beginning with the instruction that wants to use the data until the source instruction produces it. This pipeline interlock introduces a stall or bubble, just as it did for the structural hazard in section 3.3. The CPI for the stalled instruction increases by the length of the stall (one clock cycle in this case). The pipeline with the stall and the legal forwarding is shown in Figure 3.13. Because the stall causes the instructions starting with the SUB to move one cycle later in time, the forwarding to the AND instruction now goes through the register file, and no forwarding at all is needed for the OR instruction. The insertion of the bubble causes the number of cycles to complete this sequence to grow by one. No instruction is started during clock cycle 4 (and none 154 Chapter 3 Pipelining Time (in clock cycles) IM Reg CC 4 CC 5 Reg DM Reg Bubble Reg Bubble AND R6, R1, R7 Bubble IM IM SUB R4, R1, R5 CC 6 IM OR R8, R1, R9 DM ALU R1, 0(R2) CC 3 ALU LW CC 2 ALU Program execution order (in instructions) CC 1 Reg FIGURE 3.13 The load interlock causes a stall to be inserted at clock cycle 4, delaying the SUB instruction and those that follow by one cycle. This delay allows the value to be successfully forwarded on the next clock cycle. finishes during cycle 6). Figure 3.14 shows the pipeline before and after the stall using a diagram containing only the pipeline stages. We will make extensive use of this more concise form for showing interlocks and stalls in this chapter and the next. LW R1,0(R2) IF EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF SUB R4,R1,R5 ID ID EX MEM AND R6,R1,R7 OR R8,R1,R9 LW R1,0(R2) SUB R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9 IF WB ID EX MEM WB IF ID stall EX MEM WB IF stall ID EX MEM WB stall IF ID EX MEM WB FIGURE 3.14 In the top half, we can see why a stall is needed: the MEM cycle of the load produces a value that is needed in the EX cycle of the SUB, which occurs at the same time. This problem is solved by inserting a stall, as shown in the bottom half. 3.4 155 Data Hazards EXAMPLE Suppose that 30% of the instructions are loads, and half the time the instruction following a load instruction depends on the result of the load. If this hazard creates a single-cycle delay, how much faster is the ideal pipelined machine (with a CPI of 1) that does not delay the pipeline than the real pipeline? Ignore any stalls other than pipeline stalls. ANSWER The ideal machine will be faster by the ratio of the CPIs. The CPI for an instruction following a load is 1.5, since it stalls half the time. Because loads are 30% of the mix, the effective CPI is (0.7 × 1 + 0.3 × 1.5) = 1.15. This means that the ideal machine is 1.15 times faster. s In the next subsection we consider compiler techniques to reduce these penalties. After that, we look at how to implement hazard detection, forwarding, and interlocks. Compiler Scheduling for Data Hazards Many types of stalls are quite frequent. The typical code-generation pattern for a statement such as A = B + C produces a stall for a load of the second data value (C). Figure 3.15 shows that the store of A need not cause another stall, since the result of the addition can be forwarded to the data memory for use by the store. Rather than just allow the pipeline to stall, the compiler could try to schedule the pipeline to avoid these stalls by rearranging the code sequence to eliminate the hazard. For example, the compiler could try to avoid generating code with a load followed by the immediate use of the load destination register. This technique, called pipeline scheduling or instruction scheduling, was first used in the 1960s and became an area of major interest in the 1980s, as pipelined machines became more widespread. IF LW R2,C ID EX MEM WB IF LW R1,B ID EX MEM SW A,R3 ID stall EX MEM WB IF ADD R3,R1,R2 IF WB stall ID EX MEM WB FIGURE 3.15 The DLX code sequence for A = B + C. The ADD instruction must be stalled to allow the load of C to complete. The SW need not be delayed further because the forwarding hardware passes the result from the MEM/WB directly to the data memory input for storing. EXAMPLE Generate DLX code that avoids pipeline stalls for the following sequence: a = b + c; d = e – f; Assume loads have a latency of one clock cycle. 156 Chapter 3 Pipelining ANSWER Here is the scheduled code: LW LW LW ADD LW SW SUB SW Rb,b Rc,c Re,e ; swap instructions to avoid stall Ra,Rb,Rc Rf,f a,Ra ; store/load exchanged to avoid stall Rd,Re,Rf d,Rd Both load interlocks (LW Rc, c to ADD Ra, Rb, Rc and LW Rf, f to SUB Rd, Re, Rf) have been eliminated. There is a dependence between the ALU instruction and the store, but the pipeline structure allows the result to be forwarded. Notice that the use of different registers for the first and second statements was critical for this schedule to be legal. In particular, if the variable e was loaded into the same register as b or c, this schedule would be illegal. In general, pipeline scheduling can increase the register count required. In the next chapter, we will see that this increase can be substantial for machines that can issue multiple instructions in one clock. s Many modern compilers try to use instruction scheduling to improve pipeline performance. In the simplest algorithms, the compiler simply schedules using other instructions in the same basic block. A basic block is a straight-line code sequence with no transfers in or out, except at the beginning or end. Scheduling such code sequences is easy, since we know that every instruction in the block is executed if the first one is. We can simply make a graph of the dependences among the instructions and order the instructions so as to minimize the stalls. For a simple pipeline like the DLX integer pipeline with only short latencies (the only delay is one cycle on loads), a scheduling strategy focusing on basic blocks is adequate. Figure 3.16 shows the frequency that stalls are required for load results, assuming a single-cycle delay for loads. As you can see, this process is more effective for floating-point programs that have significant amounts of parallelism among instructions. As pipelining becomes more extensive and the effective pipeline latencies grow, more ambitious scheduling schemes are needed; these are discussed in detail in the next chapter. Implementing the Control for the DLX Pipeline The process of letting an instruction move from the instruction decode stage (ID) into the execution stage (EX) of this pipeline is usually called instruction issue; an instruction that has made this step is said to have issued. For the DLX integer pipeline, all the data hazards can be checked during the ID phase of the pipeline. 3.4 157 Data Hazards 45% 41% 40% 35% 30% 24% 25% 23% Fraction of loads that cause a stall 24% 20% 20% 20% 15% 12% 10% 10% 10% 4% 5% p or 2c su m dl jd d r o2 ea dr hy l do i du c c gc t so es pr es ot nt pr m co eq es s 0% Benchmark FIGURE 3.16 Percentage of the loads that result in a stall with the DLX pipeline. This chart shows the frequency of stalls remaining in scheduled code that was globally optimized before scheduling. Global optimization actually makes scheduling relatively harder because there are fewer candidates for scheduling into delay slots, as we discuss in Fallacies and Pitfalls. The pipeline slot after a load is often called the load delay or delay slot. In general, it is easier to schedule the delay slots in FP programs, since they are more regular and the analysis is easier. Hence fewer loads stall in the FP programs: an average of 13% of the loads versus 25% on the integer programs. The actual performance impact depends on the load frequency, which varies from 19% to 34% with an average of 24%.The contribution to CPI runs from 0.01 cycles per instruction to 0.15 cycles per instruction. If a data hazard exists, the instruction is stalled before it is issued. Likewise, we can determine what forwarding will be needed during ID and set the appropriate controls then. Detecting interlocks early in the pipeline reduces the hardware complexity because the hardware never has to suspend an instruction that has updated the state of the machine, unless the entire machine is stalled. Alternatively, we can detect the hazard or forwarding at the beginning of a clock cycle that uses an operand (EX and MEM for this pipeline). To show the differences in these two approaches, we will show how the interlock for a RAW hazard with the source coming from a load instruction (called a load interlock) can be implemented by a check in ID, while the implementation of forwarding paths to the ALU inputs can be done during EX. Figure 3.17 lists the variety of circumstances that we must handle. 158 Chapter 3 Pipelining Situation Example code sequence Action No dependence LW R1,45(R2) ADD R5,R6,R7 SUB R8,R6,R7 OR R9,R6,R7 No hazard possible because no dependence exists on R1 in the immediately following three instructions. Dependence requiring stall LW R1,45(R2) ADD R5,R1,R7 SUB R8,R6,R7 OR R9,R6,R7 Comparators detect the use of R1 in the ADD and stall the ADD (and SUB and OR) before the ADD begins EX. Dependence overcome by forwarding LW R1,45(R2) ADD R5,R6,R7 SUB R8,R1,R7 OR R9,R6,R7 Comparators detect use of R1 in SUB and forward result of load to ALU in time for SUB to begin EX. Dependence with accesses in order LW R1,45(R2) ADD R5,R6,R7 SUB R8,R6,R7 OR R9,R1,R7 No action required because the read of R1 by OR occurs in the second half of the ID phase, while the write of the loaded data occurred in the first half. FIGURE 3.17 Situations that the pipeline hazard detection hardware can see by comparing the destination and sources of adjacent instructions. This table indicates that the only comparison needed is between the destination and the sources on the two instructions following the instruction that wrote the destination. In the case of a stall, the pipeline dependences will look like the third case once execution continues. Of course hazards that involve R0 can be ignored since the register always contains 0, and the test above could be extended to do this. Let’s start with implementing the load interlock. If there is a RAW hazard with the source instruction being a load, the load instruction will be in the EX stage when an instruction that needs the load data will be in the ID stage. Thus, we can describe all the possible hazard situations with a small table, which can be directly translated to an implementation. Figure 3.18 shows a table that detects all load interlocks when the instruction using the load result is in the ID stage. Opcode field of ID/EX (ID/EX.IR0..5) Opcode field of IF/ID (IF/ID.IR0..5) Matching operand fields Load Register-register ALU ID/EX.IR11..15 == IF/ID.IR6..10 Load Register-register ALU ID/EX.IR11..15 == IF/ID.IR11..15 Load Load, store, ALU immediate, or branch ID/EX.IR11..15 == IF/ID.IR6..10 FIGURE 3.18 The logic to detect the need for load interlocks during the ID stage of an instruction requires three comparisons. Lines 1 and 2 of the table test whether the load destination register is one of the source registers for a register-register operation in ID. Line 3 of the table determines if the load destination register is a source for a load or store effective address, an ALU immediate, or a branch test. Remember that the IF/ID register holds the state of the instruction in ID, which potentially uses the load result, while ID/EX holds the state of the instruction in EX, which is the potential load instruction. 3.4 Data Hazards 159 Once a hazard has been detected, the control unit must insert the pipeline stall and prevent the instructions in the IF and ID stages from advancing. As we said in section 3.2, all the control information is carried in the pipeline registers. (Carrying the instruction along is enough, since all control is derived from it.) Thus, when we detect a hazard we need only change the control portion of the ID/EX pipeline register to all 0s, which happens to be a no-op (an instruction that does nothing, such as ADD R0,R0,R0). In addition, we simply recirculate the contents of the IF/ID registers to hold the stalled instruction. In a pipeline with more complex hazards, the same ideas would apply: We can detect the hazard by comparing some set of pipeline registers and shift in no-ops to prevent erroneous execution. Implementing the forwarding logic is similar, though there are more cases to consider. The key observation needed to implement the forwarding logic is that the pipeline registers contain both the data to be forwarded as well as the source and destination register fields. All forwarding logically happens from the ALU or data memory output to the ALU input, the data memory input, or the zero detection unit. Thus, we can implement the forwarding by a comparison of the destination registers of the IR contained in the EX/MEM and MEM/WB stages against the source registers of the IR contained in the ID/EX and EX/MEM registers. Figure 3.19 shows the comparisons and possible forwarding operations where the destination of the forwarded result is an ALU input for the instruction currently in EX. The Exercises ask you to add the entries when the result is forwarded to the data memory. The last possible forwarding destination is the zero detect unit, whose forwarding paths look the same as those that are needed when the destination instruction is an ALU immediate. In addition to the comparators and combinational logic that we need to determine when a forwarding path needs to be enabled, we also need to enlarge the multiplexers at the ALU inputs and add the connections from the pipeline registers that are used to forward the results. Figure 3.20 shows the relevant segments of the pipelined datapath with the additional multiplexers and connections in place. For DLX, the hazard detection and forwarding hardware is reasonably simple; we will see that things become somewhat more complicated when we extend this pipeline to deal with floating point. Before we do that, we need to handle branches. 160 Pipeline register containing source instruction Chapter 3 Pipelining Opcode of source instruction Pipeline register containing destination instruction Opcode of destination instruction Destination of the forwarded result Comparison (if equal then forward) EX/MEM Registerregister ALU ID/EX Register-register ALU, ALU immediate, load, store, branch Top ALU input EX/MEM.IR16..20 == ID/EX.IR6..10 EX/MEM Registerregister ALU ID/EX Register-register ALU Bottom ALU input EX/MEM.IR16..20 == ID/EX.IR11..15 MEM/WB Registerregister ALU ID/EX Register-register ALU, ALU immediate, load, store, branch Top ALU input MEM/WB.IR16..20 == ID/EX.IR6..10 MEM/WB Registerregister ALU ID/EX Register-register ALU Bottom ALU input MEM/WB.IR16..20 == ID/EX.IR11..15 EX/MEM ALU immediate ID/EX Register-register ALU, ALU immediate, load, store, branch Top ALU input EX/MEM.IR11..15 == ID/EX.IR6..10 EX/MEM ALU immediate ID/EX Register-register ALU Bottom ALU input EX/MEM.IR11..15 == ID/EX.IR11..15 MEM/WB ALU immediate ID/EX Register-register ALU, ALU immediate, load, store, branch Top ALU input MEM/WB.IR11..15 == ID/EX.IR6..10 MEM/WB ALU immediate ID/EX Register-register ALU Bottom ALU input MEM/WB.IR11..15 == ID/EX.IR11..15 MEM/WB Load ID/EX Register-register ALU, ALU immediate, load, store, branch Top ALU input MEM/WB.IR11..15 == ID/EX.IR6..10 MEM/WB Load ID/EX Register-register ALU Bottom ALU input MEM/WB.IR11..15 == ID/EX.IR11..15 FIGURE 3.19 Forwarding of data to the two ALU inputs (for the instruction in EX) can occur from the ALU result (in EX/MEM or in MEM/WB) or from the load result in MEM/WB. There are 10 separate comparisons needed to tell whether a forwarding operation should occur. The top and bottom ALU inputs refer to the inputs corresponding to the first and second ALU source operands, respectively, and are shown explicitly in Figure 3.1 on page 130 and in Figure 3.20 on page 161. Remember that the pipeline latch for destination instruction in EX is ID/EX, while the source values come from the ALUOutput portion of EX/MEM or MEM/WB or the LMD portion of MEM/WB. There is one complication not addressed by this logic: dealing with multiple instructions that write the same register. For example, during the code sequence ADD R1, R2, R3; ADDI R1, R1, #2; SUB R4, R3, R1, the logic must ensure that the SUB instruction uses the result of the ADDI instruction rather than the result of the ADD instruction. The logic shown above can be extended to handle this case by simply testing that forwarding from MEM/WB is enabled only when forwarding from EX/MEM is not enabled for the same input. Because the ADDI result will be in EX/MEM, it will be forwarded, rather than the ADD result in MEM/WB. 3.5 161 Control Hazards ID/EX EX/MEM MEM/WB Zero? M u x ALU M u x Data memory FIGURE 3.20 Forwarding of results to the ALU requires the addition of three extra inputs on each ALU multiplexer and the addition of three paths to the new inputs. The paths correspond to a bypass of (1) the ALU output at the end of the EX, (2) the ALU output at the end of the MEM stage, and (3) the memory output at the end of the MEM stage. 3.5 Control Hazards Control hazards can cause a greater performance loss for our DLX pipeline than do data hazards. When a branch is executed, it may or may not change the PC to something other than its current value plus 4. Recall that if a branch changes the PC to its target address, it is a taken branch; if it falls through, it is not taken, or untaken. If instruction i is a taken branch, then the PC is normally not changed until the end of MEM, after the completion of the address calculation and comparison, as shown in Figure 3.4 (page 134) and Figure 3.5 (page 136). The simplest method of dealing with branches is to stall the pipeline as soon as we detect the branch until we reach the MEM stage, which determines the new PC. Of course, we do not want to stall the pipeline until we know that the instruction is a branch; thus, the stall does not occur until after the ID stage, and the pipeline behavior looks like that shown in Figure 3.21. This control hazard stall must 162 Chapter 3 Pipelining be implemented differently from a data hazard stall, since the IF cycle of the instruction following the branch must be repeated as soon as we know the branch outcome. Thus, the first IF cycle is essentially a stall, because it never performs useful work. This stall can be implemented by setting the IF/ID register to zero for the three cycles. You may have noticed that if the branch is untaken, then the repetition of the IF stage is unnecessary since the correct instruction was indeed fetched. We will develop several schemes to take advantage of this fact shortly, but first, let’s examine how we could reduce the worst-case branch penalty. Branch instruction Branch successor Branch successor + 1 Branch successor + 2 Branch successor + 3 Branch successor + 4 IF ID EX MEM WB IF stall stall IF ID IF EX MEM WB ID EX MEM WB IF ID EX MEM IF ID EX IF ID Branch successor + 5 IF FIGURE 3.21 A branch causes a three-cycle stall in the DLX pipeline: One cycle is a repeated IF cycle and two cycles are idle. The instruction after the branch is fetched, but the instruction is ignored, and the fetch is restarted once the branch target is known. It is probably obvious that if the branch is not taken, the second IF for branch successor is redundant. This will be addressed shortly. Three clock cycles wasted for every branch is a significant loss. With a 30% branch frequency and an ideal CPI of 1, the machine with branch stalls achieves only about half the ideal speedup from pipelining! Thus, reducing the branch penalty becomes critical. The number of clock cycles in a branch stall can be reduced by two steps: 1. Find out whether the branch is taken or not taken earlier in the pipeline. 2. Compute the taken PC (i.e., the address of the branch target) earlier. To optimize the branch behavior, both of these must be done—it doesn’t help to know the target of the branch without knowing whether the next instruction to execute is the target or the instruction at PC + 4. Both steps should be taken as early in the pipeline as possible. In DLX, the branches (BEQZ and BNEZ) require testing a register for equality to zero. Thus, it is possible to complete this decision by the end of the ID cycle by moving the zero test into that cycle. To take advantage of an early decision on whether the branch is taken, both PCs (taken and untaken) must be computed early. Computing the branch target address during ID requires an additional adder because the main ALU, which has been used for this function so far, is not usable until EX. Figure 3.22 shows the revised pipelined datapath. With the separate adder and a branch decision made during ID, there is only a one-clock-cycle stall on branches. Although this reduces the branch delay to one cycle, it means that an ALU instruction followed by a branch on the result of the instruction will in- 3.5 163 Control Hazards ID/EX ADD IF/ID MEM/WB EX/MEM Zero? 4 ADD M u x IR6..10 PC IR11..15 Instruction IR memory MEM/WB.IR Registers ALU M u x 16 Data memory M u x Sign 32 extend FIGURE 3.22 The stall from branch hazards can be reduced by moving the zero test and branch target calculation into the ID phase of the pipeline. Notice that we have made two important changes, each of which removes one cycle from the three cycle stall for branches. The first change is to move both the branch address target calculation and the branch condition decision to the ID cycle. The second change is to write the PC of the instruction in the IF phase, using either the branch target address computed during ID or the incremented PC computed during IF. In comparison, Figure 3.4 obtained the branch target address from the EX/MEM register and wrote the result during the MEM clock cycle. As mentioned in Figure 3.4, the PC can be thought of as a pipeline register (e.g., as part of ID/IF), which is written with the address of the next instruction at the end of each IF cycle. cur a data hazard stall. Figure 3.23 shows the branch portion of the revised pipeline table from Figure 3.5 (page 136). In some machines, branch hazards are even more expensive in clock cycles than in our example, since the time to evaluate the branch condition and compute the destination can be even longer. For example, a machine with separate decode and register fetch stages will probably have a branch delay—the length of the control hazard—that is at least one clock cycle longer. The branch delay, unless it is dealt with, turns into a branch penalty. Many older machines that implement more complex instruction sets have branch delays of four clock cycles or more, and large, deeply pipelined machines often have branch penalties of six or seven. In general, the deeper the pipeline, the worse the branch penalty in clock cycles. Of course, the relative performance effect of a longer branch penalty depends on the overall CPI of the machine. A high CPI machine can afford to have more expensive branches because the percentage of the machine’s performance that will be lost from branches is less. 164 Chapter 3 Pipelining Pipe stage Branch instruction IF IF/ID.IR ← Mem[PC]; IF/ID.NPC,PC ← (if ((IF/ID.opcode == branch) & (Regs[IF/ID.IR6..10] op 0)) {IF/ID.NPC + (IF/ID.IR16)16##IF/ID.IR16..31} else {PC+4}); ID ID/EX.A ← Regs[IF/ID.IR6..10]; ID/EX.B ← Regs[IF/ID.IR11..15]; ID/EX.IR ← IF/ID.IR; ID/EX.Imm ← (IF/ID.IR16)16##IF/ID.IR16..31 EX MEM WB FIGURE 3.23 This revised pipeline structure is based on the original in Figure 3.5, page 136. It uses a separate adder, as in Figure 3.22, to compute the branch target address during ID. The operations that are new or have changed are in bold. Because the branch target address addition happens during ID, it will happen for all instructions; the branch condition (Regs[IF/ID.IR6..10] op 0) will also be done for all instructions. The selection of the sequential PC or the branch target PC still occurs during IF, but it now uses values from the ID/EX register, which correspond to the values set by the previous instruction. This change reduces the branch penalty by two cycles: one from evaluating the branch target and condition earlier and one from controlling the PC selection on the same clock rather than on the next clock. Since the value of cond is set to 0, unless the instruction in ID is a taken branch, the machine must decode the instruction before the end of ID. Because the branch is done by the end of ID, the EX, MEM, and WB stages are unused for branches. An additional complication arises for jumps that have a longer offset than branches. We can resolve this by using an additional adder that sums the PC and lower 26 bits of the IR. Before talking about methods for reducing the pipeline penalties that can arise from branches, let’s take a brief look at the dynamic behavior of branches. Branch Behavior in Programs Because branches can dramatically affect pipeline performance, we should look at their behavior to get some ideas about how the penalties of branches and jumps might be reduced. We already know something about branch frequencies from our programs in Chapter 2. Figure 3.24 reviews the overall frequency of controlflow operations for our SPEC subset on DLX and gives the breakdown between branches and jumps. Conditional branches are also broken into forward and backward branches. The integer benchmarks show conditional branch frequencies of 14% to 16%, with much lower unconditional branch frequencies (though li has a large number because of its high procedure call frequency). For the FP benchmarks, the behavior is much more varied with a conditional branch frequency of 3% up to 12%, but an overall average for both conditional branches and unconditional branches that is lower than for the integer benchmarks. Forward branches dominate backward branches by about 3.7 to 1 on average. Since the performance of pipelining schemes for branches may depend on whether or not branches are taken, this data becomes critical. Figure 3.25 shows the frequency of forward and backward branches that are taken as a fraction of all conditional branches. Totaling the two columns shows that 67% of the condition- 3.5 165 Control Hazards 11% compress 3% 3% 22% eqntott 2% 2% 11% espresso 4% 1% 12% gcc 3% 4% li 4% 11% 8% Benchmark 6% doduc 2% 2% 6% ear 4% 4% 10% hydro2d 2% 0% 9% mdljdp 0% 0% 2% 1% 1% su2cor 0% 5% 10% 15% 20% 25% Percentage of instructions executed Forward conditional branches Backward conditional branches Unconditional branches FIGURE 3.24 The frequency of instructions (branches, jumps, calls, and returns) that may change the PC. The unconditional column includes unconditional branches and jumps (these differ in how the target address is specified), procedure calls, and returns. In all the cases except li, the number of unconditional PC changes is roughly equally divided between those that are for calls or returns and those that are unconditional jumps. In li, calls and returns outnumber jumps and unconditional branches by a factor of 3 (6% versus 2%). Since the compiler uses loop unrolling (described in detail in Chapter 4) as an optimization, the backward conditional branch frequency will be lower, especially for the floating-point programs. Overall, the integer programs average 13% forward conditional branches, 3% backward conditional branches, and 4% unconditional branches. The FP programs average 7%, 2%, and 1%, respectively. al branches are taken on average. By combining the data in Figures 3.24 and 3.25, we can compute the fraction of forward branches that are taken, which is the probability that a forward branch will be taken. Since backward branches 166 Chapter 3 Pipelining often form loops, we would expect that the probability of a backward branch being taken is higher than the probability of a forward branch being taken. Indeed, the data, when combined, show that 60% of the forward branches are taken on average and 85% of the backward branches are taken. 78% 80% 70% 63% 61% 60% 53% 51% 50% 44% Fraction of all conditional branches 38% 40% 35% 37% 34% 30% 26% 25% 22% 21% 21% 20% 16% 14% 13% 8% 10% 3% or 2c p2 dl jd o2 m su d r dr hy c ea du c li do pr gc so es ot nt eq es co m pr es s t 0% Benchmark Forward taken Backward taken FIGURE 3.25 Together the forward and backward taken branches account for an average of 67% of all conditional branches. Although the backward branches are outnumbered, they are taken with a frequency that is almost 1.5 times higher, contributing substantially to the taken branch frequency. On average, 62% of the branches are taken in the integer programs and 70% in the FP programs. Note the wide disparity in behavior between a program like su2cor and mdljdp2; these variations make it challenging to predict the branch behavior very accurately. As in Figure 3.24, the use of loop unrolling affects this data since it removes backward branches that had a high probability of being taken. Reducing Pipeline Branch Penalties There are many methods for dealing with the pipeline stalls caused by branch delay; we discuss four simple compile-time schemes in this subsection. In these four schemes the actions for a branch are static—they are fixed for each branch during the entire execution. The software can try to minimize the branch penalty 3.5 167 Control Hazards using knowledge of the hardware scheme and of branch behavior. After discussing these schemes, we examine compile-time branch prediction, since these branch optimizations all rely on such technology. In the next chapter, we look both at more powerful compile-time schemes (such as loop unrolling) that reduce the frequency of loop branches and at dynamic hardware-based prediction schemes. The simplest scheme to handle branches is to freeze or flush the pipeline, holding or deleting any instructions after the branch until the branch destination is known. The attractiveness of this solution lies primarily in its simplicity both for hardware and software. It is the solution used earlier in the pipeline shown in Figure 3.21. In this case the branch penalty is fixed and cannot be reduced by software. A higher performance, and only slightly more complex, scheme is to treat every branch as not taken, simply allowing the hardware to continue as if the branch were not executed. Here, care must be taken not to change the machine state until the branch outcome is definitely known. The complexity that arises from having to know when the state might be changed by an instruction and how to “back out” a change might cause us to choose the simpler solution of flushing the pipeline in machines with complex pipeline structures. In the DLX pipeline, this predict-not-taken or predict-untaken scheme is implemented by continuing to fetch instructions as if the branch were a normal instruction. The pipeline looks as if nothing out of the ordinary is happening. If the branch is taken, however, we need to turn the fetched instruction into a no-op (simply by clearing the IF/ID register) and restart the fetch at the target address. Figure 3.26 shows both situations. Untaken branch instruction IF ID EX MEM WB IF Instruction i + 1 ID EX MEM Instruction i + 2 IF ID EX MEM WB IF ID EX MEM WB IF Instruction i + 3 ID EX MEM Instruction i + 4 Taken branch instruction Instruction i + 1 Branch target Branch target + 1 Branch target + 2 WB IF ID EX MEM WB IF idle idle idle idle IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB WB FIGURE 3.26 The predict-not-taken scheme and the pipeline sequence when the branch is untaken (top) and taken (bottom). When the branch is untaken, determined during ID, we have fetched the fall-through and just continue. If the branch is taken during ID, we restart the fetch at the branch target. This causes all instructions following the branch to stall one clock cycle. 168 Chapter 3 Pipelining An alternative scheme is to treat every branch as taken. As soon as the branch is decoded and the target address is computed, we assume the branch to be taken and begin fetching and executing at the target. Because in our DLX pipeline we don’t know the target address any earlier than we know the branch outcome, there is no advantage in this approach for DLX. In some machines—especially those with implicitly set condition codes or more powerful (and hence slower) branch conditions—the branch target is known before the branch outcome, and a predicttaken scheme might make sense. In either a predict-taken or predict-not-taken scheme, the compiler can improve performance by organizing the code so that the most frequent path matches the hardware’s choice. Our fourth scheme provides more opportunities for the compiler to improve performance. A fourth scheme in use in some machines is called delayed branch. This technique is also used in many microprogrammed control units. In a delayed branch, the execution cycle with a branch delay of length n is branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken The sequential successors are in the branch-delay slots. These instructions are executed whether or not the branch is taken. The pipeline behavior of the DLX pipeline, which would have one branch-delay slot, is shown in Figure 3.27. In Untaken branch instruction IF ID EX MEM WB IF ID EX MEM WB ID EX MEM WB IF ID EX MEM WB IF Branch-delay instruction (i + 1) ID EX MEM Instruction i + 2 IF Instruction i + 3 Instruction i + 4 Taken branch instruction Branch-delay instruction (i + 1) Branch target Branch target + 1 Branch target + 2 IF ID EX MEM WB IF ID EX MEM WB IF WB ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB FIGURE 3.27 The behavior of a delayed branch is the same whether or not the branch is taken. The instructions in the delay slot (there is only one delay slot for DLX) are executed. If the branch is untaken, execution continues with the instruction after the branch-delay instruction; if the branch is taken, execution continues at the branch target. When the instruction in the branch-delay slot is also a branch, the meaning is unclear: if the branch is not taken, what should happen to the branch in the branch-delay slot? Because of this confusion, architectures with delay branches often disallow putting a branch in the delay slot. 3.5 169 Control Hazards practice, all machines with delayed branch have a single instruction delay, and we focus on that case. The job of the compiler is to make the successor instructions valid and useful. A number of optimizations are used. Figure 3.28 shows the three ways in which the branch delay can be scheduled. Figure 3.29 shows the different constraints for each of these branch-scheduling schemes, as well as situations in which they win. (a) From before (b) From target (c) From fall through ADD R1, R2, R3 ADD R1, R2, R3 SUB R4, R5, R6 if R1 = 0 then if R2 = 0 then Delay slot ADD R1, R2, R3 if R1 = 0 then Delay slot Becomes Becomes SUB R4, R5, R6 if R2 = 0 then ADD R1, R2, R3 Delay slot OR R7, R8, R9 SUB R4, R5, R6 Becomes ADD R1, R2, R3 if R1 = 0 then ADD R1, R2, R3 OR R7, R8, R9 if R1 = 0 then SUB R4, R5, R6 SUB R4, R5, R6 FIGURE 3.28 Scheduling the branch-delay slot. The top box in each pair shows the code before scheduling; the bottom box shows the scheduled code. In (a) the delay slot is scheduled with an independent instruction from before the branch. This is the best choice. Strategies (b) and (c) are used when (a) is not possible. In the code sequences for (b) and (c), the use of R1 in the branch condition prevents the ADD instruction (whose destination is R1) from being moved after the branch. In (b) the branch-delay slot is scheduled from the target of the branch; usually the target instruction will need to be copied because it can be reached by another path. Strategy (b) is preferred when the branch is taken with high probability, such as a loop branch. Finally, the branch may be scheduled from the not-taken fall through as in (c). To make this optimization legal for (b) or (c), it must be OK to execute the moved instruction when the branch goes in the unexpected direction. By OK we mean that the work is wasted, but the program will still execute correctly. This is the case, for example in case (b), if R4 were an unused temporary register when the branch goes in the unexpected direction. 170 Chapter 3 Pipelining Scheduling strategy Requirements Improves performance when? (a) From before Branch must not depend on the rescheduled instructions. Always. (b) From target Must be OK to execute rescheduled instructions if branch is not taken. May need to duplicate instructions. When branch is taken. May enlarge program if instructions are duplicated. (c) From fall through Must be OK to execute instructions if branch is taken. When branch is not taken. FIGURE 3.29 Delayed-branch scheduling schemes and their requirements. The origin of the instruction being scheduled into the delay slot determines the scheduling strategy. The compiler must enforce the requirements when looking for instructions to schedule the delay slot. When the slots cannot be scheduled, they are filled with no-op instructions. In strategy (b), if the branch target is also accessible from another point in the program—as it would be if it were the head of a loop— the target instructions must be copied and not just moved. The limitations on delayed-branch scheduling arise from (1) the restrictions on the instructions that are scheduled into the delay slots and (2) our ability to predict at compile time whether a branch is likely to be taken or not. Shortly, we will see how we can better predict branches statically at compile time. To improve the ability of the compiler to fill branch delay slots, most machines with conditional branches have introduced a cancelling or nullifying branch. In a cancelling branch, the instruction includes the direction that the branch was predicted. When the branch behaves as predicted, the instruction in the branch-delay slot is simply executed as it would normally be with a delayed branch. When the branch is incorrectly predicted, the instruction in the branch-delay slot is simply turned into a no-op. Figure 3.30 shows the behavior of a predicted-taken cancelling branch, both when the branch is taken and untaken. Untaken branch instruction IF Branch-delay instruction (i + 1) ID IF EX MEM idle idle idle idle IF ID EX MEM ID EX MEM WB IF Instruction i + 2 ID EX MEM Instruction i + 3 IF Instruction i + 4 Taken branch instruction Branch-delay instruction (i + 1) Branch target Branch target + 1 Branch target + 2 WB IF WB ID EX MEM WB IF ID EX MEM WB IF ID EX MEM ID EX MEM WB IF ID EX MEM WB IF WB WB FIGURE 3.30 The behavior of a predicted-taken cancelling branch depends on whether the branch is taken or not. The instruction in the delay slot is executed only if the branch is taken and is otherwise made into a no-op. 3.5 171 Control Hazards The advantage of cancelling branches is that they eliminate the requirements on the instruction placed in the delay slot, enabling the compiler to use scheduling schemes (b) and (c) of Figure 3.28 without meeting the requirements shown for these schemes in Figure 3.29. Most machines with cancelling branches provide both a noncancelling form (i.e., a regular delayed branch) and a cancelling form, usually cancel if not taken. This combination gains most of the advantages, but does not allow scheduling scheme (c) to be used unless the requirements of Figure 3.29 are met. Figure 3.31 shows the effectiveness of the branch scheduling in DLX with a single branch-delay slot and both a noncancelling branch and a cancel-if-untaken form. The compiler uses a standard delayed branch whenever possible and then opts for a cancel-if-not-taken branch (also called branch likely). The second column shows that almost 20% of the branch delay slots are filled with no-ops. These occur when it is not possible to fill the delay slot, either because the potential candidates are unknown (e.g., for a jump register that will be used in a case statement) or because the successors are also branches. (Branches are not allowed in branchdelay slots because of the confusion in semantics.) The table shows that the Benchmark % conditional branches % conditional branches with empty slots % conditional branches that are cancelling % cancelling branches that are cancelled % branches with cancelled delay slots Total % branches with empty or cancelled delay slot compress 14% 18% 31% 43% 13% 31% eqntott 24% 24% 50% 24% 12% 36% espresso 15% 29% 19% 21% 4% 33% gcc 15% 16% 33% 34% 11% 27% li 15% 20% 55% 48% 26% 46% Integer average 17% 21% 38% 34% 13% 35% doduc 8% 33% 12% 62% 7% 40% ear 10% 37% 36% 14% 5% 42% hydro2d 12% 0% 69% 24% 17% 17% mdljdp2 9% 0% 86% 10% 9% 9% su2cor 3% 7% 17% 57% 10% 17% FP average 8% 16% 44% 33% 10% 25% 12% 18% 41% 34% 12% 30% Overall average FIGURE 3.31 Delayed and cancelling delay branches for DLX allow branch hazards to be hidden 70% of the time on average for these 10 SPEC benchmarks. Empty delay slots cannot be filled at all (most often because the branch target is another branch) in 18% of the branches. Just under half the conditional branches use a cancelling branch, and most of these are not cancelled (65%). The behavior varies widely across benchmarks. When the fraction of conditional branches is added in, the contribution to CPI varies even more widely. Chapter 3 Pipelining remaining 80% of the branch delay slots are filled nearly equally by standard delayed branches and by cancelling branches. Most of the cancelling branches are not cancelled and hence contribute to useful computation. Figure 3.32 summarizes the performance of the combination of delayed branch and cancelling branch. Overall, 70% of the branch delays are usefully filled, reducing the stall penalty to 0.3 cycles per conditional branch. 50% 45% 40% 35% 30% Percentage of conditional branches 25% 20% 15% 10% 5% dl jd su p 2c or m r d ea o2 dr hy l do i du c c gc t so es pr es ot nt pr m eq es s 0% co 172 Benchmark Empty slot Canceled delay slots FIGURE 3.32 The performance of delayed and cancelling branches is summarized by showing the fraction of branches either with empty delay slots or with a cancelled delay slot. On average 30% of the branch delay slots are wasted. The integer programs are, on average, worse, wasting an average of 35% of the slots versus 25% for the FP programs. Notice, though, that two of the FP programs waste more branch delay slots than four of the five integer programs. Delayed branches are an architecturally visible feature of the pipeline. This is the source both of their primary advantage—allowing the use of simple compiler scheduling to reduce branch penalties—and their primary disadvantage—exposing an aspect of the implementation that is likely to change. In the early RISC machines with single-cycle branch delays, the delayed branch approach was attractive, since it yielded good performance with minimal hardware costs. More recently, with deeper pipelines and longer branch delays, a delayed branch approach is less useful since it cannot easily hide the longer delays. With these longer branch delays, most architects have found it necessary to include more powerful hardware schemes for branch prediction (which we will explore in the next chapter), making the delayed branch superfluous.This has led to recent RISC architectures that include both delayed and nondelayed branches or that include only nondelayed branches, relying on hardware prediction. 3.5 173 Control Hazards There is a small additional hardware cost for delayed branches. For a singlecycle delayed branch, the only case that exists in practice, a single extra PC is needed. To understand why an extra PC is needed for the single-cycle delay case, consider when the interrupt occurs for the instruction in the branch-delay slot. If the branch was taken, then the instruction in the delay slot and the branch target have addresses that are not sequential. Thus, we need to save the PCs of both instructions and restore them after the interrupt to restart the pipeline. The two PCs can be kept with the control in the pipeline latches and passed along with the instruction. This makes saving and restoring them easy. Performance of Branch Schemes What is the effective performance of each of these schemes? The effective pipeline speedup with branch penalties, assuming an ideal CPI of 1, is Pipeline depth Pipeline speedup = -------------------------------------------------------------------------------------------1 + Pipeline stall cycles from branches Because of the following: Pipeline stall cycles from branches = Branch frequency × Branch penalty we obtain Pipeline depth Pipeline speedup = ---------------------------------------------------------------------------------------------1 + Branch frequency × Branch penalty The branch frequency and branch penalty can have a component from both unconditional and conditional branches. However, the latter dominate since they are more frequent. Using the DLX measurements in this section, Figure 3.33 shows several hardware options for dealing with branches, along with their performances given as branch penalty and as CPI (assuming a base CPI of 1). Branch penalty per conditional branch Scheduling scheme Integer FP Penalty per unconditional branch Stall pipeline 1.00 1.00 Predict taken 1.00 Predict not taken 0.62 Delayed branch 0.35 Average branch penalty per branch Effective CPI with branch stalls Integer FP Integer FP 1.00 1.00 1.00 1.17 1.15 1.00 1.00 1.00 1.00 1.17 1.15 0.70 1.0 0.69 0.74 1.12 1.11 0.25 0.0 0.30 0.21 1.06 1.03 FIGURE 3.33 Overall costs of a variety of branch schemes with the DLX pipeline. These data are for our DLX pipeline using the average measured branch frequencies from Figure 3.24 on page 165, the measurements of taken/untaken frequencies from 3.25 on page 166, and the measurements of delay-slot filling from Figure 3.31 on page 171. Shown are both the penalties per branch and the resulting overall CPI including only the effect of branch stalls and assuming a base CPI of 1. 174 Chapter 3 Pipelining Remember that the numbers in this section are dramatically affected by the length of the pipeline delay and the base CPI. A longer pipeline delay will cause an increase in the penalty and a larger percentage of wasted time. A delay of only one clock cycle is small—the R4000 pipeline, which we examine in section 3.9, has a conditional branch delay of three cycles. This results in a much higher penalty. EXAMPLE For an R4000-style pipeline, it takes three pipeline stages before the branch target address is known and an additional cycle before the branch condition is evaluated, assuming no stalls on the registers in the conditional comparison. This leads to the branch penalties for the three simplest prediction schemes listed in Figure 3.34. Branch scheme Penalty unconditional Penalty untaken Penalty taken Flush pipeline 2 3 3 Predict taken 2 3 2 Predict untaken 2 0 3 FIGURE 3.34 Branch penalties for the three simplest prediction schemes for a deeper pipeline. Find the effective addition to the CPI arising from branches for this pipeline, using the data from the 10 SPEC benchmarks in Figures 3.24 and 3.25. ANSWER We find the CPIs by multiplying the relative frequency of unconditional, conditional untaken, and conditional taken branches by the respective penalties. These frequencies for the 10 SPEC programs are 4%, 6%, and 10%, respectively. The results are shown in Figure 3.35. Addition to the CPI Branch scheme Unconditional branches Untaken conditional branches Taken conditional branches All branches Frequency of event 4% 6% 10% 20% Stall pipeline 0.08 0.18 0.30 0.56 Predict taken 0.08 0.18 0.20 0.46 Predict untaken 0.08 0.00 0.30 0.38 FIGURE 3.35 CPI penalties for three branch-prediction schemes and a deeper pipeline. The differences among the schemes are substantially increased with this longer delay. If the base CPI was 1 and branches were the only source of stalls, the ideal pipeline would be 1.56 times faster than a 3.5 175 Control Hazards pipeline that used the stall-pipeline scheme. The predict-untaken scheme would be 1.13 times better than the stall-pipeline scheme under the same assumptions. As we will see in section 3.9, the R4000 uses a mixed strategy with a one-cycle, cancelling delayed branch for the first cycle of the branch penalty. For an unconditional branch, a single-cycle stall is always added. For conditional branches, the remaining two cycles of the branch penalty use a predict-not-taken scheme. We will see measurements of the effective branch penalties for this strategy later. s Static Branch Prediction: Using Compiler Technology Delayed branches are a technique that exposes a pipeline hazard so that the compiler can reduce the penalty associated with the hazard. As we saw, the effectiveness of this technique partly depends on whether we correctly guess which way a branch will go. Being able to accurately predict a branch at compile time is also helpful for scheduling data hazards. Consider the following code segment: L: LW SUB BEQZ OR ADD ADD R1,0(R2) R1,R1,R3 R1,L R4,R5,R6 R10,R4,R3 R7,R8,R9 The dependence of the SUB and BEQZ on the LW instruction means that a stall will be needed after the LW. Suppose we knew that this branch was almost always taken and that the value of R7 was not needed on the fall-through path. Then we could increase the speed of the program by moving the instruction ADD R7,R8,R9 to the position after the LW. Correspondingly, if we knew the branch was rarely taken and that the value of R4 was not needed on the taken path, then we could contemplate moving the OR instruction after the LW. In addition, we can also use the information to better schedule any branch delay, since choosing how to schedule the delay depends on knowing the branch behavior. To perform these optimizations, we need to predict the branch statically when we compile the program. In the next chapter, we will examine the use of dynamic prediction based on runtime program behavior. We will also look at a variety of compile-time methods for scheduling code; these techniques require static branch prediction and thus the ideas in this section are critical. There are two basic methods we can use to statically predict branches: by examination of the program behavior and by the use of profile information collected from earlier runs of the program. We saw in Figure 3.25 (page 166) that most branches were taken for both forward and backward branches. Thus, the simplest scheme is to predict a branch as taken. This scheme has an average misprediction Chapter 3 Pipelining rate for the 10 programs in Figure 3.25 of the untaken branch frequency (34%). Unfortunately, the misprediction rate ranges from not very accurate (59%) to highly accurate (9%). Another alternative is to predict on the basis of branch direction, choosing backward-going branches to be taken and forward-going branches to be not taken. For some programs and compilation systems, the frequency of forward taken branches may be significantly less than 50%, and this scheme will do better than just predicting all branches as taken. In our SPEC programs, however, more than half of the forward-going branches are taken. Hence, predicting all branches as taken is the better approach. Even for other benchmarks or compilers, directionbased prediction is unlikely to generate an overall misprediction rate of less than 30% to 40%. A more accurate technique is to predict branches on the basis of profile information collected from earlier runs. The key observation that makes this worthwhile is that the behavior of branches is often bimodally distributed; that is, an individual branch is often highly biased toward taken or untaken. Figure 3.36 shows the success of branch prediction using this strategy. The same input data were used for runs and for collecting the profile; other studies have shown that changing the input so that the profile is for a different run leads to only a small change in the accuracy of profile-based prediction. 25% 22% 18% 20% 15% 15% 12% Misprediction rate 11% 12% 9% 10% 10% 5% 6% 5% l do i du c e hy ar dr o2 d m dl jd p su 2c or es pr tt es so gc c nt o pr e eq ss 0% co m 176 Benchmark FIGURE 3.36 Misprediction rate for a profile-based predictor varies widely but is generally better for the FP programs, which have an average misprediction rate of 9% with a standard deviation of 4%, than for the integer programs, which have an average misprediction rate of 15% with a standard deviation of 5%. The actual performance depends on both the prediction accuracy and the branch frequency, which varies from 3% to 24% in Figure 3.31 (page 171); we will examine the combined effect in Figure 3.37. 3.5 177 Control Hazards While we can derive the prediction accuracy of a predict-taken strategy and measure the accuracy of the profile scheme, as in Figure 3.36, the wide range of frequency of conditional branches in these programs, from 3% to 24%, means that the overall frequency of a mispredicted branch varies widely. Figure 3.37 shows the number of instructions executed between mispredicted branches for both a profile-based and a predict-taken strategy. The number varies widely, both because of the variation in accuracy and the variation in branch frequency. On average, the predict-taken strategy has 20 instructions per mispredicted branch and the profile-based strategy has 110. However, these averages are very different for integer and FP programs, as the data in Figure 3.37 show. 1000 250 159 60 19 10 10 96 253 58 11 14 19 li 37 11 c 56 Instructions between mispredictions 113 92 100 14 11 11 6 2c or p su jd d dl o2 dr m r ea hy c du do gc so es ot pr nt eq es co m pr es s t 1 Benchmark Predict taken Profile based FIGURE 3.37 Accuracy of a predict-taken strategy and a profile-based predictor as measured by the number of instructions executed between mispredicted branches and shown on a log scale. The average number of instructions between mispredictions is 20 for the predict-taken strategy and 110 for the profile-based prediction; however, the standard deviations are large: 27 instructions for the predict-taken strategy and 85 instructions for the profile-based scheme. This wide variation arises because programs such as su2cor have both low conditional branch frequency (3%) and predictable branches (85% accuracy for profiling), while eqntott has eight times the branch frequency with branches that are nearly 1.5 times less predictable. The difference between the FP and integer benchmarks as groups is large. For the predict-taken strategy, the average distance between mispredictions for the integer benchmarks is 10 instructions, while it is 30 instructions for the FP programs. With the profile scheme, the distance between mispredictions for the integer benchmarks is 46 instructions, while it is 173 instructions for the FP benchmarks. Summary: Performance of the DLX Integer Pipeline We close this section on hazard detection and elimination by showing the total distribution of idle clock cycles for our integer benchmarks when run on the DLX pipeline with software for pipeline scheduling. (After we examine the DLX FP pipeline in section 3.7, we will examine the overall performance of the FP benchmarks.) Figure 3.38 shows the distribution of clock cycles lost to load and branch 178 Chapter 3 Pipelining 14% 14% 12% 10% 9% 8% 7% 7% Percentage of all instructions that stall 6% 5% 4% 5% 4% 5% 4% 3% 2% es li c pr gc so t ot nt eq es co m pr es s 0% Benchmark Branch stalls Load stalls FIGURE 3.38 Percentage of the instructions that cause a stall cycle. This assumes a perfect memory system; the clock-cycle count and instruction count would be identical if there were no integer pipeline stalls. It also assumes the availability of both a basic delayed branch and a cancelling delayed branch, both with one cycle of delay. According to the graph, from 8% to 23% of the instructions cause a stall (or a cancelled instruction), leading to CPIs from pipeline stalls that range from 1.09 to 1.23. The pipeline scheduler fills load delays before branch delays, and this affects the distribution of delay cycles. delays, which is obtained by combining the separate measurements shown in Figures 3.16 (page 157) and 3.31 (page 171). Overall the integer programs exhibit an average of 0.06 branch stalls per instruction and 0.05 load stalls per instruction, leading to an average CPI from pipelining (i.e., assuming a perfect memory system) of 1.11. Thus, with a perfect memory system and no clock overhead, pipelining could improve the performance of these five integer SPECint92 benchmarks by 5/1.11 or 4.5 times. 3.6 What Makes Pipelining Hard to Implement? Now that we understand how to detect and resolve hazards, we can deal with some complications that we have avoided so far. The first part of this section considers the challenges of exceptional situations where the instruction execution order is changed in unexpected ways. In the second part of this section, we discuss some of the challenges raised by different instruction sets. 3.6 What Makes Pipelining Hard to Implement? 179 Dealing with Exceptions Exceptional situations are harder to handle in a pipelined machine because the overlapping of instructions makes it more difficult to know whether an instruction can safely change the state of the machine. In a pipelined machine, an instruction is executed piece by piece and is not completed for several clock cycles. Unfortunately, other instructions in the pipeline can raise exceptions that may force the machine to abort the instructions in the pipeline before they complete. Before we discuss these problems and their solutions in detail, we need to understand what types of situations can arise and what architectural requirements exist for supporting them. Types of Exceptions and Requirements The terminology used to describe exceptional situations where the normal execution order of instruction is changed varies among machines. The terms interrupt, fault, and exception are used, though not in a consistent fashion. We use the term exception to cover all these mechanisms, including the following: I/O device request Invoking an operating system service from a user program Tracing instruction execution Breakpoint (programmer-requested interrupt) Integer arithmetic overflow FP arithmetic anomaly (see Appendix A) Page fault (not in main memory) Misaligned memory accesses (if alignment is required) Memory-protection violation Using an undefined or unimplemented instruction Hardware malfunctions Power failure When we wish to refer to some particular class of such exceptions, we will use a longer name, such as I/O interrupt, floating-point exception, or page fault. Figure 3.39 shows the variety of different names for the common exception events above. Although we use the name exception to cover all of these events, individual events have important characteristics that determine what action is needed in the hardware.The requirements on exceptions can be characterized on five semiindependent axes: 180 Chapter 3 Pipelining Exception event IBM 360 VAX Motorola 680x0 Intel 80x86 I/O device request Input/output interruption Device interrupt Exception (Level 0...7 autovector) Vectored interrupt Invoking the operating system service from a user program Supervisor call interruption Exception (change mode supervisor trap) Exception (unimplemented instruction)— on Macintosh Interrupt (INT instruction) Tracing instruction execution Not applicable Exception (trace fault) Exception (trace) Interrupt (singlestep trap) Breakpoint Not applicable Exception (breakpoint fault) Exception (illegal instruction or breakpoint) Interrupt (breakpoint trap) Integer arithmetic overflow or underflow; FP trap Program interruption (overflow or underflow exception) Exception (integer overflow trap or floating underflow fault) Exception (floating-point coprocessor errors) Interrupt (overflow trap or math unit exception) Page fault (not in main memory) Not applicable (only in 370) Exception (translation not valid fault) Exception (memorymanagement unit errors) Interrupt (page fault) Misaligned memory accesses Program interruption (specification exception) Not applicable Exception (address error) Not applicable Memory protection violations Program interruption (protection exception) Exception (access control violation fault) Exception (bus error) Interrupt (protection exception) Using undefined instructions Program interruption (operation exception) Exception (opcode privileged/ reserved fault) Exception (illegal instruction or breakpoint/unimplemented instruction) Interrupt (invalid opcode) Hardware malfunctions Machine-check interruption Exception (machine-check abort) Exception (bus error) Not applicable Power failure Machine-check interruption Urgent interrupt Not applicable Nonmaskable interrupt FIGURE 3.39 The names of common exceptions vary across four different architectures. Every event on the IBM 360 and 80x86 is called an interrupt, while every event on the 680x0 is called an exception. VAX divides events into interrupts or exceptions. Adjectives device, software, and urgent are used with VAX interrupts, while VAX exceptions are subdivided into faults, traps, and aborts. 1. Synchronous versus asynchronous—If the event occurs at the same place every time the program is executed with the same data and memory allocation, the event is synchronous. With the exception of hardware malfunctions, asynchronous events are caused by devices external to the processor and memory. Asynchronous events usually can be handled after the completion of the current instruction, which makes them easier to handle. 3.6 What Makes Pipelining Hard to Implement? 181 2. User requested versus coerced—If the user task directly asks for it, it is a userrequest event. In some sense, user-requested exceptions are not really exceptions, since they are predictable. They are treated as exceptions, however, because the same mechanisms that are used to save and restore the state are used for these user-requested events. Because the only function of an instruction that triggers this exception is to cause the exception, user-requested exceptions can always be handled after the instruction has completed. Coerced exceptions are caused by some hardware event that is not under the control of the user program. Coerced exceptions are harder to implement because they are not predictable. 3. User maskable versus user nonmaskable—If an event can be masked or disabled by a user task, it is user maskable. This mask simply controls whether the hardware responds to the exception or not. 4. Within versus between instructions—This classification depends on whether the event prevents instruction completion by occurring in the middle of execution—no matter how short—or whether it is recognized between instructions. Exceptions that occur within instructions are usually synchronous, since the instruction triggers the exception. It’s harder to implement exceptions that occur within instructions than those between instructions, since the instruction must be stopped and restarted. Asynchronous exceptions that occur within instructions arise from catastrophic situations (e.g., hardware malfunction) and always cause program termination. 5. Resume versus terminate—If the program’s execution always stops after the interrupt, it is a terminating event. If the program’s execution continues after the interrupt, it is a resuming event. It is easier to implement exceptions that terminate execution, since the machine need not be able to restart execution of the same program after handling the exception. Figure 3.40 classifies the examples from Figure 3.39 according to these five categories. The difficult task is implementing interrupts occurring within instructions where the instruction must be resumed. Implementing such exceptions requires that another program must be invoked to save the state of the executing program, correct the cause of the exception, and then restore the state of the program before the instruction that caused the exception can be tried again. This process must be effectively invisible to the executing program. If a pipeline provides the ability for the machine to handle the exception, save the state, and restart without affecting the execution of the program, the pipeline or machine is said to be restartable. While early supercomputers and microprocessors often lacked this property, almost all machines today support it, at least for the integer pipeline, because it is needed to implement virtual memory (see Chapter 5). 182 Chapter 3 Pipelining Exception type Synchronous vs. asynchronous User request vs. coerced User maskable vs. nonmaskable Within vs. between instructions Resume vs. terminate I/O device request Asynchronous Coerced Nonmaskable Between Resume Invoke operating system Synchronous User request Nonmaskable Between Resume Tracing instruction execution Synchronous User request User maskable Between Resume Breakpoint Synchronous User request User maskable Between Resume Integer arithmetic overflow Synchronous Coerced User maskable Within Resume Floating-point arithmetic overflow or underflow Synchronous Coerced User maskable Within Resume Page fault Synchronous Coerced Nonmaskable Within Resume Misaligned memory accesses Synchronous Coerced User maskable Within Resume Memory-protection violations Synchronous Coerced Nonmaskable Within Resume Using undefined instructions Synchronous Coerced Nonmaskable Within Terminate Hardware malfunctions Asynchronous Coerced Nonmaskable Within Terminate Power failure Asynchronous Coerced Nonmaskable Within Terminate FIGURE 3.40 Five categories are used to define what actions are needed for the different exception types shown in Figure 3.39. Exceptions that must allow resumption are marked as resume, although the software may often choose to terminate the program. Synchronous, coerced exceptions occurring within instructions that can be resumed are the most difficult to implement. We might expect that memory protection access violations would always result in termination; however, modern operating systems use memory protection to detect events such as the first attempt to use a page or the first write to a page. Thus, processors should be able to resume after such exceptions. Stopping and Restarting Execution As in unpipelined implementations, the most difficult exceptions have two properties: (1) they occur within instructions (that is, in the middle of the instruction execution corresponding to EX or MEM pipe stages), and (2) they must be restartable. In our DLX pipeline, for example, a virtual memory page fault resulting from a data fetch cannot occur until sometime in the MEM stage of the instruction. By the time that fault is seen, several other instructions will be in execution. A page fault must be restartable and requires the intervention of another process, such as the operating system. Thus, the pipeline must be safely shut down and the state saved so that the instruction can be restarted in the correct state. Restarting is usually implemented by saving the PC of the instruction at which to restart. If the restarted instruction is not a branch, then we will continue to fetch the sequential successors and begin their execution in the normal fashion. If the restarted instruction is a branch, then we will reevaluate the branch condition and begin fetching from either the target or the fall through. When an exception occurs, the pipeline control can take the following steps to save the pipeline state safely: 3.6 What Makes Pipelining Hard to Implement? 183 1. Force a trap instruction into the pipeline on the next IF. 2. Until the trap is taken, turn off all writes for the faulting instruction and for all instructions that follow in the pipeline; this can be done by placing zeros into the pipeline latches of all instructions in the pipeline, starting with the instruction that generates the exception, but not those that precede that instruction. This prevents any state changes for instructions that will not be completed before the exception is handled. 3. After the exception-handling routine in the operating system receives control, it immediately saves the PC of the faulting instruction. This value will be used to return from the exception later. When we use delayed branches, as mentioned in the last section, it is no longer possible to re-create the state of the machine with a single PC because the instructions in the pipeline may not be sequentially related. So we need to save and restore as many PCs as the length of the branch delay plus one. This is done in the third step above. After the exception has been handled, special instructions return the machine from the exception by reloading the PCs and restarting the instruction stream (using the instruction RFE in DLX). If the pipeline can be stopped so that the instructions just before the faulting instruction are completed and those after it can be restarted from scratch, the pipeline is said to have precise exceptions. Ideally, the faulting instruction would not have changed the state, and correctly handling some exceptions requires that the faulting instruction have no effects. For other exceptions, such as floating-point exceptions, the faulting instruction on some machines writes its result before the exception can be handled. In such cases, the hardware must be prepared to retrieve the source operands, even if the destination is identical to one of the source operands. Because floating-point operations may run for many cycles, it is highly likely that some other instruction may have written the source operands (as we will see in the next section, floating-point operations often complete out of order). To overcome this, many recent highperformance machines have introduced two modes of operation. One mode has precise exceptions and the other (fast or performance mode) does not. Of course, the precise exception mode is slower, since it allows less overlap among floatingpoint instructions. In some high-performance machines, including Alpha 21064, Power-2, and MIPS R8000, the precise mode is often much slower (>10 times) and thus useful only for debugging of codes. Supporting precise exceptions is a requirement in many systems, while in others it is “just” valuable because it simplifies the operating system interface. At a minimum, any machine with demand paging or IEEE arithmetic trap handlers must make its exceptions precise, either in the hardware or with some software support. For integer pipelines, the task of creating precise exceptions is easier, and accommodating virtual memory strongly motivates the support of precise 184 Chapter 3 Pipelining exceptions for memory references. In practice, these reasons have led designers and architects to always provide precise exceptions for the integer pipeline. In this section we describe how to implement precise exceptions for the DLX integer pipeline. We will describe techniques for handling the more complex challenges arising in the FP pipeline in section 3.7. Exceptions in DLX Figure 3.41 shows the DLX pipeline stages and which “problem” exceptions might occur in each stage. With pipelining, multiple exceptions may occur in the same clock cycle because there are multiple instructions in execution. For example, consider this instruction sequence: LW ADD IF ID EX MEM WB IF ID EX MEM WB This pair of instructions can cause a data page fault and an arithmetic exception at the same time, since the LW is in the MEM stage while the ADD is in the EX stage. This case can be handled by dealing with only the data page fault and then restarting the execution. The second exception will reoccur (but not the first, if the software is correct), and when the second exception occurs, it can be handled independently. In reality, the situation is not as straightforward as this simple example. Exceptions may occur out of order; that is, an instruction may cause an exception before an earlier instruction causes one. Consider again the above sequence of instructions, LW followed by ADD. The LW can get a data page fault, seen when the instruction is in MEM, and the ADD can get an instruction page fault, seen when Pipeline stage Problem exceptions occurring IF Page fault on instruction fetch; misaligned memory access; memory-protection violation ID Undefined or illegal opcode EX Arithmetic exception MEM Page fault on data fetch; misaligned memory access; memory-protection violation WB None FIGURE 3.41 Exceptions that may occur in the DLX pipeline. Exceptions raised from instruction or data-memory access account for six out of eight cases. 3.6 What Makes Pipelining Hard to Implement? 185 the ADD instruction is in IF. The instruction page fault will actually occur first, even though it is caused by a later instruction! Since we are implementing precise exceptions, the pipeline is required to handle the exception caused by the LW instruction first. To explain how this works, let’s call the instruction in the position of the LW instruction i, and the instruction in the position of the ADD instruction i + 1. The pipeline cannot simply handle an exception when it occurs in time, since that will lead to exceptions occurring out of the unpipelined order. Instead, the hardware posts all exceptions caused by a given instruction in a status vector associated with that instruction. The exception status vector is carried along as the instruction goes down the pipeline. Once an exception indication is set in the exception status vector, any control signal that may cause a data value to be written is turned off (this includes both register writes and memory writes). Because a store can cause an exception during MEM, the hardware must be prepared to prevent the store from completing if it raises an exception. When an instruction enters WB (or is about to leave MEM), the exception status vector is checked. If any exceptions are posted, they are handled in the order in which they would occur in time on an unpipelined machine—the exception corresponding to the earliest instruction (and usually the earliest pipe stage for that instruction) is handled first. This guarantees that all exceptions will be seen on instruction i before any are seen on i + 1. Of course, any action taken in earlier pipe stages on behalf of instruction i may be invalid, but since writes to the register file and memory were disabled, no state could have been changed. As we will see in section 3.7, maintaining this precise model for FP operations is much harder. In the next subsection we describe problems that arise in implementing exceptions in the pipelines of machines with more powerful, longer-running instructions. Instruction Set Complications No DLX instruction has more than one result, and our DLX pipeline writes that result only at the end of an instruction’s execution. When an instruction is guaranteed to complete it is called committed. In the DLX integer pipeline, all instructions are committed when they reach the end of the MEM stage (or beginning of WB) and no instruction updates the state before that stage. Thus, precise exceptions are straightforward. Some machines have instructions that change the state in the middle of the instruction execution, before the instruction and its predecessors are guaranteed to complete. For example, autoincrement addressing modes on the VAX cause the update of registers in the middle of an instruction execution. In such a case, if the instruction is aborted because of an exception, it will leave the machine state altered. Although we know which instruction caused the exception, without additional hardware support the exception will be imprecise because the instruction will be half finished. Restarting the instruction stream after such an imprecise exception is difficult. Alternatively, we could avoid updating the state before the instruction commits, but this may be difficult or costly, 186 Chapter 3 Pipelining since there may be dependences on the updated state: Consider a VAX instruction that autoincrements the same register multiple times. Thus, to maintain a precise exception model, most machines with such instructions have the ability to back out any state changes made before the instruction is committed. If an exception occurs, the machine uses this ability to reset the state of the machine to its value before the interrupted instruction started. In the next section, we will see that a more powerful DLX floating-point pipeline can introduce similar problems, and the next chapter introduces techniques that substantially complicate exception handling. A related source of difficulties arises from instructions that update memory state during execution, such as the string copy operations on the VAX or 360. To make it possible to interrupt and restart these instructions, the instructions are defined to use the general-purpose registers as working registers. Thus the state of the partially completed instruction is always in the registers, which are saved on an exception and restored after the exception, allowing the instruction to continue. In the VAX an additional bit of state records when an instruction has started updating the memory state, so that when the pipeline is restarted, the machine knows whether to restart the instruction from the beginning or from the middle of the instruction. The 80x86 string instructions also use the registers as working storage, so that saving and restoring the registers saves and restores the state of such instructions. A different set of difficulties arises from odd bits of state that may create additional pipeline hazards or may require extra hardware to save and restore. Condition codes are a good example of this. Many machines set the condition codes implicitly as part of the instruction. This approach has advantages, since condition codes decouple the evaluation of the condition from the actual branch. However, implicitly set condition codes can cause difficulties in scheduling any pipeline delays between setting the condition code and the branch, since most instructions set the condition code and cannot be used in the delay slots between the condition evaluation and the branch. Additionally, in machines with condition codes, the processor must decide when the branch condition is fixed. This involves finding out when the condition code has been set for the last time before the branch. In most machines with implicitly set condition codes, this is done by delaying the branch condition evaluation until all previous instructions have had a chance to set the condition code. Of course, architectures with explicitly set condition codes allow the delay between condition test and the branch to be scheduled; however, pipeline control must still track the last instruction that sets the condition code to know when the branch condition is decided. In effect, the condition code must be treated as an operand that requires hazard detection for RAW hazards with branches, just as DLX must do on the registers. A final thorny area in pipelining is multicycle operations. Imagine trying to pipeline a sequence of VAX instructions such as this: 3.7 Extending the DLX Pipeline to Handle Multicycle Operations MOVL ADDL3 SUBL2 MOVC3 187 R1,R2 42(R1),56(R1)+,@(R1) R2,R3 @(R1)[R2],74(R2),R3 These instructions differ radically in the number of clock cycles they will require, from as low as one up to hundreds of clock cycles. They also require different numbers of data memory accesses, from zero to possibly hundreds. The data hazards are very complex and occur both between and within instructions. The simple solution of making all instructions execute for the same number of clock cycles is unacceptable, because it introduces an enormous number of hazards and bypass conditions and makes an immensely long pipeline. Pipelining the VAX at the instruction level is difficult, but a clever solution was found by the VAX 8800 designers. They pipeline the microinstruction execution: a microinstruction is a simple instruction used in sequences to implement a more complex instruction set. Because the microinstructions are simple (they look a lot like DLX), the pipeline control is much easier. While it is not clear that this approach can achieve quite as low a CPI as an instruction-level pipeline for the VAX, it is much simpler, possibly leading to a shorter clock cycle. In comparison, load-store machines have simple operations with similar amounts of work and pipeline more easily. If architects realize the relationship between instruction set design and pipelining, they can design architectures for more efficient pipelining. In the next section we will see how the DLX pipeline deals with long-running instructions, specifically floating-point operations. 3.7 Extending the DLX Pipeline to Handle Multicycle Operations We now want to explore how our DLX pipeline can be extended to handle floatingpoint operations. This section concentrates on the basic approach and the design alternatives, closing with some performance measurements of a DLX floating-point pipeline. It is impractical to require that all DLX floating-point operations complete in one clock cycle, or even in two. Doing so would mean accepting a slow clock, or using enormous amounts of logic in the floating-point units, or both. Instead, the floating-point pipeline will allow for a longer latency for operations. This is easier to grasp if we imagine the floating-point instructions as having the same pipeline as the integer instructions, with two important changes. First, the EX cycle may be repeated as many times as needed to complete the operation—the number of repetitions can vary for different operations. Second, there may be multiple floating-point functional units. A stall will occur if the instruction to be issued will either cause a structural hazard for the functional unit it uses or cause a data hazard. 188 Chapter 3 Pipelining For this section, let’s assume that there are four separate functional units in our DLX implementation: 1. The main integer unit that handles loads and stores, integer ALU operations, and branches. 2. FP and integer multiplier. 3. FP adder that handles FP add, subtract, and conversion. 4. FP and integer divider. If we also assume that the execution stages of these functional units are not pipelined, then Figure 3.42 shows the resulting pipeline structure. Because EX is not pipelined, no other instruction using that functional unit may issue until the previous instruction leaves EX. Moreover, if an instruction cannot proceed to the EX stage, the entire pipeline behind that instruction will be stalled. EX Integer unit EX FP/integer multiply IF ID MEM WB EX FP adder EX FP/integer divider FIGURE 3.42 The DLX pipeline with three additional unpipelined, floating-point, functional units. Because only one instruction issues on every clock cycle, all instructions go through the standard pipeline for integer operations. The floating-point operations simply loop when they reach the EX stage. After they have finished the EX stage, they proceed to MEM and WB to complete execution. In reality, the intermediate results are probably not cycled around the EX unit as Figure 3.42 suggests; instead, the EX pipeline stage has some number of clock delays larger than 1. We can generalize the structure of the FP pipeline shown in 3.7 189 Extending the DLX Pipeline to Handle Multicycle Operations Figure 3.42 to allow pipelining of some stages and multiple ongoing operations. To describe such a pipeline, we must define both the latency of the functional units and also the initiation interval or repeat interval. We define latency the same way we defined it earlier: the number of intervening cycles between an instruction that produces a result and an instruction that uses the result. The initiation or repeat interval is the number of cycles that must elapse between issuing two operations of a given type. For example, we will use the latencies and initiation intervals shown in Figure 3.43. Functional unit Latency Initiation interval Integer ALU 0 1 Data memory (integer and FP loads) 1 1 FP add 3 1 FP multiply (also integer multiply) 6 1 24 25 FP divide (also integer divide) FIGURE 3.43 Latencies and initiation intervals for functional units. With this definition of latency, integer ALU operations have a latency of 0, since the results can be used on the next clock cycle, and loads have a latency of 1, since their results can be used after one intervening cycle. Since most operations consume their operands at the beginning of EX, the latency is usually the number of stages after EX that an instruction produces a result—for example, zero stages for ALU operations and one stage for loads. The primary exception is stores, which consume the value being stored one cycle later. Hence the latency to a store for the value being stored, but not for the base address register, will be one cycle less. Pipeline latency is essentially equal to one cycle less than the depth of the execution pipeline, which is the number of stages from the EX stage to the stage that produces the result. Thus, for the example pipeline just above, the number of stages in an FP add is four, while the number of stages in an FP multiply is seven. To achieve a higher clock rate, designers need to put fewer logic levels in each pipe stage, which makes the number of pipe stages required for more complex operations larger. The penalty for the faster clock rate is thus longer latency for operations. The example pipeline structure in Figure 3.43 allows up to four outstanding FP adds, seven outstanding FP/integer multiplies, and one FP divide. Figure 3.44 shows how this pipeline can be drawn by extending Figure 3.42. The repeat interval is implemented in Figure 3.44 by adding additional pipeline stages, which will be separated by additional pipeline registers. Because the units are independent, we name the stages differently. The pipeline stages that take multiple clock cycles, such as the divide unit, are further subdivided to show the latency of those stages. Because they are not complete stages, only one operation may be active. 190 Chapter 3 Pipelining The pipeline structure can also be shown using the familiar diagrams from earlier in the chapter, as Figure 3.45 shows for a set of independent FP operations and FP loads and stores. Naturally, the longer latency of the FP operations increases the frequency of RAW hazards and resultant stalls, as we will see later in this section. Integer unit EX FP/integer multiply M1 IF M2 M3 M4 M5 M6 M7 ID MEM WB FP adder A1 A2 A3 A4 FP/integer divider DIV FIGURE 3.44 A pipeline that supports multiple outstanding FP operations. The FP multiplier and adder are fully pipelined and have a depth of seven and four stages, respectively. The FP divider is not pipelined, but requires 24 clock cycles to complete. The latency in instructions between the issue of an FP operation and the use of the result of that operation without incurring a RAW stall is determined by the number of cycles spent in the execution stages. For example, the fourth instruction after an FP add can use the result of the FP add. For integer ALU operations, the depth of the execution pipeline is always one and the next instruction can use the results. Both FP loads and integer loads complete during MEM, which means that the memory system must provide either 32 or 64 bits in a single clock. MULTD ADDD LD SD IF ID M1 M2 M3 M4 M5 M6 M7 IF ID A1 A2 A3 A4 MEM WB IF ID EX MEM WB IF ID EX MEM MEM WB WB FIGURE 3.45 The pipeline timing of a set of independent FP operations. The stages in italics show where data is needed, while the stages in bold show where a result is available. FP loads and stores use a 64-bit path to memory so that the pipelining timing is just like an integer load or store. 3.7 Extending the DLX Pipeline to Handle Multicycle Operations 191 The structure of the pipeline in Figure 3.44 requires the introduction of the additional pipeline registers (e.g., A1/A2, A2/A3, A3/A4) and the modification of the connections to those registers. The ID/EX register must be expanded to connect ID to EX, DIV, M1, and A1; we can refer to the portion of the register associated with one of the next stages with the notation ID/EX, ID/DIV, ID/M1, or ID/A1. The pipeline register between ID and all the other stages may be thought of as logically separate registers and may, in fact, be implemented as separate registers. Because only one operation can be in a pipe stage at a time, the control information can be associated with the register at the head of the stage. Hazards and Forwarding in Longer Latency Pipelines There are a number of different aspects to the hazard detection and forwarding for a pipeline like that in Figure 3.44: 1. Because the divide unit is not fully pipelined, structural hazards can occur. These will need to be detected and issuing instructions will need to be stalled. 2. Because the instructions have varying running times, the number of register writes required in a cycle can be larger than 1. 3. WAW hazards are possible, since instructions no longer reach WB in order. Note that WAR hazards are not possible, since the register reads always occur in ID. 4. Instructions can complete in a different order than they were issued, causing problems with exceptions; we deal with this in the next subsection. 5. Because of longer latency of operations, stalls for RAW hazards will be more frequent. The increase in stalls arising from longer operation latencies is fundamentally the same as that for the integer pipeline. Before describing the new problems that arise in this FP pipeline and looking at solutions, let’s examine the potential impact of RAW hazards. Figure 3.46 shows a typical FP code sequence and the resultant stalls. At the end of this section, we’ll examine the performance of this FP pipeline for our SPEC subset. Now look at the problems arising from writes, described as (2) and (3) in the list above. If we assume the FP register file has one write port, sequences of FP operations, as well as an FP load together with FP operations, can cause conflicts for the register write port. Consider the pipeline sequence shown in Figure 3.47: In clock cycle 11, all three instructions will reach WB and want to write the register file. With only a single register file write port, the machine must serialize the instruction completion. This single register port represents a structural hazard. We could increase the number of write ports to solve this, but that solution may be unattractive since the additional write ports would be used only rarely. This is because the maximum steady state number of write ports needed is 1. Instead, we choose to detect and enforce access to the write port as a structural hazard. 192 Chapter 3 Pipelining Clock cycle number Instruction 1 LD F4,0 (R2) IF MULTD F0, F4,F6 2 3 4 ID EX MEM WB IF ID stall IF stall ADDD F2, F0,F8 5 7 8 9 10 11 12 M1 M2 M3 M4 M5 M6 M7 MEM WB ID stall stall stall stall stall stall A1 IF SD 0(R2), F2 6 13 14 15 16 A2 A3 A4 stall stall stall stall stall stall ID MEM EX stall stall stall 17 MEM FIGURE 3.46 A typical FP code sequence showing the stalls arising from RAW hazards. The longer pipeline substantially raises the frequency of stalls versus the shallower integer pipeline. Each instruction in this sequence is dependent on the previous and proceeds as soon as data are available, which assumes the pipeline has full bypassing and forwarding. The SD must be stalled an extra cycle so that its MEM does not conflict with the ADDD. Extra hardware could easily handle this case. Clock cycle number Instruction 1 MULTD F0,F4,F6 IF ... ... ADDD F2,F4,F6 ... ... LD F2,0(R2) 2 3 4 5 6 7 8 9 10 11 M5 M6 M7 MEM WB A3 A4 MEM WB ID M1 M2 M3 M4 IF ID EX MEM WB ID EX MEM WB IF ID A1 A2 IF IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB FIGURE 3.47 Three instructions want to perform a write back to the FP register file simultaneously, as shown in clock cycle 11. This is not the worst case, since an earlier divide in the FP unit could also finish on the same clock. Note that although the MULTD, ADDD, and LD all are in the MEM stage in clock cycle 10, only the LD actually uses the memory, so no structural hazard exists for MEM. There are two different ways to implement this interlock. The first is to track the use of the write port in the ID stage and to stall an instruction before it issues, just as we would for any other structural hazard. Tracking the use of the write port can be done with a shift register that indicates when already-issued instructions will use the register file. If the instruction in ID needs to use the register file at the same time as an instruction already issued, the instruction in ID is stalled for a cycle. On each clock the reservation register is shifted one bit. This implementation has an advantage: It maintains the property that all interlock detection and stall insertion occurs in the ID stage. The cost is the addition of the shift register and write conflict logic. We will assume this scheme throughout this section. An alternative scheme is to stall a conflicting instruction when it tries to enter either the MEM or WB stage. If we wait to stall the conflicting instructions until 3.7 Extending the DLX Pipeline to Handle Multicycle Operations 193 they want to enter the MEM or WB stage, we can choose to stall either instruction. A simple, though sometimes suboptimal, heuristic is to give priority to the unit with the longest latency, since that is the one most likely to have caused another instruction to be stalled for a RAW hazard. The advantage of this scheme is that it does not require us to detect the conflict until the entrance of the MEM or WB stage, where it is easy to see. The disadvantage is that it complicates pipeline control, as stalls can now arise from two places. Notice that stalling before entering MEM will cause the EX, A4, or M7 stage to be occupied, possibly forcing the stall to trickle back in the pipeline. Likewise, stalling before WB would cause MEM to back up. Our other problem is the possibility of WAW hazards. To see that these exist, consider the example in Figure 3.47. If the LD instruction were issued one cycle earlier and had a destination of F2, then it would create a WAW hazard, because it would write F2 one cycle earlier than the ADDD. Note that this hazard only occurs when the result of the ADDD is overwritten without any instruction ever using it! If there were a use of F2 between the ADDD and the LD, the pipeline would need to be stalled for a RAW hazard, and the LD would not issue until the ADDD was completed. We could argue that, for our pipeline, WAW hazards only occur when a useless instruction is executed, but we must still detect them and make sure that the result of the LD appears in F2 when we are done. (As we will see in section 3.10, such sequences sometimes do occur in reasonable code.) There are two possible ways to handle this WAW hazard. The first approach is to delay the issue of the load instruction until the ADDD enters MEM. The second approach is to stamp out the result of the ADDD by detecting the hazard and changing the control so that the ADDD does not write its result. Then, the LD can issue right away. Because this hazard is rare, either scheme will work fine—you can pick whatever is simpler to implement. In either case, the hazard can be detected during ID when the LD is issuing. Then stalling the LD or making the ADDD a noop is easy. The difficult situation is to detect that the LD might finish before the ADDD, because that requires knowing the length of the pipeline and the current position of the ADDD. Luckily, this code sequence (two writes with no intervening read) will be very rare, so we can use a simple solution: If an instruction in ID wants to write the same register as an instruction already issued, do not issue the instruction to EX. In the next chapter, we will see how additional hardware can eliminate stalls for such hazards. First, let’s put together the pieces for implementing the hazard and issue logic in our FP pipeline. In detecting the possible hazards, we must consider hazards among FP instructions, as well as hazards between an FP instruction and an integer instruction. Except for FP loads-stores and FP-integer register moves, the FP and integer registers are distinct. All integer instructions operate on the integer registers, while the floating-point operations operate only on their own registers. Thus, we need only consider FP loads-stores and FP register moves in detecting hazards between FP and integer instructions. This simplification of pipeline control is an additional advantage of having separate register files for integer and floatingpoint data. (The main advantages are a doubling of the number of registers, with- 194 Chapter 3 Pipelining out making either set larger, and an increase in bandwidth without adding more ports to either set. The main disadvantage, beyond the need for an extra register file, is the small cost of occasional moves needed between the two register sets.) Assuming that the pipeline does all hazard detection in ID, there are three checks that must be performed before an instruction can issue: 1. Check for structural hazards—Wait until the required functional unit is not busy (this is only needed for divides in this pipeline) and make sure the register write port is available when it will be needed. 2. Check for a RAW data hazard—Wait until the source registers are not listed as pending destinations in a pipeline register that will not be available when this instruction needs the result. A number of checks must be made here, depending on both the source instruction, which determines when the result will be available, and the destination instruction, which determines when the value is needed. For example, if the instruction in ID is an FP operation with source register F2, then F2 cannot be listed as a destination in ID/A1, A1/A2, or A2/A3, which correspond to FP add instructions that will not be finished when the instruction in ID needs a result. (ID/A1 is the portion of the output register of ID that is sent to A1.) Divide is somewhat more tricky, if we want to allow the last few cycles of a divide to be overlapped, since we need to handle the case when a divide is close to finishing as special. In practice, designers might ignore this optimization in favor of a simpler issue test. 3. Check for a WAW data hazard—Determine if any instruction in A1,..., A4, D, M1,..., M7 has the same register destination as this instruction. If so, stall the issue of the instruction in ID. Although the hazard detection is more complex with the multicycle FP operations, the concepts are the same as for the DLX integer pipeline. The same is true for the forwarding logic. The forwarding can be implemented by checking if the destination register in any of EX/MEM, A4/MEM, M7/MEM, D/MEM, or MEM/WB registers is one of the source registers of a floating-point instruction. If so, the appropriate input multiplexer will have to be enabled so as to choose the forwarded data. In the Exercises, you will have the opportunity to specify the logic for the RAW and WAW hazard detection as well as for forwarding. Multicycle FP operations also introduce problems for our exception mechanisms, which we deal with next. Maintaining Precise Exceptions Another problem caused by these long-running instructions can be illustrated with the following sequence of code: DIVF ADDF F0,F2,F4 F10,F10,F8 3.7 Extending the DLX Pipeline to Handle Multicycle Operations SUBF 195 F12,F12,F14 This code sequence looks straightforward; there are no dependences. A problem arises, however, because an instruction issued early may complete after an instruction issued later. In this example, we can expect ADDF and SUBF to complete before the DIVF completes. This is called out-of-order completion and is common in pipelines with long-running operations. Because hazard detection will prevent any dependence among instructions from being violated, why is out-of-order completion a problem? Suppose that the SUBF causes a floating-point arithmetic exception at a point where the ADDF has completed but the DIVF has not. The result will be an imprecise exception, something we are trying to avoid. It may appear that this could be handled by letting the floating-point pipeline drain, as we do for the integer pipeline. But the exception may be in a position where this is not possible. For example, if the DIVF decided to take a floating-point-arithmetic exception after the add completed, we could not have a precise exception at the hardware level. In fact, because the ADDF destroys one of its operands, we could not restore the state to what it was before the DIVF, even with software help. This problem arises because instructions are completing in a different order than they were issued. There are four possible approaches to dealing with out-oforder completion. The first is to ignore the problem and settle for imprecise exceptions. This approach was used in the 1960s and early 1970s. It is still used in some supercomputers, where certain classes of exceptions are not allowed or are handled by the hardware without stopping the pipeline. It is difficult to use this approach in most machines built today because of features such as virtual memory and the IEEE floating-point standard, which essentially require precise exceptions through a combination of hardware and software. As mentioned earlier, some recent machines have solved this problem by introducing two modes of execution: a fast, but possibly imprecise mode and a slower, precise mode. The slower precise mode is implemented either with a mode switch or by insertion of explicit instructions that test for FP exceptions. In either case the amount of overlap and reordering permitted in the FP pipeline is significantly restricted so that effectively only one FP instruction is active at a time. This solution is used in the DEC Alpha 21064 and 21164, in the IBM Power-1 and Power-2, and in the MIPS R8000. A second approach is to buffer the results of an operation until all the operations that were issued earlier are complete. Some machines actually use this solution, but it becomes expensive when the difference in running times among operations is large, since the number of results to buffer can become large. Furthermore, results from the queue must be bypassed to continue issuing instructions while waiting for the longer instruction. This requires a large number of comparators and a very large multiplexer. There are two viable variations on this basic approach. The first is a history file, used in the CYBER 180/990. The history file keeps track of the original values of registers. When an exception occurs and the state must be rolled back ear- 196 Chapter 3 Pipelining lier than some instruction that completed out of order, the original value of the register can be restored from the history file. A similar technique is used for autoincrement and autodecrement addressing on machines like VAXes. Another approach, the future file, proposed by J. Smith and A. Pleszkun [1988], keeps the newer value of a register; when all earlier instructions have completed, the main register file is updated from the future file. On an exception, the main register file has the precise values for the interrupted state. In the next chapter (section 4.6), we will see extensions of this idea, which are used in processors such as the PowerPC 620 and MIPS R10000 to allow overlap and reordering while preserving precise exceptions. A third technique in use is to allow the exceptions to become somewhat imprecise, but to keep enough information so that the trap-handling routines can create a precise sequence for the exception. This means knowing what operations were in the pipeline and their PCs. Then, after handling the exception, the software finishes any instructions that precede the latest instruction completed, and the sequence can restart. Consider the following worst-case code sequence: Instruction1—A long-running instruction that eventually interrupts execution. Instruction2, ..., Instructionn–1—A series of instructions that are not completed. Instructionn—An instruction that is finished. Given the PCs of all the instructions in the pipeline and the exception return PC, the software can find the state of instruction1 and instructionn. Because instructionn has completed, we will want to restart execution at instructionn+1. After handling the exception, the software must simulate the execution of instruction1, ..., instructionn–1. Then we can return from the exception and restart at instructionn+1. The complexity of executing these instructions properly by the handler is the major difficulty of this scheme. There is an important simplification for simple DLX-like pipelines: If instruction2, ..., instructionn are all integer instructions, then we know that if instructionn has completed, all of instruction2, ..., instructionn–1 have also completed. Thus, only floating-point operations need to be handled. To make this scheme tractable, the number of floating-point instructions that can be overlapped in execution can be limited. For example, if we only overlap two instructions, then only the interrupting instruction need be completed by software. This restriction may reduce the potential throughput if the FP pipelines are deep or if there is a significant number of FP functional units. This approach is used in the SPARC architecture to allow overlap of floating-point and integer operations. The final technique is a hybrid scheme that allows the instruction issue to continue only if it is certain that all the instructions before the issuing instruction will complete without causing an exception. This guarantees that when an exception occurs, no instructions after the interrupting one will be completed and all of the instructions before the interrupting one can be completed. This sometimes means stalling the machine to maintain precise exceptions. To make this scheme work, 3.7 Extending the DLX Pipeline to Handle Multicycle Operations 197 the floating-point functional units must determine if an exception is possible early in the EX stage (in the first three clock cycles in the DLX pipeline), so as to prevent further instructions from completing. This scheme is used in the MIPS R2000/3000, the R4000, and the Intel Pentium. It is discussed further in Appendix A. Performance of a DLX FP Pipeline The DLX FP pipeline of Figure 3.44 on page 190 can generate both structural stalls for the divide unit and stalls for RAW hazards (it also can have WAW hazards, but this rarely occurs in practice). Figure 3.48 shows the number of stall cycles for each type of floating-point operation on a per instance basis (i.e., the first bar for each FP benchmark shows the number of FP result stalls for each FP add, subtract, or compare). As we might expect, the stall cycles per operation track the latency of the FP operations, varying from 46% to 59% of the latency of the functional unit. Figure 3.49 gives the complete breakdown of integer and floating-point stalls for the five FP SPEC benchmarks we are using. There are four classes of stalls shown: FP result stalls, FP compare stalls, load and branch delays, and floatingpoint structural delays. The compiler tries to schedule both load and FP delays before it schedules branch delays. The total number of stalls per instruction varies from 0.65 to 1.21. 198 Chapter 3 Pipelining 1.7 1.7 3.7 doduc 15.4 2.0 1.6 2.0 2.5 ear 12.4 0.0 FP SPEC hydro2d benchmarks 2.3 2.5 3.2 0.4 0.0 2.1 1.2 2.9 mdljdp 24.5 0.0 su2cor 0.7 1.5 1.6 18.6 0.6 0.0 5.0 10.0 15.0 25.0 20.0 Number of stalls Add/subtract/convert Compares Divide Multiply Divide structural FIGURE 3.48 Stalls per FP operation for each major type of FP operation. Except for the divide structural hazards, these data do not depend on the frequency of an operation, only on its latency and the number of cycles before the result is used. The number of stalls from RAW hazards roughly tracks the latency of the FP unit. For example, the average number of stalls per FP add, subtract, or convert is 1.7 cycles, or 56% of the latency (3 cycles). Likewise, the average number of stalls for multiplies and divides are 2.8 and 14.2, respectively, or 46% and 59% of the corresponding latency. Structural hazards for divides are rare, since the divide frequency is low. 3.8 199 Crosscutting Issues: Instruction Set Design and Pipelining 0.98 doduc 0.07 0.08 0.08 0.52 ear 0.09 0.07 0.00 0.54 FP SPEC hydro2d benchmarks mdljdp 0.22 0.04 0.00 0.88 0.10 0.03 0.00 0.61 su2cor 0.02 0.01 0.01 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Number of stalls FP result stalls FP compare stalls Branch/load stalls FP structural FIGURE 3.49 The stalls occurring for the DLX FP pipeline for the five FP SPEC benchmarks. The total number of stalls per instruction ranges from 0.65 for su2cor to 1.21 for doduc, with an average of 0.87. FP result stalls dominate in all cases, with an average of 0.71 stalls per instruction or 82% of the stalled cycles. Compares generate an average of 0.1 stalls per instruction and are the second largest source. The divide structural hazard is only significant for doduc. 3.8 Crosscutting Issues: Instruction Set Design and Pipelining For many years the interaction between instruction sets and implementations was believed to be small, and implementation issues were not a major focus in designing instruction sets. In the 1980s it became clear that the difficulty and inefficiency of pipelining could both be increased by instruction set complications. Here are some examples, many of which are mentioned earlier in the chapter: s Variable instruction lengths and running times can lead to imbalance among pipeline stages, causing other stages to back up. They also severely complicate hazard detection and the maintenance of precise exceptions. Of course, some- 200 Chapter 3 Pipelining times the advantages justify the added complexity. For example, caches cause instruction running times to vary when they miss; however, the performance advantages of caches make the added complexity acceptable. To minimize the complexity, most machines freeze the pipeline on a cache miss. Other machines try to continue running parts of the pipeline; though this is complex, it may overcome some of the performance losses from cache misses. s s s Sophisticated addressing modes can lead to different sorts of problems. Addressing modes that update registers, such as post-autoincrement, complicate hazard detection. They also slightly increase the complexity of instruction restart. Other addressing modes that require multiple memory accesses substantially complicate pipeline control and make it difficult to keep the pipeline flowing smoothly. Architectures that allow writes into the instruction space (self-modifying code), such as the 80x86, can cause trouble for pipelining (as well as for cache designs). For example, if an instruction in the pipeline can modify another instruction, we must constantly check if the address being written by an instruction corresponds to the address of an instruction following the instruction that writes in the pipeline. If so, the pipeline must be flushed or the instruction in the pipeline somehow updated. Implicitly set condition codes increase the difficulty of finding when a branch has been decided and the difficulty of scheduling branch delays. The former problem occurs when the condition-code setting is not uniform, making it difficult to decide which instruction assigns the condition code last. The latter problem occurs when the condition code is unconditionally set by almost every instruction. This makes it hard to find instructions that can be scheduled between the condition evaluation and the branch. Most older architectures (the IBM 360, the DEC VAX, and the Intel 80x86, for example) have one or both of these problems. Many newer architectures avoid condition codes or set them explicitly under the control of a bit in the instruction. Either approach dramatically reduces pipelining difficulties. As a simple example, suppose the DLX instruction format were more complex, so that a separate, decode pipe stage were required before register fetch. This would increase the branch delay to two clock cycles. At best, the second branch-delay slot would be wasted at least as often as the first. Gross [1983] found that a second delay slot was only used half as often as the first. This would lead to a performance penalty for the second delay slot of more than 0.1 clock cycles per instruction. Another example comes from a comparison of the pipeline efficiencies of a VAX 8800 and a MIPS R3000. Although these two machines have many similarities in organization, the VAX instruction set was not designed with pipelining in mind. As a result, on the SPEC89 benchmarks, the MIPS R3000 is faster by between two times and four times, with a mean performance advantage of 2.7 times. 3.9 3.9 201 Putting It All Together: The MIPS R4000 Pipeline Putting It All Together: The MIPS R4000 Pipeline In this section we look at the pipeline structure and performance of the MIPS R4000 processor family. The MIPS-3 instruction set, which the R4000 implements, is a 64-bit instruction set similar to DLX. The R4000 uses a deeper pipeline than that of our DLX model both for integer and FP programs. This deeper pipeline allows it to achieve higher clock rates (100–200 MHz) by decomposing the five-stage integer pipeline into eight stages. Because cache access is particularly time critical, the extra pipeline stages come from decomposing the memory access. This type of deeper pipelining is sometimes called superpipelining. Figure 3.50 shows the eight-stage pipeline structure using an abstracted version of the datapath. Figure 3.51 shows the overlap of successive instructions in the pipeline. Notice that although the instruction and data memory occupy multiple cycles, they are fully pipelined, so that a new instruction can start on every clock. In fact, the pipeline uses the data before the cache hit detection is complete; Chapter 5 discusses how this can be done in more detail. IS Instruction memory RF Reg EX DF ALU IF DS Data memory TC WB Reg FIGURE 3.50 The eight-stage pipeline structure of the R4000 uses pipelined instruction and data caches. The pipe stages are labeled and their detailed function is described in the text. The vertical dashed lines represent the stage boundaries as well as the location of pipeline latches. The instruction is actually available at the end of IS, but the tag check is done in RF, while the registers are fetched. Thus, we show the instruction memory as operating through RF. The TC stage is needed for data memory access, since we cannot write the data into the register until we know whether the cache access was a hit or not. The function of each stage is as follows: s s s IF—First half of instruction fetch; PC selection actually happens here, together with initiation of instruction cache access. IS—Second half of instruction fetch, complete instruction cache access. RF—Instruction decode and register fetch, hazard checking, and also instruction cache hit detection. 202 Chapter 3 Pipelining s EX—Execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. s DF—Data fetch, first half of data cache access. s DS—Second half of data fetch, completion of data cache access. s TC—Tag check, determine whether the data cache access hit. s WB—Write back for loads and register-register operations. In addition to substantially increasing the amount of forwarding required, this longer latency pipeline increases both the load and branch delays. Figure 3.51 shows that load delays are two cycles, since the data value is available at the end of DS. Figure 3.52 shows the shorthand pipeline schedule when a use immediately follows a load. It shows that forwarding is required for the result of a load instruction to a destination that is three or four cycles later. Time (in clock cycles) ADD R2, R1 Instruction memory Reg Instruction memory CC 6 CC 7 CC 8 Data memory Reg Instruction memory CC 9 Data memory Reg CC 10 CC 11 Reg Reg Data memory ALU Instruction 2 Reg CC 5 ALU Instruction 1 Instruction memory CC 4 ALU LW R1 CC 3 CC 2 ALU CC 1 Data memory Reg Reg FIGURE 3.51 The structure of the R4000 integer pipeline leads to a two-cycle load delay. A two-cycle delay is possible because the data value is available at the end of DS and can be bypassed. If the tag check in TC indicates a miss, the pipeline is backed up a cycle, when the correct data are available. Figure 3.53 shows that the basic branch delay is three cycles, since the branch condition is computed during EX. The MIPS architecture has a single-cycle delayed branch. The R4000 uses a predict-not-taken strategy for the remaining two cycles of the branch delay. As Figure 3.54 shows, untaken branches are simply one-cycle delayed branches, while taken branches have a one-cycle delay slot 3.9 203 Putting It All Together: The MIPS R4000 Pipeline Clock number Instruction number 1 2 3 4 5 6 7 8 LW R1, . . . IF IS RF EX DF DS TC WB IF IS RF stall stall EX DF DS ADD R2,R1, . . . IF 9 IS OR R4,R1, . . . stall stall RF EX DF IF SUB R3,R1, . . . stall stall IS RF EX FIGURE 3.52 A load instruction followed by an immediate use results in a two-cycle stall. Normal forwarding paths can be used after two cycles, so the ADD and SUB get the value by forwarding after the stall. The OR instruction gets the value from the register file. Since the two instructions after the load could be independent and hence not stall, the bypass can be to instructions that are three or four cycles after the load. followed by two idle cycles. The instruction set provides a branch likely instruction, which we described earlier and which helps in filling the branch delay slot. Pipeline interlocks enforce both the two-cycle branch stall penalty on a taken branch and any data hazard stall that arises from use of a load result. Time (in clock cycles) Instruction 3 Target FIGURE 3.53 Reg Instruction memory CC6 CC7 CC8 Data memory Reg Instruction memory CC9 Data memory Reg Instruction memory CC10 CC11 Reg Reg Data memory Reg Reg Data memory ALU Instruction 2 Instruction memory CC5 ALU Instruction 1 Reg CC4 ALU Instruction memory CC3 ALU BEQZ CC2 ALU CC1 Reg Data memory The basic branch delay is three cycles, since the condition evaluation is performed during EX. 204 Chapter 3 Pipelining Clock number Instruction number 1 2 Branch instruction IF 3 4 5 6 7 8 9 IS RF EX DF DS TC WB IF IS RF EX DF DS TC WB stall stall stall stall stall stall stall stall stall stall stall stall stall IF Delay slot IS RF EX DF 9 Stall Stall Branch target Clock number Instruction number 1 2 3 4 5 6 7 8 Branch instruction IF IS RF EX DF DS TC WB IF IS RF EX DF DS TC WB IF IS RF EX DF DS TC IF IS RF EX DF DS Delay slot Branch instruction + 2 Branch instruction + 3 FIGURE 3.54 A taken branch, shown in the top portion of the figure, has a one-cycle delay slot followed by a twocycle stall, while an untaken branch, shown in the bottom portion, has simply a one-cycle delay slot. The branch instruction can be an ordinary delayed branch or a branch-likely, which cancels the effect of the instruction in the delay slot if the branch is untaken. In addition to the increase in stalls for loads and branches, the deeper pipeline increases the number of levels of forwarding for ALU operations. In our DLX five-stage pipeline, forwarding between two register-register ALU instructions could happen from the ALU/MEM or the MEM/WB registers. In the R4000 pipeline, there are four possible sources for an ALU bypass: EX/DF, DF/DS, DS/TC, and TC/WB. The Exercises ask you to explore all the possible forwarding conditions for the DLX instruction set using an R4000-style pipeline. The Floating-Point Pipeline The R4000 floating-point unit consists of three functional units: a floating-point divider, a floating-point multiplier, and a floating-point adder. As in the R3000, the adder logic is used on the final step of a multiply or divide. Double-precision FP operations can take from two cycles (for a negate) up to 112 cycles for a square root. In addition, the various units have different initiation rates. The floating-point functional unit can be thought of as having eight different stages, listed in Figure 3.55. 3.9 Putting It All Together: The MIPS R4000 Pipeline Stage Functional unit Description A FP adder Mantissa ADD stage D FP divider Divide pipeline stage E FP multiplier 205 Exception test stage M FP multiplier First stage of multiplier N FP multiplier Second stage of multiplier R FP adder Rounding stage S FP adder Operand shift stage U Unpack FP numbers FIGURE 3.55 The eight stages used in the R4000 floating-point pipelines. There is a single copy of each of these stages, and various instructions may use a stage zero or more times and in different orders. Figure 3.56 shows the latency, initiation rate, and pipeline stages used by the most common double-precision FP operations. FP instruction Add, subtract Multiply Latency 4 Initiation interval Pipe stages 3 U,S+A,A+R,R+S U,E+M,M,M,M,N,N+A,R 8 4 36 35 112 111 Negate 2 1 U,S Absolute value 2 1 U,S FP compare 3 2 U,A,R Divide Square root U,A,R,D27,D+A,D+R,D+A,D+R,A,R U,E,(A+R)108,A,R FIGURE 3.56 The latencies and initiation intervals for the FP operations both depend on the FP unit stages that a given operation must use. The latency values assume that the destination instruction is an FP operation; the latencies are one cycle less when the destination is a store. The pipe stages are shown in the order in which they are used for any operation. The notation S+A indicates a clock cycle in which both the S and A stages are used. The notation D28 indicates that the D stage is used 28 times in a row. From the information in Figure 3.56, we can determine whether a sequence of different, independent FP operations can issue without stalling. If the timing of the sequence is such that a conflict occurs for a shared pipeline stage, then a stall will be needed. Figures 3.57, 3.58, 3.59, and 3.60 show four common possible two-instruction sequences: a multiply followed by an add, an add followed by a multiply, a divide followed by an add, and an add followed by a divide. The figures show all the interesting starting positions for the second instruction and 206 Chapter 3 Pipelining Clock cycle Operation Issue/stall 0 Multiply Issue U Add Issue 1 2 3 4 5 6 7 N N+A R M M M M U S+A A+R S+A A+R S+A A+R S+A A+R R+S U S+A A+R 11 R+S U 10 R+S U 9 R+S U 8 Issue Issue Stall Stall Issue U R+S S+A A+R R+S U Issue S+A A+R R+S FIGURE 3.57 An FP multiply issued at clock 0 is followed by a single FP add issued between clocks 1 and 7. The second column indicates whether an instruction of the specified type stalls when it is issued n cycles later, where n is the clock cycle number in which the U stage of the second instruction occurs. The stage or stages that cause a stall are highlighted. Note that this table deals with only the interaction between the multiply and one add issued between clocks 1 and 7. In this case, the add will stall if it is issued four or five cycles after the multiply; otherwise, it issues without stalling. Notice that the add will be stalled for two cycles if it issues in cycle 4 since on the next clock cycle it will still conflict with the multiply; if, however, the add issues in cycle 5, it will stall for only one clock cycle, since that will eliminate the conflicts. Clock cycle Operation Issue/stall 0 1 2 Add Issue U S+A A+R R+S Multiply Issue U M U Issue 3 4 5 6 7 8 M M M N N+A R M M M M N N+A 9 10 11 12 R FIGURE 3.58 A multiply issuing after an add can always proceed without stalling, since the shorter instruction clears the shared pipeline stages before the longer instruction reaches them. whether that second instruction will issue or stall for each position. Of course, there could be three instructions active, in which case the possibilities for stalls are much higher and the figures more complex. 12 3.9 207 Putting It All Together: The MIPS R4000 Pipeline Clock cycle Operation Issue/stall 25 26 27 28 29 30 31 32 33 34 35 Divide issued in cycle 0... D D D D D D+A D+R D+A D+R A R U S+A A+R R+S U S+A A+R R+S U S+A A+R R+S U S+A A+R R+S S+A A+R R+S U S+A A+R R+S S+A A+R R+S U S+A A+R Add Issue Issue Stall Stall Stall U Stall Stall U Stall Issue U 36 R+S S+A A+R U Issue S+A Issue U FIGURE 3.59 An FP divide can cause a stall for an add that starts near the end of the divide. The divide starts at cycle 0 and completes at cycle 35; the last 10 cycles of the divide are shown. Since the divide makes heavy use of the rounding hardware needed by the add, it stalls an add that starts in any of cycles 28 to 33. Notice the add starting in cycle 28 will be stalled until cycle 34. If the add started right after the divide it would not conflict, since the add could complete before the divide needed the shared stages, just as we saw in Figure 3.58 for a multiply and add. As in the earlier figure, this example assumes exactly one add that reaches the U stage between clock cycles 26 and 35. Clock cycle Operation Issue/stall 0 1 2 3 Add Issue U S+A A+R R+S Divide Stall U A U Issue Issue 4 5 6 7 8 9 10 11 12 R D D D D D D D D D A R D D D D D D D D U A R D D D D D D D FIGURE 3.60 A double-precision add is followed by a double-precision divide. If the divide starts one cycle after the add, the divide stalls, but after that there is no conflict. Performance of the R4000 Pipeline In this section we examine the stalls that occur for the SPEC92 benchmarks when running on the R4000 pipeline structure. There are four major causes of pipeline stalls or losses: 1. Load stalls—Delays arising from the use of a load result one or two cycles after the load. Chapter 3 Pipelining 2. Branch stalls—Two-cycle stall on every taken branch plus unfilled or cancelled branch delay slots. 3. FP result stalls—Stalls because of RAW hazards for an FP operand. 4. FP structural stalls—Delays because of issue restrictions arising from conflicts for functional units in the FP pipeline. Figure 3.61 shows the pipeline CPI breakdown for the R4000 pipeline for the 10 SPEC92 benchmarks. Figure 3.62 shows the same data but in tabular form. 3.00 2.50 2.00 Pipeline CPI 1.50 1.00 0.50 dl jd su p 2c or m r d ea o2 dr hy l do i du c c gc t so es pr es ot nt pr m eq es s 0.00 co 208 SPEC92 benchmark Base Load stalls FP result stalls Branch stalls FP structural stalls FIGURE 3.61 The pipeline CPI for 10 of the SPEC92 benchmarks, assuming a perfect cache. The pipeline CPI varies from 1.2 to 2.8. The leftmost five programs are integer programs, and branch delays are the major CPI contributor for these. The rightmost five programs are FP, and FP result stalls are the major contributor for these. 3.10 Benchmark 209 Fallacies and Pitfalls Pipeline CPI Load stalls Branch stalls FP result stalls FP structural stalls compress 1.20 0.14 0.06 0.00 0.00 eqntott 1.88 0.27 0.61 0.00 0.00 espresso 1.42 0.07 0.35 0.00 0.00 gcc 1.56 0.13 0.43 0.00 0.00 li 1.64 0.18 0.46 0.00 0.00 Integer average 1.54 0.16 0.38 0.00 0.00 doduc 2.84 0.01 0.22 1.39 0.22 mdljdp2 2.66 0.01 0.31 1.20 0.15 ear 2.17 0.00 0.46 0.59 0.12 hydro2d 2.53 0.00 0.62 0.75 0.17 su2cor 2.18 0.02 0.07 0.84 0.26 FP average 2.48 0.01 0.33 0.95 0.18 Overall average 2.00 0.10 0.36 0.46 0.09 FIGURE 3.62 The total pipeline CPI and the contributions of the four major sources of stalls are shown. The major contributors are FP result stalls (both for branches and for FP inputs) and branch stalls, with loads and FP structural stalls adding less. From the data in Figures 3.61 and 3.62, we can see the penalty of the deeper pipelining. The R4000’s pipeline has much longer branch delays than the fivestage DLX-style pipeline. The longer branch delay substantially increases the cycles spent on branches, especially for the integer programs with a higher branch frequency. An interesting effect for the FP programs is that the latency of the FP functional units leads to more stalls than the structural hazards, which arise both from the initiation interval limitations and from conflicts for functional units from different FP instructions. Thus, reducing the latency of FP operations should be the first target, rather than more pipelining or replication of the functional units. Of course, reducing the latency would probably increase the structural stalls, since many potential structural stalls are hidden behind data hazards. 3.10 Fallacies and Pitfalls Pitfall: Unexpected execution sequences may cause unexpected hazards. At first glance, WAW hazards look like they should never occur because no compiler would ever generate two writes to the same register without an intervening read. But they can occur when the sequence is unexpected. For example, the first write might be in the delay slot of a taken branch when the scheduler thought the branch would not be taken. Here is the code sequence that could cause this: 210 Chapter 3 Pipelining BNEZ DIVD foo: ..... ..... LD R1,foo F0,F2,F4 ; moved into delay slot ; from fall through F0,qrs If the branch is taken, then before the DIVD can complete, the LD will reach WB, causing a WAW hazard. The hardware must detect this and may stall the issue of the LD. Another way this can happen is if the second write is in a trap routine. This occurs when an instruction that traps and is writing results continues and completes after an instruction that writes the same register in the trap handler. The hardware must detect and prevent this as well. Pitfall: Extensive pipelining can impact other aspects of a design, leading to overall worse cost/performance. The best example of this phenomenon comes from two implementations of the VAX, the 8600 and the 8700. When the 8600 was initially delivered, it had a cycle time of 80 ns. Subsequently, a redesigned version, called the 8650, with a 55ns clock was introduced. The 8700 has a much simpler pipeline that operates at the microinstruction level, yielding a smaller CPU with a faster clock cycle of 45 ns. The overall outcome is that the 8650 has a CPI advantage of about 20%, but the 8700 has a clock rate that is about 20% faster. Thus, the 8700 achieves the same performance with much less hardware. Fallacy: Increasing the number of pipeline stages always increases performance. Two factors combine to limit the performance improvement gained by pipelining. Limited parallelism in the instruction stream means that increasing the number of pipeline stages, called the pipeline depth, will eventually increase the CPI, due to dependences that require stalls. Second, clock skew and latch overhead combine to limit the decrease in clock period obtained by further pipelining. Figure 3.63 shows the trade-off between the number of pipeline stages and performance for the first 14 of the Livermore Loops. The performance flattens out when the number of pipeline stages reaches 4 and actually drops when the execution portion is pipelined 16 deep. Although this study is limited to a small set of FP programs, the trade-off of increasing CPI versus increasing clock rate by more pipelining arises constantly. Pitfall: Evaluating a compile-time scheduler on the basis of unoptimized code. Unoptimized code—containing redundant loads, stores, and other operations that might be eliminated by an optimizer—is much easier to schedule than “tight” optimized code. This holds for scheduling both control delays (with delayed 3.11 211 Concluding Remarks 3.0 2.5 2.0 Relative performance 1.5 1.0 0.5 0.0 1 2 4 8 16 Pipeline depth FIGURE 3.63 The depth of pipelining versus the speedup obtained. The x-axis shows the number of stages in the EX portion of the floating-point pipeline. A single-stage pipeline corresponds to 32 levels of logic, which might be appropriate for a single FP operation. Data based on Table 2 in Kunkel and Smith [1986]. branches) and delays arising from RAW hazards. In gcc running on an R3000, which has a pipeline almost identical to that of DLX, the frequency of idle clock cycles increases by 18% from the unoptimized and scheduled code to the optimized and scheduled code. Of course, the optimized program is much faster, since it has fewer instructions. To fairly evaluate a scheduler you must use optimized code, since in the real system you will derive good performance from other optimizations in addition to scheduling. 3.11 Concluding Remarks Pipelining has been and is likely to continue to be one of the most important techniques for enhancing the performance of processors. Improving performance via pipelining was the key focus of many early computer designers in the late 1950s through the mid 1960s. In the late 1960s through the late 1970s, the attention of computer architects was focused on other things, including the dramatic improvements in cost, size, and reliability that were achieved by the introduction of integrated circuit technology. In this period pipelining played a secondary role in many designs. Since pipelining was not a primary focus, many instruction sets designed in this period made pipelining overly difficult and reduced its payoff. The VAX architecture is perhaps the best example. In the late 1970s and early 1980s several researchers realized that instruction set complexity and implementation ease, particularly ease of pipelining, were related. The RISC movement led to a dramatic simplification in instruction sets that allowed rapid progress in the development of pipelining techniques. As we will 212 Chapter 3 Pipelining see in the next chapter, these techniques have become extremely sophisticated. The sophisticated implementation techniques now in use in many designs would have been extremely difficult with the more complex architectures of the 1970s. In this chapter, we introduced the basic ideas in pipelining and looked at some simple compiler strategies for enhancing performance. The pipelined microprocessors of the 1980s relied on these strategies, with the R4000-style machine representing one of the most advanced of the “simple” pipeline organizations. To further improve performance in this decade most microprocessors have introduced schemes such as hardware-based pipeline scheduling, dynamic branch prediction, the ability to issue more than one instruction in a cycle, and the use of more powerful compiler technology. These more advanced techniques are the subject of the next chapter. 3.12 Historical Perspective and References This section describes some of the major advances in pipelining and ends with some of the recent literature on high-performance pipelining. The first general-purpose pipelined machine is considered to be Stretch, the IBM 7030. Stretch followed the IBM 704 and had a goal of being 100 times faster than the 704. The goal was a stretch from the state of the art at that time— hence the nickname. The plan was to obtain a factor of 1.6 from overlapping fetch, decode, and execute, using a four-stage pipeline. Bloch [1959] and Bucholtz [1962] describe the design and engineering trade-offs, including the use of ALU bypasses. The CDC 6600, developed in the early 1960s, also introduced several enhancements in pipelining; these innovations and the history of that design are discussed in the next chapter. A series of general pipelining descriptions that appeared in the late 1970s and early 1980s provided most of the terminology and described most of the basic techniques used in simple pipelines. These surveys include Keller [1975], Ramamoorthy and Li [1977], Chen [1980], and Kogge’s book [1981], devoted entirely to pipelining. Davidson and his colleagues [1971, 1975] developed the concept of pipeline reservation tables as a design methodology for multicycle pipelines with feedback (also described in Kogge [1981]). Many designers use a variation of these concepts, as we did in sections 3.2 and 3.3. The RISC machines were originally designed with ease of implementation and pipelining in mind. Several of the early RISC papers, published in the early 1980s, attempt to quantify the performance advantages of the simplification in instruction set. The best analysis, however, is a comparison of a VAX and a MIPS implementation published by Bhandarkar and Clark in 1991, 10 years after the first published RISC papers. After 10 years of arguments about the implementation benefits of RISC, this paper convinced even the most skeptical designers of the advantages of a RISC instruction set architecture. 3.12 Historical Perspective and References 213 The RISC machines refined the notion of compiler-scheduled pipelines in the early 1980s, though earlier work on this topic is described at the end of the next chapter. The concepts of delayed branches and delayed loads—common in microprogramming—were extended into the high-level architecture. The Stanford MIPS architecture made the pipeline structure purposely visible to the compiler and allowed multiple operations per instruction. Simple schemes for scheduling the pipeline in the compiler were described by Sites [1979] for the Cray, by Hennessy and Gross [1983] (and in Gross’s thesis [1983]), and by Gibbons and Muchnik [1986]. More advanced techniques will be described in the next chapter. Rymarczyk [1982] describes the interlock conditions that programmers should be aware of for a 360-like machine; this paper also shows the complex interaction between pipelining and an instruction set not designed to be pipelined. Static branch prediction by profiling has been explored by McFarling and Hennessy [1986] and by Fisher and Freudenberger [1992]. J. E. Smith and his colleagues have written a number of papers examining instruction issue, exception handling, and pipeline depth for high-speed scalar machines. Kunkel and Smith [1986] evaluate the impact of pipeline overhead and dependences on the choice of optimal pipeline depth; they also have an excellent discussion of latch design and its impact on pipelining. Smith and Pleszkun [1988] evaluate a variety of techniques for preserving precise exceptions. Weiss and Smith [1984] evaluate a variety of hardware pipeline scheduling and instructionissue techniques. The MIPS R4000, in addition to being one of the first deeply pipelined microprocessors, was the first true 64-bit architecture. It is described by Killian [1991] and by Heinrich [1993]. The initial Alpha implementation (the 21064) has a similar instruction set and similar integer pipeline structure, with more pipelining in the floating-point unit. References BHANDARKAR, D. AND D. W. CLARK [1991]. “Performance from architecture: Comparing a RISC and a CISC with similar hardware organizations,” Proc. Fourth Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (April), Palo Alto, Calif., 310–319. BLOCH, E. [1959]. “The engineering design of the Stretch computer,” Proc. Fall Joint Computer Conf., 48–59. BUCHOLTZ, W. [1962]. Planning a Computer System: Project Stretch, McGraw-Hill, New York. CHEN, T. C. [1980]. “Overlap and parallel processing,” in Introduction to Computer Architecture, H. Stone, ed., Science Research Associates, Chicago, 427–486. CLARK, D. W. [1987]. “Pipelining and performance in the VAX 8800 processor,” Proc. Second Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (March), Palo Alto, Calif., 173–177. DAVIDSON, E. S. [1971]. “The design and control of pipelined function generators,” Proc. Conf. on Systems, Networks, and Computers, IEEE (January), Oaxtepec, Mexico, 19–21. DAVIDSON, E. S., A. T. THOMAS, L. E. SHAR, AND J. H. PATEL [1975]. “Effective control for pipelined processors,” COMPCON, IEEE (March), San Francisco, 181–184. EARLE, J. G. [1965]. “Latched carry-save adder,” IBM Technical Disclosure Bull. 7 (March), 909–910. 214 Chapter 3 Pipelining EMER, J. S. AND D. W. CLARK [1984]. “A characterization of processor performance in the VAX-11/ 780,” Proc. 11th Symposium on Computer Architecture (June), Ann Arbor, Mich., 301–310. FISHER, J. AND FREUDENBERGER, S. [1992]. “Predicting conditional branch directions from previous runs of a program,” Proc. Fifth Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (October), Boston, 85–95. GIBBONS, P. B. AND S. S. MUCHNIK [1986]. “Efficient instruction scheduling for a pipelined processor,” SIGPLAN ‘86 Symposium on Compiler Construction, ACM (June), Palo Alto, Calif., 11–16. GROSS, T. R. [1983]. Code Optimization of Pipeline Constraints, Ph.D. Thesis (December), Computer Systems Lab., Stanford Univ. HEINRICH, J. [1993]. MIPS R4000 User’s Manual, Prentice Hall, Englewood Cliffs, N.J. HENNESSY, J. L. AND T. R. GROSS [1983]. “Postpass code optimization of pipeline constraints,” ACM Trans. on Programming Languages and Systems 5:3 (July), 422–448. IBM [1990]. “The IBM RISC System/6000 processor” (collection of papers), IBM J. of Research and Development 34:1 (January). KELLER R. M. [1975]. “Look-ahead processors,” ACM Computing Surveys 7:4 (December), 177– 195. KILLIAN, E. [1991]. “MIPS R4000 technical overview–64 bits/100 MHz or bust,” Hot Chips III Symposium Record (August), Stanford University, 1.6–1.19. KOGGE, P. M. [1981]. The Architecture of Pipelined Computers, McGraw-Hill, New York. KUNKEL, S. R. AND J. E. SMITH [1986]. “Optimal pipelining in supercomputers,” Proc. 13th Symposium on Computer Architecture (June), Tokyo, 404–414. MCFARLING, S. AND J. L. HENNESSY [1986]. “Reducing the cost of branches,” Proc. 13th Symposium on Computer Architecture (June), Tokyo, 396-403. RAMAMOORTHY, C. V. AND H. F. LI [1977]. “Pipeline architecture,” ACM Computing Surveys 9:1 (March), 61–102. RYMARCZYK, J. [1982]. “Coding guidelines for pipelined processors,” Proc. Symposium on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (March), Palo Alto, Calif., 12–19. SITES, R. [1979]. Instruction Ordering for the CRAY-1 Computer, Tech. Rep. 78-CS-023 (July), Dept. of Computer Science, Univ. of Calif., San Diego. SMITH, J. E. AND A. R. PLESZKUN [1988]. “Implementing precise interrupts in pipelined processors,” IEEE Trans. on Computers 37:5 (May), 562–573. WEISS, S. AND J. E. SMITH [1984]. “Instruction issue logic for pipelined supercomputers,” Proc. 11th Symposium on Computer Architecture (June), Ann Arbor, Mich., 110–118. EXERCISES 3.1 [15/15/15] <3.4,3.5> Use the following code fragment: loop: LW ADDI SW ADDI SUB BNEZ R1,0(R2) R1,R1,#1 0(R2),R1 R2,R2,#4 R4,R3,R2 R4,Loop 215 Exercises Assume that the initial value of R3 is R2 + 396. Throughout this exercise use the DLX integer pipeline and assume all memory accesses are cache hits. a. [15] <3.4,3.5> Show the timing of this instruction sequence for the DLX pipeline without any forwarding or bypassing hardware but assuming a register read and a write in the same clock cycle “forwards” through the register file, as in Figure 3.10. Use a pipeline timing chart like Figure 3.14 or 3.15. Assume that the branch is handled by flushing the pipeline. If all memory references hit in the cache, how many cycles does this loop take to execute? b. [15] <3.4,3.5> Show the timing of this instruction sequence for the DLX pipeline with normal forwarding and bypassing hardware. Use a pipeline timing chart like Figure 3.14 or 3.15. Assume that the branch is handled by predicting it as not taken. If all memory references hit in the cache, how many cycles does this loop take to execute? c. [15] <3.4,3.5> Assuming the DLX pipeline with a single-cycle delayed branch and normal forwarding and bypassing hardware, schedule the instructions in the loop including the branch-delay slot. You may reorder instructions and modify the individual instruction operands, but do not undertake other loop transformations that change the number or opcode of the instructions in the loop (that’s for the next chapter!). Show a pipeline timing diagram and compute the number of cycles needed to execute the entire loop. 3.2 [15/15/15] <3.4,3.5,3.7> Use the following code fragment: Loop: LD LD MULTD ADDD ADDI ADDI SUB BNEZ F0,0(R2) F4,0(R3) F0,F0,F4 F2,F0,F2 R2,R2,#8 R3,R3,#8 R5,R4,R2 R5,Loop Assume that the initial value of R4 is R2 + 792. For this exercise assume the standard DLX integer pipeline (as shown in Figure 3.10) and the standard DLX FP pipeline as described in Figures 3.43 and 3.44. If structural hazards are due to write-back contention, assume the earliest instruction gets priority and other instructions are stalled. a. [15] <3.4,3.5,3.7> Show the timing of this instruction sequence for the DLX FP pipeline without any forwarding or bypassing hardware but assuming a register read and a write in the same clock cycle “forwards” through the register file, as in Figure 3.10. Use a pipeline timing chart like Figure 3.14 or 3.15. Assume that the branch is handled by flushing the pipeline. If all memory references hit in the cache, how many cycles does this loop take to execute? 216 Chapter 3 Pipelining b. [15] <3.4,3.5,3.7> Show the timing of this instruction sequence for the DLX FP pipeline with normal forwarding and bypassing hardware. Use a pipeline timing chart like Figure 3.14 or 3.15. Assume that the branch is handled by predicting it as not taken. If all memory references hit in the cache, how many cycles does this loop take to execute? c. [15] <3.4,3.5,3.7> Assuming the DLX FP pipeline with a single-cycle delayed branch and full bypassing and forwarding hardware, schedule the instructions in the loop including the branch-delay slot. You may reorder instructions and modify the individual instruction operands, but do not undertake other loop transformations that change the number or opcode of the instructions in the loop (that’s for the next chapter!). Show a pipeline timing diagram and compute the time needed in cycles to execute the entire loop. 3.3 [12/13/20/20/15/15] <3.2,3.4,3.5> For these problems, we will explore a pipeline for a register-memory architecture. The architecture has two instruction formats: a registerregister format and a register-memory format. There is a single-memory addressing mode (offset + base register). There is a set of ALU operations with format: ALUop Rdest, Rsrc1, Rsrc2 or ALUop Rdest, Rsrc1, MEM where the ALUop is one of the following: Add, Subtract, And, Or, Load (Rsrc1 ignored), Store. Rsrc or Rdest are registers. MEM is a base register and offset pair. Branches use a full compare of two registers and are PC-relative. Assume that this machine is pipelined so that a new instruction is started every clock cycle. The following pipeline structure—similar to that used in the VAX 8700 micropipeline (Clark [1987])—is IF RF ALU1 MEM ALU2 WB IF RF ALU1 MEM ALU2 WB IF RF ALU1 MEM ALU2 WB IF RF ALU1 MEM ALU2 WB IF RF ALU1 MEM ALU2 WB IF RF ALU1 MEM ALU2 WB The first ALU stage is used for effective address calculation for memory references and branches. The second ALU cycle is used for operations and branch comparison. RF is both a decode and register-fetch cycle. Assume that when a register read and a register write of the same register occur in the same clock the write data is forwarded. a. [12] <3.2> Find the number of adders needed, counting any adder or incrementer; show a combination of instructions and pipe stages that justify this answer. You need only give one combination that maximizes the adder count. 217 Exercises b. [13] <3.2> Find the number of register read and write ports and memory read and write ports required. Show that your answer is correct by showing a combination of instructions and pipeline stage indicating the instruction and the number of read ports and write ports required for that instruction. c. [20] <3.4> Determine any data forwarding for any ALUs that will be needed. Assume that there are separate ALUs for the ALU1 and ALU2 pipe stages. Put in all forwarding among ALUs needed to avoid or reduce stalls. Show the relationship between the two instructions involved in forwarding using the format of the table in Figure 3.19 but ignoring the last two columns. Be careful to consider forwarding across an intervening instruction, e.g., ADD R1, ... any instruction ADD ..., R1, ... d. [20] <3.4> Show all data forwarding requirements needed to avoid or reduce stalls when either the source or destination unit is not an ALU. Use the same format as Figure 3.19, again ignoring the last two columns. Remember to forward to and from memory references. e. [15] <3.4> Show all the remaining hazards that involve at least one unit other than an ALU as the source or destination unit. Use a table like that in Figure 3.18, but listing the length of hazard in place of the last column. f. [15] <3.5> Show all control hazard types by example and state the length of the stall. Use a format like Figure 3.21, labeling each example. 3.4 [10] <3.2> Consider the example on page 137 that compares the unpipelined and pipelined machine. Assume that 1 ns overhead is fixed and that each pipe stage is balanced and takes 10 ns in the five-stage pipeline. Plot the speedup of the pipelined machine versus the unpipelined machine as the number of pipeline stages is increased from five stages to 20 stages, considering only the impact of the pipelining overhead and assuming that the work can be evenly divided as the stages are increased (which is not generally true). Also plot the “perfect” speedup that would be obtained if there was no overhead. 3.5 [12] <3.1–3.5> A machine is called “underpipelined” if additional levels of pipelining can be added without changing the pipeline-stall behavior appreciably. Suppose that the DLX integer pipeline was changed to four stages by merging EX and MEM and lengthening the clock cycle by 50%. How much faster would the conventional DLX pipeline be versus the underpipelined DLX on integer code only? Make sure you include the effect of any change in pipeline stalls using the data for gcc in Figure 3.38 (page 178). 3.6 [20] <3.4> Add the forwarding entries for stores and for the zero detect unit (for branches) to the table in Figure 3.19. Hint: Remember the tricky case: ADD R1, ... any instruction SW ..., R1 How is the forwarding handled for this case? 3.7 [20] <3.4,3.9> Create a table showing the forwarding logic for the R4000 integer pipeline using the same format as that in Figure 3.19. Include only the DLX instructions we considered in Figure 3.19. 218 Chapter 3 Pipelining 3.8 [15] <3.4,3.9> Create a table showing the R4000 integer hazard detection using the same format as that in Figure 3.18. Include only the instructions in the DLX subset that we considered in section 3.4. 3.9 [15] <3.5> Suppose the branch frequencies (as percentages of all instructions) are as follows: Conditional branches Jumps and calls Conditional branches 20% 5% 60% are taken We are examining a four-deep pipeline where the branch is resolved at the end of the second cycle for unconditional branches and at the end of the third cycle for conditional branches. Assuming that only the first pipe stage can always be done independent of whether the branch goes and ignoring other pipeline stalls, how much faster would the machine be without any branch hazards? 3.10 [20/20] <3.4> Suppose that we have the pipeline layout shown in Figure 3.64. Stage Function 1 Instruction fetch 2 Operand decode 3 Execution or memory access (branch resolution) FIGURE 3.64 Pipeline stages. All data dependences are between the register written in stage 3 of instruction i and a register read in stage 2 of instruction i + 1, before instruction i has completed. The probability of such an interlock occurring is 1/p. We are considering a change in the machine organization that would write back the result of an instruction during an effective fourth pipe stage. This would decrease the length of the clock cycle by d (i.e., if the length of the clock cycle was T, it is now T – d). The probability of a dependence between instruction i and instruction i + 2 is p–2. (Assume that the value of p–1 excludes instructions that would interlock on i + 2.) The branch would also be resolved during the fourth stage. a. [20] <3.4> Assume that we add no additional forwarding hardware for the four-stage pipeline. Considering only the data hazard, find the lower bound on d that makes this a profitable change. Assume that each result has exactly one use and that the basic clock cycle has length T. b. [20] <3.4> Now assume that we have used forwarding to eliminate the extra hazard introduced by the change. That is, for all data hazards the pipeline length is effectively 3. This design may still not be worthwhile because of the impact of control hazards coming from a four-stage versus a three-stage pipeline. Assume that only stage 1 of the pipeline can be safely executed before we decide whether a branch goes or not. We want to know the impact of branch hazards before this longer pipeline does not yield high performance. Find an upper bound on the percentages of conditional branches in Exercises 219 programs in terms of the ratio of d to the original clock-cycle time, so that the longer pipeline has better performance. If d is a 10% reduction, what is the maximum percentage of conditional branches before we lose with this longer pipeline? Assume the taken-branch frequency for conditional branches is 60%. 3.11 [20] <3.4,3.7> Construct a table like Figure 3.18 that shows the data hazard stalls for the DLX FP pipeline as shown in Figure 3.44. Consider both integer-FP and FP-FP interactions but ignore divides (FP and integer). 3.12 [20] <3.4,3.7> Construct the forwarding table for the DLX FP pipeline of Figure 3.44 as we did in Figure 3.19. Consider both FP to FP forwarding and forwarding of FP loads to the FP units but ignore FP and integer divides. 3.13 [25] <3.4,3.7> Suppose DLX had only one register set. Construct the forwarding table for the FP and integer instructions using the format of Figure 3.19. Assume the DLX pipeline in Figure 3.44. Ignore FP and integer divides. 3.14 [15] <3.4,3.7> Construct a table like Figure 3.18 to check for WAW stalls in the DLX FP pipeline of Figure 3.44. Do not consider integer or FP divides. 3.15 [20] <3.4,3.7> Construct a table like Figure 3.18 that shows the structural stalls for the R4000 FP pipeline. 3.16 [35] <3.2–3.7> Change the DLX instruction simulator to be pipelined. Measure the frequency of empty branch-delay slots, the frequency of load delays, and the frequency of FP stalls for a variety of integer and FP programs. Also, measure the frequency of forwarding operations. Determine the performance impact of eliminating forwarding and stalling. 3.17 [35] <3.7> Using a DLX simulator, create a DLX pipeline simulator. Explore the impact of lengthening the FP pipelines, assuming both fully pipelined and unpipelined FP units. How does clustering of FP operations affect the results? Which FP units are most susceptible to changes in the FP pipeline length? 3.18 [40] <3.3–3.5> Write an instruction scheduler for DLX that works on DLX assembly language. Evaluate your scheduler using either profiles of programs or a pipeline simulator. If the DLX C compiler does optimization, evaluate your scheduler’s performance both with and without optimization. 4 Advanced Pipelining and InstructionLevel Parallelism 4 “Who’s first?” “America.” “Who’s second?” “Sir, there is no second.” Dialog between two observers of the sailing race later named “The America’s Cup” and run every few years. This quote was the inspiration for John Cocke’s naming of the IBM research processor as “America.” This processor was the precursor to the RS/6000 series and the first superscalar microprocessor. 4.1 221 4.2 Overcoming Data Hazards with Dynamic Scheduling 240 4.3 Reducing Branch Penalties with Dynamic Hardware Prediction 262 4.4 Taking Advantage of More ILP with Multiple Issue 278 4.5 Compiler Support for Exploiting ILP 289 4.6 Hardware Support for Extracting More Parallelism 299 4.7 Studies of ILP 318 4.8 Putting It All Together: The PowerPC 620 335 4.9 Fallacies and Pitfalls 349 4.10 Concluding Remarks 352 4.11 Historical Perspective and References 354 Exercises 4.1 Instruction-Level Parallelism: Concepts and Challenges 362 Instruction-Level Parallelism: Concepts and Challenges In the last chapter we saw how pipelining can overlap the execution of instructions when they are independent of one another. This potential overlap among instructions is called instruction-level parallelism (ILP) since the instructions can be evaluated in parallel. In this chapter, we look at a wide range of techniques for extending the pipelining ideas by increasing the amount of parallelism exploited among instructions. We start by looking at techniques that reduce the impact of data and control hazards and then turn to the topic of increasing the ability of the processor to exploit parallelism. We discuss the compiler technology used to increase the ILP and examine the results of a study of available ILP. The Putting It All Together section covers the PowerPC 620, which supports most of the advanced pipelining techniques described in this chapter. In this section, we discuss features of both programs and processors that limit the amount of parallelism that can be exploited among instructions. We conclude the section by looking at simple compiler techniques for enhancing the exploitation of pipeline parallelism by a compiler. 222 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism The CPI of a pipelined machine is the sum of the base CPI and all contributions from stalls: Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls The ideal pipeline CPI is a measure of the maximum performance attainable by the implementation. By reducing each of the terms of the right-hand side, we minimize the overall pipeline CPI and thus increase the instruction throughput per clock cycle. While the focus of the last chapter was on reducing the RAW stalls and the control stalls, in this chapter we will see that the techniques we introduce to further reduce the RAW and control stalls, as well as reduce the ideal CPI, can increase the importance of dealing with structural, WAR, and WAW stalls. The equation above allows us to characterize the various techniques we examine in this chapter by what component of the overall CPI a technique reduces. Figure 4.1 shows some of the techniques we examine and how they affect the contributions to the CPI. Technique Reduces Section Loop unrolling Control stalls 4.1 Basic pipeline scheduling RAW stalls 4.1 (also Chapter 3) Dynamic scheduling with scoreboarding RAW stalls 4.2 Dynamic scheduling with register renaming WAR and WAW stalls 4.2 Dynamic branch prediction Control stalls 4.3 Issuing multiple instructions per cycle Ideal CPI 4.4 Compiler dependence analysis Ideal CPI and data stalls 4.5 Software pipelining and trace scheduling Ideal CPI and data stalls 4.5 Speculation All data and control stalls 4.6 Dynamic memory disambiguation RAW stalls involving memory 4.2, 4.6 FIGURE 4.1 The major techniques examined in this chapter are shown together with the component of the CPI equation that the technique affects. Data stalls are stalls arising from any type of data hazard, namely RAW (read after write), WAR (write after read), or WAW (write after write). Before we examine these techniques in detail, we need to define the concepts on which these techniques are built. These concepts, in the end, determine the limits on how much parallelism can be exploited. Instruction-Level Parallelism All the techniques in this chapter exploit parallelism among instruction sequences. As we stated above, this type of parallelism is called instruction-level parallelism or ILP. The amount of parallelism available within a basic block (a straight-line code 4.1 Instruction-Level Parallelism: Concepts and Challenges 223 sequence with no branches in except to the entry and no branches out except at the exit) is quite small. For example, in Chapter 3 we saw that the average dynamic branch frequency in integer programs was about 15%, meaning that between six and seven instructions execute between a pair of branches. Since these instructions are likely to depend upon one another, the amount of overlap we can exploit within a basic block is likely to be much less than six. To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks. The simplest and most common way to increase the amount of parallelism available among instructions is to exploit parallelism among iterations of a loop. This type of parallelism is often called loop-level parallelism. Here is a simple example of a loop, which adds two 1000-element arrays, that is completely parallel: for (i=1; i<=1000; i=i+1) x[i] = x[i] + y[i]; Every iteration of the loop can overlap with any other iteration, although within each loop iteration there is little opportunity for overlap. There are a number of techniques we will examine for converting such looplevel parallelism into instruction-level parallelism. Basically, such techniques work by unrolling the loop either statically by the compiler or dynamically by the hardware. We will look at a detailed example of loop unrolling later in this section. An important alternative method for exploiting loop-level parallelism is the use of vector instructions. Essentially, a vector instruction operates on a sequence of data items. For example, the above code sequence could execute in four instructions on a typical vector processor: two instructions to load the vectors x and y from memory, one instruction to add the two vectors, and an instruction to store back the result vector. Of course, these instructions would be pipelined and have relatively long latencies, but these latencies may be overlapped. Vector instructions and the operation of vector processors are described in detail in Appendix B. Although the development of the vector ideas preceded most of the techniques we examine in this chapter for exploiting parallelism, processors that exploit ILP are replacing the vector-based processors; the reasons for this technology shift are discussed in more detail later in this chapter and in the historical perspectives at the end of the chapter. Basic Pipeline Scheduling and Loop Unrolling To keep a pipeline full, parallelism among instructions must be exploited by finding sequences of unrelated instructions that can be overlapped in the pipeline. To avoid a pipeline stall, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction. A compiler’s ability to perform this scheduling depends both on the amount of ILP available in the program and on the latencies of the 224 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism functional units in the pipeline. Throughout this chapter we will assume the FP unit latencies shown in Figure 4.2, unless different latencies are explicitly stated. We assume the standard DLX integer pipeline, so that branches have a delay of one clock cycle. We assume that the functional units are fully pipelined or replicated (as many times as the pipeline depth), so that an operation of any type can be issued on every clock cycle and there are no structural hazards. Instruction producing result Instruction using result Latency in clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 FIGURE 4.2 Latencies of FP operations used in this chapter. The first column shows the originating instruction type. The second column is the type of the consuming instruction. The last column is the number of intervening clock cycles needed to avoid a stall. These numbers are similar to the average latencies we would see on an FP unit, like the one we described for DLX in the last chapter. The major change versus the DLX FP pipeline was to reduce the latency of FP multiply; this helps keep our examples from becoming unwieldy. The latency of a floating-point load to a store is zero, since the result of the load can be bypassed without stalling the store. We will continue to assume an integer load latency of 1 and an integer ALU operation latency of 0. In this subsection, we look at how the compiler can increase the amount of available ILP by unrolling loops. This example serves both to illustrate an important technique as well as to motivate the definitions and program transformations described in the rest of this section. Our example uses a simple loop that adds a scalar value to an array in memory. Here is a typical version of the source: for (i=1000; i>0; i=i–1) x[i] = x[i] + s; We can see that this loop is parallel by noticing that the body of each iteration is independent. We will formalize this notion later in this section and describe how we can test whether loop iterations are independent later in the chapter. First, let’s work through this simple example, showing how we can use the parallelism to improve its performance for a DLX-like pipeline with the latencies shown above. The first step is to translate the above segment to DLX assembly language. In the following code segment, R1 is initially the address of the element in the array with the highest address, and F2 contains the scalar value, s. For simplicity, we assume that the element (x[1]) with the lowest address is at 8; if it were located elsewhere, the loop would require one additional integer instruction to perform the comparison with R1. 4.1 Instruction-Level Parallelism: Concepts and Challenges 225 The straightforward DLX code, not scheduled for the pipeline, looks like this: Loop: LD ADDD SD SUBI F0,0(R1) F4,F0,F2 0(R1),F4 R1,R1,#8 BNEZ R1,Loop ;F0=array element ;add scalar in F2 ;store result ;decrement pointer ;8 bytes (per DW) ;branch R1!=zero Let’s start by seeing how well this loop will run when it is scheduled on a simple pipeline for DLX with the latencies from Figure 4.2. EXAMPLE ANSWER Show how the loop would look on DLX, both scheduled and unscheduled, including any stalls or idle clock cycles. Schedule for both delays from floating-point operations and from the delayed branch. Without any scheduling the loop will execute as follows: Clock cycle issued Loop: LD stall ADDD stall stall SD SUBI stall BNEZ stall F0,0(R1) F4,F0,F2 0(R1),F4 R1,R1,#8 R1,Loop 1 2 3 4 5 6 7 8 9 10 This requires 10 clock cycles per iteration: one stall for the LD, two for the ADDD, one for the SUBI (since a branch reads the operand in ID), and one for the delayed branch. We can schedule the loop to obtain only one stall: Loop: LD SUBI ADDD stall BNEZ SD F0,0(R1) R1,R1,#8 F4,F0,F2 R1,Loop 8(R1),F4 ;delayed branch ;altered & interchanged with SUBI Execution time has been reduced from 10 clock cycles to 6. The stall after ADDD is for the use by the SD. s Notice that to schedule the delayed branch, the compiler had to determine that it could swap the SUBI and SD by changing the address to which the SD stored: 226 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism the address was 0(R1) and is now 8(R1). This is not a trivial observation, since most compilers would see that the SD instruction depends on the SUBI and would refuse to interchange them. A smarter compiler could figure out the relationship and perform the interchange. The chain of dependent instructions from the LD to the ADDD and then to the SD determines the clock cycle count for this loop. This chain must take at least 6 cycles because of dependencies and pipeline latencies. In the above example, we complete one loop iteration and store back one array element every 6 clock cycles, but the actual work of operating on the array element takes just 3 (the load, add, and store) of those 6 clock cycles. The remaining 3 clock cycles consist of loop overhead—the SUBI and BNEZ—and a stall. To eliminate these 3 clock cycles we need to get more operations within the loop relative to the number of overhead instructions. A simple scheme for increasing the number of instructions relative to the branch and overhead instructions is loop unrolling. This is done by simply replicating the loop body multiple times, and adjusting the loop termination code. Loop unrolling can also be used to improve scheduling. Because it eliminates the branch, it allows instructions from different iterations to be scheduled together. In this case, we can eliminate the load delay stall by creating additional independent instructions within the loop body. The compiler can then schedule these instructions into the load delay slot. If we simply replicated the instructions when we unrolled the loop, the resulting use of the same registers could prevent us from effectively scheduling the loop. Thus, we will want to use different registers for each iteration, increasing the required register count. EXAMPLE Show our loop unrolled so that there are four copies of the loop body, assuming R1 is initially a multiple of 32, which means that the number of loop iterations is a multiple of 4. Eliminate any obviously redundant computations and do not reuse any of the registers. ANSWER Here is the result after merging the SUBI instructions and dropping the unnecessary BNEZ operations that are duplicated during unrolling. Loop: LD ADDD SD LD ADDD SD LD ADDD SD LD ADDD SD SUBI BNEZ F0,0(R1) F4,F0,F2 0(R1),F4 F6,-8(R1) F8,F6,F2 -8(R1),F8 F10,-16(R1) F12,F10,F2 -16(R1),F12 F14,-24(R1) F16,F14,F2 -24(R1),F16 R1,R1,#32 R1,Loop ;drop SUBI & BNEZ ;drop SUBI & BNEZ ;drop SUBI & BNEZ 4.1 Instruction-Level Parallelism: Concepts and Challenges 227 We have eliminated three branches and three decrements of R1. The addresses on the loads and stores have been compensated to allow the SUBI instructions on R1 to be merged. Without scheduling, every operation is followed by a dependent operation and thus will cause a stall. This loop will run in 28 clock cycles—each LD has 1 stall, each ADDD 2, the SUBI 1, the branch 1, plus 14 instruction issue cycles—or 7 clock cycles for each of the four elements. Although this unrolled version is currently slower than the scheduled version of the original loop, this will change when we schedule the unrolled loop. Loop unrolling is normally done early in the compilation process, so that redundant computations can be exposed and eliminated by the optimizer. s In real programs we do not usually know the upper bound on the loop. Suppose it is n, and we would like to unroll the loop to make k copies of the body. Instead of a single unrolled loop, we generate a pair of consecutive loops. The first executes (n mod k) times and has a body that is the original loop. The second is the unrolled body surrounded by an outer loop that iterates (n/k) times. In the above Example, unrolling improves the performance of this loop by eliminating overhead instructions, although it increases code size substantially. What will happen to the performance increase when the loop is scheduled on DLX? EXAMPLE ANSWER Show the unrolled loop in the previous example after it has been scheduled on DLX. Loop: LD LD LD LD ADDD ADDD ADDD ADDD SD SD SUBI SD BNEZ SD F0,0(R1) F6,-8(R1) F10,-16(R1) F14,-24(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 F16,F14,F2 0(R1),F4 -8(R1),F8 R1,R1,#32 16(R1),F12 R1,Loop 8(R1),F16 ;8-32 = -24 The execution time of the unrolled loop has dropped to a total of 14 clock cycles, or 3.5 clock cycles per element, compared with 7 cycles per element before scheduling and 6 cycles when scheduled but not unrolled. s 228 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism The gain from scheduling on the unrolled loop is even larger than on the original loop. This is because unrolling the loop exposes more computation that can be scheduled to minimize the stalls; the code above has no stalls. Scheduling the loop in this fashion necessitates realizing that the loads and stores are independent and can be interchanged. Loop unrolling is a simple but useful method for increasing the size of straightline code fragments that can be scheduled effectively. This transformation is useful in a variety of processors, from simple pipelines like those in DLX to the pipelines described in section 4.4 that issue more than one instruction per cycle. Summary of the Loop Unrolling and Scheduling Example Throughout this chapter we will look at both hardware and software techniques that allow us to take advantage of instruction-level parallelism to fully utilize the potential of the functional units in a processor. The key to most of these techniques is to know when and how the ordering among instructions may be changed. In our example we made many such changes, which to us, as human beings, were obviously allowable. In practice, this process must be performed in a methodical fashion either by a compiler or by hardware. To obtain the final unrolled code we had to make the following decisions and transformations: 1. Determine that it was legal to move the SD after the SUBI and BNEZ, and find the amount to adjust the SD offset. 2. Determine that unrolling the loop would be useful by finding that the loop iterations were independent, except for the loop maintenance code. 3. Use different registers to avoid unnecessary constraints that would be forced by using the same registers for different computations. 4. Eliminate the extra tests and branches and adjust the loop maintenance code. 5. Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This requires analyzing the memory addresses and finding that they do not refer to the same address. 6. Schedule the code, preserving any dependences needed to yield the same result as the original code. The key requirement underlying all of these transformations is an understanding of how an instruction depends on another and how the instructions can be changed or reordered given the dependences. The next subsection defines these ideas and describes the restrictions that any hardware or software system must maintain. 4.1 Instruction-Level Parallelism: Concepts and Challenges 229 Dependences Determining how one instruction depends on another is critical not only to the scheduling process we used in the earlier example but also to determining how much parallelism exists in a program and how that parallelism can be exploited. In particular, to exploit instruction-level parallelism we must determine which instructions can be executed in parallel. If two instructions are parallel, they can execute simultaneously in a pipeline without causing any stalls, assuming the pipeline has sufficient resources (and hence no structural hazards exist). Two instructions that are dependent are not parallel. Likewise, two instructions that are dependent cannot be reordered. Instructions that can be reordered are parallel and vice versa. The key in both cases is to determine whether an instruction is dependent on another instruction. Data Dependences There are three different types of dependences: data dependences, name dependences, and control dependences. An instruction j is data dependent on instruction i if either of the following holds: s s Instruction i produces a result that is used by instruction j, or Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. The second condition simply states that one instruction is dependent on another if there exists a chain of dependences of the first type between the two instructions. This dependence chain can be as long as the entire program. In our example, the sequences Loop: LD ADDD SD F0,0(R1) ;F0=array element F4,F0,F2 ;add scalar in F2 0(R1),F4 ;store result SUBI R1,R1,8 BNEZ R1,Loop and ;decrement pointer ;8 bytes (per DW) ; branch R1!=zero are both dependent sequences, as shown by the arrows, with each instruction depending on the previous one.The arrows here and in following examples show the order that must be preserved for correct execution. The arrow points from an instruction that must precede the instruction that the arrowhead points to. 230 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism If two instructions are data dependent they cannot execute simultaneously or be completely overlapped. The dependence implies that there would be a chain of one or more RAW hazards between the two instructions. Executing the instructions simultaneously will cause a processor with pipeline interlocks to detect a hazard and stall, thereby reducing or eliminating the overlap. In a processor without interlocks that relies on compiler scheduling, the compiler cannot schedule dependent instructions in such a way that they completely overlap, since the program will not execute correctly. The presence of a data dependence in an instruction sequence reflects a data dependence in the source code from which the instruction sequence was generated. The effect of the original data dependence must be preserved. Dependences are a property of programs. Whether a given dependence results in an actual hazard being detected and whether that hazard actually causes a stall are properties of the pipeline organization. This difference is critical to understanding how instruction-level parallelism can be exploited. In our example, there is a data dependence between the SUBI and the BNEZ; this dependence causes a stall because we moved the branch test for the DLX pipeline to the ID stage. Had the branch test stayed in EX, this dependence would not cause a stall. (Of course, the branch delay would then still be 2 cycles, rather than 1.) The presence of the dependence indicates the potential for a hazard, but the actual hazard and the length of any stall is a property of the pipeline. The importance of the data dependences is that a dependence (1) indicates the possibility of a hazard, (2) determines the order in which results must be calculated, and (3) sets an upper bound on how much parallelism can possibly be exploited. Such limits are explored in section 4.7. Since a data dependence can limit the amount of instruction-level parallelism we can exploit, a major focus of this chapter is overcoming the limitations. This is done in two different ways: maintaining the dependence but avoiding a hazard, and eliminating a dependence by transforming the code. Scheduling the code is the primary method used to avoid a hazard without altering the dependence. We used this technique in several places in our example both before and after unrolling; the dependence LD, ADDD, SD was scheduled to avoid hazards, but the dependence remains in the code. We will see techniques for implementing scheduling of code both in hardware and in software. In our earlier example, we also eliminated dependences, though we did not show this step explicitly. EXAMPLE ANSWER Show how the process of optimizing the loop overhead by unrolling the loop actually eliminates data dependences. In this example and those used in the remainder of this chapter, we use nondelayed branches for simplicity; it is easy to extend the examples to use delayed branches. Here is the unrolled but unoptimized code with the extra SUBI instructions, but without the branches. (Eliminating the branches is another type of transformation, since it involves control rather than data.) The arrows 4.1 Instruction-Level Parallelism: Concepts and Challenges 231 show the data dependences that are within the unrolled body and involve the SUBI instructions: Loop: LD ADDD SD SUBI LD ADDD SD SUBI LD ADDD SD SUBI LD ADDD SD SUBI BNEZ F0,0(R1) F4,F0,F2 0(R1),F4 R1,R1,#8 F6,0(R1) F8,F6,F2 0(R1),F8 R1,R1,#8 F10,0(R1) F12,F10,F2 0(R1),F12 R1,R1,#8 F14,0(R1) F16,F14,F2 0(R1),F16 R1,R1,#8 R1,LOOP ;drop BNEZ ;drop BNEZ ;drop BNEZ As the arrows show, the SUBI instructions form a dependent chain that involves the SUBI, LD, and SD instructions. This forces the body to execute in order, as well as making the SUBI instructions necessary, which increases the instruction count. The compiler removes this dependence by symbolically computing the intermediate values of R1 and folding the computation into the offset of the LD and SD instructions and by changing the final SUBI into a decrement by 32. This makes the three SUBI unnecessary, and the compiler can remove them. There are other types of dependences in this code, but we will deal with them shortly. s Removing a real data dependence, as we did in the example above, requires knowledge of the global structure of the program and typically a fair amount of analysis. Thus, techniques for doing such optimizations are carried out by compilers, in contrast to the avoidance of hazards by scheduling, which can be performed both in hardware and software. A data value may flow between instructions either through registers or through memory locations. When the data flow occurs in a register, detecting the dependence is reasonably straightforward since the register names are fixed in the instructions, although it gets more complicated when branches intervene. Dependences that flow through memory locations are more difficult to detect 232 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism since two addresses may refer to the same location, but look different: For example, 100(R4) and 20(R6) may be identical. In addition, the effective address of a load or store may change from one execution of the instruction to another (so that 20(R4) and 20(R4) will be different), further complicating the detection of a dependence. In this chapter, we examine both hardware and software techniques for detecting data dependences that involve memory locations. The compiler techniques for detecting such dependences are critical in uncovering loop-level parallelism, as we will see shortly. Name Dependences The second type of dependence is a name dependence. A name dependence occurs when two instructions use the same register or memory location, called a name, but there is no flow of data between the instructions associated with that name. There are two types of name dependences between an instruction i that precedes instruction j in program order: 1. An antidependence between instruction i and instruction j occurs when instruction j writes a register or memory location that instruction i reads and instruction i is executed first. An antidependence corresponds to a WAR hazard, and the hazard detection for WAR hazards forces the ordering of an antidependent instruction pair. 2. An output dependence occurs when instruction i and instruction j write the same register or memory location. The ordering between the instructions must be preserved. Output dependences are preserved by detecting WAW hazards. Both antidependences and output dependences are name dependences, as opposed to true data dependences, since there is no value being transmitted between the instructions. This means that instructions involved in a name dependence can execute simultaneously or be reordered, if the name (register number or memory location) used in the instructions is changed so the instructions do not conflict. This renaming can be more easily done for register operands and is called register renaming. Register renaming can be done either statically by a compiler or dynamically by the hardware. EXAMPLE Unroll our example loop, eliminating the excess loop overhead, but using the same registers in each loop copy. Indicate both the data and name dependences within the body. Show how renaming eliminates name dependences that reduce parallelism. ANSWER Here’s the loop unrolled but with the same registers in use for each copy. The data dependences are shown with gray arrows and the name dependences with black arrows. As in earlier examples, the direction of the 4.1 Instruction-Level Parallelism: Concepts and Challenges arrow indicates the ordering that must be preserved for correct execution of the code: Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 LD F0,-8(R1) ADDD F4,F0,F2 SD -8(R1),F4 LD F0,-16(R1) ADDD -16(R1),F4 LD F0,-24(R1) ADDD F4,F0,F2 SD -24(R1),F4 SUBI R1,R1,#32 BNEZ ;drop SUBI & BNEZ F4,F0,F2 SD ;drop SUBI & BNEZ R1,LOOP The name dependences force the instructions in the loop to be almost completely ordered, allowing only the order of the LD following each SD to be interchanged. When the registers used for each copy of the loop body are renamed only the true dependences within each body remain: Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 LD F6,-8(R1) ADDD F8,F6,F2 SD -8(R1),F8 LD F10,-16(R1) ADDD F12,F10,F2 SD -16(R1),F12 LD F14,-24(R1) ADDD F16,F14,F2 SD -24(R1),F16 SUBI R1,R1,#32 BNEZ R1,LOOP ;drop SUBI & BNEZ ;drop SUBI & BNEZ 233 234 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism With the renaming, the copies of each loop body become independent and can be overlapped or executed in parallel. This renaming process can be performed either by the compiler or in hardware. In fact, we will see how the entire unrolling and renaming process can be done in the hardware. s Control Dependences The last type of dependence is a control dependence. A control dependence determines the ordering of an instruction with respect to a branch instruction so that the non-branch instruction is executed only when it should be. Every instruction, except for those in the first basic block of the program, is control dependent on some set of branches, and, in general, these control dependences must be preserved. One of the simplest examples of a control dependence is the dependence of the statements in the “then” part of an if statement on the branch. For example, in the code segment: if p1 { S1; }; if p2 { S2; } S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1. There are two constraints on control dependences: 1. An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. For example, we cannot take an instruction from the then portion of an if statement and move it before the if statement. 2. An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch. For example, we cannot take a statement before the if statement and move it into the then portion. It is sometimes possible to violate these constraints and still have a correct execution. Before we examine this further, let’s see how control dependences limit parallelism in our example. EXAMPLE Show the unrolled code sequence before the loop overhead is optimized away. Indicate the control dependences. How are the control dependences removed? 4.1 ANSWER Instruction-Level Parallelism: Concepts and Challenges Here is the unrolled code sequence with the branches still in place. The branches for the first three loop iterations have the conditions complemented, because we want the fall-through case (when the branch is untaken) to execute another loop iteration. The control dependences within the unrolled body are shown with arrows. Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,#8 BEQZ R1,exit LD F6,0(R1) ADDD F8,F6,F2 SD 0(R1),F8 SUBI R1,R1,#8 BEQZ R1,exit LD ;complement of BNEZ F10,0(R1) ADDD F12,F10,F2 SD 0(R1),F12 SUBI R1,R1,#8 BEQZ R1,exit LD F14,0(R1) ADDD F16,F14,F2 SD 0(R1),F16 SUBI R1,R1,#8 BNEZ ;complement of BNEZ R1,LOOP ;complement of BNEZ exit: The presence of the intermediate branches (BEQZ instructions) prevents the overlapping of iterations for scheduling since moving the instructions would require changing the control dependences. Furthermore, the presence of the intermediate branches prevents the removal of the SUBI instructions since the value computed by each SUBI is used in the branch. Hence the first goal is to remove the intermediate branches. Removing the branches changes the control dependences. In this case, we know that the content of R1 is a multiple of 32 and that the number of loop iterations is a multiple of 4. This insight allows us to determine that the three intermediate BEQZ instructions will never be taken. Since they are never taken, the branches are no-ops and no instructions are control dependent on the branches. After removing the branches, we can then optimize the data dependences involving the SUBI instructions, as we did in the example on page 230. s 235 236 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Control dependence is preserved by two properties in simple pipelines, such as that of Chapter 3. First, instructions execute in order. This ensures that an instruction that occurs before a branch is executed before the branch. Second, the detection of control or branch hazards ensures that an instruction that is control dependent on a branch is not executed until the branch direction is known. Although preserving control dependence is a useful and simple way to help preserve program correctness, the control dependence in itself is not the fundamental performance limit. In the above example, the compiler removed some control dependences. In other cases, we may be willing to execute instructions that should not have been executed, thereby violating the control dependences, if we can do so without affecting the correctness of the program. Control dependence is not the critical property that must be preserved. Instead, the two properties critical to program correctness, and normally preserved by control dependence, are the exception behavior and the data flow. Preserving the exception behavior means that any changes in the ordering of instruction execution must not change how exceptions are raised in the program. Often this is relaxed to mean that the reordering of instruction execution must not cause any new exceptions in the program. A simple example shows how maintaining the control dependences can prevent such situations. Consider this code sequence, recalling that we are using nondelayed branches: BEQZ LW R2,L1 R1,0(R2) L1: In this case, if we ignore the control dependence and move the load instruction before the branch, the load instruction may cause a memory protection exception. Notice that no data dependence prevents us from interchanging the BEQZ and the LW; it is only the control dependence. A similar situation could arise with an FP instruction that could raise an exception. In either case, if the branch is taken, such an exception would not occur if the instruction were not hoisted above the branch. To allow us to reorder the instructions, we would like to just ignore the exception when the branch is taken. In section 4.6, we will look at two techniques, speculation and conditional instructions, that allow us to overcome this exception problem. The second property preserved by maintenance of control dependences is the data flow. The data flow is the actual flow of data among instructions that produce results and those that consume them. Branches make the data flow dynamic, since they allow the source of data for a given instruction to come from many points. Consider the following code fragment: L: ADD BEQZ SUB OR R1,R2,R3 R4,L R1,R5,R6 R7,R1,R8 4.1 Instruction-Level Parallelism: Concepts and Challenges 237 In this example, the value of R1 used by the OR instruction depends on whether the branch is taken or not. Data dependence alone is not sufficient to preserve correctness, since it deals only with the static ordering of reads and writes. Thus while the OR instruction is data dependent on both the ADD and SUB instructions, this is insufficient for correct execution. Instead, when the instructions execute, the data flow must be preserved: If the branch is not taken then the value of R1 computed by the SUB should be used by the OR, and if the branch is taken the value of R1 computed by the ADD should be used by the OR. By preserving the control dependence of the SUB on the branch, we prevent an illegal change to the data flow. Speculation and conditional instructions, which help with the exception problem, allow us to change the control dependence while still maintaining the data flow, as we will see in section 4.6. Sometimes we can determine that violating the control dependence cannot affect either the exception behavior or the data flow. Consider the following code sequence: skipnext: ADD BEQZ SUB ADD OR R1,R2,R3 R12,skipnext R4,R5,R6 R5,R4,R9 R7,R8,R9 Suppose we knew that the register destination of the SUB instruction (R4) was unused after the instruction labeled skipnext. (The property of whether a value will be used by an upcoming instruction is called liveness.) If R4 were unused, then changing the value of R4 just before the branch would not affect the data flow since R4 would be dead (rather than live) in the code region after skipnext. Thus, if R4 were not live and the SUB instruction could not generate an exception, we could move the SUB instruction before the branch, since the program result could not be affected by this change. If the branch is taken, the SUB instruction will execute and will be useless, but it will not affect the program results. This type of code scheduling is sometimes called speculation, since the compiler is basically betting on the branch outcome; in this case that the branch is usually not taken. More ambitious compiler speculation mechanisms are discussed in section 4.5. Control dependence is preserved by implementing control hazard detection that causes control stalls. Control stalls can be eliminated or reduced by a variety of hardware and software techniques. Delayed branches, for example, can reduce the stalls arising from control hazards. Loop unrolling reduces control dependences, as we have seen. Other techniques for reducing the control hazard stalls and the impact of control dependences are converting branches into conditionally executed instructions and compiler-based and hardware speculation. Sections 4.5 and 4.6 examine these techniques. In this subsection, we have defined the three types of dependences that can exist among instructions and examined examples of each in code sequences. 238 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Because parallelism exists naturally in loops, it is useful to extend our techniques for detecting dependences to loops. The next subsection describes how we can use the concept of a dependence to determine whether an entire loop can be executed in parallel. Loop-Level Parallelism: Concepts and Techniques Loop-level parallelism is normally analyzed at the source level or close to it, while most analysis of ILP is done once instructions have been generated by the compiler. Loop-level analysis involves determining what dependences exist among the operands in the loop across the iterations of the loop. For now, we will consider only data dependences, which arise when an operand is written at some point and read at a later point. Name dependences also exist and may be removed by renaming techniques like those we used earlier. The analysis of loop-level parallelism focuses on determining whether data accesses in later iterations are data dependent on data values produced in earlier iterations. Our earlier example is loop-level parallel. The computational work in each iteration is independent of previous iterations. To easily see this, we really want to look at the source representation: for (i=1000; i>0; i=i–1) x[i] = x[i] + s; There is a dependence in the loop body between the two uses of x[i], but this dependence is within a single iteration. There is no dependence between instructions in different iterations. Thus, the loop is parallel. Of course, once this loop is translated to assembly language, the loop implementation creates a loop-carried dependence, involving the register used for addressing and decrementing (R1 in our code). For this reason, loop-level parallelism is usually analyzed at or near the source level, with loops still represented in high-level form. Let’s look at a more complex example. EXAMPLE Consider a loop like this one: for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1]; /* S2 */ } Assume that A, B, and C are distinct, nonoverlapping arrays. (In practice, the arrays may sometimes be the same or may overlap. Because the arrays may be passed as parameters to a procedure, which includes this loop, determining whether arrays overlap or are identical requires sophisticated, interprocedural analysis of the program.) What are the data dependences among the statements S1 and S2 in the loop? 4.1 ANSWER 239 Instruction-Level Parallelism: Concepts and Challenges There are two different dependences: 1. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1], which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. 2. S2 uses the value, A[i+1], computed by S1 in the same iteration. s These two dependences are different and have different effects. To see how they differ, let’s assume that only one of these dependences exists at a time. Consider the dependence of statement S1 on an earlier iteration of S1. This dependence is a loop-carried dependence, meaning that the dependence exists between different iterations of the loop. Furthermore, since the statement S1 is dependent on itself, successive iterations of statement S1 must execute in order. The second dependence above (S2 depending on S1) is within an iteration and not loop-carried. Thus, if this were the only dependence, multiple iterations of the loop could execute in parallel, as long as each pair of statements in an iteration were kept in order. This is the same type of dependence that exists in our initial example, in which we can fully exploit the parallelism present in the loop through unrolling. It is also possible to have a loop-carried dependence that does not prevent parallelism, as the next example shows. EXAMPLE Consider a loop like this one: for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } What are the dependences between S1 and S2? Is this loop parallel? If not, show how to make it parallel. ANSWER Statement S1 uses the value assigned in the previous iteration by statement S2, so there is a loop-carried dependence between S2 and S1. Despite this loop-carried dependence, this loop can be made parallel. Unlike the earlier loop, this dependence is not circular: Neither statement depends on itself, and while S1 depends on S2, S2 does not depend on S1. A loop is parallel if it can be written without a cycle in the dependences, since the absence of a cycle means that the dependences give a partial ordering on the statements. Although there are no circular dependences in the above loop, it must be transformed to conform to the partial ordering and expose the parallelism. Two observations are critical to this transformation: 240 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism 1. There is no dependence from S1 to S2. If there were, then there would be a cycle in the dependences and the loop would not be parallel. Since this other dependence is absent, interchanging the two statements will not affect the execution of S2. 2. On the first iteration of the loop, statement S1 depends on the value of B[1] computed prior to initiating the loop. These two observations allow us to replace the loop above with the following code sequence: A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } B[101] = C[100] + D[100]; The dependence between the two statements is no longer loop-carried, so that iterations of the loop may be overlapped, provided the statements in each iteration are kept in order. There are a variety of such transformations that restructure loops to expose parallelism, as we will see in section 4.5. s The key focus of the rest of this chapter is on techniques that exploit instructionlevel parallelism. The data dependences in a compiled program act as a limit on how much ILP can be exploited. The challenge is to approach that limit by trying to minimize the actual hazards and associated stalls that arise. The techniques we examine become ever more sophisticated in an attempt to exploit all the available parallelism while maintaining the necessary true data dependences in the code. Both the compiler and the hardware have a role to play: The compiler tries to eliminate or minimize dependences, while the hardware tries to prevent dependences from becoming stalls. 4.2 Overcoming Data Hazards with Dynamic Scheduling In Chapter 3 we assumed that our pipeline fetches an instruction and issues it, unless there is a data dependence between an instruction already in the pipeline and the fetched instruction that cannot be hidden with bypassing or forwarding. Forwarding logic reduces the effective pipeline latency so that the certain dependences do not result in hazards. If there is a data dependence that cannot be hidden, then the hazard detection hardware stalls the pipeline (starting with the instruction that uses the result). No new instructions are fetched or issued until 4.2 Overcoming Data Hazards with Dynamic Scheduling 241 the dependence is cleared. We also examined compiler techniques for scheduling the instructions so as to separate dependent instructions and minimize the number of actual hazards and resultant stalls. This approach, which has been called static scheduling, was first used in the 1960s and became popular in the 1980s as pipelining became widespread. Several early processors used another approach, called dynamic scheduling, whereby the hardware rearranges the instruction execution to reduce the stalls. Dynamic scheduling offers several advantages: It enables handling some cases when dependences are unknown at compile time (e.g., because they may involve a memory reference), and it simplifies the compiler. Perhaps most importantly, it also allows code that was compiled with one pipeline in mind to run efficiently on a different pipeline. As we will see, these advantages are gained at a cost of a significant increase in hardware complexity. While a dynamically scheduled processor cannot remove true data dependences, it tries to avoid stalling when dependences are present. In contrast, static pipeline scheduling, like that we have already seen, tries to minimize stalls by separating dependent instructions so that they will not lead to hazards. Of course, static scheduling can also be used on code destined to run on a processor with a dynamically scheduled pipeline. We will examine two different schemes, with the second one extending the ideas of the first to attack WAW and WAR hazards as well as RAW stalls. Dynamic Scheduling: The Idea A major limitation of the pipelining techniques we have used so far is that they all use in-order instruction issue: If an instruction is stalled in the pipeline, no later instructions can proceed. Thus, if there is a dependence between two closely spaced instructions in the pipeline, a stall will result. If there are multiple functional units, these units could lie idle. If instruction j depends on a long-running instruction i, currently in execution in the pipeline, then all instructions after j must be stalled until i is finished and j can execute. For example, consider this code: DIVD ADDD SUBD F0,F2,F4 F10,F0,F8 F12,F8,F14 The SUBD instruction cannot execute because the dependence of ADDD on DIVD causes the pipeline to stall; yet SUBD is not data dependent on anything in the pipeline. This is a performance limitation that can be eliminated by not requiring instructions to execute in order. In the DLX pipeline developed in the last chapter, both structural and data hazards were checked during instruction decode (ID): When an instruction could execute properly, it was issued from ID. To allow us to begin executing the SUBD in the above example, we must separate the issue process into two parts: checking 242 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism the structural hazards and waiting for the absence of a data hazard. We can still check for structural hazards when we issue the instruction; thus, we still use inorder instruction issue. However, we want the instructions to begin execution as soon as their data operands are available. Thus, the pipeline will do out-of-order execution, which implies out-of-order completion. Out-of-order completion creates major complications in handling exceptions. In the dynamically scheduled processors addressed in this section, exceptions are imprecise, since instructions may complete before an instruction issued earlier raises an exception. Thus, it is difficult to restart after an interrupt. Rather than address these problems in this section, we will discuss a solution for precise exceptions in the context of a processor with speculation in section 4.6. The approach discussed in section 4.6 can be used to solve the simpler problem that arises in these dynamically scheduled processors. For floating-point exceptions other solutions may be possible, as discussed in Appendix A. In introducing out-of-order execution, we have essentially split the ID pipe stage into two stages: 1. Issue—Decode instructions, check for structural hazards. 2. Read operands—Wait until no data hazards, then read operands. An instruction fetch stage precedes the issue stage and may fetch either into a single-entry latch or into a queue; instructions are then issued from the latch or queue. The EX stage follows the read operands stage, just as in the DLX pipeline. As in the DLX floating-point pipeline, execution may take multiple cycles, depending on the operation. Thus, we may need to distinguish when an instruction begins execution and when it completes execution; between the two times, the instruction is in execution. This allows multiple instructions to be in execution at the same time. In addition to these changes to the pipeline structure, we will also change the functional unit design by varying the number of units, the latency of operations, and the functional unit pipelining, so as to better explore these more advanced pipelining techniques. Dynamic Scheduling with a Scoreboard In a dynamically scheduled pipeline, all instructions pass through the issue stage in order (in-order issue); however, they can be stalled or bypass each other in the second stage (read operands) and thus enter execution out of order. Scoreboarding is a technique for allowing instructions to execute out of order when there are sufficient resources and no data dependences; it is named after the CDC 6600 scoreboard, which developed this capability. Before we see how scoreboarding could be used in the DLX pipeline, it is important to observe that WAR hazards, which did not exist in the DLX floating; 4.2 Overcoming Data Hazards with Dynamic Scheduling 243 point or integer pipelines, may arise when instructions execute out of order. Suppose in the earlier example, the SUBD destination is F8, so that the code sequence is DIVD ADDD SUBD F0,F2,F4 F10,F0,F8 F8,F8,F14 Now there is an antidependence between the ADDD and the SUBD: If the pipeline executes the SUBD before the ADDD, it will violate the antidependence, yielding incorrect execution. Likewise, to avoid violating output dependences, WAW hazards (e.g., as would occur if the destination of the SUBD were F10) must also be detected. As we will see, both these hazards are avoided in a scoreboard by stalling the later instruction involved in the antidependence. The goal of a scoreboard is to maintain an execution rate of one instruction per clock cycle (when there are no structural hazards) by executing an instruction as early as possible. Thus, when the next instruction to execute is stalled, other instructions can be issued and executed if they do not depend on any active or stalled instruction. The scoreboard takes full responsibility for instruction issue and execution, including all hazard detection. Taking advantage of out-of-order execution requires multiple instructions to be in their EX stage simultaneously. This can be achieved with multiple functional units, with pipelined functional units, or with both. Since these two capabilities—pipelined functional units and multiple functional units—are essentially equivalent for the purposes of pipeline control, we will assume the processor has multiple functional units. The CDC 6600 had 16 separate functional units, including 4 floating-point units, 5 units for memory references, and 7 units for integer operations. On DLX, scoreboards make sense primarily on the floating-point unit since the latency of the other functional units is very small. Let’s assume that there are two multipliers, one adder, one divide unit, and a single integer unit for all memory references, branches, and integer operations. Although this example is simpler than the CDC 6600, it is sufficiently powerful to demonstrate the principles without having a mass of detail or needing very long examples. Because both DLX and the CDC 6600 are load-store architectures, the techniques are nearly identical for the two processors. Figure 4.3 shows what the processor looks like. Every instruction goes through the scoreboard, where a record of the data dependences is constructed; this step corresponds to instruction issue and replaces part of the ID step in the DLX pipeline. The scoreboard then determines when the instruction can read its operands and begin execution. If the scoreboard decides the instruction cannot execute immediately, it monitors every change in the hardware and decides when the instruction can execute. The scoreboard also controls when an instruction can write its result into the destination register. Thus, all hazard detection and resolution is centralized in the scoreboard. We will see a picture of the scoreboard later (Figure 4.4 on page 247), but first we need to understand the steps in the issue and execution segment of the pipeline. 244 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Registers Data buses FP mult FP mult FP divide FP add Integer unit Scoreboard Control/ status Control/ status FIGURE 4.3 The basic structure of a DLX processor with a scoreboard. The scoreboard’s function is to control instruction execution (vertical control lines). All data flows between the register file and the functional units over the buses (the horizontal lines, called trunks in the CDC 6600). There are two FP multipliers, an FP divider, an FP adder, and an integer unit. One set of buses (two inputs and one output) serves a group of functional units. The details of the scoreboard are shown in Figures 4.4–4.7. Each instruction undergoes four steps in executing. (Since we are concentrating on the FP operations, we will not consider a step for memory access.) Let’s first examine the steps informally and then look in detail at how the scoreboard keeps the necessary information that determines when to progress from one step to the next. The four steps, which replace the ID, EX, and WB steps in the standard DLX pipeline, are as follows: 1. Issue—If a functional unit for the instruction is free and no other active instruction has the same destination register, the scoreboard issues the instruction to the functional unit and updates its internal data structure. This step replaces a portion of the ID step in the DLX pipeline. By ensuring that no other active functional unit wants to write its result into the destination register, we guarantee that WAW hazards cannot be present. If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will 4.2 Overcoming Data Hazards with Dynamic Scheduling 245 issue until these hazards are cleared. When the issue stage stalls, it causes the buffer between instruction fetch and issue to fill; if the buffer is a single entry, instruction fetch stalls immediately. If the buffer is a queue with multiple instructions, it stalls when the queue fills; later we will see how a queue is used in the PowerPC 620 to connect fetch and issue. 2. Read operands—The scoreboard monitors the availability of the source operands. A source operand is available if no earlier issued active instruction is going to write it. When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order. This step, together with issue, completes the function of the ID step in the simple DLX pipeline. 3. Execution—The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. This step replaces the EX step in the DLX pipeline and takes multiple cycles in the DLX FP pipeline. 4. Write result—Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards and stalls the completing instruction, if necessary. A WAR hazard exists if there is a code sequence like our earlier example with ADDD and SUBD that both use F8. In that example we had the code DIVD ADDD SUBD F0,F2,F4 F10,F0,F8 F8,F8,F14 ADDD has a source operand F8, which is the same register as the destination of SUBD. But ADDD actually depends on an earlier instruction. The scoreboard will still stall the SUBD in its write result stage until ADDD reads its operands. In general, then, a completing instruction cannot be allowed to write its results when s s there is an instruction that has not read its operands that precedes (i.e., in order of issue) the completing instruction, and one of the operands is the same register as the result of the completing instruction. If this WAR hazard does not exist, or when it clears, the scoreboard tells the functional unit to store its result to the destination register. This step replaces the WB step in the simple DLX pipeline. 246 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism At first glance, it might appear that the scoreboard will have difficulty separating RAW and WAR hazards. Exercise 4.6 will help you understand how the scoreboard distinguishes these two cases and thus knows when to prevent a WAR hazard by stalling an instruction that is ready to write its results. Because the operands for an instruction are read only when both operands are available in the register file, this scoreboard does not take advantage of forwarding. Instead registers are only read when they are both available. This is not as large a penalty as you might initially think. Unlike our simple pipeline of Chapter 3, instructions will write their result into the register file as soon as they complete execution (assuming no WAR hazards), rather than wait for a statically assigned write slot that may be several cycles away. The effect is reduced pipeline latency and benefits of forwarding. There is still one additional cycle of latency that arises since the write result and read operand stages cannot overlap. We would need additional buffering to eliminate this overhead. Based on its own data structure, the scoreboard controls the instruction progression from one step to the next by communicating with the functional units. There is a small complication, however. There are only a limited number of source operand buses and result buses to the register file, which represents a structural hazard. The scoreboard must guarantee that the number of functional units allowed to proceed into steps 2 and 4 do not exceed the number of buses available. We will not go into further detail on this, other than to mention that the CDC 6600 solved this problem by grouping the 16 functional units together into four groups and supplying a set of buses, called data trunks, for each group. Only one unit in a group could read its operands or write its result during a clock. Now let’s look at the detailed data structure maintained by a DLX scoreboard with five functional units. Figure 4.4 shows what the scoreboard’s information looks like part way through the execution of this simple sequence of instructions: LD LD MULTD SUBD DIVD ADDD F6,34(R2) F2,45(R3) F0,F2,F4 F8,F6,F2 F10,F0,F6 F6,F8,F2 There are three parts to the scoreboard: 1. Instruction status—Indicates which of the four steps the instruction is in. 2. Functional unit status—Indicates the state of the functional unit (FU). There are nine fields for each functional unit: Busy—Indicates whether the unit is busy or not. Op—Operation to perform in the unit (e.g., add or subtract). 4.2 247 Overcoming Data Hazards with Dynamic Scheduling Fi—Destination register. Fj, Fk—Source-register numbers. Qj, Qk—Functional units producing source registers Fj, Fk. Rj, Rk—Flags indicating when Fj, Fk are ready and not yet read. Set to No after operands are read. 3. Register result status—Indicates which functional unit will write each register, if an active instruction has the register as its destination. This field is set to blank whenever there are no pending instructions that will write that register. Instruction status Instruction Issue Read operands Execution complete Write result √ LD F6,34(R2) √ √ √ LD F2,45(R3) √ √ √ MULTD F0,F2,F4 √ SUBD F8,F6,F2 √ DIVD F10,F0,F6 √ ADDD F6,F8,F2 Functional unit status Name Busy Op Fi Fj Fk Integer Yes Load F2 R3 Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Sub F8 F6 F2 Divide Yes Div F10 F0 F6 Qj Qk Rj Rk No Integer No Yes No No Integer Mult1 Yes Yes ... F30 Register result status F0 FU F2 Mult1 Integer F4 F6 F8 F10 Add F12 Divide FIGURE 4.4 Components of the scoreboard. Each instruction that has issued or is pending issue has an entry in the instruction status table. There is one entry in the functional-unit status table for each functional unit. Once an instruction issues, the record of its operands is kept in the functional-unit status table. Finally, the register-result table indicates which unit will produce each pending result; the number of entries is equal to the number of registers. The instruction status table says that (1) the first LD has completed and written its result, and (2) the second LD has completed execution but has not yet written its result. The MULTD, SUBD, and DIVD have all issued but are stalled, waiting for their operands. The functional-unit status says that the first multiply unit is waiting for the integer unit, the add unit is waiting for the integer unit, and the divide unit is waiting for the first multiply unit. The ADDD instruction is stalled because of a structural hazard; it will clear when the SUBD completes. If an entry in one of these scoreboard tables is not being used, it is left blank. For example, the Rk field is not used on a load and the Mult2 unit is unused, hence their fields have no meaning. Also, once an operand has been read, the Rj and Rk fields are set to No. Figure 4.7 and Exercise 4.6 show why this last step is crucial. 248 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Now let’s look at how the code sequence begun in Figure 4.4 continues execution. After that, we will be able to examine in detail the conditions that the scoreboard uses to control execution. EXAMPLE ANSWER Assume the following EX cycle latencies (chosen to illustrate the behavior and not representative) for the floating-point functional units: Add is 2 clock cycles, multiply is 10 clock cycles, and divide is 40 clock cycles. Using the code segment in Figure 4.4 and beginning with the point indicated by the instruction status in Figure 4.4, show what the status tables look like when MULTD and DIVD are each ready to go to the write-result state. There are RAW data hazards from the second LD to MULTD and SUBD, from MULTD to DIVD, and from SUBD to ADDD. There is a WAR data hazard between DIVD and ADDD. Finally, there is a structural hazard on the add functional unit for ADDD. What the tables look like when MULTD and DIVD are ready to write their results is shown in Figures 4.5 and 4.6, respectively. Instruction status Instruction Issue Read operands Execution complete Write result LD F6,34(R2) √ √ √ √ LD F2,45(R3) √ √ √ √ MULTD F0,F2,F4 √ √ √ SUBD F8,F6,F2 √ √ √ DIVD F10,F0,F6 √ ADDD F6,F8,F2 √ √ √ √ Functional unit status Name Busy Integer No Mult1 Yes Op Fi Fj Fk Mult F0 F2 Rj Rk F4 No No No No No Yes Mult2 Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Qk No Add Qj Mult1 Register result status F0 FU Mult1 F2 F4 F6 Add F8 F10 F12 ... F30 Divide FIGURE 4.5 Scoreboard tables just before the MULTD goes to write result. The DIVD has not yet read either of its operands, since it has a dependence on the result of the multiply. The ADDD has read its operands and is in execution, although it was forced to wait until the SUBD finished to get the functional unit. ADDD cannot proceed to write result because of the WAR hazard on F6, which is used by the DIVD. The Q fields are only relevant when a functional unit is waiting for another unit. 4.2 249 Overcoming Data Hazards with Dynamic Scheduling Instruction status Instruction Issue Read operands Execution complete Write result LD F6,34(R2) √ √ √ √ LD F2,45(R3) √ √ √ √ MULTD F0,F2,F4 √ √ √ √ SUBD F8,F6,F2 √ √ √ √ DIVD F10,F0,F6 √ √ √ ADDD F6,F8,F2 √ √ √ √ Functional unit status Name Busy Integer No Mult1 No Mult2 No Add No Divide Op Yes Fi Fj Fk Div F10 F0 Qj Qk Rj F6 Rk No No Register result status F0 FU F2 F4 F6 F8 F10 F12 ... F30 Divide FIGURE 4.6 Scoreboard tables just before the DIVD goes to write result. ADDD was able to complete as soon as DIVD passed through read operands and got a copy of F6. Only the DIVD remains to finish. s Now we can see how the scoreboard works in detail by looking at what has to happen for the scoreboard to allow each instruction to proceed. Figure 4.7 shows what the scoreboard requires for each instruction to advance and the bookkeeping action necessary when the instruction does advance. The scoreboard, like a number of other structures that we examine in this chapter, records operand specifier information, such as register numbers. For example, we must record the source registers when an instruction is issued. Because we refer to the contents of a register as Regs[D] where D is a register name, there is no ambiguity. For example, Fj[FU]← S1 causes the register name S1 to be placed in Fj[FU], rather than the contents of the register of register S1. 250 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Instruction status Wait until Bookkeeping Issue Not Busy [FU] and not Result [D] Busy[FU]← yes; Op[FU]← op; Fi[FU]←D; Fj[FU]← S1; Fk[FU]← S2; Qj← Result[S1]; Qk← Result[S2]; Rj← not Qj; Rk← not Qk; Result[D]← FU; Read operands Rj and Rk Rj← No; Rk← No; Qj←0; Qk←0 Execution complete Functional unit done Write result ∀ f ((Fj[f ] ≠ Fi[FU] or Rj[f ] = No) & (Fk[f ] ≠ Fi[FU] or Rk[f ] = No)) ∀f(if Qj[f ]=FU then Rj[f ]← Yes); ∀f(if Qk[f ]=FU then Rk[f ]← Yes); Result[Fi[FU]]← 0; Busy[FU]← No FIGURE 4.7 Required checks and bookkeeping actions for each step in instruction execution. FU stands for the functional unit used by the instruction, D is the destination register name, S1 and S2 are the source register names, and op is the operation to be done. To access the scoreboard entry named Fj for functional unit FU we use the notation Fj[FU]. Result[D] is the value of the result register field for register D. The test on the write-result case prevents the write when there is a WAR hazard, which exists if another instruction has this instruction’s destination (Fi[FU]) as a source (Fj[f ] or Fk[f ]) and if some other instruction has written the register (Rj = Yes or Rk = Yes). The variable f is used for any functional unit. The costs and benefits of scoreboarding are interesting considerations. The CDC 6600 designers measured a performance improvement of 1.7 for FORTRAN programs and 2.5 for hand-coded assembly language. However, this was measured in the days before software pipeline scheduling, semiconductor main memory, and caches (which lower memory-access time). The scoreboard on the CDC 6600 had about as much logic as one of the functional units, which is surprisingly low. The main cost was in the large number of buses—about four times as many as would be required if the processor only executed instructions in order (or if it only initiated one instruction per execute cycle). The recently increasing interest in dynamic scheduling is motivated by attempts to issue more instructions per clock (so the cost of more buses must be paid anyway) and by ideas like speculation (explored in section 4.6) that naturally build on dynamic scheduling. A scoreboard uses the available ILP to minimize the number of stalls arising from the program’s true data dependences. In eliminating stalls, a scoreboard is limited by several factors: 1. The amount of parallelism available among the instructions—This determines whether independent instructions can be found to execute. If each instruction depends on its predecessor, no dynamic scheduling scheme can reduce stalls. If the instructions in the pipeline simultaneously must be chosen from the same basic block (as was true in the 6600), this limit is likely to be quite severe. 2. The number of scoreboard entries—This determines how far ahead the pipeline can look for independent instructions. The set of instructions examined as candidates for potential execution is called the window. The size of the scoreboard determines the size of the window. In this section, we assume a window 4.2 Overcoming Data Hazards with Dynamic Scheduling 251 does not extend beyond a branch, so the window (and the scoreboard) always contains straight-line code from a single basic block. Section 4.6 shows how the window can be extended beyond a branch. 3. The number and types of functional units—This determines the importance of structural hazards, which can increase when dynamic scheduling is used. 4. The presence of antidependences and output dependences—These lead to WAR and WAW stalls. This entire chapter focuses on techniques that attack the problem of exposing and better utilizing available ILP. The second and third factors can be attacked by increasing the size of the scoreboard and the number of functional units; however, these changes have cost implications and may also affect cycle time. WAW and WAR hazards become more important in dynamically scheduled processors, because the pipeline exposes more name dependences. WAW hazards also become more important if we use dynamic scheduling with a branch prediction scheme that allows multiple iterations of a loop to overlap. The next subsection looks at a technique called register renaming that dynamically eliminates name dependences so as to avoid WAR and WAW hazards. Register renaming does this by replacing the register names (such as those kept in the scoreboard) with the names of a larger set of virtual registers. The register renaming scheme also is the basis for implementing forwarding. Another Dynamic Scheduling Approach— The Tomasulo Approach Another approach to allow execution to proceed in the presence of hazards was used by the IBM 360/91 floating-point unit. This scheme was invented by Robert Tomasulo and is named after him. Tomasulo’s scheme combines key elements of the scoreboarding scheme with the introduction of register renaming. There are many variations on this scheme, though the key concept of renaming registers to avoid WAR and WAW hazards is the most common characteristic. The IBM 360/91 was completed about three years after the CDC 6600, just before caches appeared in commercial processors. IBM’s goal was to achieve high floating-point performance from an instruction set and from compilers designed for the entire 360 computer family, rather than from specialized compilers for the high-end processors. The 360 architecture had only four double-precision floating-point registers, which limits the effectiveness of compiler scheduling; this fact was another motivation for the Tomasulo approach. In addition, the IBM 360/91 had long memory accesses and long floating-point delays, which Tomasulo’s algorithm was designed to overcome. At the end of the section, we will see that Tomasulo’s algorithm can also support the overlapped execution of multiple iterations of a loop. 252 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism We explain the algorithm, which focuses on the floating-point unit, in the context of a pipelined, floating-point unit for DLX. The primary difference between DLX and the 360 is the presence of register-memory instructions in the latter processor. Because Tomasulo’s algorithm uses a load functional unit, no significant changes are needed to add register-memory addressing modes. The primary addition is another bus. The IBM 360/91 also had pipelined functional units, rather than multiple functional units. The only difference between these is that a pipelined unit can start at most one operation per clock cycle. Since there are really no fundamental differences, we describe the algorithm as if there were multiple functional units. The IBM 360/91 could accommodate three operations for the floating-point adder and two for the floating-point multiplier. In addition, up to six floating-point loads, or memory references, and up to three floating-point stores could be outstanding. Load data buffers and store data buffers are used for this function. Although we will not discuss the load and store units, we do need to include the buffers for operands. Tomasulo’s scheme shares many ideas with the scoreboard scheme, so we assume that you understand the scoreboard thoroughly. In the last section, we saw how a compiler could rename registers to avoid WAW and WAR hazards. In Tomasulo’s scheme this functionality is provided by the reservation stations, which buffer the operands of instructions waiting to issue, and by the issue logic. The basic idea is that a reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from a register. In addition, pending instructions designate the reservation station that will provide their input. Finally, when successive writes to a register appear, only the last one is actually used to update the register. As instructions are issued, the register specifiers for pending operands are renamed to the names of the reservation station in a process called register renaming. This combination of issue logic and reservation stations provides renaming and eliminates WAW and WAR hazards. This additional capability is the major conceptual difference between scoreboarding and Tomasulo’s algorithm. Since there can be more reservation stations than real registers, the technique can eliminate hazards that could not be eliminated by a compiler. As we explore the components of Tomasulo’s scheme, we will return to the topic of register renaming and see exactly how the renaming occurs and how it eliminates hazards. In addition to the use of register renaming, there are two other significant differences in the organization of Tomasulo’s scheme and scoreboarding. First, hazard detection and execution control are distributed: The reservation stations at each functional unit control when an instruction can begin execution at that unit. This function is centralized in the scoreboard. Second, results are passed directly to functional units from the reservation stations where they are buffered, rather than going through the registers. This is done with a common result bus that allows all units waiting for an operand to be loaded simultaneously (on the 360/91 this is called the common data bus, or CDB). In comparison, the scoreboard writes results into registers, where waiting functional units may have to contend for 4.2 253 Overcoming Data Hazards with Dynamic Scheduling them. The number of result buses in either the scoreboard or Tomasulo’s scheme can be varied. In the actual implementations, the CDC 6600 had multiple completion buses (two in the floating-point unit), while the IBM 360/91 had only one. Figure 4.8 shows the basic structure of a Tomasulo-based floating-point unit for DLX; none of the execution control tables are shown. The reservation stations hold instructions that have been issued and are awaiting execution at a functional unit, the operands for that instruction if they have already been computed or the source of the operands otherwise, as well as the information needed to control the instruction once it has begun execution at the unit. The load buffers and store buffers hold data or addresses coming from and going to memory. The floatingpoint registers are connected by a pair of buses to the functional units and by a single bus to the store buffers. All results from the functional units and from memory are sent on the common data bus, which goes everywhere except to the load buffer. All the buffers and reservation stations have tag fields, employed by hazard control. From instruction unit Floatingpoint operation queue From memory FP registers Load buffers 6 5 4 3 2 1 Operand buses Store buffers 3 2 1 To memory Operation bus 3 2 1 2 1 Reservation stations FP adders FP multipliers Common data bus (CDB) FIGURE 4.8 The basic structure of a DLX FP unit using Tomasulo’s algorithm. Floating-point operations are sent from the instruction unit into a queue when they are issued. The reservation stations include the operation and the actual operands, as well as information used for detecting and resolving hazards. There are load buffers to hold the results of outstanding loads that are waiting for the CDB. Similarly, store buffers are used to hold the destination memory addresses of outstanding stores waiting for their operands. All results from either the FP units or the load unit are put on the CDB, which goes to the FP register file as well as to the reservation stations and store buffers. The FP adders implement addition and subtraction, while the FP multipliers do multiplication and division. 254 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Before we describe the details of the reservation stations and the algorithm, let’s look at the steps an instruction goes through—just as we did for the scoreboard. Since operands are transmitted differently than in a scoreboard, there are only three steps: 1. Issue—Get an instruction from the floating-point operation queue. If the operation is a floating-point operation, issue it if there is an empty reservation station, and send the operands to the reservation station if they are in the registers. If the operation is a load or store, it can issue if there is an available buffer. If there is not an empty reservation station or an empty buffer, then there is a structural hazard and the instruction stalls until a station or buffer is freed. This step also performs the process of renaming registers. 2. Execute—If one or more of the operands is not yet available, monitor the CDB while waiting for it to be computed. When an operand becomes available, it is placed into the corresponding reservation station. When both operands are available, execute the operation. This step checks for RAW hazards. 3. Write result—When the result is available, write it on the CDB and from there into the registers, into any reservation stations waiting for this result, and to any waiting store buffers. Although these steps are fundamentally similar to those in the scoreboard, there are three important differences. First, there is no checking for WAW and WAR hazards—these are eliminated when the register operands are renamed during issue. Second, the CDB is used to broadcast results rather than waiting on the registers. Third, the loads and stores are treated as basic functional units. The data structures used to detect and eliminate hazards are attached to the reservation stations, the register file, and the load and store buffers. Although different information is attached to different objects, everything except the load buffers contains a tag field per entry. These tags are essentially names for an extended set of virtual registers used in renaming. In this example, the tag field is a four-bit quantity that denotes one of the five reservation stations or one of the six load buffers; as we will see this produces the equivalent of eleven registers that can be designated as result registers (as opposed to the four double-precision registers that the 360 architecture contains). In a processor with more real registers, we would want renaming to provide an even larger set of virtual registers. The tag field describes which reservation station contains the instruction that will produce a result needed as a source operand. Once an instruction has issued and is waiting for a result, it refers to the operand by the reservation station number, rather than by the number of the destination register written by the instruction producing the value. Unused values, such as zero, indicate that the operand is already available in the registers. Because there are more reservation stations than actual register numbers, WAW and WAR hazards are eliminated by renaming results using reservation station numbers. Although in Tomasulo’s scheme the reservation 4.2 Overcoming Data Hazards with Dynamic Scheduling 255 stations are used as the extended virtual registers, other approaches could use a register set with additional registers or a structure like the reorder buffer, which we will see in section 4.6. In describing the operation of this scheme, scoreboard terminology is used wherever this will not lead to confusion. The terminology used by the IBM 360/91 is also shown, for historical reference. It is important to remember that the tags in the Tomasulo scheme refer to the buffer or unit that will produce a result; the register names are discarded when an instruction issues to a reservation station. Each reservation station has six fields: Op—The operation to perform on source operands S1 and S2. Qj, Qk—The reservation stations that will produce the corresponding source operand; a value of zero indicates that the source operand is already available in Vj or Vk, or is unnecessary. (The IBM 360/91 calls these SINKunit and SOURCEunit.) Vj, Vk—The value of the source operands. These are called SINK and SOURCE on the IBM 360/91. Note that only one of the V field or the Q field is valid for each operand. Busy—Indicates that this reservation station and its accompanying functional unit are occupied. The register file and store buffer each have a field, Qi: Qi—The number of the reservation station that contains the operation whose result should be stored into this register or into memory. If the value of Qi is blank (or 0), no currently active instruction is computing a result destined for this register or buffer. For a register, this means the value is simply the register contents. The load and store buffers each require a busy field, indicating when a buffer is available because of completion of a load or store assigned there; the register file will have a blank Qi field when it is not busy. Before we examine the algorithm in detail, let’s see what the information tables look like for the following code sequence: 1. 2. 3. 4. 5. 6. LD LD MULTD SUBD DIVD ADDD F6,34(R2) F2,45(R3) F0,F2,F4 F8,F6,F2 F10,F0,F6 F6,F8,F2 256 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism We saw what the scoreboard looked like for this program when only the first load had written its result. Figure 4.9 depicts the reservation stations and the register tags. The numbers appended to the names add, mult, and load stand for the tag for that reservation station—Add1 is the tag for the result from the first add unit. In addition we have included an instruction status table. This table is included only to help you understand the algorithm; it is not actually a part of the hardware. Instead, the state of each operation that has issued is kept in a reservation station. Instruction status Instruction Issue Execute Write result √ LD F6,34(R2) √ √ LD F2,45(R3) √ √ MULTD F0,F2,F4 √ SUBD F8,F6,F2 √ DIVD F10,F0,F6 √ ADDD F6,F8,F2 √ Reservation stations Name Busy Op Vj Vk Qj Qk Add1 Yes SUB Mem[34+Regs[R2]] Add2 Yes ADD Add1 Load2 Add3 No Mult1 Yes MULT Regs[F4] Load2 Mult2 Yes DIV Mem[34+Regs[R2]] Mult1 Load2 Register status Field F0 F2 Qi Mult1 Load2 F4 F6 F8 F10 Add2 Add1 F12 ... F30 Mult2 FIGURE 4.9 Reservation stations and register tags. All of the instructions have issued, but only the first load instruction has completed and written its result to the CDB. The instruction status table is not actually present, but the equivalent information is distributed throughout the hardware. The Vj and Vk fields show the value of an operand in our hardware description language. The load and store buffers are not shown. Load buffer 2 is the only busy load buffer and it is performing on behalf of instruction 2 in the sequence—loading from memory address R3 + 45. Remember that an operand is specified by either a Q field or a V field at any time. 4.2 Overcoming Data Hazards with Dynamic Scheduling 257 There are two important differences from scoreboards that are immediately observable in these tables. First, the value of an operand is stored in the reservation station in one of the V fields as soon as it is available; it is not read from the register file nor from a reservation station once the instruction has issued. Second, the ADDD instruction, which was blocked in the scoreboard by a WAR hazard at the WB stage, has issued and could complete before the DIVD initiates. The major advantages of the Tomasulo scheme are (1) the distribution of the hazard detection logic, and (2) the elimination of stalls for WAW and WAR hazards. The first advantage arises from the distributed reservation stations and the use of the CDB. If multiple instructions are waiting on a single result, and each instruction already has its other operand, then the instructions can be released simultaneously by the broadcast on the CDB. In the scoreboard the waiting instructions must all read their results from the registers when register buses are available. WAW and WAR hazards are eliminated by renaming registers using the reservation stations, and by the process of storing operands into the reservation station as soon as they are available. For example, in our code sequence in Figure 4.9 we have issued both the DIVD and the ADDD, even though there is a WAR hazard involving F6. The hazard is eliminated in one of two ways. First, if the instruction providing the value for the DIVD has completed, then Vk will store the result, allowing DIVD to execute independent of the ADDD (this is the case shown). On the other hand, if the LD had not completed, then Qk would point to the Load1 reservation station, and the DIVD instruction would be independent of the ADDD. Thus, in either case, the ADDD can issue and begin executing. Any uses of the result of the DIVD would point to the reservation station, allowing the ADDD to complete and store its value into the registers without affecting the DIVD. We’ll see an example of the elimination of a WAW hazard shortly. But let’s first look at how our earlier example continues execution. EXAMPLE Assume the same latencies for the floating-point functional units as we did for Figure 4.6: Add is 2 clock cycles, multiply is 10 clock cycles, and divide is 40 clock cycles. With the same code segment, show what the status tables look like when the MULTD is ready to write its result. ANSWER The result is shown in the three tables in Figure 4.10. Unlike the example with the scoreboard, ADDD has completed since the operands of DIVD are copied, thereby overcoming the WAR hazard. Notice that even if the load of F6 was delayed, the add into F6 could be executed without triggering a WAW hazard. 258 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Instruction status Instruction Issue Execute Write result LD F6,34(R2) √ √ √ LD F2,45(R3) √ √ √ MULTD F0,F2,F4 √ √ SUBD F8,F6,F2 √ √ √ DIVD F10,F0,F6 √ ADDD F6,F8,F2 √ √ √ Reservation stations Name Busy Op Vj Vk Add1 No Add2 No Add3 No Mult1 Yes MULT Mem[45+Regs[R3]] Mult2 Regs[F4] Yes DIV Qj Mem[34+Regs[R2]] Qk Mult1 Register status Field F0 Qi F2 F4 F6 F8 Mult1 F10 F12 ... F30 Mult2 FIGURE 4.10 Multiply and divide are the only instructions not finished. This is different from the scoreboard case, because the elimination of WAR hazards allowed the ADDD to finish right after the SUBD on which it depended. s Figure 4.11 gives the steps that each instruction must go through. Load and stores are only slightly special. A load can execute as soon as it is available. When execution is completed and the CDB is available, a load puts its result on the CDB like any functional unit. Stores receive their values from the CDB or from the register file and execute autonomously; when they are done they turn the busy field off to indicate availability, just like a load buffer or reservation station. To understand the full power of eliminating WAW and WAR hazards through dynamic renaming of registers, we must look at a loop. Consider the following simple sequence for multiplying the elements of an array by a scalar in F2: Loop: LD MULTD SD SUBI BNEZ F0,0(R1) F4,F0,F2 0(R1),F4 R1,R1,#8 R1,Loop ; branches if R1≠0 4.2 Overcoming Data Hazards with Dynamic Scheduling Instruction status Wait until Action or bookkeeping Issue Station or buffer empty if (Register[S1].Qi ≠0) {RS[r].Qj← Register[S1].Qi} else {RS[r].Vj← S1; RS[r].Qj← 0}; if (Register[S2].Qi≠0) {RS[r].Qk← Register[S2].Qi} else {RS[r].Vk← S2; RS[r].Qk← 0}; RS[r].Busy← yes; Register[D].Qi=r; Execute (RS[r].Qj=0) and (RS[r].Qk=0) None—operands are in Vj and Vk Write result Execution completed at r and CDB available 259 ∀x(if (Register[x].Qi=r) {Fx← result; Register[x].Qi← 0}); ∀x(if (RS[x].Qj=r) {RS[x].Vj← result; RS[x].Qj ← 0}); ∀x(if (RS[x].Qk=r) {RS[x].Vk← result; RS[x].Qk ← 0}); ∀x(if (Store[x].Qi=r) {Store[x].V← result; Store[x].Qi ← 0}); RS[r].Busy← No FIGURE 4.11 Steps in the algorithm and what is required for each step. For the issuing instruction, D is the destination, S1 and S2 are the source register numbers, and r is the reservation station or buffer that D is assigned to. RS is the reservation-station data structure. The value returned by a reservation station or by the load unit is called result. Register is the register data structure (not the register file), while Store is the store-buffer data structure. When an instruction is issued, the destination register has its Qi field set to the number of the buffer or reservation station to which the instruction is issued. If the operands are available in the registers, they are stored in the V fields. Otherwise, the Q fields are set to indicate the reservation station that will produce the values needed as source operands. The instruction waits at the reservation station until both its operands are available, indicated by zero in the Q fields. The Q fields are set to zero either when this instruction is issued, or when an instruction on which this instruction depends completes and does its write back. When an instruction has finished execution and the CDB is available, it can do its write back. All the buffers, registers, and reservation stations whose value of Qj or Qk is the same as the completing reservation station update their values from the CDB and mark the Q fields to indicate that values have been received. Thus, the CDB can broadcast its result to many destinations in a single clock cycle, and if the waiting instructions have their operands, they can all begin execution on the next clock cycle. There is a subtle timing difficulty that arises in Tomasulo’s algorithm; we discuss this in Exercise 4.24. If we predict that branches are taken, using reservation stations will allow multiple executions of this loop to proceed at once. This advantage is gained without unrolling the loop—in effect, the loop is unrolled dynamically by the hardware. In the 360 architecture, the presence of only four FP registers would severely limit the use of unrolling, since we would generate many WAW and WAR hazards. As we saw earlier on page 227, when we unroll a loop and schedule it to avoid interlocks, many more registers are required. Tomasulo’s algorithm supports the overlapped execution of multiple copies of the same loop with only a small number of registers used by the program. The reservation stations extend the real register set via the renaming process. 260 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Let’s assume we have issued all the instructions in two successive iterations of the loop, but none of the floating-point loads-stores or operations has completed. The reservation stations, register-status tables, and load and store buffers at this point are shown in Figure 4.12. (The integer ALU operation is ignored, and it is assumed the branch was predicted as taken.) Once the system reaches this state, two copies of the loop could be sustained with a CPI close to 1.0 provided the multiplies could complete in four clock cycles. If we ignore the loop overhead, which is not reduced in this scheme, the performance level achieved matches what we would obtain with compiler unrolling and scheduling, assuming we had enough registers. An additional element that is critical to making Tomasulo’s algorithm work is shown in this example. The load instruction from the second loop iteration could easily complete before the store from the first iteration, although the normal sequential order is different. The load and store can safely be done in a different order, provided the load and store access different addresses. This is checked by examining the addresses in the store buffer whenever a load is issued. If the load address matches the store-buffer address, we must stop and wait until the store buffer gets a value; we can then access it or get the value from memory. This dynamic disambiguation of addresses is an alternative to the techniques that a compiler would use when interchanging a load and store. This dynamic scheme can yield very high performance, provided the cost of branches can be kept small, an issue we address in the next section. The major drawback of this approach is the complexity of the Tomasulo scheme, which requires a large amount of hardware. In particular, there are many associative stores that must run at high speed, as well as complex control logic. Lastly, the performance gain is limited by the single completion bus (CDB). While additional CDBs can be added, each CDB must interact with all the pipeline hardware, including the reservation stations. In particular, the associative tag-matching hardware would need to be duplicated at each station for each CDB. In Tomasulo’s scheme two different techniques are combined: the renaming of registers to a larger virtual set of registers and the buffering of source operands from the register file. Source operand buffering resolves WAR hazards that arise when the operand is available in the registers. As we will see later, it is also possible to eliminate WAR hazards by the renaming of a register together with the buffering of a result until no outstanding references to the earlier version of the register remain. This approach will be used when we discuss hardware speculation. Tomasulo’s scheme is appealing if the designer is forced to pipeline an architecture for which it is difficult to schedule code or that has a shortage of registers. On the other hand, the advantages of the Tomasulo approach versus compiler scheduling for a efficient single-issue pipeline are probably fewer than the costs of implementation. But, as processors become more aggressive in their issue capability and designers are concerned with the performance of difficult-toschedule code (such as most nonnumeric code), techniques such as register renaming and dynamic scheduling will become more important. Later in this chapter, we will see that they are one important component of most schemes for incorporating hardware speculation. 4.2 261 Overcoming Data Hazards with Dynamic Scheduling The key components for enhancing ILP in Tomasulo’s algorithm are dynamic scheduling, register renaming, and dynamic memory disambiguation. It is difficult to assess the value of these features independently. When we examine the studies of ILP in section 4.7, we will look at how these features affect the amount of parallelism discovered. Corresponding to the dynamic hardware techniques for scheduling around data dependences are dynamic techniques for handling branches efficiently. These techniques are used for two purposes: to predict whether a branch will be taken and to find the target more quickly. Hardware branch prediction, the name for these techniques, is the next topic we discuss. Instruction status Instruction From iteration Issue Execute √ LD F0,0(R1) 1 √ MULTD F4,F0,F2 1 √ SD 0(R1),F4 1 √ LD F0,0(R1) 2 √ Write result MULTD F4,F0,F2 2 √ SD 0(R1),F4 2 √ √ Reservation stations Name Busy Add1 No Add2 No Add3 No Mult1 Yes Mult2 Op Yes Vj Vk Qj Qk MULT Regs[F2] Load1 MULT Regs[F2] Load2 Register status Field F0 Qi F2 Load2 F4 F6 F8 F10 F12 ... F30 Mult2 Load buffers Field Load 1 Load 2 Address Regs[R1] Regs[R1]-8 Busy Yes Yes Store buffers Load 3 Field Store 1 Store 2 Qi Mult1 Mult2 No Busy Yes Yes Address Regs[R1] Store 3 Regs[R1]-8 No FIGURE 4.12 Two active iterations of the loop with no instruction yet completed. Load and store buffers are included, with addresses to be loaded from and stored to. The loads are in the load buffer; entries in the multiplier reservation stations indicate that the outstanding loads are the sources. The store buffers indicate that the multiply destination is their value to store. 262 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism 4.3 Reducing Branch Penalties with Dynamic Hardware Prediction The previous section describes techniques for overcoming data hazards. The frequency of branches and jumps demands that we also attack the potential stalls arising from control dependences. Indeed, as the amount of ILP we attempt to exploit grows, control dependences rapidly become the limiting factor. Although schemes in this section are helpful in processors that try to maintain one instruction issue per clock, for two reasons they are crucial to any processor that tries to issue more than one instruction per clock. First, branches will arrive up to n times faster in an n-issue processor and providing an instruction stream will probably require that we predict the outcome of branches. Second, Amdahl’s Law reminds us that relative impact of the control stalls will be larger with the lower potential CPI in such machines. In the last chapter, we examined a variety of static schemes for dealing with branches; these schemes are static since the action taken does not depend on the dynamic behavior of the branch. We also examined the delayed branch scheme, which allows software to optimize the branch behavior by scheduling it at compile time. This section focuses on using hardware to dynamically predict the outcome of a branch—the prediction will change if the branch changes its behavior while the program is running. We start with a simple branch prediction scheme and then examine approaches that increase the accuracy of our branch prediction mechanisms. After that, we look at more elaborate schemes that try to find the instruction following a branch even earlier. The goal of all these mechanisms is to allow the processor to resolve the outcome of a branch early, thus preventing control dependences from causing stalls. The effectiveness of a branch prediction scheme depends not only on the accuracy, but also on the cost of a branch when the prediction is correct and when the prediction is incorrect. These branch penalties depend on the structure of the pipeline, the type of predictor, and the strategies used for recovering from misprediction. Later in this chapter we will look at some typical examples. Basic Branch Prediction and Branch-Prediction Buffers The simplest dynamic branch-prediction scheme is a branch-prediction buffer or branch history table. A branch-prediction buffer is a small memory indexed by the lower portion of the address of the branch instruction. The memory contains a bit that says whether the branch was recently taken or not. This is the simplest sort of buffer; it has no tags and is useful only to reduce the branch delay when it is longer than the time to compute the possible target PCs. We don’t know, in fact, if the prediction is correct—it may have been put there by another branch that has the same low-order address bits. But this doesn’t matter. The prediction 4.3 Reducing Branch Penalties with Dynamic Hardware Prediction 263 is a hint that is assumed to be correct, and fetching begins in the predicted direction. If the hint turns out to be wrong, the prediction bit is inverted and stored back. Of course, this buffer is effectively a cache where every access is a hit, and, as we will see, the performance of the buffer depends on both how often the prediction is for the branch of interest and how accurate the prediction is when it matches. We can use all the caching techniques to improve the accuracy of finding the prediction matching this branch, as we will see shortly. Before we do that, it is useful to make a small, but important, improvement in the accuracy of the branch prediction scheme. This simple one-bit prediction scheme has a performance shortcoming: Even if a branch is almost always taken, we will likely predict incorrectly twice, rather than once, when it is not taken. The following example shows this. EXAMPLE Consider a loop branch whose behavior is taken nine times in a row, then not taken once. What is the prediction accuracy for this branch, assuming the prediction bit for this branch remains in the prediction buffer? ANSWER The steady-state prediction behavior will mispredict on the first and last loop iterations. Mispredicting the last iteration is inevitable since the prediction bit will say taken (the branch has been taken nine times in a row at that point). The misprediction on the first iteration happens because the bit is flipped on prior execution of the last iteration of the loop, since the branch was not taken on that iteration. Thus, the prediction accuracy for this branch that is taken 90% of the time is only 80% (two incorrect predictions and eight correct ones). In general, for branches used to form loops—a branch is taken many times in a row and then not taken once— a one-bit predictor will mispredict at twice the rate that the branch is not taken. Ideally, the accuracy of the predictor would match the taken branch frequency for these highly regular branches. s To remedy this, two-bit prediction schemes are often used. In a two-bit scheme, a prediction must miss twice before it is changed. Figure 4.13 shows the finite-state processor for a two-bit prediction scheme. The two-bit scheme is actually a specialization of a more general scheme that has an n-bit saturating counter for each entry in the prediction buffer. With an n-bit counter, the counter can take on values between 0 and 2n – 1: when the counter is greater than or equal to one half of its maximum value (2n–1), the branch is predicted as taken; otherwise, it is predicted untaken. As in the two-bit scheme, the counter is incremented on a taken branch and decremented on an untaken branch. Studies of n-bit predictors have shown that the two-bit predictors do almost as well, and thus most systems rely on two-bit branch predictors rather than the more general n-bit predictors. 264 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Taken Not taken Predict taken Predict taken Taken Not taken Taken Not taken Predict not taken Predict not taken Taken Not taken FIGURE 4.13 The states in a two-bit prediction scheme. By using two bits rather than one, a branch that strongly favors taken or not taken—as many branches do—will be mispredicted only once. The two bits are used to encode the four states in the system. A branch-prediction buffer can be implemented as a small, special “cache” accessed with the instruction address during the IF pipe stage, or as a pair of bits attached to each block in the instruction cache and fetched with the instruction. If the instruction is decoded as a branch and if the branch is predicted as taken, fetching begins from the target as soon as the PC is known. Otherwise, sequential fetching and executing continue. If the prediction turns out to be wrong, the prediction bits are changed as shown in Figure 4.13. While this scheme is useful for most pipelines, the DLX pipeline finds out both whether the branch is taken and what the target of the branch is at roughly the same time, assuming no hazard in accessing the register specified in the conditional branch. (Remember that this is true for the DLX pipeline because the branch does a compare of a register against zero during the ID stage, which is when the effective address is also computed.) Thus, this scheme does not help for the simple DLX pipeline; we will explore a scheme that can work for DLX a little later. First, let’s see how well branch prediction works in general. What kind of accuracy can be expected from a branch-prediction buffer using two bits per entry on real applications? For the SPEC89 benchmarks a branchprediction buffer with 4096 entries results in a prediction accuracy ranging from over 99% to 82%, or a misprediction rate of 1% to 18%, as shown in Figure 4.14. To show the differences more clearly, we plot misprediction frequency rather than prediction frequency. A 4K-entry buffer, like that used for these results, is considered very large; smaller buffers would have worse results. 4.3 265 Reducing Branch Penalties with Dynamic Hardware Prediction nasa7 1% matrix300 0% tomcatv 1% doduc 5% spice 9% fpppp SPEC89 benchmarks 9% gcc 12% espresso 5% 18% eqntott 10% li 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% Frequency of mispredictions FIGURE 4.14 Prediction accuracy of a 4096-entry two-bit prediction buffer for the SPEC89 benchmarks. The misprediction rate for the integer benchmarks (gcc, espresso, eqntott, and li) is substantially higher (average of 11%) than that for the FP programs (average of 4%). Even omitting the FP kernels (nasa7, matrix300, and tomcatv) still yields a higher accuracy for the FP benchmarks than for the integer benchmarks. These data, as well as the rest of the data in this section, are taken from a branch prediction study done using the IBM Power architecture and optimized code for that system. See Pan et al. [1992]. Knowing just the prediction accuracy, as shown in Figure 4.14, is not enough to determine the performance impact of branches, even given the branch costs and penalties for misprediction. We also need to take into account the branch frequency, since the importance of accurate prediction is larger in programs with higher branch frequency. For example, the integer programs—li, eqntott, espresso, and gcc—have higher branch frequencies than those of the more easily predicted FP programs. As we try to exploit more ILP, the accuracy of our branch prediction becomes critical. As we can see in Figure 4.14, the accuracy of the predictors for integer programs, which typically also have higher branch frequencies, is lower than for the loop-intensive scientific programs. We can attack this problem in two ways: by increasing the size of the buffer and by increasing the accuracy of the scheme we use for each prediction. A buffer with 4K entries is already quite large and, as Figure 4.15 shows, performs quite comparably to an infinite buffer. The data in Figure 4.15 make it clear that the hit rate of the buffer is not the limiting factor. 266 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism nasa7 1% 0% matrix300 0% 0% tomcatv 1% 0% 5% 5% doduc spice 9% 9% fpppp 9% 9% SPEC89 benchmarks 12% 11% gcc 5% 5% espresso 18% 18% eqntott 10% 10% li 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% Frequency of mispredictions 4096 entries: 2 bits per entry Unlimited entries: 2 bits per entry FIGURE 4.15 Prediction accuracy of a 4096-entry two-bit prediction buffer versus an infinite buffer for the SPEC89 benchmarks. As we mentioned above, increasing the number of bits per predictor also has little impact. These two-bit predictor schemes use only the recent behavior of a branch to predict the future behavior of that branch. It may be possible to improve the prediction accuracy if we also look at the recent behavior of other branches rather than just the branch we are trying to predict. Consider a small code fragment from the SPEC92 benchmark eqntott (the worst case for the two-bit predictor): if (aa==2) aa=0; if (bb==2) bb=0; if (aa!=bb) { 4.3 267 Reducing Branch Penalties with Dynamic Hardware Prediction Here is the DLX code that we would typically generate for this code fragment assuming that aa and bb are assigned to registers R1 and R2: L1: L2: SUBUI(3x) BNEZ ADD SUBUI(3x) BNEZ ADD SUBU(1x) BEQZ R3,R1,#2 R3,L1 R1,R0,R0 R3,R2,#2 R3,L2 R2,R0,R0 R3,R1,R2 R3,L3 ;branch b1 ;aa=0 (aa!=2) ;branch b2 ;bb=0 ;R3=aa-bb ;branch b3 (bb!=2) (aa==bb) Let’s label these branches b1, b2, and b3. The key observation is that the behavior of branch b3 is correlated with the behavior of branches b1 and b2. Clearly, if branches b1 and b2 are both not taken (i.e., the if conditions both evaluate to true and aa and bb are both assigned 0), then b3 will be taken, since aa and bb are clearly equal. A predictor that uses only the behavior of a single branch to predict the outcome of that branch can never capture this behavior. Branch predictors that use the behavior of other branches to make a prediction are called correlating predictors or two-level predictors. To see how such predictors work, let’s choose a simple hypothetical case. Consider the following simplified code fragment (chosen for illustrative purposes): if (d==0) d=1; if (d==1) Here is the typical code sequence generated for this fragment, assuming that d is assigned to R1: L1: BNEZ ADDI SUBUI(3x) BNEZ R1,L1 R1,R0,#1 R3,R1,#1 R3,L2 ;branch b1 (d!=0) ;d==0, so d=1 ;branch b2 (d!=1) ... L2: The branches corresponding to the two if statements are labeled b1 and b2. The possible execution sequences for an execution of this fragment, assuming d has values 0, 1, and 2, are shown in Figure 4.16. To illustrate how a correlating predictor works, assume the sequence above is executed repeatedly and ignore other branches in the program (including any branch needed to cause the above sequence to repeat). 268 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Initial value of d d==0? b1 0 Yes Not taken 1 Yes Not taken 1 No Taken 1 Yes Not taken No Taken 2 No Taken 2 FIGURE 4.16 Value of d before b2 d==1? b2 Possible execution sequences for a code fragment. From Figure 4.16, we see that if b1 is not taken, then b2 will be not taken. A correlating predictor can take advantage of this, but our standard predictor cannot. Rather than consider all possible branch paths, consider a sequence where d alternates between 2 and 0. A one-bit predictor initialized to not taken has the behavior shown in Figure 4.17. As the figure shows, all the branches are mispredicted! d=? b1 prediction b1 action New b1 prediction b2 prediction b2 action New b2 prediction 2 NT T T NT T T 0 T NT NT T NT NT 2 NT T T NT T T 0 T NT NT T NT NT FIGURE 4.17 Behavior of a one-bit predictor initialized to not taken. T stands for taken, NT for not taken. Alternatively, consider a predictor that uses one bit of correlation. The easiest way to think of this is that every branch has two separate prediction bits: one prediction assuming the last branch executed was not taken and another prediction that is used if the last branch executed was taken. Note that, in general, the last branch executed is not the same instruction as the branch being predicted, though this can occur in simple loops consisting of a single basic block (since there are no other branches in the loops). We write the pair of prediction bits together, with the first bit being the prediction if the last branch in the program is not taken and the second bit being the prediction if the last branch in the program is taken. The four possible combinations and the meanings are listed in Figure 4.18. Prediction bits Prediction if last branch not taken Prediction if last branch taken NT/NT Not taken Not taken NT/T Not taken Taken T/NT Taken Not taken T/T Taken Taken FIGURE 4.18 Combinations and meaning of the taken/not taken prediction bits. T stands for taken, NT for not taken. 4.3 269 Reducing Branch Penalties with Dynamic Hardware Prediction The action of the one-bit predictor with one bit of correlation, when initialized to NT/NT is shown in Figure 4.19. d=? b1 prediction b1 action New b1 prediction b2 prediction 2 NT/NT 0 T/NT 2 T/NT T T/NT 0 T/NT NT T/NT b2 action New b2 prediction T T/NT NT T/NT NT/NT T NT/T NT/T NT NT/T NT/T T NT/T NT/T NT NT/T FIGURE 4.19 The action of the one-bit predictor with one bit of correlation, initialized to not taken/not taken. T stands for taken, NT for not taken. The prediction used is shown in bold. In this case, the only misprediction is on the first iteration, when d = 2. The correct prediction of b1 is because of the choice of values for d, since b1 is not obviously correlated with the previous prediction of b2. The correct prediction of b2, however, shows the advantage of correlating predictors. Even if we had chosen different values for d, the predictor for b2 would correctly predict the case when b1 is not taken on every execution of b2 after one initial incorrect prediction. The predictor in Figures 4.18 and 4.19 is called a (1,1) predictor since it uses the behavior of the last branch to choose from among a pair of one-bit branch predictors. In the general case an (m,n) predictor uses the behavior of the last m branches to choose from 2m branch predictors, each of which is a n-bit predictor for a single branch. The attraction of this type of correlating branch predictor is that it can yield higher prediction rates than the two-bit scheme and requires only a trivial amount of additional hardware. The simplicity of the hardware comes from a simple observation: The global history of the most recent m branches can be recorded in an m-bit shift register, where each bit records whether the branch was taken or not taken. The branch-prediction buffer can then be indexed using a concatenation of the low-order bits from the branch address with the m-bit global history. For example, Figure 4.20 shows a (2,2) predictor and how the prediction is accessed. There is one subtle effect in this implementation. Because the prediction buffer is not a cache, the counters indexed by a single value of the global predictor may in fact correspond to different branches at some point in time. This is no different from our earlier observation that the prediction may not correspond to the current branch. In Figure 4.20 we draw the buffer as a two-dimensional object to ease understanding. In reality, the buffer can simply be implemented as a linear memory array that is two bits wide; the indexing is done by concatenating the global history bits and the number of required bits from the branch address. For the example in Figure 4.20, a (2,2) buffer with 64 total entries, the four low-order address bits of the branch (word address) and the two global bits form a six-bit index that can be used to index the 64 counters. 270 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Branch address 4 2–bit per branch predictors XX XX prediction 2–bit global branch history FIGURE 4.20 A (2,2) branch-prediction buffer uses a two-bit global history to choose from among four predictors for each branch address. Each predictor is in turn a two-bit predictor for that particular branch. The branch-prediction buffer shown here has a total of 64 entries; the branch address is used to choose four of these entries and the global history is used to choose one of the four. The two-bit global history can be implemented as a shifter register that simply shifts in the behavior of a branch as soon as it is known. How much better do the correlating branch predictors work when compared with the standard two-bit scheme? To compare them fairly, we must compare predictors that use the same number of state bits. The number of bits in an (m,n) predictor is 2m × n × Number of prediction entries selected by the branch address A two-bit predictor with no global history is simply a (0,2) predictor. EXAMPLE How many bits are in the (0,2) branch predictor we examined earlier? How many bits are in the branch predictor shown in Figure 4.20? ANSWER The earlier predictor had 4K entries selected by the branch address. Thus the total number of bits is 20 × 2 × 4K = 8K. 4.3 271 Reducing Branch Penalties with Dynamic Hardware Prediction The predictor in Figure 4.20 has 22 × 2 × 16 = 128 bits. s To compare the performance of a correlating predictor with that of our simple two-bit predictor examined in Figure 4.14, we need to determine how many entries we should assume for the correlating predictor. EXAMPLE ANSWER How many branch-selected entries are in a (2,2) predictor that has a total of 8K bits in the prediction buffer? We know that 22 × 2 × Number of prediction entries selected by the branch = 8K. Hence Number of prediction entries selected by the branch = 1K. s Figure 4.21 compares the performance of the earlier two-bit simple predictor with 4K entries and a (2,2) predictor with 1K entries. As you can see, this predictor not only outperforms a simple two-bit predictor with the same total number of state bits, it often outperforms a two-bit predictor with an unlimited number of entries. There are a wide spectrum of correlating predictors, with the (0,2) and (2,2) predictors being among the most interesting. The Exercises ask you to explore the performance of a third extreme: a predictor that does not rely on the branch address. For example, a (12,2) predictor that has a total of 8K bits does not use the branch address in indexing the predictor, but instead relies solely on the global branch history. Surprisingly, this degenerate case can outperform a noncorrelating two-bit predictor if enough global history is used and the table is large enough! Further Reducing Control Stalls: Branch-Target Buffers To reduce the branch penalty on DLX, we need to know from what address to fetch by the end of IF. This means we must know whether the as-yet-undecoded instruction is a branch and, if so, what the next PC should be. If the instruction is a branch and we know what the next PC should be, we can have a branch penalty of zero. A branch-prediction cache that stores the predicted address for the next instruction after a branch is called a branch-target buffer or branch-target cache. 272 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism nasa7 1% 0% 1% matrix300 0% 0% 0% tomcatv 1% 0% 1% 5% 5% 5% doduc 9% 9% spice 5% SPEC89 benchmarks 9% 9% fpppp 5% 12% 11% 11% gcc 5% 5% espresso 4% 18% 18% eqntott 6% 10% 10% li 5% 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% Frequency of mispredictions 4096 entries: 2 bits per entry Unlimited entries: 2 bits per entry 1024 entries (2,2) FIGURE 4.21 Comparison of two-bit predictors. A noncorrelating predictor for 4096 bits is first, followed by a noncorrelating two-bit predictor with unlimited entries and a two-bit predictor with two bits of global history and a total of 1024 entries. For the standard DLX pipeline, a branch-prediction buffer is accessed during the ID cycle, so that at the end of ID we know the branch-target address (since it is computed during ID), the fall-through address (computed during IF), and the prediction. Thus, by the end of ID we know enough to fetch the next predicted instruction. For a branch-target buffer, we access the buffer during the IF stage using the instruction address of the fetched instruction, a possible branch, to index the buffer. If we get a hit, then we know the predicted instruction address at the end of the IF cycle, which is one cycle earlier than for a branch-prediction buffer. 4.3 273 Reducing Branch Penalties with Dynamic Hardware Prediction Because we are predicting the next instruction address and will send it out before decoding the instruction, we must know whether the fetched instruction is predicted as a taken branch. Figure 4.22 shows what the branch-target buffer looks like. If the PC of the fetched instruction matches a PC in the buffer, then the corresponding predicted PC is used as the next PC. In Chapter 5 we will discuss caches in much more detail; we will see that the hardware for this branch-target buffer is essentially identical to the hardware for a cache. PC of instruction to fetch Look up Predicted PC Number of entries in branchtarget buffer = No: instruction is not predicted to be branch. Proceed normally Yes: then instruction is branch and predicted PC should be used as the next PC Branch predicted taken or untaken FIGURE 4.22 A branch-target buffer. The PC of the instruction being fetched is matched against a set of instruction addresses stored in the first column; these represent the addresses of known branches. If the PC matches one of these entries, then the instruction being fetched is a taken branch, and the second field, predicted PC, contains the prediction for the next PC after the branch. Fetching begins immediately at that address. The third field, which is optional, may be used for extra prediction state bits. If a matching entry is found in the branch-target buffer, fetching begins immediately at the predicted PC. Note that (unlike a branch-prediction buffer) the entry must be for this instruction, because the predicted PC will be sent out before it is known whether this instruction is even a branch. If we did not check whether the entry matched this PC, then the wrong PC would be sent out for instructions that were not branches, resulting in a slower processor. We only need to store the predicted-taken branches in the branch-target buffer, since an untaken branch follows the same strategy (fetch the next sequential instruction) as a nonbranch. Complications arise when we are using a two-bit predictor, since this requires 274 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism that we store information for both taken and untaken branches. One way to resolve this is to use both a target buffer and a prediction buffer, which is the solution used by the PowerPC 620—the topic of section 4.8. We assume that the buffer only holds PC-relative conditional branches, since this makes the target address a constant; it is not hard to extend the mechanism to work with indirect branches. Figure 4.23 shows the steps followed when using a branch-target buffer and where these steps occur in the pipeline. From this we can see that there will be no branch delay if a branch-prediction entry is found in the buffer and is correct. Otherwise, there will be a penalty of at least two clock cycles. In practice, this penalty could be larger, since the branch-target buffer must be updated. We could assume that the instruction following a branch or at the branch target is not a branch, and do the update during that instruction time; however, this does complicate the control. Instead, we will take a two-clock-cycle penalty when the branch is not correctly predicted or when we get a miss in the buffer. Dealing with the mispredictions and misses is a significant challenge, since we typically will have to halt instruction fetch while we rewrite the buffer entry. Thus, we would like to make this process fast to minimize the penalty. To evaluate how well a branch-target buffer works, we first must determine the penalties in all possible cases. Figure 4.24 contains this information. EXAMPLE Determine the total branch penalty for a branch-target buffer assuming the penalty cycles for individual mispredictions from Figure 4.24. Make the following assumptions about the prediction accuracy and hit rate: s s ANSWER prediction accuracy is 90% hit rate in the buffer is 90% Using a 60% taken branch frequency, this yields the following: Branch penalty = Percent buffer hit rate × Percent incorrect predictions × 2 + ( 1 – Percent buffer hit rate) × Taken branches × 2 Branch penalty = ( 90% × 10% × 2 ) + ( 10% × 60% × 2 ) Branch penalty = 0.18 + 0.12 = 0.30 clock cycles This compares with a branch penalty for delayed branches, which we evaluated in section 3.5 of the last chapter, of about 0.5 clock cycles per branch. Remember, though, that the improvement from dynamic branch prediction will grow as the branch delay grows; in addition, better predictors will yield a larger performance advantage. s 4.3 275 Reducing Branch Penalties with Dynamic Hardware Prediction Send PC to memory and branch-target buffer IF No No Is instruction a taken branch? Entry found in branch-target buffer? Yes Send out predicted PC Yes ID No Taken branch? Yes Normal instruction execution EX Enter branch instruction address and next PC into branch target buffer Mispredicted branch, kill fetched instruction; restart fetch at other target; delete entry from target buffer Branch correctly predicted; continue execution with no stalls FIGURE 4.23 The steps involved in handling an instruction with a branch-target buffer. If the PC of an instruction is found in the buffer, then the instruction must be a branch that is predicted taken; thus, fetching immediately begins from the predicted PC in ID. If the entry is not found and it subsequently turns out to be a taken branch, it is entered in the buffer along with the target, which is known at the end of ID. If the entry is found, but the instruction turns out not to be a taken branch, it is removed from the buffer. If the instruction is a branch, is found, and is correctly predicted, then execution proceeds with no delays. If the prediction is incorrect, we suffer a one-clock-cycle delay fetching the wrong instruction and restart the fetch one clock cycle later, leading to a total mispredict penalty of two clock cycles. If the branch is not found in the buffer and the instruction turns out to be a branch, we will have proceeded as if the instruction were a branch and can turn this into an assume-not-taken strategy. The penalty will differ depending on whether the branch is actually taken or not. 276 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Instruction in buffer Prediction Actual branch Penalty cycles Yes Taken Taken 0 Yes Taken Not taken 2 No Taken 2 No Not taken 0 FIGURE 4.24 Penalties for all possible combinations of whether the branch is in the buffer and what it actually does, assuming we store only taken branches in the buffer. There is no branch penalty if everything is correctly predicted and the branch is found in the target buffer. If the branch is not correctly predicted, the penalty is equal to one clock cycle to update the buffer with the correct information (during which an instruction cannot be fetched) and one clock cycle, if needed, to restart fetching the next correct instruction for the branch. If the branch is not found and taken, a two-cycle penalty is encountered, during which time the buffer is updated. One variation on the branch-target buffer is to store one or more target instructions instead of, or in addition to, the predicted target address. This variation has two potential advantages. First, it allows the branch-target buffer access to take longer than the time between successive instruction fetches. This could allow a larger branch-target buffer. Second, buffering the actual target instructions allows us to perform an optimization called branch folding. Branch folding can be used to obtain zero-cycle unconditional branches, and sometimes zero-cycle conditional branches. Consider a branch-target buffer that buffers instructions from the predicted path and is being accessed with the address of an unconditional branch. The only function of the unconditional branch is to change the PC. Thus, when the branch-target buffer signals a hit and indicates that the branch is unconditional, the pipeline can simply substitute the instruction from the branchtarget buffer in place of the instruction that is returned from the cache (which is the unconditional branch). If the processor is issuing multiple instructions per cycle, then the buffer will need to supply multiple instructions to obtain the maximum benefit. In some cases, it may be possible to eliminate the cost of a conditional branch when the condition codes are preset; we will see how this scheme can be used in the IBM PowerPC processor in the Putting It All Together section. Another method that designers have studied and are including in the most recent processors is a technique for predicting indirect jumps, that is, jumps whose destination address varies at runtime. While high-level language programs will generate such jumps for indirect procedure calls, select or case statements, and FORTRAN-computed gotos, the vast majority of the indirect jumps come from procedure returns. For example, for the SPEC benchmarks procedure returns account for 85% of the indirect jumps on average. Thus, focusing on procedure returns seems appropriate. Though procedure returns can be predicted with a branch-target buffer, the accuracy of such a prediction technique can be low if the procedure is called from 4.3 Reducing Branch Penalties with Dynamic Hardware Prediction 277 multiple sites and the calls from one site are not clustered in time. To overcome this problem, the concept of a small buffer of return addresses operating as a stack has been proposed. This structure caches the most recent return addresses: pushing a return address on the stack at a call and popping one off at a return. If the cache is sufficiently large (i.e., as large as the maximum call depth), it will predict the returns perfectly. Figure 4.25 shows the performance of such a return buffer with 1–16 elements for a number of the SPEC benchmarks. We will use this type of return predictor when we examine the studies of ILP in section 4.7. Branch prediction schemes are limited both by prediction accuracy and by the penalty for misprediction. As we have seen, typical prediction schemes achieve prediction accuracy in the range of 80–95% depending on the type of program and the size of the buffer. In addition to trying to increase the accuracy of the predictor, we can try to reduce the penalty for misprediction. This is done by fetching from both the predicted and unpredicted direction. This requires that the memory system be dual-ported, have an interleaved cache, or fetch from one path and then the other. While this adds cost to the system, it may be the only way to reduce branch penalties below a certain point. Caching addresses or instructions from multiple paths in the target buffer is another alternative that some processors have used. 50% 45% 40% 35% 30% Misprediction rate 25% 20% 15% 10% 5% 1 8 2 4 Number of entries in the return stack gcc fpppp espresso doduc 16 li tomcatv FIGURE 4.25 Prediction accuracy for a return address buffer operated as a stack. The accuracy is the fraction of return addresses predicted correctly. Since call depths are typically not large, with some exceptions, a modest buffer works well. On average returns account for 81% of the indirect jumps in these six benchmarks. 278 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism We have seen a variety of software-based static schemes and hardware-based dynamic schemes for trying to boost the performance of our pipelined processor. These schemes attack both the data dependences (discussed in the previous subsections) and the control dependences (discussed in this subsection). Our focus to date has been on sustaining the throughput of the pipeline at one instruction per clock. In the next section we will look at techniques that attempt to exploit more parallelism by issuing multiple instructions in a clock cycle. 4.4 Taking Advantage of More ILP with Multiple Issue Processors are being produced with the potential for very many parallel operations on the instruction level. ...Far greater extremes in instruction-level parallelism are on the horizon. J. Fisher [1981], in the paper that inaugurated the term “instruction-level parallelism” The techniques of the previous two sections can be used to eliminate data and control stalls and achieve an ideal CPI of 1. To improve performance further we would like to decrease the CPI to less than one. But the CPI cannot be reduced below one if we issue only one instruction every clock cycle. The goal of the multiple-issue processors discussed in this section is to allow multiple instructions to issue in a clock cycle. Multiple-issue processors come in two flavors: superscalar processors and VLIW (very long instruction word) processors. Superscalar processors issue varying numbers of instructions per clock and may be either statically scheduled by the compiler or dynamically scheduled using techniques based on scoreboarding and Tomasulo’s algorithm. In this section, we examine simple versions of both a statically scheduled superscalar and a dynamically scheduled superscalar. VLIWs, in contrast, issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet. VLIW processors are inherently statically scheduled by the compiler. Section 4.5 explores compiler technology useful for scheduling both VLIWs and superscalars. To explain and compare the techniques in this section we will assume the pipeline latencies we used earlier in section 4.1 (Figure 4.2) and the same example code segment, which adds a scalar to an array in memory: Loop: LD ADDD SD SUBI F0,0(R1) F4,F0,F2 0(R1),F4 R1,R1,#8 BNEZ R1,LOOP ;F0=array element ;add scalar in F2 ;store result ;decrement pointer ;8 bytes (per DW) ; branch R1!=zero We begin by looking at a simple superscalar processor. 4.4 Taking Advantage of More ILP with Multiple Issue 279 A Superscalar Version of DLX In a typical superscalar processor, the hardware might issue from one to eight instructions in a clock cycle. Usually, these instructions must be independent and will have to satisfy some constraints, such as no more than one memory reference issued per clock. If some instruction in the instruction stream is dependent or doesn’t meet the issue criteria, only the instructions preceding that one in sequence will be issued, hence the variability in issue rate. In contrast, in VLIWs, the compiler has complete responsibility for creating a package of instructions that can be simultaneously issued, and the hardware does not dynamically make any decisions about multiple issue. Thus, we say that a superscalar processor has dynamic issue capability, while a VLIW processor has static issue capability. Superscalar processors may also be statically or dynamically scheduled; for now, we assume static scheduling, but we will explore the use of dynamic scheduling in conjunction with speculation in section 4.6. What would the DLX processor look like as a superscalar? Let’s assume two instructions can be issued per clock cycle. One of the instructions can be a load, store, branch, or integer ALU operation, and the other can be any floating-point operation. As we will see, issue of an integer operation in parallel with a floatingpoint operation is much simpler and less demanding than arbitrary dual issue. This configuration is, in fact, very close to the organization used in the HP 7100 processor. Issuing two instructions per cycle will require fetching and decoding 64 bits of instructions. To keep the decoding simple, we could require that the instructions be paired and aligned on a 64-bit boundary, with the integer portion appearing first. The alternative is to examine the instructions and possibly swap them when they are sent to the integer or FP datapath; however, this introduces additional requirements for hazard detection. In either case, the second instruction can be issued only if the first instruction can be issued. Remember that the hardware makes this decision dynamically, issuing only the first instruction if the conditions are not met. Figure 4.26 shows how the instructions look as they go into the pipeline in pairs. This table does not address how the floating-point operations extend the EX cycle, but it is no different in the superscalar case than it was for the ordinary DLX pipeline; the concepts of section 3.7 apply directly. With this pipeline, we have substantially boosted the rate at which we can issue floating-point instructions. To make this worthwhile, however, we need either pipelined floating-point units or multiple independent units. Otherwise, the floating-point datapath will quickly become the bottleneck, and the advantages gained by dual issue will be small. By issuing an integer and a floating-point operation in parallel, the need for additional hardware, beyond the usual hazard detection logic, is minimized— integer and floating-point operations use different register sets and different functional units on load-store architectures. Furthermore, enforcing the issue restriction as a structural hazard (which it is, since only specific pairs of instructions can 280 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Instruction type Pipe stages Integer instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Integer instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Integer instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Integer instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB FIGURE 4.26 Superscalar pipeline in operation. The integer and floating-point instructions are issued at the same time, and each executes at its own pace through the pipeline. This scheme will only improve the performance of programs with a fair fraction of floating-point operations. issue), requires only looking at the opcodes. The only difficulties that arise are when the integer instruction is a floating-point load, store, or move. This creates contention for the floating-point register ports and may also create a new RAW hazard when the floating-point operation that could be issued in the same clock cycle is dependent on the first instruction of the pair. The register port problem could be solved by requiring the FP loads and stores to issue by themselves. This solution treats the case of an FP load, store, or move that is paired with an FP operation as a structural hazard. This is easy to implement, but it has substantial performance drawbacks. This hazard could instead be eliminated by providing two additional ports, a read and a write, on the floatingpoint register file. When the fetched instruction pair consists of an FP load and an FP operation that is dependent on it, we must detect the hazard and avoid issuing the FP operation. Except for this case, other possible hazards are essentially the same as for our single-issue pipeline. We will, however, need some additional bypass paths to prevent unnecessary stalls. There is another difficulty that may limit the effectiveness of a superscalar pipeline. In our simple DLX pipeline, loads had a latency of one clock cycle, which prevented one instruction from using the result without stalling. In the superscalar pipeline, the result of a load instruction cannot be used on the same clock cycle or on the next clock cycle. This means that the next three instructions cannot use the load result without stalling. The branch delay also becomes three instructions, since a branch must be the first instruction of a pair. To effectively exploit the parallelism available in a superscalar processor, more ambitious compiler or hardware scheduling techniques, as well as more complex instruction decoding, will be needed. Let’s see how well loop unrolling and scheduling work on a superscalar version of DLX with the delays in clock cycles from Figure 4.2 on page 224. 4.4 EXAMPLE Below is the loop we unrolled and scheduled earlier in section 4.1. How would it be scheduled on a superscalar pipeline for DLX? Loop: LD ADDD SD SUBI F0,0(R1) F4,F0,F2 0(R1),F4 R1,R1,#8 BNEZ ANSWER 281 Taking Advantage of More ILP with Multiple Issue R1,Loop ;F0=array element ;add scalar in F2 ;store result ;decrement pointer ;8 bytes (per DW) ;branch R1!=zero To schedule it without any delays, we will need to unroll the loop to make five copies of the body. After unrolling, the loop will contain five each of LD, ADDD, and SD; one SUBI; and one BNEZ. The unrolled and scheduled code is shown in Figure 4.27. Integer instruction FP instruction Clock cycle LD F0,0(R1) 1 LD F6,-8(R1) 2 LD F10,-16(R1) ADDD F4,F0,F2 3 LD F14,-24(R1) ADDD F8,F6,F2 4 LD F18,-32(R1) ADDD F12,F10,F2 5 SD 0(R1),F4 ADDD F16,F14,F2 6 SD -8(R1),F8 ADDD F20,F18,F2 7 SD Loop: -16(R1),F12 8 SUBI 9 16(R1),F16 10 BNEZ R1,Loop 11 SD FIGURE 4.27 DLX. R1,R1,#40 SD 8(R1),F20 12 The unrolled and scheduled code as it would look on a superscalar This unrolled superscalar loop now runs in 12 clock cycles per iteration, or 2.4 clock cycles per element, versus 3.5 for the scheduled and unrolled loop on the ordinary DLX pipeline. In this Example, the performance of the superscalar DLX is limited by the balance between integer and floatingpoint computation. Every floating-point instruction is issued together with an integer instruction, but there are not enough floating-point instructions to keep the floating-point pipeline full. When scheduled, the original loop ran in 6 clock cycles per iteration. We have improved on that by a factor of 2.5, more than half of which came from loop unrolling. Loop unrolling took us from 6 to 3.5 (a factor of 1.7), while superscalar execution gave us a factor of 1.5 improvement. s 282 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Ideally, our superscalar processor will pick up two instructions and issue them both if the first is an integer and the second is a floating-point instruction. If they do not fit this pattern, which can be quickly detected, then they are issued sequentially. This points to two of the major advantages of a superscalar processor over a VLIW processor. First, there is little impact on code density, since the processor detects whether the next instruction can issue, and we do not need to lay out the instructions to match the issue capability. Second, even unscheduled programs, or those compiled for older implementations, can be run. Of course, such programs may not run well; one way to overcome this is to use dynamic scheduling. Multiple Instruction Issue with Dynamic Scheduling Multiple instruction issue can also be applied to dynamically scheduled processors. We could start with either the scoreboard scheme or Tomasulo’s algorithm. Let’s assume we want to extend Tomasulo’s algorithm to support issuing two instructions per clock cycle, one integer and one floating point. We do not want to issue instructions to the reservation stations out of order, since this makes the bookkeeping extremely complex. Rather, by employing separate data structures for the integer and floating-point registers, we can simultaneously issue a floating-point instruction and an integer instruction to their respective reservation stations, as long as the two issued instructions do not access the same register set. Unfortunately, this approach bars issuing two instructions with a dependence in the same clock cycle, such as a floating-point load (an integer instruction) and a floating-point add. Of course, we cannot execute these two instructions in the same clock, but we would like to issue them to the reservation stations where they will later be serialized. In the superscalar processor of the previous section, the compiler is responsible for finding independent instructions to issue. If a hardware-scheduling scheme cannot find a way to issue two dependent instructions in the same clock, there will be little advantage to a hardware-scheduled scheme versus a compiler-based scheme. Luckily, there are two approaches that can be used to achieve dual issue. The first assumes that the register renaming portion of instruction-issue logic can be made to run in one-half of a clock. This permits two instructions to be processed in one clock cycle, so that they can begin executing on the same clock cycle. The second approach is based on the observation that with the issue restrictions assumed, it will only be FP loads and moves from the GP to the FP registers that will create dependences among instructions that we can issue together. If we had a more complex set of issue capabilities, there would be additional possible dependences that we would need to handle. The need for reservation tables for loads and moves can be eliminated by using queues for the result of a load or a move. Queues can also be used to allow stores to issue early and wait for their operands, just as they did in Tomasulo’s algorithm. Since dynamic scheduling is most effective for data moves, while static scheduling is highly effective in register-register code sequences, we could 4.4 Taking Advantage of More ILP with Multiple Issue 283 use static scheduling to eliminate reservation stations completely and rely only on the queues for loads and stores. This style of processor organization, where the load-store units have queues to allow slippage with respect to other functional units, has been called a decoupled architecture. Several machines have used variations on this idea. A processor that dynamically schedules loads and stores may cause loads and stores to be reordered. This may result in violating a data dependence through memory and thus requires some detection hardware for this potential hazard. We can detect such hazards with the same scheme we used for the single-issue version of Tomasulo’s algorithm: We dynamically check whether the memory source address specified by a load is the same as the target address of an outstanding, uncompleted store. If there is such a match, we can stall the load instruction until the store completes. Since the address of the store has already been computed and resides in the store buffer, we can use an associative check (possibly with only a subset of the address bits) to determine whether a load conflicts with a store in the buffer. There is also the possibility of WAW and WAR hazards through memory, which must be prevented, although they are much less likely than a true data dependence. (In contrast to these dynamic techniques for detecting memory dependences, we will discuss compiler-based approaches in the next section.) For simplicity, let us assume that we have pipelined the instruction issue logic so that we can issue two operations that are dependent but use different functional units. Let’s see how this would work with the same code sequence we used earlier. EXAMPLE Consider the execution of our simple loop on a DLX pipeline extended with Tomasulo’s algorithm and with multiple issue. Assume that both a floating-point and an integer operation can be issued on every clock cycle, even if they are related, provided the integer instruction is the first instruction. Assume one integer functional unit and a separate FP functional unit for each operation type. The number of cycles of latency per instruction is the same. Assume that issue and write results take one cycle each and that there is dynamic branch-prediction hardware. Create a table showing when each instruction issues, begins execution, and writes its result to the CDB for the first two iterations of the loop. Here is the original loop: Loop: ANSWER LD ADDD SD SUBI BNEZ F0,0(R1) F4,F0,F2 0(R1),F4 R1,R1,#8 R1,Loop The loop will be dynamically unwound and, whenever possible, instructions will be issued in pairs. The result is shown in Figure 4.28. The loop runs in 4 clock cycles per result, assuming no stalls are required on loop exit. 284 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Iteration number Instructions Issues at clock-cycle number Executes at clock-cycle number 1 LD F0,0(R1) 1 2 1 ADDD F4,F0,F2 1 4 1 SD 0(R1),F4 2 3 1 SUBI R1,R1,#8 3 4 1 BNEZ R1,Loop 4 LD F0,0(R1) 5 6 2 ADDD F4,F0,F2 5 9 2 SD 0(R1),F4 6 7 2 SUBI R1,R1,#8 7 8 2 BNEZ R1,Loop 8 3 Writes result at clock-cycle number 5 2 Memory access at clock-cycle number 3 9 6 7 5 8 8 11 12 9 FIGURE 4.28 The time of issue, execution, and writing result for a dual-issue version of our Tomasulo pipeline. The write-result stage does not apply to either stores or branches, since they do not write any registers. We assume a result is written to the CDB at the end of the clock cycle it is available in. This also assumes a wider CDB. For LD and SD, the execution is effective address calculation. We assume one memory pipeline. s The number of dual issues is small because there is only one floating-point operation per iteration. The relative number of dual-issued instructions would be helped by the compiler partially unwinding the loop to reduce the instruction count by eliminating loop overhead. With that transformation, the loop would run as fast as scheduled code on a superscalar processor. We will return to this transformation in the Exercises. Alternatively, if the processor were “wider,” that is, could issue more integer operations per cycle, larger improvements would be possible. The VLIW Approach With a VLIW we can reduce the amount of hardware needed to implement a multiple-issue processor, and the potential savings in hardware increases as we increase the issue width. For example, our two-issue superscalar processor requires that we examine the opcodes of two instructions and the six register specifiers and that we dynamically determine whether one or two instructions can issue and dispatch them to the appropriate functional units. Although the hardware required for a two-issue processor is modest and we could extend the mechanisms to handle three or four instructions (or more if the issue restrictions were chosen carefully), it becomes increasingly difficult to determine whether a significant number of instructions can all issue simultaneously without knowing both the order of the instructions before they are fetched and what dependencies might exist among them. 4.4 285 Taking Advantage of More ILP with Multiple Issue An alternative is an LIW (long instruction word) or VLIW (very long instruction word) architecture. VLIWs use multiple, independent functional units. Rather than attempting to issue multiple, independent instructions to the units, a VLIW packages the multiple operations into one very long instruction, hence the name. Since the burden for choosing the instructions to be issued simultaneously falls on the compiler, the hardware in a superscalar to make these issue decisions is unneeded. Since this advantage of a VLIW increases as the maximum issue rate grows, we focus on a wider-issue processor. A VLIW instruction might include two integer operations, two floating-point operations, two memory references, and a branch. An instruction would have a set of fields for each functional unit—perhaps 16 to 24 bits per unit, yielding an instruction length of between 112 and 168 bits. To keep the functional units busy, there must be enough parallelism in a straight-line code sequence to fill the available operation slots. This parallelism is uncovered by unrolling loops and scheduling code across basic blocks using a global scheduling technique. In addition to eliminating branches by unrolling loops, global scheduling techniques allow the movement of instructions across branch points. In the next section, we will discuss trace scheduling, one of these techniques developed specifically for VLIWs; the references also provide pointers to other approaches. For now, let’s assume we have a technique to generate long, straight-line code sequences for building up VLIW instructions and examine how well these processors operate. EXAMPLE ANSWER Suppose we have a VLIW that could issue two memory references, two FP operations, and one integer operation or branch in every clock cycle. Show an unrolled version of the array sum loop for such a processor. Unroll as many times as necessary to eliminate any stalls. Ignore the branchdelay slot. The code is shown in Figure 4.29. The loop has been unrolled to make seven copies of the body, which eliminates all stalls (i.e., completely empty issue cycles), and runs in 9 cycles. This yields a running rate of seven results in 9 cycles, or 1.29 cycles per result. s Limitations in Multiple-Issue Processors What are the limitations of a multiple-issue approach? If we can issue five operations per clock cycle, why not 50? The difficulty in expanding the issue rate comes from three areas: 1. Inherent limitations of ILP in programs 2. Difficulties in building the underlying hardware 3. Limitations specific to either a superscalar or VLIW implementation. 286 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Memory reference 1 Memory reference 2 LD F0,0(R1) LD F22,-40(R1) Integer operation/branch LD F14,-24(R1) LD F18,-32(R1) FP operation 2 LD F6,-8(R1) LD F10,-16(R1) FP operation 1 SD -8(R1),F8 SD -16(R1),F12 SD -40(R1),F24 ADDD F24,F22,F2 SD -24(R1),F16 SD -32(R1),F20 ADDD F8,F6,F2 ADDD F16,F14,F2 ADDD F20,F18,F2 SD 0(R1),F4 ADDD F4,F0,F2 ADDD F12,F10,F2 LD F26,-48(R1) SD 8(R1),F28 ADDD F28,F26,F2 SUBI R1,R1,#56 BNEZ R1,Loop FIGURE 4.29 VLIW instructions that occupy the inner loop and replace the unrolled sequence. This code takes nine cycles assuming no branch delay; normally the branch delay would also need to be scheduled. The issue rate is 23 operations in nine clock cycles, or 2.5 operations per cycle. The efficiency, the percentage of available slots that contained an operation, is about 60%. To achieve this issue rate requires a larger number of registers than DLX would normally use in this loop. The VLIW code sequence above requires at least eight FP registers, while the same code sequence for the base DLX processor can use as few as two FP registers or as many as five when unrolled and scheduled. In the superscalar example in Figure 4.27, six registers were needed. Limits on available ILP are the simplest and most fundamental. For example, in a statically scheduled processor, unless loops are unrolled very many times, there may not be enough operations to fill the available instruction issue slots. At first glance, it might appear that five instructions that could execute in parallel would be sufficient to keep our example VLIW completely busy. This, however, is not the case. Several of these functional units—the memory, the branch, and the floating-point units—will be pipelined and have a multicycle latency, requiring a larger number of operations that can execute in parallel to prevent stalls. For example, if the floating-point pipeline has a latency of five clocks, and if we want to schedule both FP pipelines without stalling, there must be 10 FP operations that are independent of the most recently issued FP operation. In general, we need to find a number of independent operations roughly equal to the average pipeline depth times the number of functional units. This means that roughly 15 to 20 operations could be needed to keep a multiple-issue processor with five functional units busy. The second cost, the hardware resources for a multiple-issue processor, arises from the hardware needed both to issue and to execute multiple instructions per cycle. The hardware for executing multiple operations per cycle seems quite straightforward: duplicating the floating-point and integer functional units is easy and cost scales linearly. However, there is a large increase in the memory bandwidth and register-file bandwidth. For example, even with a split floating-point and integer register file, our VLIW processor will require six read ports (two for each load-store and two for the integer part) and three write ports (one for each 4.4 Taking Advantage of More ILP with Multiple Issue 287 non-FP unit) on the integer register file and six read ports (one for each load-store and two for each FP) and four write ports (one for each load-store or FP) on the floating-point register file. This bandwidth cannot be supported without an increase in the silicon area of the register file and possible degradation of clock speed. Our five-unit VLIW also has two data memory ports, which are substantially more expensive than register ports. If we wanted to expand the number of issues further, we would need to continue adding memory ports. Adding only arithmetic units would not help, since the processor would be starved for memory bandwidth. As the number of data memory ports grows, so does the complexity of the memory system. To allow multiple memory accesses in parallel, we could break the memory into banks containing different addresses with the hope that the operations in a single instruction do not have conflicting accesses, or the memory may be truly dual-ported, which is substantially more expensive. Yet another approach is used in the IBM Power-2 design: The memory is accessed twice per clock cycle, but even with an aggressive memory system, this approach may be too slow for a high-speed processor. These memory system alternatives are discussed in more detail in the next chapter. The complexity and access time penalties of a multiported memory hierarchy are probably the most serious hardware limitations faced by any type of multiple-issue processor, whether VLIW or superscalar. The hardware needed to support instruction issue varies significantly depending on the multiple-issue approach. At one end of the spectrum are the dynamically scheduled superscalar processors that have a substantial amount of hardware involved in implementing either scoreboarding or Tomasulo’s algorithm. In addition to the silicon that such mechanisms consume, dynamic scheduling substantially complicates the design, making it more difficult to achieve high clock rates, as well as significantly increasing the task of verifying the design. At the other end of the spectrum are VLIWs, which require little or no additional hardware for instruction issue and scheduling, since that function is handled completely by the compiler. Between these two extremes lie most existing superscalar processors, which use a combination of static scheduling by the compiler with the hardware making the decision of how many of the next n instructions to issue. Depending on what restrictions are made on the order of instructions and what types of dependences must be detected among the issue candidates, statically scheduled superscalars will have issue logic either closer to that of a VLIW or more like that of a dynamically scheduled processor. Much of the challenge in designing multiple-issue processors lies in assessing the costs and performance advantages of a wide spectrum of possible hardware mechanisms versus the compiler-driven alternatives. Finally, there are problems that are specific to either the superscalar or VLIW model. We have already discussed the major challenge for a superscalar processor, namely the instruction issue logic. For the VLIW model, there are both technical and logistical problems. The technical problems are the increase in code size and the limitations of lock-step operation. Two different elements combine 288 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism to increase code size substantially for a VLIW. First, generating enough operations in a straight-line code fragment requires ambitiously unrolling loops, which increases code size. Second, whenever instructions are not full, the unused functional units translate to wasted bits in the instruction encoding. In Figure 4.29, we saw that only about 60% of the functional units were used, so almost half of each instruction was empty. To combat this problem, clever encodings are sometimes used. For example, there may be only one large immediate field for use by any functional unit. Another technique is to compress the instructions in main memory and expand them when they are read into the cache or are decoded. Because a VLIW is statically scheduled and operates lock-step, a stall in any functional unit pipeline must cause the entire processor to stall, since all the functional units must be kept synchronized. While we may be able to schedule the deterministic functional units to prevent stalls, predicting which data accesses will encounter a cache stall and scheduling them is very difficult. Hence, a cache miss must cause the entire processor to stall. As the issue rate and number of memory references becomes large, this lock-step structure makes it difficult to effectively use a data cache, thereby increasing memory complexity and latency. Binary code compatibility is the major logistical problem for VLIWs. This problem exists within a generation of processors, even though the processors may implement the same basic instructions. The problem is that different numbers of issues and functional unit latencies require different versions of the code. Thus, migrating between successive implementations or even between implementations with different issue widths is more difficult than it may be for a superscalar design. Of course, obtaining improved performance from a new superscalar design may require recompilation. Nonetheless, the ability to run old binary files is a practical advantage for the superscalar approach. One possible solution to this problem, and the problem of binary code compatibility in general, is object-code translation or emulation. This technology is developing quickly and could play a significant role in future migration schemes. The major challenge for all multiple-issue processors is to try to exploit large amounts of ILP. When the parallelism comes from unrolling simple loops in FP programs, the original loop probably could have been run efficiently on a vector processor (described in Appendix B). It is not clear that a multiple-issue processor is preferred over a vector processor for such applications; the costs are similar, and the vector processor is typically the same speed or faster. The potential advantages of a multiple-issue processor versus a vector processor are twofold. First, a multiple-issue processor has the potential to extract some amount of parallelism from less regularly structured code, and, second, it has the ability to use a less expensive memory system. For these reasons it appears clear that multiple- 4.5 Compiler Support for Exploiting ILP 289 issue approaches will be the primary method for taking advantage of instructionlevel parallelism, and vectors will primarily be an extension to these processors. 4.5 Compiler Support for Exploiting ILP In this section we discuss compiler technology for increasing the amount of parallelism that we can exploit in a program. We begin by examining techniques to detect dependences and eliminate name dependences. Detecting and Eliminating Dependences Finding the dependences in a program is an important part of three tasks: (1) good scheduling of code, (2) determining which loops might contain parallelism, and (3) eliminating name dependences. The complexity of dependence analysis arises because of the presence of arrays and pointers in languages like C. Since scalar variable references explicitly refer to a name, they can usually be analyzed quite easily, with aliasing because of pointers and reference parameters causing some complications and uncertainty in the analysis. Our analysis needs to find all dependences and determine whether there is a cycle in the dependences, since that is what prevents us from running the loop in parallel. Consider the following example: for (i=1;i<=100;i=i+1) { A[i] = B[i] + C[i] D[i] = A[i] * E[i] } Because the dependence involving A is not loop-carried, we can unroll the loop and find parallelism; we just cannot exchange the two references to A. If a loop has loop-carried dependences but no circular dependences (recall the Example in section 4.1), we can transform the loop to eliminate the dependence and then unrolling will uncover parallelism. In many parallel loops the amount of parallelism is limited only by the number of unrollings, which is limited only by the number of loop iterations. Of course, in practice, to take advantage of that much parallelism would require many functional units and possibly an enormous number of registers. The absence of a loop-carried dependence simply tells us that we have a large amount of parallelism available. The code fragment above illustrates another opportunity for improvement. The second reference to A need not be translated to a load instruction, since we know that the value is computed and stored by the previous statement; hence, the second reference to A can simply be a reference to the register into which A was computed. Performing this optimization requires knowing that the two references are always to the same memory address and that there is no intervening access to 290 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism the same location. Normally, data dependence analysis only tells that one reference may depend on another; a more complex analysis is required to determine that two references must be to the exact same address. In the example above, a simple version of this analysis suffices, since the two references are in the same basic block. Often loop-carried dependences are in the form of a recurrence: for (i=2;i<=100;i=i+1) { Y[i] = Y[i-1] + Y[i]; } A recurrence is when a variable is defined based on the value of that variable in an earlier iteration, often the one immediately preceding, as in the above fragment. Detecting a recurrence can be important for two reasons: Some architectures (especially vector computers) have special support for executing recurrences, and some recurrences can be the source of a reasonable amount of parallelism. To see how the latter can be true, consider this loop: for (i=6;i<=100;i=i+1) { Y[i] = Y[i-5] + Y[i]; } On the iteration i, the loop references element i – 5. The loop is said to have a dependence distance of 5. Many loops with carried dependences have a dependence distance of 1. The larger the distance, the more potential parallelism can be obtained by unrolling the loop. For example, if we unroll the first loop, with a dependence distance of 1, successive statements are dependent on one another; there is still some parallelism among the individual instructions, but not much. If we unroll the loop that has a dependence distance of 5, there is a sequence of five instructions that have no dependences, and thus much more ILP. Although many loops with loop-carried dependences have a dependence distance of 1, cases with larger distances do arise, and the longer distance may well provide enough parallelism to keep a processor busy. How does the compiler detect dependences in general? Nearly all dependence analysis algorithms work on the assumption that array indices are affine. In simplest terms, a one-dimensional array index is affine if it can be written in the form a × i + b, where a and b are constants, and i is the loop index variable. The index of a multidimensional array is affine if the index in each dimension is affine. Determining whether there is a dependence between two references to the same array in a loop is thus equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop. For example, suppose we have stored to an array element with index value a × i + b and loaded from the same array with index value c × i + d, where i is the 4.5 291 Compiler Support for Exploiting ILP for-loop index variable that runs from m to n. A dependence exists if two conditions hold: 1. There are two iteration indices, j and k, both within the limits of the for loop. That is m ≤ j ≤ n, m ≤ k ≤ n. 2. The loop stores into an array element indexed by a × j + b and later fetches from that same array element when it is indexed by c × k + d. That is, a × j + b = c × k + d. In general, we cannot determine whether a dependence exists at compile time. For example, the values of a, b, c, and d may not be known (they could be values in other arrays), making it impossible to tell if a dependence exists. In other cases, the dependence testing may be very expensive but decidable at compile time. For example, the accesses may depend on the iteration indices of multiple nested loops. Many programs, however, contain primarily simple indices where a, b, c, and d are all constants. For these cases, it is possible to devise reasonable compile-time tests for dependence. As an example, a simple and sufficient test for the absence of a dependence is the greatest common divisor, or GCD, test. It is based on the observation that if a loop-carried dependence exists, then GCD (c,a) must divide (d – b). (Remember that an integer, x, divides another integer, y, if there is no remainder when we do the division y/x and get an integer quotient.) EXAMPLE Use the GCD test to determine whether dependences exist in the following loop: for (i=1; i<=100; i=i+1) { X[2*i+3] = X[2*i] * 5.0; } ANSWER Given the values a = 2, b = 3, c = 2, and d = 0, then GCD(a,c) = 2, and d – b = –3. Since 2 does not divide –3, no dependence is possible. s The GCD test is sufficient to guarantee that no dependence exists (you can show this in the Exercises); however, there are cases where the GCD test succeeds but no dependence exists. This can arise, for example, because the GCD test does not take the loop bounds into account. In general, determining whether a dependence actually exists is NP-complete. In practice, however, many common cases can be analyzed precisely at low cost. Recently, approaches using a hierarchy of exact tests increasing in generality and cost have been shown to be both accurate and efficient. (A test is exact if it precisely determines whether a dependence exists. Although the general case is NP-complete, there exist exact tests for restricted situations that are much cheaper.) 292 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism In addition to detecting the presence of a dependence, a compiler wants to classify the types of dependence. This allows a compiler to recognize name dependences and eliminate them at compile time by renaming and copying. EXAMPLE The following loop has multiple types of dependences. Find all the true dependences, output dependences, and antidependences, and eliminate the output dependences and antidependences by renaming. for (i=1; i<=100; i=i+1) { Y[i] = X[i] / c; /*S1*/ X[i] = X[i] + c; /*S2*/ Z[i] = Y[i] + c; /*S3*/ Y[i] = c - Y[i]; /*S4*/ } ANSWER The following dependences exist among the four statements: 1. There are true dependences from S1 to S3 and from S1 to S4 because of Y[i]. These are not loop carried, so they do not prevent the loop from being considered parallel. These dependences will force S3 and S4 to wait for S1 to complete. 2. There is an antidependence from S1 to S2, based on X[i]. 3. There is an antidependence from S3 to S4 for Y[i]. 4. There is an output dependence from S1 to S4, based on Y[i]. The following version of the loop eliminates these false (or pseudo) dependences. for (i=1; i<=100; i=i+1 { /* Y renamed to T to remove output dependence*/ T[i] = X[i] / c; /* X renamed to X1 to remove antidependence*/ X1[i] = X[i] + c; /* Y renamed to T to remove antidependence */ Z[i] = T[i] + c; Y[i] = c - T[i]; } After the loop the variable X has been renamed X1. In code that follows the loop, the compiler can simply replace the name X by X1. In this case, renaming does not require an actual copy operation but can be done by substituting names or by register allocation. In other cases, however, renaming will require copying. s 4.5 Compiler Support for Exploiting ILP 293 Dependence analysis is a critical technology for exploiting parallelism. At the instruction level it provides information needed to interchange memory references when scheduling, as well as to determine the benefits of unrolling a loop. For detecting loop-level parallelism, dependence analysis is the basic tool. Effectively compiling programs to either vector computers or multiprocessors depends critically on this analysis. In addition, it is useful in scheduling instructions to determine whether memory references are potentially dependent. The major drawback of dependence analysis is that it applies only under a limited set of circumstances, namely among references within a single loop nest and using affine index functions. Thus, there are a wide variety of situations in which dependence analysis cannot tell us what we might want to know, including s s s s when objects are referenced via pointers rather than array indices; when array indexing is indirect through another array, which happens with many representations of sparse arrays; when a dependence may exist for some value of the inputs, but does not exist in actuality when the code is run since the inputs never take on certain values; when an optimization depends on knowing more than just the possibility of a dependence, but needs to know on which write of a variable does a read of that variable depend. The rapid progress in dependence analysis algorithms has led us to a situation where we are often limited by the lack of applicability of the analysis rather than a shortcoming in dependence analysis per se. Software Pipelining: Symbolic Loop Unrolling We have already seen that one compiler technique, loop unrolling, is useful to uncover parallelism among instructions by creating longer sequences of straightline code. There are two other important techniques that have been developed for this purpose: software pipelining and trace scheduling. Software pipelining is a technique for reorganizing loops such that each iteration in the software-pipelined code is made from instructions chosen from different iterations of the original loop. This is most easily understood by looking at the scheduled code for the superscalar version of DLX, which appeared in Figure 4.27 on page 281. The scheduler in this example essentially interleaves instructions from different loop iterations, so as to separate the dependent instructions that occur within a single loop iteration. A software-pipelined loop interleaves instructions from different iterations without unrolling the loop, as illustrated in Figure 4.30. This technique is the software counterpart to what Tomasulo’s algorithm does in hardware. The software-pipelined loop for the earlier example would contain one load, one add, and one store, each from a different iteration. There is also some start-up code that is needed before the loop begins as well as code to finish up after the loop is completed. We will ignore these in this discussion, for simplicity; the topic is addressed in the Exercises. 294 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Softwarepipelined iteration FIGURE 4.30 A software-pipelined loop chooses instructions from different loop iterations, thus separating the dependent instructions within one iteration of the original loop. The start-up and finish-up code will correspond to the portions above and below the software-pipelined iteration. EXAMPLE Show a software-pipelined version of this loop, which increments all the elements of an array whose starting address is in R1 by the contents of F2: Loop: LD ADDD SD SUBI BNEZ F0,0(R1) F4,F0,F2 0(R1),F4 R1,R1,#8 R1,Loop You may omit the start-up and clean-up code. ANSWER Software pipelining symbolically unrolls the loop and then selects instructions from each iteration. Since the unrolling is symbolic, the loop overhead instructions (the SUBI and BNEZ) need not be replicated. Here’s the body of the unrolled loop without overhead instructions, highlighting the instructions taken from each iteration: 4.5 295 Compiler Support for Exploiting ILP Iteration i: Iteration i+1: Iteration i+2: LD ADDD SD LD ADDD SD LD ADDD SD F0,0(R1) F4,F0,F2 0(R1),F4 F0,0(R1) F4,F0,F2 0(R1),F4 F0,0(R1) F4,F0,F2 0(R1),F4 The selected instructions are then put together in the loop with the loop control instructions: Loop: SD ADDD LD SUBI BNEZ 16(R1),F4 F4,F0,F2 F0,0(R1) R1,R1,#8 R1,Loop ;stores into M[i] ;adds to M[i-1] ;loads M[i-2] This loop can be run at a rate of 5 cycles per result, ignoring the start-up and clean-up portions, and assuming that SUBI is scheduled after the ADDD and the LD instruction, with an adjusted offset, is placed in the branch delay slot. Because the load and store are separated by offsets of 16 (two iterations), the loop should run for two fewer iterations. (We address this and the start-up and clean-up portions in Exercise 4.18.) Notice that the reuse of registers (e.g., F4, F0, and R1) requires the hardware to avoid the WAR hazards that would occur in the loop. This should not be a problem in this case, since no data-dependent stalls should occur. By looking at the unrolled version we can see what the start-up code and finish code will need to be. For start-up, we will need to execute any instructions that correspond to iteration 1 and 2 that will not be executed. These instructions are the LD for iterations 1 and 2 and the ADDD for iteration 1. For the finish code, we need to execute any instructions that will not be executed in the final two iterations. These include the ADDD for the last iteration and the SD for the last two iterations. s Register management in software-pipelined loops can be tricky. The example above is not too hard since the registers that are written on one loop iteration are read on the next. In other cases, we may need to increase the number of iterations between when we issue an instruction and when the result is used. This occurs when there are a small number of instructions in the loop body and the latencies are large. In such cases, a combination of software pipelining and loop unrolling is needed. An example of this is shown in the Exercises. Software pipelining can be thought of as symbolic loop unrolling. Indeed, some of the algorithms for software pipelining use loop-unrolling algorithms to figure out how to software pipeline the loop. The major advantage of software 296 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism pipelining over straight loop unrolling is that software pipelining consumes less code space. Software pipelining and loop unrolling, in addition to yielding a better scheduled inner loop, each reduce a different type of overhead. Loop unrolling reduces the overhead of the loop—the branch and counter-update code. Software pipelining reduces the time when the loop is not running at peak speed to once per loop at the beginning and end. If we unroll a loop that does 100 iterations a constant number of times, say 4, we pay the overhead 100/4 = 25 times— every time the inner unrolled loop is initiated. Figure 4.31 shows this behavior graphically. Because these techniques attack two different types of overhead, the best performance can come from doing both. Start-up code Wind-down code Number of overlapped operations (a) Software pipelining Proportional to number of unrolls Time Overlap between unrolled iterations Number of overlapped operations (b) Loop unrolling Time FIGURE 4.31 The execution pattern for (a) a software-pipelined loop and (b) an unrolled loop. The shaded areas are the times when the loop is not running with maximum overlap or parallelism among instructions. This occurs once at the beginning and once at the end for the software-pipelined loop. For the unrolled loop it occurs m/n times if the loop has a total of m iterations and is unrolled n times. Each block represents an unroll of n iterations. Increasing the number of unrollings will reduce the start-up and clean-up overhead. The overhead of one iteration overlaps with the overhead of the next, thereby reducing the impact. The total area under the polygonal region in each case will be the same, since the total number of operations is just the execution rate multiplied by the time. Trace Scheduling: Using Critical Path Scheduling The other technique used to generate additional parallelism is trace scheduling. Trace scheduling extends loop unrolling with a technique for finding parallelism across conditional branches other than loop branches. Trace scheduling is useful for processors with a very large number of issues per clock where loop unrolling may not be sufficient by itself to uncover enough ILP to keep the processor busy. Trace scheduling is a combination of two separate processes. The first process, called trace selection, tries to find a likely sequence of basic blocks whose 4.5 297 Compiler Support for Exploiting ILP operations will be put together into a smaller number of instructions; this sequence is called a trace. Loop unrolling is used to generate long traces, since loop branches are taken with high probability. Additionally, by using static branch prediction, other conditional branches are also chosen as taken or not taken, so that the resultant trace is a straight-line sequence resulting from concatenating many basic blocks. Once a trace is selected, the second process, called trace compaction, tries to squeeze the trace into a small number of wide instructions. Trace compaction attempts to move operations as early as it can in a sequence (trace), packing the operations into as few wide instructions (or issue packets) as possible. Trace compaction is global code scheduling, where we want to compact the code into the shortest possible sequence that preserves the data and control dependences. The data dependences force a partial order on operations, while the control dependences dictate instructions across which code cannot be easily moved. Data dependences are overcome by unrolling and using dependence analysis to determine if two references refer to the same address. Control dependences are also reduced by unrolling. The major advantage of trace scheduling over simpler pipeline-scheduling techniques is that it provides a scheme for reducing the effect of control dependences by moving code across conditional nonloop branches using the predicted behavior of the branch. While such movements cannot guarantee speedup, if the prediction information is accurate, the compiler can determine whether such code movement is likely to lead to faster code. Figure 4.32 shows a code fragment, which may be thought of as an iteration of an unrolled loop, and the trace selected. A[i] = A[i] + B[i] T A[i] = 0? B[i] = F X C[i] = FIGURE 4.32 A code fragment and the trace selected shaded with gray. This trace would be selected first if the probability of the true branch being taken were much higher than the probability of the false branch being taken. The branch from the decision (A[i]=0) to X is a branch out of the trace, and the branch from X to the assignment to C is a branch into the trace. These branches are what make compacting the trace difficult. 298 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Once the trace is selected as shown in Figure 4.32, it must be compacted so as to fill the processor’s resources. Compacting the trace involves moving the assignments to variables B and C up to the block before the branch decision. The movement of the code associated with B is speculative: it will speed the computation up only when the path containing the code would be taken. Any global scheduling scheme, including trace scheduling, performs such movement under a set of constraints. In trace scheduling, branches are viewed as jumps into or out of the selected trace, which is assumed to the most probable path. When code is moved across such trace entry and exit points, additional bookkeeping code may be needed on the entry or exit point. The key assumption is that the selected trace is the most probable event, otherwise, the cost of the bookkeeping code may be excessive. This movement of code alters the control dependences, and the bookkeeping code is needed to maintain the correct dynamic data dependence. In the case of moving the code associated with C, the bookkeeping costs are the only cost, since C is executed independent of the branch. For a code movement that is speculative, like that associated with B, we must not introduce any new exceptions. Compilers avoid changing the exception behavior by not moving certain classes of instructions, such as memory references, that can cause exceptions. In the next section, we will see how hardware support can ease the process of speculative code motion as well as remove control dependences. What is involved in moving the assignments to B and C? The computation of and assignment to B is control-dependent on the branch, while the computation of C is not. Moving these statements can only be done if they either do not change the control and data dependences or if the effect of the change is not visible and thus does not affect program execution. To see what’s involved, let’s look at a typical code generation sequence for the flowchart in Figure 4.32. Assuming that the addresses for A, B, C are in R1, R2, and R3, respectively, here is such a sequence: LW LW ADDI SW ... BNEZ ... SW j elsepart:... X ... join: ... SW R4,0(R1) R5,0(R2) R4,R4,R5 0(R1),R4 ; ; ; ; load A load B Add to A Store A R4,elsepart ; ; ; ; ; ; Test A then part Stores to B jump over else else part code for X 0(R2),... join 0(R3),... ; after if ; store C[i] 4.6 Hardware Support for Extracting More Parallelism 299 Let’s first consider the problem of moving the assignment to B to before the BNEZ instruction. Since B is control-dependent on that branch before it is moved but not after, we must ensure the execution of the statement cannot cause any exception, since that exception would not have been raised in the original program if the else part of the statement were selected. The movement of B must also not affect the data flow, since that will result in changing the value computed. Moving B will change the data flow of the program, if B is referenced before it is assigned either in X or after the if statement. In either case moving the assignment to B will cause some instruction, i (either in X or later in the program), to become data-dependent on the moved version of the assignment to B rather than on an earlier assignment to B that occurs before the loop and on which i originally depended. One could imagine more clever schemes to allow B to be moved even when the value is used: for example, in the first case, we could make a shadow copy of B before the if statement and use that shadow copy in X. Such schemes are generally not used, both because they are complex to implement and because they will slow down the program if the trace selected is not optimal and the operations end up requiring additional instructions to execute. Moving the assignment to C up to before the first branch requires two steps. First, the assignment is moved over the join point of the else part into the trace (a trace entry) in the portion corresponding to the then part. This makes the instructions for C control-dependent on the branch and means that they will not execute if the else path, which is not on the trace, is chosen. Hence, instructions that were data-dependent on the assignment to C, and which execute after this code fragment, will be affected. To ensure the correct value is computed for such instructions, a copy is made of the instructions that compute and assign to C on the branch into the trace, that is, at the end of X on the else path. Second, we can move C from the then case of the branch across the branch condition, if it does not affect any data flow into the branch condition. If C is moved to before the if test, the copy of C in the else branch can be eliminated, since it will be redundant. Loop unrolling, software pipelining, and trace scheduling all aim at trying to increase the amount of ILP that can be exploited by a processor issuing more than one instruction on every clock cycle. The effectiveness of each of these techniques and their suitability for various architectural approaches are among the hottest topics being actively pursued by researchers and designers of high-speed processors. 4.6 Hardware Support for Extracting More Parallelism Techniques such as loop unrolling, software pipelining, and trace scheduling can be used to increase the amount of parallelism available when the behavior of branches is fairly predictable at compile time. When the behavior of branches is 300 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism not well known, compiler techniques alone may not be able to uncover much ILP. This section introduces several techniques that can help overcome such limitations. The first is an extension of the instruction set to include conditional or predicated instructions. Such instructions can be used to eliminate branches and to assist in allowing the compiler to move instructions past branches. As we will see, conditional or predicated instructions enhance the amount of ILP, but still have significant limitations. To exploit more parallelism, designers have explored an idea called speculation, which allows the execution of an instruction before the processor knows that the instruction should execute (i.e., it avoids control dependence stalls). We discuss two different approaches to speculation. The first is static speculation performed by the compiler with hardware support. In such schemes, the compiler chooses to make an instruction speculative and the hardware helps by making it easier to ignore the outcome of an incorrectly speculated instruction. Conditional instructions can also be used to perform limited speculation. Speculation can also be done dynamically by the hardware using branch prediction to guide the speculation process; such schemes are the subject of the third portion of this section. Conditional or Predicated Instructions The concept behind conditional instructions is quite simple: An instruction refers to a condition, which is evaluated as part of the instruction execution. If the condition is true, the instruction is executed normally; if the condition is false, the execution continues as if the instruction was a no-op. Many newer architectures include some form of conditional instructions. The most common example of such an instruction is conditional move, which moves a value from one register to another if the condition is true. Such an instruction can be used to completely eliminate the branch in simple code sequences. EXAMPLE Consider the following code: if (A==0) {S=T;} Assuming that registers R1, R2, and R3 hold the values of A, S, and T, respectively, show the code for this statement with the branch and with the conditional move. ANSWER The straightforward code using a branch for this statement is (remember that we are assuming normal rather than delayed branches) BNEZ MOV L: R1,L R2,R3 4.6 Hardware Support for Extracting More Parallelism 301 Using a conditional move that performs the move only if the third operand is equal to zero, we can implement this statement in one instruction: CMOVZ R2,R3,R1 The conditional instruction allows us to convert the control dependence present in the branch-based code sequence to a data dependence. (This transformation is also used for vector computers, where it is called ifconversion.) For a pipelined processor, this moves the place where the dependence must be resolved from near the front of the pipeline, where it is resolved for branches, to the end of the pipeline where the register write occurs. s One use for conditional move is to implement the absolute value function: A = abs (B), which is implemented as if (B<0) {A=–B;) else {A=B;}. This if statement can be implemented as a pair of conditional moves, or as one unconditional move (A = B) and one conditional move (A = –B). In the example above or in the compilation of absolute value, conditional moves are used to change a control dependence into a data dependence. This enables us to eliminate the branch and possibly improve the pipeline behavior. Conditional instructions can also be used to improve scheduling in superscalar or VLIW processors by the use of speculation. A conditional instruction can be used to speculatively move an instruction that is time-critical. EXAMPLE Here is a code sequence for a two-issue superscalar that can issue a combination of one memory reference and one ALU operation, or a branch by itself, every cycle: First instruction slot LW R1,40(R2) Second instruction slot ADD R3,R4,R5 ADD R6,R3,R7 BEQZ R10,L LW R8,20(R10) LW R9,0(R8) This sequence wastes a memory operation slot in the second cycle and will incur a data dependence stall if the branch is not taken, since the second LW after the branch depends on the prior load. Show how the code can be improved using a conditional form of LW. ANSWER Call the conditional version load word LWC and assume the load occurs unless the third operand is 0. The LW immediately following the branch can be converted to a LWC and moved up to the second issue slot: 302 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism First instruction slot Second instruction slot LW R1,40(R2) ADD R3,R4,R5 LWC R8,20(R10),R10 ADD R6,R3,R7 BEQZ R10,L LW R9,0(R8) This improves the execution time by several cycles since it eliminates one instruction issue slot and reduces the pipeline stall for the last instruction in the sequence. Of course, if the compiler mispredicts the branch, the conditional instruction will have no effect and will not improve the running time. This is why the transformation is speculative. s To use a conditional instruction successfully in examples like this, we must ensure that the speculated instruction does not introduce an exception. Thus the semantics of the conditional instruction must define the instruction to have no effect if the condition is not satisfied. This means that the instruction cannot write the result destination nor cause any exceptions if the condition is not satisfied. The property of not causing exceptions is quite critical, as the Example above shows: If register R10 contains zero, the instruction LW R8,20(R10) executed unconditionally is likely to cause a protection exception, and this exception should not occur. It is this property that prevents a compiler from simply moving the load of R8 across the branch. Of course, if the condition is satisfied, the LW may still cause a legal and resumable exception (e.g., a page fault), and the hardware must take the exception when it knows that the controlling condition is true. Conditional instructions are certainly helpful for implementing short alternative control flows. Nonetheless, the usefulness of conditional instructions is significantly limited by several factors: s Conditional instructions that are annulled (i.e., whose conditions are false) still take execution time. Therefore, moving an instruction across a branch and making it conditional will slow the program down whenever the moved instruction would not have been normally executed. An important exception to this occurs when the cycles used by the moved instruction when it is not performed would have been idle anyway (as in the superscalar example above). Moving an instruction across a branch is essentially speculating on the outcome of the branch. Conditional instructions make this easier but do not eliminate the execution time taken by an incorrect guess. In simple cases, where we trade a conditional move for a branch and a move, using conditional moves is almost always better. When longer code sequences are made conditional, the benefits are more limited. 4.6 s s s 303 Hardware Support for Extracting More Parallelism Conditional instructions are most useful when the condition can be evaluated early. If the condition and branch cannot be separated (because of data dependences in determining the condition), then a conditional instruction will help less, though it may still be useful since it delays the point when the condition must be known till nearer the end of the pipeline. The use of conditional instructions is limited when the control flow involves more than a simple alternative sequence. For example, moving an instruction across multiple branches requires making it conditional on both branches, which requires two conditions to be specified, an unlikely capability, or requires additional instructions to compute the “and” of the conditions. Conditional instructions may have some speed penalty compared with unconditional instructions. This may show up as a higher cycle count for such instructions or a slower clock rate overall. If conditional instructions are more expensive, they will need to be used judiciously. For these reasons, many architectures have included a few simple conditional instructions (with conditional move being the most frequent), but few architectures include conditional versions for the majority of the instructions. Figure 4.33 shows the conditional operations available in a variety of recent architectures. Alpha HP PA MIPS SPARC Conditional move Any register-register instruction can nullify the following instruction, making it conditional. Conditional move Conditional move FIGURE 4.33 Conditional instructions available in four different RISC architectures. Conditional move was one of the few user instructions added to the Intel P6 processor. Compiler Speculation with Hardware Support As we saw in Chapter 3, many programs have branches that can be accurately predicted at compile time either from the program structure or by using a profile. In such cases, the compiler may want to speculate either to improve the scheduling or to increase the issue rate. Conditional instructions provide some limited ability to speculate, but they are really more useful when control dependences can be completely eliminated, such as in an if-then with a small then body. In trying to speculate, the compiler would like to not only make instructions control independent, it would also like to move them so that the speculated instructions execute before the branch! In moving instructions across a branch the compiler must ensure that exception behavior is not changed and that the dynamic data dependence remains the same. We have already seen, in examining trace scheduling, how the compiler can move instructions across branches and how to compensate for such speculation so that 304 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism the data dependences are properly maintained. In addition to determining which register values are unneeded, the compiler can rename registers so that the speculated code will not destroy data values when they are needed. The challenge is in avoiding the unintended changes in exception behavior when speculating. In the simplest case, the compiler is conservative about what instructions it speculatively moves, and the exception behavior is unaffected. This limitation, however, is very constraining. In particular, since all memory reference instructions and most FP instructions can cause exceptions, this limitation will produce small benefits. The key observation for any scheme is to observe that the results of a speculated sequence that is mispredicted will not be used in the final computation. There are three methods that have been investigated for supporting more ambitious speculation without introducing erroneous exception behavior: 1. The hardware and operating system cooperatively ignore exceptions for speculative instructions. 2. A set of status bits, called poison bits, are attached to the result registers written by speculated instructions when the instructions cause exceptions. The poison bits cause a fault when a normal instruction attempts to use the register. 3. A mechanism is provided to indicate that an instruction is speculative and the hardware buffers the instruction result until it is certain that the instruction is no longer speculative. To explain these schemes, we need to distinguish between exceptions that indicate a program error and would normally cause termination, such as a memory protection violation, and those that are handled and normally resumed, such as a page fault. Exceptions that can be resumed can be accepted and processed for speculative instructions just as if they were normal instructions. If the speculative instruction should not have been executed, handling the unneeded exception may have some negative performance effects. Handling these resumable exceptions, however, cannot cause incorrect execution; furthermore, the performance losses are probably minor, so we ignore them. Exceptions that indicate a program error should not occur in correct programs, and the result of a program that gets such an exception is not well defined, except perhaps when the program is running in a debugging mode. If such exceptions arise in speculated instructions, we cannot take the exception until we know that the instruction is no longer speculative. Hardware-Software Cooperation for Speculation In the simplest case, the hardware and the operating system simply handle all resumable exceptions when the exception occurs and simply return an undefined value for any exception that would cause termination. If the instruction generating the terminating exception was not speculative, then the program is in error. 4.6 Hardware Support for Extracting More Parallelism 305 Note that instead of terminating the program, the program is allowed to continue, though it will almost certainly generate incorrect results. If the instruction generating the terminating exception is speculative, then the program may be correct and the speculative result will simply be unused; thus, returning an undefined value for the instruction cannot be harmful. This scheme can never cause a correct program to fail, no matter how much speculation is done. An incorrect program, which formerly might have received a terminating exception, will get an incorrect result. This is probably acceptable, assuming the compiler can also generate a normal version of the program, which does not speculate and can receive a terminating exception. EXAMPLE Consider the following code fragment from an if-then-else statement of the form if (A==0) A = B; else A = A+4; where A is at 0(R3) and B is at 0(R2): L1: L2: LW BNEZ LW J ADDI SW R1,0(R3) R1,L1 R1,0(R2) L2 R1,R1,#4 0(R3),R1 ;load A ;test A ;if clause ;skip else ;else clause ;store A Assume the then clause is almost always executed. Compile the code using compiler-based speculation. Assume R14 is unused and available. ANSWER Here is the new code: L3: LW LW BEQZ ADDI SW R1,0(R3) R14,0(R2) R1,L3 R14,R1,#4 0(R3),R14 ;load A ;speculative load B ;other branch of the if ;the else clause ;nonspeculative store The then clause is completely speculated. We introduce a temporary register to avoid destroying R1 when B is loaded. After the entire code segment is executed, A will be in R14. The else clause could have also been compiled speculatively with a conditional move, but if the branch is highly predictable and low cost, this might slow the code down, since two extra instructions would always be executed as opposed to one branch. s In such a scheme, it is not necessary to know that an instruction is speculative. Indeed, it is helpful only when a program is in error and receives a terminating exception on a normal instruction; in such cases, if the instruction were not 306 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism marked as speculative, the program could be terminated. In such a scheme, as in the next one, renaming will often be needed to prevent speculative instructions from destroying live values. Renaming is usually restricted to register values. Because of this restriction, the targets of stores cannot be destroyed and stores cannot be speculative. The small number of registers and the cost of spilling will act as one constraint on the amount of speculation. Of course, the major constraint remains the cost of executing speculative instructions when the compiler’s branch prediction is incorrect. Speculation with Poison Bits The use of poison bits allows compiler speculation with less change to the exception behavior. In particular, incorrect programs that caused termination without speculation will still cause exceptions when instructions are speculated. The scheme is simple: A poison bit is added to every register and another bit is added to every instruction to indicate whether the instruction is speculative. The poison bit of the destination register is set whenever a speculative instruction results in a terminating exception; all other exceptions are handled immediately. If a speculative instruction uses a register with a poison bit turned on, the destination register of the instruction simply has its poison bit turned on. If a normal instruction attempts to use a register source with its poison bit turned on, the instruction causes a fault. In this way, any program that would have generated an exception still generates one, albeit at the first instance where a result is used by an instruction that is not speculative. Since poison bits exist only on register values and not memory values, stores are not speculative and thus trap if either operand is “poison.” EXAMPLE ANSWER Consider the code fragment from page 305 and show how it would be compiled with speculative instructions and poison bits. Show where an exception for the speculative memory reference would be recognized. Assume R14, R15 are unused and available. Here is the code (an “*” on the opcode indicates a speculative instruction): L3: LW LW* BEQZ ADDI SW R1,0(R3) R14,0(R2) R1,L3 R14,R1,#4 0(R3),R14 ;load A ;speculative load B ; ; ;exception for speculative LW If the speculative LW* generates a terminating exception, the poison bit of R14 will be turned on. When the nonspeculative SW instruction occurs, it will raise an exception if the poison bit for R14 is on. s 4.6 Hardware Support for Extracting More Parallelism 307 One complication that must be overcome is how the OS can save the user registers if the poison bit is set. A special instruction is needed to save and reset the state of the poison bits to avoid this problem. Speculative Instructions with Renaming The main disadvantages of the two previous schemes are the need to introduce copies to deal with register renaming and the possibility of exhausting the registers. The former problem reduces efficiency, while the latter can make speculation not worthwhile. An alternative is to move instructions past branches, flagging them as speculative, and providing renaming and buffering in the hardware, much as Tomasulo’s algorithm does. This concept has been called boosting, and it is closely related to the full hardware-based scheme we consider next. A boosted instruction is executed speculatively based on a future branch. The results of the instruction are forwarded to and used by other boosted instructions. When the branch following the boosted instruction is reached, if the boosted instruction contains a correct prediction of the branch, then results are committed to the registers; otherwise, the results are discarded. Boosted instructions allow the execution of an instruction that is dependent on a branch before the branch is resolved, but the final action to commit the instruction is taken only after the branch is resolved. It is possible to support boosting of instructions across multiple branches, but we consider only the case of boosting across one branch. EXAMPLE ANSWER Consider the code fragment from page 305 and show how it would be compiled with boosted instructions. Show where the instruction would finally commit. Can the sequence be compiled without needing any additional registers? We use a “+” after the opcode to indicate that the instruction is boosted across the next branch and predicts the branch as taken. Here is the new code: L3: LW LW+ BEQZ ADDI SW R1,0(R3) R1,0(R2) R1,L3 R1,R1,#4 0(R3),R1 ;load A ;boosted load B ;other branch of the if ;the else clause ;nonspeculative store The extra register is no longer necessary, since if the branch is not taken, the result of the speculative load is never written to R1, so R1 can be used in the code sequence. Remember that the result of the boosted instruction is not written into R1 until after the branch. Hence, the branch uses the value of R1 produced by the first, nonboosted load. Other boosted instructions could use the results of the boosted load. s 308 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Boosting can be implemented by one of several techniques that are closely related to the techniques needed to implement hardware-based speculation, the topic of the next section. Hardware-Based Speculation Hardware-based speculation combines three key ideas: dynamic branch prediction to choose which instructions to execute, speculation to allow the execution of instructions before the control dependences are resolved, and dynamic scheduling to deal with the scheduling of different combinations of basic blocks. Hardware-based speculation uses the dynamic data dependences to choose when to execute instructions. This method of executing programs is essentially a dataflow execution: operations execute as soon as their operands are available. The advantages of hardware-based speculation versus software-based speculation include the following: 1. To speculate extensively, we must be able to disambiguate memory references. This is difficult to do at compile time for integer programs that contain pointers. In a hardware-based scheme, dynamic runtime disambiguation of memory addresses is done using the techniques we saw earlier for Tomasulo’s algorithm. This allows us to move loads past stores at runtime. 2. Hardware-based speculation is better when hardware-based branch prediction is superior to software-based branch prediction done at compile time. This is true for many integer programs. For example, a profile-based static predictor has a misprediction rate of about 16% for four of the five integer SPEC programs we use, while a hardware predictor has a misprediction rate of about 11%. Because speculated instructions may slow down the computation when the prediction is incorrect, this difference is significant. 3. Hardware-based speculation maintains a completely precise exception model even for speculated instructions. 4. Hardware-based speculation does not require compensation or bookkeeping code. 5. Hardware-based speculation with dynamic scheduling does not require different code sequences to achieve good performance for different implementations of an architecture. Compiler-based speculation and scheduling often requires code sequences tuned to the machine, and older or different code sequences can result in much lower performance. In contrast, while hardware speculation and scheduling can benefit from scheduling and tuning processors, using the hardware-based approaches is expected to work well even with older or different code sequences. While this advantage is the hardest to quantify, it may be the most important in the long run. Interestingly, this was one of the motivations for the IBM 360/91. 4.6 Hardware Support for Extracting More Parallelism 309 Against these advantages stands a major disadvantage: supporting speculation in hardware is complex and requires substantial hardware resources. The approach we examine here, and the one implemented in a number of processors (PowerPC 620, MIPS R10000, Intel P6, and AMD K5), is to combine speculative execution with dynamic scheduling based on Tomasulo’s algorithm. The 360/91 did this to a certain extent since it could use branch prediction to fetch instructions and assign them to reservation stations. Speculation involves going further and actually executing the instructions as well as executing other instructions dependent on the speculated instructions. Just as with Tomasulo’s algorithm, we explain hardware speculation in the context of the floating-point unit, but the ideas are easily applicable to the integer unit, as we will see in the Putting It All Together section. The hardware that implements Tomasulo’s algorithm can be extended to support speculation. To do so, we must separate the bypassing of results among instructions, which is needed to execute an instruction speculatively, from the actual completion of an instruction. By making this separation, we can allow an instruction to execute and to bypass its results to other instructions, without allowing the instruction to perform any updates that cannot be undone, until we know that the instruction is no longer speculative. Using the bypass is like performing a speculative register read, since we do not know whether the instruction providing the source register value is providing the correct result until the instruction is no longer speculative. When an instruction is no longer speculative, we allow it to update the register file or memory; we call this additional step in the instruction execution sequence instruction commit. The key idea behind implementing speculation is to allow instructions to execute out of order but to force them to commit in order and to prevent any irrevocable action (such as updating state or taking an exception) until an instruction commits. In the simple single-issue DLX pipeline we could ensure that instructions committed in order, and only after any exceptions for that instruction had been detected, simply by moving writes to the end of the pipeline. When we add speculation, we need to separate the process of completing execution from instruction commit, since instructions may finish execution considerably before they are ready to commit. Adding this commit phase to the instruction execution sequence requires some changes to the sequence as well as an additional set of hardware buffers that hold the results of instructions that have finished execution but have not committed. This hardware buffer, which we call the reorder buffer, is also used to pass results among instructions that may be speculated. The reorder buffer provides additional virtual registers in the same way as the reservation stations in Tomasulo’s algorithm extend the register set. The reorder buffer holds the result of an instruction between the time the operation associated with the instruction completes and the time the instruction commits. Hence, the reorder buffer is a source of operands for instructions, just as the reservation stations provide operands in Tomasulo’s algorithm. The key difference is that in Tomasulo’s algorithm, once an instruction writes its result, any subsequently 310 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism issued instructions will find the result in the register file. With speculation, the register file is not updated until the instruction commits (and we know definitively that the instruction should execute); thus, the reorder buffer supplies operands in the interval between completion of execution and instruction commit. The reorder buffer is not unlike the store buffer in Tomasulo’s algorithm, and we integrate the function of the store buffer into the reorder buffer for simplicity. Since the reorder buffer is responsible for holding results until they are stored into the registers, it also replaces the function of the load buffers. Each entry in the reorder buffer contains three fields: the instruction type, the destination field, and the value field. The instruction type field indicates whether the instruction is a branch (and has no destination result), a store (which has a memory address destination), or a register operation (ALU operation or load, which have register destinations). The destination field supplies the register number (for loads and ALU operations) or the memory address (for stores), where the instruction result should be written. The value field is used to hold the value of the instruction result until the instruction commits. We will see an example of reorder buffer entries shortly. Figure 4.34 shows the hardware structure of the processor including the reorder buffer. The reorder buffer completely replaces the load and store buffers. Although the renaming function of the reservation stations is replaced by the reorder buffer, we still need a place to buffer operations (and operands) between the time they issue and the time they begin execution. This function is still provided by the reservation stations. Since every instruction has a position in the reorder buffer until it commits (and the results are posted to the register file), we tag a result using the reorder buffer entry number rather than using the reservation station number. This requires that the reorder buffer assigned for an instruction must be tracked in the reservation stations. In section 4.8, we will explore an alternative implementation that uses extra registers for renaming and the reorder buffer only to track when instructions can commit. Here are the four steps involved in instruction execution: 1. Issue—Get an instruction from the floating-point operation queue. Issue the instruction if there is an empty reservation station and an empty slot in the reorder buffer, send the operands to the reservation station if they are in the registers or the reorder buffer, and update the control entries to indicate the buffers are in use. The number of the reorder buffer allocated for the result is also sent to the reservation station, so that the number can be used to tag the result when it is placed on the CDB. If either all reservations are full or the reorder buffer is full, then instruction issue is stalled until both have available entries. This stage is sometimes called dispatch in a dynamically scheduled machine. 4.6 311 Hardware Support for Extracting More Parallelism Reorder buffer From instruction unit Floatingpoint operation queue (data) To memory (data/address) From memory (load results) Register no. FP registers ... Operand buses Operation bus Reservation stations FP adders FP multipliers Common data bus FIGURE 4.34 The basic structure of a DLX FP unit using Tomasulo’s algorithm and extended to handle speculation. Comparing this to Figure 4.8 on page 253, which implemented Tomasulo’s algorithm, the major changes are the addition of the reorder buffer and the elimination of the load and store buffers (their functions are subsumed by the reorder buffer). This mechanism can be extended to multiple issue by making the CDB (common data bus) wider to allow for multiple completions per clock. 2. Execute—If one or more of the operands is not yet available, monitor the CDB (common data bus) while waiting for the register to be computed. This step checks for RAW hazards. When both operands are available at a reservation station, execute the operation. Some dynamically scheduled processors call this step issue, but we use the terminology based on the CDC 6600. 3. Write result—When the result is available, write it on the CDB (with the reorder buffer tag sent when the instruction issued) and from the CDB into the reorder buffer, as well as to any reservation stations waiting for this result. (It is also possible to read results from the reorder buffer, rather than from the CDB, just as the scoreboard reads results from the registers rather than from a completion bus. The trade-offs are similar to those that exist in a central scoreboard scheme versus a broadcast scheme using a CDB.) Mark the reservation station as available. 312 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism 4. Commit—When an instruction, other than a branch with incorrect prediction, reaches the head of the reorder buffer and its result is present in the buffer, update the register with the result (or perform a memory write if the operation is a store) and remove the instruction from the reorder buffer. When a branch with incorrect prediction reaches the head of the reorder buffer, it indicates that the speculation was wrong. The reorder buffer is flushed and execution is restarted at the correct successor of the branch. If the branch was correctly predicted, the branch is finished. Some machines call this completion or graduation. Once an instruction commits, its entry in the reorder buffer is reclaimed and the register or memory destination is updated, eliminating the need for the reorder buffer entry. To avoid changing the reorder buffer numbers as instructions commit, we implement the reorder buffer as a circular queue, so that positions in the reorder buffer change only when an instruction is committed. If the reorder buffer fills, we simply stop issuing instructions until an entry is made free. Now, let’s examine how this scheme would work with the same example we used for Tomasulo’s algorithm. EXAMPLE Assume the same latencies for the floating-point functional units as in earlier examples: Add is 2 clock cycles, multiply is 10 clock cycles, and divide is 40 clock cycles. Using the code segment below, the same one we used earlier, show what the status tables look like when the MULTD is ready to go to commit. LD LD MULTD SUBD DIVD ADDD ANSWER F6,34(R2) F2,45(R3) F0,F2,F4 F8,F6,F2 F10,F0,F6 F6,F8,F2 The result is shown in the three tables in Figure 4.35. Note that although the SUBD instruction has completed execution, it does not commit until the MULTD commits. Note that all tags in the Qj and Qk fields as well as in the register status fields have been replaced with reorder buffer numbers, and the Dest field designates the reorder buffer number that is the destination for the result. s The above Example illustrates the key important difference between a processor with speculation and a processor with dynamic scheduling. Compare the content of Figure 4.35 with that of Figure 4.10 (page 258), which shows the same 4.6 313 Hardware Support for Extracting More Parallelism Reservation stations Name Busy Op Vj Vk Mem[45+Regs[R3]] Regs[F4] Add1 No Mult1 No MULTD Mult2 Yes DIVD Dest No Add3 Qk No Add2 Qj #3 Mem[34+Regs[R2]] #3 #5 Reorder buffer Entry Busy 1 No Instruction LD State Value Commit F6,34(R2) Destination F6 Mem[34+Regs[R2]] 2 No LD F2,45(R3) Commit F2 Mem[45+Regs[R3]] 3 Yes MULTD F0,F2,F4 Write result F0 #2 x Regs[F4] #1 – #2 4 Yes SUBD F8,F6,F2 Write result F8 5 Yes DIVD F10,F0,F6 Execute F10 6 Yes ADDD F6,F8,F2 Write result F6 #4 + #2 FP register status Field F0 Reorder # 3 Busy Yes F2 No F6 F8 F10 6 No F4 4 5 Yes Yes Yes F12 ... F30 No ... No FIGURE 4.35 Only the two LD instructions have committed, though several others have completed execution. The SUBD and ADDD instructions will not commit until the MULTD instruction commits, though the results of the instructions are available and can be used as sources for other instructions.The value column indicates the value being held, the format #X is used to refer to a value field of reorder buffer entry X. code sequence in operation on a processor with Tomasulo’s algorithm. The key difference is that in the example above, no instruction after the earliest uncompleted instruction (MULTD above) is allowed to complete. In contrast, in Figure 4.10 the SUBD and ADDD instructions have also completed. One implication of this difference is that the processor with the reorder buffer can dynamically execute code while maintaining a precise interrupt model. For example, if the MULTD instruction caused an interrupt, we could simply wait until it reached the head of the reorder buffer and take the interrupt, flushing any other 314 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism pending instructions. Because instruction commit happens in order, this yields a precise exception. By contrast, in the example using Tomasulo’s algorithm, the SUBD and ADDD instructions could both complete before the MULTD raised the exception. The result is that the registers F8 and F6 (destinations of the SUBD and ADDD instructions) could be overwritten, and the interrupt would be imprecise. Some users and architects have decided that imprecise floating-point exceptions are acceptable in high-performance processors, since the program will likely terminate; see Appendix A for further discussion of this topic. Other types of exceptions, such as page faults, are much more difficult to accommodate if they are imprecise, since the program must transparently resume execution after handling such an exception. The use of a reorder buffer with in-order instruction commit provides precise exceptions, in addition to supporting speculative execution, as the next Example shows. EXAMPLE Consider the code example used earlier for Tomasulo’s algorithm and shown in Figure 4.12 on page 261 in execution: Loop: LD MULTD SD SUBI BNEZ F0,0(R1) F4,F0,F2 0(R1),F4 R1,R1,#8 R1,Loop ; branches if R1≠0 Assume that we have issued all the instructions in the loop twice. Let’s also assume that the LD and MULTD from the first iteration have committed and all other instructions have completed execution. In an implementation that uses dynamic scheduling for both the integer and floating-point units, the store would wait in the reorder buffer for both the effective address operand (R1 in this example) and the value (F4 in this example); however, since we are only considering the floating-point resources, assume the effective address for the store is computed by the time the instruction is issued. ANSWER The result is shown in the three tables in Figure 4.36. 4.6 315 Hardware Support for Extracting More Parallelism Reservation stations Name Busy Op Vj Vk Qj Qk Dest Mult1 No MULTD Mem[0+Regs[R1]] Regs[F2] #2 Mult2 No MULTD Mem[0+Regs[R1]] Regs[F2] #7 Reorder buffer Entry Busy Instruction State Destination Value 1 No LD 2 No MULTD F0,0(R1) Commit F0 Mem[0+Regs[R1]] F4,F0,F2 Commit F4 #1 x Regs[F2] 3 Yes 4 Yes SD 0(R1),F4 Write result 0+Regs[R1] #2 SUBI R1,R1,#8 Write result R1 Regs[R1]–8 5 Yes 6 Yes BNEZ R1,Loop Write result LD F0,0(R1) Write result F0 Mem[#4] 7 Yes 8 Yes MULTD F4,F0,F2 Write result F4 #6 x Regs[F2] SD 0(R1),F4 Write result 0+#4 #7 9 10 Yes SUBI R1,R1,#8 Write result R1 #4 – 8 Yes BNEZ R1,Loop Write result FP register status Field Reorder # Busy F0 F2 F4 No Yes 6 Yes F6 F8 F10 F12 ... F30 No No No No ... No 7 FIGURE 4.36 Only the LD and MULTD instructions have committed, though all the others have completed execution. The remaining instructions will be committed as fast as possible. s Because neither the register values nor any memory values are actually written until an instruction commits, the processor can easily undo its speculative actions when a branch is found to be mispredicted. Suppose that in the above example (Figure 4.36), the branch BNEZ is not taken the first time. The instructions prior to the branch will simply commit when each reaches the head of the reorder buffer; when the branch reaches the head of that buffer, the buffer is simply cleared and the processor begins fetching instructions from the other path. In practice, machines that speculate try to recover as early as possible after a branch is mispredicted. This can be done by clearing the reorder buffer for all entries that appear after the mispredicted branch, allowing those that are before the branch in the reorder buffer to continue, and restarting the fetch at the correct branch successor. In speculative processors, performance is more sensitive to the branch prediction 316 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism mechanisms, since the impact of a misprediction will be higher. Thus, all the aspects of handling branches—prediction accuracy, misprediction detection, and misprediction recovery—increase in importance. Exceptions are handled by not recognizing the exception until it is ready to commit. If a speculated instruction raises an exception, the exception is recorded in the reorder buffer. If a branch misprediction arises and the instruction should not have been executed, the exception is flushed along with the instruction when the reorder buffer is cleared. If the instruction reaches the head of the reorder buffer, then we know it is no longer speculative and the exception should really be taken. We can also try to handle exceptions as soon as they arise, but this is more challenging for exceptions than for branch mispredict. Figure 4.37 shows the steps of execution for an instruction, as well as the conditions that must be satisfied to proceed to the step and the actions taken. We show the case where mispredicted branches are not resolved until commit. Although this explanation of speculative execution has focused on floating point, the techniques easily extend to the integer registers and functional units. Indeed, speculation may be more useful in integer programs, since such programs tend to have code where the branch behavior is less predictable. Additionally, these techniques can be extended to work in a multiple-issue processor by allowing multiple instructions to issue and commit every clock. Indeed, speculation is probably most interesting in such processors, since less ambitious techniques can probably exploit sufficient ILP within basic blocks when assisted by a compiler using unrolling. A speculative processor can be extended to multiple issue (see the Exercises) using the same techniques we employed when extending a Tomasulo-based processor in section 4.4. The same techniques for implementing the instruction issue unit can be used: We process multiple instructions per clock assigning reservation stations and reorder buffers to the instructions. The challenge here is in deciding what instructions can issue and in performing the renaming within the allowable clock period. We also need to widen the CDB to allow multiple instructions to complete within a clock cycle. The challenge lies in monitoring the multiple completion buses for operands without impacting the clock cycle. In section 4.7 we will examine the importance of speculation on the amount of ILP that can be extracted. Section 4.8 examines a speculative multiple-issue machine, the PowerPC 620, and its performance. The alternative to hardware-based speculation is compiler-based speculation. Such approaches are useful when branches cannot be eliminated by techniques such as loop unrolling but are statically predictable, so that the compiler can choose how to speculate. Whether speculation will be supported primarily in hardware or primarily in software is a point of current debate. Of course, all the techniques described in the last chapter and in this one cannot take advantage of more parallelism than is provided by the application. The question of how much parallelism is available has been hotly debated and is the topic of the next section. 4.6 Instruction status Issue Hardware Support for Extracting More Parallelism Wait until Action or bookkeeping Reservation station (r) and reorder buffer (b) both available 317 if (Register[S1].Busy) /* an executing instruction writes S1 */ {h← Register[S1].Reorder; if (Reorder[h].Ready) /* Instruction has completed already */ {RS[r].Vj← Reorder[h].Value; RS[r].Qj ← 0;} else /* Wait for instrution */ {RS[r].Qj← h;} } else /* Data must be in registers */ {RS[r].Vj← Regs[S1]; RS[r].Qj← 0;}; if (Register[S2].Busy) /* an executing instruction writes S1 */ {h← Register[S2].Reorder; if (Reorder[h].Ready) /* Instruction has completed already */ {RS[r].Vk← Reorder[h].Value; RS[r].Qk ← 0;} else /* Wait for instrution */ {RS[r].Qk← h;} } else /* Data must be in registers */ {RS[r].Vk← Regs[S2]; RS[r].Qk← 0;}; /* assign tracking fields of reservation station, register data structure, and reorder buffer */ RS[r].Busy← Yes; RS[r].Dest¨ b; Register[D].Qi=b; Register[D].Busy← Yes; Reorder[h].Instruction ← opcode; Reorder[b].Dest← D; Reorder[b].Ready← No; Execute (RS[r].Qj=0) and (RS[r].Qk=0) None—operands are in Vj and Vk Write result Execution completed at r and CDB available, value is result (for a store, there are two results, dest is the stores destination address in memory, while result is the value to be stored) b←RS[r].Reorder; /* if x waiting for this reorder buffer, update it */ ∀x(if (RS[x].Qj=b) {RS[x].Vj← result; RS[x].Qj ← 0}); ∀x(if (RS[x].Qk=b) {RS[x].Vk← result; RS[x].Qk ← 0}); /* free reservation station; update reorder buffer */ RS[r].Busy← No; Reorder[b].Value← result; Reorder[b].Ready← Yes; if (Reorder[h].Instruction=Store) {Reorder[b].Address← dest;}; Commit Instruction is at the head of the reorder buffer (entry h) and instruction has completed Write result. r = Reorder[h].Dest; /* register dest, if it exists */ if (Reorder[h].Instruction==Branch) {if (branch is mispredicted) {clear reorder buffer and Register status; fetch correct branch successor;};} else if (Reorder[h].Instruction==Store) /* preform the store operation */ {Mem[Reorder[h].Address]← Reorder[h].Value;} else /* put the result in the register destination */ {Regs[r]← Reorder[h].Value;}; Reorder[h].Busy← No; /* free up reorder buffer entry */ /* free up dest register if no one else writing it */ if (Register[r].Qi==h) {Register[r].Busy← No;}; FIGURE 4.37 Steps in the algorithm and what is required for each step. For the issuing instruction, D is the destination, S1 and S2 are the sources, and r is the reservation station allocated and b is the assigned reorder buffer entry. RS is the reservation-station data structure. The value returned by a reservation station is called the result. Register is the register data structure, Regs represents the actual registers, while Reorder is the reorder buffer data structure. Just as in Tomasulo’s algorithm there is a subtle timing problem; see Exercise 4.24 for further discussion. Similarly, some of the details in handling stores have been simplified; as an exercise, the reader should consider the implication of the fact that stores have two input operands, but that the operands are not needed at the same time. 318 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism 4.7 Studies of ILP Exploiting ILP to increase performance began with the first pipelined processors in the 1960s. In the 1980s and 1990s, these techniques were key to achieving rapid performance improvements. The question of how much ILP exists is critical to our long-term ability to enhance performance at a rate that exceeds the increase in speed of the base integrated-circuit technology. On a shorter scale, the critical question of what is needed to exploit more ILP is crucial to both computer designers and compiler writers. The data in this section also provide us with a way to examine the value of ideas that we have introduced in this chapter, including memory disambiguation, register renaming, and speculation. In this section we review one of the studies done of these questions. The historical section describes several studies, including the source for the data in this section. All these studies of available parallelism operate by making a set of assumptions and seeing how much parallelism is available under those assumptions. The data we examine here are from a study that makes the fewest assumptions; in fact, the ultimate hardware model is completely unrealizable. Nonetheless, all such studies assume a certain level of compiler technology and some of these assumptions could affect the results, despite the use of incredibly ambitious hardware. In the future, advances in compiler technology together with significantly new and different hardware techniques may be able to overcome some limitations assumed in these studies; however, it is unlikely that such advances when coupled with realistic hardware will overcome these limits in the near future. Instead, developing new hardware and software techniques to overcome the limits seen in these studies will continue to be one of the most important challenges in computer design. The Hardware Model To see what the limits of ILP might be, we first need to define an ideal processor. An ideal processor is one where all artificial constraints on ILP are removed. The only limits on ILP in such a processor are true data dependences either through registers or memory. The assumptions made for an ideal or perfect processor are as follows: 1. Register renaming—There are an infinite number of virtual registers available and hence all WAW and WAR hazards are avoided. 2. Branch prediction—Branch prediction is perfect. All conditional branches are predicted exactly. 3. Jump prediction—All jumps (including jump register used for return and computed jumps) are perfectly predicted. When combined with perfect branch prediction, this is equivalent to having a processor with perfect speculation and an unbounded buffer of instructions available for execution. 4. Memory-address alias analysis—All memory addresses are known exactly and a load can be moved before a store provided that the addresses are not identical. 4.7 319 Studies of ILP Initially, we examine a processor that can issue an unlimited number of instructions at once looking arbitrarily far ahead in the computation. For all the processor models we examine, there are no restrictions on what types of instructions can execute in a cycle. For the unlimited-issue case, this means there may be an unlimited number of loads or stores issuing in one clock cycle. In addition, all functional unit latencies are assumed to be one cycle, so that any sequence of dependent instructions can issue on successive cycles. Latencies longer than one cycle would decrease the number of issues per cycle, although not the number of instructions under execution at any point. (The instructions in execution at any point are often referred to as in-flight.) Of course, this processor is completely unrealizable. For example, the HP 8000 is one of the widest superscalar processors announced to date. The 8000 issues up to six instructions per clock (with significant restrictions on the instruction types, including at most two memory references), supports limited renaming, has multicycle latencies, and uses branch prediction. After looking at the parallelism available for the perfect processor, we will examine the impact of restricting various features. To measure the available parallelism, a set of programs were compiled and optimized with the standard MIPS optimizing compilers. The programs were instrumented and executed to produce a trace of the instruction and data references. Every instruction in the trace is then scheduled as early as possible, limited only by the data dependences. Since a trace is used, perfect branch prediction and perfect alias analysis are easy to do. With these mechanisms, instructions may be scheduled much earlier than they would otherwise, moving across large numbers of instructions on which they are not data dependent, including branches, since branches are perfectly predicted. Figure 4.38 shows the average amount of parallelism available for six of the SPEC92 benchmarks. Throughout this section the parallelism is measured by the 54.8 gcc espresso SPEC benchmarks 62.6 17.9 li 75.2 fpppp 118.7 doduc 150.1 tomcatv 0 20 40 60 80 100 120 140 160 Instruction issues per cycle FIGURE 4.38 ILP available in a perfect processor for six of the SPEC benchmarks. The first three programs are integer programs, while the last three are floating-point programs. The floating-point programs are loop-intensive and have large amounts of looplevel parallelism. 320 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism average instruction issue rate (remember that all instructions have a one-cycle latency). Three of these benchmarks (fpppp, doduc, and tomcatv) are floatingpoint intensive, while the other three are integer programs. Two of the floatingpoint benchmarks (fpppp and tomcatv) have extensive parallelism, which could be exploited by a vector computer or by a multiprocessor. The doduc program has extensive parallelism, but the parallelism does not occur in simple parallel loops as it does in fpppp and tomcatv. The program li is a LISP interpreter that has many short dependences. In the next few sections, we restrict various aspects of this processor to show what the effects of various assumptions are before looking at some ambitious but realizable processors. Limitations on the Window Size and Maximum Issue Count To build a processor that even comes close to perfect branch prediction and perfect alias analysis requires extensive dynamic analysis, since static compile-time schemes cannot be perfect. Of course, most realistic dynamic schemes will not be perfect, but the use of dynamic schemes will provide the ability to uncover parallelism that cannot be analyzed by static compile-time analysis. Thus, a dynamic processor might be able to more closely match the amount of parallelism uncovered by our ideal processor. How close could a real dynamically scheduled, speculative processor come to the ideal processor? To gain insight into this question, consider what the perfect processor must do: 1. Look arbitrarily far ahead to find a set of instructions to issue. 2. Rename all register uses to avoid WAR and WAW hazards. 3. Determine which instructions can issue and which must wait because of a register dependence. 4. Determine if any memory dependences exist and prevent dependent instructions from issuing. 5. Predict all branches. 6. Provide enough replicated functional units to allow all the ready instructions to issue. Obviously, this analysis is quite complicated. For example, to determine whether n instructions have any register dependences among them, assuming all 4.7 321 Studies of ILP instructions are register-register and the total number of registers is unbounded, requires 2n – 2 + 2n – 4 + … + 2= 2 Σ n–1 i=1 (n – 1)n 2 i = 2 ------------------- = n – n 2 comparisons. Thus, to detect dependences among the next 2000 instructions—the default size we assume in several figures—requires almost four million comparisons! Even examining only 50 instructions requires 2450 comparisons. This obviously limits the number of instructions that can be considered for issue at once. In practice, things are not quite so bad, since we need only detect dependence pairs. For a smaller number of registers we can build a structure that detects reuse of registers rather than comparing all instructions. Of course, if we serialize the instruction issue, the number of comparisons drops. In particular, this large number of comparisons is only needed to simultaneously issue a group of instructions; it is not necessarily needed if the instructions are overlapped. The set of instructions examined for simultaneous execution is called the window. Since each instruction in the window must be kept in the processor and the number of comparisons required to execute any instruction in the window grows quadratically in the window size, real window sizes are likely to be small. To date, the window size has been in the range of 4 to 32, which requires about 900 comparisons, but probably not larger. As we will see in the next section, recent machines actually have several smaller windows (2–8) used for different instruction types. This limits the issue capability somewhat, but is much simpler to build. The window size limits the number of instructions considered for issue and thus implicitly the maximum number of instructions that may issue. In addition to the cost in dependence checking and renaming hardware, real processors will have a limited number of functional units available and limited copies of each functional unit. Thus, the maximum number of instructions that may issue in a real processor might be smaller than the window size. Issuing large numbers of instructions will almost certainly lengthen the clock cycle. For example, in the early 1990s, the processors with the most powerful multiple-issue capabilities typically had clock cycles that were 1.5 to 3 times longer than the processors with the simplest pipelines that were designed to emphasize a high clock rate. This does not mean the multiple-issue processors had lower performance, since they “typically” had CPIs that were 2 to 3 times lower. Several examples of such comparisons appear later in the chapter. Figures 4.39 and 4.40 show the effects of restricting the size of the window from which an instruction can issue; the only difference in the two graphs is the format—the data are identical. As we can see in Figure 4.39, the amount of parallelism uncovered falls sharply with decreasing window size. Even a window of 32, which would be ambitious in 1995 technology, achieves about one-fifth of the average issue rate of an infinite window. As we can see in Figure 4.40, the integer programs do not contain nearly as much parallelism as the floating-point programs. This is to be expected. Looking at how the parallelism drops off in Figure 4.40 makes it clear that the parallelism in the floating-point cases is coming from loop-level parallelism. The fact that the amount of parallelism at low 322 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism 160 140 120 100 Instruction issues per cycle 80 60 40 20 0 Infinite 2k 512 128 32 8 4 Window size gcc espresso li fpppp doduc tomcatv FIGURE 4.39 The effects of reducing the size of the window. The window is the group of instructions from which an instruction can issue. The start of the window is the earliest uncompleted instruction, while the last instruction in the window is determined by the window size. The instructions in the window are obtained by perfectly predicting branches and selecting instructions until the window is full. window sizes is not that different among the floating-point and integer programs implies a structure where there are non-loop-carried dependences within loop bodies, but few loop-carried dependences in programs such as tomcatv. At small window sizes, the processors simply cannot see the instructions in the next loop iteration that could be issued in parallel with instructions from the current iteration. This is an example of where better compiler technology could uncover higher amounts of ILP, since it can find the loop-level parallelism and schedule the code to take advantage of it, even with small window sizes. Software pipelining, for example, could do this. We know that large window sizes are impractical, and the data in Figures 4.39 and 4.40 tell us that issue rates will be considerably reduced with realistic windows, thus we will assume a base window size of 2K entries and a maximum issue capability of 64 instructions for the rest of this analysis. As we will see in the next few sections, when the rest of the processor is not perfect, a 2K window and a 64-issue limitation do not constrain the processor. 4.7 323 Studies of ILP 55 10 10 gcc 8 4 3 63 15 13 espresso 8 4 3 18 12 11 9 li 4 3 Benchmarks 75 49 35 fpppp 14 5 3 119 16 15 doduc 9 4 3 150 45 34 tomcatv 14 6 3 0 20 40 60 80 100 120 140 160 Instruction issues per cycle Window size Infinite 512 8 4 128 32 FIGURE 4.40 The effect of window size shown by each application by plotting the average number of instruction issues per clock cycle. The most interesting observation is that at modest window sizes, the amount of parallelism found in the integer and floating-point programs is similar. The Effects of Realistic Branch and Jump Prediction Our ideal processor assumes that branches can be perfectly predicted: The outcome of any branch in the program is known before the first instruction is executed! Of course, no real processor can ever achieve this. Figures 4.41 and 4.42 show the effects of more realistic prediction schemes in two different formats. 324 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism 60 50 40 Instruction issues per cycle 30 20 10 0 Perfect Selective predictor Standard 2-bit Static None Branch prediction scheme gcc espresso li fpppp doduc tomcatv FIGURE 4.41 The effect of branch-prediction schemes. This graph shows the impact of going from a perfect model of branch prediction (all branches predicted correctly arbitrarily far ahead) to various dynamic predictors (selective and two-bit), to compile time, profile-based prediction, and finally to using no predictor. The predictors are described precisely in the text. Our data is for several different branch-prediction schemes varying from perfect to no predictor. We assume a separate predictor is used for jumps. Jump predictors are important primarily with the most accurate branch predictors, since the branch frequency is higher and the accuracy of the branch predictors dominates. The five levels of branch prediction shown in these figures are 1. Perfect—All branches and jumps are perfectly predicted at the start of execution. 2. Selective history predictor—The prediction scheme uses a correlating two-bit predictor and a noncorrelating two-bit predictor together with a selector, which chooses the best predictor for each branch. The prediction buffer contains 213 (8K) entries, each consisting of three two-bit fields, two of which are predictors and the third is a selector. The correlating predictor is indexed using the exclusive-or of the branch address and the global branch history. The noncorrelating predictor is the standard two-bit predictor indexed by the branch address. The selector table is also indexed by the branch address and specifies whether the correlating or noncorrelating predictor should be used. The selector is incremented or decremented just as we would for a standard two-bit 4.7 325 Studies of ILP 35 9 gcc 6 6 2 41 12 espresso 7 6 2 16 10 li 6 7 2 Benchmarks 61 48 46 45 fpppp 29 58 15 13 14 doduc 4 60 46 45 45 tomcatv 19 0 10 20 30 40 50 60 Instruction issues per cycle Branch predictor Perfect Selective predictor Standard 2 bit Static None FIGURE 4.42 The effect of branch-prediction schemes sorted by application. This graph highlights the differences among the programs with extensive loop-level parallelism (tomcatv and fpppp) and those without (the integer programs and doduc). predictor. This predictor, which uses a total of 48K bits, outperforms both the correlating and noncorrelating predictors, achieving an accuracy of at least 97% for these six SPEC benchmarks. Jump prediction is done with a pair of 2K-entry predictors, one organized as a circular buffer for predicting returns and one organized as a standard predictor and used for computed jumps (as in case statement or computed gotos). These jump predictors are nearly perfect. 3. Standard two-bit predictor with 512 two-bit entries—In addition, we assume a 16-entry buffer to predict returns. 326 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism 4. Static—A static predictor uses the profile history of the program and predicts that the branch is always taken or always not taken based on the profile, as we discussed in the last chapter. 5. None—No branch prediction is used, though jumps are still predicted. Parallelism is largely limited to within a basic block. Since we do not charge additional cycles for a mispredicted branch, the effect of varying the branch prediction is to vary the amount of parallelism that can be exploited across basic blocks by speculation. Figure 4.42 shows that the branch behavior of two of the floating-point programs is much simpler than the other programs, primarily because these two programs have many fewer branches and the few branches that exist are highly predictable.This allows significant amounts of parallelism to be exploited with realistic prediction schemes. In contrast, for all the integer programs and for doduc, the FP benchmark with the least loop-level parallelism, even the difference between perfect branch prediction and the ambitious selective predictor is dramatic. Like the window size data, these figures tell us that to achieve significant amounts of parallelism in integer programs, the processor must select and execute instructions that are widely separated. When branch prediction is not highly accurate, the mispredicted branches become a barrier to finding the parallelism. As we have seen, branch prediction is critical, even with a window size of 2K instructions and an issue limit of 64. For the rest of the studies, in addition to the window and issue limit, we assume as a base an ambitious predictor that uses two levels of prediction and a total of 8K entries. This predictor, which requires more than 150K bits of storage, slightly outperforms the selective predictor described above (by about 0.5–1%). We also assume a pair of 2K jump and return predictors, as described above. The Effects of Finite Registers Our ideal processor eliminates all name dependences among register references using an infinite set of virtual registers. While several processors have used register renaming for this purpose, most have only a few extra virtual registers. For example, the PowerPC 620 provides 12 extra FP registers and eight extra integer registers in addition to the 32 FP and 32 integer registers provided for in the architecture; these renaming registers are also used to hold speculative results in the 620, but not in these experiments where speculation is perfect. Figures 4.43 and 4.44 show the effect of reducing the number of registers available for renaming, again using the same data in two different forms. Both the FP and GP registers are increased by the number of registers shown on the axis or in the legend. 4.7 327 Studies of ILP 60 50 40 Instruction issues per cycle 30 20 10 0 Infinite 256 128 64 32 None Number of registers available for renaming gcc espresso li fpppp doduc tomcatv FIGURE 4.43 The effect of finite numbers of registers available for renaming. Both the number of FP registers and the number of GP registers are increased by the number shown on the x axis. The effect is most dramatic on the FP programs, although having only 32 extra GP and 32 extra FP registers has a significant impact on all the programs. As stated earlier, we assume a window size of 2K entries and a maximum issue width of 64 instructions. Recall that DLX supplies 31 integer registers and 16 FP registers (the base number provided under “None”). At first, the results in these figures might seem somewhat surprising: you might expect that name dependences should only slightly reduce the parallelism available. Remember though, exploiting large amounts of parallelism requires evaluating many independent threads of execution. Thus, many registers are needed to hold live variables from these threads. Figure 4.43 shows that the impact of having only a finite number of registers is significant if extensive parallelism exists. Although these graphs show a large impact on the floating-point programs, the impact on the integer programs is small primarily because the limitations in window size and branch prediction have limited the ILP substantially, making renaming less valuable. In addition, notice that the reduction in available parallelism is significant even if 32 additional registers are available for renaming, which is more than the number of registers available on any existing processor as of 1995. While register renaming is obviously critical to performance, an infinite number of registers is obviously not practical. Thus, for the next section, we assume that there are 256 registers available for renaming—far more than any anticipated processor has. 328 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism 11 10 10 9 gcc 5 4 15 15 13 espresso 10 5 4 12 12 12 11 li 6 5 Benchmarks 59 49 35 fpppp 20 5 4 29 16 15 doduc 11 5 5 54 45 44 tomcatv 28 7 5 0 20 10 30 40 50 60 Instruction issues per cycle Renaming registers Infinite 256 32 None 128 64 FIGURE 4.44 The reduction in available parallelism is significant when fewer than an unbounded number of renaming registers are available. For the integer programs, the impact of having more than 64 registers is not seen here. To use more than 64 registers requires uncovering lots of parallelism, which for the integer programs requires essentially perfect branch prediction. The Effects of Imperfect Alias Analysis Our optimal model assumes that it can perfectly analyze all memory dependences, as well as eliminate all register name dependences. Of course, perfect alias analysis is not possible in practice: The analysis cannot be perfect at compile time, and it requires a potentially unbounded number of comparisons at runtime. Figures 4.45 and 4.46 show the impact of three other models of memory alias analysis, in addition to perfect analysis. The three models are 4.7 329 Studies of ILP 60 50 40 Instruction issues per cycle 30 20 10 0 Perfect Global/stack perfect Inspection None Alias analysis technique gcc espresso li fpppp doduc tomcatv FIGURE 4.45 The effect of various alias analysis techniques on the amount of ILP. Anything less than perfect analysis has a dramatic impact on the amount of parallelism found in the integer programs, while global/stack analysis is perfect (and unrealizable) for the FORTRAN programs. As we said earlier, we assume a maximum issue width of 64 instructions and a window of 2K instructions. 1. Global/stack perfect—This model does perfect predictions for global and stack references and assumes all heap references conflict. This represents an idealized version of the best compiler-based analysis schemes currently in production. Recent and ongoing research on alias analysis for pointers should improve the handling of pointers to the heap. 2. Inspection—This model examines the accesses to see if they can be determined not to interfere at compile time. For example, if an access uses R10 as a base register with an offset of 20, then another access that uses R10 as a base register with an offset of 100 cannot interfere. In addition, addresses based on registers that point to different allocation areas (such as the global area and the stack area) are assumed never to alias. This analysis is similar to that performed by many existing commercial compilers, though newer compilers can do better through the use of dependence analysis, at least for loop-oriented programs. 3. None—All memory references are assumed to conflict. As one might expect, for the FORTRAN programs (where no heap references exist), there is no difference between perfect and global/stack perfect analysis. The global/stack perfect analysis is optimistic, since no compiler could ever find 330 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism 10 7 gcc 4 3 15 7 espresso 5 5 12 9 li 4 3 Benchmarks 49 49 fpppp 4 3 16 16 doduc 6 4 45 45 tomcatv 5 4 0 5 10 15 20 25 30 35 40 45 50 Instruction issues per cycle Alias analysis Perfect FIGURE 4.46 Global/stack perfect Inspection None The effect of varying levels of alias analysis on individual programs. all array dependences exactly. The fact that perfect analysis of global and stack references is still a factor of two better than inspection indicates that either sophisticated compiler analysis or dynamic analysis on the fly will be required to obtain much parallelism. ILP for Realizable Processors In this section we look at the performance of processors with realistic levels of hardware support that might be attainable in the next five to 10 years. In particular we assume the following fixed attributes: 1. Up to 64 instruction issues per clock with no issue restrictions. 2. A selective predictor with 1K entries and a 16-entry return predictor. 4.7 331 Studies of ILP 3. Perfect disambiguation of memory references done dynamically—this is ambitious but perhaps attainable for small window sizes. 4. Register renaming with 64 additional integer and 64 additional FP registers. Figures 4.47 and 4.48 show the result for this configuration as we vary the window size. This configuration is still substantially more complex and expensive than existing implementations in 1995. Nonetheless, it gives a useful bound on what future implementations might yield. The data in these figures is likely to be very optimistic for another reason. There are no issue restrictions among the 64 instructions: they may all be memory references. No one would even contemplate this capability in a single processor at this time. Unfortunately, it is quite difficult to bound the performance of a processor with reasonable issue restrictions; not only is the space of possibilities quite large, but the existence of issue restrictions requires that the parallelism be evaluated with an accurate instruction scheduler, making the cost of studying processors with large numbers of issues very expensive. In addition, remember that in interpreting these results, cache misses and nonunit latencies have not been taken into account, and both these effects will have significant impact (see the Exercises). Figure 4.47 shows the parallelism versus window size. The most startling observation is that with the realistic processor constraints listed above, the effect of the window size for the integer programs is not so severe as for FP programs. This points to the key difference between these two types of programs: The 60 50 40 Instruction issues per cycle 30 20 10 0 Infinite 256 128 64 32 16 8 4 Window size gcc espresso li fpppp doduc tomcatv FIGURE 4.47 The amount of parallelism available for a wide variety of window sizes and a fixed implementation with up to 64 issues per clock. 332 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism 10 10 10 9 8 6 gcc 4 3 15 15 13 espresso 10 8 6 4 2 12 12 11 11 li 9 6 4 3 Benchmarks 52 47 35 fpppp 22 14 8 5 3 17 16 15 doduc 12 9 7 4 3 56 45 34 22 tomcatv 14 9 6 3 0 10 20 30 40 50 60 Instruction issues per cycle Window size Infinite 256 128 64 32 16 8 4 FIGURE 4.48 The amount of parallelism available versus the window size for a variety of integer and floating-point programs with up to 64 arbitrary instruction issues per clock. 4.7 333 Studies of ILP availability of loop-level parallelism in two of the FP programs means that the amount of ILP that can be exploited is higher, but that for integer programs other factors—such as branch prediction, register renaming, and less parallelism to start with—are all important limitations. As we will see in the next section, for today’s speculative machines the actual performance levels are much lower than those shown in Figure 4.47. Given the difficulty of increasing the instruction rates with realistic hardware designs, designers face a challenge in deciding how best to use the limited resources available on a integrated circuit. One of the most interesting trade-offs is between simpler processors with larger caches and higher clock rates versus more emphasis on instruction-level parallelism with a slower clock and smaller caches. The following Example illustrates the challenges. EXAMPLE Consider the following three hypothetical, but not atypical, processors, which we run with the SPEC gcc benchmark: 1. A simple DLX pipe running with a clock rate of 300 MHz and achieving a pipeline CPI of 1.1. This processor has a cache system that yields 0.03 misses per instruction. 2. A deeply pipelined version of DLX with slightly smaller caches and a 400 MHz clock rate. The pipeline CPI of the processor is 1.5, and the smaller caches yield 0.035 misses per instruction on average. 3. A speculative superscalar with a 32-entry window. It achieves 75% of the ideal issue rate measured for this window size. (Use the data in Figure 4.47 on page 331.) This processor has the smallest caches, which leads to 0.05 misses per instruction. This processor has a 200MHz clock. Assume that the main memory time (which sets the miss penalty) is 200 ns. Determine the relative performance of these three processors. ANSWER First, we use the miss penalty and miss rate information to compute the contribution to CPI from cache misses for each configuration. We do this with the following formula: Cache CPI = Misses per instruction × M iss penalty We need to compute the miss penalties for each system: Memory access time Miss penalty = -----------------------------------------------Clock cycle 334 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism The clock cycle times for the processors are 3.3 ns, 2.5 ns, and 5 ns, respectively. Hence, the miss penalties are 200 ns Miss penalty 1 = ---------------- = 60 cycles 3.33 ns 200 ns Miss penalty 2 = --------------- = 80 cycles 2.5 ns 200 ns Miss penalty 3 = --------------- = 40 cycles 5 ns Applying this for each cache: Cache CPI1 = 0.03 × 60 = 1.8 Cache CPI2 = 0.035 × 80 = 2.8 Cache CPI3 = 0.05 × 40 = 2.0 We know the pipeline CPI contribution for everything but processor 3; its pipeline CPI is given by 1 1 1 Pipeline CPI 3 = ---------------------- = ------------------- = -- = 0.167 I ssue rate 8 × 0.75 6 Now we can find the CPI for each processor by adding the pipeline and cache CPI contributions. CPI1 = 1.1 + 1.8 = 2.9 CPI2 = 1.5 + 2.8 = 4.3 CPI3 = 0.167 + 2.0 = 2.167 Since this is the same architecture we can compare instruction execution rates to determine relative performance: CR Instruction execution rate = -------CPI 300 MHz Instruction execution rate 1 = ---------------------- = 103 MIPS 2.9 400 MHz Instruction execution rate 2 = ---------------------- = 93 MIPS 4.3 200 MHz Instruction execution rate 3 = ---------------------- = 92 MIPS 2.167 So the simplest design is the fastest. Of course, the designer building either system 2 or system 3 will probably be alarmed by the large fraction of the system performance lost to cache misses. In the next chapter we’ll see the most common solution to this problem: adding another level of caches. s 4.8 Putting It All Together: The PowerPC 620 335 Before we move to the next chapter, let’s see how some of the advanced ideas in this chapter are put to use in a real processor. 4.8 Putting It All Together: The PowerPC 620 The PowerPC 620 is an implementation of the 64-bit version of the PowerPC architecture; this implementation embodies many of the ideas discussed in section 4.6, including out-of-order execution and speculation. It is very similar to several other processors that provide this facility, including the MIPS R10000 and the HP PA 8000, and somewhat more ambitious in organization than other multiple-issue processors, such as the Alpha 21164 and UltraSPARC. The PowerPC 620 and 604 are very similar. The 604 implements only the 32-bit instruction set and provides fewer buffers; its overall organization, however, is essentially identical to that of the 620. The structure of the PowerPC 620 is shown in Figure 4.49. The 620 can fetch, issue, and complete up to four instructions per clock. There are six separate execution units, each of which can initiate execution independently from its own reservation stations. The six units are as follows: s s s Two simple integer units, XSU0 and XSU1, which handle simple integer operations, such as add, subtract, and simple logical operations. All operations here take a single cycle. One complex integer function unit, MCFXU, which handles integer multiply and divide. Operations in this unit have a latency of 3 to 20 clock cycles and provide early availability for multiplies with short input values. The operations in this unit vary from being fully pipelined (for multiplies with short integer values) to unpipelined (for integer divide). One load-store unit, LSU, which handles loads and stores and has a execution latency for integer loads of 1 cycle and for FP loads of 2 cycles.The LSU is fully pipelined and has its own effective address adder. The LSU contains both load and store buffers and allows loads to bypass pending stores by checking for address conflicts once the effective address of both instructions is known. The load and store buffers hold requests once the effective address calculation is completed. The load buffer simply holds the effective address until the cache access can be completed, whereupon the result is written to the GP or FP result buses. The store buffer is actually two separate queues: one that holds the effective address of the target until the data are available, and a second that holds both the effective address and the data until the store is ready to commit, which happens in order. When the store is ready to commit, the store buffer sends the data to the cache and frees the buffer entry. The cache has two banks so that a load and a store to separate banks can proceed in parallel. When a load causes a cache miss, the load is moved to a single-entry buffer that holds the pending load until the miss is handled. Other loads and stores can be processed at this point, and if the requests hit in the cache, the instructions can complete execution. Because there is a single-entry buffer, when a second instruction misses, 336 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Branch correction Fetch unit Reorder buffer information Dispatch unit with 8-entry instruction queue Instruction cache Completion unit with reorder buffer Instruction dispatch buses Register nos. Register nos. GP operand buses GP registers Instruction operation buses Register FP registers nos. Register nos. FP operand buses Reservation stations XSU0 XSU1 MCFXU LSU GP result buses FPU BPU FP result buses Result status buses Data cache FIGURE 4.49 The PowerPC 620 has six different functional units, each with its own reservation stations and a 16entry reorder buffer, contained in the instruction completion unit. Renaming is implemented for both the integer and floating-point registers, and the additional renaming registers are part of the respective register files. The condition register used for branches (see Appendix C for a description of conditional branches in the PowerPC architecture) is a 32-bit register grouped as a set of eight 4-bit fields. The BPU provides an additional 16 rename buffers that can each rename one 4-bit field. The condition register and rename buffers are inside the BPU and hence are not shown separately. All the major data flows in the processor are shown here, but not all the control signals. The load and store buffers are not shown, but are present inside the LSU. it is returned to the reservation station. This allows up to four cache misses to occur (one in the buffer and three in the reservation stations) before the loadstore is completely stalled. This capability, called nonblocking, is described in more detail in Chapter 5. s One floating-point unit, FPU, which has a latency for use of its result by another floating-point operation of 2 cycles for multiply, add, or multiply-add and 31 clock cycles for DP FP divide. The FPU is fully pipelined except for divide. 4.8 s Putting It All Together: The PowerPC 620 337 One branch unit, BPU, which completes branches and informs the fetch unit of mispredictions. The branch unit includes the condition register, used for conditional branches in the PowerPC architecture. The branch unit allows branches to be evaluated independently of the rest of the instructions. In particular, branches do not take issue slots or cycles in the other functional units. When condition registers are set early enough, conditional branches can be executed in parallel with no additional delay. The 620 operates much like the speculative processor we saw in section 4.6, with one major extension: The register set is extended with a set of renaming registers. These are used to hold speculative results until the instruction commits, at which time the result is written from the renaming registers to the standard integer or floating-point registers. The reorder buffer, which is part of the completion unit, does not contain the speculative results, but only the information needed to complete the instruction when it commits. The primary advantage of this scheme, which is similar to the one used in the MIPS R10000, is that all the operands are available from a single location: the extended register file, consisting of the architectural plus renaming registers. In the 620, there are eight extra integer and 12 extra FP registers. When an instruction issues, it is allocated a rename register; when execution completes, the result is written into the rename register; and when it commits, the result is moved from the rename register to one of the architected registers. With the available rename registers, at most eight integer and 12 FP instructions can be in flight. Operands are still read into reservation stations, as soon as they are available, either from the register file when the instruction is dispatched or from the result buses, the counterpart to the CDB (the Common Data Bus used in Tomasulo’s scheme), when the operand is produced. The instructions flow through a pipeline that varies from five to seven clock cycles in typical cases and much longer for operations like divide, which are not pipelined. All instructions pass through the following pipe stages: 1. Fetch—Loads the decode queue with instructions from the cache and determines the address of the next instruction. A 256-entry two-way set-associative branch-target buffer is used as the first source for predicting the next fetch address. There is also a 2048-entry branch-prediction buffer used when the branch-target buffer does not hit but a branch is present in the instruction stream. Both the target and prediction buffers are updated, if necessary, when the instruction completes using information from the BPU. In addition, there is a stack of return address registers used to predict subroutine returns. 2. Instruction decode—Instructions are decoded and prepared for issue. All time-critical portions of decode are done here. The next four instructions are passed to the next pipeline stage. 3. Instruction issue—Issues the instructions to the appropriate reservation station. Operands are read from the register file in this stage, either into the functional unit or into the reservation stations. A rename register is allocated to hold the result of the instruction and a reorder buffer entry is allocated to 338 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism ensure in-order completion. In some speculative and dynamically scheduled machines, this process is called dispatch, rather than issue. We use the term issue, since the process corresponds to the issue process of the CDC 6600, the first dynamically scheduled machine. 4. Execution—This stage proceeds when the operands are all available in a reservation station. One of six functional units executes the instruction. The simple integer units XSU0 and XSU1 have a one-stage execution pipeline. The MCFXU has a pipeline depth of between one and three, though integer divide is not fully pipelined and takes more clock cycles (a total of 20 cycles). The FPU has a three-stage pipeline, while the LSU has a two-cycle pipeline. At the end of execution, the result is written into the appropriate result bus and from there into any reservation stations that are waiting for the result, as well as into the rename buffer allocated for this instruction. The completion unit is notified that the instruction has completed. If the instruction is a branch, and the branch was mispredicted, the instruction fetch unit and completion unit are notified, causing instruction fetch to restart at the corrected address and causing the completion unit to discard the speculated instructions and free the rename buffers holding speculated results. When an instruction moves to the functional unit, we say that it has initiated execution; some machines use the term issue for this transition. An instruction frees up the reservation station when it initiates execution, allowing another instruction to issue to that station. If the instruction is ready to execute when it first issues to the reservation station, it can initiate on the next clock cycle freeing up the reservation station. In such cases, the instruction effectively spends no time in the reservation station: it acts simply as a latch between stages. When an instruction has finished execution and is ready to move to the next stage, we say it has completed execution. 5. Commit—This occurs when all previous instructions have been committed. Up to four instructions may complete per cycle. The results in the rename register are written into the register file and the rename buffer freed. Upon completion of a store instruction, the LSU is also notified, so that the corresponding store buffer may be sent to the cache. Some machines use the term instruction completion for this stage. In a small number of cases, an extra stage may be added for write backs that cannot complete during commit because of a shortage of write ports. Figure 4.50 shows the basic structure of the PowerPC 620 pipeline and how the stages are connected by buffers, allowing one stage to slip with respect to another. When an instruction commits, all information about that instruction is removed and its results are written into architecturally visible state (registers, PC, or memory). 4.8 Fetch Execute Issue Registers Instruction memory Instruction buffer 339 Putting It All Together: The PowerPC 620 Reservation stations Reorder buffer XSU0 XSU1 MCFXU LSU FPU BPU Commit Rename registers Commit unit Registers FUs FIGURE 4.50 The pipeline stages of the 620 are linked with a set of buffers, which are shown in grey. These buffers allow slippage between stages of the pipeline. For example, the fetch stage places instructions in the instruction buffer where they are removed by the issue stage. The buffers limit the slippage: If the buffer fills, the stage filling the buffer must stall; if the buffer empties, the stage emptying the buffer must stall. The reservation stations, each of which is associated with a particular functinal unit, and the reorder buffer link issue with the rest of the pipeline. The rename registers are used for results by the execute stage, until the commit unit writes the renamed register to an architectural register. The data cache is essentially part of the load-store unit (LSU). Unless a stall occurs, instructions spend at most one cycle in each stage, except for execute. Performance of the PowerPC 620 Pipeline In this section we look at the performance characteristics of the PowerPC 620, examining the critical factors that determine performance. We use seven of the SPEC92 benchmarks in this evaluation: compress, eqntott, espresso, li, alvinn, hydro2d, and tomcatv. Before we start, we need to understand what it means to stall a multiple-issue processor with dynamic scheduling and speculation. Let’s start with the multipleissue part. In a simple single-issue pipeline, the number of instructions completing in a clock cycle is 0 or 1, and the instruction portion of the CPI ratio for a given clock cycle either increases by 0, in which case a stall occurred, or increases by 1, in which case a stall did not occur. In a multiple-issue processor, the pipeline may be partially stalled—completing fewer instructions than its maximum capability. For example, in the 620 up to four instructions may be completed per clock cycle. Thus a stall is no longer binary: the contribution to the denominator of the CPI for a given clock cycle may vary from 0 to 4. Clearly, the processor is stalled when the contribution is 0, and not stalled when the contribution is 4; in between, the processor is partly stalled since the CPI corresponding to that cycle cannot reach its ideal value of 0.25. To keep this clear, we will focus on what fraction of the instruction slots are empty. If 50% of the instruction slots are empty at instruction commit in a given clock cycle, then two instructions commit that clock cycle, and the CPI for that clock cycle is 0.5. For multiple-issue machines, it is convenient to use IPC (instructions per clock) as the metric, rather than its reciprocal, CPI. We follow this practice in the measurements. As a further complication, the dynamic scheduling in the pipeline means that we cannot simply track empty instruction slots down the pipeline in a rigid 340 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism fashion. Once instructions reach the execution stage, dynamic scheduling can reorder the instructions. The instruction throughput, however, cannot increase as instructions proceed down the pipeline: If the issue unit processes only two instructions in a given cycle, at least two empty instruction slots will appear in a later clock cycle at the instruction commit stage. These commit slots may come up at different times, but they must appear. No stage can exceed the instruction processing rate achieved by earlier stages. In fact, because of the imposition of additional constraints as instructions flow down the pipeline, we can expect that each stage of execution will somewhat decrease the throughput. For the 620, this effect appears minor, primarily because the execute stage is wider than the issue stage and because instruction commit has few constraints. Because of the buffering between stages, the performance is limited by the stage that falls behind on any given clock cycle. This means that empty instruction slots created by a given unit that would decrease performance can actually be hidden by the presence of stalls that create additional slots in another unit. For example, the instruction fetch unit may provide only three instructions on a given cycle, leaving one instruction slot empty. If, however, the issue unit processes only two instructions, then the lost slot generated by the fetch unit is essentially hidden by the issue unit. From a designer’s viewpoint, we would like to place the burden for the partial stall on the issue unit. Notice that in doing so, we cannot conclude that eliminating the empty slots by redesigning the issue unit will improve the CPI by the corresponding amount. Instead, it will expose the empty slots in the fetch unit. Such interactions are a major source of complexity in the design and performance analysis of pipelines that provide buffering between stages. As a final complication, remember that the buffers are finite. As a result, if a given stage is stalled sufficiently, it also affects the earlier stages, since they will have to stall when the buffers are full. The buffers provide for limited slippage between stages. The goal is that the total number of empty instruction slots is less than the sum of the number of empty slots generated by each unit. In looking at the 620 performance data, we will focus on the instruction throughput of the issue stage as the critical performance measurement. Focusing on issue makes sense for two reasons. First, it is a good measure of steady-state performance, since in equilibrium instructions cannot issue faster than they execute or commit. Second, the issue stage is the location of some key bottlenecks that are common in many dynamically scheduled machines. Although we will focus on the issue stage, both the fetch stage and the execute stage affect the performance of instruction issue since the fetch and execute stages are responsible for filling the input buffer and emptying the output buffer, respectively, of the issue stage. Thus, we examine the ability of the fetch and execute stages to prevent a stall in the issue stage. Figure 4.51 gives a preview of the pipeline performance, showing how the difference between the ideal IPC (4) and the actual IPC (1.2– 1.3) is distributed to the various pipeline stages. We investigate this difference and its causes in more detail in this section. 4.8 341 Putting It All Together: The PowerPC 620 4.0 4.0 4.0 3.8 3.4 3.5 3.0 2.5 Instructions per clock 2.1 2.0 1.6 1.2 1.3 1.2 1.3 1.5 1.0 0.5 te IP C C cu su Ex e Is om m it IP C IP C e h tc Fe Id ea lI PC IP C 0.0 IPC at each pipe stage Integer average FP average FIGURE 4.51 An overview of the performance of the 620 pipeline showing the IPC at each pipe stage. The ideal IPC is 4. Losses occurring in fetch, primarily due to branch mispredict, bring the IPC down to 3.6 on average. Issue stage incurs stalls for both limitations in the issue structure and mismatch in the functional unit capacity versus need. After issue the IPC is about 1.8. Losses occurring due to a lack of ILP and finite buffers cause the execute stage to back up. This eventually leads to a stall in the issue stage, but we count it in the execute stage. By the end of execute, the IPC is between 1.2 and 1.3. More detailed versions of these data appear throughout this section. In a machine with speculation, the processor can be active doing work that is later discarded. In examining the performance we ignore such instructions, since they do not contribute to useful work. In particular, we charge the fetch stage for mispredicted branches and do not count stalls for such instructions in the later stages. Notice that incorrect speculation can reduce performance by competing for resources against instructions that must be completed, but we do not expect such effects to be large in well-designed machines. This downside to speculation puts increased importance on the accuracy of branch prediction. After looking at the performance of the various stages we summarize the overall processor performance. The data examined in this section all comes from measurements made on a PowerPC 620 simulator described by Diep, Nelson, and Shen [1995]. Performance of the Fetch Stage The instruction fetch stage fetches up to four instructions per cycle and places them into the eight-entry instruction buffer. This stage can become a bottleneck whenever it cannot keep at least four instructions in the instruction buffer. Notice 342 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism that if the instruction buffer has at least four instructions, then the failure of the fetch stage to return four instructions cannot be seen as a stall. In fact, if the buffer is completely full, then downstream pipe stages must be stalled and the fetch unit can do nothing but wait. On average, the fetch stage is able to keep 5.2 instruction buffers full and is often not a limit on performance. Fetch does limit performance whenever it does not have four instructions available for issue. There are three circumstances under which the fetch unit can fail to keep the instruction buffer full: 1. A branch misprediction—No useful instructions are added to the buffer; the fetch unit is effectively stalled. In reality, instructions are added to the buffer, but since the instructions come from the wrong path, we count this as a stall, treating it as if no instructions were placed in the buffer. This is the dominant cause of a complete stall in the instruction fetch unit, since it can lead to an effectively empty instruction buffer. Branch mispredict is a much more serious problem for the selected integer programs, where the mispredict rate is 10%, than for the selected FP programs, where the rate is about 3%. This effect shows up clearly in Figure 4.51, where it is the dominant cause of the difference in throughput of the fetch stage for the integer and FP programs. In fact, 75% of the loss in fetch for the integer programs arises from having an empty buffer. 2. An instruction cache miss—No instructions are added to the buffer; the fetch unit is completely stalled. With the programs chosen and the assumption about a perfect off-chip cache, I-cache misses are not a serious problem. 3. Partial cache line fill—The next group of four instructions crosses a cache block, and only the instructions on the same cache block, which is 8 words long, are fetched. This effect can be significant when branch targets are in the middle of cache blocks. It is a major contributor to having 1–3 buffers full and is responsible for most of the throughput loss in the fetch stage of the FP programs. Figure 4.52 shows the contribution of these factors to the total effective loss of instruction slots by the fetch unit. On average the integer benchmarks lose 15% (0.6 out of 4.0) of their peak performance, while the FP benchmarks lose 5% (0.2 out of 4.0). 4.8 343 Putting It All Together: The PowerPC 620 4.0 3.5 3.0 2.5 3.7 4.0 d tv 3.8 3.3 3.3 1.5 3.2 2.0 3.6 Components of IPC at fetch 1.0 0.5 Effective IPC delivered Mispredict loss m ca to dr o2 nn vi al hy li so t pr es ot nt eq es co m pr es s 0.0 Cache miss loss Partial line loss FIGURE 4.52 The average number of instructions that the fetch unit can provide to the issue unit varies between 3.2 and 4, with an average of 3.4 for the integer benchmarks and 3.8 for the FP benchmarks. This means that the fetch stage loses about 0.6 IPC for integer programs and 0.2 IPC for FP programs. These data are computed by determining how often the instruction buffer has 0 through 3 instructions and weighting the frequency by the issue potential that is lost, which is the difference between 4 and the number of entries in the buffer. The portion of the ideal IPC of 4 lost to each of the three causes is shown; these data make the assumption that the timing of one of these events is independent of the state of the instruction buffer. All of the measurements in this section include the effects of the onchip cache misses, assuming the presence of another level of cache off-chip with a 100% hit rate. The miss penalty to the off-chip cache is 8 cycles, which is probably slightly optimistic. Multilevel cache structures are discussed in detail in the next chapter; the assumption of 100% hits in the next level has only a small effect on the SPEC benchmarks. Instruction Issue Instruction issue tries to send four instructions to the appropriate reservation units on every clock cycle. We measure and analyze the performance of the 620 by focusing on the instruction issue stage. Instruction issue can fail to process four instructions for two primary reasons. First, there are limitations in the instruction issue processing stage where certain combinations of instructions cannot simultaneously issue. Second, lack of progress in the execution and completion stages leads to the unavailability of buffers that are required to issue an instruction. Because instruction issue is in order, the first event that prevents issuing an instruction terminates the issue packet. Thus, if the conflicts that prevent issue were 344 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism uniformly distributed across the four potentially issuing instructions, the average number of instruction issues would be given by p + p2 + p3 + p4, where p is the probability that any one instruction can issue. If the probability of not issuing is significant, the average number of issues per clock drops quickly, as shown in Figure 4.53. Probability (cannot issue a given instruction) = (1 – p) Probability (issue a given instruction) = p Average number of instruction issues 0.1 0.9 3.1 0.2 0.8 2.4 0.3 0.7 1.8 0.4 0.6 1.3 0.5 0.5 0.9 FIGURE 4.53 Number of instruction issues out of four possible issues. p is the probability that any one instruction can issue. The first instruction to stall ends the issue packet. This clearly shows the importance of preventing unnecessary stalls in instruction issue. Five possible conflicts can prevent an instruction from being issued: 1. No reservation station available: There is no reservation station of the appropriate type available. 2. No rename registers are available. 3. Reorder buffer is full. 4. Two operations to the same functional unit: The reservation stations in front of each functional unit share a single write port, so only one operation can issue to the reservation stations for a unit in a clock cycle. 5. Miscellaneous conflicts: Includes shortages of read ports for the registers, conflicts that occur when special registers are accessed, and serialization imposed by special instructions. The last is quite rare and essentially never occurs in the SPEC benchmarks (less than 0.01%). The use of special registers is significant only in li, while register port shortages occur for both tomcatv and hydro2d. These three classes of stalls are combined, but only one of the two primary types is significant in the benchmarks. The first three of these conflicts arise because the execution or completion stages have not processed instructions that were previously issued; the last two conflicts are internal limitations in the implementation. Figure 4.54 shows the reduction in IPC because of these cases. 4.8 345 Putting It All Together: The PowerPC 620 2.5 IPC at the issue stage 2.0 1.5 1.00 1.09 d tv 1.78 1.01 1.36 0.5 1.42 1.13 1.0 m ca to dr o2 nn vi al hy li t so pr es ot nt eq es co m pr es s 0.0 Issue stage IPC No reservation station No rename registers Full reorder buffer FU conflict Miscellaneous conflict Figure 4.54— Hennessy/Patterson FIGURE 4.54 The IPC throughput rate for the issue stage is arrived at by subtracting stalls that arise in issue from the IPC rate sustained by the fetch stage. The top of each bar shows the IPC from the fetch stage, while the bottom section of each bar shows the effective IPC after the issue stage. The difference is divided between two classes of stalls, those that arise because the later stages have not freed up buffers and those that arise from an implementation limitation in the issue stage (the FU and miscellaneous conflicts). Multiple potential stalls can arise in the same clock cycle for the same instruction. We count the stall as arising from the first cause in the following order: miscellaneous stalls, no reservation station, no rename buffers, no reorder buffer entries, and FU conflict. Because instruction issue is in order, the first of these conflicts that occurs when examining the instructions in order limits the instruction issue count for that clock cycle. Figure 4.55 shows the same data as Figure 4.54, but organized to show how often various events are responsible for issuing fewer than four instructions per cycle. More than one of these events can occur for a given instruction in a clock cycle. The data in Figure 4.54 assume that the cause is associated with the first event that triggers the stall in the order given above. The dominant cause of stalls is lack of available buffers to issue to (with an average of 54%, this occurs on slightly more than one-half of the cycles), with reservation stations accounting for the largest cause of shortage (33% of the cycles). 346 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism No stalls: 4 issues Stall: no res. station Stall: no rename buffer compress 24% 36% 24% eqntott 41% 22% espresso 33% 32% li 31% 34% alvinn 31% 23% hydro2d 17% tomcatv 6% Benchmark Stall: no reorder buffer Total stalls: no buffers Stall: 2 instrs. to FU Misc. stall Total issue limit stalls 6% 66% 10% 0% 10% 8% 4% 34% 21% 4% 25% 14% 2% 48% 18% 1% 19% 17% 4% 55% 11% 3% 14% 1% 21% 45% 24% 0% 24% 43% 17% 8% 68% 12% 3% 15% 37% 34% 9% 80% 7% 7% 14% Integer avg. 32% 31% 16% 4% 51% 15% 2% 17% FP avg. 34% 28% 10% 8% 46% 19% 2% 21% Total avg. 28% 33% 12% 9% 54% 16% 2% 18% FIGURE 4.55 The sources of all stalls in the issue unit is shown in three broad groups. The first category shows the frequency that four instructions are issued, i.e., no stalls are incurred. The second group shows the frequency of stalls due to full buffers, with the last column totaling the frequency of full buffer stalls. This group arises because the execution and commit stages have failed to complete instructions, which would free up buffers. We will examine the reasons for lack of progress in the execute stage in the next section. The last group are stalls due to restrictions in the issue stage, and the last column sums the two types of these stalls. As in Figure 4.54, there may be multiple reasons for stalling an instruction, so the stall is counted according to the guidelines in Figure 4.54. Notice that the number of cycles where no stalls occur varies widely from 6% to 41%; likewise, in many cases (35% on average) zero instructions issue. This frequency also varies widely from 18% for alvinn to 45% for li and hydro2d. Performance of the Execution Stage Once instructions have issued, they wait at the assigned reservation station until the functional unit and the operands are available, whereupon the instruction initiates execution at a functional unit. There are six different functional units allowing up to six initiations per clock. Until an instruction in a reservation station issues, the buffers for the instruction are occupied, potentially causing a stall in the issue stage. An instruction at a reservation station may be delayed for four reasons: 1. Source operand unavailable—One of the source operands is not yet ready. 2. Functional unit unavailable—Another instruction is using the functional unit. For fully pipelined units, this happens only when two instructions are ready to initiate on the same clock cycle, but for unpipelined units (integer multiply/ divide, FP divide), the functional unit blocks further initiation until the operation completes. 3. Out-of-order disallowed—Both the branch unit and the FP unit require that instructions initiate in order. Thus, an instruction may be stalled until its predecessor initiates.This is a limitation of the execution unit. 4.8 347 Putting It All Together: The PowerPC 620 4. Serialization—A few instructions require totally in-order execution. For example, instructions that access the non-renamed special registers must execute totally in order. Such instructions wait at the reservation station for all prior instructions to commit. Full buffer stalls in the issue stage are responsible for a loss of 1.6 IPC for the integer programs and 2.0 IPC in the FP programs. We can decompose this loss into the four components above, if we make the assumption that initiating execution for any instruction will free an equivalent number of buffers, allowing issue to continue. Figure 4.56 shows this distribution. 2.5 2.0 1.5 Total buffer-full-stalls in issue 1.0 0.5 FU busy tv ca m to dr o2 d li nn vi al hy so t es pr es ot nt pr m co Source not ready eq es s 0.0 In-order forced Serialization FIGURE 4.56 The stalls in the issue stage because of full buffers (reservation station, rename registers, or reorder buffer) can be attributed to lack of progress in the execution unit. An occupied reservation station fails to begin execution for one of the four reasons shown above. The frequency of these events is used to attribute the total number of full buffer stalls from the issue stage. When the issue stage stalls and an instruction in a reservation station fails to initiate for one of the four reasons shown above, a designer could contemplate one of three possible reasons: 1. If the source operand is not available and issue stalls because the buffers are full, this indicates that the amount of instruction-level parallelism available, given the limited window size dictated by the buffering, is insufficient. If the code had more parallelism, then fewer reservation stations would have to wait 348 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism for results, leading to more initiations and a need for fewer buffers. Alternatively, the designer could increase the number of buffers, leading to a larger window and possibly increased instruction-level parallelism. 2. If the instruction in a reservation station does not initiate because the FU is in use and issue is also stalled, then the basic problem is that the FU capacity is not sufficient to handle the dynamic instruction distribution, at least at that point in the execution. Increasing the number of functional units of different classes would help, as would increasing the pipelining in the unpipelined units. For example, from Figure 4.56 and a knowledge of the instruction distribution, we can see the load-store FU is overcommitted in compress. 3. The final two reasons for a reservation station not initiating (out-of-order disallowed and serialization) are both execution-stage implementation choices, which could be eliminated by reorganizing the execution unit, though an alternative structure might have other drawbacks. Performance of Instruction Commit Instruction commit is totally stalled only when the instruction at the head of the reorder buffer has not completed execution. The failure to commit instructions during the cycle can eventually lead to a full reorder buffer, which in turn stalls the instruction issue stage. Instruction commit is basically limited by instruction issue and execute. In some infrequent situations, a lack of write-back ports— there are four integer write ports and two FP write ports—can also lead to a partial stall in instruction commit. Like execution stalls, a completion stall leads to not freeing up rename registers and reorder buffer entries, which can lead to a stall in the issue stage. Completion stalls, however, are very infrequent: on average, execution stalls are seven times more frequent for the FP programs and 100 times more frequent for the integer programs. As a result, instruction commit is not a bottleneck. Summary: Overall Performance From the data in earlier figures, we can determine that the IPC runs from just under 1 to just under 1.8 for these benchmarks. The gap between the effective IPC and the ideal IPC of 4.0 can be viewed as three parts: 1. The limitation caused by the functional units—This limitation arises because the 620 does not have four copies of each functional unit. For these benchmarks the bottleneck is the load-store unit. This loss counts only the average shortage of FU capacity for the entire program. Short-term higher demands for a functional unit are counted as ILP/finite buffer stalls. 2. Losses in specific stages—Fetch, issue, and execute all have losses associated specifically with that stage. 4.9 349 Fallacies and Pitfalls 3. Limited instruction-level parallelism and finite buffering—Stalls that arise because of lack of parallelism or insufficient buffering. Cache misses that actually result in stall cycles are counted here; cache misses may occur without generating any stalls. Figure 4.57 shows how the peak IPC of 4 is divided between the actual IPC (1.0 to 1.8) and the various possible stalls. 4.0 3.5 3.0 2.5 IPC at the issue stage 2.0 1.5 1.0 0.5 tv ca m to hy dr o2 d li nn vi al t so es pr es ot nt pr m co eq es s 0.0 Actual IPC FU capacity Fetch limitations Issue limitations Execution limitations ILP/finite buffer limitations FIGURE 4.57 The breakdown of the ideal IPC of 4.0 into its components. The actual IPC averages 1.2 for the integer programs and 1.3 for the FP programs. The largest difference is the IPC loss due to the functional unit balance not matching the frequency of instructions. Losses in fetch, issue, and execution are the next largest components. ILP and limitations of finite buffering are last. The limits are calculated in this same order, so that the shortage of load-store execution slots is counted as a FU capacity loss, rather than as an ILP/ finite buffer loss. Although the ILP/finite buffering limitations are small overall, this arises largely because the other limitations prevent the lack of ILP or finite buffering from becoming overly constraining. 4.9 Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster. Although a lower CPI is certainly better, sophisticated pipelines typically have slower clock rates than processors with simple pipelines. In applications with 350 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism limited ILP or where the parallelism cannot be exploited by the hardware resources, the faster clock rate often wins. The IBM Power-2 is a machine designed for high-performance FP and capable of sustaining four instructions per clock, including two FP and two load-store instructions; its clock rate was a modest 71.5 MHz. The DEC Alpha 21064 is a dual-issue machine with one load-store or FP operation per clock, but an aggressive 200-MHz clock rate. Comparing the low CPI Power-2 against the high CPI 21064 shows that on a few benchmarks, including some FP programs, the fast clock rate of Alpha leads to better performance (see Figure 4.58). Of course, this fallacy is nothing more than a restatement of a pitfall from Chapter 2 about comparing processors using only one part of the performance equation. 900 800 700 600 SPEC ratio 500 400 300 200 100 sc gc c sp ic e do m duc dl jd w p2 av to e5 m ca tv or al a vi nn e m ar dl j sw sp2 ea m2 rs 56 u2 hy co dr r o2 d na sa fp pp p e li co qnt m ott pr es s es pr es so 0 Benchmarks 200-MHz Alpha 71.5-MHz Power2 FIGURE 4.58 The performance of the low-CPI Power-2 design versus the high-CPI Alpha 21064. Overall, the 21064 is about 1.1 times faster on integer and 1.4 times faster on FP, indicating that the CPI for the 21064 is 2 to 2.5 times higher than for the Power-2, assuming instruction counts are identical. Pitfall: Emphasizing a reduction in CPI by increasing issue rate while sacrificing clock rate can lead to lower performance. The TI SuperSPARC design is a flexible multiple-issue processor capable of issuing up to three instructions per cycle. It had a 1994 clock rate of 60 MHz. The HP PA 7100 processor is a simple dual-issue processor (integer and FP combination) with a 99-MHz clock rate in 1994. The HP processor is faster on all the SPEC benchmarks except two of the integer benchmarks and one FP benchmark, as shown in Figure 4.59. On average, the two processors are close on integer, but the 4.9 351 Fallacies and Pitfalls HP processor is about 1.5 times faster on the FP benchmarks. Of course, differences in compiler technology, as well as the processor, could contribute to the performance differences. 300 250 200 SPEC ratio 150 100 50 sc gc c sp ic e do m duc dl jd w p2 av to e5 m ca tv or al a vi nn e m ar dl j sw sp2 ea m2 rs 56 u2 hy co dr r o2 d na sa fp pp p e li co qnt m ott pr es s es pr es so 0 Benchmarks HP PA 7100 TI SuperSPARC FIGURE 4.59 The performance of a 99-MHz HP PA 7100 processor versus a 60-MHz SuperSPARC. The comparison is based on 1994 measurements. The potential of multiple-issue techniques has caused many designers to focus on reducing CPI while possibly not focusing adequately on the trade-off in cycle time incurred when implementing these sophisticated techniques. This inclination arises at least partially because it is easier with good simulation tools to evaluate the impact of enhancements that affect CPI than it is to evaluate the cycle time impact. There are two factors that lead to this outcome. First, it is difficult to know the clock rate impact of an approach until the design is well underway, and then it may be too late to make large changes in the organization. Second, the design simulation tools available for determining and improving CPI are generally better than those available for determining and improving cycle time. In understanding the complex interaction between cycle time and various organizational approaches, the experience of the designers seems to be one of the most valuable factors. Pitfall: Improving only one aspect of a multiple-issue processor and expecting overall performance improvement. 352 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism This is simply a restatement of Amdahl’s Law. A designer might simply look at a design, see a poor branch prediction mechanism and improve it, expecting to see significant performance improvements. The difficulty is that many factors limit the performance of multiple-issue machines, and improving one aspect of a processor often exposes some other aspect that previously did not limit performance. We can see examples of this in the data on ILP. For example, looking just at the effect of branch prediction in Figure 4.42 on page 325, we can see that going from a standard two-bit predictor to a selective predictor significantly improves the parallelism in espresso (from an issue rate of 7 to an issue rate of 12). However, if the processor provides only 32 registers for renaming, the amount of parallelism is limited to 5 issues per clock cycle, even with a branch prediction scheme better than either alternative. Likewise, improving, for example, the performance of the fetch stage of the PowerPC 620 will probably have little impact on the SPEC benchmarks, since the issue and execute stages are significant bottlenecks and the stalls in those stages would probably increase to capture most of the benefit obtained by improving fetch. 4.10 Concluding Remarks The tremendous interest in multiple-issue organizations came about because of an interest in improving performance without affecting the standard uniprocessor programming model. While taking advantage of ILP is conceptually simple, the design problems are amazingly complex in practice. It is extremely difficult to achieve the performance you might expect from a simple first-level analysis. The trade-offs between increasing clock speed and decreasing CPI through multiple issue are extremely hard to quantify. Although you might expect that it is possible to build an advanced multiple-issue processor with a high clock rate, a factor of 1.5 to 2 in clock rate has consistently separated the highest clock rate processors and the most sophisticated multiple-issue processors. It is simply too early to tell whether this difference is due to fundamental implementation tradeoffs, or to the difficulty of dealing with the complexities in multiple-issue processors, or simply a lack of experience in implementing such processors. One insight that is clear is that the peak to sustained performance ratios for multipleissue processors are often quite large and typically grow as the issue rate grows. Thus, increasing the clock rate by X is almost always a better choice than increasing the issue width by X, though often the clock rate increase may rely largely on deeper pipelining, substantially narrowing the advantage. On the other hand, a simple two-way superscalar that issues FP instructions in parallel with integer instructions can probably be built with little impact on clock rate and should perform better on FP applications and suffer little or no degradation on integer applications. 4.10 Concluding Remarks 353 Whether approaches based primarily on faster clock rates, simpler hardware, and more static scheduling or approaches using more sophisticated hardware to achieve lower CPI will win out is difficult to say and may depend on the benchmarks. At the present, both approaches seem capable of delivering similar performance. Pragmatic issues, such as code quality for existing binaries, may turn out to be the deciding factor. What will happen to multiple-issue processors in the long term? The basic trends in integrated circuit technology lead to an important insight: The number of devices available on a chip will grow faster than the device speed. This means that designs that obtain performance with more transistors rather than just raw gate speed are a more promising direction. Three other factors limit how far we can exploit this trend, however. One is the increasing delay of interconnections compared with gates, which means that bigger designs will have longer cycle times. The second factor is the diminishing returns seen when trying to exploit ILP. The last factor is the potential impact of increased complexity on either the clock rate or the design time. Combined, these effects may serve as effective limits to how much performance can be gained by exploiting ILP within a single processor. The alternative to trying to continue to push uniprocessors to exploit ILP is to look toward multiprocessors, the topic of Chapter 8. Looking toward multiprocessors to take advantage of parallelism overcomes a fundamental problem in ILP processors: building a cost-effective memory system. A multiprocessor memory system is inherently multiported and, as we will see, can even be distributed in a larger processor. Using multiprocessors to exploit parallelism encounters two difficulties. First, it is likely that the software model will need to change. Second, MP approaches may have difficulty in exploiting fine-grained, low-level parallelism. While it appears clear that using a large number of processors requires new programming approaches, using a smaller number of processors efficiently could be based on compiler approaches. Exploiting the type of finegrained parallelism that a compiler can easily uncover can be quite difficult in a multiprocessor, since the processors are relatively far apart. To date, computer architects do not know how to design processors that can effectively exploit ILP in a multiprocessor configuration. Existing high-performance designs are either tightly integrated uniprocessors or loosely coupled multiprocessors. Around the end of this century, it should be possible to place two fully configured processors on a single die. Perhaps this capability will inspire the design of a new type of architecture that allows processors to be more tightly coupled than before, but also separates them sufficiently so that the design can be partitioned and each processor can individually achieve very high performance. 354 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism 4.11 Historical Perspective and References This section describes some of the major advances in compiler technology and advanced pipelining and ends with some of the recent literature on multiple-issue processors. The basic concepts—data dependence and its limitation in exploiting parallelism—are old ideas that were studied in the 1960s. Ideas such as data flow computation derived from observations that programs were limited by data dependence. Loop unrolling is a similarly old idea, practiced by early computer programmers on processors with very expensive branches. The Introduction of Dynamic Scheduling In 1964 CDC delivered the first CDC 6600. The CDC 6600 was unique in many ways. In addition to introducing scoreboarding, the CDC 6600 was the first processor to make extensive use of multiple functional units. It also had peripheral processors that used a time-shared pipeline. The interaction between pipelining and instruction set design was understood, and the instruction set was kept simple to promote pipelining. The CDC 6600 also used an advanced packaging technology. Thornton [1964] describes the pipeline and I/O processor architecture, including the concept of out-of-order instruction execution. Thornton’s book [1970] provides an excellent description of the entire processor, from technology to architecture, and includes a foreword by Cray. (Unfortunately, this book is currently out of print.) The CDC 6600 also has an instruction scheduler for the FORTRAN compilers, described by Thorlin [1967]. The IBM 360/91 introduced many new concepts, including tagging of data, register renaming, dynamic detection of memory hazards, and generalized forwarding. Tomasulo’s algorithm is described in his 1967 paper. Anderson, Sparacio, and Tomasulo [1967] describe other aspects of the processor, including the use of branch prediction. Many of the ideas in the 360/91 faded from use for nearly 25 years before being broadly employed in the 1990s. Branch Prediction Schemes Basic dynamic hardware branch prediction schemes are described by J. E. Smith [1981] and by A. Smith and Lee [1984]. Ditzel and McLellan [1987] describe a novel branch-target buffer for CRISP, which implements branch folding. McFarling and Hennessy [1986] did a quantitative comparison of a variety of compile-time and runtime branch prediction schemes. Fisher and Freudenberger [1992] evaluated a range of compile-time branch prediction schemes using the metric of distance between mispredictions. The correlating predictor we examine was described by Pan, So, and Rameh in 1992. Yeh and Patt [1992,1993] have written several papers on multilevel predictors that use branch histories for each branch. McFarling’s competitive prediction scheme is described in his 1993 technical report. 4.11 Historical Perspective and References 355 The Development of Multiple-Issue Processors The concept of multiple-issue designs has been around for a while, though most early processors followed an LIW or VLIW design approach. Charlesworth [1981] reports on the Floating Point Systems AP-120B, one of the first wideinstruction processors containing multiple operations per instruction. Floating Point Systems applied the concept of software pipelining in both a compiler and by hand-writing assembly language libraries to use the processor efficiently. Since the processor was an attached processor, many of the difficulties of implementing multiple issue in general-purpose processors, for example, virtual memory and exception handling, could be ignored. The Stanford MIPS processor had the ability to place two operations in a single instruction, though this capability was dropped in commercial variants of the architecture, primarily for performance reasons. Along with his colleagues at Yale, Fisher [1983] proposed creating a processor with a very wide instruction (512 bits), and named this type of processor a VLIW. Code was generated for the processor using trace scheduling, which Fisher [1981] had developed originally for generating horizontal microcode. The implementation of trace scheduling for the Yale processor is described by Fisher et al. [1984] and by Ellis [1986]. The Multiflow processor (see Colwell et al. [1987]) was based on the concepts developed at Yale, although many important refinements were made to increase the practicality of the approach. Among these was a controllable store buffer that provided support for a form of speculation. Although more than 100 Multiflow processors were sold, a variety of problems, including the difficulties of introducing a new instruction set from a small company and the competition provided from RISC microprocessors that changed the economics in the minicomputer market, led to failure of Multiflow as a company. Around the same time, Cydrome was founded to build a VLIW-style processor (see Rau et al. [1989]), which was also unsuccessful commercially. Dehnert, Hsu, and Bratt [1989] explain the architecture and performance of the Cydrome Cydra 5, a processor with a wide-instruction word that provides dynamic register renaming and additional support for software pipelining. The Cydra 5 is a unique blend of hardware and software, including conditional instructions, aimed at extracting ILP. Cydrome relied on more hardware than the Multiflow processor and achieved competitive performance primarily on vectorstyle codes. In the end, Cydrome suffered from problems similar to those of Multiflow and was not a commercial success. Both Multiflow and Cydrome, though unsuccessful as commercial entities, produced a number of people with extensive experience in exploiting ILP as well as advanced compiler technology; many of those people have gone on to incorporate their experience and the pieces of the technology in newer processors. Recently, Fisher and Rau [1993] edited a comprehensive collection of papers covering the hardware and software of these two important processors. Rau had also developed a scheduling technique called polycyclic scheduling, which is a basis for most software pipelining schemes (see Rau, Glaeser, and Picard [1982]). Rau’s work built on earlier work by Davidson and his colleagues 356 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism on the design of optimal hardware schedulers for pipelined processors. Other LIW processors have included the Apollo DN 10000 and the Intel i860, both of which could dual issue FP and integer operations. One of the interesting approaches used in early VLIW processors, such as the AP-120B and i860, was the idea of a pipeline organization that requires operations to be “pushed through” a functional unit and the results to be caught at the end of the pipeline. In such processors, operations advance only when another operation pushes them from behind (in sequence). Furthermore, an instruction specifies the destination for an instruction issued earlier that will be pushed out of the pipeline when this new operation is pushed in. Such an approach has the advantage that it does not specify a result destination when an operation first issues but only when the result register is actually written. This eliminates the need to detect WAW and WAR hazards in the hardware. The disadvantage is that it increases code size since no-ops may be needed to push results out when there is a dependence on an operation that is still in the pipeline and no other operations of that type are immediately needed. Instead of the “push-and-catch” approach used in these two processors, almost all designers have chosen to use self-draining pipelines that specify the destination in the issuing instruction and in which an issued instruction will complete without further action. The advantages in code density and simplifications in code generation seem to outweigh the advantages of the more unusual structure. IBM did pioneering work on multiple issue. In the 1960s, a project called ACS was underway. It included multiple-issue concepts, but never reached product stage. John Cocke made a subsequent proposal for a superscalar processor that dynamically makes issue decisions; he described the key ideas in several talks in the mid 1980s and coined the name superscalar. He called the design America; it is described by Agerwala and Cocke [1987]. The IBM Power-1 architecture (the RS/6000 line) is based on these ideas (see Bakoglu et al. [1989]). J. E. Smith [1984] and his colleagues at Wisconsin proposed the decoupled approach that included multiple issue with limited dynamic pipeline scheduling. A key feature of this processor is the use of queues to maintain order among a class of instructions (such as memory references) while allowing it to slip behind or ahead of another class of instructions. The Astronautics ZS-1 described by Smith et al. [1987] embodies this approach with queues to connect the loadstore unit and the operation units. The Power-2 design uses queues in a similar fashion. J. E. Smith [1989] also describes the advantages of dynamic scheduling and compares that approach to static scheduling. The concept of speculation has its roots in the original 360/91, which performed a very limited form of speculation. The approach used in recent processors combines the dynamic scheduling techniques of the 360/91 with a buffer to allow in-order commit. J. E. Smith and Pleszkun [1988] explored the use of buffering to maintain precise interrupts and described the concept of a reorder buffer. Sohi [1990] describes adding renaming and dynamic scheduling, making it possible to use the mechanism for speculation. Patt and his colleagues have described 4.11 Historical Perspective and References 357 another approach, called HPSm, that is also an extension of Tomasulo’s algorithm [Hwu and Patt 1986] and supports speculative-like execution. The use of speculation as a technique in multiple-issue processors was evaluated by Smith, Johnson, and Horowitz [1989] using the reorder buffer technique; their goal was to study available ILP in nonscientific code using speculation and multiple issue. In a subsequent book, M. Johnson [1990] describes the design of a speculative superscalar processor. What is surprising about the development of multiple-issue processors is that many of the early processors were not successful. Recent superscalars with modest issue capabilities (e.g., the DEC 21064 or HP 7100), however, have shown that the techniques can be used together with aggressive clock rates to build very fast processors, and designs like the Power-2 and TFP [Hsu 1994] processor show that very high issue-rate processors can be successful in the FP domain. Compiler Technology Loop-level parallelism and dependence analysis was developed primarily by D. Kuck and his colleagues at the University of Illinois in the 1970s. They also coined the commonly used terminology of antidependence and output dependence and developed several standard dependence tests, including the GCD and Banerjee tests. The latter test was named after Uptal Banerjee and comes in a variety of flavors. Recent work on dependence analysis has focused on using a variety of exact tests ending with an algorithm called Fourier-Motzkin, which is a linear programming algorithm. D. Maydan and W. Pugh both showed that the sequences of exact tests were a practical solution. In the area of uncovering and scheduling ILP, much of the early work was connected to the development of VLIW processors, described earlier. Lam [1988] developed algorithms for software pipelining and evaluated their use on Warp, a wide-instruction-word processor designed for special-purpose applications. Weiss and J. E. Smith [1987] compare software pipelining versus loop unrolling as techniques for scheduling code on a pipelined processor. Recently several groups have been looking at techniques for scheduling code for processors with conditional and speculative execution, but without full support for dynamic hardware scheduling. For example, Smith, Horowitz, and Lam [1992] created a concept called boosting that contains a hardware facility for supporting speculation but relies on compiler scheduling of speculated instructions. The sentinel concept, developed by Hwu and his colleagues [Mahlke et al. 1992] is a more general form of this idea. Studies of ILP A series of early papers, including Tjaden and Flynn [1970] and Riseman and Foster [1972], concluded that only small amounts of parallelism could be available at the instruction level without investing an enormous amount of hardware. These papers dampened the appeal of multiple instruction issue for more than ten 358 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism years. Nicolau and Fisher [1984] published a paper based on their work with trace scheduling and asserted the presence of large amounts of potential ILP in scientific programs. Since then there have been many studies of the available ILP. Such studies have been criticized since they presume some level of both hardware support and compiler technology. Nonetheless, the studies are useful to set expectations as well as to understand the sources of the limitations. Wall has participated in several such strategies, including Jouppi and Wall [1989], Wall [1991], and Wall [1993]. While the early studies were criticized as being conservative (e.g., they didn’t include speculation), the latest study is by far the most ambitious study of ILP to date and the basis for the data in section 4.8. Sohi and Vajapeyam [1989] give measurements of available parallelism for wide-instruction-word processors. Smith, Johnson, and Horowitz [1989] also used a speculative superscalar processor to study ILP limits. At the time of their study, they anticipated that the processor they specified was an upper bound on reasonable designs. Recent and upcoming processors, however, are likely to be at least as ambitious as their processor. Most recently, Lam and Wilson [1992] have looked at the limitations imposed by speculation and shown that additional gains are possible by allowing processors to speculate in multiple directions, which requires more than one PC. Such ideas represent one possible alternative for future processor architectures, since they represent a hybrid organization between a conventional uniprocessor and a conventional multiprocessor. Recent Advanced Microprocessors The years 1994–95 saw the announcement of a wide superscalar processor (3 or more issues per clock) by every major processor vendor: Intel P6, AMD K5, Sun UltraSPARC, Alpha 21164, MIPS R10000, PowerPC 604/620, and HP 8000. In 1995, the trade-offs between processors with more dynamic issue and speculation and those with more static issue and higher clock rates remains unclear. In practice, many factors, including the implementation technology, the memory hierarchy, the skill of the designers, and the type of applications benchmarked, all play a role in determining which approach is best. Figure 4.60 shows some of the most interesting recent processors, their characteristics, and suggested references. What is clear is that some level of multiple issue is here to stay and will be included in all processors in the foreseeable future. 4.11 359 Historical Perspective and References Issue capabilities Year shipped in systems Initial clock rate (MHz) Issue structure Scheduling IBM Power-1 1991 66 Dynamic HP 7100 1992 100 DEC Alpha 21064 1992 SuperSPARC SPEC (measure or estimate) Maximum Loadstore Integer ALU FP Branch Static 4 1 1 1 1 60 int 80 FP Static Static 2 1 1 1 1 80 int 150 FP 150 Dynamic Static 2 1 1 1 1 100 int 150 FP 1993 50 Dynamic Static 3 1 1 1 1 75 int 85 FP IBM Power-2 1994 67 Dynamic Static 6 2 2 2 2 95 int 270 FP MIPS TFP 1994 75 Dynamic Static 4 2 2 2 1 100 int 310 FP Intel Pentium 1994 66 Dynamic Static 2 2 2 1 1 65 int 65 FP DEC Alpha 21164 1995 300 Static Static 4 2 2 2 1 330 int 500 FP Sun Ultra– SPARC 1995 167 Dynamic Static 4 1 1 1 1 275 int 305 FP Processor Intel P6 1995 150 Dynamic Dynamic 3 1 2 1 1 > 200 int AMD K5 1995 100 Dynamic Dynamic 4 2 2 1 1 130 HaL R1 1995 154 Dynamic Dynamic 4 1 2 1 1 255 int 330 FP PowerPC 620 1995 133 Dynamic Dynamic 4 1 2 1 1 225 int 300 FP MIPS R10000 1996 200 Dynamic Dynamic 4 1 2 2 1 300 int 600 FP HP 8000 1996 200 Dynamic Static 4 2 2 2 1 > 360 int > 550 FP FIGURE 4.60 Recent high-performance processors and their characteristics and suggested references. For the last seven systems (starting with the UltraSPARC), the SPEC numbers are estimates, since no system has yet shipped. Issue structure refers to whether the hardware (dynamic) or compiler (static) is responsible for arranging instructions into issue packets; scheduling similarly describes whether the hardware dynamically schedules instructions or not. To read more about these processors the following references are useful: IBM Journal of Research and Development (contains issues on Power and PowerPC designs), the Digital Technical Journal (contains issues on various Alpha processors), and Proceedings of the Hot Chips Symposium (annual meeting at Stanford, which reviews the newest microprocessors). 360 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism References AGERWALA, T. AND J. COCKE [1987]. “High performance reduced instruction set processors,” IBM Tech. Rep. (March). ANDERSON, D. W., F. J. SPARACIO, AND R. M. TOMASULO [1967]. “The IBM 360 Model 91: Processor philosophy and instruction handling,” IBM J. Research and Development 11:1 (January), 8–24. BAKOGLU, H. B., G. F. GROHOSKI, L. E. THATCHER, J. A. KAHLE, C. R. MOORE, D. P. TUTTLE, W. E. MAULE, W. R. HARDELL, D. A. HICKS, M. NGUYEN PHU, R. K. MONTOYE, W. T. GLOVER, AND S. DHAWAN [1989]. “IBM second-generation RISC processor organization,” Proc. Int’l Conf. on Computer Design, IEEE (October), Rye, N.Y., 138–142. CHARLESWORTH, A. E. [1981]. “An approach to scientific array processing: The architecture design of the AP-120B/FPS-164 family,” Computer 14:9 (September), 18–27. COLWELL, R. P., R. P. NIX, J. J. O’DONNELL, D. B. PAPWORTH, AND P. K. RODMAN [1987]. “A VLIW architecture for a trace scheduling compiler,” Proc. Second Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (March), Palo Alto, Calif., 180–192. DEHNERT, J. C., P. Y.-T. HSU, AND J. P. BRATT [1989]. “Overlapped loop support on the Cydra 5,” Proc. Third Conf. on Architectural Support for Programming Languages and Operating Systems (April), IEEE/ACM, Boston, 26–39. DIEP, T. A., C. NELSON, AND J. P. SHEN [1995]. “Performance evaluation of the PowerPC 620 microarchitecture,” Proc. 22th Symposium on Computer Architecture (June), Santa Margherita, Italy. DITZEL, D. R. AND H. R. MCLELLAN [1987]. “Branch folding in the CRISP microprocessor: Reducing the branch delay to zero,” Proc. 14th Symposium on Computer Architecture (June), Pittsburgh, 2–7. ELLIS, J. R. [1986]. Bulldog: A Compiler for VLIW Architectures, MIT Press, Cambridge, Mass. FISHER, J. A. [1981]. “Trace scheduling: A technique for global microcode compaction,” IEEE Trans. on Computers 30:7 (July), 478–490. FISHER, J. A. [1983]. “Very long instruction word architectures and ELI-512,” Proc. Tenth Symposium on Computer Architecture (June), Stockholm, 140–150. FISHER, J. A., J. R. ELLIS, J. C. RUTTENBERG, AND A. NICOLAU [1984]. “Parallel processing: A smart compiler and a dumb processor,” Proc. SIGPLAN Conf. on Compiler Construction (June), Palo Alto, Calif., 11–16. FISHER, J. A. AND S. M. FREUDENBERGER [1992]. “Predicting conditional branches from previous runs of a program,” Proc. Fifth Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (October), Boston, 85-95. FISHER, J. A. AND B. R. RAU [1993]. Journal of Supercomputing (January), Kluwer. FOSTER, C. C. AND E. M. RISEMAN [1972]. “Percolation of code to enhance parallel dispatching and execution,” IEEE Trans. on Computers C-21:12 (December), 1411–1415. HSU, P. Y.-T. [1994]. “Designing the TFP microprocessor,” IEEE Micro. 14:2, 23–33. HWU, W.-M. AND Y. PATT [1986]. “HPSm, a high performance restricted data flow architecture having minimum functionality,” Proc. 13th Symposium on Computer Architecture (June), Tokyo, 297–307. IBM [1990]. “The IBM RISC System/6000 processor,” collection of papers, IBM J. Research and Development 34:1 (January), 119 pages. JOHNSON, M. [1990]. Superscalar Microprocessor Design, Prentice Hall, Englewood Cliffs, N.J. JOUPPI, N. P. AND D. W. WALL [1989]. “Available instruction-level parallelism for superscalar and superpipelined processors,” Proc. Third Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (April), Boston, 272–282. 4.11 Historical Perspective and References 361 LAM, M. [1988]. “Software pipelining: An effective scheduling technique for VLIW processors,” SIGPLAN Conf. on Programming Language Design and Implementation, ACM (June), Atlanta, Ga., 318–328. LAM, M. S. AND R. P. WILSON [1992]. “Limits of control flow on parallelism,” Proc. 19th Symposium on Computer Architecture (May), Gold Coast, Australia, 46–57. MAHLKE, S. A., W. Y. CHEN, W.-M. HWU, B. R. RAU, AND M. S. SCHLANSKER [1992]. “Sentinel scheduling for VLIW and superscalar processors,” Proc. Fifth Conf. on Architectural Support for Programming Languages and Operating Systems (October), Boston, IEEE/ACM, 238–247. MCFARLING, S. [1993] “Combining branch predictors,” WRL Technical Note TN-36 (June), Digital Western Research Laboratory, Palo Alto, Calif. MCFARLING, S. AND J. HENNESSY [1986]. “Reducing the cost of branches,” Proc. 13th Symposium on Computer Architecture (June), Tokyo, 396–403. NICOLAU, A. AND J. A. FISHER [1984]. “Measuring the parallelism available for very long instruction word architectures,” IEEE Trans. on Computers C-33:11 (November), 968–976. PAN, S.-T., K. SO, AND J. T. RAMEH [1992]. “Improving the accuracy of dynamic branch prediction using branch correlation,” Proc. Fifth Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (October), Boston, 76-84. RAU, B. R., C. D. GLAESER, AND R. L. PICARD [1982]. “Efficient code generation for horizontal architectures: Compiler techniques and architectural support,” Proc. Ninth Symposium on Computer Architecture (April), 131–139. RAU, B. R., D. W. L. YEN, W. YEN, AND R. A. TOWLE [1989]. “The Cydra 5 departmental supercomputer: Design philosophies, decisions, and trade-offs,” IEEE Computers 22:1 (January), 12–34. RISEMAN, E. M. AND C. C. FOSTER [1972]. “Percolation of code to enhance parallel dispatching and execution,” IEEE Trans. on Computers C-21:12 (December), 1411–1415. SMITH, A. AND J. LEE [1984]. “Branch prediction strategies and branch-target buffer design,” Computer 17:1 (January), 6–22. SMITH, J. E. [1981]. “A study of branch prediction strategies,” Proc. Eighth Symposium on Computer Architecture (May), Minneapolis, 135–148. SMITH, J. E. [1984]. “Decoupled access/execute computer architectures,” ACM Trans. on Computer Systems 2:4 (November), 289–308. SMITH, J. E. [1989]. “Dynamic instruction scheduling and the Astronautics ZS-1,” Computer 22:7 (July), 21–35. SMITH, J. E. AND A. R. PLESZKUN [1988]. “Implementing precise interrupts in pipelined processors,” IEEE Trans. on Computers 37:5 (May), 562–573. This paper is based on an earlier paper that appeared in Proc. 12th Symposium on Computer Architecture, June 1988. SMITH, J. E., G. E. DERMER, B. D. VANDERWARN, S. D. KLINGER, C. M. ROZEWSKI, D. L. FOWLER, K. R. SCIDMORE, AND J. P. LAUDON [1987]. “The ZS-1 central processor,” Proc. Second Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (March), Palo Alto, Calif., 199–204. SMITH, M. D., M. HOROWITZ, AND M. S. LAM [1992]. “Efficient superscalar performance through boosting,” Proc. Fifth Conf. on Architectural Support for Programming Languages and Operating Systems (October), Boston, IEEE/ACM, 248–259. SMITH, M. D., M. JOHNSON, AND M. A. HOROWITZ [1989]. “Limits on multiple instruction issue,” Proc. Third Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (April), Boston, 290–302. SOHI, G. S. [1990]. “Instruction issue logic for high-performance, interruptible, multiple functional unit, pipelined computers,” IEEE Trans. on Computers 39:3 (March), 349-359. 362 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism SOHI, G. S. AND S. VAJAPEYAM [1989]. “Tradeoffs in instruction format design for horizontal architectures,” Proc. Third Conf. on Architectural Support for Programming Languages and Operating Systems, IEEE/ACM (April), Boston, 15–25. THORLIN, J. F. [1967]. “Code generation for PIE (parallel instruction execution) computers,” Proc. Spring Joint Computer Conf. 27. THORNTON, J. E. [1964]. “Parallel operation in the Control Data 6600,” Proc. AFIPS Fall Joint Computer Conf., Part II, 26, 33–40. THORNTON, J. E. [1970]. Design of a Computer, the Control Data 6600, Scott, Foresman, Glenview, Ill. TJADEN, G. S. AND M. J. FLYNN [1970]. “Detection and parallel execution of independent instructions,” IEEE Trans. on Computers C-19:10 (October), 889–895. TOMASULO, R. M. [1967]. “An efficient algorithm for exploiting multiple arithmetic units,” IBM J. Research and Development 11:1 (January), 25–33. WALL, D. W. [1991]. “Limits of instruction-level parallelism,” Proc. Fourth Conf. on Architectural Support for Programming Languages and Operating Systems (April), Santa Clara, Calif., IEEE/ ACM, 248–259. WALL, D. W. [1993]. Limits of Instruction-Level Parallelism, Research Rep. 93/6, Western Research Laboratory, Digital Equipment Corp. (November). WEISS, S. AND J. E. SMITH [1984]. “Instruction issue logic for pipelined supercomputers,” Proc. 11th Symposium on Computer Architecture (June), Ann Arbor, Mich., 110–118. WEISS, S. AND J. E. SMITH [1987]. “A study of scalar compilation techniques for pipelined supercomputers,” Proc. Second Conf. on Architectural Support for Programming Languages and Operating Systems (March), IEEE/ACM, Palo Alto, Calif., 105–109. WEISS, S. AND J. E. SMITH [1994]. Power and PowerPC, Morgan Kaufmann, San Francisco. YEH, T. AND Y. N. PATT [1992]. “Alternative implementations of two-level adaptive branch prediction,” Proc. 19th Symposium on Computer Architecture (May), Gold Coast, Australia, 124– 134. YEH, T. AND Y. N. PATT [1993]. “A comparison of dynamic branch predictors that use two levels of branch history,” Proc. 20th Symposium on Computer Architecture (May), San Diego, 257–266. EXERCISES 4.1 [15] <4.1> List all the dependences (output, anti, and true) in the following code fragment. Indicate whether the true dependences are loop-carried or not. Show why the loop is not parallel. for (i=2;i<100;i=i+1) { a[i] = b[i] + a[i]; c[i-1] = a[i] + d[i]; a[i-1] = 2 * b[i]; b[i+1] = 2 * b[i]; } /* /* /* /* S1 S2 S3 S4 */ */ */ */ 4.2 [15] <4.1> Here is an unusual loop. First, list the dependences and then rewrite the loop so that it is parallel. for (i=1;i<100;i=i+1) { a[i] = b[i] + c[i]; b[i] = a[i] + d[i]; a[i+1] = a[i] + e[i]; } /* S1 */ /* S2 */ /* S3 */ 363 Exercises 4.3 [10] <4.1> For the following code fragment, list the control dependences. For each control dependence, tell whether the statement can be scheduled before the if statement based on the data references. Assume that all data references are shown, that all values are defined before use, and that only b and c are used again after this segment. You may ignore any possible exceptions. if (a>c) { d= a= else { e= f= c= } b = a + f; d + 5; b + d + e;} e + 2; f + 2; c + f; 4.4 [15] <4.1> Assuming the pipeline latencies from Figure 4.2, unroll the following loop as many times as necessary to schedule it without any delays, collapsing the loop overhead instructions. Assume a one-cycle delayed branch. Show the schedule. The loop computes Y[i] = a × X[i] + Y[i], the key step in a Gaussian elimination. loop: LD MULTD LD ADDD SD SUBI SUBI BNEZ F0,0(R1) F0,F0,F2 F4,0(R2) F0,F0,F4 0(R2),F0 R1,R1,8 R2,R2,8 R1,loop 4.5 [15] <4.1> Assume the pipeline latencies from Figure 4.2 and a one-cycle delayed branch. Unroll the following loop a sufficient number of times to schedule it without any delays. Show the schedule after eliminating any redundant overhead instructions. The loop is a dot product (assuming F2 is initially 0) and contains a recurrence. Despite the fact that the loop is not parallel, it can be scheduled with no delays. loop: LD LD MULTD ADDD SUBI SUBI BNEZ F0,0(R1) F4,0(R2) F0,F0,F4 F2,F0,F2 R1,R1,#8 R2,R2,#8 R1,loop 4.6 [20] <4.2> It is critical that the scoreboard be able to distinguish RAW and WAR hazards, since a WAR hazard requires stalling the instruction doing the writing until the instruction reading an operand initiates execution, while a RAW hazard requires delaying the reading instruction until the writing instruction finishes—just the opposite. For example, consider the sequence: MULTD SUBD ADDD F0,F6,F4 F8,F0,F2 F2,F10,F2 The SUBD depends on the MULTD (a RAW hazard) and thus the MULTD must be allowed to complete before the SUBD; if the MULTD were stalled for the SUBD due to the inability to distinguish between RAW and WAR hazards, the processor will deadlock. This sequence contains a WAR hazard between the ADDD and the SUBD, and the ADDD cannot be 364 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism allowed to complete until the SUBD begins execution. The difficulty lies in distinguishing the RAW hazard between MULTD and SUBD, and the WAR hazard between the SUBD and ADDD. Describe how the scoreboard for a machine with two multiply units and two add units avoids this problem and show the scoreboard values for the above sequence assuming the ADDD is the only instruction that has completed execution (though it has not written its result). (Hint: Think about how WAW hazards are prevented and what this implies about active instruction sequences.) 4.7 [12] <4.2> A shortcoming of the scoreboard approach occurs when multiple functional units that share input buses are waiting for a single result. The units cannot start simultaneously, but must serialize. This is not true in Tomasulo’s algorithm. Give a code sequence that uses no more than 10 instructions and shows this problem. Assume the hardware configuration from Figure 4.3, for the scoreboard, and Figure 4.8, for Tomasulo’s scheme. Use the FP latencies from Figure 4.2 (page 224). Indicate where the Tomasulo approach can continue, but the scoreboard approach must stall. 4.8 [15] <4.2> Tomasulo’s algorithm also has a disadvantage versus the scoreboard: only one result can complete per clock, due to the CDB. Use the hardware configuration from Figures 4.3 and 4.8 and the FP latencies from Figure 4.2 (page 224). Find a code sequence of no more than 10 instructions where the scoreboard does not stall, but Tomasulo’s algorithm must due to CDB contention. Indicate where this occurs in your sequence. 4.9 [45] <4.2> One benefit of a dynamically scheduled processor is its ability to tolerate changes in latency or issue capability without requiring recompilation. This was a primary motivation behind the 360/91 implementation. The purpose of this programming assignment is to evaluate this effect. Implement a version of Tomasulo’s algorithm for DLX to issue one instruction per clock; your implementation should also be capable of in-order issue. Assume fully pipelined functional units and the latencies shown in Figure 4.61. Unit Integer Branch Latency 7 9 Load-store 11 FP add 13 FP mult 15 FP divide 17 FIGURE 4.61 Latencies for functional units. A one-cycle latency means that the unit and the result are available for the next instruction. Assume the processor takes a one-cycle stall for branches, in addition to any datadependent stalls shown in the above table. Choose 5–10 small FP benchmarks (with loops) to run; compare the performance with and without dynamic scheduling. Try scheduling the loops by hand and see how close you can get with the statically scheduled processor to the dynamically scheduled results. 365 Exercises Change the processor to the configuration shown in Figure 4.62. Unit Latency Integer 19 Branch 21 Load-store 23 FP add 25 FP mult 27 FP divide 29 FIGURE 4.62 Latencies for functional units, configuration 2. Rerun the loops and compare the performance of the dynamically scheduled processor and the statically scheduled processor. 4.10 [15] <4.3> Suppose we have a deeply pipelined processor, for which we implement a branch-target buffer for the conditional branches only. Assume that the misprediction penalty is always 4 cycles and the buffer miss penalty is always 3 cycles. Assume 90% hit rate and 90% accuracy, and 15% branch frequency. How much faster is the processor with the branch-target buffer versus a processor that has a fixed 2-cycle branch penalty? Assume a base CPI without branch stalls of 1. 4.11 [10] <4.3> Determine the improvement from branch folding for unconditional branches. Assume a 90% hit rate, a base CPI without unconditional branch stalls of 1, and an unconditional branch frequency of 5%. How much improvement is gained by this enhancement versus a processor whose effective CPI is 1.1? 4.12 [30] <4.4> Implement a simulator to evaluate the performance of a branch-prediction buffer that does not store branches that are predicted as untaken. Consider the following prediction schemes: a one-bit predictor storing only predicted taken branches, a two-bit predictor storing all the branches, a scheme with a target buffer that stores only predicted taken branches and a two-bit prediction buffer. Explore different sizes for the buffers keeping the total number of bits (assuming 32-bit addresses) the same for all schemes. Determine what the branch penalties are, using Figure 4.24 as a guideline. How do the different schemes compare both in prediction accuracy and in branch cost? 4.13 [30] <4.4> Implement a simulator to evaluate various branch prediction schemes. You can use the instruction portion of a set of cache traces to simulate the branch-prediction buffer. Pick a set of table sizes (e.g., 1K bits, 2K bits, 8K bits, and 16K bits). Determine the performance of both (0,2) and (2,2) predictors for the various table sizes. Also compare the performance of the degenerate predictor that uses no branch address information for these table sizes. Determine how large the table must be for the degenerate predictor to perform as well as a (0,2) predictor with 256 entries. 4.14 [20/22/22/22/22/25/25/25/20/22/22] <4.1,4.2,4.4> In this Exercise, we will look at how a common vector loop runs on a variety of pipelined versions of DLX. The loop is the so-called SAXPY loop (discussed extensively in Appendix B) and the central operation in 366 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism Gaussian elimination. The loop implements the vector operation Y = a × X + Y for a vector of length 100. Here is the DLX code for the loop: foo: LD MULTD LD ADDD SD ADDI ADDI SGTI BEQZ F2,0(R1) F4,F2,F0 F6,0(R2) F6,F4,F6 0(R2),F6 R1,R1,#8 R2,R2,#8 R3,R1,done R3,foo ;load X(i) ;multiply a*X(i) ;load Y(i) ;add a*X(i) + Y(i) ;store Y(i) ;increment X index ;increment Y index ;test if done ; loop if not done For (a)–(e), assume that the integer operations issue and complete in one clock cycle (including loads) and that their results are fully bypassed. Ignore the branch delay. You will use the FP latencies shown in Figure 4.2 (page 224). Assume that the FP unit is fully pipelined. a. [20] <4.1> For this problem use the standard single-issue DLX pipeline with the pipeline latencies from Figure 4.2. Show the number of stall cycles for each instruction and what clock cycle each instruction begins execution (i.e., enters its first EX cycle) on the first iteration of the loop. How many clock cycles does each loop iteration take? b. [22] <4.1> Unroll the DLX code for SAXPY to make four copies of the body and schedule it for the standard DLX integer pipeline and a fully pipelined FPU with the FP latencies of Figure 4.2. When unwinding, you should optimize the code as we did in section 4.1. Significant reordering of the code will be needed to maximize performance. How many clock cycles does each loop iteration take? c. [22] <4.2> Using the DLX code for SAXPY above, show the state of the scoreboard tables (as in Figure 4.4) when the SGTI instruction reaches write result. Assume that issue and read operands each take a cycle. Assume that there is one integer functional unit that takes only a single execution cycle (the latency to use is 0 cycles, including loads and stores). Assume the FP unit configuration of Figure 4.3 with the FP latencies of Figure 4.2. The branch should not be included in the scoreboard. d. [22] <4.2> Use the DLX code for SAXPY above and a fully pipelined FPU with the latencies of Figure 4.2. Assume Tomasulo’s algorithm for the hardware with one integer unit taking one execution cycle (a latency of 0 cycles to use) for all integer operations. Show the state of the reservation stations and register-status tables (as in Figure 4.9) when the SGTI writes its result on the CDB. Do not include the branch. e. [22] <4.2> Using the DLX code for SAXPY above, assume a scoreboard with the FP functional units described in Figure 4.3, plus one integer functional unit (also used for load-store). Assume the latencies shown in Figure 4.63. Show the state of the scoreboard (as in Figure 4.4) when the branch issues for the second time. Assume the branch was correctly predicted taken and took one cycle. How many clock cycles does each loop iteration take? You may ignore any register port/bus conflicts. f. [25] <4.2> Use the DLX code for SAXPY above. Assume Tomasulo’s algorithm for the hardware using one fully pipelined FP unit and one integer unit. Assume the latencies shown in Figure 4.63. 367 Exercises Instruction producing result Instruction using result Latency in clock cycles FP multiply FP ALU op 6 FP add FP ALU op 4 FP multiply FP store 5 FP add FP store 3 Integer operation (including load) Any 0 FIGURE 4.63 Pipeline latencies where latency is number of cycles between producing and consuming instruction. Show the state of the reservation stations and register status tables (as in Figure 4.9) when the branch is executed for the second time. Assume the branch was correctly predicted as taken. How many clock cycles does each loop iteration take? g. [25] <4.1,4.4> Assume a superscalar architecture that can issue any two independent operations in a clock cycle (including two integer operations). Unwind the DLX code for SAXPY to make four copies of the body and schedule it assuming the FP latencies of Figure 4.2. Assume one fully pipelined copy of each functional unit (e.g., FP adder, FP multiplier) and two integer functional units with latency to use of 0. How many clock cycles will each iteration on the original code take? When unwinding, you should optimize the code as in section 4.1. What is the speedup versus the original code? h. [25] <4.4> In a superpipelined processor, rather than have multiple functional units, we would fully pipeline all the units. Suppose we designed a superpipelined DLX that had twice the clock rate of our standard DLX pipeline and could issue any two unrelated instructions in the same time that the normal DLX pipeline issued one operation. If the second instruction is dependent on the first, only the first will issue. Unroll the DLX SAXPY code to make four copies of the loop body and schedule it for this superpipelined processor, assuming the FP latencies of Figure 4.63. Also assume the load to use latency is 1 cycle, but other integer unit latencies are 0 cycles. How many clock cycles does each loop iteration take? Remember that these clock cycles are half as long as those on a standard DLX pipeline or a superscalar DLX. i. [20] <4.4> Start with the SAXPY code and the processor used in Figure 4.29. Unroll the SAXPY loop to make four copies of the body, performing simple optimizations (as in section 4.1). Assume all integer unit latencies are 0 cycles and the FP latencies are given in Figure 4.2. Fill in a table like Figure 4.28 for the unrolled loop. How many clock cycles does each loop iteration take? j. [22] <4.2,4.6> Using the DLX code for SAXPY above, assume a speculative processor with the functional unit organization used in section 4.6 and a single integer functional unit. Assume the latencies shown in Figure 4.63. Show the state of the processor (as in Figure 4.35) when the branch issues for the second time. Assume the branch was correctly predicted taken and took one cycle. How many clock cycles does each loop iteration take? 368 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism k. [22] <4.2,4.6> Using the DLX code for SAXPY above, assume a speculative processor like Figure 4.34 that can issue one load-store, one integer operation, and one FP operation each cycle. Assume the latencies in clock cycles of Figure 4.63. Show the state of the processor (as in Figure 4.35) when the branch issues for the second time. Assume the branch was correctly predicted taken and took one cycle. How many clock cycles does each loop iteration take? 4.15 [15] <4.5> Here is a simple code fragment: for (i=2;i<=100;i+=2) a[i] = a[50*i+1]; To use the GCD test, this loop must first be “normalized”—written so that the index starts at 1 and increments by 1 on every iteration. Write a normalized version of the loop (change the indices as needed), then use the GCD test to see if there is a dependence. 4.16 [15] <4.1,4.5> Here is another loop: for (i=2,i<=100;i+=2) a[i] = a[i-1]; Normalize the loop and use the GCD test to detect a dependence. Is there a loop-carried, true dependence in this loop? 4.17 [25] <4.5> Show that if for two array elements A(a × i + b) and A(c × i + d) there is a true dependence, then GCD(c,a) divides (d – b). 4.18 [15] <4.5> Rewrite the software pipelining loop shown in the Example on page 294 in section 4.5, so that it can be run by simply decrementing R1 by 16 before the loop starts. After rewriting the loop, show the start-up and finish-up code. Hint: To get the loop to run properly when R1 is decremented, the SD should store the result of the original first iteration. You can achieve this by adjusting load-store offsets. 4.19 [20] <4.5> Consider the loop that we software pipelined on page 294 in section 4.5. Suppose the latency of the ADDD was five cycles. The software pipelined loop now has a stall. Show how this loop can be written using both software pipelining and loop unrolling to eliminate any stalls. The loop should be unrolled as few times as possible (once is enough). You need not show loop start-up or clean-up. 4.20 [15/15] <4.6> Consider our speculative processor from section 4.6. Since the reorder buffer contains a value field, you might think that the value field of the reservation stations could be eliminated. a. [15] <4.6> Show an example where this is the case and an example where the value field of the reservation stations is still needed. Use the speculative machine shown in Figure 4.34. Show DLX code for both examples. How many value fields are needed in each reservation station? b. [15] <4.6> Find a modification to the rules for instruction commit that allows elimination of the value fields in the reservation station. What are the negative side effects of such a change? 369 Exercises 4.21 [20] <4.6> Our implementation of speculation uses a reorder buffer and introduces the concept of instruction commit, delaying commit and the irrevocable updating of the registers until we know an instruction will complete. There are two other possible implementation techniques, both originally developed as a method for preserving precise interrupts when issuing out of order. One idea introduces a future file that keeps future values of a register; this idea is similar to the reorder buffer. An alternative is to keep a history buffer that records values of registers that have been speculatively overwritten. Design a speculative processor like the one in section 4.6 but using a history buffer. Show the state of the processor, including the contents of the history buffer, for the example in Figure 4.36. Show the changes needed to Figure 4.37 for a history buffer implementation. Describe exactly how and when entries in the history buffer are read and written, including what happens on an incorrect speculation. 4.22 [30/30] <4.8> This exercise involves a programming assignment to evaluate what types of parallelism might be expected in more modest, and more realistic, processors than those studied in section 4.7. These studies can be done using traces available with this text or obtained from other tracing programs. For simplicity, assume perfect caches. For a more ambitious project, assume a real cache. To simplify the task, make the following assumptions: s Assume perfect branch and jump prediction: hence you can use the trace as the input to the window, without having to consider branch effects—the trace is perfect. s Assume there are 64 spare integer and 64 spare floating-point registers; this is easily implemented by stalling the issue of the processor whenever there are more live registers required. s Assume a window size of 64 instructions (the same for alias detection). Use greedy scheduling of instructions in the window. That is, at any clock cycle, pick for execution the first n instructions in the window that meet the issue constraints. a. [30] <4.8> Determine the effect of limited instruction issue by performing the following experiments: s s b. Vary the issue count from 4–16 instructions per clock, Assuming eight issues per clock: determine what the effect of restricting the processor to two memory references per clock is. [30] <4.8> Determine the impact of latency in instructions. Assume the following latency models for a processor that issues up to 16 instructions per clock: s Model 1: All latencies are one clock. s Model 2: Load latency and branch latency are one clock; all FP latencies are two clocks. s Model 3: Load and branch latency is two clocks; all FP latencies are five clocks. Remember that with limited issue and a greedy scheduler, the impact of latency effects will be greater. 370 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism 4.23 [Discussion] <4.3,4.6> Dynamic instruction scheduling requires a considerable investment in hardware. In return, this capability allows the hardware to run programs that could not be run at full speed with only compile-time, static scheduling. What trade-offs should be taken into account in trying to decide between a dynamically and a statically scheduled implementation? What situations in either hardware technology or program characteristics are likely to favor one approach or the other? Most speculative schemes rely on dynamic scheduling; how does speculation affect the arguments in favor of dynamic scheduling? 4.24 [Discussion] <4.3> There is a subtle problem that must be considered when implementing Tomasulo’s algorithm. It might be called the “two ships passing in the night problem.” What happens if an instruction is being passed to a reservation station during the same clock period as one of its operands is going onto the common data bus? Before an instruction is in a reservation station, the operands are fetched from the register file; but once it is in the station, the operands are always obtained from the CDB. Since the instruction and its operand tag are in transit to the reservation station, the tag cannot be matched against the tag on the CDB. So there is a possibility that the instruction will then sit in the reservation station forever waiting for its operand, which it just missed. How might this problem be solved? You might consider subdividing one of the steps in the algorithm into multiple parts. (This intriguing problem is courtesy of J. E. Smith.) 4.25 [Discussion] <4.4-4.6> Discuss the advantages and disadvantages of a superscalar implementation, a superpipelined implementation, and a VLIW approach in the context of DLX. What levels of ILP favor each approach? What other concerns would you consider in choosing which type of processor to build? How does speculation affect the results? 5 Memory-Hierarchy Design Ideally one would desire an indefinitely large memory capacity such that any particular . . . word would be immediately available. . . . We are . . . forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible. A. W. Burks, H. H. Goldstine, and J. von Neumann Preliminary Discussion of the Logical Design of an Electronic Computing Instrument (1946) 5 5.1 373 5.2 The ABCs of Caches 375 5.3 Reducing Cache Misses 390 5.4 Reducing Cache Miss Penalty 411 5.5 Reducing Hit Time 422 5.6 Main Memory 427 5.7 Virtual Memory 439 5.8 Protection and Examples of Virtual Memory 447 5.9 Crosscutting Issues in the Design of Memory Hierarchies 457 5.10 Putting It All Together: The Alpha AXP 21064 Memory Hierarchy 461 5.11 Fallacies and Pitfalls 466 5.12 Concluding Remarks 471 5.13 Historical Perspective and References 472 Exercises 5.1 Introduction 476 Introduction Computer pioneers correctly predicted that programmers would want unlimited amounts of fast memory. An economical solution to that desire is a memory hierarchy, which takes advantage of locality and cost/performance of memory technologies. The principle of locality, presented in the first chapter, says that most programs do not access all code or data uniformly (see section 1.6, page 38). This principle, plus the guideline that smaller hardware is faster, led to the hierarchy based on memories of different speeds and sizes. Since fast memory is expensive, a memory hierarchy is organized into several levels—each smaller, faster, and more expensive per byte than the next level. The goal is to provide a memory system with cost almost as low as the cheapest level of memory and speed almost as fast as the fastest level. The levels of the hierarchy usually subset one another; all data in one level is also found in the level below, and all data in that lower level is found in the one below it, and so on until we reach the bottom of the hierarchy. Note that each level maps addresses from a larger memory to a smaller but faster memory higher in the hierarchy. As part of address mapping, Chapter 5 Memory-Hierarchy Design the memory hierarchy is given the responsibility of address checking; hence protection schemes for scrutinizing addresses are also part of the memory hierarchy. The importance of the memory hierarchy has increased with advances in performance of processors. For example, in 1980 microprocessors were often designed without caches, while in 1995 they often come with two levels of caches. As noted in Chapter 1, microprocessor performance improved 55% per year since 1987, and 35% per year until 1986. Figure 5.1 plots CPU performance projections against the historical performance improvement in main memory access time. Clearly there is a processor-memory performance gap that computer architects must try to close. 10,000 1000 Performance 100 10 00 99 20 98 19 97 19 96 19 95 19 94 19 93 19 92 19 91 19 90 19 89 19 88 19 87 19 86 19 85 19 84 19 83 19 82 19 81 19 19 80 1 19 374 Year Memory CPU FIGURE 5.1 Starting with 1980 performance as a baseline, the performance of memory and CPUs are plotted over time. The memory baseline is 64-KB DRAM in 1980, with three years to the next generation and a 7% per year performance improvement in latency (see Figure 5.30 on page 429). The CPU line assumes a 1.35 improvement per year until 1986, and a 1.55 improvement thereafter. Note that the vertical axis must be on a logarithmic scale to record the size of the CPU-DRAM performance gap. In addition to giving us the trends that highlight the importance of the memory hierarchy, Chapter 1 gives us a formula to evaluate the effectiveness of the memory hierarchy: Memory stall cycles = Instruction count × Memory references per instruction × Miss rate × Miss penalty 5.2 375 The ABCs of Caches where Miss rate is the fraction of accesses that are not in the cache and Miss penalty is the additional clock cycles to service the miss. Recall that a block is the minimum unit of information that can be present in the cache (hit in the cache) or not (miss in the cache). This chapter uses a related formula to evaluate many examples of using the principle of locality to improve performance while keeping the memory system affordable. This common principle allows us to pose four questions about any level of the hierarchy: Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy) The answers to these questions help us understand the different trade-offs of memories at different levels of a hierarchy; hence we ask these four questions on every example. To put these abstract ideas into practice, throughout the chapter we show examples from the four levels of the memory hierarchy in a computer using the Alpha AXP 21064 microprocessor. Toward the end of the chapter we evaluate the impact of these levels on performance using the SPEC92 benchmark programs. 5.2 The ABCs of Caches Cache: a safe place for hiding or storing things. Webster’s New World Dictionary of the American Language, Second College Edition (1976) Cache is the name generally given to the first level of the memory hierarchy encountered once the address leaves the CPU. Since the principle of locality applies at many levels, and taking advantage of locality to improve performance is so popular, the term cache is now applied whenever buffering is employed to reuse commonly occurring items; examples include file caches, name caches, and so on. We start our description of caches by answering the four common questions for the first level of the memory hierarchy; you’ll see similar questions and answers later. 376 Chapter 5 Memory-Hierarchy Design Q1: Where can a block be placed in a cache? Figure 5.2 shows that the restrictions on where a block is placed create three categories of cache organization: Fully associative: block 12 can go anywhere Block no. 01234567 Direct mapped: block 12 can go only into block 4 (12 mod 8) Block no. 01234567 Set associative: block 12 can go anywhere in set 0 (12 mod 4) Block 0 1 2 3 4 5 6 7 no. Cache Set Set Set Set 0123 Block frame address Block no. 1111111111222222222233 01234567890123456789012345678901 Memory FIGURE 5.2 This example cache has eight block frames and memory has 32 blocks. Real caches contain hundreds of block frames and real memories contain millions of blocks. The set-associative organization has four sets with two blocks per set, called two-way set associative. Assume that there is nothing in the cache and that the block address in question identifies lower-level block 12. The three options for caches are shown left to right. In fully associative, block 12 from the lower level can go into any of the eight block frames of the cache. With direct mapped, block 12 can only be placed into block frame 4 (12 modulo 8). Set associative, which has some of both features, allows the block to be placed anywhere in set 0 (12 modulo 4). With two blocks per set, this means block 12 can be placed either in block 0 or block 1 of the cache. s If each block has only one place it can appear in the cache, the cache is said to be direct mapped. The mapping is usually (Block address) MOD (Number of blocks in cache) 5.2 s s The ABCs of Caches 377 If a block can be placed anywhere in the cache, the cache is said to be fully associative. If a block can be placed in a restricted set of places in the cache, the cache is said to be set associative. A set is a group of blocks in the cache. A block is first mapped onto a set, and then the block can be placed anywhere within that set. The set is usually chosen by bit selection; that is, (Block address) MOD (Number of sets in cache) If there are n blocks in a set, the cache placement is called n-way set associative. The range of caches from direct mapped to fully associative is really a continuum of levels of set associativity: Direct mapped is simply one-way set associative and a fully associative cache with m blocks could be called m-way set associative; equivalently, direct mapped can be thought of as having m sets and fully associative as having one set. The vast majority of processor caches today are direct mapped, two-way set associative, or four-way set associative, for reasons we shall see shortly. Q2: How is a block found if it is in the cache? Caches have an address tag on each block frame that gives the block address. The tag of every cache block that might contain the desired information is checked to see if it matches the block address from the CPU. As a rule, all possible tags are searched in parallel because speed is critical. There must be a way to know that a cache block does not have valid information. The most common procedure is to add a valid bit to the tag to say whether or not this entry contains a valid address. If the bit is not set, there cannot be a match on this address. Before proceeding to the next question, let’s explore the relationship of a CPU address to the cache. Figure 5.3 shows how an address is divided. The first division is between the block address and the block offset. The block frame address can be further divided into the tag field and the index field. The block offset field selects the desired data from the block, the index field selects the set, and the tag field is compared against it for a hit. While the comparison could be made on more of the address than the tag, there is no need because of the following: s s Checking the index would be redundant, since it was used to select the set to be checked; an address stored in set 0, for example, must have 0 in the index field or it couldn’t be stored in set 0. The offset is unnecessary in the comparison since the entire block is present or not, and hence all block offsets must match. 378 Chapter 5 Memory-Hierarchy Design Block address Tag Index Block offset FIGURE 5.3 The three portions of an address in a set-associative or direct-mapped cache. The tag is used to check all the blocks in the set and the index is used to select the set. The block offset is the address of the desired data within the block. If the total cache size is kept the same, increasing associativity increases the number of blocks per set, thereby decreasing the size of the index and increasing the size of the tag. That is, the tag-index boundary in Figure 5.3 moves to the right with increasing associativity, with the end case of fully associative caches having no index field. Q3: Which block should be replaced on a cache miss? When a miss occurs, the cache controller must select a block to be replaced with the desired data. A benefit of direct-mapped placement is that hardware decisions are simplified—in fact, so simple that there is no choice: Only one block frame is checked for a hit, and only that block can be replaced. With fully associative or set-associative placement, there are many blocks to choose from on a miss. There are two primary strategies employed for selecting which block to replace: s s Random—To spread allocation uniformly, candidate blocks are randomly selected. Some systems generate pseudorandom block numbers to get reproducible behavior, which is particularly useful when debugging hardware. Least-recently used (LRU)—To reduce the chance of throwing out information that will be needed soon, accesses to blocks are recorded. The block replaced is the one that has been unused for the longest time. LRU makes use of a corollary of locality: If recently used blocks are likely to be used again, then the best candidate for disposal is the least-recently used block. A virtue of random replacement is that it is simple to build in hardware. As the number of blocks to keep track of increases, LRU becomes increasingly expensive and is frequently only approximated. Figure 5.4 shows the difference in miss rates between LRU and random replacement. Q4: What happens on a write? Reads dominate processor cache accesses. All instruction accesses are reads, and most instructions don’t write to memory. Figure 2.26 on page 105 in Chapter 2 suggests a mix of 9% stores and 26% loads for DLX programs, making writes 9%/(100% + 26% + 9%) or about 7% of the overall memory traffic and 5.2 379 The ABCs of Caches Associativity Two-way Size LRU Random 16 KB 5.18% 5.69% 64 KB 1.88% 2.01% 256 KB 1.15% 1.17% Four-way LRU Eight-way Random LRU Random 4.67% 5.29% 4.39% 4.96% 1.54% 1.66% 1.39% 1.53% 1.13% 1.13% 1.12% 1.12% FIGURE 5.4 Miss rates comparing least-recently used versus random replacement for several sizes and associativities. These data were collected for a block size of 16 bytes using one of the VAX traces containing user and operating system code. There is little difference between LRU and random for larger-size caches in this trace. Although not included in the table, a first-in, first-out order replacement policy is worse than random or LRU. 9%/(26% + 9%) or about 25% of the data cache traffic. Making the common case fast means optimizing caches for reads, especially since processors traditionally wait for reads to complete but need not wait for writes. Amdahl’s Law (section 1.6, page 29) reminds us, however, that high-performance designs cannot neglect the speed of writes. Fortunately, the common case is also the easy case to make fast. The block can be read from cache at the same time that the tag is read and compared, so the block read begins as soon as the block address is available. If the read is a hit, the requested part of the block is passed on to the CPU immediately. If it is a miss, there is no benefit—but also no harm; just ignore the value read. Such is not the case for writes. Modifying a block cannot begin until the tag is checked to see if the address is a hit. Because tag checking cannot occur in parallel, writes normally take longer than reads. Another complexity is that the processor also specifies the size of the write, usually between 1 and 8 bytes; only that portion of a block can be changed. In contrast, reads can access more bytes than necessary without fear. The write policies often distinguish cache designs. There are two basic options when writing to the cache: s s Write through (or store through)—The information is written to both the block in the cache and to the block in the lower-level memory. Write back (also called copy back or store in)—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. To reduce the frequency of writing back blocks on replacement, a feature called the dirty bit is commonly used. This status bit indicates whether the block is dirty (modified while in the cache) or clean (not modified). If it is clean, the 380 Chapter 5 Memory-Hierarchy Design block is not written on a miss, since the lower level has identical information to the cache. Both write back and write through have their advantages. With write back, writes occur at the speed of the cache memory, and multiple writes within a block require only one write to the lower-level memory. Since some writes don’t go to memory, write back uses less memory bandwidth, making write back attractive in multiprocessors. With write through, read misses never result in writes to the lower level, and write through is easier to implement than write back. Write through also has the advantage that the next lower level has the most current copy of the data. This is important for I/O and for multiprocessors, which we examine in Chapters 6 and 8. As we shall see, I/O and multiprocessors are fickle: they want write back for processor caches to reduce the memory traffic and write through to keep the cache consistent with lower levels of the memory hierarchy. When the CPU must wait for writes to complete during write through, the CPU is said to write stall. A common optimization to reduce write stalls is a write buffer, which allows the processor to continue as soon as the data is written to the buffer, thereby overlapping processor execution with memory updating. As we shall see shortly, write stalls can occur even with write buffers. Since the data are not needed on a write, there are two common options on a write miss: s s Write allocate (also called fetch on write)—The block is loaded on a write miss, followed by the write-hit actions above. This is similar to a read miss. No-write allocate (also called write around)—The block is modified in the lower level and not loaded into the cache. Although either write-miss policy could be used with write through or write back, write-back caches generally use write allocate (hoping that subsequent writes to that block will be captured by the cache) and write-through caches often use no-write allocate (since subsequent writes to that block will still have to go to memory). An Example: The Alpha AXP 21064 Data Cache and Instruction Cache To give substance to these ideas, Figure 5.5 shows the organization of the data cache in the Alpha AXP 21064 microprocessor that is found in the DEC 3000 Model 800 workstation. The cache contains 8192 bytes of data in 32-byte blocks with direct-mapped placement, write through with a four-block write buffer, and no-write allocate on a write miss. Let’s trace a cache hit through the steps of a hit as labeled in Figure 5.5. (The four steps are shown as circled numbers.) As we shall see later (Figure 5.41), the 21064 microprocessor presents a 34-bit physical address to the cache for tag comparison. The address coming into the cache is divided into two fields: the 29bit block address and 5-bit block offset. The block address is further divided into an address tag and cache index. Step 1 shows this division. 5.2 381 The ABCs of Caches Block Block address offset <8> <5> <21> Tag 1 CPU address Data Data in out Index 4 Valid <1> (256 blocks) Tag <21> Data <256> 2 =? 3 4:1 Mux Write buffer Lower level memory FIGURE 5.5 The organization of the data cache in the Alpha AXP 21064 microprocessor. The 8-KB cache is direct mapped with 32-byte blocks. It has 256 blocks selected by the 8-bit index. The four steps of a read hit, shown as circled numbers in order of occurrence, label this organization. Although we show a 4:1 multiplexer to select the desired 8 bytes, in reality the data RAM is organized 8 bytes wide and the multiplexer is unnecessary: 2 bits of the block offset join the index to supply the RAM address to select the proper 8 bytes (see Figure 5.8). Although not exercised in this example, the line from memory to the cache is used on a miss to load the cache. The cache index selects the tag to be tested to see if the desired block is in the cache. The size of the index depends on cache size, block size, and set associativity. The 21064 cache is direct mapped, so set associativity is set to one, and we calculate the index as follows: 2 index 8 Cache size 8192 = --------------------------------------------------------------------- = -------------- = 256 = 2 Block size × Set associativity 32 × 1 Hence the index is 8 bits wide, and the tag is 29 – 8 or 21 bits wide. Index selection is step 2 in Figure 5.5. Remember that direct mapping allows the data to be read and sent to the CPU in parallel with the tag being read and checked. After reading the tag from the cache, it is compared to the tag portion of the block address from the CPU. This is step 3 in the figure. To be sure the tag con- 382 Chapter 5 Memory-Hierarchy Design tains valid information, the valid bit must be set or else the results of the comparison are ignored. Assuming the tag does match, the final step is to signal the CPU to load the data from the cache. The 21064 allows two clock cycles for these four steps, so the instructions in the following two clock cycles would stall if they tried to use the result of the load. Handling writes is more complicated than handling reads in the 21064, as it is in any cache. If the word to be written is in the cache, the first three steps are the same. After the tag comparison indicates a hit, the data are written. (Section 5.5 shows how the 21064 avoids the extra time on write hits that this description implies.) Since this is a write-through cache, the write process isn’t yet over. The data are also sent to a write buffer that can contain up to four blocks that each can hold four 64-bit words. If the write buffer is empty, the data and the full address are written in the buffer, and the write is finished from the CPU’s perspective; the CPU continues working while the write buffer prepares to write the word to memory. If the buffer contains other modified blocks, the addresses are checked to see if the address of this new data matches the address of the valid write buffer entry; if so, the new data are combined with that entry, called write merging. Without this optimization, four stores to sequential addresses would fill the buffer, even though these four words easily fit within a single block of the write buffer when merged. Figure 5.6 shows a write buffer with and without write merging. If the buffer is full and there is no address match, the cache (and CPU) must wait until the buffer has an empty entry. So far we have assumed the common case of a cache hit. What happens on a miss? On a read miss, the cache sends a stall signal to the CPU telling it to wait, and 32 bytes are read from the next level of the hierarchy. The path to the next lower level is 16 bytes wide in the DEC 3000 model 800 workstation, one of several models that use the 21064. That takes 5 clock cycles per transfer, or 10 clock cycles for all 32 bytes. Since the data cache is direct mapped, there is no choice on which block to replace. Replacing a block means updating the data, the address tag, and the valid bit. On a write miss, the CPU writes “around” the cache to lower-level memory and does not affect the cache; that is, the 21064 follows the no-write-allocate rule. We have seen how it works, but the data cache cannot supply all the memory needs of the processor: the processor also needs instructions. Although a single cache could try to supply both, it can be a bottleneck. For example, when a load or store instruction is executed, the pipelined processor will simultaneously request both a data word and an instruction word. Hence a single cache would present a structural hazard for loads and stores, leading to stalls. One simple way to conquer this problem is to divide it: one cache is dedicated to instructions and another to data. Separate caches are found in most recent processors, including the Alpha AXP 21064. It has an 8-KB instruction cache that is nearly identical to its 8-KB data cache in Figure 5.5. 5.2 383 The ABCs of Caches Write address V V V V 100 1 0 0 0 104 1 0 0 0 108 1 0 0 0 112 1 0 0 0 Write address V V V V 100 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 FIGURE 5.6 To illustrate write merging, the write buffer on top does not use it while the write buffer on the bottom does. Each buffer has four entries, and each entry holds four 64-bit words. The address for each entry is on the left, with valid bits (V) indicating whether or not the next sequential four bytes are occupied in this entry. The four writes are merged into a single buffer entry with write merging; without it, all four entries are used. Without write merging, the blocks to the right in the upper drawing would only be used for instructions that wrote multiple words at the same time. (The Alpha is a 64-bit architecture so its buffer is really 8 bytes per word.) The CPU knows whether it is issuing an instruction address or a data address, so there can be separate ports for both, thereby doubling the bandwidth between the memory hierarchy and the CPU. Separate caches also offer the opportunity of optimizing each cache separately: different capacities, block sizes, and associativities may lead to better performance. (In contrast to the instruction caches and data caches of the 21064, the terms unified or mixed are applied to caches that can contain either instructions or data.) Figure 5.7 shows that instruction caches have lower miss rates than data caches. Separating instructions and data removes misses due to conflicts between instruction blocks and data blocks, but the split also fixes the cache space devoted to each type. Which is more important to miss rates? A fair comparison of separate instruction and data caches to unified caches requires the total cache size to be the same. For example, a separate 1-KB instruction cache and 1-KB data cache should be compared to a 2-KB unified cache. Calculating the average miss rate with separate instruction and data caches necessitates knowing the percentage of memory references to each cache. Figure 2.26 on page 105 suggests the 384 Chapter 5 Memory-Hierarchy Design Size Instruction cache Data cache Unified cache 1 KB 3.06% 24.61% 13.34% 2 KB 2.26% 20.57% 9.78% 4 KB 1.78% 15.94% 7.24% 8 KB 1.10% 10.19% 4.57% 16 KB 0.64% 6.47% 2.87% 32 KB 0.39% 4.82% 1.99% 64 KB 0.15% 3.77% 1.35% 128 KB 0.02% 2.88% 0.95% FIGURE 5.7 Miss rates for instruction, data, and unified caches of different sizes. The data are for a direct-mapped cache with 32-byte blocks for an average of SPEC92 benchmarks on the DECstation 5000 [Gee et al. 1993]. The percentage of instruction references is about 75%. split is 100%/(100% + 26% + 9%) or about 75% instruction references to (26% + 9%)/(100% + 26% + 9%) or about 25% data references. Splitting affects performance beyond what is indicated by the change in miss rates, as we shall see in a little bit. Cache Performance Because instruction count is independent of the hardware, it is tempting to evaluate CPU performance using that number. As we saw in Chapter 1, however, such indirect performance measures have waylaid many a computer designer. The corresponding temptation for evaluating memory-hierarchy performance is to concentrate on miss rate, because it, too, is independent of the speed of the hardware. As we shall see, miss rate can be just as misleading as instruction count. A better measure of memory-hierarchy performance is the average time to access memory: Average memory access time = Hit time + Miss rate × Miss penalty where Hit time is the time to hit in the cache; we have seen the other two terms before. The components of average access time can be measured either in absolute time—say, 2 nanoseconds on a hit—or in the number of clock cycles that the CPU waits for the memory—such as a miss penalty of 50 clock cycles. Remember that average memory access time is still an indirect measure of performance; although it is a better measure than miss rate, it is not a substitute for execution time. This formula can help us decide between split caches and a unified cache. EXAMPLE Which has the lower miss rate: a 16-KB instruction cache with a 16-KB data cache or a 32-KB unified cache? Use the miss rates in Figure 5.7 to help calculate the correct answer. Assume a hit takes 1 clock cycle and the miss penalty is 50 clock cycles, and a load or store hit takes 1 extra clock cycle on a unified cache since there is only one cache port to satisfy 5.2 The ABCs of Caches 385 two simultaneous requests. Using the pipelining terminology of the previous chapter, the unified cache leads to a structural hazard. What is the average memory access time in each case? Assume write-through caches with a write buffer and ignore stalls due to the write buffer. ANSWER As stated above, about 75% of the memory accesses are instruction references. Thus, the overall miss rate for the split caches is (75% × 0.64%) + (25% × 6.47%) = 2.10% According to Figure 5.7, a 32-KB unified cache has a slightly lower miss rate of 1.99%. The average memory access time formula can be divided into instruction and data accesses: Average memory access time = % instructions × ( Hit time + Instruction miss rate × Miss penalty ) + % data × ( Hit time + Data miss rate × Miss penalty ) So the time for each organization is Average memory access time split = 75% × ( 1 + 0.64% × 50 ) + 25% × ( 1 + 6.47% × 50 ) = ( 75% × 1.32 ) + ( 25% × 4.235 ) = 0.990 + 1.059 = 2.05 Average memory access time unified = 75% × ( 1 + 1.99% × 50 ) + 25% × ( 1 + 1 + 1.99% × 50 ) = ( 75% × 1.995 ) + ( 25% × 2.995 ) = 1.496 + 0.749 = 2.24 Hence the split caches in this example—which offer two memory ports per clock cycle, thereby avoiding the structural hazard—have a better average memory access time than the single-ported unified cache even though their effective miss rate is higher. s In Chapter 1 we saw another formula for the memory hierarchy: CPU time = (CPU execution clock cycles + Memory stall clock cycles) × Clock cycle time To simplify evaluation of cache alternatives, sometimes designers assume that all memory stalls are due to cache misses since the memory hierarchy typically dominates other reasons for stalls, such as contention due to I/O devices using memory. We use this simplifying assumption here, but it is important to account for all memory stalls when calculating final performance! The CPU time formula above raises the question whether the clock cycles for a cache hit should be considered part of CPU execution clock cycles or part of memory stall clock cycles. Although either convention is defensible, the most widely accepted is to include hit clock cycles in CPU execution clock cycles. 386 Chapter 5 Memory-Hierarchy Design Memory stall clock cycles can then be defined in terms of the number of memory accesses per program, miss penalty (in clock cycles), and miss rate for reads and writes: Memory stall clock cycles = Reads × Read miss rate × Read miss penalty + Writes × Write miss rate × Write miss penalty We often simplify the complete formula by combining the reads and writes and finding the average miss rates and miss penalty for reads and writes: Memory stall clock cycles = Memory accesses × Miss rate × Miss penalty This formula is an approximation since the miss rates and miss penalties are often different for reads and writes. Factoring instruction count (IC) from execution time and memory stall cycles, we now get a CPU time formula that includes memory accesses per instruction, miss rate, and miss penalty: Memory accesses CPU time = IC × CPI execution + ----------------------------------------- × Miss rate × Miss penalty × Clock cycle time Instruction Some designers prefer measuring miss rate as misses per instruction rather than misses per memory reference: Misses Memory accesses × Miss rate ------------------------- = ---------------------------------------------------------------------Instruction Instruction The advantage of this measure is that it is independent of the hardware implementation. For example, the 21064 instruction prefetch unit can make repeated references to a single word (see section 5.10), which can artificially reduce the miss rate if measured as misses per memory reference rather than per instruction executed. The drawback is that misses per instruction is architecture dependent; for example, the average number of memory accesses per instruction may be very different for an 80x86 versus DLX. Thus misses per instruction is most popular with architects working with a single computer family. They then use this version of the CPU time formula: Memory stall clock cycles CPU time = IC × CPI execution + -------------------------------------------------------------- × Clock cycle time Instruction We can now explore the impact of caches on performance. EXAMPLE Let’s use a machine similar to the Alpha AXP as a first example. Assume the cache miss penalty is 50 clock cycles, and all instructions normally take 2.0 clock cycles (ignoring memory stalls). Assume the miss rate is 2%, and 5.2 The ABCs of Caches 387 there is an average of 1.33 memory references per instruction. What is the impact on performance when behavior of the cache is included? ANSWER Memory stall clock cycles CPU time = IC × CPI execution + -------------------------------------------------------------- × Clock cycle time Instruction The performance, including cache misses, is CPU timewith cache = IC × (2.0 + (1.33 × 2% × 50)) × Clock cycle time = IC × 3.33 × Clock cycle time The clock cycle time and instruction count are the same, with or without a cache, so CPU time increases with CPI from 2.0 for a “perfect cache” to 3.33 with a cache that can miss. Hence, including the memory hierarchy in the CPI calculations stretches the CPU time by a factor of 1.67. Without any memory hierarchy at all the CPI would increase to 2.0 + 50 × 1.33 or 68.5—a factor of over 30 times longer! s As this example illustrates, cache behavior can have enormous impact on performance. Furthermore, cache misses have a double-barreled impact on a CPU with a low CPI and a fast clock: 1. The lower the CPIexecution, the higher the relative impact of a fixed number of cache miss clock cycles. 2. When calculating CPI, the cache miss penalty is measured in CPU clock cycles for a miss. Therefore, even if memory hierarchies for two computers are identical, the CPU with the higher clock rate has a larger number of clock cycles per miss and hence the memory portion of CPI is higher. The importance of the cache for CPUs with low CPI and high clock rates is thus greater, and, consequently, greater is the danger of neglecting cache behavior in assessing performance of such machines. Amdahl’s Law strikes again! Although minimizing average memory access time is a reasonable goal and we will use it in much of this chapter, keep in mind that the final goal is to reduce CPU execution time. The next example shows how these two can differ. EXAMPLE What is the impact of two different cache organizations on the performance of a CPU? Assume that the CPI with a perfect cache is 2.0 and the clock cycle time is 2 ns, that there are 1.3 memory references per instruction, and that the size of both caches is 64 KB and both have a block size of 32 bytes. One cache is direct mapped and the other is two-way set associative. Figure 5.8 shows that for set-associative caches we must add a multiplexer to select between the blocks in the set depending on the tag 388 Chapter 5 Memory-Hierarchy Design match. Since the speed of the CPU is tied directly to the speed of a cache hit, assume the CPU clock cycle time must be stretched 1.10 times to accommodate the selection multiplexer of the set-associative cache. To the first approximation, the cache miss penalty is 70 ns for either cache organization. (In practice it must be rounded up or down to an integer number of clock cycles.) First, calculate the average memory access time, and then CPU performance. Assume the hit time is one clock cycle. Assume that the miss rate of a direct-mapped 64-KB cache is 1.4%, and the miss rate for a two-way set-associative cache of the same size is 1.0%. Block Block address offset <22> <7> <5> Tag Index CPU address Data Data in out Data <64> Valid Tag <1> <22> =? =? 2:1 M u x Write buffer Lower level memory FIGURE 5.8 A two-way set-associative version of the 8-KB cache of Figure 5.5, showing the extra multiplexer in the path. Unlike the prior figure, the data portion of the cache is drawn more realistically, with the two leftmost bits of the block offset combined with the index to address the desired 64-bit word in memory, which is then sent to the CPU. ANSWER Average memory access time is Average memory access time = Hit time + Miss rate × Miss penalty 5.2 The ABCs of Caches 389 Thus, the time for each organization is Average memory access time1-way = 2.0 + (.014 × 70) = 2.98 ns Average memory access time2-way = 2.0 × 1.10 + (.010 × 70) = 2.90 ns The average memory access time is better for the two-way set-associative cache. CPU performance is Misses CPU time = IC × CPI Execution + ------------------------- × Miss penalty × Clock cycle time Instruction = IC × ( CPI Execution × Clock cycle time ) Memory accesses + ----------------------------------------- × Miss rate × Miss penalty × Clock cycle time Instruction Substituting 70 ns for (Miss penalty × Clock cycle time), the performance of each cache organization is CPU time 1-way = IC × ( 2 × 2.0 + ( 1.3 × 0.014 × 70 ) ) = 5.27 × IC CPU time 2-way = IC × ( 2 × 2.0 × 1.10 + ( 1.3 × 0.010 × 70 ) ) = 5.31 × IC and relative performance is CPU time 2-way 5.31 × Instruction count 5.31 ------------------------------------ = -------------------------------------------------------- = --------- = 1.01 CPU time 1-way 5.27 × Instruction count 5.27 In contrast to the results of average memory access time comparison, the direct-mapped cache leads to slightly better average performance because the clock cycle is stretched for all instructions for the two-way case, even if there are fewer misses. Since CPU time is our bottom-line evaluation, and since direct mapped is simpler to build, the preferred cache is direct mapped in this example. s Improving Cache Performance The increasing gap between CPU and main memory speeds shown in Figure 5.1 has attracted the attention of many architects. A bibliographic search for the years 1989 –95 revealed more than 1600 research papers on the subject of caches. Your authors’ job was to survey all 1600 papers, decide what is and is not worthwhile, translate the results into a common terminology, reduce the results to their essence, write in an intriguing fashion, and provide just the right amount of detail! Fortunately, the average memory access time formula gave us a framework to present cache optimizations as well as the techniques for improving caches: Average memory access time = Hit time + Miss rate × Miss penalty 390 Chapter 5 Memory-Hierarchy Design Hence we organize 15 cache optimizations into three categories: s Reducing the miss rate (Section 5.3) s Reducing the miss penalty (Section 5.4) s Reducing the time to hit in the cache (Section 5.5) Figure 5.29 on page 427 concludes with a summary of the implementation complexity and the performance benefits of the 15 techniques presented. 5.3 Reducing Cache Misses Most cache research has concentrated on reducing the miss rate, so that is where we start our exploration. To gain better insights into the causes of misses, we start with a model that sorts all misses into three simple categories: s s s Compulsory—The very first access to a block cannot be in the cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses. Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur because of blocks being discarded and later retrieved. Conflict—If the block placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. These are also called collision misses or interference misses. Figure 5.9 shows the relative frequency of cache misses, broken down by the “three C’s.” Figure 5.10 presents the same data graphically. The top graph shows absolute miss rates; the bottom graph plots percentage of all the misses by type of miss as a function of cache size. To show the benefit of associativity, conflict misses are divided into misses caused by each decrease in associativity. Here are the four divisions: s s s s Eight-way—conflict misses due to going from fully associative (no conflicts) to eight-way associative Four-way—conflict misses due to going from eight-way associative to fourway associative Two-way—conflict misses due to going from four-way associative to two-way associative One-way—conflict misses due to going from two-way associative to one-way associative (direct mapped) 5.3 391 Reducing Cache Misses Miss rate components (relative percent) (Sum = 100% of total miss rate) Cache size 1 KB 1 KB 1 KB 1 KB 2 KB 2 KB 2 KB 2 KB 4 KB 4 KB 4 KB 4 KB 8 KB 8 KB 8 KB 8 KB 16 KB 16 KB 16 KB 16 KB 32 KB 32 KB 32 KB 32 KB 64 KB 64 KB 64 KB 64 KB 128 KB 128 KB 128 KB 128 KB Degree associative Total miss rate 1-way 2-way 4-way 8-way 1-way 2-way 4-way 8-way 1-way 2-way 4-way 8-way 1-way 2-way 4-way 8-way 1-way 2-way 4-way 8-way 1-way 2-way 4-way 8-way 1-way 2-way 4-way 8-way 1-way 2-way 4-way 8-way 0.133 0.105 0.095 0.087 0.098 0.076 0.064 0.054 0.072 0.057 0.049 0.039 0.046 0.038 0.035 0.029 0.029 0.022 0.020 0.018 0.020 0.014 0.013 0.013 0.014 0.010 0.009 0.009 0.010 0.007 0.006 0.006 Compulsory 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 1% 2% 2% 2% 2% 2% 3% 4% 3% 3% 4% 5% 4% 5% 5% 6% 7% 9% 10% 10% 10% 14% 15% 15% 14% 20% 21% 22% 20% 29% 31% 31% Capacity 0.080 0.080 0.080 0.080 0.044 0.044 0.044 0.044 0.031 0.031 0.031 0.031 0.023 0.023 0.023 0.023 0.015 0.015 0.015 0.015 0.010 0.010 0.010 0.010 0.007 0.007 0.007 0.007 0.004 0.004 0.004 0.004 Conflict 60% 76% 84% 92% 45% 58% 69% 82% 43% 55% 64% 80% 51% 61% 66% 79% 52% 68% 74% 80% 52% 74% 79% 81% 50% 70% 75% 78% 40% 58% 61% 62% 0.052 0.023 0.013 0.005 0.052 0.030 0.018 0.008 0.039 0.024 0.016 0.006 0.021 0.013 0.010 0.004 0.012 0.005 0.003 0.002 0.008 0.002 0.001 0.001 0.005 0.001 0.000 0.000 0.004 0.001 0.001 0.000 39% 22% 14% 6% 53% 39% 28% 14% 54% 42% 32% 15% 45% 34% 28% 15% 42% 23% 17% 9% 38% 12% 6% 4% 36% 10% 3% 0% 40% 14% 8% 7% FIGURE 5.9 Total miss rate for each size cache and percentage of each according to the “three C’s.” Compulsory misses are independent of cache size, while capacity misses decrease as capacity increases, and conflict misses decrease as associativity increases. Gee et al. [1993] calculated the average D-cache miss rate for the SPEC92 benchmark suite with 32-byte blocks and LRU replacement on a DECstation 5000. Figure 5.10 shows the same information graphically. The compulsory rate was calculated as the miss rate of a fully associative 1-MB cache. Note that the 2:1 cache rule of thumb (inside front cover) is supported by the statistics in this table: a direct-mapped cache of size N has about the same miss rate as a 2-way set-associative cache of size N/2. 392 Chapter 5 Memory-Hierarchy Design 0.14 1-way 0.12 2-way 0.1 4-way 0.08 Miss rate per type 8-way 0.06 Capacity 0.04 0.02 0 1 2 4 8 16 32 Cache size (KB) 64 128 Compulsory 100% 1-way 80% 2-way 4-way 60% 8-way Miss rate per type 40% Capacity 20% 0% 1 2 4 8 16 Cache size (KB) 32 64 128 Compulsory FIGURE 5.10 Total miss rate (top) and distribution of miss rate (bottom) for each size cache according to three C’s for the data in Figure 5.9. The top diagram is the actual D-cache miss rates, while the bottom diagram is scaled to the direct-mapped miss ratios. As we can see from the figures, the compulsory miss rate of the SPEC92 programs is very small, as it is for many long-running programs. Having identified the three C’s, what can a computer designer do about them? Conceptually, conflicts are the easiest: Fully associative placement avoids all conflict misses. Full associativity is expensive in hardware, however, and may slow the processor clock rate (see the example above), leading to lower overall performance. There is little to be done about capacity except to enlarge the cache. If the upper-level memory is much smaller than what is needed for a program, and a 5.3 Reducing Cache Misses 393 significant percentage of the time is spent moving data between two levels in the hierarchy, the memory hierarchy is said to thrash. Because so many replacements are required, thrashing means the machine runs close to the speed of the lowerlevel memory, or maybe even slower because of the miss overhead. Another approach to improving the three C’s is to make blocks larger to reduce the number of compulsory misses, but, as we shall see, large blocks can increase other kinds of misses. The three C’s give insight into the cause of misses, but this simple model has its limits; it gives you insight into average behavior but may not explain an individual miss. For example, changing cache size changes conflict misses as well as capacity misses, since a larger cache spreads out references to more blocks. Thus, a miss might move from a capacity miss to a conflict miss as cache size changes. Note that the three C’s also ignore replacement policy, since it is difficult to model and since, in general, it is less significant. In specific circumstances the replacement policy can actually lead to anomalous behavior, such as poorer miss rates for larger associativity, which is contradictory to the three C’s model. Alas, many of the techniques that reduce miss rates also increase hit time or miss penalty. The desirability of reducing miss rates using the seven techniques presented in the rest of this section must be balanced against the goal of making the whole system fast. This first example shows the importance of a balanced perspective. First Miss Rate Reduction Technique: Larger Block Size This simplest way to reduce miss rate is to increase the block size. Figure 5.11 shows the trade-off of block size versus miss rate for a set of programs and cache sizes. Larger block sizes will reduce compulsory misses. This reduction occurs because the principle of locality has two components: temporal locality and spatial locality. Larger blocks take advantage of spatial locality. At the same time, larger blocks increase the miss penalty. Since they reduce the number of blocks in the cache, larger blocks may increase conflict misses and even capacity misses if the cache is small. Clearly there is little reason to increase the block size to such a size that it increases the miss rate, but there is also no benefit to reducing miss rate if it increases the average memory access time; the increase in miss penalty may outweigh the decrease in miss rate. 394 Chapter 5 Memory-Hierarchy Design 25% 20% 15% Miss rate 10% 5% 0% 16 32 64 128 256 Block size 1k 4k 64k 256k 16k FIGURE 5.11 Miss rate versus block size for five different-sized caches. Each line represents a cache of different size. Figure 5.12 shows the data used to plot these lines. This graph is based on the same measurements found in Figure 5.10. Cache size Block size 1K 4K 16K 64K 256K 16 15.05% 8.57% 3.94% 2.04% 1.09% 32 13.34% 7.24% 2.87% 1.35% 0.70% 64 13.76% 7.00% 2.64% 1.06% 0.51% 128 16.64% 7.78% 2.77% 1.02% 0.49% 256 22.01% 9.51% 3.29% 1.15% 0.49% FIGURE 5.12 Actual miss rate versus block size for five different-sized caches in Figure 5.11. Note that for a 1-KB cache, 64-byte, 128-byte, and 256-byte blocks have a higher miss rate than 32-byte blocks. In this example, the cache would have to be 256 KB in order for a 256-byte block to decrease misses. EXAMPLE Figure 5.12 shows the actual miss rates plotted in Figure 5.11. Assume the memory system takes 40 clock cycles of overhead and then delivers 16 bytes every 2 clock cycles. Thus, it can supply 16 bytes in 42 clock cycles, 32 bytes in 44 clock cycles, and so on. Which block size has the minimum average memory access time for each cache size in Figure 5.12? 5.3 ANSWER 395 Reducing Cache Misses Average memory access time is Average memory access time = Hit time + Miss rate × Miss penalty If we assume the hit time is one clock cycle independent of block size, then the access time for a 16-byte block in a 1-KB cache is Average memory access time = 1 + (15.05% × 42) = 7.321 clock cycles and for a 256-byte block in a 256-KB cache the average memory access time is Average memory access time = 1 + (0.49% × 72) = 1.353 clock cycles Figure 5.13 shows the average memory access time for all block and cache sizes between those two extremes. The boldfaced entries show the fastest block size for a given cache size: 32 bytes for 1-KB, 4-KB, and 16KB caches and 64 bytes for the larger caches. These sizes are, in fact, popular block sizes for processor caches today. Cache size Block size Miss penalty 1K 4K 16K 64K 256K 16 42 7.321 4.599 2.655 1.857 1.458 32 44 6.870 4.186 2.263 1.594 1.308 64 48 7.605 4.360 2.267 1.509 1.245 128 56 10.318 5.357 2.551 1.571 1.274 256 72 16.847 7.847 3.369 1.828 1.353 FIGURE 5.13 Average memory access time versus block size for five different-sized caches in Figure 5.11. The smallest average time per cache size is boldfaced. s As in all of these techniques, the cache designer is trying to minimize both the miss rate and the miss penalty. The selection of block size depends on both the latency and bandwidth of the lower-level memory: high latency and high bandwidth encourage large block size since the cache gets many more bytes per miss for a small increase in miss penalty. Conversely, low latency and low bandwidth encourage smaller block sizes since there is little time saved from a larger block— twice the miss penalty of a small block may be close to the penalty of a block twice the size—and the larger number of small blocks may reduce conflict misses. After seeing the positive and negative impact of larger block size on compulsory and capacity misses, we next look at the potential of higher associativity to reduce conflict misses. 396 Chapter 5 Memory-Hierarchy Design Second Miss Rate Reduction Technique: Higher Associativity Figures 5.9 and 5.10 above show how miss rates improve with higher associativity. There are two general rules of thumb that can be gleaned from these figures. The first is that eight-way set associative is for practical purposes as effective in reducing misses for these sized caches as fully associative. The second observation, called the 2:1 cache rule of thumb and found on the front inside cover, is that a direct-mapped cache of size N has about the same miss rate as a 2-way setassociative cache of size N/2. Like many of these examples, improving one aspect of the average memory access time comes at the expense of another. Increasing block size reduced miss rate while increasing miss penalty, and greater associativity can come at the cost of increased hit time. Hill [1988] found about a 10% difference in hit times for TTL or ECL board-level caches and a 2% difference for custom CMOS caches for directmapped caches versus two-way set-associative caches. Hence the pressure of a fast processor clock cycle encourages simple cache designs, but the increasing miss penalty rewards associativity, as the following example suggests. EXAMPLE Assume that going to higher associativity would increase the clock cycle as suggested below: Clock cycle time2-way = 1.10 × Clock cycle time1-way Clock cycle time4-way = 1.12 × Clock cycle time1-way Clock cycle time8-way = 1.14 × Clock cycle time1-way Assume that the hit time is 1 clock cycle, that the miss penalty for the direct-mapped case is 50 clock cycles, and that the miss penalty need not be rounded to an integral number of clock cycles. Using Figure 5.9 for miss rates, for which cache sizes are each of these three statements true? Average memory access time8-way < Average memory access time4-way < Average memory access time2-way < ANSWER Average memory access time4-way Average memory access time2-way Average memory access time1-way Average memory access time for each associativity is Average memory access time8-way Average memory access time4-way Average memory access time2-way Average memory access time1-way = Hit time8-way + Miss rate8-way × Miss penalty1-way = 1.14 + Miss rate8-way × 50 = 1.12 + Miss rate4-way × 50 = 1.10 + Miss rate2-way × 50 = 1.00 + Miss rate1-way × 50 5.3 397 Reducing Cache Misses The miss penalty is the same time in each case, so we leave it as 50 clock cycles. For example, the average memory access time for a 1-KB directmapped cache is Average memory access time1-way = 1.00 + (0.133 × 50) = 7.65 and the time for a 128-KB, eight-way set-associative cache is Average memory access time8-way = 1.14 + (0.006 × 50) = 1.44 Using these formulas and the miss rates from Figure 5.9, Figure 5.14 shows the average memory access time for each cache and associativity. The figure shows that the formulas in this example hold for caches less than or equal to 16 KB. Starting with 32 KB, the average memory access time of four-way is less than two-way, and two-way is less than one-way, but eight-way cache is not less than four-way. Note that we did not account for the slower clock rate on the rest of the program in this example, thereby understating the advantage of directmapped cache. Associativity Cache size (KB) One-way Two-way Four-way Eight-way 1 7.65 6.60 6.22 5.44 2 5.90 4.90 4.62 4.09 4 4.60 3.95 3.57 3.19 8 3.30 3.00 2.87 2.59 16 2.45 2.20 2.12 2.04 32 2.00 1.80 1.77 1.79 64 1.70 1.60 1.57 1.59 128 1.50 1.45 1.42 1.44 FIGURE 5.14 Average memory access time using miss rates in Figure 5.9 for parameters in the example. Boldface type means that this time is higher than the number to the left; that is, higher associativity increases average memory access time. s Third Miss Rate Reduction Technique: Victim Caches Larger block size and higher associativity are two classic techniques to reduce miss rates that have been considered by architects since the earliest caches. Starting with this subsection, we see more recent inventions to reduce miss rate without affecting the clock cycle time or the miss penalty. 398 Chapter 5 Memory-Hierarchy Design One solution that reduces conflict misses without impairing clock rate is to add a small, fully associative cache between a cache and its refill path. Figure 5.15 shows the organization. This victim cache contains only blocks that CPU address Data Data in out =? Tag Victim cache Data =? Write buffer Lower level memory FIGURE 5.15 Placement of victim cache in the memory hierarchy. are discarded from a cache because of a miss—“victims”—and are checked on a miss to see if they have the desired data before going to the next lower-level memory. If it is found there, the victim block and cache block are swapped. Jouppi [1990] found that victim caches of one to five entries are effective at reducing conflict misses, especially for small, direct-mapped data caches. Depending on the program, a four-entry victim cache removed 20% to 95% of the conflict misses in a 4-KB direct-mapped data cache. Fourth Miss Rate Reduction Technique: Pseudo-Associative Caches Another approach to getting the miss rate of set-associative caches and the hit speed of direct mapped is called pseudo-associative or column associative. A cache access proceeds just as in the direct-mapped cache for a hit. On a miss, however, before going to the next lower level of the memory hierarchy, another 5.3 399 Reducing Cache Misses cache entry is checked to see if it matches there. A simple way is to invert the most significant bit of the index field to find the other block in the “pseudo set.” Pseudo-associative caches then have one fast and one slow hit time—corresponding to a regular hit and a pseudo hit—in addition to the miss penalty. Figure 5.16 shows the relative times. The danger is if many of the fast hit times of the direct-mapped cache became slow hit times in the pseudo-associative cache, then the performance would be degraded by this optimization. Hence it is important to be able to indicate for each set which block should be the fast hit and which should be the slow one; one way is simply to swap the contents of the blocks. Hit time Pseudo hit time Miss penalty Time FIGURE 5.16 alty. Relationship between a regular hit time, pseudo hit time, and miss pen- Let’s do an example to see how well pseudo-associativity works. EXAMPLE ANSWER Assume that it takes two extra cycles to find the entry in the alternative location if it is not found in the direct-mapped location: one cycle to check and one cycle to swap. Using the parameters from the previous example, which of direct-mapped, two-way set-associative, and pseudo-associative organizations is fastest for 2-KB and 128-KB sizes? The average memory access time for pseudo-associative caches starts with the standard formula: Average memory access timepseudo = Hit timepseudo + Miss ratepseudo × Miss penaltypseudo Let’s start with the last part of the equation. The pseudo miss penalty is one cycle more than a normal miss penalty, to account for the time to check the alternative location.To determine the miss rate we need to see when misses occur. As long as we invert the most significant bit of the index to find the other block, the two blocks in the “pseudo set” are selected using the same index that would be used in a two-way set-associative cache and hence have the same miss rates. Thus the last part of the equation is Miss ratepseudo × Miss penaltypseudo = Miss rate2-way × Miss penalty1-way 400 Chapter 5 Memory-Hierarchy Design Returning to the beginning of the equation, the hit time for a pseudoassociative cache is the time to hit in a direct-mapped cache plus the fraction of accesses that are found in the pseudo-associative search times the extra time it takes to find the hit: Hit timepseudo = Hit time1-way + Alternate hit ratepseudo × 2 The hit rate for the pseudo-associative search is the difference between the hits that would occur in a two-way set-associative cache and the number of hits in a direct-mapped cache: Alternate hit rate pseudo = Hit rate 2-way – Hit rate 1-way = ( 1 – Miss rate 2-way ) – ( 1 – Miss rate 1-way ) = Miss rate 1-way – Miss rate 2-way But it is slightly more complex. The miss rate is of a direct-mapped cache half the size—since half of the cache is reserved for alternate locations— while the whole cache has the contents of a two-way set-associative cache. Putting the pieces back together: Average memory access timepseudo = Hit time1-way + (Miss rate1-way – Miss rate2-way) × 2 + Miss rate2-way × Miss penalty1-way Figure 5.9 supplies the values we need to plug into our formulas: Average memory access timepseudo 2 KB = 1 + (0.113 – 0.076) × 2 + (0.076 × (50 + 1)) = 1 + 0.074 + 3.876 = 4.950 Average memory access timepseudo 128 KB = 1 + (0.014 – 0.007) × 2 + (0.007 × (50+ 1)) = 1 + 0.014 + 0.357 = 1.371 From Figure 5.14 in the last example we know these results for 2-KB caches: Average memory access time1-way = 5.90 clock cycles Average memory access time2-way = 4.90 clock cycles For 128-KB caches the times are Average memory access time1-way = 1.50 clock cycles Average memory access time2-way = 1.45 clock cycles The pseudo-associative cache is fastest for the 128-KB cache while the s two-way set associative is fastest for the 2-KB cache. Although an attractive idea on paper, variable hit times can complicate a pipelined CPU design. Hence the authors expect the most likely use of pseudoassociativity is with caches further from the processor (see the description of second-level caches in the next section). Fifth Miss Rate Reduction Technique: Hardware Prefetching of Instructions and Data Victim caches and pseudo-associativity both promise to improve miss rates without affecting the processor clock rate. A third way is to prefetch items before they are requested by the processor. Both instructions and data can be prefetched, 5.3 Reducing Cache Misses 401 either directly into the caches or into an external buffer that can be more quickly accessed than main memory. Instruction prefetch is frequently done in hardware outside of the cache. For example, the Alpha AXP 21064 microprocessor fetches two blocks on a miss: the requested block and the next consecutive block. The requested block is placed in the instruction cache when it returns, and the prefetched block is placed into the instruction stream buffer. If the requested block is present in the instruction stream buffer, the original cache request is canceled, the block is read from the stream buffer, and the next prefetch request is issued. There is never more than one 32-byte block in the 21064 instruction stream buffer. Jouppi [1990] found that a single instruction stream buffer would catch 15% to 25% of the misses from a 4-KB direct-mapped instruction cache with 16-byte blocks. With 4 blocks in the instruction stream buffer the hit rate improves to about 50%, and with 16 blocks to 72%. A similar approach can be applied to data accesses. Jouppi found that a single data stream buffer caught about 25% of the misses from the 4-KB direct-mapped cache. Instead of having a single stream, there could be multiple stream buffers beyond the data cache, each prefetching at different addresses. Jouppi found that four data stream buffers increased the data hit rate to 43%. Palacharla and Kessler [1994] looked at a set of scientific programs and considered stream buffers that could handle either instructions or data. They found that eight stream buffers could capture 50% to 70% of all misses from a processor with two 64-KB fourway set-associative caches, one for instructions and the other for data. EXAMPLE What is the effective miss rate of the Alpha AXP 21064 using instruction prefetching? How much bigger an instruction cache would be needed in the Alpha AXP 21064 to match the average access time if prefetching were removed? ANSWER We assume it takes 1 extra clock cycle if the instruction misses the cache but is found in the prefetch buffer. Here is our revised formula: Average memory access timeprefetch = Hit time + Miss rate × Prefetch hit rate × 1 + Miss rate × (1– Prefetch hit rate) × Miss penalty Let's assume the prefetch hit rate is 25%. Figure 5.7 on page 384 gives the miss rate for an 8-KB instruction cache as 1.10%. Using the parameters from the Example on page 386, the hit time is 2 clock cycles, and the miss penalty is 50 clock cycles: Average memory access timeprefetch = 2 + (1.10% × 25% × 1) + (1.10% × (1 – 25%) × 50) = 2 + 0.00275 + 0.413 = 2.415 To find the effective miss rate with the equivalent performance, we start with the original formula and solve for the miss rate: 402 Chapter 5 Memory-Hierarchy Design Average memory access time = Hit time + Miss rate × Miss penalty Average memory access time – Hit time Miss rate = ----------------------------------------------------------------------------------------------Miss penalty 2.415 – 2 0.415 Miss rate = --------------------- = ------------ = 0.83% 50 50 Our calculation suggests that the effective miss rate of prefetching with an 8-KB cache is 0.83%. Figure 5.7 on page 384 gives the miss rate of a 16-KB instruction cache as 0.64%, so 8 KB with prefetching is midway between the 1.10% and 0.64% miss rates of the 8-KB and 16-KB caches. s Prefetching relies on utilizing memory bandwidth that otherwise would be unused, and can actually lower performance if it interferes with demand misses. Help from compilers can reduce useless prefetching. Sixth Miss Rate Reduction Technique: Compiler-Controlled Prefetching An alternative to hardware prefetching is for the compiler to insert prefetch instructions to request the data before they are needed. There are several flavors of prefetch: s Register prefetch will load the value into a register. s Cache prefetch loads data only into the cache and not the register. Either of these can be faulting or nonfaulting; that is, the address does or does not cause an exception for virtual address faults and protection violations. Using this terminology, a normal load instruction could be considered a “faulting register prefetch instruction.” Nonfaulting prefetches simply turn into no-ops if they would normally result in an exception. The most effective prefetch is “semantically invisible” to a program: it doesn't change the contents of registers or memory and it cannot cause virtual memory faults. This section assumes nonfaulting cache prefetch, also called nonbinding prefetch. Prefetching makes sense only if the processor can proceed while the prefetched data are being fetched; that is, the caches continue to supply instructions and data while waiting for the prefetched data to return. Such a nimble cache is called a nonblocking cache or lockup-free cache; we'll discuss it in more detail later. Like hardware-controlled prefetching, the goal is to overlap execution with the prefetching of data. Loops are the key targets, as they lend themselves to prefetch optimizations. If the miss penalty is small, the compiler just unrolls the loop once or twice and it schedules the prefetches with the execution. If the miss 5.3 Reducing Cache Misses 403 penalty is large, it uses software pipelining (page 290 in Chapter 4) or unrolls many times to prefetch data for a future iteration. Issuing prefetch instructions incurs an instruction overhead, however, so care must be taken to ensure that such overheads do not exceed the benefits. By concentrating on references that are likely to be cache misses, programs can avoid unnecessary prefetches while improving average memory access time significantly. EXAMPLE For the code below, determine which accesses are likely to cause data cache misses. Next, insert prefetch instructions to reduce misses. Finally, calculate the number of prefetch instructions executed and the misses avoided due to prefetching. Let's assume we have an 8-KB direct-mapped data cache with 16-byte blocks, it is a write-back cache that does write allocate, and that the elements of a and b are 8 bytes long as they are double-precision floating-point arrays with 3 rows and 100 columns for a and 101 rows and 3 columns for b. Let’s also assume they are not in the cache at the start of the program. for (i = 0; i < 3; i = i+1) for (j = 0; j < 100; j = j+1) a[i][j] = b[j][0] * b[j+1][0]; ANSWER The compiler will first determine which accesses are likely to cause cache misses; otherwise, we will waste time on issuing prefetch instructions for data that would be hits. Elements of a are written in the order that they are stored in memory, so a will benefit from spatial locality: the even values of j will miss and the odd values will hit. Since a has 3 rows and 100 col3 × 100 umns, its accesses will lead to ----------------- or 150 misses. The array b does 2 not benefit from spatial locality since the accesses are not in the order it is stored. The array b does benefit twice from temporal locality: the same elements are accessed for each iteration of i, and each iteration of j uses the same value of b as the last iteration. Ignoring potential conflict misses, the misses due to b will be for b[j+1][0] accesses when i = 0, and also the first access to b[j][0] when j = 0. Since j goes from 0 to 99 when i = 0, accesses to b lead to 100 + 1 or 101 misses. Thus this loop will miss the data cache approximately 150 + 101 or 251 times. To simplify our optimization, we will not worry about prefetching the first accesses of the loop nor suppressing the prefetches at the end of the loop; if these were faulting prefetches, we could not take this luxury. Given our analysis of misses, we split the loop so the first loop will prefetch b as well as a, and the second loop will just prefetch a, since b will have already been prefetched. Let's assume that the miss penalty is so large we need to prefetch at least seven iterations in advance. 404 Chapter 5 Memory-Hierarchy Design for (j = 0; j < 100; j = j+1) { prefetch(b[j+7][0]); /* b(j,0) for 7 iterations later */ prefetch(a[0][j+7]); /* a(0,j) for 7 iterations later */ a[0][j] = b[j][0] * b[j+1][0];}; for (i = 1; i < 3; i = i+1) for (j = 0; j < 100; j = j+1) { prefetch(a[i][j+7]); /* a(i,j) for +7 iterations */ a[i–1][j] = b[j][0] *b[j+1][0];} This revised code prefetches a[i][7] through a[i][99] and b[7][0] through b[99][0], reducing the number of nonprefetched misses to 3×7 ----------- + 8 = 11 + 8 = 19 2 The cost of avoiding 232 cache misses is executing 400 prefetch instructions, very likely a good trade-off. s EXAMPLE ANSWER Calculate the time saved in the example above. Ignore instruction cache misses and assume there are no conflict or capacity misses in the data cache. Assume that prefetches can overlap with each other and with cache misses, thereby transferring at the maximum memory bandwidth. Here are the key loop times ignoring cache misses: the original loop takes 7 clock cycles per iteration, the first prefetch loop takes 9 clock cycles per iteration, and the second prefetch loop takes 8 clock cycles per iteration (including the overhead of the outer for loop). A miss takes 50 clock cycles. The original doubly nested loop executes the multiply 3 × 100 or 300 times. Since the loop takes 7 clock cycles per iteration, the total is 300 × 7 or 2100 clock cycles plus cache misses. Cache misses add 251 × 50 or 12,550 clock cycles, giving a total of 14,650 clock cycles. The first prefetch loop iterates 100 times; at 9 clock cycles per iteration the total is 900 clock cycles plus cache misses. They add 11 × 50 or 550 clock cycles for cache misses, giving a total of 1450. The second loop executes 2 × 100 or 200 times, and at 8 clock cycles per iteration it takes 1600 clock cycles plus 8 × 50 or 400 clock cycles for cache misses. This gives a total of 2000 clock cycles. From the prior example we know that this code executes 400 prefetch instructions during the 1450 + 2000 or 3450 clock cycles to execute these two loops. If we assume that the prefetches are completely overlapped with the rest of the execution, then the prefetch code is 14,650/3450 or 4.2 times faster. s 5.3 Reducing Cache Misses 405 Seventh Miss Rate Reduction Technique: Compiler Optimizations Thus far our techniques to reduce misses have required changes to or additions to the hardware: larger blocks, higher associativity, pseudo-associativity, hardware prefetching, or prefetch instructions. This final technique reduces miss rates without any hardware changes! This magical reduction comes from optimized software—the hardware designer’s favorite solution. The increasing performance gap between processors and main memory has inspired compiler writers to scrutinize the memory hierarchy to see if compile time optimizations can improve performance. Once again research is split between improvements in instruction misses and improvements in data misses. Code can easily be rearranged without affecting correctness; for example, reordering the procedures of a program might reduce instruction miss rates by reducing conflict misses. McFarling [1989] looked at using profiling information to determine likely conflicts between groups of instructions, and reordered the instructions to reduce misses by 50% for a 2-KB direct-mapped instruction cache with 4-byte blocks, and by 75% in an 8-KB cache. McFarling got the best performance when it was possible to prevent some instructions from ever entering the cache, but even without that feature, optimized programs on a direct-mapped cache had lower miss rates than unoptimized programs on an eight-way setassociative cache of the same size. Data have even fewer restrictions on location than code. The goal of such transformations is to try to improve the spatial and temporal locality of the data. For example, array calculations can be changed to operate on all the data in a cache block rather than blindly striding through arrays in the order the programmer happened to place the loop. To give a feeling of this type of optimization, we will show four examples, transforming the C code by hand to reduce cache misses. Figure 5.17 shows the performance improvement in using these optimizations on a subset of the SPEC92 floating-point benchmarks. Merging Arrays This first technique reduces misses by improving spatial locality. Some programs reference multiple arrays in the same dimension with the same indices at the same time. The danger is that these accesses will interfere with each other, leading to conflict misses. This danger is removed by combining these independent matrices into a single compound array so that a single cache block can contain the desired elements. /* Before */ int val[SIZE]; int key[SIZE]; 406 Chapter 5 Memory-Hierarchy Design vpenta (nasa7) gmty (nasa7) tomcatv btrix (nasa7) mxm (nasa7) spice cholesky (nasa7) compress 1 1.5 2 2.5 3 Performance improvement Merged arrays Loop interchange Loop fusion Blocking FIGURE 5.17 Lebeck and Wood [1994] performed the four optimizations in this section by hand on three SPEC92 programs and five separate portions of the nasa7 benchmark. /* After */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; An interesting characteristic of this example is that the proper coding practice of using an array of records would achieve the same benefits as this optimization. Loop Interchange Some programs have nested loops that access data in memory in nonsequential order. Simply exchanging the nesting of the loops can make the code access the data in the order it is stored. Like the prior example, this technique reduces misses by improving spatial locality; reordering maximizes use of data in a cache block before it is discarded. 5.3 Reducing Cache Misses 407 /* Before */ for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; The original code would skip through memory in strides of 100 words, while the revised version accesses all the words in the cache block before going to the next one. This optimization improves cache performance without affecting the number of instructions executed, unlike the prior example. Loop Fusion Some programs have separate sections of code that access the same arrays with the same loops, performing different computations on the common data. By “fusing” the code into a single loop, the data that are fetched into the cache can be used repeatedly before being swapped out. Hence, in contrast to our first two techniques, the target of this optimization is reducing misses via improved temporal locality. /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; } The original code would take all the misses to access arrays a and c twice, once in the first loop and then again in the second. In the fused loop, the second statement freeloads on the cache accesses of the first statement. 408 Chapter 5 Memory-Hierarchy Design Blocking This optimization, perhaps the most famous of the cache optimizations, again tries to reduce misses via improved temporal locality. We are again dealing with multiple arrays, with some arrays accessed by rows and some by columns. Storing the arrays row by row (row major order) or column by column (column major order) does not solve the problem because both rows and columns are used in every iteration of the loop. Such orthogonal accesses mean the earlier transformations, such as loop interchange, are not helpful. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks. The goal is to maximize accesses to the data loaded into the cache before the data are replaced. The code example below, which performs matrix multiplication, helps motivate the optimization: /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = r; }; The two inner loops read all N by N elements of z, access the same N elements in a row of y repeatedly, and write one row of N elements of x. Figure 5.18 gives a j x 0 1 2 k 3 4 y 5 0 1 2 j 3 4 z 5 0 0 0 1 2 3 4 5 0 1 1 1 2 i 2 i 2 k 3 3 3 4 4 4 5 5 5 FIGURE 5.18 A snapshot of the three arrays x, y, and z when i = 1. The age of accesses to the array elements is indicated by shade: white means not yet touched, light means older accesses and dark means newer accesses. The variables i, j, and k are shown along the rows or columns used to access the arrays. 5.3 Reducing Cache Misses 409 snapshot of the accesses to the three arrays, with a dark shade indicating a recent access, a light shade indicating an older access, and white meaning not yet accessed. The number of capacity misses clearly depends on N and the size of the cache. If it can hold all three N by N matrices, then all is well, provided there are no cache conflicts. If the cache can hold one N by N matrix and one row of N, then at least the ith row of y and the array z may stay in the cache. Less than that and misses may occur for both x and z. In the worst case, there would be 2N3 + N2 words read from memory for N3 operations. To ensure that the elements being accessed can fit in the cache, the original code is changed to compute on a submatrix of size B by B by having the two inner loops compute in steps of size B rather than going from beginning to end of x and z. B is called the blocking factor. (Assume x is initialized to zero.) /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; }; Figure 5.19 illustrates the accesses to the three arrays using blocking. Looking only at capacity misses, the total number of memory words accessed is 2N3/B + N2, which is an improvement by about a factor of B. Thus blocking exploits a combination of spatial and temporal locality, since y benefits from spatial locality and z benefits from temporal locality. Although we have aimed at reducing cache misses, blocking can also be used to help register allocation. By taking a small blocking size such that the block can be held in registers, we can minimize the number of loads and stores in the program. Traditionally blocking has been aimed at reducing capacity misses, under the simplifying assumption that conflict misses are either not significant or can be removed by more associative caches. Since blocking reduces the number of words that are active in a cache at a given time, choosing a blocking size smaller than capacity can also reduce conflict misses. Figure 5.20 gives a qualitative view of this trade-off. These last two subsections have concentrated on the potential benefit of cacheaware compilers and programs. Given that increasing gap in processor speed and memory access times, this benefit will only increase in importance over time. 410 Chapter 5 Memory-Hierarchy Design j x 0 1 2 k 3 4 y 5 0 1 2 j 3 4 z 5 0 0 0 1 2 3 4 5 0 1 1 1 2 i 2 2 i k 3 3 3 4 4 4 5 5 5 FIGURE 5.19 The age of accesses to the arrays x, y, and z. Note in contrast to Figure 5.18 the smaller number of elements accessed. 10% Direct mapped cache Miss rate 5% Fully associative cache 0% 0 50 100 150 Blocking factor FIGURE 5.20 The impact of conflict misses in caches that aren’t fully associative on block size. For example, Lam, Rothberg, and Wolf [1991] found one case where a blocking factor of 24 had a fifth the number of misses of a blocking factor of 48, despite both fitting into the cache. 5.4 411 Reducing Cache Miss Penalty Now that we have spent more than 20 pages on techniques that reduce cache misses, it is time to look at reducing the next component of average memory access time. 5.4 Reducing Cache Miss Penalty Reducing cache misses has been the traditional focus of cache research, but the cache performance formula assures us that improvements in miss penalty can be just as beneficial as improvements in miss rate. Moreover, Figure 5.1 shows that technology trends have improved the speed of processors faster than DRAMs, making the relative cost of miss penalties increase over time. We give five optimizations here to address this problem. Perhaps the most interesting optimization is the final one, which adds another level of cache to reduce miss penalty. First Miss Penalty Reduction Technique: Giving Priority to Read Misses over Writes With a write-through cache the most important improvement is a write buffer (page 380) of the proper size (see the pitfall on page 470 in section 5.11). Write buffers, however, do complicate memory accesses in that they might hold the updated value of a location needed on a read miss. EXAMPLE Look at this code sequence: SW 512(R0),R3 LW R1,1024(R0) LW R2,512(R0) ; M[512] ← R3 (cache index 0) ; R1 ← M[1024] (cache index 0) ; R2 ← M[512] (cache index 0) Assume a direct-mapped, write-through cache that maps 512 and 1024 to the same block, and a four-word write buffer. Will the value in R2 always be equal to the value in R3? ANSWER Using the terminology from Chapter 3, this is a read-after-write data hazard in memory. Let’s follow a cache access to see the danger. The data in R3 are placed into the write buffer after the store. The following load uses the same cache index and is therefore a miss. The second load instruction tries to put the value in location 512 into register R2; this also results in a miss. If the write buffer hasn’t completed writing to location 512 in memory, the read of location 512 will put the old, wrong value into the cache block, and then into R2. Without proper precautions, R3 would not be equal to R2! s 412 Chapter 5 Memory-Hierarchy Design The simplest way out of this dilemma is for the read miss to wait until the write buffer is empty. A write buffer of a few words in a write-through cache will almost always have data in the buffer on a miss, thereby increasing the read miss penalty. The designers of the MIPS M/1000 estimated that waiting for a fourword buffer to empty would have increased the average read miss penalty by a factor of 1.5. The alternative is to check the contents of the write buffer on a read miss, and if there are no conflicts and the memory system is available, let the read miss continue. The cost of writes by the processor in a write-back cache can also be reduced. Suppose a read miss will replace a dirty memory block. Instead of writing the dirty block to memory, and then reading memory, we could copy the dirty block to a buffer, then read memory, and then write memory. This way the CPU read, for which the processor is probably waiting, will finish sooner. Similar to the situation above, if a read miss occurs, the processor can either stall until the buffer is empty or check the addresses of the words in the buffer for conflicts. Second Miss Penalty Reduction Technique: Sub-block Placement for Reduced Miss Penalty Suppose you are designing a cache that must fit on the chip. You may find that your tags are too large, either because they don’t fit on the chip or because they are too slow. A simple solution is to go to large blocks, which reduces tag storage without decreasing the amount of information you can store in the cache. Of course the miss rate will likely improve, but the increase in miss penalty could make the larger blocks a bad decision. One solution is called sub-block placement. A valid bit is added to units smaller than the full block, called sub-blocks. Only a single sub-block need be read on a miss. The valid bits specify some parts of the block as valid and some parts as invalid, so a match of the tag doesn’t mean the word is necessarily in the cache, as the valid bit for that word must also be on. Figure 5.21 gives an example. Clearly sub-blocks will have a smaller miss penalty than full blocks. Figure 5.21 shows the reduction in tag storage; if the valid bits had to be replaced by full tags, there would be much more memory dedicated to tags, which is the reason sub-block placement was invented. Third Miss Penalty Reduction Technique: Early Restart and Critical Word First The first two techniques require extra hardware to reduce miss penalty, but not this third technique. It is based on the observation that the CPU needs just one word of the block at a time. This strategy is impatience: Don’t wait for the full block to be loaded before sending the requested word and restarting the CPU. Here are two specific strategies: 5.4 413 Reducing Cache Miss Penalty 100 1 1 1 1 300 1 1 0 0 200 0 1 0 1 204 0 0 0 0 Sub-blocks FIGURE 5.21 In this example there are four sub-blocks per block in a direct-mapped cache. Sub-blocks can be thought of as an extra level of addressing beyond the address tag. In the first block (top), all the valid bits are on, equivalent to the valid bit being on for a block in a normal cache. In the last block (bottom), the opposite is true; no valid bits are on. In the second block, locations 300 and 301 are valid and will be hits, while locations 302 and 303 will be misses. For the third block, locations 201 and 203 are hits. If, instead of this organization, there were 16 blocks the size of the sub-block, 16 tags would be needed instead of 4. Note that for caches with sub-block placement, a block can no longer be defined as the minimum unit transferred between cache and memory. For such caches a block is defined as the unit of information associated with an address tag. s s Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution. Critical word first—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Critical-word-first fetch is also called wrapped fetch and requested word first. Generally these techniques only benefit designs with very large cache blocks, since the benefit is low unless blocks are large. EXAMPLE Let’s assume a machine has a 32-byte cache block and the memory system takes five clock cycles to fetch bytes over a 16-byte wide path to memory, as in the case of the Alpha AXP 21064. Calculate the average miss penalty for critical word first, assuming that there will be no other accesses to the other half of the block until it is completely fetched. Then calculate assuming the following instruction reads data from the other half of the block. 414 Chapter 5 Memory-Hierarchy Design ANSWER The average miss penalty is five clock cycles for critical word first. For back-to-back reads of both halves of the cache block, only one cycle is saved since the pipeline will only move one instruction further until it must stall on the missing data. s As this example illustrates, the benefits of critical word first and early restart depend on the size of the block and the likelihood of another access to the portion of the block that has not yet been fetched. The next technique takes overlap between the CPU and cache miss penalty even further to reduce the average miss penalty. Fourth Miss Penalty Reduction Technique: Nonblocking Caches to Reduce Stalls on Cache Misses Early restart still waits for the requested word to arrive before the CPU can continue execution. For pipelined machines that allow out-of-order completion using a scoreboard or Tomasulo-style control (section 4.2 in Chapter 4), the CPU need not stall on a cache miss. For example, the CPU could continue fetching instructions from the instruction cache while waiting for the data cache to return the missing data. A nonblocking cache or lockup-free cache escalates the potential benefits of such a scheme by allowing the data cache to continue to supply cache hits during a miss. This “hit under miss” optimization reduces the effective miss penalty by being helpful during a miss instead of ignoring the requests of the CPU. A subtle and complex option is that the cache may further lower the effective miss penalty if it can overlap multiple misses: a “hit under multiple miss” or “miss under miss” optimization. The second option is beneficial only if the memory system can service multiple misses (see page 434). Be aware that hit under miss significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses. Figure 5.22 shows the average time in clock cycles for cache misses for an 8-KB data cache as the number of outstanding misses is varied. Floating-point programs benefit from increasing complexity, while integer programs get almost all of the benefit from a simple hit-under-one-miss scheme. 5.4 415 Reducing Cache Miss Penalty 100% 90% 80% 70% 60% Ratio of the average memory stall time 50% 40% 30% 20% 10% pr es s mo pr es eq s nt ot t co a p is es xl g6 or nn vi e2 al ic sp r p2 ea jd m dl to sw m 25 6 m ca tv fp pp su p 2c or hy dr o2 md dl js p2 na sa 7 do du c w av e5 0% Benchmarks Hit under 1 miss Hit under 2 misses Hit under 64 misses FIGURE 5.22 Ratio of the average memory stall time for a blocking cache to hit-under-miss schemes as the number of outstanding misses is varied for 18 SPEC92 programs. The hit-under-64-misses line allows one miss for every register in the machine. The first 14 programs are floating-point programs: the average for hit under 1 miss is 76%, for 2 misses is 51%, and for 64 misses is 39%. The final four are integer programs, and the three averages are 81%, 78%, and 78%, respectively. These data were collected for an 8-KB direct-mapped data cache with 32-byte blocks and a 16-clock-cycle miss penalty. These data were generated using the VLIW Multiflow Compiler, which scheduled loads away from use [Farkas and Jouppi 1994]. EXAMPLE For the cache described in Figure 5.22, which is more important for floating-point programs: two-way set associativity or hit under one miss? What about for integer programs? Assume the following average miss rates for 8-KB data caches: 11.4% for floating-point programs with a direct-mapped cache, 10.7% for these programs with a two-way setassociative cache, 7.4% for integer programs with a direct-mapped cache, and 6.0% for integer programs with a two-way set-associative cache. Assume the average memory stall time is just the product of the miss rate and the miss penalty. 416 Chapter 5 Memory-Hierarchy Design ANSWER The numbers for Figure 5.22 were based on a miss penalty of 16 clock cycles. Although this is low for a miss penalty, let’s stick with it for consistency. For floating-point programs the average memory stall times are Miss rateDM × Miss penalty = 11.4% × 16 = 1.84 Miss rate2-way × Miss penalty = 10.7% × 16 = 1.71 The memory stalls of two-way are thus 1.71/1.84 or 93% of directmapped cache. The caption of Figure 5.22 says hit under one miss reduces the average memory stall time to 76% of a blocking cache, so for floating-point programs the direct-mapped data cache supporting hit under one miss gives better performance than a two-way set-associative cache that blocks on a miss. For integer programs the calculation is Miss rateDM × Miss penalty = 7.4% × 16 = 1.18 Miss rate2-way × Miss penalty = 6.0% × 16 = 0.96 The memory stalls of two-way are thus 0.96/1.18 or 81% of directmapped cache. The caption of Figure 5.22 says hit under one miss reduces the average memory stall time to 81% of a blocking cache, so the two options give about the same performance for integer programs. One potential advantage of hit under miss is that it cannot affect the hit time, as associativity can. s Fifth Miss Penalty Reduction Technique: Second-Level Caches The first four techniques to reduce miss penalty have impact on the CPU. This final technique ignores the CPU, concentrating on the interface between the cache and main memory. The performance gap between processors and memory leads the architect to this question: Should I make the cache faster to keep pace with the speed of CPUs, or make the cache larger to overcome the widening gap between the CPU and main memory? One answer is, both. By adding another level of cache between the original cache and memory, the first-level cache can be small enough to match the clock cycle time of the fast CPU, while the second-level cache can be large enough to capture many accesses that would go to main memory, thereby lessening the effective miss penalty. While the concept of adding another level in the hierarchy is straightforward, it complicates performance analysis. Definitions for a second level of cache are 5.4 Reducing Cache Miss Penalty 417 not always straightforward. Let’s start with the definition of average memory access time for a two-level cache. Using the subscripts L1 and L2 to refer, respectively, to a first-level and a second-level cache, the original formula is Average memory access time = Hit timeL1 + Miss rateL1 × Miss penaltyL1 and Miss penaltyL1 = Hit timeL2 + Miss rateL2 × Miss penaltyL2 so Average memory access time = Hit timeL1 + Miss rateL1× (Hit timeL2 + Miss rateL2 × Miss penaltyL2) In this formula, the second-level miss rate is measured on the leftovers from the first-level cache. To avoid ambiguity, these terms are adopted here for a two-level cache system: s s Local miss rate—The number of misses in the cache divided by the total number of memory accesses to this cache; this is Miss rateL2 above for the secondlevel cache. Global miss rate—The number of misses in the cache divided by the total number of memory accesses generated by the CPU; using the terms above, the global miss rate of the second-level cache is Miss rateL1 × Miss rateL2. This local miss rate is large because the first-level cache skims the cream of the memory accesses, and this is why the global miss rate is the more useful measure: it indicates what fraction of the memory accesses that leave the CPU go all the way to memory. EXAMPLE Suppose that in 1000 memory references there are 40 misses in the firstlevel cache and 20 misses in the second-level cache. What are the various miss rates? ANSWER The miss rate (either local or global) for the first-level cache is 40/1000 or 4%. The local miss rate for the second-level cache is 20/40 or 50%. The global miss rate of the second-level cache is 20/1000 or 2%. s Note that these formulas are for combined reads and writes, assuming a writeback first-level cache. Obviously, a write-through first-level cache will send all writes to the second level, not just the misses, and a write buffer would be used. Figures 5.23 and 5.24 show how miss rates and relative execution time change with the size of a second-level cache for one design. From these figures we can gain two insights. The first is that the global cache miss rate is very similar to the 418 Chapter 5 Memory-Hierarchy Design 80.0% 72% 72% 70.0% 71% 60.0% 53% 50.0% Miss rate 38% 40.0% 28% 30.0% 22% 20.0% 10.0% 8% 3% 4 6% 3% 8 4% 3% 2% 3% 2% 16 32 64 18% 1% 1% 1% 128 256 512 16% 15% 15% Local miss rate 1% Single cache miss rate Global miss rate 1024 2048 4096 1% 1% Cache size (KB) 100.0% Local miss rate 10.0% Miss rate 1.0% Single cache miss rate Global miss rate 0.1% 4 8 16 32 64 128 256 512 1024 2048 4096 Cache size (KB) FIGURE 5.23 Miss rates versus cache size for reads and writes. The top graph shows the results plotted on a linear scale as we have done with earlier figures, while the bottom graph shows the results plotted on a log scale. As miss rates shrink, the log scale makes the differences easier to follow. The miss rate of a single-level cache versus size is plotted against the local miss rate and global miss rate of a second-level cache using a 32-KB first-level cache. Second-level caches smaller than the 32-KB first level make little sense, as reflected in the high miss rates. After 256 KB the single cache and global miss rates are virtually identical. Przybylski [1990] used four traces from the VAX system and four user programs from the MIPS R2000 that were randomly interleaved to duplicate the effect of process switches. single cache miss rate of the second-level cache, provided that the second-level cache is much larger than the first-level cache. Hence our intuition and knowledge about the first-level caches apply. The second insight is that the local cache 5.4 419 Reducing Cache Miss Penalty rate is not a good measure of secondary caches; it is a function of the miss rate of the first-level cache, and hence can vary by changing the first-level cache. Thus, the global cache miss rate should be used when evaluating second-level caches. With these definitions in place, we can consider the parameters of secondlevel caches. The foremost difference between the two levels is that the speed of the first-level cache affects the clock rate of the CPU, while the speed of the second-level cache only affects the miss penalty of the first-level cache. Thus, we can consider many alternatives in the second-level cache that would be ill chosen for the first-level cache. There are but two questions for the design of the secondlevel cache: Will it lower the average memory access time portion of the CPI, and how much does it cost? The initial decision is the size of a second-level cache. Since everything in the first-level cache is likely to be in the second-level cache, the second-level cache should be much bigger than the first. If second-level caches are just a little bigger, the local miss rate will be high. This observation inspires design of huge secondlevel caches—the size of main memory in older computers! Large size means that the second-level cache may have practically no capacity misses, leaving a few compulsory and conflict misses for our attention. One question is whether set associativity makes more sense for second-level caches. 1.24 4096 1.62 1.25 2048 Level two hit = 8 clock cycles 1.25 1024 Level two cache size (KB) Level two hit = 4 clock cycles 1.62 1.62 1.27 512 1.64 1.32 256 1.68 1.38 128 1.75 1.49 64 1.84 1 1.5 2 2.5 Relative execution time FIGURE 5.24 Relative execution time by second-level cache size. Przybylski [1990] collected these data using a 32-KB first-level write-back cache, varying the size of the secondlevel cache. The two bars are for different clock cycles for a level two cache hit. The reference execution time of 1.00 is for a 4096-KB second-level cache with a one-clock-cycle latency on a second-level hit. These data were collected the same way as in Figure 5.23. 420 Chapter 5 Memory-Hierarchy Design EXAMPLE Given the data below, what is the impact of second-level cache associativity on the miss penalty? s s Hit timeL2 for direct mapped = 10 clock cycles s Local miss rateL2 for direct mapped = 25% s Local miss rateL2 for two-way set associative = 20% s ANSWER Two-way set associativity increases hit time by 10% of a CPU clock cycle Miss penaltyL2 = 50 clock cycles For a direct-mapped second-level cache, the first-level cache miss penalty is Miss penalty1- way L2 = 10 + 25% × 50 = 22.5 clock cycles Adding the cost of associativity increases the hit cost only 0.1 clock cycles, making the new first-level cache miss penalty Miss penalty2- way L2 = 10.1 + 20% × 50 = 20.1 clock cycles In reality, second-level caches are almost always synchronized with the first-level cache and CPU. Accordingly, the second-level hit time must be an integral number of clock cycles. If we are lucky, we can shave the second-level hit time to 10 cycles; if not, we can round up to 11 cycles. Either choice is an improvement over the direct-mapped second-level cache: Miss penalty2- way L2 = 10 + 20% × 50 = 20.0 clock cycles Miss penalty2- way L2 = 11 + 20% × 50 = 21.0 clock cycles s Now we can reduce the miss penalty by reducing the miss rate of the secondlevel caches using techniques from section 5.3. Higher associativity or pseudoassociativity (page 398) are worth considering because they have small impact on the second-level hit time and because so much of the average access time is due to misses in the second-level cache. Although the larger size of the second-level cache eliminates conflict misses by distributing data over more blocks, it also eliminates most of the capacity misses; thus the percentage of conflict misses is still significant in direct-mapped second-level caches. Another approach to reducing misses is increasing block size in second-level caches. Increasing block size can increase conflict misses with small caches since there may not be enough places to put data, therefore increasing miss rate. Because this is not an issue in large second-level caches, and because memory 5.4 421 Reducing Cache Miss Penalty access time is relatively longer, block sizes of 64 bytes, 128 bytes, and even occasionally 256 bytes are popular. Figure 5.25 shows the variation in execution time as the second-level block size changes for a relatively narrow memory bus of 32 bits. Another consideration concerns whether all data in the first-level cache are always in the second-level cache. If so, the second-level cache is said to have the multilevel inclusion property. Inclusion is desirable because consistency between I/O and caches (or between caches in a multiprocessor) can be determined just by checking the second-level cache (see section 8.7). The drawback to this natural inclusion is that the lower average memory access times can suggest smaller blocks for the smaller first-level cache and larger blocks for the larger second-level cache. Inclusion can still be maintained with more work on a second-level miss: The second-level cache must invalidate all first-level blocks that map onto the second-level block to be replaced, causing a slightly higher first-level miss rate. It can also cause unneeded cache invalidates. Inclusion escalates in complexity when combined with performance optimizations, such as a nonblocking secondary cache. Finally, although a novice might design the first- and second-level caches independently, the designer of the first-level cache has a simpler job given a second-level cache to back up the first. It is less of a gamble to use a write through, for example, if there is a write-back cache at the next level to act as a backstop for repeated writes. 2.00 1.95 1.75 1.54 Relative CPU execution time 1.50 1.36 1.34 1.28 1.27 32 64 1.25 1.00 16 128 256 512 Block size of second-level cache (bytes) FIGURE 5.25 Relative execution time by block size for a two-level cache. Przybylski [1990] collected these data using a 512-KB second-level cache. These data were collected the same way as in Figure 5.23. The path to memory was basically 32 bits wide in this study: one clock cycle to send the address, six clock cycles to access the data, and one word per clock cycle to transfer the data. 422 Chapter 5 Memory-Hierarchy Design Summarizing the second-level cache considerations, the essence of cache design is balancing fast hits and few misses. Most optimizations that help one hurt the other. For second-level caches, there are many fewer hits than in the first-level cache, so the emphasis shifts to fewer misses. This insight leads to larger caches with higher associativity and larger blocks. 5.5 Reducing Hit Time Now that we have examined ways to improve cache performance by reducing misses (in section 5.3) and by reducing miss penalty (in section 5.4), we are ready to reduce the third component of the average memory access time. Hit time is critical because it affects the clock rate of the processor; on many machines today the cache access time limits the clock cycle rate, even for machines that take multiple clock cycles to access the cache. Hence a fast hit time is multiplied in importance beyond the average memory access time formula because it helps everything. This section gives two general techniques and then one optimization for write hits. First Hit Time Reduction Technique: Small and Simple Caches A time-consuming portion of a cache hit is using the index portion of the address to read the tag memory and then compare it to the address. Our guideline from Chapter 1 suggests that smaller hardware is faster, and a small cache certainly helps the hit time. It is also critical to keep the cache small enough to fit on the same chip as the processor to avoid the time penalty of going off-chip. Some designs strike a compromise by keeping the tags on-chip and the data off-chip, promising a fast tag check, yet providing the greater capacity of separate memory chips. The second suggestion is to keep the cache simple, such as using direct mapping (see page 396). A main benefit of direct-mapped caches is that the designer can overlap the tag check with the transmission of the data. This effectively reduces hit time. Hence the pressure of a fast clock cycle encourages small and simple cache designs for first-level caches. Second Hit Time Reduction Technique: Avoiding Address Translation During Indexing of the Cache Even a small and simple cache must cope with the translation of a virtual address from the CPU to a physical address to access memory. As described below in section 5.7, processors treat main memory as just another level of the memory hierarchy, and thus the address of the virtual memory that exists on disk must be mapped onto the main memory. 5.5 423 Reducing Hit Time The guideline of making the common case fast suggests that we use virtual addresses for the cache, since hits are much more common than misses. Such caches are termed virtual caches, with physical cache used to identify the traditional cache that uses physical addresses. Virtual addressing eliminates address translation time from a cache hit. Then why doesn’t everyone build virtually addressed caches? One reason is that every time a process is switched, the virtual addresses refer to different physical addresses, requiring the cache to be flushed. Figure 5.26 shows the impact on miss rates of this flushing. One solution is to 20% 0.6% 0.4% 18% 16% 1.1% 14% 0.5% 12% Miss rate 1.8% 10% 18.8% 0.6% 8% 13.0% 2.7% 6% 3.4% 8.7% 4% 0.6% 3.9% 0.4% 4.1% 4.3% 4.3% 4.3% 0.3% 0.4% 0.3% 0.3% 0.3% 0.3% 0.3% 0.3% 3.9% 2% 2.7% 0.4% 0.9% 0% 2K 4K 8K 16K 32K 64K 128K 256K 512K 1024K Cache size Uniprocess PIDs Purge FIGURE 5.26 Miss rate versus virtually addressed cache size of a program measured three ways: without process switches (uniprocess), with process switches using a process-identifier tag (PIDs), and with process switches but without PIDs (purge). PIDs increase the uniprocess absolute miss rate by 0.3% to 0.6% and save 0.6% to 4.3% over purging. Agarwal [1987] collected these statistics for the Ultrix operating system running on a VAX, assuming direct-mapped caches with a block size of 16 bytes. Note that the miss rate goes up from 128K to 256K. Such nonintuitive behavior can occur in caches because changing size changes the mapping of memory blocks onto cache blocks, which can change the conflict miss rate. 424 Chapter 5 Memory-Hierarchy Design increase the width of the cache address tag with a process-identifier tag (PID). If the operating system assigns these tags to processes, it only need flush the cache when a PID is recycled; that is, the PID distinguishes whether or not the data in the cache are for this program. Figure 5.26 shows the improvement in miss rates by using PIDs to avoid cache flushes. Another reason why virtual caches are not more popular is that operating systems and user programs may use two different virtual addresses for the same physical address. These duplicate addresses, called synonyms or aliases, could result in two copies of the same data in a virtual cache; if one is modified, the other will have the wrong value. With a physical cache this wouldn’t happen, since the accesses would first be translated to the same physical cache block. Hardware solutions, called anti-aliasing, guarantee every cache block a unique physical address. Software can make this problem much easier by forcing aliases to share some address bits. The version of UNIX from Sun Microsystems, for example, requires all aliases to be identical in the last 18 bits of their addresses; this restriction is called page coloring. Note that page coloring is simply set-associative mapping applied to virtual memory: the 4-KB (212) pages are mapped using 64 (26) sets to ensure that the physical and virtual addresses match in the last 18 bits. This restriction means a direct-mapped cache that is 218 (256K) bytes or smaller can never have duplicate physical addresses for blocks. The final area of concern with virtual addresses is I/O. I/O typically uses physical addresses and thus would require mapping to virtual addresses to interact with a virtual cache. (The impact of I/O on caches is further discussed below in section 5.9.) Another technique to get fast hits is to break address translation and cache access into separate pipeline stages, giving fast cycle time and slow hits. This increases the number of pipeline stages for a memory access, leading to greater penalty on mispredicted branches and more clock cycles between the issue of the load and the use of the data (see section 3.9). One alternative to get the best of both virtual and physical caches is to use the page offset—the part unaffected by address translation—to index the cache while sending the virtual part to be translated. This alternative allows the comparison to be with physical addresses and yet overlap the time to read the tags with address translation. The limitation of this virtually indexed, physically tagged alternative is that a direct-mapped cache can be no bigger than the page size. This is an advantage of the 8-KB caches of the Alpha AXP 21064; the minimum page size is 8 KB, so the 8-bit index can be taken from the physical part of the address. One way to keep the index small enough to be taken from the physical part of the address and still have a large cache is to use high associativity. Recall that the size of the index is controlled by this formula: 2 index Cache size = --------------------------------------------------------------------Block size × Set associativity 5.5 425 Reducing Hit Time The IBM 3033 cache, as an extreme example, is 16-way set associative, even though studies show there is little benefit to miss rates above eight-way set associativity. This high associativity allows a 64-KB cache to be addressed with a physical index despite the limitation of 4-KB pages in the IBM architecture. Figure 5.27 shows the relationship of index to page offset. 31 12 Page address Address tag 11 0 Page offset Index Block offset FIGURE 5.27 Relationship of index field and page offset in the IBM 3033 cache. The 4-KB page means the last 12 bits of the address are not translated, and hence some of it can be used to index the cache. One alternative to higher associativity is for the operating system to implement page coloring by guaranteeing that the