172 Pages

slides-up4

Course: ECE 427, Fall 2009
School: W. Alabama
Rating:
 
 
 
 
 

Word Count: 34741

Document Preview

427: E&CE Digital Systems Engineering Lecture Slides Instructors: Farzad Khalvati and Muhammad Nummer Notes by: Mark Aagaard 2007t1Winter University of Waterloo Dept of Electrical and Computer Engineering January 9, 2007 Contents I Lecture Notes . . . . . . . . . . . . . . . . 1 3 4 4 5 6 11 11 12 13 1 VHDL 1.1 Introduction to VHDL . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Levels of...

Register Now

Unformatted Document Excerpt

Coursehero >> Alabama >> W. Alabama >> ECE 427

Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

Course Hero has millions of student submitted documents similar to the one below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
427: E&CE Digital Systems Engineering Lecture Slides Instructors: Farzad Khalvati and Muhammad Nummer Notes by: Mark Aagaard 2007t1Winter University of Waterloo Dept of Electrical and Computer Engineering January 9, 2007 Contents I Lecture Notes . . . . . . . . . . . . . . . . 1 3 4 4 5 6 11 11 12 13 1 VHDL 1.1 Introduction to VHDL . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Levels of Abstraction . . . . . . . . . . . . . . . . . . . . . 1.1.2 VHDL Origins and History . . . . . . . . . . . . . . . . . . 1.1.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Synthesis of a Simulation-Based Language . . . . . . . . 1.1.5 Solution to Synthesis Sanity . . . . . . . . . . . . . . . . . 1.1.6 Standard Logic 1164 . . . . . . . . . . . . . . . . . . . . . 1.2 Comparison of VHDL to Other Hardware Description Languages i ii 1.3 Overview of Syntax . . . . . . . . . . . . . . . . . 1.3.1 Syntactic Categories . . . . . . . . . . . . . 1.3.2 Library Units . . . . . . . . . . . . . . . . . 1.3.3 Entities and Architecture . . . . . . . . . . . 1.3.4 Concurrent Statements . . . . . . . . . . . 1.3.5 Component Declaration and Instantiations . 1.3.6 Processes . . . . . . . . . . . . . . . . . . 1.3.7 Sequential Statements . . . . . . . . . . . . 1.3.8 A Few More Miscellaneous VHDL Features 1.4 Concurrent vs Sequential Statements . . . . . . . 1.4.1 Concurrent Assignment vs Process . . . . 1.4.2 Conditional Assignment vs If Statements . 1.4.3 Selected Assignment vs Case Statement . 1.4.4 Coding Style . . . . . . . . . . . . . . . . . 1.5 Overview of Processes . . . . . . . . . . . . . . . 1.5.1 Combinational Process vs Clocked Process 1.5.2 Latch Inference . . . . . . . . . . . . . . . . 1.6 Details of Process Execution . . . . . . . . . . . . 1.6.1 Temporal Granularities of Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 13 14 17 20 20 25 26 26 27 28 29 30 31 35 42 45 45 CONTENTS 1.6.2 Intuition Behind Delta-Cycle Simulation . . . . . . 1.6.3 Denitions and Algorithm . . . . . . . . . . . . . . 1.6.3.1 Process Modes . . . . . . . . . . . . . . 1.6.3.2 Simulation Algorithm . . . . . . . . . . . 1.6.3.3 Delta-Cycle Denitions . . . . . . . . . . 1.6.4 Example 1: Process Execution (Bamboozle) . . . 1.6.5 Example 2: Process Execution (Flummox) . . . . . 1.6.6 Ex: Need for Provisonal Asn . . . . . . . . . . . . 1.6.7 Delta-Cycle Simulations of Flip-Flops . . . . . . . 1.7 Register-Transfer Level Simulation . . . . . . . . . . . . . 1.7.1 Technique for Register-Transfer Level Simulation . 1.7.2 Examples of RTL Simulation . . . . . . . . . . . . 1.8 VHDL and Hardware Building Blocks . . . . . . . . . . . . 1.8.1 Basic Building Blocks . . . . . . . . . . . . . . . . 1.8.2 Deprecated Building Blocks for RTL . . . . . . . . 1.8.3 Hardware and Code for Flops . . . . . . . . . . . . 1.8.3.1 Flops with Waits and Ifs . . . . . . . . . . 1.8.3.2 Flops with Synchronous Reset . . . . . . 1.8.3.3 Flop with Chip-Enable and Mux on Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii . . . . . . . . . . . . . . . . . . . . 45 . 47 . 47 . 51 . 54 . 55 . 56 . 61 . 67 . 74 . 74 . 75 . 84 . 84 . 89 . 91 . 91 . 93 . 100 iv CONTENTS 1.8.3.4 Flops with Chip-Enable, Muxes, and Reset . 1.8.4 An Example Sequential Circuit . . . . . . . . . . . . . Arrays and Vectors . . . . . . . . . . . . . . . . . . . . . . . . Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10.1 Arithmetic Packages . . . . . . . . . . . . . . . . . . 1.10.2 Shift and Rotate Operations . . . . . . . . . . . . . . 1.10.3 Overloading of Arithmetic . . . . . . . . . . . . . . . 1.10.4 Different Widths and Arithmetic . . . . . . . . . . . . 1.10.5 Overloading of Comparisons . . . . . . . . . . . . . 1.10.6 Different Widths and Comparisons . . . . . . . . . . 1.10.7 Type Conversion . . . . . . . . . . . . . . . . . . . . Synthesizable vs Non-Synthesizable Code . . . . . . . . . . 1.11.1 Unsynthesizable Code . . . . . . . . . . . . . . . . . 1.11.1.1 Initial Values . . . . . . . . . . . . . . . . . 1.11.1.2 Wait For . . . . . . . . . . . . . . . . . . . . 1.11.1.3 Different Wait Conditions . . . . . . . . . . 1.11.1.4 Multiple if rising edges in Same Process . 1.11.1.5 if rising edge and wait in Same Process 1.11.1.6 if rising edge with else Clause . . . . . . 1.11.1.7 if rising edge Inside a for Loop . . . . . . 1.11.1.8 wait Inside of a for loop . . . . . . . . . . Synthesizable VHDL Coding Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 101 101 102 102 104 104 104 104 105 106 109 110 110 111 112 114 115 116 117 119 122 CONTENTS 2 RTL Design with VHDL 2.1 Prelude to Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 FPGA Background and Coding Guidelines . . . . . . . . . . . 2.3 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Algorithms and High-Level Models . . . . . . . . . . . . . . . . 2.5 Finite State Machines in VHDL . . . . . . . . . . . . . . . . . . 2.5.1 Introduction to State-Machine Design . . . . . . . . . . 2.5.1.1 Mealy vs Moore State Machines . . . . . . . . 2.5.1.2 Introduction to State Machines and VHDL . . . 2.5.1.3 Explicit vs Implicit State Machines . . . . . . . 2.5.2 Implementing a Simple Moore Machine . . . . . . . . . 2.5.2.1 Implicit Moore State Machine . . . . . . . . . . 2.5.2.2 Explicit Moore with Flopped Output . . . . . . 2.5.2.3 Explicit Moore with Combinational Outputs . . 2.5.2.4 Explicit-Current+Next Moore with Concurrent signment . . . . . . . . . . . . . . . . . . . . . 2.5.2.5 E-C+N Moore with Comb Proc . . . . . . . . . 2.5.3 Implementing a Simple Mealy Machine . . . . . . . . . 2.5.4 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . As. . . . . . . . . . . . v 123 124 124 124 124 125 125 125 128 130 135 136 138 140 142 144 146 147 1.9 1.10 1.11 1.12 vi 2.5.5 State Encoding . . . . . . . . . . . . . . . . . . 2.6 Dataow Diagrams . . . . . . . . . . . . . . . . . . . . 2.6.1 Dataow Diagrams Overview . . . . . . . . . . 2.6.2 Dataow Diagrams, Hardware, and Behaviour 2.6.3 Dataow Diagram Execution . . . . . . . . . . 2.6.4 Performance Estimation . . . . . . . . . . . . . 2.6.5 Area Estimation . . . . . . . . . . . . . . . . . 2.6.6 Design Analysis . . . . . . . . . . . . . . . . . 2.6.7 Area / Performance Tradeoffs . . . . . . . . . . 2.7 Memory Arrays and RTL Design . . . . . . . . . . . . 2.7.1 Memory Arrays in VHDL . . . . . . . . . . . . . 2.7.2 Data Dependencies . . . . . . . . . . . . . . . 2.7.3 Memory Arrays and Dataow Diagrams . . . . 2.7.4 Ex: Mem Array and Dataow Diagram . . . . . 2.8 Input / Output Protocols . . . . . . . . . . . . . . . . . 2.9 Design Example: Massey . . . . . . . . . . . . . . . . 2.10 Design Example: Vanier . . . . . . . . . . . . . . . . 2.10.1 Requirements . . . . . . . . . . . . . . . . . . 2.10.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 153 153 166 175 185 186 188 190 193 194 194 199 207 215 215 216 218 219 CONTENTS 2.10.3 Initial Dataow Diagram . . . . . . . . . . . . 2.10.4 Reschedule to Meet Requirements . . . . . . 2.10.5 Optimize Resources . . . . . . . . . . . . . . 2.10.6 Assign Names to Registered Values . . . . . 2.10.7 Input/Output Allocation . . . . . . . . . . . . . 2.10.8 Tangent: Combinational Outputs . . . . . . . 2.10.9 Register Allocation . . . . . . . . . . . . . . . 2.10.10 Datapath Allocation . . . . . . . . . . . . . . 2.10.11 Hardware Block Diagram and State Machine 2.10.11.1 Control for Registers . . . . . . . . 2.10.11.2 Control for Datapath Components . 2.10.11.3 Control for State . . . . . . . . . . . 2.10.11.4 Complete State Machine Table . . 2.10.12 VHDL Code with Explicit State Machine . . . 2.10.13 Peephole Optimizations . . . . . . . . . . . . 2.10.14 Notes and Observations . . . . . . . . . . . 2.11 Design Example: Stack . . . . . . . . . . . . . . . . 2.12 Optimization Techniques . . . . . . . . . . . . . . . . 2.12.1 Strength Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 220 221 223 226 227 229 230 232 233 234 237 239 240 242 246 249 251 252 252 viii 2.12.1.1 Arithmetic Strength Reduction . . . 2.12.1.2 Boolean Strength Reduction . . . . 2.12.2 Replication and Sharing . . . . . . . . . . . . 2.12.2.1 Mux-Pushing . . . . . . . . . . . . . 2.12.2.2 Common Subexpression Elimination 2.12.2.3 Computation Replication . . . . . . . 2.12.3 Arithmetic . . . . . . . . . . . . . . . . . . . . 2.12.4 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 253 254 254 255 257 258 259 261 262 262 262 263 263 264 264 265 270 CONTENTS 3.3.1 Overview of Test Benches . . . . . . 3.3.2 Reference Model Style Testbench . 3.3.3 Relational Style Testbench . . . . . 3.3.4 Coding Structure of a Testbench . . 3.3.5 Datapath vs Control . . . . . . . . . 3.3.6 Verication Tips . . . . . . . . . . . 3.4 Functional Verication for Datapath Circuits 3.4.1 A Spec-Less Testbench . . . . . . . 3.4.2 Use an Array for Test Vectors . . . . 3.4.3 Build Spec into Stimulus . . . . . . . 3.4.4 Have Separate Specication Entity . 3.4.5 Generate Test Vectors Automatically 3.4.6 Relational Specication . . . . . . . 3.5 Functional Verication of Control Circuits . 3.5.1 Overview of Queues in Hardware . . 3.5.2 VHDL Coding . . . . . . . . . . . . . 3.5.2.1 Package . . . . . . . . . . 3.5.2.2 Other VHDL Coding . . . . 3.5.3 Code Structure for Verication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 270 271 271 272 273 274 275 277 279 280 282 285 286 287 288 295 295 296 296 3 Functional Verication 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Terminology: Validation / Verication / Testing . . . . . . . 3.1.2 The Difculty of Designing Correct Chips . . . . . . . . . 3.1.2.1 Notes from Kenn Heinrich (UW E&CE grad) . . 3.1.2.2 Notes from Aart de Geus (Chairman and CEO Synopsys) . . . . . . . . . . . . . . . . . . . . . 3.2 Test Cases and Coverage . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Floating Point Divider Example . . . . . . . . . . . . . . . 3.3 Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . of . . . . . . . . . . x 3.5.4 3.5.5 3.5.6 3.5.7 3.5.8 Instrumentation Code Assertions . . . . . . VHDL Coding Tips . . Queue Specication . Queue Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . 298 304 308 313 317 321 322 323 326 326 328 329 329 330 334 337 338 340 340 CONTENTS xi 4 Performance Analysis and Optimization 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Dening Performance . . . . . . . . . . . . . . . . . . . . 4.3 Comparing Performance . . . . . . . . . . . . . . . . . . . 4.3.1 General Equations . . . . . . . . . . . . . . . . . . 4.3.2 Example: Performance of Printers . . . . . . . . . 4.4 Clock Speed, CPI, Program Length, and Performance . . 4.4.1 Mathematics . . . . . . . . . . . . . . . . . . . . . 4.4.2 Example: CISC vs RISC and CPI . . . . . . . . . . 4.4.3 Effect of Instruction Set on Performance . . . . . . 4.4.4 Effect of Time to Market on Relative Performance 4.4.5 Summary of Equations . . . . . . . . . . . . . . . 4.5 Performance Analysis and Dataow Diagrams . . . . . . 4.5.1 Dataow Diagrams, CPI, and Clock Speed . . . . 4.5.2 Examples of Dataow Diagrams for Two Instructions . . . . . 343 4.5.2.1 Scheduling of Operations for Different Clock Periods 344 4.5.2.2 Performance Computation for Different Clock Periods 348 4.5.2.3 Example: Two Instructions Taking Similar Time . . . 349 4.5.2.4 Example: Same Total Time, Different Order for A . . 350 4.5.3 Example: From Algorithm to Optimized Dataow . . . . . . . 351 5 Optimization 5.1 Pipelining . . . . . . . . . . . . . . . . . 5.1.1 Introduction to Pipelining . . . . 5.1.2 Partially Pipelined . . . . . . . . 5.1.3 Pipelined Version of InstP . . . . 5.1.4 Pipelined Version of InstP/InstQ 355 356 356 361 363 366 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 6 Timing Analysis 6.1 Delays and Denitions . . . . . . . . . . . . . . . . . . . . 6.1.1 Background Denitions . . . . . . . . . . . . . . . 6.1.2 Clock-Related Timing Denitions . . . . . . . . . . 6.1.2.1 Clock Skew . . . . . . . . . . . . . . . . . 6.1.2.2 Clock Latency . . . . . . . . . . . . . . . 6.1.2.3 Clock Jitter . . . . . . . . . . . . . . . . . 6.1.3 Storage Related Timing Denitions . . . . . . . . . 6.1.3.1 Setup Time . . . . . . . . . . . . . . . . . 6.1.3.2 Hold Time . . . . . . . . . . . . . . . . . 6.1.3.3 Clock-to-Q Time . . . . . . . . . . . . . . 6.1.4 Propagation Delays . . . . . . . . . . . . . . . . . 6.1.5 Summary of Delay Factors . . . . . . . . . . . . . 6.1.6 Timing Constraints . . . . . . . . . . . . . . . . . . 6.1.6.1 Minimum Clock Period . . . . . . . . . . . 6.1.6.2 Hold Constraint . . . . . . . . . . . . . . 6.1.6.3 Example Timing Violations . . . . . . . . 6.2 Timing Analysis of Latches and Flip Flops . . . . . . . . . 6.2.1 Review: Latch, Flip-Flop, Setup, Hold, Clock-to-Q . . . . . . . . . . . . . . . . . . CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 370 370 371 371 373 374 376 378 379 380 381 382 383 384 385 386 389 389 CONTENTS 6.2.2 Simple Multiplexer Latch . . . . . . . . . . . . . . . . . . . 6.2.2.1 Structure and Behaviour of Multiplexer Latch . . 6.2.2.2 Strategy for Timing Analysis of Storage Devices 6.2.2.3 Clock-to-Q Time of a Multiplexer Latch . . . . . 6.2.2.4 Setup Timing of a Multiplexer Latch . . . . . . . 6.2.2.5 Hold Time of a Multiplexer Latch . . . . . . . . . 6.2.2.6 Example of a Bad Latch . . . . . . . . . . . . . . 6.3 Critical Paths and False Paths . . . . . . . . . . . . . . . . . . . 6.3.1 Introduction to Critical and False Paths . . . . . . . . . . 6.3.1.1 Example of Critical Path in Full Adder . . . . . . 6.3.1.2 Preliminaries for Critical Paths . . . . . . . . . . 6.3.1.3 Longest Path and Critical Path . . . . . . . . . . 6.3.2 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Detecting a False Path . . . . . . . . . . . . . . . . . . . . 6.3.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . 6.3.3.2 Almost-Correct Algorithm to Detect a False Path 6.3.3.3 Examples of Detecting False Paths . . . . . . . 6.3.4 Finding the Next Candidate Path . . . . . . . . . . . . . . 6.3.4.1 Algorithm to Find Next Candidate Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 391 391 396 399 401 407 409 410 410 412 413 414 417 418 418 424 425 426 427 xiv 6.3.4.2 Examples of Finding Next Candidate Path . 6.3.5 Correct Algorithm to Find Critical Path . . . . . . . . 6.3.5.1 Algorithm . . . . . . . . . . . . . . . . . . . 6.3.5.2 Examples . . . . . . . . . . . . . . . . . . . 6.4 Analog Timing Model . . . . . . . . . . . . . . . . . . . . . 6.4.1 Timing Model . . . . . . . . . . . . . . . . . . . . . . 6.4.1.1 Equation for Output Voltage . . . . . . . . . 6.5 Elmore Delay Model . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Elmore Time Constant . . . . . . . . . . . . . . . . . 6.5.2 Interconnect with Single Fanout . . . . . . . . . . . . 6.5.3 Interconnect with Multiple Gates in Fanout . . . . . . 6.6 Practical Usage of Timing Analysis . . . . . . . . . . . . . . 6.6.1 Speed Binning . . . . . . . . . . . . . . . . . . . . . 6.6.1.1 FPGAs, Interconnect, and Synthesis . . . . 6.6.2 Worst Case Timing . . . . . . . . . . . . . . . . . . 6.6.2.1 Fanout delay . . . . . . . . . . . . . . . . . 6.6.2.2 Derating Factors . . . . . . . . . . . . . . . CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428 431 431 432 445 454 454 459 459 461 465 470 472 474 475 475 476 CONTENTS 7 Power Analysis and Power-Aware Design 7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Importance of Power and Energy . . . . . . . . 7.1.2 Industrial Names and Products . . . . . . . . . 7.1.3 Power vs Energy . . . . . . . . . . . . . . . . . 7.1.4 Batteries, Power and Energy . . . . . . . . . . 7.1.4.1 Do Batteries Store Energy or Power? 7.1.4.2 Battery Life and Efciency . . . . . . 7.1.4.3 Battery Life and Power . . . . . . . . 7.2 Power Equations . . . . . . . . . . . . . . . . . . . . . 7.2.1 Switching Power . . . . . . . . . . . . . . . . . 7.2.2 Short-Circuited Power . . . . . . . . . . . . . . 7.2.3 Leakage Power . . . . . . . . . . . . . . . . . . 7.2.4 Glossary . . . . . . . . . . . . . . . . . . . . . 7.2.5 Note on Power Equations . . . . . . . . . . . . 7.3 Overview of Power Reduction Techniques . . . . . . . 7.4 Voltage Reduction for Power Reduction . . . . . . . . 7.5 Data Encoding for Power Reduction . . . . . . . . . . 7.5.1 How Data Encoding Can Reduce Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 481 482 482 483 484 485 485 486 487 490 492 495 496 497 497 497 502 506 506 xvi 7.5.2 Example Problem: Sixteen Pulser . . . . . . . . . . 7.5.2.1 Problem Statement . . . . . . . . . . . . . 7.5.2.2 Additional Information . . . . . . . . . . . . 7.5.2.3 Answer . . . . . . . . . . . . . . . . . . . . 7.6 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Introduction to Clock Gating . . . . . . . . . . . . . . 7.6.2 Implementing Clock Gating . . . . . . . . . . . . . . 7.6.3 Design Process . . . . . . . . . . . . . . . . . . . . 7.6.4 Effectiveness of Clock Gating . . . . . . . . . . . . . 7.6.5 Example: Reduced Activity Factor with Clock Gating 7.6.6 Clock Gating with Valid-Bit Protocol . . . . . . . . . 7.6.6.1 Valid-Bit Protocol . . . . . . . . . . . . . . . 7.6.6.2 How Many Clock Cycles for Module? . . . 7.6.6.3 Adding Clock-Gating Circuitry . . . . . . . 7.6.7 Example: Pipelined Circuit with Clock-Gating . . . . CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510 510 511 513 519 519 520 521 521 525 527 527 530 532 535 CONTENTS 8 Fault Testing and Testability 8.1 Faults and Testing . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Overview of Faults and Testing . . . . . . . . . . . . . 8.1.1.1 Faults (Smith 14.3) . . . . . . . . . . . . . . 8.1.1.2 Causes of Faults (Smith 14.3) . . . . . . . . 8.1.1.3 Testing (Smith 14) . . . . . . . . . . . . . . . 8.1.1.4 Burn In (Smith 14.3.1) . . . . . . . . . . . . . 8.1.1.5 Bin Sorting (Smith 5.1.6) . . . . . . . . . . . 8.1.1.6 Testing Techniques (Smith 14) . . . . . . . . 8.1.1.7 Design for Testability (DFT) (Smith 14.6) . . 8.1.2 Example Problem: Economics of Testing (Smith 14.1) 8.1.3 Physical Faults (Smith 14.3.3) . . . . . . . . . . . . . . 8.1.3.1 Types of Physical Faults . . . . . . . . . . . . 8.1.3.2 Locations of Faults . . . . . . . . . . . . . . . 8.1.3.3 Layout Affects Locations . . . . . . . . . . . 8.1.3.4 Naming Fault Locations . . . . . . . . . . . . 8.1.4 Detecting a Fault . . . . . . . . . . . . . . . . . . . . . 8.1.4.1 Which Test Vectors will Detect a Fault? . . . 8.1.5 Mathematical Models of Faults (Smith 14.3.4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii 547 548 548 548 549 549 550 550 551 551 551 551 552 553 554 554 555 555 558 xviii CONTENTS CONTENTS xix 8.1.5.1 Single Stuck-At Fault Model . . . . . . . . . . . . . . 559 8.1.6 Generate Test Vector to Find a Mathematical Fault (Smith 14.4)561 8.1.6.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 561 8.1.6.2 Example of Finding a Test Vector . . . . . . . . . . . 562 8.1.7 Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . 563 8.1.7.1 Redundant Circuitry . . . . . . . . . . . . . . . . . . 563 8.1.7.2 Curious Circuitry and Fault Detection . . . . . . . . 566 8.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567 8.2.1 A Small Example . . . . . . . . . . . . . . . . . . . . . . . . . 567 8.2.2 Choosing Test Vectors . . . . . . . . . . . . . . . . . . . . . . 568 8.2.2.1 Fault Domination . . . . . . . . . . . . . . . . . . . . 569 8.2.2.2 Fault Equivalence . . . . . . . . . . . . . . . . . . . 570 8.2.2.3 Gate Collapsing . . . . . . . . . . . . . . . . . . . . 571 8.2.2.4 Node Collapsing . . . . . . . . . . . . . . . . . . . . 572 8.2.2.5 Fault Collapsing Summary . . . . . . . . . . . . . . 572 8.2.3 Fault Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . 573 8.2.4 Test Vector Generation and Fault Detection . . . . . . . . . . 574 8.2.5 Generate Test Vectors for 100% Coverage . . . . . . . . . . 575 8.2.5.1 Collapse the Faults . . . . . . . . . . . . . . . . . . 576 8.2.5.2 Check for Fault Domination . . . . . . . . . . . . . . 579 8.2.5.3 Required Test Vectors . . . . . . . . . . . . . . . . . 581 8.2.5.4 Faults Not Covered by Required Test Vectors . . . . 582 8.2.5.5 Order to Run Test Vectors . . . . . . . . . . . . . . . 583 8.2.5.6 Summary of Technique to Find and Order Test Vectors585 8.2.6 One Fault Hiding Another . . . . . . . . . . . . . . . . . . . . 586 8.3 Scan Testing in General . . . . . . . . . . . . . . . . . . . . . . . . . 588 8.3.1 Structure and Behaviour of Scan Testing . . . . . . . . . . . 588 8.3.2 Scan Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . 590 8.3.2.1 Circuitry in Normal and Scan Mode . . . . . . . . . 591 8.3.2.2 Scan in Operation . . . . . . . . . . . . . . . . . . . 592 8.3.2.3 Scan in Operation with Example Circuit . . . . . . . 594 8.3.3 Summary of Scan Testing . . . . . . . . . . . . . . . . . . . . 598 8.3.4 Time to Test a Chip . . . . . . . . . . . . . . . . . . . . . . . 599 8.3.4.1 Example: Time to Test a Chip . . . . . . . . . . . . 600 8.4 Boundary Scan and JTAG . . . . . . . . . . . . . . . . . . . . . . . . 601 8.4.1 Scan Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 604 8.5 Built In Self Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 8.5.1 Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 605 xx 8.5.1.1 Components . . . . . . . . . . . . . . . 8.5.1.2 Linear Feedback Shift Register (LFSR) 8.5.1.3 Maximal-Length LFSR . . . . . . . . . . 8.5.2 Test Generator . . . . . . . . . . . . . . . . . . . 8.5.3 Signature Analyzer . . . . . . . . . . . . . . . . . 8.5.4 Result Checker . . . . . . . . . . . . . . . . . . . 8.5.5 Arithmetic over Binary Fields . . . . . . . . . . . 8.5.6 Shift Registers and Characteristic Polynomials . 8.5.6.1 Circuit Multiplication . . . . . . . . . . . 8.5.7 Bit Streams and Characteristic Polynomials . . . 8.5.8 Division . . . . . . . . . . . . . . . . . . . . . . . 8.5.9 Signature Analysis: Math and Circuits . . . . . . 8.6 Scan vs Self Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608 612 614 617 620 624 626 628 631 632 633 636 640 CONTENTS 9 Review 9.1 Overview of the Term . . . . . . . . . . 9.2 VHDL . . . . . . . . . . . . . . . . . . . 9.2.1 VHDL Topics . . . . . . . . . . . 9.2.2 VHDL Example Problems . . . . 9.3 RTL Design Techniques . . . . . . . . . 9.3.1 Design Topics . . . . . . . . . . 9.3.2 Design Example Problems . . . 9.4 Functional Verication . . . . . . . . . . 9.4.1 Verication Topics . . . . . . . . 9.4.2 Verication Example Problems . 9.5 Performance Analysis and Optimization 9.5.1 Performance Topics . . . . . . . 9.5.2 Performance Example Problems 9.6 Timing Analysis . . . . . . . . . . . . . . 9.6.1 Timing Topics . . . . . . . . . . . 9.6.2 Timing Example Problems . . . 9.7 Power . . . . . . . . . . . . . . . . . . . 9.7.1 Power Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi 641 642 643 643 644 645 645 646 647 647 648 649 649 650 651 651 652 653 653 xxii 9.7.2 Power Example Problems . . 9.8 Testing . . . . . . . . . . . . . . . . 9.8.1 Testing Topics . . . . . . . . 9.8.2 Testing Example Problems . 9.9 Formulas to be Given on Final Exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . 654 655 655 656 657 Part I Lecture Notes 1 Chapter 1 VHDL: The Language 3 4 CHAPTER 1. VHDL 1.1.2 VHDL Origins and History 5 1.1 Introduction to VHDL 1.1.1 Levels of Abstraction Transistor Signal values and time are continous (analog). Each transistor is modeled by a resistor-capacitor network. Switch Time is continuous, but voltage may be either continuous or discrete. Linear equations are used. Gate Transistors are grouped together into gates. Voltages are discrete values such as 0 and 1. Register transfer level Hardware is modeled as assignments to registers and combinational signals. Basic unit of time is one clock cycle. Transaction level A transaction is an operation such as transfering data across a bus. Building blocks are processors, controllers, etc. VHDL, SystemC, or SystemVerilog. Electronic-system level Looks at an entire electronic system, with both hardware and software. 1.1.2 VHDL Origins and History VHDL = VHSIC Hardware Description Language VHSIC = Very High Speed Integrated Circuit The VHSIC Hardware Description Language (VHDL) is a formal notation intended for use in all phases of the creation of electronic systems. Because it is both machine readable and human readable, it supports the development, verication, synthesis and testing of hardware designs, the communication of hardware design data, and the maintenance, modication, and procurement of hardware. Language Reference Manual (IEEE Design Automation Standards Committee, 1993a) VHDL is a lot more than synthesis of digital hardware 6 CHAPTER 1. VHDL 1.1.3 Semantics 7 1.1.3 Semantics The original goal of VHDL was to simulate circuits. The semantics of the language dene circuit behaviour. a c <= a AND b; Synthesis Synthesis is a computer-aided design (CAD) technique that transforms a designers concise, high-level description of a circuit into a structural description of a circuit. simulation b c But now, VHDL is used in simulation and synthesis. Synthesis is concerned with the structure of the circuit. Synthesis: converts one type of description (behavioural) into another, lower level, description (usually a netlist). c <= a AND b; c <= a AND b; synthesis a c b synthesis a c b 8 CHAPTER 1. VHDL 1.1.3 Semantics 9 CAD Tools CAD Tools allow designers to automate lower-level design processes in implementing the desired functionality of a system. NOTE: EDA = Electronic Design Automation. EDA = CAD. In digital hardware design Synthesis vs Simulation For synthesis, we want the code we write to dene the structure of the hardware that is generated. c <= a AND b; synthesis a c b 10 CHAPTER 1. VHDL 1.1.4 Synthesis of a Simulation-Based Language 11 Synthesis vs Simulation The VHDL semantics dene the behaviour of the hardware that is generated, not the structure of the hardware. a a c 1.1.4 Synthesis of a Simulation-Based Language This section reserved for your reading pleasure the sis b simulation b c 1.1.5 Solution to Synthesis Sanity same behaviour c <= a AND b; different structure a a c b simulation b c Pick a high-quality synthesis tool and study its documentation thoroughly Learn the idioms of the tool Different VHDL code with same behaviour can result in very different circuits Be careful if you have to port VHDL code from one tool to another KISS: Keep It Simple Stupid VHDL examples will illustrate reliable coding techniques for the synthesis tools from Synopsys, Mentor Graphics, Altera, Xilinx, and most other companies as well. Follow the coding guidelines and examples from lecture As you write VHDL, think about the hardware you expect to get. Note: If you cant predict the hardware, then the hardware probably wont be very good (small, fast, correct, etc) syn syn the sis 12 CHAPTER 1. VHDL 1.2. COMPARISON OF VHDL TO OTHER HARDWARE DESCRIPTION LANGUAGES13 1.1.6 Standard Logic 1164 std logic 1164: IEEE standard for signal values in VHDL. 1.2 Comparison of VHDL to Other Hardware Description Languages This section reserved for your reading pleasure U X 0 1 Z W L H -- uninitialized strong unknown strong 0 strong 1 high impedance weak unknown weak 0 weak 1 dont care 1.3 Overview of Syntax 1.3.1 Syntactic Categories This section reserved for your reading pleasure The most common values are: U, X, 0, 1. If you see X in a simulation, it usually means that there is a mistake in your code. 1.3.2 Library Units This section reserved for your reading pleasure 14 CHAPTER 1. VHDL 1.3.3 Entities and Architecture 15 1.3.3 Entities and Architecture Each hardware module is described with an Entity/Architecture pair Entity library ieee; use ieee.std_logic_1164.all; entity and_or is port ( a, b, c : in std_logic ; z : out std_logic ); end and_or; Example of an entity entity entity architecture architecture Entity and Architecture 16 CHAPTER 1. VHDL 1.3.4 Concurrent Statements 17 Architecture architecture signal x : begin x <= a AND z <= x OR end main; main of and_or is std_logic; b; (a AND c); Example of architecture 1.3.4 Concurrent Statements Architectures contain concurrent statements Concurrent statements execute in parallel (Figure1.4) Concurrent statements make VHDL fundamentally different from most software languages. Hardware (gates) naturally execute in parallel VHDL mimics the behaviour of real hardware. At each innitesimally small moment of time, each gate: 1. samples its inputs 2. computes the value of its output 3. drives the output 18 CHAPTER 1. VHDL 1.3.4 Concurrent Statements 19 Concurrent Statements architecture main of bowser is begin x1 <= a AND b; x2 <= NOT x1; z <= NOT x2; end main; architecture main of bowser is begin z <= NOT x2; x2 <= NOT x1; x1 <= a AND b; end main; Types of Concurrent Statements a b x1 x2 z The order of concurrent statements doesnt matter conditional assignment similar to conventional if-then-else c <= a+b when sel=1 else a+c when sel=0 else "0000"; selected assignment similar to conventional case/switch with color select d <= "00" when red , "01" when ; component instantiation use a hardware module/component add1 : adder port map( a => f, b => g, s => h, co => i); for-generate create multiple pieces of hardware bgen: for i in 1 to 7 generate b(i)<=a(7-i); end generate; if-generate conditionally create some hardware okgen : if optgoal /= fast then generate result <= ((a and b) or (d and not e)) or g; end generate; fastgen : if optgoal = fast then generate result <= 1; end generate; process description of complex behaviour (Section 1.3.6) 20 CHAPTER 1. VHDL 1.3.6 Processes 21 1.3.5 Component Declaration and Instantiations This section reserved for your reading pleasure Example Process with Sensitivity List process (a, b, c) begin y <= a AND b; if (a = 1) then z1 <= b AND c; z2 <= NOT c; else z1 <= b OR c; z2 <= c; end if; end process; 1.3.6 Processes Processes are used to describe complex and potentially unsynthesizable behaviour A process is a concurrent statement (Section 1.3.4). The body of a process contains sequential statements (Section 1.3.7) Processes are the most complex and difcult to understand part of VHDL (Sections 1.5 and 1.6) 22 CHAPTER 1. VHDL 1.3.6 Processes 23 Example Process with Wait Statements process begin y <= a AND b; z <= 0; wait until rising_edge(clk); if (a = 1) then z <= 1; y <= 0; wait until rising_edge(clk); else y <= a OR b; end if; end process; Sensitivity Lists and Wait Statements Processes must have either a sensitivity list or at least one wait statement on each execution path through the process. Processes cannot have both a sensitivity list and a wait statement. 24 CHAPTER 1. VHDL 1.3.7 Sequential Statements 25 Sensitivity List The sensitivity list contains the signals that are read in the process. A process is executed when a signal in its sensitivity list changes value. An important coding guideline to ensure consistent synthesis and simulation results is to include all signals that are read in the sensitivity list. There is one exception to this rule: for a process that implements a ip-op with an if rising edge statement, it is acceptable to include only the clock signal in the sensitivity list other signals may be included, but are not needed. 1.3.7 Sequential Statements Used inside processes and functions. wait signal assignment if-then-else case wait until . . . ; . . . <= . . . ; if . . . then . . . elsif . . . end if; case . . . is when . . . | . . . => . . . ; when . . . => . . . ; end case; loop . . . end loop; while . . . loop . . . end loop; for . . . in . . . loop . . . end loop; next . . . ; loop while loop for loop next The most commonly used sequential statements 26 CHAPTER 1. VHDL 1.4.1 Concurrent Assignment vs Process 27 1.3.8 A Few More Miscellaneous VHDL Features This section reserved for your reading pleasure 1.4.1 Concurrent Assignment vs Process The two code fragments below have identical behaviour: architecture main of tiny is begin process (a) begin b <= a; end process; end main; 1.4 Concurrent vs Sequential Statements All concurrent assignments can be translated into sequential statements. But, not all sequential statements can be translated into concurrent statements. architecture main of tiny is begin b <= a; end main; 28 CHAPTER 1. VHDL 1.4.3 Selected Assignment vs Case Statement 29 1.4.2 Conditional Assignment vs If Statements The two code fragments below have identical behaviour: Concurrent Statements t <= <val1> when <cond> else <val2>; Sequential Statements if <cond> then t <= <val1>; else t <= <val2>; end if 1.4.3 Selected Assignment vs Case Statement The two code fragments below have identical behaviour Concurrent Statements with <expr> select t <= <val1> when <choices1>, <val2> when <choices2>, <val3> when <choices3>; Sequential Statements case <expr> is when <choices1> => t <= <val1>; when <choices2> => t <= <val2>; when <choices3> => t <= <val3>; end case; 30 CHAPTER 1. VHDL 1.5. OVERVIEW OF PROCESSES 31 1.4.4 Coding Style Code thats easy to write with sequential statements, but difcult with concurrent: case <expr> is when <choice1> => if <cond> then o <= <expr1>; else o <= <expr2>; end if; when <choice2> => ... end case; 1.5 Overview of Processes Processes are the most difcult VHDL construct to understand. This section gives an overview of processes. Section 1.6 gives the details of the semantics of processes. Within a process, statements are executed almost sequentially Among processes, execution is done in parallel Remember: a process is a concurrent statement! 32 CHAPTER 1. VHDL 1.5. OVERVIEW OF PROCESSES 33 Process Semantics VHDL mimics hardware Hardware (gates) execute in parallel Processes execute in parallel with each other All possible orders of executing processes must produce the same simulation results (waveforms) If a signal is not assigned a value, then it holds its previous value architecture procA: process stmtA1; stmtA2; stmtA3; end process; procB: process Process Semantics execution sequence execution sequence execution sequence A1 A2 A3 A1 A2 A3 A1 A2 A3 All orders of executing concurrent statements must produce the same waveforms stmtB1; stmtB2; end process; B1 B2 B1 B2 B1 B2 single threaded: procA before procB single threaded: procB before procA multithreaded: procA and procB in parallel Different process execution sequences 34 CHAPTER 1. VHDL 1.5.1 Combinational Process vs Clocked Process 35 Process Semantics 1.5.1 Combinational Process vs Clocked Process Each well-written synthesizable process is either combinational or clocked. Combinational process: Executing the process takes part of one clock cycle Target signals are outputs of combinational circuitry A combinational processes must have a sensitivity list A combinational process must not have any wait statements All execution orders must have same behaviour A combinational falling_edges process must not have any rising_edges, or The hardware for a combinational process is just combinational circuitry 36 CHAPTER 1. VHDL 1.5.1 Combinational Process vs Clocked Process 37 Clocked process: Executing the process takes one (or more) clock cycles Target signals are outputs of ops Process contains one or more wait or if rising edge statements Hardware contains combinational circuitry and ip ops Combinational or Clocked Process? (1) process (a,b,c) p1 <= a; if (b = c) then p2 <= b; else p2 <= a; end if; end process; Note: Clocked processes are sometimes called sequential processes, but this can be easily confused with sequential statements, so in E&CE 427 well refer to synthesizable processes as either combinational or clocked. 38 CHAPTER 1. VHDL 1.5.1 Combinational Process vs Clocked Process 39 Combinational or Clocked Process? (2) process begin wait until rising_edge(clk); b <= a; end process; Combinational or Clocked Process? (3) process (clk) begin if rising_edge(clk) then b <= a; end if; end process; 40 CHAPTER 1. VHDL 1.5.1 Combinational Process vs Clocked Process 41 Combinational or Clocked Process? (4) process (clk) begin a <= clk; end process; Combinational or Clocked Process? (5) process begin wait until rising_edge(a); c <= b; end process; 42 CHAPTER 1. VHDL 1.5.2 Latch Inference 43 1.5.2 Latch Inference The semantics of VHDL require that if a signal is assigned a value on some passes through a process and not on other passes, then on a pass through the process when the signal is not assigned a value, it must maintain its value from the previous pass. process (a, b, c) begin if (a = 1) then z1 <= b; z2 <= b; else z1 <= c; end if; end process; Latch Inference When a signals value must be stored, VHDL infers a latch or a ip-op in the hardware to store the value. If you want a latch or a ip-op for the signal, then latch inference is good. If you want combinational circuitry, then latch inference is bad. a b c z1 z2 Example of latch inference 44 CHAPTER 1. VHDL 1.6. DETAILS OF PROCESS EXECUTION 45 Loop, Latch, Flop a b z a Latch Combinational loop EN 1.6 Details of Process Execution 1.6.1 Temporal Granularities of Simulation D Q b z b a Flip-op z This section reserved for your reading pleasure 1.6.2 Intuition Behind Delta-Cycle Simulation In zero-delay simulation, a sequence of dependent events must appear to happen instantaneously (in zero time). In particular, the effect of an event must propagate instantaneously through combinational circuitry. Two fundamental rules for zero-delay simulation: 1. events appear to propagate through combinational circuitry instantaneously. 2. all of the gates appear to operate in parallel Question: Write VHDL code for each of the above circuits 46 CHAPTER 1. VHDL 1.6.3 Denitions and Algorithm 47 Intution for Delta Cycles To make it appear that events propagate instaneously, VHDL introduces an articial unit of time, the delta cycle, to represent an innitesimally small amount of time. In each delta cycle, every gate in the circuit will sample its inputs, compute its result, and drive its output signal with the result. Simulators simulate one gate at a time, but the waveforms make it appear that all of the gates were run in parallel. In each delta cycle, the simulator executes all gates whose inputs changed. To preserve the illusion that the gates ran in parallel, the effect of simulating a gate remains invisible until the end of the delta cycle. 1.6.3 Denitions and Algorithm 1.6.3.1 Process Modes active su e sp at en tiv d ac postponed resume suspended 48 CHAPTER 1. VHDL 1.6.3 Denitions and Algorithm 49 Suspended active e sp su e at tiv Postponed active su at e sp en tiv ac postponed resume suspended postponed resume ac nd suspended d Nothing to currently execute A process stays suspended until the event that it is waiting for occurs: either a change in a signal on its sensitivity list or the condition in a wait statement Wants to execute, but not currently active A process stays postponed until the simulator chooses it from the pool of postponed processes 50 CHAPTER 1. VHDL 1.6.3 Denitions and Algorithm 51 Active active ac tiv at e d en sp su 1.6.3.2 Simulation Algorithm The algorithm presented here is a simplication of the actual algorithm in the VHDL Standard. This algorithm does not (a <= b after 2 ns;). support delayed assignments; for example: postponed resume suspended Currently executing A process stays active until it hits a wait statement or sensitivity list, at which point it suspends A somewhat ironic note, only six of the two hundred pages in the VHDL Standard are devoted to the semantics of executing processes. 52 CHAPTER 1. VHDL 1.6.3 Denitions and Algorithm 53 The Algorithm Simulations start at step 1 with all processes postponed and all signals with a default value (e.g., U for std logic). 1. While there are postponed processes: (a) Pick one or more postponed processes to execute (become active). (b) Provisionally execute assignments (new values become visible at step 3) (c) A process executes until it hits its sensitivity list or a wait statement, at which point it suspends. (d) Processes that become suspended, stay suspended until there are no more postponed or active processes. 2. Each process checks its sensitivity list or wait condition to see if it should resume 3. Update signals with their provisional values 4. If no postponed processes, then increment simulation time to next event. Notes on Simulation Algorithm At a wait statement, the process will suspend even if the condition is true in the current simulation cycle. The process will resume when the condition changes to true. In n-threaded execution, at most n processes are active at a time 54 CHAPTER 1. VHDL 1.6.4 Example 1: Process Execution (Bamboozle) 55 1.6.3.3 Delta-Cycle Denitions Denition simulation step: Executing one sequential assignment or process mode change. 1.6.4 Example 1: Process Execution (Bamboozle) This section reserved for your reading pleasure Denition simulation cycle: The operations that occur in one iteration of the simulation algorithm. Denition delta cycle: A simulation cycle that does not advance simulation time. Denition simulation round: A sequence of simulation cycles that all have the same simulation time. 56 CHAPTER 1. VHDL 1.6.5 Example 2: Process Execution (Flummox) process mode (S=suspended, P=postponend A=active) simulation-step pointer (one per process) P visible-assignment value provisional-assignment value 57 1.6.5 Example 2: Process Execution (Flummox) This example is a variation of the Bamboozle example from section 1.6.4. P P proc1: process (a, b, c) begin c <= a AND b; d <= NOT c; end process; proc2: process (b, d) begin 0ns e <= b AND d; sim round end process; sim cycle proc3: process begin delta cycle a <= 1; proc1 proc2 b <= 0; proc3 wait for 3 ns; a b <= 1; wait for 99 ns; b end process; c d e U a U b Uc Ud U e Legend initial values simulation step 58 1. While there are postponed processes: (a) Pick process(es) to activate (b) Execute active processes, record prov asns (c) Suspend at sens list or wait statement (d) Once suspended, stay suspended 2. Check sens lists, wait conditions for changes 3. Update signals with provisional values 4. If no postponed procs, increment time proc1: ...(a, b, c)... c <= a AND b; d <= NOT c; end process; proc2: ...(b, d)... e <= b AND d; end process; proc3: process begin a <= 1; b <= 0; wait for 3 ns; b <= 1; wait for 99 ns; end process; CHAPTER 1. VHDL 1.6.5 Example 2: Process Execution (Flummox) 59 From Delta-Time to Real Time a c d e b 0ns +1 a U b U c U d U e U U U U U U 3ns +2 +3 +1 +2 +3 102ns sim round sim cycle delta cycle proc1 proc2 proc3 a b 0ns 1ns 2ns 3ns 4ns 100ns 101ns 102ns a U b U c U c d U d e e U 60 CHAPTER 1. VHDL 1.6.6 Ex: Need for Provisonal Asn 61 Note and Questions Note: If a signal is updated with the same value it had in the previous simulation cycle, then it does not change, and therefore does not trigger processes to resume. 1.6.6 Ex: Need for Provisonal Asn architecture main of swindle is begin p_c: process (a, b) begin Question: c <= a AND b; end process; p_d: process (a, c) begin d <= a XOR c; end process; end main; draw the circuit Question: What are the different granularities of time that occur when doing delta-cycle simulation? Question: What is the order of granularity, from nest to coarsest, amongst the different granularities related to delta-cycle simulation? Circuit to illustrate need for provisional assignments 1. Start with all signals at 0. 2. Simultaneously change to a = 1 and b = 1. 62 CHAPTER 1. VHDL 1.6.6 Ex: Need for Provisonal Asn 63 With Provisional Assignments, c Before d If assignments are not visible within same simulation cycle (correct: i.e. provisional assignments are used) With Provisional Assignments, d Before c If assignments are not visible within same simulation cycle (correct: i.e. provisional assignments are used) p_c: process c <= a AND end process; p_d: process d <= a XOR end process; (a, b) begin p_c b; p_d P A P 0 0 0 0 S A S P A S (a, c) begin c; a b c d p_c: process c <= a AND end process; p_d: process d <= a XOR end process; (a, b) begin p_c b; p_d P P A 0 0 0 0 A S S P A S (a, c) begin c; a b c d If p c is scheduled before p d, then d will have a 1 pulse. If p d is scheduled before p c, then d will have a 1 pulse. 64 CHAPTER 1. VHDL 1.6.6 Ex: Need for Provisonal Asn 65 Without Prov. Assignments, c Before d If assignments are visible within same simulation cycle (incorrect) Without Prov. Assignments, d Before c If assignments are visible within same simulation cycle (incorrect) p_c: process c <= a AND end process; p_d: process d <= a XOR end process; (a, b) begin p_c b; p_d P A P 0 0 0 0 S A S P A S (a, c) begin c; a b c d p_c: process c <= a AND end process; p_d: process d <= a XOR end process; (a, b) begin p_c b; p_d P P A 0 0 0 0 A S S P A S (a, c) begin c; a b c d If p c is scheduled before p d, then d will stay constant 0. If p d is scheduled before p c, then d will have a 1 pulse. 66 CHAPTER 1. VHDL 1.6.7 Delta-Cycle Simulations of Flip-Flops 67 Need for Provisional Assignment With provisional assignments, both orders of scheduling processes result in the same behaviour on all signals. Without provisional assignments, different scheduling orders result in different behaviour. 1.6.7 Delta-Cycle Simulations of Flip-Flops p_a : process begin a <= 0; wait for 15 ns; a <= 1; wait for 20 ns; end process; p_clk : process begin clk <= 0; wait for 10 ns; clk <= 1; wait for 10 ns; end process; flop : process ( clk ) begin if rising_edge( clk ) then q <= a; end if; end process; 0ns sim round sim cycle delta cycle p_a P p_clk P flop P a U clk U q U B B B A E E S A U U S A S 0 0 68 CHAPTER 1. VHDL 1.6.7 Delta-Cycle Simulations of Flip-Flops 69 Redraw with Normal Time Scale Back-to-Back Flops p_a : process begin a <= 0; wait for 15 ns; a <= 1; wait for 20 ns; end process; p_clk : process begin clk <= 0; wait for 10 ns; clk <= 1; wait for 10 ns; end process; flops : process ( clk ) begin if rising_edge( clk ) then q1 <= a; q2 <= q1; end if; end process; 15ns 20ns 30ns 35ns 0ns 5ns 10ns 15ns 20ns 25ns 30ns 35ns a clk q sim round sim cycle delta cycle p_a p_clk flops a 0 clk 0 q1 U q2 U 1 U 10ns B B/E B/E B B A S E B E E B E B/E B/E P P A S P A E B E B/E B E B/E B S P A 1 E B E B E B/E B B/E B E P A S P A S S E B E E E P A S 0 0 1 1 70 CHAPTER 1. VHDL 1.6.7 Delta-Cycle Simulations of Flip-Flops 71 Redraw with Normal Time Scale env : Testbenches and Clock Phases process begin a <= 1; clk <= 0; wait for 10 ns; a <= 0; clk <= 1; wait for 10 ns; end process; 0ns flop : process ( clk ) begin if rising_edge( clk ) then q1 <= a end if; end process; 0ns 5ns 10ns 15ns 20ns 25ns 30ns 35ns a clk q sim round sim cycle delta cycle env flop1 flop2 a clk q1 72 CHAPTER 1. VHDL 1.6.7 Delta-Cycle Simulations of Flip-Flops 73 Redraw with Normal Time Scale 0ns 10ns 20ns Warning Note: Testbench signals For consistent results across different simulators, simulation scripts vs test benches, and timingsimulation vs zero-delay simulation do not change signals in your testbench or script at the same time as the clock changes. 0ns 10ns 20ns 30ns 40ns 50ns 60ns a clk q1 a U a is output of clocked or combinational process clk U q1 0ns U 10ns 20ns 30ns 40ns 50ns 60ns a is output of timed process (testbench or environment) POOR DESIGN a is output of timed process (testbench or environment) GOOD DESIGN a U clk U q1 0ns U 10ns 20ns 30ns 40ns 50ns 60ns a U clk U q1 U 74 CHAPTER 1. VHDL 1.7.2 Examples of RTL Simulation 75 1.7 Register-Transfer Level Simulation 1.7.1 Technique for Register-Transfer Level Simulation Temporally coarser than delta cycle Columns in waveform diagrams correspond to real time: clock cycles, nanoseconds, etc. Can simulate both synthesizable and unsynthesizable code Cannot simulate combinational loops Same values as delta-cycle at end of simulation round RTL Simulation Technique 1. Pre-processing (a) Separate processes into combinational and non-combinational (clocked and timed) (b) Decompose each combinational process into separate processes with one target signal per process (c) Sort processes into topological order based on dependencies 2. For each clock cycle or unit of time: (a) Run non-combinational processes in any order. Non-combinational assignments read from earlier clock cycle / time step. (b) Run combinational processes in topological order. Combinational assignments read from current clock cycle / time step. 1.7.2 Examples of RTL Simulation 76 CHAPTER 1. VHDL 1.7.2 Examples of RTL Simulation 77 Combinational Process Decomposition proc(a,b,c) if a = 1 then d <= b; else d <= not b; end if; end process; proc(a,b,c) if a = 1 then e <= c; else e <= b and c; end if; end process; After decomposition into separate processes for d and e RTL Simulation Example Revisit an earlier example, but do register-transfer-level simulation, rather than delta-cycle simulation. proc(a,b,c) if a = 1 then d <= b; e <= c; else d <= not b; e <= b and c; end if; end process; Original code Original code proc1: process (a, b, c) begin c <= a AND b; d <= NOT c; end process; proc2: process (b, d) begin e <= b AND d; end process; proc3: process begin a <= 1; b <= 0; wait for 3 ns; b <= 1; wait for 99 ns; end process; 78 CHAPTER 1. VHDL 1.7.2 Examples of RTL Simulation 79 Decompose combinational processes proc1c: process (a, b) begin c <= a AND b; end process; proc1d: process (c) begin d <= NOT c; end process; proc2: process (b, d) begin e <= b AND d; end process; Run Timed Processes Run timed process (proc3) until suspend at wait for 3 ns; The signal a gets 1 from 0 to 3 ns. proc3: process begin a <= 1; b <= 0; wait for 3 ns; b <= 1; wait for 99 ns; end process; The signal b gets 0 from 0 to 3 ns. Run proc1c The signal c gets a AND b (0 AND 1 = 0) from 0 to 3 ns. Run proc1d The signal d gets NOT c (NOT 0 = 1) from 0 to 3 ns. Run proc2 The signal e gets b AND d (0 AND 1 = 0) from 0 to 3 ns. Run the timed process until suspend at wait for 99 ns;, which takes us from 3ns to 102ns. Run combinational processes in topological order to calculate values on c, d, e from 3ns to 102ns. Combinational processes are already in topological order, because each signal is assigned a value before it is read. 80 CHAPTER 1. VHDL 1.7.2 Examples of RTL Simulation 0ns a b U U U U U 81 3ns 102ns RTL vs Delta-cycle Simulation Question: Draw the RTL waveforms that correspond to the delta-cycle waveform below. 0ns sim round sim cycle delta cycle proc1 proc2 proc3 a b c d e B B B P P A P U U U U U U A S A U1 0 U U 0 U 0 0 1 0 1 1 1 1 1 0 0 S c d e 1ns 2ns 0ns+1 0ns+2 0ns+23ns EB EB PA S EB E S PA EB EB B S PA 3ns+1 3ns+2 3ns+3 102ns EB EB S PA EB E S PA E E E S EB EB S PA P S A EB EB P PA S A S 82 CHAPTER 1. VHDL 1.7.2 Examples of RTL Simulation 83 Example: Communicating State Machines Note: It is easier to do a simulation by hand if you start your clock at 0 and use the rst clock phase in the waveform diagram for the rst values that your VHDL code assigns to signals Simulate If-Then-Else, Wait Until huey: process begin clk <= 0; wait for 10 ns; clk <= 1; wait for 10 ns; end process; dewey: process begin a <= to_unsigned(0,4); wait until re(clk); while (a < 4) loop a <= a + 1; wait until re(clk); end loop; end process; louie: process begin d <= 1; wait until re(clk); if (a < 2) then d <= 0; wait until re(clk); end if; end process; clk a d 84 CHAPTER 1. VHDL 1.8.1 Basic Building Blocks 85 1.8 VHDL and Hardware Building Blocks 1.8.1 Basic Building Blocks Different classes of building blocks: Basic Building Blocks: Boolean Schematic VHDL Description and or not AND OR gate gate Conditional Arithmetic Storage inverter nand NAND gate nor xor and gate exclusive-or gate 86 CHAPTER 1. VHDL 1.8.1 Basic Building Blocks 87 Basic Building Blocks: Conditional if-then-else, when-else, Multiplexer with-select, case Basic Building Blocks: Arithmetic + adder subtracter asl, lsl left shifter asr, lsr right shifter 88 CHAPTER 1. VHDL 1.8.2 Deprecated Building Blocks for RTL 89 Basic Building Blocks: Storage D CE S R Q 1.8.2 Deprecated Building Blocks for RTL Some of the common gates you have encountered in previous courses should be avoided when synthesizing register-transfer-level hardware, particularly if FPGAs are the implementation technology. Latches : Use ops, not latches T, JK, SR, etc ip-ops : Limit yourself to D-type ip-ops clocked process DO ip op WE A DI memory component single-port memory WE A0 DI0 A1 DO1 DO0 Tri-State Buffers : Use multiplexers, not tri-state buffers Note: Unfortunately and surprisingly, PalmChip has been awarded a US patent for using uni-directional busses (i.e. multiplexers) for system-on-chip designs. The patent was led in 2000, so all fourth-year design projects completed after that date will need to pay royalties to PalmChip memory component dual-port memory 90 CHAPTER 1. VHDL 1.8.3 Hardware and Code for Flops 91 What is This? process (a) begin if rising_edge(a) then c <= b; end if; end process; 1.8.3 Hardware and Code for Flops 1.8.3.1 Flops with Waits and Ifs process (clk) begin if rising_edge(clk) then q <= d; end if; end process; 92 CHAPTER 1. VHDL 1.8.3 Hardware and Code for Flops 93 VHDL Code for Flip-Flop: Wait-Style process begin wait until rising_edge(clk); q <= d; end process; 1.8.3.2 Flops with Synchronous Reset process (clk) begin if rising_edge(clk) then if (reset = 1) then q <= 0; else q <= d; end if; end if; end process; 94 CHAPTER 1. VHDL 1.8.3 Hardware and Code for Flops 95 Flop with Synchronous Reset: Wait-Style process begin wait until rising_edge(clk); if (reset = 1) then q <= 0; else q <= d0; end if; end process; Question: Variation on a Floppy Theme Synchronous or asynchronous reset? process (clk, reset) begin if (reset = 1) then q <= 0; else if rising_edge(clk) then q <= d; end if; end if; end process; 96 CHAPTER 1. VHDL 1.8.3 Hardware and Code for Flops 97 Variated Flop of a Theme Question: Synchronous or asynchronous reset? Flop with Chip-Enable process (clk) begin if rising_edge(clk) then if (ce = 1) then q <= d; end if; end if; end process; Wait-style op with chip-enable included in course notes process begin if (reset = 1) then q <= 0; else q <= d0; end if; wait until rising_edge(clk); end process; 98 CHAPTER 1. VHDL 1.8.3 Hardware and Code for Flops 99 Q: Flop with a Mux on the Input? sel d0 D Q Q: Flops with a Mux on the Output? d0 D Q q0 sel q d1 clk clk d1 clk D Q q q1 100 CHAPTER 1. VHDL 1.8.4 An Example Sequential Circuit 101 1.8.3.3 Flop with Chip-Enable and Mux on Input Hint: Chip Enable process (clk) begin if rising_edge(clk) then if (ce = 1) then q <= d; end if; end if; end process; 1.8.3.4 Flops with Chip-Enable, Muxes, and Reset This section reserved for your reading pleasure 1.8.4 An Example Sequential Circuit This section reserved for your reading pleasure 1.9 Arrays and Vectors This section reserved for your reading pleasure 102 CHAPTER 1. VHDL 1.10.1 Arithmetic Packages supersedes numeric std std logic arith. earlier arithmetic packages, such 103 as 1.10 Arithmetic VHDL includes all of the common arithmetic and logical operators. Use the VHDL arithmetic operators and let the synthesis tool choose the best implementation for you. It is almost impossible for a hand-coded implementation to beat vendor-supplied arithmetic libraries. Use only one arithmetic package, otherwise the different denitions will clash and you can get strange error messages. 1.10.1 Arithmetic Packages Rushton Ch-7 covers arithmetic packages. Rushton Appendex A.5 has the code listing for the numeric std package. To do arithmetic with signals, use the numeric_std package. This package denes types signed and unsigned, which are std_logic vectors on which you can do signed or unsigned arithmetic. 104 CHAPTER 1. VHDL 1.10.6 Different Widths and Comparisons Overloading of Comparison Operations (=, /=, >=, >, <) src1/2 unsigned signed unsigned src2/1 integer OK integer OK signed fails in analysis 105 1.10.2 Shift and Rotate Operations This section reserved for your reading pleasure 1.10.3 Overloading of Arithmetic This section reserved for your reading pleasure 1.10.6 Different Widths and Comparisons This section reserved for your reading pleasure 1.10.4 Different Widths and Arithmetic This section reserved for your reading pleasure 1.10.5 Overloading of Comparisons This section reserved for your reading pleasure 106 CHAPTER 1. VHDL 1.10.7 Type Conversion 107 1.10.7 Type Conversion Type Conversion The functions unsigned, signed, to integer, to unsigned and to signed are used to convert between integers, std-logic vectors, signed vectors and unsigned vectors. If you convert between two types of the same width, then no additional hardware will be generated. The listing below summarizes the types of these functions. 108 unsigned( val : std_logic_vector ) turn unsigned; signed( val : std_logic_vector ) turn signed; to_integer( val : signed ) turn integer; to_integer( val : unsigned ) return integer; to_unsigned( val : integer; width : natural) return unsigned; to_signed( val : integer; width : natural) turn signed; Note: CHAPTER 1. VHDL rere- 1.11. SYNTHESIZABLE VS NON-SYNTHESIZABLE CODE 109 1.11 Synthesizable vs Non-Synthesizable Code Synthesis is done by matching VHDL code against templates or patterns. reIts important to use idioms that your synthesis tools recognize. Think like hardware: when you write VHDL, you should know what hardware you expect to be produced by the synthesizer. re- More details in course notes 110 CHAPTER 1. VHDL 1.11.1 Unsynthesizable Code 111 1.11.1 1.11.1.1 Unsynthesizable Code Initial Values 1.11.1.2 Wait For Wait for length of time (UNSYNTHESIZABLE) wait for 10 ns; Initial values on signals (UNSYNTHESIZABLE) signal bad_signal : std_logic := 0; Reason: At powerup, the values on signals are random (except for some FPGAs). Reason: Delays through circuits are dependent upon both the circuit and its operating environment, particularly supply voltage and temperature. For example, imagine trying to build an AND gate that will have exactly a 2ns delay in all environments. 112 CHAPTER 1. VHDL 1.11.1 Unsynthesizable Code 113 1.11.1.3 Different Wait Conditions Different Wait Conditions -- different clock edges process begin wait until rising_edge(clk); x <= a; wait until falling_edge(clk); x <= a; end process; Reason: Would require ip-op to be sensitive to different clock edges at different times. wait statements with different conditions in a process (UNSYNTHESIZABLE) -- different clock signals process begin wait until rising_edge(clk1); x <= a; wait until rising_edge(clk2); x <= a; end process; Reason: Would require the ip ops to use different clock signals at different times. 114 CHAPTER 1. VHDL 1.11.1 Unsynthesizable Code 115 1.11.1.4 Multiple if rising edges in Same Process Multiple if rising edge statements in a process (UNSYNTHESIZABLE) process (clk) begin if rising_edge(clk) then q0 <= d0; end if; if rising_edge(clk) then q1 <= d1; end if; end process; Reason: The idioms for synthesis tools generally expect just a single if rising edge statement in each process. The simpler the VHDL code is, the easier it is to synthesize hardware. Programmers of synthesis tools make idiomatic (idiotic?) restrictions to make their jobs simpler. 1.11.1.5 if rising edge and wait in Same Process An if rising edge statement and a wait statement in the same process (UNSYNTHESIZABLE) process (clk) begin if rising_edge(clk) then q0 <= d0; end if; wait until rising_edge(clk); q0 <= d1; end process; Reason: The idioms for synthesis tools generally expect just a single type of opgenerating statement in each process. 116 CHAPTER 1. VHDL 1.11.1 Unsynthesizable Code 117 1.11.1.6 if rising edge with else Clause 1.11.1.7 if rising edge Inside a for Loop The if statement has a rising edge condition and an else clause (UNSYNTHESIZABLE). process (clk) begin if rising_edge(clk) then q0 <= d0; else q0 <= d1; end if; end process; Reason: Generally, an if-then-else statement synthesizes to a multiplexer. An if rising edge statement in a for-loop (UNSYNTHESIZABLE-Synopsys) process (clk) begin for i in 0 to 7 loop if rising_edge(clk) then q(i) <= d; end if; end loop; end process; Reason: just an idiom of the synthesis tool. Some loop statements are synthesizable (Rushton Section 8.7). For-loops in general are described in Ashenden. 118 CHAPTER 1. VHDL 1.11.1 Unsynthesizable Code 119 Synthesizable Alternative A synthesizable alternative to an if rising edge statement in a for-loop is to put the if-rising-edge outside of the for loop. process (clk) begin if rising_edge(clk) then for i in 0 to 7 loop q(i) <= d; end loop; end if; end process; 1.11.1.8 wait Inside of a for loop wait statements in a for loop (UNSYNTHESIZABLE) process begin for i in 0 to 7 loop wait until rising_edge(clk); x <= to_unsigned(i,4); end loop; end process; Reason: Unknown. Clocked for-loops are generally unsynthsizable, but 120 while-loops with the same behaviour are synthesizable. CHAPTER 1. VHDL 1.11.1 Unsynthesizable Code 121 Synthesizable Alternative to Wait-Inside-For while loop (synthesizable) This is the synthesizable alternative to the the wait statement in a for loop above. Note: Combinational for-loops Combinational for-loops are usually synthesizable. They are often used to build a combinational circuit for each element of an array. Note: Clocked for-loops Clocked for-loops are not synthesizable, but are very useful in simulation, particular to generate test vectors for test benches. process begin -- output values from 0 to 4 on i -- sending one value out each clock cycle i <= to_unsigned(0,4); wait until rising_edge(clk); while (4 > i) loop i <= i + 1; wait until rising_edge(clk); end loop; end process; 122 CHAPTER 1. VHDL 1.12 Synthesizable VHDL Coding Guidelines This section reserved for your reading pleasure Chapter 2 RTL Design with VHDL: From Requirements to Optimized Code 123 124 CHAPTER 2. RTL DESIGN WITH VHDL 2.5. FINITE STATE MACHINES IN VHDL 125 2.1 Prelude to Chapter This section reserved for your reading pleasure 2.5 Finite State Machines in VHDL 2.5.1 Introduction to State-Machine Design 2.2 FPGA Background and Coding Guidelines This section reserved for your reading pleasure 2.5.1.1 Mealy vs Moore State Machines 2.3 Design Flow This section reserved for your reading pleasure 2.4 Algorithms and High-Level Models This section reserved for your reading pleasure 126 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.1 Introduction to State-Machine Design 127 Moore Machines Outputs are dependent upon only the state No combinational paths from inputs to outputs s0/0 a s1/1 !a s2/0 Mealy Machines Outputs are dependent upon both the state and the inputs Combinational paths from inputs to outputs s0 a/1 s1 /0 /0 s3 !a/0 s2 s3/0 128 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.1 Introduction to State-Machine Design 129 2.5.1.2 Introduction to State Machines and VHDL A state machine is generally written as a single clocked process, or as a pair of processes, where one is clocked and one is combinational. VHDL Constructs for State Machines The following VHDL control constructs are useful to steer the transition from state to state: loop if ... then ... else case next for ... loop exit while ... loop Design Decisions Moore vs Mealy (Sections 2.5.2 and 2.5.3) Implicit vs Explicit (Section 2.5.1.3) State values in explicit state machines: Enumerated type vs constants (Section 2.5.5) State values for constants: encoding scheme (binary, gray, one-hot, ...) (Section 2.5.5) 130 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.1 Introduction to State-Machine Design 131 2.5.1.3 Explicit vs Implicit State Machines There are two styles of writing state machines in VHDL: explicit and implicit. Explicit Implicit State Machines For the implicit style of writing state machines, the synthesis program adds an implicit register to hold the state signal and combinational circuitry to update the state signal. In Synopsys synthesis tools, the state signal dened by the synthesizer is named multiple wait state reg. In Mentor Graphics, the state signal is named STATE VAR We can think of implicit state machines as having 0 state signals, explicit-current state machines as having 1 state signal, and explicit-current+next state machines as having 2 state signals. State signal appears explicitly in VHDL code At most one wait statement per process Two sub-categories of explicit state machines Explicit-Current State signal represents current state Next-state computation done in a clocked process Explicit-Current+Next Two state signals: current state and next state Next-state computation done in a combinational process Current-state <= next-state is registered assignment Implicit Use multiple wait statements in a process to describe state machine implicilty 132 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.1 Introduction to State-Machine Design 133 State Machine Tradeoffs Explicit-Current+Next Limitation of Implicit State Machines Because implicit state machines are written with loops, if-then-elses, cases, etc. it is difcult to write some state machines with complicated control ows in an implicit style. The following example illustrates the point. s0/0 a !a s2/0 Most detailed, closest to hardware Greatest opportunity for manual optimization Most labour-intensive Susceptible to small, subtle, hard-to-nd bugs Explicit-Current Almost as manual optimization as Explicit-Current+Next Easier to write than Explicit-Current+Next Less susceptible to subtle bugs Implicit !a s3/0 a s1/1 Taught infrequently Least detailed, furthest from actual hardware Rely on synthesis for optimization Usually least labour to write, shortest code Easiest to write correctly (But must understand VHDL synthesis!) 134 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.2 Implementing a Simple Moore Machine 135 Terminology Note: The terminology of explicit and implicit is somewhat standard, in that some descriptions of processes with multiple wait statements describe the processes as having implicit state machines. There is no standard terminology to distinguish between the two explicit styles: explicit-current+next and explicit-current. 2.5.2 Implementing a Simple Moore Machine s0/0 a s1/1 !a s2/0 entity simple is port ( a, clk : in std_logic; z : out std_logic ); end simple; s3/0 136 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.2 Implementing a Simple Moore Machine 137 2.5.2.1 Implicit Moore State Machine architecture moore_implicit_v1a of simple is begin process begin z <= 0; wait until rising_edge(clk); if (a = 1) then z <= 1; else z <= 0; end if; wait until rising_edge(clk); z <= 0; wait until rising_edge(clk); end process; end moore_implicit; Implicit Moore State Machine Flops Gates Delay 138 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.2 Implementing a Simple Moore Machine 139 2.5.2.2 Explicit Moore with Flopped Output architecture moore_explicit_v1 of simple is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = 1) then state <= s1; z <= 1; else state <= s2; z <= 0; end if; when s1 | s2 => state <= s3; z <= 0; when s3 => state <= s0; z <= 1; end case; end if; end process; end moore_explicit_v1; Explicit Moore with Flopped Outputs Flops Gates Delay 140 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.2 Implementing a Simple Moore Machine 141 2.5.2.3 Explicit Moore with Combinational Outputs architecture moore_explicit_v2 of simple is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = 1) then state <= s1; else state <= s2; end if; when s1 | s2 => state <= s3; when s3 => state <= s0; end case; end if; end process; z <= 1 when (state = s1) else 0; end moore_explicit_v2; Explicit Moore with Combinational Outputs Flops Gates Delay 142 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.2 Implementing a Simple Moore Machine 143 2.5.2.4 Explicit-Current+Next Moore with Concurrent Assignment architecture moore_explicit_v3 of simple is type state_ty is (s0, s1, s2, s3); signal state, state_nxt : state_ty; begin process (clk) begin if rising_edge(clk) then Flops state <= state_nxt; end if; Gates end process; Delay state_nxt <= s1 when (state = s0) and (a = 1) else s2 when (state = s0) and (a = 0) else s3 when (state = s1) or (state = s2) else s0; z <= 1 when (state = s1) else 0; end moore_explicit_v3; Explicit-Current+Next Moore with Concurrent Assignment The hardware synthesized from this architecture is the same as that synthesized from moore explicit v2, which is written in the current-explicit style. 144 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.2 Implementing a Simple Moore Machine 145 2.5.2.5 E-C+N Moore with Comb Proc architecture moore_explicit_v4 of simple is type state_ty is (s0, s1, s2, s3); signal state, state_nxt : state_ty; begin process (clk) begin if rising_edge(clk) then state <= state_nxt; end if; end process; process (state, a) begin case state is when s0 => if (a = 1) then state_nxt <= s1; else state_nxt <= s2; end if; when s1 | s2 => state_nxt <= s3; when s3 => state_nxt <= s0; end case; end process; z <= 1 when (state = s1) else 0; end moore_explicit_v4; Explicit-Current+Next Moore with Combinational Process Change the selected assignment to state into a combinational process using a case statement. Flops Gates Delay Same hardware as moore explicit v2 and v3. 146 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.4 Reset 147 2.5.3 Implementing a Simple Mealy Machine Mealy machines have a combinational path from inputs to outputs, which often violates good coding guidelines for hardware. Thus, Moore machines are much more common. You should know how to write a Mealy machine if needed, but most of the state machines that you design will be Moore machines. This section reserved for your reading pleasure 2.5.4 Reset All circuits should have a reset signal that puts the circuit back into a good initial state. However, not all ip ops within the circuit need to be reset. In a circuit that has a datapath and a state machine, the state machine will probably need to be reset, but datapath may not need to be reset. There are standard ways to add a reset signal to both explicit and implicit state machines. It is important that reset is tested on every clock cycle, otherwise a reset might not be noticed, or your circuit will be slow to react to reset and could generate illegal outputs after reset is asserted. 148 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.4 Reset 149 Insert a loop Test for reset after each wait Example from section 2.5.2.1: Reset with Implicit State Machine Reset with Explicit State Machine Reset is often easier to include in an explicit state machine, because we need only put a test for reset = 1 in the clocked process for the state. The pattern for an explicit-current style of machine is: architecture moore_implicit of simple is begin process begin init : loop -outermost loop z <= 0; wait until rising_edge(clk); next init when (reset = 1); -test for reset if (a = 1) then z <= 1; else z <= 0; end if; wait until rising_edge(clk); next init when (reset = 1); -test for reset z <= 0; wait until rising_edge(clk); next init when (reset = 1); -test for reset end process; end moore_implicit; 150 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.4 Reset 151 process (clk) begin if rising_edge(clk) then if reset = 1 then state <= S0; else if ... then state <= ...; elif ... then ... -- more tests and assignments to state end if; end if; end if; end process; Reset with Explicit State Machine Applying this pattern to the explicit Moore machine from section 2.5.2.3 produces: architecture moore_explicit_v2 of simple is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then if (reset = 1) then state <= s0; else case state is ... end case; end if; end if; end process; z <= 1 when (state = s1) else 0; end moore_explicit_v2; 152 CHAPTER 2. RTL DESIGN WITH VHDL 2.6. DATAFLOW DIAGRAMS 153 Reset with Explicit-Current+Next The pattern for an explicit-current+next style is: process (clk) begin if rising_edge(clk) then if reset = 1 then state_cur <= reset state; else state_cur <= state_nxt; end if; end if; end process; 2.6 Dataow Diagrams 2.6.1 Dataow Diagrams Overview Dataow diagrams are data-dependency graphs where the computation is divided into clock cycles. Purpose: Provide a disciplined approach for designing datapath-centric circuits Guide the design from algorithm, through high-level models, and nally to register transfer level code for the datapath and control circuitry. Estimate area and performance Make tradeoffs between different design options Background 2.5.5 State Encoding This section reserved for your reading pleasure Based on techniques from high-level synthesis tools Some similarity between high-level synthesis and software compilation Each dataow diagram corresponds to a basic block in software compiler terminology. 154 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.1 Dataow Diagrams Overview 155 Data-Dependency Graph a b c d e f Dataow Diagrams a b c d e f + x1 + x1 + x2 + x2 + x3 + x3 + x4 + x4 + z + z Data-dependency graph for z = a + b + c + d + e + f Dataow diagram for z = a + b + c + d + e + f 156 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.1 Dataow Diagrams Overview 157 Clock Cycle Boundaries a b c d e f a b c d e Latency f 1 + x1 + x2 Horizontal lines mark clock cycle boundaries + 2 3 x1 + x2 Horizontal lines mark clock cycle boundaries + x3 + 4 x3 + x4 + 5 x4 + 6 z + z Latency = 6 clock cycles 158 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.1 Dataow Diagrams Overview 159 Latency a b c d e f a b c d e Flip Flops f 1 + x1 + 2 x2 Horizontal lines mark clock cycle boundaries + x1 + + x3 x2 Horizontal lines mark clock cycle boundaries + + x3 x4 Signals crossing clock boundaries are flip-flops 3 4 + + z z x4 + Latency = 4 clock cycles Question: Why would a good hardware engineer nd this design disatisfying? 160 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.1 Dataow Diagrams Overview 161 Registered Inputs and Outputs a b c d e f Registered Inputs, Combinational Outputs a b c d e f + x1 + x2 Horizontal lines mark clock cycle boundaries + x1 + x2 Horizontal lines mark clock cycle boundaries + x3 Signals crossing clock boundaries are flip-flops + x3 Signals crossing clock boundaries are flip-flops + x4 + x4 + z + z Flops on both inputs and outputs Flops on inputs, but not outputs (Latency = 5) 162 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.1 Dataow Diagrams Overview 163 Datapath Components a b c d e f Inputs + x1 + x2 Horizontal lines mark clock cycle boundaries a b c d e f Unconnected signal tails are inputs Horizontal lines mark clock cycle boundaries + x3 Signals crossing clock boundaries are flip-flops + x1 + x2 + x4 Blocks in clock cycles are datapath components + x3 Signals crossing clock boundaries are flip-flops + z + x4 Blocks in clock cycles are datapath components + z 164 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.1 Dataow Diagrams Overview 165 Outputs a b c d e f Summary a b c d e f Unconnected signal tails are inputs Horizontal lines mark clock cycle boundaries Unconnected signal tails are inputs Horizontal lines mark clock cycle boundaries + x1 + x1 + x2 + x2 + x3 Signals crossing clock boundaries are flip-flops + x3 Signals crossing clock boundaries are flip-flops + x4 Blocks in clock cycles are datapath components + x4 Blocks in clock cycles are datapath components + z Unconnected signal heads are outputs + z Unconnected signal heads are outputs 166 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.2 Dataow Diagrams, Hardware, and Behaviour 167 2.6.2 Dataow Diagrams, Hardware, and Behaviour Primary Input Dataow Diagram i Hardware i x Dataow Diagram i Register Input Hardware i x Behaviour x clk i x x Behaviour clk i x 168 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.2 Dataow Diagrams, Hardware, and Behaviour 169 Register Signal Hardware Combinational-Component Output Hardware i1 x Dataow Diagram i1 i2 i1 x + + Dataow Diagram i1 i2 i2 i2 + x clk i1 i2 x Behaviour + x clk i1 i2 x Behaviour 170 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.2 Dataow Diagrams, Hardware, and Behaviour 171 Read of Memory with Registered Inputs Hardware Write to Memory with Registered Inputs Hardware we a clk Dataow Diagram M a WE A DO M DI do we a di clk Dataow Diagram M di a WE A DO M DI do mem(rd) clk d we a M(a) do Behaviour Behaviour mem(wr) M clk we a di M(a) do a d - a d 172 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.2 Dataow Diagrams, Hardware, and Behaviour 173 Dual-Port Memory with Registered Inputs M di0 a0 a1 clk we mem(wr) M we a0 di0 a1 clk WE A0 DO0 Sequence of Memory Operations a d a - mem(rd) do1 a0 di0 a1 M DI0 A1 DO1 do0 do1 M(a) M(a) do0 do1 d 174 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.3 Dataow Diagram Execution 175 M di0 a0 a1 clk 2.6.3 Dataow Diagram Execution we a0 a d a a d2 a - mem(wr) mem(rd) do1 a0 a1 di0 a1 mem(rd) M we a0 di0 a1 clk do0 WE A0 DO0 mem(rd) do1 M(a) M(a) M(a) d d1 d M DI0 A1 DO1 do0 do1 M(a) do0 do1 176 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.3 Dataow Diagram Execution 177 Execution with Registers on Both Inputs and Outputs a b c d e f Execution with Registers on Both Inputs and Outputs 0 clk a 0 1 2 3 4 5 6 a b c d e f 0 1 clk a 0 1 2 3 4 5 6 x1 + x2 x1 + x2 + x3 x1 x2 + x3 x1 x2 + x4 x3 x4 + x4 x3 x4 + x5 x5 z + x5 x5 z + z + z 178 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.3 Dataow Diagram Execution 179 Execution with Registers on Both Inputs and Outputs a b c d e f Execution with Registers on Both Inputs and Outputs 0 1 clk a 0 1 2 3 4 5 6 a b c d e f 0 1 clk a 0 1 2 3 4 5 6 x1 + x2 x1 + x2 + x3 2 x1 x2 + x3 2 x1 x2 + x4 x3 x4 + x4 3 x3 x4 + x5 x5 z + x5 x5 z + z + z 180 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.3 Dataow Diagram Execution 181 Execution with Registers on Both Inputs and Outputs a b c d e f Execution with Registers on Both Inputs and Outputs 0 1 clk a 0 1 2 3 4 5 6 a b c d e f 0 1 clk a 0 1 2 3 4 5 6 x1 + x2 x1 + x2 + x3 2 x1 x2 + x3 2 x1 x2 + x4 3 x3 x4 + x4 3 x3 x4 + x5 4 x5 z + x5 4 x5 z + z + z 5 182 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.3 Dataow Diagram Execution 183 Execution with Registers on Both Inputs and Outputs a b c d e f Execution with Registers on Both Inputs and Outputs 0 1 clk a 0 1 2 3 4 5 6 a b c d e f 0 1 clk a 0 1 2 3 4 5 6 x1 + x2 x1 + x2 + x3 2 x1 x2 + x3 2 x1 x2 + x4 3 x3 x4 + x4 3 x3 x4 + x5 4 x5 z + x5 4 x5 z + z 5 6 + z 5 6 184 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.4 Performance Estimation 185 Execution Without Output Registers a b c d e f 2.6.4 Performance Estimation Performance Equations Performance TimeExec 1 TimeExec 0 1 clk a 0 1 2 3 4 5 6 x1 + x2 + x3 2 x1 x2 + + x4 x5 3 4 x3 x4 x5 z Latency ClockPeriod + z 5 Latency = Number of clock cycles from inputs to outputs Performance of Dataow Diagrams Latency: count horizontal lines in diagram Min clock period (Max clock speed) limited by longest path in a clock cycle 186 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.5 Area Estimation 187 2.6.5 Area Estimation Maximum number of blocks in a clock cycle is total number of that component that are needed Maximum number of signals that cross a cycle boundary is total number of registers that are needed Maximum number of unconnected signal tails in a clock cycle is total number of inputs that are needed Maximum number of unconnected signal heads in a clock cycle is total number of outputs that are needed These estimates give lower bounds. Other constraints might force you to use more components. Area Estimation Implementation-technology factors, such as the relative size of registers, multiplexers, and datapath components, might force you to make tradeoffs that increase the number of datapath components to decrease the overall area of the circuit. With some FPGA chips, a 2:1 multiplexer has the same area as an adder. With some FPGA chips, a 2:1 multiplexer can be combined with an adder into one FPGA cell per bit. In FPGAs, registers are usually free, in that the area consumed by a circuit is limited by the amount of combinational logic, not the number of ip-ops. 188 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.6 Design Analysis 189 2.6.6 Design Analysis a b c d e f a b c Design Analysis (Contd) d e f + x1 num inputs + x1 + x2 num inputs num outputs + x2 num outputs + x3 num registers + x3 num registers + x4 num adders min clock period + x4 num adders min clock period + z latency + x5 z latency 190 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.7 Area / Performance Tradeoffs 191 2.6.7 Area / Performance Tradeoffs one add per clock cycle a b c d e f Two Adds per Clock Cycle f a b c d e f two adds per clock cycle a b c d e 0 1 0 0 clk 0 1 2 3 4 5 6 a x1 + x1 + x1 + 1 x1 1 + x2 2 + x2 + x2 x2 + x3 3 + x3 + 2 x3 x3 2 x4 x5 + x4 4 + x4 + x4 z + x5 z 5 6 + x5 z 3 4 + x5 z 3 4 Note: wasted. In the Two-add design, half of the last clock cycle is 192 CHAPTER 2. RTL DESIGN WITH VHDL 2.7. MEMORY ARRAYS AND RTL DESIGN 193 Design Comparison One add per clock cycle a b c d e f 2.7 Memory Arrays and RTL Design c d e f Two adds per clock cycle a b 0 1 0 1 + x1 + x1 + x2 2 + x2 + x3 3 + x3 2 + x4 4 + x4 + x5 z 5 6 + x5 z 3 4 inputs outputs registers adders clock period latency Question: 6 1 6 1 op + 1 add 6 6 1 6 2 op + 2 add 4 Under what circumstances would each design option be fastest? 194 CHAPTER 2. RTL DESIGN WITH VHDL 2.7.2 Data Dependencies 195 2.7.1 Memory Arrays in VHDL This section reserved for your reading pleasure Purpose of Dependencies W0 R3 := ...... WAW ordering prevents W0 from happening after W1 2.7.2 Data Dependencies Denition of Three Types of Dependencies W1 R3 := ...... producer RAW ordering prevents R1 from happening before W1 WAR ordering prevents W2 from happening before R1 R1 ... := ... R3 ... consumer M[i] := := M[i] := := := M[i] := W2 R3 := ...... := M[i] M[i] := M[i] := Read after Write Write after Write Write after Read (True dependency) (Load dependency) (Anti dependency) Instructions in a program can be reordered, so long as the data dependencies are preserved. Each of the three types of memory dependencies (RAW, WAW, and WAR) serves a specic purpose in ensuring that producer-consumer relationships are preserved. 196 CHAPTER 2. RTL DESIGN WITH VHDL 2.7.2 Data Dependencies 197 Ordering of Memory Operations M[2] := 21 Data Dependencies (Contd) M[2] := 21 B A := M[0] := M[2] M[3] := 31 A B := M[2] := M[0] Data Dependencies M[3] M[2] M[1] M[0] 30 20 10 0 M[2] := 21 M[3] := 31 A B := M[2] := M[0] 21 M[3] := 31 M[3] := 32 M[0] := 01 C := M[3] M[3] := 32 M[0] := 01 C := M[3] Initial Program Valid Modication M[3] := 32 M[0] := 01 C := M[3] Initial Program 198 CHAPTER 2. RTL DESIGN WITH VHDL 2.7.3 Memory Arrays and Dataow Diagrams 199 Data Dependencies (Contd) M[2] := 21 M[3] := 31 A B := M[2] := M[0] M[2] := 21 B A := M[0] := M[2] 2.7.3 Memory Arrays and Dataow Diagrams Legend for Dataow Diagrams name name M[3] := 31 C := M[3] M[3] := 32 M[0] := 01 C := M[3] M[3] := 32 M[0] := 01 name name (rd) name(wr) Initial Program Valid (or Bad?) Modication Input port Output port State signal Array read Array write 200 CHAPTER 2. RTL DESIGN WITH VHDL 2.7.3 Memory Arrays and Dataow Diagrams 201 Basic Memory Operations mem data addr Dataow Diagrams and Data Dependencies mem addr mem(rd) data mem (anti-dependency) mem(wr) Read after Write Dependencies Algo: mem[wr addr] := data in; := mem[rd addr]; data out mem data_in wr_addr mem data := mem[addr]; Memory Read mem[addr] := data; Memory Write Dataow diagrams show the dependencies between operations. The basic memory operations are similar, in that each arrow represents a data dependency. mem(wr) rd_addr mem(rd) mem data_out Read after Write 202 CHAPTER 2. RTL DESIGN WITH VHDL 2.7.3 Memory Arrays and Dataow Diagrams 203 Read after Write Optimization Algo: mem[wr addr] := data in; data out := mem[rd addr]; mem data_in wr_addr rd_addr Write after Write Dependencies Algo: mem[wr1 addr] := data1; mem[wr2 addr] := data2; mem data1 wr1_addr mem(wr) data2 wr2_addr mem(wr) mem(rd) mem(wr) mem data_out Optimization when rd addr wr addr mem Write after Write 204 CHAPTER 2. RTL DESIGN WITH VHDL 2.7.3 Memory Arrays and Dataow Diagrams 205 Write after Write Scheduling Option Algo: mem[wr1 addr] := data1; mem[wr2 addr] := data2; mem data1 wr1_addr Write after Read Dependencies Algo: rd data := mem[rd addr]; mem[wr addr] := wr data; mem rd_addr Algo: mem[wr1 addr] := data1; mem[wr2 addr] := data2; mem data2 wr2_addr mem(wr) data2 wr2_addr data1 mem(wr) mem(rd) mem(wr) wr1_addr wr_data wr_addr mem(wr) mem mem(wr) Write after Write mem rd_data mem Write after Read Scheduling option when wr1 addr wr2 addr 206 CHAPTER 2. RTL DESIGN WITH VHDL 2.7.4 Ex: Mem Array and Dataow Diagram 207 Write after Read Optimization Algo: rd data := mem[rd addr]; mem[wr addr] := wr data; mem rd_addr wr_data wr_addr 2.7.4 Ex: Mem Array and Dataow Diagram mem(rd) mem(wr) rd_data mem Optimization when rd addr wr addr 208 mem M data_in wr_addr 21 2 CHAPTER 2. RTL DESIGN WITH VHDL 2.7.4 Ex: Mem Array and Dataow Diagram 209 Dependencies for Known Addresses mem data_in wr_addr 21 2 1 M(wr) 31 3 M 2 M(wr) 2 0 M(wr) 31 3 M(wr) 2 0 3 M(rd) 4 M(rd) 32 3 32 3 1 2 3 4 5 6 7 M[2] := 21 M[3] := 31 A B := M[2] := M[0] A B 5 M(wr) M(rd) M(rd) 01 0 A B M(wr) 01 0 6 M(wr) 3 M(wr) 3 M[3] := 32 M[0] := 01 C := M[3] M C M C 7 M(rd) M(rd) 210 CHAPTER 2. RTL DESIGN WITH VHDL 2.7.4 Ex: Mem Array and Dataow Diagram 211 Anti-Dependencies for Known Addresses mem M data_in wr_addr 21 2 M 0 21 2 Minimal Dependencies 31 3 M(wr) 31 3 M(rd) B M(wr) M(wr) M(wr) 2 0 01 0 M(wr) 2 M(rd) 32 3 M(wr) 3 M(rd) M(rd) M(rd) 32 3 A A B M(wr) 01 0 M C M(wr) 3 Memory array with minimal dependencies M(rd) M C 212 CHAPTER 2. RTL DESIGN WITH VHDL 2.7.4 Ex: Mem Array and Dataow Diagram 213 Memory Array with Orderings M 0 21 2 31 3 M 0 1 M(rd) B B 01 0 4 M(wr) 2 2 M(rd) 3 32 3 M(wr) 3 3 M(rd) 1 M(wr) 2 M(wr) 1 M(rd) Place Operations in Clock Cycles 21 2 1 M(wr) 2 2 M(rd) A 2 31 3 M(wr) A 32 3 3 M C 01 0 3 3 M(rd) M(wr) Memory array with orderings 4 M(wr) M C 214 CHAPTER 2. RTL DESIGN WITH VHDL 2.8. INPUT / OUTPUT PROTOCOLS 215 Final Dataow Diagram M 0 1 M(rd) B 2 2 M(rd) A 32 3 3 M(wr) 2 31 3 M(wr) 1 21 2 M(wr) 2.8 Input / Output Protocols This section reserved for your reading pleasure 2.9 Design Example: Massey This section reserved for your reading pleasure 3 3 M(rd) C 4 01 0 M(wr) M Final version of DFD 216 CHAPTER 2. RTL DESIGN WITH VHDL 2.10. DESIGN EXAMPLE: VANIER 217 2.10 Design Example: Vanier 2. I/O allocation 3. First high-level model 4. Register allocation 5. Datapath allocation Design Process 1. Scheduling (allocate operations to clock cycles) Well go through the following artifacts: 1. requirements 2. algorithm 3. dataow diagram 4. high-level models 5. hardware block diagram 6. RTL code for datapath 7. state machine 8. RTL code for control 6. Connect datapath components, insert muxes where needed 7. Design implicit state machine 8. Optimize 9. Design explicit-current state machine 10. Optimize 218 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.2 Algorithm 219 2.10.1 Requirements 2.10.2 Algorithm d) + c + (d Functional requirements: compute the following formula: output = (a d) + c + (d b) + b Performance requirement: Max clock period: op plus (2 adds or 1 multiply) Max latency: 4 output = (a b) + b Create a data-dependency graph for the algorithm. a d b c Cost requirements Maximum of two adders Maximum of two multipliers Unlimited registers Maximum of three inputs and one output Maximum of 5000 student-minutes of design effort + + + z Registered inputs and outputs 220 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.4 Reschedule to Meet Requirements 221 2.10.3 Initial Dataow Diagram 2.10.4 a d Reschedule to Meet Requirements b c a d b c Schedule operations into clock cycles. a d b c + + + z z + + + z 222 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.5 Optimize Resources 223 Fix Clock Period Violation d b c d b c 2.10.5 Optimize Resources a d b c a + + + z a + + + z a d b c + + + z z 224 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.5 Optimize Resources 225 Analysis d b Dene Entity Having nalized our input/output scheduling, we can write our entity. Note: we will add a reset signal later, when we design the state machine to control the datapath. a c + + + z entity vanier is port ( clk : in std_logic; i_1, i_2 : in std_logic_vector(15 downto 0); o_1 : out std_logic_vector(15 downto 0) ); end vanier; Question: second? Should we move the second addition from third clock cycle to 226 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.7 Input/Output Allocation 227 2.10.6 d Assign Names to Registered Values b 2.10.7 d x1 Input/Output Allocation b x2 c x4 x5 a c a x3 + x6 + + + x8 z x7 + + z Question: Why do we not need to assign names to combinational signals? Question: Why do we not need to assign a new name to x1, x2, and x4 the second time they cross a clock cycle boundary? 228 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.8 Tangent: Combinational Outputs 229 VHDL Code! architecture hlm_v1 of vanier is signal x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8 : unsigned(15 downto 0); begin process begin wait until rising_edge(clk); x_1 <= unsigned(i_1); x_2 <= unsigned(i_2); wait until rising_edge(clk); x_3 <= unsigned(i_1); x_4 <= x_1(7 downto 0) * x_2(7 downto 0); x_5 <= unsigned(i_2); wait until rising_edge(clk); x_6 <= x_3(7 downto 0) * x_1(7 downto 0); x_7 <= x_2 + x_5; wait until rising_edge(clk); x_8 <= x_6 + (x_4 + x_7); end process; o_1 <= std_logic_vector(x_8); end hlm_v1; 2.10.8 Tangent: Combinational Outputs architecture hlm_v1c of vanier is signal x_1, x_2, x_3, x_4, x_5, x_6, x_7 : unsigned(15 downto 0); begin process begin wait until rising_edge(clk); x_1 <= unsigned(i_1); x_2 <= unsigned(i_2); wait until rising_edge(clk); x_3 <= unsigned(i_1); x_4 <= x_1(7 downto 0) * x_2(7 downto 0); x_5 <= unsigned(i_2); wait until rising_edge(clk); x_6 <= x_3(7 downto 0) * x_1(7 downto 0); x_7 <= x_2 + x_5; end process; o_1 <= std_logic_vector(x_6 + (x_4 + x_7)); end hlm_v1c; i1 d i2 b x1 i1 a x2 i2 c x3 x4 x5 + x6 + + z o1 x7 230 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.9 Register Allocation 231 2.10.9 i1 d Register Allocation i2 b i1 d r1 x1 i2 c i1 a r3 x3 r4 x4 i2 b r2 x2 i2 c New VHDL Code! x1 i1 a x2 x3 x4 x5 r5 x5 + x6 + r2 x6 + + z o1 x7 + + r5 x8 z o1 r5 x7 architecture hlm_v2 of vanier is signal r_1, r_2, r_3, r_4, r_5 : unsigned(15 downto 0); begin process begin wait until rising_edge(clk); r_1 <= unsigned(i_1); r_2 <= unsigned(i_2); wait until rising_edge(clk); r_3 <= unsigned(i_1); r_4 <= r_1(7 downto 0) * r_2(7 downto 0); r_5 <= unsigned(i_2); wait until rising_edge(clk); r_2 <= r_3(7 downto 0) * r_1(7 downto 0); r_5 <= r_2 + r_5; wait until rising_edge(clk); r_5 <= r_2 + (r_4 + r_5); end process; o_1 <= std_logic_vector(r_5); end hlm_v2; 232 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.11 Hardware Block Diagram and State Machine 233 2.10.10 i1 d r1 x1 i1 a r3 x3 Datapath Allocation i2 b r2 x2 i2 c r4 x4 r5 x5 2.10.11 Hardware Block Diagram and State Machine 1. Calculate number of states that are needed 2. Control signals for registers Chip enable Mux select on input 3. Control signals for datapath components + r2 x6 + + r5 x8 z o1 r5 x7 Instruction (e.g. add/sub for ALU) Mux select on inputs For our example: Use four states: S0..S3, one for each clock cycle. 234 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.11 Hardware Block Diagram and State Machine 235 2.10.11.1 S0 S1 i1 a r3 x3 m1 m1 r4 x4 i1 d r1 x1 Control for Registers i2 b r2 x2 i2 c r5 x5 a1 Optimize chip enables and muxes r1 S0 S1 S2 S3 ce 1 0 d i1 ce 1 0 1 r2 d i2 m1 ce 1 r3 d i1 ce 1 0 r4 d m1 ce 1 1 1 r5 d i2 a1 a1 Build a table with one row per state, one colum per register. S2 + r5 x7 r2 x6 a2 S3 a1 + + r5 x8 z o1 S0 r1 ce S0 S1 S2 S3 d ce r2 d ce r3 d ce r4 d ce r5 d Chip enable: a register holds a value for multiple clock cycles. Mux: a register loads values from multiple sources. 236 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.11 Hardware Block Diagram and State Machine 237 Optimized Chip Enables and Muxes r1=i1 ce 1 0 r2 ce 1 0 1 d i2 m1 r3=i1 r4=m1 ce 1 0 r5 d i2 a1 a1 2.10.11.2 nents Control for Datapath Compo- S0 S1 S2 S3 Table for datapath components. One row per state. One column per datapath component. Sub-columns for sources and instructions (e.g. add/sub for ALU). S0 S1 i1 a r3 x3 m1 m1 r4 x4 a1 r2 x6 a2 i2 c r5 x5 i1 d r1 x1 i2 b r2 x2 S2 + r5 x7 S3 a1 + + r5 x8 z o1 S0 S0 S1 S2 S3 a1 a2 m1 src1 src2 src1 src2 src1 src2 r1 r2 r2 r5 r3 r1 r2 a2 r4 r5 238 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.11 Hardware Block Diagram and State Machine 239 Optimize Datapath Control Table a1 a2 m1 src1 src2 src1 src2 src1 src2 r1 r2 r2 r5 r1 r3 r2 a2 r4 r5 2.10.11.3 Control for State S0 S1 S2 S3 We need to control the transition from one state to the next. For this example, the S1 S2 transition is very simple, each state transitions to its successor: S0 S3 S0 . 240 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.11 Hardware Block Diagram and State Machine 241 2.10.11.4 S0 S1 S2 S3 Complete State Machine Table S0 S1 S2 S3 Dont Cares Instantiations r1 ce r2 ce r2 sel r4 ce r5 sel a1 src2 sel m1 src2 sel state 1 1 i2 0 a1 a2 r3 S1 0 0 m1 1 i2 a2 r2 S2 1 1 m1 0 a1 r5 r3 S3 1 1 m1 0 a1 a2 r3 S0 r1 ce r2 ce r2 sel r4 ce r5 sel a1 src2 sel m1 src2 sel state 1 1 i2 S1 0 0 1 i2 r2 S2 1 m1 0 a1 r5 r3 S3 a1 a2 S0 Question: What values should we use for dont cares? 242 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.12 VHDL Code with Explicit State Machine begin ----------------------- r_1 process (clk) begin if rising_edge(clk) then if state != S1 then r_1 <= i_1; end if; end if; end process; ----------------------- r_2 process (clk) begin if rising_edge(clk) then if state != S1 then if state = S0 then r_2 <= i_2; else r_2 <= m_1; end if; end if; end if; end process; 243 2.10.12 chine VHDL Code with Explicit State Ma- We chose a one-hot encoding of the state, which usually results in small and fast hardware for state machines with sixteen or fewer states. architecture explicit_v1 of vanier is signal r_1, r_2, r_3, r_4, r_5 : std_logic_vector(15 downto 0); type state_ty is std_logic_vector(3 downto 0); constant s0 : state_ty := "0001"; constant s1 : state_ty := "0010"; constant s2 : state_ty := "0100"; constant s3 : state_ty := "1000"; signal state : state_ty; ----------------------- r_3 process (clk) begin if rising_edge(clk) then r_3 <= i_1; end if; end process; ----------------------- r_4 process (clk) begin if rising_edge(clk) then if state = S1 then r_4 <= m_1; end if; end if; end process; 244 ----------------------- r_5 process (clk) begin if rising_edge(clk) then if state = S1 then r_5 <= i_2; else r_5 <= a_1; end if; end if; end process; ----------------------- combinational datapath with state select a1_src2 <= r_5 when S2, a_2 when others; with state select m1_src2 <= r_2 when S1 r_3 when others; a_1 <= a_2 + a1_src2; a_2 <= r_4 + r_5; m_1 <= r_1 * m1_src2; o_1 <= r_5; CHAPTER 2. RTL DESIGN WITH VHDL 2.10.12 VHDL Code with Explicit State Machine 245 Hardware Block Diagram ----------------------- state machine process (clk) begin if rising_edge(clk) then if reset = 1 then state <= S0; else case state is when S0 => state <= when S1 => state <= when S2 => state <= when S3 => state <= end case; end if; end if; end process; ---------------------end explicit_v1; i1 i2 S0 S1 i1 a r3 x3 m1 i1 d r1 x1 m1 r4 x4 i2 b r2 x2 i2 c r5 x5 a1 S1; S2; S3; S0; S2 + r5 x7 r1 r2 r3 r5 r2 x6 a2 S3 a1 + + r5 x8 z m1 o1 S0 r4 a2 + + a1 246 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.13 Peephole Optimizations 247 2.10.13 Peephole Optimizations -- r_1 (optimized) process (clk) begin if rising_edge(clk) then if then r_1 <= i_1; end if; end if; end process; Peephole Optimizations -- r_2 process (clk) begin if rising_edge(clk) then if state != S1 if state = S0 then r_2 <= i_2; else r_2 <= m_1; end if; end if; end if; end process; -- r_2 (optimized) process (clk) begin if rising_edge(clk) then if state(1) = 0 then if state(0) = 1 then r_2 <= i_2; else r_2 <= m_1; end if; end if; end if; end process; -- r_1 process (clk) begin if rising_edge(clk) then if state != S1 then r_1 <= i_1; end if; end if; end process; 248 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.14 Notes and Observations 249 Peephole Optimizations -- state machine process (clk) begin if rising_edge(clk) then if reset = 1 then state <= S0; else case state is when S0 => state <= when S1 => state <= when S2 => state <= when S3 => state <= end case; end if; end if; end process; -- state machine (optimized) -- NOTE: "st" = "state" process (clk) begin if rising_edge(clk) then if reset = 1 then st <= S0; else for i in 0 to 3 loop st( (i+1) mod 4 ) <= st( i ); end loop; end if; end if; end process; 2.10.14 Notes and Observations Our functional requirements were written as: output = (a d) + (d b) + b + c S1; S2; S3; S0; Alternatively, we could have achieved exactly the same functionality with the functional requirements written as (the two statements are mathematically equivalent): output = (a d) + b + (d b) + c 250 CHAPTER 2. RTL DESIGN WITH VHDL 2.11. DESIGN EXAMPLE: STACK 251 Data Dependency Graphs: Clean vs Ugly The naive data dependency graph for the alternative formulation is much messier than the data dependency graph for the original formulation: Original (a d) + (d b) + b + c a d b c a d 2.11 Design Example: Stack This section reserved for your reading pleasure Alternative (a d) + c + (d b) + b b c + + + z + + z + 252 CHAPTER 2. RTL DESIGN WITH VHDL 2.12.1 Strength Reduction 253 2.12 2.12.1 Optimization Techniques Strength Reduction 2.12.1.2 is neg, is pos Boolean Strength Reduction Boolean tests that can be implemented as wires is odd, is even By choosing your encodings carefully, you can sometimes reduce a vector comparisons to a wire. For example if your state uses a one-hot encoding, then the comparison state = S3 reduces to state(3) = 1. You might expect a reasonable logic-synthesis tool to do this reduction automatically, but most tools do not do this reduction. When using encodings other than one-hot, Karnaugh maps can be useful tools for optimizing vector comparisons. By carefully choosing our state assignments, when we use a full binary encoding for 8 states, the comparison: (state = S0 or state = S3 or state = S4) = 1 can be reduced to a single bit comparison, such as state(2) = 1. Strength reduction replaces one operation with another that is simpler. 2.12.1.1 Arithmetic Strength Reduction wired shift logical left shift logical left wired shift logical right shift logical right wired shift and addition Multiply by a constant power of two Multiply by a power of two Divide by a constant power of two Divide by a power of two Multiply by 3 254 CHAPTER 2. RTL DESIGN WITH VHDL 2.12.2 Replication and Sharing 255 2.12.2 2.12.2.1 Replication and Sharing Mux-Pushing 2.12.2.2 tion Common Subexpression Elimina- Introduce new signals to capture subexpressions that occur multiple places in the code. Before y <= else z <= else a + b + c when (w = 1) d; a + c + d when (w = 1) e; a + c; b + tmp when (w = 1) d; d + tmp when (w = 1) e; Pushing multiplexors into the fanin of a signal can reduce area. Before z <= a + b when (w = 1) else a + c; After tmp <= b when (w = 1) else c; z <= a + tmp; The rst circuit will have two adders, while the second will have one adder. Some synthesis tools will perform this optimization automatically, particularly if all of the signals are combinational. After tmp <= y <= else z <= else 256 CHAPTER 2. RTL DESIGN WITH VHDL 2.12.2 Replication and Sharing 257 Subexpression Elimination Note: Clocked subexpressions Care must be taken when doing common subexpression elimination in a clocked process. Putting the temporary signal in the clocked process will add a clock cycle to the latency of the computation, because the tmp signal will be ip-op. The tmp signal must be combinational to preserve the behaviour of the circuit. 2.12.2.3 Computation Replication To improve performance If same result is needed at two very distant locations and wire delays are signicant, it might improve performance (increase clock speed) to replicate the hardware To reduce area If same result is needed at two different times that are widely separated, it might be cheaper to reuse the hardware component to repeat the computation than to store the result in a register Note: Muxes are not free Each time a component is reused, multiplexors are added to inputs and/or outputs. Too much sharing of a component can cost more area in additional multiplexors than would be spent in replicating the component 258 CHAPTER 2. RTL DESIGN WITH VHDL 2.12.4 Pipelining 259 2.12.3 Arithmetic 2.12.4 Pipelining VHDL is left-associative. The expression a + b + c + d is interpreted as (((a + b) + c) + d). You can use parentheses to suggest parallelism. Perform arithmetic on the minimum number of bits needed. If you only need the lower 12 bits of a result, but your input signals are 16 bits wide, trim your inputs to 12 bits. This results in a smaller and faster design than computing all 16 bits of the result and trimming the result to 12 bits. You can turn a dataow diagram into a pipeline by making each clock cycle of the dataow diagram a separate pipe stage. However, this can be complicated and error-prone. You need to worry about data hazards if you have state-holding registers in your algorithm. You need to worry about structural hazards if different instructions have different latencies. A rough description of the technique to turn dataow diagram into pipeline: Group one or more consecutive clock cycles of computation for all instructions into each stage. Each stage becomes a single module. Hardware is not shared between stages. So, moving from a non-pipelined implementation to a pipelined implementation will increase the area of the design. For pipelines, the most important measure of performance is usually throughput, which is the inverse of number of clock cycles that are grouped into a single stage. For example if each clock cycle becomes a single stage, then the throughput (as measured in clock cycles) is 1 parcel/clock-cycle. As another example, if two clock cycles are grouped into a single stage, then a new parcel can enter the pipeline once every two clock cycles. 260 CHAPTER 2. RTL DESIGN WITH VHDL Chapter 3 Functional Verication 261 262 CHAPTER 3. FUNCTIONAL VERIFICATION 3.1.2 The Difculty of Designing Correct Chips 263 3.1 Overview 3.1.1 Terminology: Validation / Verication / Testing 3.1.2 The Difculty of Designing Correct Chips 3.1.2.1 Notes from Kenn Heinrich (UW E&CE grad) Everyone should get a lecture on why their rst industrial design wont work in the eld. Note: There are six reasons in your notes. 3.1.2.2 Notes from Aart de Geus (Chairman and CEO of Synopsys) More than 60% of the ASIC designs that are fabricated have at least one error, issue, or a problem that whose severity forced the design to be reworked. Note: There is a pretty picture in your notes. 264 CHAPTER 3. FUNCTIONAL VERIFICATION 3.2.2 Floating Point Divider Example 265 3.2 Test Cases and Coverage 3.2.1 Coverage To be absolutely certain that an implementation is correct, we must check every combination of values. This includes both input values and internal state (ip ops). If we have ni bits of inputs and ns bits in ip-ops, we have to test 2ni ns different cases when doing functional verication. 3.2.2 Floating Point Divider Example This example illustrates the difculty of achieving signicant coverage on realistic circuits. Consider doing the functional simulation for a double precision (64-bit) oating-point divider. Given Information Data width 64 bits Number of gates in circuit 10 000 Number of assembly-language instructions to 100 simulate one gate for one test case 0.5 Number of clock cycles required to execute one assembly language instruction on the computer that is running the simulation Clock speed of computer that is running the sim- 1 Gigahertz ulation Question: If we have nc combinational signals, why dont we have to test 2ninsnc different cases? 266 CHAPTER 3. FUNCTIONAL VERIFICATION 3.2.2 Floating Point Divider Example 267 Number of Cases Question: How many cases must be considered? Simulation Run Time Question: How long will it take to simulate all of the different possible cases using a single computer? width=64b, gates=10 000, instrs/gate=100, cycles/instr=0.5, cycles/sec=109 width=64b, gates=10 000, instrs/gate=100, cycles/instr=0.5, cycles/sec=109 268 CHAPTER 3. FUNCTIONAL VERIFICATION 3.2.2 Floating Point Divider Example 269 Coverage Question: If you can run simulations non-stop for one year on ten computers, what coverage will you achieve? width=64b, gates=10 000, instrs/gate=100, cycles/instr=0.5, cycles/sec=109 Simulation vs the Real World From Validating the Intel(R) Pentium(R) Microprocessor by Bob Bentley, Design Automation Conference 2001. (Link on E&CE 427 web page.) Simulating the Pentium 4 Processor on a Pentium 3 Processor ran at about 15 MHz. By tapeout, over 200 billion simulation cycles had been run on a network of computers. All of these simulations represent less than two minutes of running a real processor. 270 CHAPTER 3. FUNCTIONAL VERIFICATION 3.3.2 Reference Model Style Testbench 271 3.3 Testbenches 3.3.1 Overview of Test Benches testbench specification stimulus check 3.3.2 Reference Model Style Testbench reference model testbench specification stimulus implementation implementation 3.3.3 Relational Style Testbench relational testbench Implementation Circuit that youre checking for bugs also known as: design under test or unit under test Stimulus Generates test vectors Specication Describes desired behaviour of implementation Check Checks whether implementation obeys specication stimulus check implementation 272 CHAPTER 3. FUNCTIONAL VERIFICATION 3.3.5 Datapath vs Control 273 3.3.4 Coding Structure of a Testbench testbench specification stimulus check 3.3.5 Datapath vs Control Datapath and control circuits tend to use different styles of testbenches. reference model testbench specification relational testbench implementation stimulus stimulus check implementation implementation architecture main of athabasca_tb is component declaration for implementation; other declarations begin implementation instantiation; stimulus process; specification process (or component instantiation); check process; end main; 274 CHAPTER 3. FUNCTIONAL VERIFICATION 3.4. FUNCTIONAL VERIFICATION FOR DATAPATH CIRCUITS 275 3.3.6 Verication Tips Suggested order of simulation for functional verication. 1. Write high-level model. 2. Simulate high-level model until have correct functionality and latency. 3. Write synthesizable model. 4. Use zero-delay simulation (uw-sim) to check behaviour of synthesizable model against high-level model. 5. Optimize the synthesizable model. 6. Use zero-delay simulation (uw-sim) to check behaviour of optimized model against high-level model. 7. Use timing-simulation (uw-timsim) to check behaviour of optimized model against high-level model. section 3.4 describes a series of testbenches that are particularly useful for debugging datapath circuits in the early phases of the design cycle. 3.4 Functional Verication for Datapath Circuits In this section we will incrementally develop a testbench for a very simple circuit: an AND gate. 276 CHAPTER 3. FUNCTIONAL VERIFICATION 3.4.1 A Spec-Less Testbench 277 Implementation entity and2 is port ( a, b : in std_logic; c : out std_logic ); end and2; architecture main of and2 is begin c <= 1 when (a = 1 AND b = 1) else 0; end and2; 3.4.1 A Spec-Less Testbench First, use waveform viewer to check that implementation generates reasonable outputs for a small set of inputs. 278 entity and2_tb is end and2_tb; CHAPTER 3. FUNCTIONAL VERIFICATION 3.4.2 Use an Array for Test Vectors 279 3.4.2 Use an Array for Test Vectors architecture main_tb of and2_tb is ... begin ... stimulus : process type test_datum_ty is record ra, rb : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty ; constant test_vectors : test_vectors_ty := -a b ( ( 0, 0), ( 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; wait for 10 ns; end loop; end process; end main_tb; architecture main_tb of and2_tb is component and2 ... end component; signal ta, tb, tc_impl : std_logic; signal ok : boolean; begin -------------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); -------------------------------------------stimulus : process begin ta <= 0; tb <= 0; wait for 10ns; ta <= 1; tb <= 1; wait for 10ns; end process; -------------------------------------------end main_tb; 280 CHAPTER 3. FUNCTIONAL VERIFICATION 3.4.3 Build Spec into Stimulus 281 3.4.3 Build Spec into Stimulus stimulus : process type test_datum_ty is record ra, rb, rc : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := -a, b: inputs -c : expected output -a b c ( ( 0, 0, 0), ( 0, 1, 0), ( 1, 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; tc_spec <= test_vectors(i).rc; wait for 10 ns; end loop; end process; Build Spec into Stimulus (Contd) stimulus : process ... begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; tc_spec <= test_vectors(i).rc; wait for 10 ns; end loop; end process; ----------------------------------------check : process (tc_impl, tc_spec) begin ok <= (tc_impl = tc_spec); end process; ----------------------------------------end main_tb; 282 CHAPTER 3. FUNCTIONAL VERIFICATION 3.4.4 Have Separate Specication Entity 283 3.4.4 Have Separate Specication Entity entity and2_spec is ...(same as and2 entity)... end and2_spec; architecture spec of and2_spec is begin c <= a AND b; end spec; Testbench for Separate Specication architecture main_tb of and2_tb is component and2 ...; component and2_spec ...; signal ta, tb, tc_impl, tc_spec : std_logic; signal ok : boolean; begin ----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); spec : and2_spec port map (a => ta, b => tb, c => tc_spec); ----------------------------------------stimulus process... check process... end 284 CHAPTER 3. FUNCTIONAL VERIFICATION 3.4.5 Generate Test Vectors Automatically 285 Testbench for Separate Spec (Contd) stimulus : process ... constant test_vectors : test_vectors_ty := -a b ( ( 0, 0), ( 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; wait for 10 ns; end loop; end process; ----------------------------------------check : process (tc_impl, tc_spec) begin ok <= (tc_impl = tc_spec); end process; ----------------------------------------end main_tb; 3.4.5 Generate Test Vectors Automatically architecture main_tb of and2_tb is ... begin ... stimulus : process subtype std_test_ty of std_logic is (0, 1); begin for va in std_test_tylow to std_test_tyhigh loop for vb in std_test_tylow to std_test_tyhigh loop ta <= va; tb <= vb; wait for 10 ns; end loop; end loop; end process; ... end main_tb; 286 CHAPTER 3. FUNCTIONAL VERIFICATION 3.5. FUNCTIONAL VERIFICATION OF CONTROL CIRCUITS 287 3.4.6 Relational Specication Sometimes we want to check a relationship between the output and the input, rather than check that the output has a specic value. To do this, we drop the spec process, and put the brains into the check process. architecture main_tb of and2_tb is ... begin ----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); ----------------------------------------stimulus : process ... end process; ----------------------------------------check : process (tc_impl, tc_spec) begin ok <= NOT (tc_impl = 1 AND (ta =0 OR tb = 0)); end process; ----------------------------------------end main_tb; 3.5 Functional Verication of Control Circuits Control circuits are often more challenging to verify than datapath circuits. In this section, we will explore the functional verication of state machines via a First-In First-Out queue. 288 CHAPTER 3. FUNCTIONAL VERIFICATION 3.5.1 Overview of Queues in Hardware Empty A Write 1 Write 2 A 289 3.5.1 Overview of Queues in Hardware write read Structure of queue queue Write Sequence 290 Write 1 A B Write 2 A B CHAPTER 3. FUNCTIONAL VERIFICATION 3.5.1 Overview of Queues in Hardware Read 1 A B B Read 2 A 291 A Second Example Write Example Read Sequence 292 Write 1 Write 2 CHAPTER 3. FUNCTIONAL VERIFICATION 3.5.1 Overview of Queues in Hardware Write 1 K Write 2 K B C D E F G H I J 293 B C D E F G H I J B C D E F G H I J B C D E F G H I J Write Illustrating Index Wrap Write Illustrating Full Queue 294 do_rd mem do_wr CHAPTER 3. FUNCTIONAL VERIFICATION do_rd wr_idx 3.5.2 VHDL Coding 295 3.5.2 VHDL Coding mem rd_idx data_rd data_wr wr_idx do_wr 3.5.2.1 Package data_rd WE A0 DO0 data_wr rd_idx DI0 A1 DO1 package queue_pkg is subtype data is std_logic_vector(3 downto 0); function to_data(i : integer) return data; end queue_pkg; package body queue_pkg is function to_data(i : integer) return data is begin return std_logic_vector(to_unsigned(i, 4)); end to_data; end queue_pkg; empty empty Queue Signals Control circuitry not shown. Incomplete Queue Blocks 296 CHAPTER 3. FUNCTIONAL VERIFICATION 3.5.3 Code Structure for Verication 297 3.5.2.2 Other VHDL Coding This section reserved for your reading pleasure Code Structure for Verication architecture ... is ... begin ... normal implementation ... process (clk) begin if rising_edge(clk) then ... instrumentation code ... prev_signame <= signame; end if; end process; ... assertions ... ... coverage monitors ... end; 3.5.3 Code for Structure Verication Verication things to notice in queue implementation: 1. instrumentation code 2. coverage monitors 3. assertions 298 CHAPTER 3. FUNCTIONAL VERIFICATION 3.5.4 Instrumentation Code 299 3.5.4 Instrumentation Code Added to implementation to support verication Usually keeps track of previous values of signals Does not create hardware (Optimized away during synthesis) Does not feed any output signals Must use synthesizable subset of VHDL process (clk) begin if rising_edge(clk) then prev_rd_idx <= rd_idx; prev_wr_idx <= wr_idx; prev_do_rd <= do_rd; prev_do_wr <= do_wr; end if; end process; Coverage Events for Queue Question: What events should we monitor to estimate the coverage of our functional tests? 300 CHAPTER 3. FUNCTIONAL VERIFICATION 3.5.4 Instrumentation Code 301 Coverage Monitor Template process (signals read) begin if (condition) then report "coverage: message"; elsif (condition) ) then report "coverage: message"; else report "error: case fall through on message" severity warning; end if; end process; Coverage Monitor Code Events related to rd idx equals wr idx. 302 CHAPTER 3. FUNCTIONAL VERIFICATION 3.5.4 Instrumentation Code 303 process (prev_rd_idx, prev_wr_idx, rd_idx, wr_idx) begin if (rd_idx = wr_idx) then if ( prev_rd_idx = prev_wr_idx ) then report "coverage: read = write both moved"; elsif ( rd_idx /= prev_rd_idx ) then report "coverage: Read caught write"; elsif ( wr_idx /= prev_wr_idx ) then report "coverage: Write caught read"; else report "error: case fall through on rd/wr catching" severity warning; end if; end if; end process; Coverage Monitor Code Events related to rd idx wrapping. process (rd_idx) begin if (rd_idx = low_idx) then report "coverage: rd mv to low"; elsif (rd_idx = high_idx) then report "coverage: rd mv to high"; else report "coverage: rd mv normal"; end if; end process; 304 CHAPTER 3. FUNCTIONAL VERIFICATION 3.5.5 Assertions 305 3.5.5 Assertions Assertions for Queue 1. If rd idx changes, then it increments or wraps. 2. If rd idx changes, then do rd was 1, or reset is 1. 3. If wr idx changes, then it increments or wraps. 4. If wr idx changes, then do wr was 1, or reset is 1. 5. And many others.... Assertion Template process (signals read) begin assert (required condition) report "error: message" severity warning; end process; 306 CHAPTER 3. FUNCTIONAL VERIFICATION 3.5.5 Assertions 307 Assertions: Read Index process (rd_idx) begin assert ((rd_idx > prev_rd_idx) or (rd_idx = low_idx)) report "error: rd inc" severity warning; assert ((prev_do_rd = 1) or (reset = 1)) report "error: rd imp do_rd" severity warning; end process; Assertions: Write Index process (wr_idx) begin assert ((wr_idx > prev_wr_idx) or (wr_idx = low_idx)) report "error: wr inc" severity warning; assert ((prev_do_wr = 1) or (reset = 1)) report "error: wr imp do_wr" severity warning; end process; 308 CHAPTER 3. FUNCTIONAL VERIFICATION 3.5.6 VHDL Coding Tips 309 3.5.6 VHDL Coding Tips Vector Type Declaration type data_array_ty is array(natural range <>) of data; signal data_array : data_array_ty(7 downto 0); Functions function to_idx (i : natural range data_arraylow to data_arrayhigh) return idx_ty is begin return to_unsigned(i, idx_tylength); end to_idx; Conversion to Index Without Function With Function rd_idx <= to_unsigned(5, 3); rd_idx <= to_idx(5); The function code is verbose, but is very maintainable, because neither the function itself nor uses of the function need to know the width of the index vector. 310 CHAPTER 3. FUNCTIONAL VERIFICATION 3.5.6 VHDL Coding Tips 311 Attributes function inc_idx (idx : idx_ty) return idx_ty is begin if idx < data_arrayhigh then return (idx + 1); else return (to_idx(data_arraylow)); end if; end inc_idx; Feedback Loops, and Functions Coding guideline: use functions. Dont use procedures. inc as fun inc as proc wr_idx <= inc_idx(wr_idx); inc_idx(wr_idx); Functions clearly distinguish between reading from a signal and writing to a signal. By examining the use of a procedure, you cannot tell which signals are read from and which are written to. You must examine the declaration or implementation of the procedure to determine modes of signals. Modifying a signal within a procedure results in a tri-state signal. This is bad. 312 CHAPTER 3. FUNCTIONAL VERIFICATION 3.5.7 Queue Specication 313 File I/O (textio package) TEXTIO denes read, write, readline, writeline functions. Described in: http://www.eng.auburn.edu/department/ee/mgc/vhdl.html#textio These functions can be used to read test vectors from a le and write results to a le. 3.5.7 Queue Specication Most bugs in queues are related to the queue becoming full, becoming empty, and/or wrap of indices. Specication should be obviously correct. Avoid bugs in specication by making specication queue larger than the max number of writes that we will do in test suite. Thus, the specication queue will never become full or wrap. However, the implementation queue will become full and wrap. 314 CHAPTER 3. FUNCTIONAL VERIFICATION 3.5.7 Queue Specication 315 Write Index Update in Specication We increment write-index on every write, we never wrap. process (clk) begin if rising_edge(clk) then if (reset = 1) then wr_idx <= 0; elsif (do_wr = 1) then wr_idx <= wr_idx + 1; end if; end if; end process; Things to Notice Things to notice in queue specication: 1. dont care conditions (-) 2. uninitialized data (hint: what is the value of rd_data when do more reads than writes? 316 CHAPTER 3. FUNCTIONAL VERIFICATION 3.5.8 Queue Testbench 317 Dont Care rd_data <= data_array(rd_idx) when (do_rd =1) else (others => -); 3.5.8 Queue Testbench Things to notice in queue testbench: 1. running multipe test sequences 2. uninitialized data U 3. std_match to compare spec and impl data 0 0 1 1 everything else 0 L 1 H everything everything With equality, - 1, but we want to use - to mean dont care in specication. The solution is to use std match, rather than = to check implementation signals against the specication. 318 CHAPTER 3. FUNCTIONAL VERIFICATION 3.5.8 Queue Testbench stimulus : process type test_datum_ty is record r_reset, ... normal fields ... end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := ( -reset ... ( 1, normal fields), ( 0, normal fields), ... -- wr_idx passes rd_idx (overwrite entries) -reset ... ( 1, normal fields), ( 0, normal fields), ... ); begin for i in test_vectorsrange loop if (test_vectors(i).r_reset = 1) then ... reset code ... end if; reset <= 0; ... normal sequence ... wait until rising_edge(clk); end loop; end process; 319 Stimulus Process Structure The stimulus process runs multiple test vectors in a single simulation run. 320 CHAPTER 3. FUNCTIONAL VERIFICATION Reset After reset is asserted, set signals to U. Chapter 4 Performance Analysis and Optimization 321 322 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4.2. DEFINING PERFORMANCE 323 4.1 Introduction Hennessey and Pattersons Quantitative Computer Achitecture (textbook for E&CE 429) has good information on performance. We will use some of the same definitions and formulas as Hennessey and Patterson, but we will move away from generic denitions of performance for computer systems and focus on performance for digital circuits. 4.2 Dening Performance Performance Work Time You can double your performance by: doing twice the work in the same amount of time OR doing the same amount of work in half the time 324 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4.2. DEFINING PERFORMANCE 325 Benchmarking Performance Work Time SPEC Benchmarks The Spec Benchmarks are among the most respected and accurate predictions of real-world performance. Measuring time is easy, but how do we accurately measure work? The game of benchmarketing is nding a denition of work that makes your system appear to get the most work done in the least amount of time. Measure of Work clock cycle instruction synthetic program real program travel 1/4 mile Measure of Performance MHz MIPs Whetstone, Dhrystone, D-MIPs (Dhrystone MIPs) SPEC drag race Denition SPEC: Standard Performance Evaluation Corporation MISSION: To establish, maintain, and endorse a standardized set of relevant benchmarks and metrics for performance evaluation of modern computer systems http://www.spec.org. The Spec organization has different benchmarks for integer software, oating-point software, web-serving software, etc. 326 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4.3.1 General Equations 327 4.3 Comparing Performance 4.3.1 General Equations Equation for Big is n% greater than Small: n% Big Small Small Substituting the above equation into the equation for the performance of A is n% greater than the performance of B gives: n% Time B Time A TimeA In general, the equation for a fast system to be n% faster than a slow system is: TSlow TFast TFast n% Using n% greater formula, the phrase The performance of A is n% greater than the performance of B is: PerformanceA PerformanceB PerformanceB Another useful formula is the average time to do one of k different tasks, each of which happens %i of the time and takes an amount of time Ti to do each time it is done . n% TAvg i 1 %iTi k Performance is inversely proportional to time: Performance 1 Time We can measure the performance of practically anything (cars, computers, vacuum cleaners, printers....) 328 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4.4. CLOCK SPEED, CPI, PROGRAM LENGTH, AND PERFORMANCE 329 4.3.2 Example: Performance of Printers This section reserved for your reading pleasure 4.4 Clock Speed, CPI, Program Length, and Performance 4.4.1 Mathematics CPI NumInsts ClockSpeed ClockPeriod Cycles per instruction Number of instructions Clock speed Clock period Time Time NumInsts CPI ClockPeriod NumInstsCPI ClockSpeed 330 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4.4.2 Example: CISC vs RISC and CPI 331 4.4.2 Example: CISC vs RISC and CPI AMD Athlon Fujitsu SPARC64 Clock Speed SPECint 1.1GHz 409 675MHz 443 SPECint and Performance AMD Athlon Fujitsu SPARC64 Clock Speed SPECint 1.1GHz 409 675MHz 443 The AMD Athlon is a CISC microprocessor (it uses the IA-32 instruction set). The Fujitsu SPARC64 is a RISC microprocessor (it uses Suns Sparc instruction set). Assume that it requires 20% more instructions to write a program in the Sparc instruction set than the same program requires in IA-32. Question: Which of the two processors has higher performance? 332 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4.4.2 Example: CISC vs RISC and CPI 333 Relative CPI Question: What is the ratio between the CPIs of the two microprocessors? Absolute CPI Question: Can you determine the absolute (actual) CPI of either microprocessor? 334 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4.4.3 Effect of Instruction Set on Performance 335 4.4.3 Effect of Instruction Set on Performance Example: Changing Instruction Set and Performance Your group designs a microprocessor and you are considering adding a fused multiply-accumulate to the instruction set. (A fused multiply accumulate is a single instruction that does both a multiply and an addition. It is often used in digital signal processing.) Your studies have shown that, on average, half of the multiply operations are followed by an add instruction that could be done with a fused multiply-add. Additionally, you know: cpi % ADD 0.8 CPIavg 15% MUL 1.2 CPIavg 5% Other 1.0 CPIavg 80% 336 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4.4.4 Effect of Time to Market on Relative Performance 337 Options You have three options: 4.4.4 Effect of Time to Market on Relative Performance Assume that performance of the average product in your market segment doubles every 18 months. You are considering an optimization that will improve the performance of your product by 7%. option 1 : no change option 2 : add the MAC instruction, increase the clock period by 20%, and MAC has the same CPI as MUL. option 3 : add the MAC instruction, keep the clock period the same, and the CPI of a MAC is 50% greater than that of a multiply. Question: Which option will result in the highest overall performance? Question: If you add the optimization, how much can you allow your schedule to slip before the delay hurts your relative performance compared to not doing the optimization and launching the product according to your current schedule? 338 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4.4.5 Summary of Equations 339 4.4.5 Summary of Equations Time to perform a task: NumInsts CPI ClockSpeed Speedup: Time Summary of Equations (Contd) Speedup TSlow TFast Average time to do one of k different tasks: k TFast is n% faster than TSlow: TSlow TFast TFast TAvg i 1 %iTi n% faster Performance: Work Time Performance at time t if performance increases by factor of k every n units of time: Perf 0 kt n Performance Perf t 340 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4.5.1 Dataow Diagrams, CPI, and Clock Speed Goal Action Affect 341 4.5 Performance Analysis and Dataow Diagrams 4.5.1 Dataow Diagrams, CPI, and Clock Speed One of the challenges in designing a circuit is to choose the clock speed. Choosing a clock period affects many aspects of the design, not just the overall performance. Some goals will push you toward a short clock period Some goals will push you toward a long clock period Minimize area Increase exibility scheduling Decrease percentage of clock cycle spent in ops (overhead time in ops is not doing useful work) Decrease time to execute an instruction 342 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4.5.2 Examples of Dataow Diagrams for Two Instructions 343 Outline to Choose Clock Period Outline of plan to nd optimal latency and clock period for maximum performance: 4.5.2 Examples of Dataow Diagrams for Two Instructions Circuit supports two instructions, A and B Each operation occurs 50% of the time. The delay through a register is 5ns. Find clock period and dataow diagram to maximize overall performance. Instruction A f (30ns) Instruction B i (40ns) 1. Start with smallest possible clock period. 2. Allocate operations to clock cycles 3. Calculate average time to execute an instruction. 4. If latency 1, then: increase clock period until reduce latency; return to Step 2. Else (latency = 1): choose clock period and dataow diagram that resulted in highest performance. 5. Optimize dataow diagram to reduce area. g (50 ns) g (50 ns) h (20 ns) g (50 ns) 344 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4.5.2 Examples of Dataow Diagrams for Two Instructions 345 4.5.2.1 Scheduling of Operations for Different Clock Periods 55ns Clock Period Instr A 55ns 55ns f (30ns) Scheduling (1) Instr B i (40ns) 15 ns 25 ns g (50 ns) h (20 ns) g (50 ns) 55ns 55ns g (50 ns) 346 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4.5.2 Examples of Dataow Diagrams for Two Instructions 347 Scheduling (2) 15 ns 25 ns 15 ns 25 ns 15 ns 25 ns Scheduling (3) 348 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4.5.2 Examples of Dataow Diagrams for Two Instructions 349 4.5.2.2 Performance Computation for Different Clock Periods Question: Which clock speed will result in the highest overall performance? Tavg 4.5.2.3 Example: Two Instructions Taking Similar Time Question: For the ow below, which clock speed will result in the highest overall performance? Clock Period CPIA CPIB 55ns 75ns 85ns 95ns 155ns A B 30ns 40ns 50ns 50ns 20ns 40ns 50ns Clock Period CPIA CPIB ns ns ns ns ns ns Tavg 350 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4.5.3 Example: From Algorithm to Optimized Dataow 351 4.5.2.4 Example: Same Total Time, Different Order for A Question: For the ow below, which clock speed will result in the highest overall performance? 4.5.3 Example: mized Dataow From Algorithm to Opti- This question involves doing some of the design work for a circuit that implements InstP and InstQ using the components described below. Instruction Algorithm Frequence of Occurrence InstP a b a b b d e 75% i j k l m 25% InstQ A B 30ns 40ns 20ns 50ns 50ns 40ns 50ns Clock Period CPIA CPIB ns ns ns ns Tavg Component Delays 2-input Mult 40ns 2-input Add 25ns Register 5ns 352 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4.5.3 Example: From Algorithm to Optimized Dataow 353 NOTES There is a resource limitation of a maximum of 3 input ports. (There are no other resource limitations.) You must put registers on your inputs, you do not need to register your outputs. The environment will directly connect your outputs (its inputs) to registers. Each input value (a, b, c, d, e, i, j, k, l, m) can be input only once if you need to use a value in multiple clock cycles, you must store it in a register. Question: Questions What clock period will result in the best overall performance? 354 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION Question: Find a minimal set of resources that will achieve the performance you calculated. Chapter 5 Optimization 355 356 CHAPTER 5. OPTIMIZATION 5.1.1 Introduction to Pipelining 357 5.1 Pipelining 5.1.1 Introduction to Pipelining Execution of dataow diagram a r1 add1 Pipelined Execution Pipelining is optimization that increases performance by overlapping the execution of multiple parcels (instructions). The cost is an increase in area, because we cannot reuse datapath components, registers, inputs, or outputs. a b r2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 c r4 clk a r1 d r5 b r2 0 c r2 r1 + r1 add1 1 clk d r2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a r1 add1 + r3 add2 1 2 e r8 + r1 add1 2 e r2 + r5 add3 r3 r5 + r1 add1 3 f r2 z + r7 add4 3 f r10 r7 r9 + r1 add1 4 5 + r9 add5 4 5 z + z Question: How soon can we start to execute ? + z Question: How soon can we start to execute ? 358 CHAPTER 5. OPTIMIZATION 5.1.1 Introduction to Pipelining 359 Sequential (Unpipelined) Hardware reset State(0) State(1) State(2) State(3) State(4) i1 i2 Pipelined Hardware i1 i2 r1 add1 r2 i3 + r3 add2 r1 add1 r2 r4 i4 + o1 + r5 add3 r6 i5 + r7 add4 r8 i6 + r9 add5 r10 + o1 360 CHAPTER 5. OPTIMIZATION 5.1.2 Partially Pipelined 361 Pipelined VHDL Code begin process wait until rising_edge(clk); r1 <= i1; r2 <= i2; r3 <= r1 + r2; r4 <= i3; r5 <= r3 + r4; r6 <= i4; r7 <= r5 + r6; r8 <= i5; r9 <= r7 + r8; r10 <= i6; end process; o1 <= r9 + r10; 5.1.2 Partially Pipelined The previous section illustrated a fully pipelined circuit, which means that the circuit could accept a new parcel every clock cycle. Sometimes we want to sacrice performance (throughput) in order to reduce area. We can do this by having a throughput that is less than one parcel per clock-cycle and reusing some hardware. a r1 add1 b r2 0 c r2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 clk a + r1 add1 1 d r4 + r3 add2 2 e r4 r1 r3 + r3 add2 3 f r6 r5 z Denition Bubble: When a pipe stage is empty (contains invalid data), it is said to contain a bubble. + r5 add3 4 5 + z Question: How do we execute followed by ? Question: How do we know whether the output of the pipeline is a bubble or is valid data? 362 CHAPTER 5. OPTIMIZATION 5.1.3 Pipelined Version of InstP 363 Hardware for Partially Pipelined i1 i2 5.1.3 Pipelined Version of InstP This example is based on the InstP/InstQ circuit from section 4.5.3 reset State(0) State(1) r1 add1 r2 + i2 Dataow Graph a b d * r3 add2 * + e r4 + i2 + r6 Component Delays 2-input Mult 40ns 2-input Add 25ns Register 5ns r5 add3 * + o1 364 CHAPTER 5. OPTIMIZATION 5.1.3 Pipelined Version of InstP 365 Behaviour of Unpipelined and Pipelined Unpipelined: r1 Pipelined Hardware a r2 b r3 m2 d valid v1 stage1 0 50 100 150 200 250 300 clk input m1 * a1 * + e process begin wait until rising_edge(clk); r1 <= i1; r2 <= i2; r3 <= i3; r4 <= m1; r5 <= a1; r6 <= i4; end process; m1 <= r1 * m2 <= r2 * a1 <= m1 * a1 <= r5 * m3 <= r4 * o_valid <= r2; r3; m2; r6; a2; v2; v1 <= v2 <= output r4 r5 r6 v2 stage2 Pipelined: 0 50 100 150 200 250 300 a2 + m3 * clk input output 366 CHAPTER 5. OPTIMIZATION 5.1.4 Pipelined Version of InstP/InstQ 367 5.1.4 Pipelined Version of InstP/InstQ Dataow Graph Behaviour of Unpiped and Pipelined r1 a b d 0 50 100 150 200 250 300 0 50 100 150 200 250 300 r2 r3 * + + * * e clk input clk input m1 * m1 m2 * m2 a1 + a1 a2 + a2 output output 0 50 100 150 200 250 clk input output 368 CHAPTER 5. OPTIMIZATION Resources and Performance Chapter 6 Timing Analysis 369 370 CHAPTER 6. TIMING ANALYSIS 6.1.2 Clock-Related Timing Denitions 371 6.1 Delays and Denitions In this section we will look at the different timing parameters of circuits. Our focus will be on those parameters that limit the maximum clock speed at which a circuit will work correctly. 6.1.2 Clock-Related Timing Denitions 6.1.2.1 Clock Skew skew clk1 clk2 clk3 clk2 clk4 clk1 clk3 6.1.1 Background Denitions This section reserved for your reading pleasure clk4 Denition Clock Skew: The difference in arrival times for the same clock edge at different ip-ops. Clock skew is caused by the difference in interconnect delays to different points on the chip. 372 CHAPTER 6. TIMING ANALYSIS 6.1.2 Clock-Related Timing Denitions 373 Clock Tree Design Clock tree design is critical in high-performance designs to minimize clock skew. Sophisticated synthesis tools put lots of effort into clock tree design, and the techniques for clock tree design still generate PhD theses. 6.1.2.2 Clock Latency master clock latency intermediate clock final clock master clock intermediate clock final clock Denition Clock Latency: The difference in arrival times for the same clock edge at different levels of interconnect along the clock tree. (Intuitively different points in the clock generation circuitry.) Note: Clock latency Clock latency does not affect the limit on the minimim clock period. 374 CHAPTER 6. TIMING ANALYSIS 6.1.2 Clock-Related Timing Denitions 375 6.1.2.3 Clock Jitter ideal clock Causes of Clock Jitter Clock jitter is caused by: temperature and voltage variations over time temperature and voltage variations across different locations on a chip manufacturing variations between different parts clock with jitter jitter etc. Denition Clock Jitter: Difference between actual clock period and ideal clock period. 376 CHAPTER 6. TIMING ANALYSIS 6.1.3 Storage Related Timing Denitions to guarantee that storage device will store data correctly. Note: Require / Guarantee Setup and hold times are requirements that the storage device imposes upon its environment. Clock-to-Q is a guarantee that the storage device provides its environment. If the environment satises the setup and hold times, then the storage device guarantees that it will satisfy the clock-to-Q time. 377 6.1.3 Storage Related Timing Denitions Setup d d clk q Hold clk q Clock-to-Q Setup, hold, and clock-to-Q times for a ip op Setup and hold dene window in which input data must be held constant in order 378 CHAPTER 6. TIMING ANALYSIS 6.1.3 Storage Related Timing Denitions 379 6.1.3.1 Setup Time Denition Setup Time (T ) : Latest time before arrival of clock edge (ip SUD op), or deasserting of enable line (latch), that input data is required to be stable in order for storage device to work correctly. 6.1.3.2 Hold Time Denition Hold Time (T ): Latest time after arrival of clock edge (ip op), HO or deasserting of enable line (latch), that input data is required to remain stable in order for storage device to work correctly. If setup time is violated, current input data will not be stored; input data from previous clock cycle might remain stored. Setup d clk q Clock-to-Q Hold If hold time is violated, current input data will not be stored; input data from next clock cycle might slip through and be stored. Setup d clk q Clock-to-Q Hold 380 CHAPTER 6. TIMING ANALYSIS 6.1.4 Propagation Delays 381 6.1.3.3 Clock-to-Q Time Denition Clock-to-Q Time (T ): Earliest time after arrival of clock edge CO (ip op), or asserting of enable line (latch) when output data is guaranteed to be stable. Setup d clk q Clock-to-Q Hold 6.1.4 Propagation Delays Propagation delay time it takes a signal to travel from the source (driving) op to the destination op propagation delay = load delay + interconnect delay Load delay combinational gates between the ops Interconnect delay wires between gates and ops 382 CHAPTER 6. TIMING ANALYSIS 6.1.6 Timing Constraints 383 6.1.5 Summary of Delay Factors Name Skew Symbol Denition Difference in arrival times for different clock signals Jitter Difference in clock period over time Clock-to-Q T Delay from clock signal to Q output of CO op Setup T Length of time prior to clock/enable that SUD data must be stable Hold T Length of time after clock/enable that HO data must be stable Load Delay due to load (fanout/consumers/readers) Interconnect Delay along wire Summary of delay factors 6.1.6 Timing Constraints Margin Denition Margin: The difference between the required value of a timing parameter and the actual value. A negative margin means that there is a timing violation. A margin of zero means that the timing parameter is just satised: changing the timing of the signals (which would affect the actual value of the parameter) could violate the timing parameter. A positive margin means that the constraint for the timing parameter is more than satised: the timing of the signals could be changed at least a little bit without violating the timing parameter. Note: Margin is often called slack. Both terms are used commonly. 384 CHAPTER 6. TIMING ANALYSIS 6.1.6 Timing Constraints 385 6.1.6.1 Minimum Clock Period a clk1 clk2 b signal is stable signal may change signal may rise signal may fall clock period propagation 6.1.6.2 Hold Constraint a clk1 clk2 b signal is stable signal may change signal may rise signal may fall skew -Q jitter hold tio n -to ck clk1 clk2 a b slack clk1 clk2 a b slack ClockPeriod Note: Skew Jitter T Interconnect Load T CO SUD Skew Jitter T HO T Interconnect Load CO The minimum clock period is independent of hold time. cl o pr o pa ga skew jitter clock-to-Q interconnect + load setup 386 CHAPTER 6. TIMING ANALYSIS 6.1.6 Timing Constraints 387 6.1.6.3 Example Timing Violations Good Timing a clk b c d a clk b Setup Violation Clock-to-Q Prop Setup c d a clk b Clock-to-Q Prop Setup Hold c d ??? ??? Setup Violation Good Timing 388 CHAPTER 6. TIMING ANALYSIS 6.2. TIMING ANALYSIS OF LATCHES AND FLIP FLOPS 389 Hold Violation a clk b c d 6.2 Timing Analysis of Latches and Flip Flops In this section, we show how to nd the clock-to-Q, setup, and hold times for latches, ip-ops, and other storage elements. a clk b Clock-to-Q Prop Hold 6.2.1 Review: Latch, Flip-Flop, Setup, Hold, Clock-to-Q clk clk d q d q c d ??? Flop Behaviour Hold Violation Latch Behaviour 390 CHAPTER 6. TIMING ANALYSIS 6.2.2 Simple Multiplexer Latch 391 Review: Timing Parameters Setup : Time before arrival of clock edge (ip op), or deasserting of enable line (latch), that input data is required to start being stable Hold : time after arrival of clock edge (ip op), or deasserting of enable line (latch), that input data is required to remain stable Clock-to-Q : Time after arrival of clock edge (ip op), or asserting of enable line (latch) when output data is guaranteed to start being stable 6.2.2 Simple Multiplexer Latch 6.2.2.1 Structure and Behaviour of Multiplexer Latch Two modes for storage devices: loading data: loads input data into storage circuitry input data passes through to output using stored data input signal is disconnected from output storage circuitry drives output clk i o Schematic 392 CHAPTER 6. TIMING ANALYSIS 6.2.2 Simple Multiplexer Latch 393 Two Modes for Latch 1 i o i 0 o Unfold Multiplexer to Simple Gates s a b o a sel b o Loading / pass-through mode Storage mode Multiplexer: symbol and implementation d clk o clk i o Latch implementation 394 CHAPTER 6. TIMING ANALYSIS 6.2.2 Simple Multiplexer Latch 395 Latch Glitching d clk o d=0 clk=1 Loading and Storing Values 1 1 0 1 1 0 0 o d=1 clk=1 0 1 0 0 0 0 1 o Note: inverters on clk Both of the inverters on the clk signal are needed. Together, they prevent a glitch on the OR gate when clk is deasserted. If there was only one inverter, a glitch would occur. For more on this, see section 6.2.2.6 Loading 0 d clk=0 0 1 1 0 d clk=0 o=0 Loading 1 0 0 0 1 o=1 0 1 1 0 1 0 Storing 0 Storing 1 396 CHAPTER 6. TIMING ANALYSIS 6.2.2 Simple Multiplexer Latch 397 6.2.2.2 Strategy for Timing Analysis of Storage Devices The key to calculating setup and hold times of a latch, op, etc is to identify: 1. how the data is stored when not connected to the input (often a pair of inverters in a loop) 2. the gate(s) that the clock uses to cause the stored data to drive the output (often a transmission gate or multiplexor) 3. the gate(s) that the clock uses to cause the input to drive the output (often a transmission gate or multiplexor) Clock-to-Q Note: Clock-to-Q for latches For latches, clock-to-Q times are measured with respect to the clock edge that connects the data input to the output. For active-high latches, this is a rising edge. 398 CHAPTER 6. TIMING ANALYSIS 6.2.2 Simple Multiplexer Latch 399 Setup and Hold Setup and hold timing constraints ensure that, when the storage device transitions from load mode to store mode, the input data is stored correctly in the storage device. Thus, the setup and hold timing constraints come into play when the storage device transitions from load mode to store mode. Note: Setup and hold time for latches For latches, hold time and setup time are measured with respect to the clock edge that disconnects the data input from the output. For active-high latches, this is a falling edge. Hold time is concerned with the next data value sneaking in before the latch goes into storage mode. 6.2.2.3 Clock-to-Q Time of a Multiplexer Latch d clk l1 c2 cn l2 qn s2 s1 q d clk l1 c2 cn l2 qn s2 s1 q d clk l1 c2 cn l2 qn s2 s1 q d clk l1 c2 cn l2 qn s2 s1 q d clk l1 c2 cn l2 qn s2 s1 q d clk l1 c2 cn l2 qn s2 s1 q Setup time is concerned with the previous data value still being in the storage circuitry when the input is disconnected. 400 CHAPTER 6. TIMING ANALYSIS 6.2.2 Simple Multiplexer Latch 401 Assume that input is stable, and then clock signal transitions to cause the circuit to move from storage mode to load mode. Calculate clock-to-Q time by nding delay of critical path from where clock signal enters storage circuit to where q exits storage circuit. 6.2.2.4 Setup Timing of a Multiplexer Latch 402 CHAPTER 6. TIMING ANALYSIS 6.2.2 Simple Multiplexer Latch Step-by-step animation of latch transitioning from load to store mode. d 1 clk 0 1 0 0 403 Good Timing d clk l1 c2 cn l2 qn s2 s1 q d 0 clk 1 0 0 1 Circuit is stable in load mode t=3: l2 is set to 0, because c2 turns off AND gate d l1 l2 qn q s1 s2 clk cn d 0 clk 0 1 d 0 clk 1 0 0 0 0 1 t=0: Clk transitions from load to store t=4: from store path propagates to q d 0 clk 1 1 d 0 clk 1 0 0 1 0 1 t=1: Clk transitions from load to store t=5: from store path completes cycle c2 404 d 0 clk 1 0 1 CHAPTER 6. TIMING ANALYSIS 6.2.2 Simple Multiplexer Latch 405 Timing Violation d clk l1 c2 cn l2 qn s2 s1 q t=2: s1 propagates to s2, because cn turns on AND gate The value on s1 at t=1 will propagate from the store loop to the output and back through the store loop. At t=1, s1 must have the value that we want to store. Or, equivalently, the value to store must have saturated the store loop by t=1. It takes 5 time units for a value on the input d to propagate to s1 (d l1 l2 qn q s1). The setup time is the difference in the delay from d to s1 and the delay from clk to cn: 5 1 4, so the setup time for this latch is 4 time units. d l1 l2 qn q s1 s2 clk cn c2 406 CHAPTER 6. TIMING ANALYSIS 6.2.2 Simple Multiplexer Latch 407 Minimum Setup Time d clk l1 l2 qn cn s2 s1 q 6.2.2.5 Hold Time of a Multiplexer Latch d clk cn s2 s1 l1 c2 l2 qn q setup d l1 l2 qn q s1 s2 clk cn c2 When cn is asserted, must be at s1. Otherwise, will affect storage circuitry when data input is disconnected. 408 CHAPTER 6. TIMING ANALYSIS 6.2.2 Simple Multiplexer Latch 409 Hold Time Behaviour d clk cn s2 s1 l1 c2 l2 qn q d clk cn s2 s1 l1 c2 l2 qn q 6.2.2.6 Example of a Bad Latch d clk l1 c2 cn l2 qn s2 s1 q d l1 d clk cn l1 c2 l2 qn s2 s1 q d clk cn l1 c2 l2 qn s2 s1 q l2 qn q s1 s2 clk d clk cn l1 c2 l2 qn s2 s1 q d clk cn l1 c2 l2 qn s2 s1 q c2 cn 410 CHAPTER 6. TIMING ANALYSIS 6.3.1 Introduction to Critical and False Paths Note: The analysis of critical paths and false paths assumes that all inputs change values at exactly the same time. Timing differences between inputs are modelled by the skew parameter in timing analysis. Note: To exercise a path, only one input needs to change. Stated another way, if a path cannot be exercised by toggling one input, then the path cannot be exercised by toggling more than one input. 411 6.3 Critical Paths and False Paths 6.3.1 Introduction to Critical and False Paths The algorithm that we present comes from McGeer and Brayton in the DAC 198? paper. The algorithm to nd the critical path through a circuit is presented in several parts. 1. Section 6.3.2: Find the longest path ignoring the possibility of false paths. 2. Section 6.3.3: Almost-correct algorithm to test whether a candidate critical path is a false path. 3. Section 6.3.4: If a candidate path is a false path, then nd the next candidate path, and repeat the false-path detection algorithm. 4. Section 6.3.5: Correct, complete, and complex algorithm to nd the critical path in a circuit. Throughout our discussion of critical paths, we will use the delay values for gates shown in the table below. gate delay NOT 2 AND 4 OR 4 XOR 6 412 CHAPTER 6. TIMING ANALYSIS 6.3.1 Introduction to Critical and False Paths 413 6.3.1.1 Example of Critical Path in Full Adder Question: Question: Find the critical path through the full-adder circuit shown below. ci a b i k j co s ci a b Alternative Excitation Do the input values of ci=0, a= , b=1 exercise the critical path? i k j co s 6.3.1.2 Preliminaries for Critical Paths Denition critical path: The slowest path on the chip between ops or ops and pins. The critical path limits the maximum clock speed. 414 CHAPTER 6. TIMING ANALYSIS 6.3.1 Introduction to Critical and False Paths 415 6.3.1.3 Longest Path and Critical Path The longest path through the circuit might not be the critical path, because the behaviour of the gates might prevent an edge (0 1 or 1 0) from travelling along the path. Question: path Example False Path Determine whether the longest path in the circuit below is a false a y b Denition false path: : a path along which an edge cannot travel from beginning to end. a a 0, b 0 1 a y a 0, b 1 0 y b b a a 1, b 0 1 a y a 1, b 1 0 y b b Question: How can we determine analytically that this is a false path? 416 CHAPTER 6. TIMING ANALYSIS 6.3.2 Longest Path 417 Preview of Complete Example Question: Find the critical path through the circuit below. 6.3.2 Longest Path Outline of Algorithm to Find Longest Path b a c d e f g Start at destination signals and traverse through fanin to source signals, annotating each intermediate signal with the maximum delay from the intermediate signal to the destination signals. The source signal with the maximum delay is the start of the longest path. The delay annotation of this signal is the delay of the longest path. The longest path is found by working from the source signal to the destination signals, picking the fanout signal with the maximum delay at each step. 418 CHAPTER 6. TIMING ANALYSIS 6.3.3 Detecting a False Path Gate Controlling Value Controlled Output AND 419 6.3.3 Detecting a False Path 6.3.3.1 Preliminaries Controlling Value The controlling value of a gate is the value such that if one of the inputs has this value, the output can be determined independently of the other inputs. The controlled output value of a gate is the value produced by the controlling input value. OR NAND NOR XOR 420 CHAPTER 6. TIMING ANALYSIS 6.3.3 Detecting a False Path 421 Path Input, Side Input Denition path input: For a gate on a path (either a candidate critical path, or a real critical path), the path input is the input signal that is on the path. Reconvergent Fanout Denition reconvergent fanout: There are paths from signals in the fanout of a gate that reconverge at another gate. a g y d e f c h z Denition side input: For a gate on a path (either a candidate critical path, or a real critical path), the side inputs are the input signals that are not on the path. b If a candidate path has reconvergent fanout, then the rising or falling edge on the input to the path might cause a side input along the path to have a rising or falling edge, rather than a stable 0 or 1. 422 CHAPTER 6. TIMING ANALYSIS 6.3.3 Detecting a False Path 423 Rules for Propagating an Edge Along a Path NOT 1 AND 1 Missing Rules? Question: Why do the rules not have falling edges for AND gates or rising edges for OR gates on the side input? 0 OR 0 1 XOR 1 0 0 424 CHAPTER 6. TIMING ANALYSIS 6.3.3 Detecting a False Path 425 6.3.3.2 Almost-Correct Algorithm to Detect a False Path 1. Annotate each side input along the path with its non-controlling value. These annotations are the constraints that must be satised for the candidate path to be exercised. 2. Propagate the constraints backward from the side inputs of the path to the inputs of the circuit under consideration. 3. If there is a contradiction amongst the constraints, then the candidate path is a false path. 4. If there is no contradiction, then the constraints on the inputs give the conditions under which an edge will traverse along the candidate path from input to output. 6.3.3.3 Examples of Detecting False Paths False-Path Example 1 Question: a 16 b 12 c 10 e 8 Determine if the longest path in the circuit below is a false path. d 14 f 12 12 g 8 12 6 8 8 h 4 i 4 2 4 4 j 0 k 0 side input non-controlling value constraint 426 CHAPTER 6. TIMING ANALYSIS 6.3.4 Finding the Next Candidate Path 427 6.3.4 Finding the Next Candidate Path If the longest path is a false path, we need to nd the next longest path in the circuit, which will be our next candidate critical path. If this candidate fails, we continue to nd the next longest of the remaining paths, ad innitum. 6.3.4.1 Algorithm to Find Next Candidate Path 1. Initialize path table with primary inputs, their potential delay, and fanout. 2. Sort path table by potential delay 3. If the partial path with the maxdelay has just one unused fanout signal, then extend the partial path with this signal. Otherwise: (a) Extend path through unused fanout with max delay. (b) Delete this fanout signal from the list of unused fanout signals . 4. Compute constraint that side input has non-controlling value 5. If the new constraint does not cause a contradiction, then return to step 3. Otherwise: (a) Mark this partial path as false. (b) For each partial path that is a prex of the false path: recalculate potential delay of path (c) Return to step 2 428 CHAPTER 6. TIMING ANALYSIS 6.3.4 Finding the Next Candidate Path potential unused delay fanout path 429 6.3.4.2 Examples of Finding Next Candidate Path Next-Path Example 1 Question: Starting from the initial delay calculation and longest path, nd the next candidate path and test if it is a false path. a 16 b 12 c 10 e 8 d 14 f 12 12 g 8 12 6 8 8 h 4 i 4 2 4 4 j 0 k 0 430 CHAPTER 6. TIMING ANALYSIS side input non-controlling value constraint 6.3.5 Correct Algorithm to Find Critical Path 431 6.3.5 Correct Algorithm to Find Critical Path In this section, we remove the assumption that values on side inputs always arrive earlier than the value on the path input. 6.3.5.1 Algorithm If nd contradiction on path, check for side inputs that are on previously discovered false paths. If a side input to candidate path is on a previously discovered false path, and the primary input of the candidate path is the same signal as the primary input of the false path, then the side input denes a prex a false path that is a late-arriving side input. Compute constraint to excite the prex (this is called the viability constraint of the prex. To the row of the late arriving side input in the constraint table, add as a disjunction the constraint that the prex is viable and the path input has a controlling value. 432 CHAPTER 6. TIMING ANALYSIS 6.3.5 Correct Algorithm to Find Critical Path 433 6.3.5.2 Examples Question: Complete Example 1 Find the critical path in the circuit below. b a c d e g f 434 potential unused delay fanout path CHAPTER 6. TIMING ANALYSIS 6.3.5 Correct Algorithm to Find Critical Path side input non-controlling value constraint 435 436 CHAPTER 6. TIMING ANALYSIS 6.3.5 Correct Algorithm to Find Critical Path potential unused delay fanout path 437 Complete Example 2a Question: Find the critical path in the circuit below. a d b c e f i g h i j j 438 CHAPTER 6. TIMING ANALYSIS side input non-controlling value constraint 6.3.5 Correct Algorithm to Find Critical Path 439 Complete Example 2b Question: Find the critical path in the circuit below. a b x c d h e j f g i k m l m 440 potential unused delay fanout path CHAPTER 6. TIMING ANALYSIS 6.3.5 Correct Algorithm to Find Critical Path side input non-controlling value constraint 441 442 CHAPTER 6. TIMING ANALYSIS 6.3.5 Correct Algorithm to Find Critical Path potential unused delay fanout path 443 Modied Example 3b Question: Modify circuit to illustrate late side input. Make j a very slow inverter with delay of 5. Pick up example after determining that c x f g i l m is false. a 18 b 18 8 8 c d e 19 x f 17 18 14 g 12 12 k 8 4 i 12 17 17 17 12 8 8 8 l 4 4 4 m 0 m h 13 j 8 444 CHAPTER 6. TIMING ANALYSIS side input non-controlling value constraint 6.4. ANALOG TIMING MODEL 445 6.4 Analog Timing Model Mask Level Transistor Level (P-Tran) (P-Tran) source poly source contact gate gate p-diff drain drain Cross-Section of Fabricated Transistor poly contact p-diff Switch Level (P-Tran) source gate drain substrate Mask Level Transistor Level (N-Tran) (N-Tran) source poly source contact gate gate n-diff drain drain Cross-Section of Fabricated Transistor poly contact p-diff Switch Level (N-Tran) source gate drain substrate 446 CHAPTER 6. TIMING ANALYSIS 6.4. ANALOG TIMING MODEL 447 Different Levels of Abstraction for Inverter Transistor Level VDD Gate Level b Mask Level contact VDD poly p-diff b n-diff GND metal A Pair of Inverters Transistor Level VDD a a b a a Gate Level b c a b c GND metal GND RC-Network for Timing Analysis VDD Rpu Mask Level VDD b c a a CL Cp Rpd GND b GND 448 CHAPTER 6. TIMING ANALYSIS 6.4. ANALOG TIMING MODEL 449 A Pair of Inverters (Contd) Mask Level VDD b c a CL VDD A Pair of Inverters RC-Network for Timing Analysis Rpu Rpu b RW Cp Rpd GND CW RV CL Rpd Cp c a GND RC-Network for Timing Analysis VDD Rpu a CL Rpd GND Cp b RW CW RV CL Rpd GND RC-Network for Timing Analysis (trimmed) Rpu c Cp Rpd Cp VDD Rpu b RW CW RV CL 450 CHAPTER 6. TIMING ANALYSIS 6.4. ANALOG TIMING MODEL 451 A Circuit with Fanout Gate Level c a b d A Circuit with Fanout (Contd) Transistor Level VDD a Gate Level (physical layout) b c d c b a c b d Transistor Level VDD GND c Mask Level VDD b a c b d a c GND GND b c b d c 452 CHAPTER 6. TIMING ANALYSIS 6.4. ANALOG TIMING MODEL 453 A Circuit with Fanout (Contd) Mask Level VDD b a b c d A Circuit with Fanout RC-Network for Timing Analysis VDD Rpu a b RW1 CL Cp Rpd RV CW1 CL Cp Rpd RW3 CW3 Rpu b c RW2 CW2 RV CL Cp Rpd c d Rpu c GND RC-Network for Timing Analysis VDD GND RC-Network for Timing Analysis (trimmed) VDD Rpu a CL Cp Rpd b RW1 RV CW1 CL Rpu b c Cp Rpd RW3 CW3 RW2 CW2 RV CL Rpu d Cp Cp Rpu b b RW1 RV CW1 CL RW2 CW2 RV CL Rpd Rpd c GND GND 454 CHAPTER 6. TIMING ANALYSIS 6.4.1 Timing Model 455 6.4.1 Timing Model Rpu Vi Cp Rpd Vo Cout Measuring Delay Through an Inverter Gate Level b a a c b Timing model Rpu Rpd Cp Cout pull up resistor in p-tran pull down resistor in n-tran parasitic capacitance load capacitance RC-Network (Analog Level) VDD 6.4.1.1 Equation for Output Voltage a Rpu b RW CL Rpd GND Cp CW RV CL Rpu a c Cp Rpd b Output voltage when Vo discharges through Rpd . t Vo 456 VDD e Rpd Cp Cout How do we use the analog waveforms to determine the discrete delay through the inverter? CHAPTER 6. TIMING ANALYSIS 6.4.1 Timing Model 457 Trip Points To measure delay through inverter, what voltage levels should we use? Denition Trip Points: A high or 1 trip point is the voltage level where an upwards transition means the signal represents a 1. A low or 0 trip point is the voltage level where a downwards transition means the signal represents a 0. a Trip Points and Delay Equation Setup the delay equation for TPD to be the time for Vo to fall from VDD to the low trip point of 0 35VDD : Original equation 0 35VDD trip point Vo VDD e t Rpd Cp Cout 0 35VDD VDD e TPD Rpd Cp Cout b TPD represents the propagation delay, which is the sum of the interconnect and load delays. Solving for TPD, using ln1 0 35 1, doing some more approximations: Pick the trip points to simplify the delay equation. Pick trips points of 0.35/0.65: low-voltage (0) trip point of 0.35 Vdd high-voltage (1) trip point of 0.65 Vdd TPD RpdCp Cout 458 CHAPTER 6. TIMING ANALYSIS 6.5. ELMORE DELAY MODEL 459 Some Rough Intuition TPD RpdCp Cout A larger transistor has a lower resistance, but a higher capacitance. 6.5 Elmore Delay Model 6.5.1 Elmore Time Constant Original equation Vo VDD e Resistance affects timing of source (driving) signals. Capacitance affects (mostly) timing of destination (load) signals. Decreasing resistance increases the current through drivers. Increasing capacitance slows down (dis)charging of load capacitors. t Rpd Cp Cout TPD Rpd Cp Cout TPD 0 35VDD trip point 0 35VDD VDD e Introduce Elmore-delay constant 0 35VDD VDD e Di 460 CHAPTER 6. TIMING ANALYSIS 6.5.2 Interconnect with Single Fanout 461 Summary of Elmores Initials Vit Di The voltage on node i (capacitor i) at time t 6.5.2 Interconnect with Single Fanout et Di n Elmore time constant for node i k 1 ERk,iCk (n is the number of nodes in the circuit) ER k,i = resistance along path from node i to the source-ground node that is also on the path from node k to the sourceground node (source ground is the ground node below the pull-down resistor of the source) 462 CHAPTER 6. TIMING ANALYSIS 6.5.2 Interconnect with Single Fanout Question: Calculate delay from gate 1 to gate 2 463 Ra4 Ra1 G1 G1 Rpu C3 Rw3 G2 C1 Rw1 G2 Ra3 C2 Rw2 Ra2 G1 G1 G2 Vi Rpu Ra1 Rw1 Ra2 Rw2 Ra3 Rw3 Ra4 Cp Rpd C1 C2 C3 G2 CG2 Vi Cp Rpd Ra1 Rw1 Ra2 Rw2 Ra3 Rw3 Ra4 C1 C2 C3 CG2 G* C* Ra* Rw* gate capacitance on wire resistance through antifuse resistance through wire 464 CHAPTER 6. TIMING ANALYSIS 6.5.3 Interconnect with Multiple Gates in Fanout 465 Doubling Antifuses Question: If you double the number of antifuses and wires needed to connect two gates, what will be the approximate effect on the wire delay between the gates? 6.5.3 Interconnect with Multiple Gates in Fanout G1 G3 G2 G2 G3 G1 Question: Assuming that wire resistance is much less than antifuse resistance and that all antifuses have equal resistance, calculate the delay from the source inverter (G1) to G2 466 CHAPTER 6. TIMING ANALYSIS 6.5.3 Interconnect with Multiple Gates in Fanout 3. Draw RC tree 467 Answer: Vi G1 Rpu R1 n1 R2 n2 Cp C1 C2 n3 R3 n4 R4 C3 C4 G2 n5 C5 1. There are a total of 7 nodes in the circuit (n 7). Rpd G3 R6 R5 n6 C6 2. Label interconnect with resistance and capacitance identiers. R3 C3 R4 R5 C4 C5 G2 R6 C1 G3 C7 R1 C6 R2 C2 G1 n7 C7 4. G2 is node 5 in the circuit (i 5. Elmore delay equations D5 5). k 1 ERk,5Ck C ER C6 ER C7 5,5 5 6,5 7,5 7 ER C1 ER C2 ER C3 ER C4 1,5 2,5 3,5 4,5 ER 468 6. Elmore resistances ER = R1 1,5 ER ER ER ER ER ER 2,5 3,5 4,5 5,5 6,5 7,5 = R1 + R2 = R1 + R2 = R1 + R2 + R3 CHAPTER 6. TIMING ANALYSIS 6.5.3 Interconnect with Multiple Gates in Fanout 469 Delay to G2 vs G3 = R = 2R = 2R = 3R R3 C3 R5 R6 G3 C6 R2 C2 C7 R5 n6 R6 C6 Vi Cp Rpd G3 G1 G2 R1 n1 R2 n2 C1 C2 n3 R3 n4 R4 C3 C4 Question: Assuming all wire segments at same level have roughly the same capacitance, which is greater, the delay to G2 or the delay to G3? Rpu = R1 + R2 + R3 + R4 = 4R = R1 + R2 = R1 + R2 = 2R = 2R R1 R4 C5 G2 C1 G1 C4 n5 C5 n7 C7 7. Plug resistances into delay equations D5 RC1 2RC2 2RC3 3RC4 4RC5 2RC6 2RC7 470 CHAPTER 6. TIMING ANALYSIS 6.6. PRACTICAL USAGE OF TIMING ANALYSIS Minimum voltage Maximum temperature 471 6.6 Practical Usage of Timing Analysis Speed Grading Fabs sort chips according to their speed (sorting is known as speed grading or speed binning) Faster chips are more expensive In FPGAs, sorting is based usualy on propagation delay through an FPGA cell. As wires become a larger portiono of delay, some analysis of wire delays is also being done. Propagation delay is the average of the rising and falling propagation delays. Typical speed grades for FPGAs: Std standard speed grade 1 15% faster than Std 2 25% faster than Std 3 35% faster than Std Worst-Case Timing Slow-slow conditions (process variation/corner which result in slow p-channel and slow n-channel). We could also have fast-fast, slow-fast, and fast-slow process corners Increasing temperature increases delay Temp resistivity resistivity electron vibration electron vibration colliding with current electrons colliding with current electrons delay supply voltage current Increasing supply voltage decreases delay load capacitor charge time load capacitor charge time total delay Derating factor is a number used to adjust timing number to account for voltage and temp conditions current Maximum Delay in CMOS. When? 472 CHAPTER 6. TIMING ANALYSIS 6.6.1 Speed Binning 473 ASIC manufacturers classes, based on variety of environments: VDD TA (ambient temp) TC (case temp) Commercial 5V 5% 0 to +70C Industrial 5V 10% 40 to +85C 5V 10% 55 to +125C Military What is important is the transistor temperature inside the chip, TJ (junction temperature) Overclocking: running a chip at a clock speed faster than what it is rated for (and hoping that your software crashes more frequently than your over-stressed hardware will). 6.6.1 Speed Binning Speed binning is the process of testing each manufactured part to determine the maximum clock speed at which it will run reliably. Manufacturers sell chips off of the same manufacturing line at different prices based on how fast they will run. A speed bin is the clock speed that chips will be labeled with when sold. 474 CHAPTER 6. TIMING ANALYSIS 6.6.2 Worst Case Timing 475 6.6.1.1 FPGAs, Interconnect, and Synthesis On FPGAs 40-60% of clock cycle is consumed by interconnect. When synthesizing, increasing effort (number of iterations) of place and route can signicantly reduce the clock period on large designs. 6.6.2 Worst Case Timing 6.6.2.1 Fanout delay In Smiths book, Table 5.2 (Fanout delay) combines two separate parameters: capacitive load delay interconnect delay into a single parameter (fanout). This is common, and ne. But, when reading a table such as this, you need to know whether fanout delay is combining both capacitive load delay and interconnect delay, or is just capacitive load. 476 CHAPTER 6. TIMING ANALYSIS 6.6.2 Worst Case Timing 477 6.6.2.2 Derating Factors Delays are dependent upon supply voltage and temperature. Temperature Temp Temp Delay Supply voltage Delay Delay Temp Resistivity of wires As temp goes up, atoms vibrate more, and so have greater probability of colliding with electrons owing with current. 478 CHAPTER 6. TIMING ANALYSIS 6.6.2 Worst Case Timing 479 Supply Voltage Supply voltage Derating Factor Denition A derating factor is a number to adjust timing numbers to account for different temperature and voltage conditions. Excerpt from table 5.3 in Smiths book (Actel Act 3 derating factors): Derating factor 1.17 1.00 0.63 Temp 125C 70C -55C Vdd 4.5V 5.0V 5.5V Delay Supply voltage current (V = IR) current time to charge load capacitors to threshold voltage 480 CHAPTER 6. TIMING ANALYSIS Chapter 7 Power Analysis and Power-Aware Design 481 482 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.1.2 Industrial Names and Products 483 7.1 Overview 7.1.1 Importance of Power and Energy Laptops, PDA, cell-phones, etc obvious! For microprocessors in personal computers, every watt above 40W adds $1 to manufacturing cost Approx 25% of operating expense of server farm goes to energy bills (Dis)Comfort of Unix labs in E2 Sandia Labs had to build a special sub-station when they took delivery of Teraops massively parallel supercomputer (over 9000 Pentium Pros) High-speed microprocessors today can run so hot that they will damage themselves Athlon reliability problems, Pentium 4 processor thermal throttling In 2000, information technology consumed 8% of total power in US. Future power viruses: cell phone viruses cause cell phone to run in full power mode and consume battery very quickly; PC viruses that cause CPU to meltdown batteries 7.1.2 Industrial Names and Products Note: Lots of links from E&CE 427 web pages under Documentation 484 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.1.4 Batteries, Power and Energy 485 7.1.3 Power vs Energy Most people talk about power reduction, but sometimes they mean power and sometimes energy. Power minimization is usually about heat removal 7.1.4 Batteries, Power and Energy 7.1.4.1 Do Batteries Store Energy or Power? Energy Power Volts Coulombs Energy Time Energy minimization is usually about battery life or energy costs Type Units Equivalent Types Equations Volts Coulombs Energy Joules Work 1 C Volts2 2 Power Watts Energy / Time Volts I Joules sec Batteries rated in Amp-hours at a voltage. battery Amps Seconds Volts Coulombs Seconds Volts Seconds Coulombs Volts Energy Batteries store energy. 486 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.1.4 Batteries, Power and Energy 487 7.1.4.2 Battery Life and Efciency To extend battery life, we want to increase the amount of work done and/or decrease energy consumed. Work and energy are same units, therefore to extend battery life, we truly want to improve efciency. Power efciency of microprocessors normally measured in MIPS/Watt. Is this a real measure of efciency? MIPs Watts millions of instructions Seconds Energy Seconds millions of instructions Energy 7.1.4.3 Battery Life and Power Question: Running a VHDL simulation requires executing an average of 1 million instructions per simulation step. My computer runs at 700MHz, has a CPI of 1.0, and burns 70W of power. My battery is rated at 10V and 2.5AH. Assuming all of my computers clock cycles go towards running VHDL simulations, how many simulation steps can I run on one battery charge? Both instructions executed and energy are measures of work, so MIPs/Watt is a measure of efciency. Question: What is the weakness of this analysis? 488 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.1.4 Batteries, Power and Energy 489 Battery Life and Power Question: If I use the SpeedStep feature of my computer, my computer runs at 600MHz with 60W of power. With SpeedStep activated, much longer can I keep the computer running on one battery? Battery Life and Power Question: With SpeedStep activated, how many more simulation steps can I run on one battery? 490 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.2. POWER EQUATIONS 491 7.2 Power Equations Power SwitchPower ShortPower DynamicPower Dynamic Power dependent upon clock speed Switching Power useful charges up transistors Short Circuit Power not useful both N and P transistors are on Static Power independent of clock speed Leakage Power not useful leaks around transistor Dynamic Power LeakagePower StaticPower Dynamic power is proportional to how often signals change their value (switch). Roughly 20% of signals switch during a clock cycle. Need to take glitches into account when calculating activity factor. Glitches increase the activity factor. Equations for dynamic power contain clock speed and activity factor. 492 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.2.1 Switching Power 493 7.2.1 Switching Power 1->0 0->1 CapLoad 0->1 1->0 CapLoad Switching Power When a capacitor C is charged to a voltage V , the energy stored in capacitor is 1 2 2CV . The energy required to charge the capacitor from 0 to V is CV 2. Half of the energy 1 ( 2CV 2 is dissipated as heat through the pullup resistance. Half of energy is transfered to the capacitor. When the capacitor discharges from V to 0, the energy stored in the capacitor ( 1CV 2) is dissipated as heat through the pulldown resistance. 2 Charging a capacitor Disharging a capacitor 1 CapLoad VoltSup2 2 energy to (dis)charge capacitor 494 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.2.2 Short-Circuited Power 495 Switching Power f : frequency at which invertor goes through complete charge-discharge cycle. (eqn 15.4 in Smith) f CapLoad VoltSup2 7.2.2 Short-Circuited Power IShort Vi Vo average switching power VoltSup VoltSup - VoltThresh VoltThresh ClockSpeed clock speed ActFact average number of times that signal switches from 0 or from 1 0 during a clock cycle GND P-trans on N-trans on TimeShort 1 Gate Voltage average switching power 1 ActFact ClockSpeed CapLoad VoltSup2 2 PwrShort ActFact ClockSpeed TimeShort IShort VoltSup 496 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.2.4 Glossary 497 7.2.3 Leakage Power Vi Vo N P N P P 7.2.4 Glossary This section reserved for your reading pleasure I ILeak V N-substrate Cross section of invertor showing parasitic diode PwrLk Leakage current through parasitic diode 7.2.5 Note on Power Equations This section reserved for your reading pleasure ILeak VoltSup ILeak e q VoltThresh kT 7.3 Overview of Power Reduction Techniques We can divide power reduction techniques into two classes: analog and digital. 498 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.3. OVERVIEW OF POWER REDUCTION TECHNIQUES 499 Analog Parameters Power reduction parameters at the analog level. capacitance for example, Silicon on Insulator (SOI) resistance for example, copper wires voltage low-voltage circuits Analog Techniques Power reduction techniques at the analog level. dual-VDD Two different supply voltages: high voltage for performance-critical portions of design, low voltage for remainder of circuit. Alternatively, can vary voltage over time: high voltage when running performance-critical software and low voltage when running software that is less sensitive to performance. dual-Vt Two different threshold voltages: transistors with low threshold voltage for performance-critical portions of design (can switch more quickly, but more leakage power), transistors with high threshold voltage for remainder of circuit (switches more slowly, but reduces leakage power). exotic circuits Special ops, latches, and combinational circuitry that run at a high frequency while minimizing power adiabatic circuits Special circuitry that consumes power on 0 1 transitions, but not 1 0 transitions. These sacrice performance for reduced power. clock trees Up to 30% of total power can be consumed in clock generation and clock tree 500 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.3. OVERVIEW OF POWER REDUCTION TECHNIQUES 501 Digital Parameters Power-reduction parameters at the digital level. capacitance (number of gates) activity factor clock frequency Digital Techniques Power-reduction techniques at the digital level. multiple clocks Put a high speed clock in performance-critical parts of design and a low speed clock for remainder of circuit clock gating Turn off clock to portions of a chip when its not being used data encoding Gray coding vs one-hot vs fully encoded vs ... glitch reduction Adjust circuit delays or add redundant circuitry to reduce or eliminate glitches. asynchronous circuits Get rid of clocks altogether.... Additional low-power design techniques for RTL from a Qualis engineer: http://home.europa.com/celiac/lowpower.html 502 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.4. VOLTAGE REDUCTION FOR POWER REDUCTION 503 7.4 Voltage Reduction for Power Reduction If our goal is to reduce power, the most promising approach is to reduce the supply voltage, because, from: Reducing Difference Between Supply and Threshold Voltage As the supply voltage decreases, it takes longer to charge up the capacitive load, which increases the load delay of a circuit. In the chapter on timing analysis, we saw that increasing the supply voltage will decrease the delay through a circuit. (From V IR, increasing V causes an increase in I, which causes the capacitive load to charge more quickly.) However, it is more accurate to take into account both the value of the supply voltage, and the difference between the supply voltage and the threshold voltage. Power ClockSpeed 1 CapLoad VoltSup2 2 ActFact ClockSpeed TimeShort IShort VoltSup ILeak VoltSup ActFact we observe: Power VoltSup2 MaxClockSpeed VoltSup VoltThresh2 VoltSup 504 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.4. VOLTAGE REDUCTION FOR POWER REDUCTION 505 Effect of Decreasing Supply Voltage on Delay Question: If the delay along the critical path of a circuit is 20 ns, the supply voltage is 2.8 V, and the threshold voltage is 0.7 V, calculate the critical path delay if the supply voltage is dropped to 2.2 V. Reducing Threshold Voltage Increases Leakage Current If we reduce the supply voltage, we want to also reduce the threshold voltage, so that we do not increase the delay through the circuit. However, as threshold voltage drops, leakage current increases: ILeak e q VoltThresh kT And increasing the leakage current increases the power: Power ILeak So, need to strike a balance between reducing VoltSup (which has a quadratic affect on reducing power), and increasing ILeak, which has a linear affect on increasing power. 506 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.5.1 How Data Encoding Can Reduce Power Decimal 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Gray 0000 0001 0011 0010 0110 0111 0101 0100 1100 1101 1111 1110 1010 1011 1001 1000 Binary 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 507 7.5 Data Encoding for Power Reduction 7.5.1 How Data Encoding Can Reduce Power Data encoding is a technique that chooses data values so that normal execution will have a low activity factor. The most common example is Gray coding where exactly one bit changes value each clock cycle when counting. 508 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.5.1 How Data Encoding Can Reduce Power 509 8-bit Counter Question: For an eight-bit counter, how much more power will a binary counter consume than a gray-code counter? Random Data Question: For completely random eight-bit data, how much more power will a binary circuit consume than a gray-code circuit? 510 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.5.2 Example Problem: Sixteen Pulser 511 7.5.2 Example Problem: Sixteen Pulser 7.5.2.1 Problem Statement Your task is to do the power analysis for a circuit that should send out a one-clock-cycle pulse on the done signal once every 16 clock cycles. (That is, done is 0 for 15 clock cycles, then 1 for one cycle, then repeat with 15 cycles of 0 followed by a 1, etc.) 1 clk done 2 3 15 16 17 31 32 33 7.5.2.2 Additional Information Your implementation technology is an FPGA where each cell has a programable combinational circuit and a ip-op. The combinational circuit has 4 inputs and 1 output. The capacitive load of the combinational circuit is twice that of the ip-op. PLA cell Required behaviour 1. You may neglect power associated with clocks. 2. You may assume that all counters: (a) are implemented on the same fabrication process (b) run at the same clock speed (c) have negligible leakage and short-circuit currents You have been asked to consider three different types of counters: a binary counter, a Gray-code counter, and a one-hot counter. (The table below shows the values from 0 to 15 for the different encodings.) Question: What is the relative amount of power consumption for the different options? 512 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.5.2 Example Problem: Sixteen Pulser 513 Data Encoding Decimal 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Gray 0000 0001 0011 0010 0110 0111 0101 0100 1100 1101 1111 1110 1010 1011 1001 1000 One-Hot 0000000000000001 0000000000000010 0000000000000100 0000000000001000 0000000000010000 0000000000100000 0000000001000000 0000000010000000 0000000100000000 0000001000000000 0000010000000000 0000100000000000 0001000000000000 0010000000000000 0100000000000000 1000000000000000 Binary 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 7.5.2.3 Answer Sketch the Circuitry Name the output done and the count digits d(). 514 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.5.2 Example Problem: Sixteen Pulser 515 Capacitance cap number subtotal cap Gray d() PLAs Flops done PLAs Flops 1-Hot d() PLAs Flops done PLAs Flops Binary d() PLAs Flops done PLAs Flops clk d(0) d(1) d(2) d(3) done Activity Factors Gray Coding Activity Factor 8/16 4/16 2/16 2/16 2/16 Gray coding 516 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.5.2 Example Problem: Sixteen Pulser 517 One-Hot Activity Factor clk d(0) d(1) d(2) 2/16 2/16 2/16 2/16 done 2/16 clk d(0) d(1) d(2) d(3) done Binary Coding Activity Factor 16/16 8/16 4/16 2/16 2/16 One-hot coding Binary coding 518 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.6. CLOCK GATING 519 Putting it all Together subtotal cap act fact Gray d() PLAs Flops done PLAs Flops Total d() PLAs Flops done PLAs Flops Total power 7.6 Clock Gating The basic idea of clock gating is to reduce power by turning off the clock when a circuit isnt needed. This reduces the activity factor. 7.6.1 Introduction to Clock Gating Examples of Clock Gating Condition O/S in standby mode Circuitry turned off Everything except core state (PC, registers, caches, etc) No oating point instruc- oating point circuitry tions for k clock cycles Instruction cache miss Instruction decode circuitry No instruction in pipe Pipe stage i stage i 1-Hot Binary d() PLAs Flops done PLAs Flops Total 520 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.6.3 Design Process 521 7.6.2 Implementing Clock Gating Clock gating is implemented by adding a component that disables the clock when the circuit isnt needed. i_data i_valid clk o_data 7.6.3 Design Process 7.6.4 Effectiveness of Clock Gating Parameters to characterize effectiveness of clock gating: Eff = effectiveness of clock gating PctValid = percentage of clock cycles with valid data in the circuit the clock must be toggling PctClk = percentage of clock cycles that clock toggles Effectiveness measures the percentage of clock cycles with invalid data in which the clock is turned off. Equation for effectiveness of clock gating: PctClkOff PctInvalid 1 PctClk 1 PctValid o_valid Without clock gating i_data i_valid clk o_data cool_clk o_valid clk_en i_wakeup Clock Enable State Machine Eff With clock gating 522 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.6.4 Effectiveness of Clock Gating 523 Clock Gating Effectiveness Questions Question: What is the effectiveness if the clock toggles only when there is valid data? Clock Gating Effectiveness Questions Question: What does it mean for a clock gating scheme to be 75% effective? Question: What is the effectiveness of a clock that always toggles? Question: What happens if PctClk PctValid? 524 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.6.5 Example: Reduced Activity Factor with Clock Gating 525 Effect of Effectiveness We can see the effect of the effectiveness of a clock-gating scheme on the activity factor: A PctValid * A A 0 0 Eff 1 7.6.5 Example: Reduced Activity Factor with Clock Gating Question: How much power will be saved in the following clock-gating scheme? 70% of the time the main circuit has valid data clock gating circuit is 90% effective (90% of the time that the circuit has invalid data, the clock is off) clock gating circuit has 10% of the area of the main circuit clock gating circuit has same activity factor as main circuit neglect short-circuiting and leakage power The new activity factor with a clock gating scheme is: A A 1 PctValid Eff A 526 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.6.6 Clock Gating with Valid-Bit Protocol 527 7.6.6 Clock Gating with Valid-Bit Protocol 7.6.6.1 Valid-Bit Protocol Need a mechanism to tell circuit when to pay attention to data inputs clk i_valid i_data o_valid o_data clk i_valid i_data 528 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.6.6 Clock Gating with Valid-Bit Protocol 529 Valid-Bit Protocol clk i_valid i_data o_valid o_data Microscopic Analysis Which clock edges are needed? i_valid clk o_valid clk i_valid i_data o_valid o_data clk i_valid o_valid i valid: high when i data has valid data signies whether circuit should pay attention to or ignore data. o valid: high when o data has valid data signies whether whether environment should pay attention to output of circuit. For more on circuit protocols, see section 2.8. 530 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.6.6.2 How Many Clock Cycles for Module? Given a module with latency Lat, if the module receives a stream of NumPcls consecutive valid parcels, how many clock cycles must the clock-enable signal be asserted? ti1 to1 tik tok tstart tlast time of rst i valid time of rst o valid time of last i valid time of last o valid rst clock cycle with clock enabled last clock cycle with clock enabled To understand the 1 in the equation for tok , examine the situation when NumPcls 1. With just one parcel going through the system to1 ti1 Lat, so we have: tok to1 1 1. In the equation for tlast , we need the 1 to clear the last valid bit. Solve for the length of time that the clock must be enabled. The 1 at the end of this equation is becuase if tlast trst , we would have the clock enabled for 1 clock cycle. ClkEnLen tlast trst 1 tok 1 ti1 1 1 tok ti1 1 to1 NumPcls 1 ti1 1 to1 NumPcls ti1 ti1 Lat NumPcls ti1 Lat NumPcls 7.6.6 Clock Gating with Valid-Bit Protocol 531 Initial equations to describe relationships between different points in time: to1 tok trst ti1 1 tlast tok 1 532 ti1 Lat to1 NumPcls 1 We are left with the formula that the number of clock cycles that the modules clock must be enabled is the latency through the module plus the number of consecutive parcels. CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.6.6 Clock Gating with Valid-Bit Protocol 533 7.6.6.3 Adding Clock-Gating Circuitry data_in After Clock Gating: Circuitry data_out valid_out Before Clock Gating data_in valid_in clk clk valid_in data_in valid_out data_out dont care uninitialized data_out valid_out valid_in hot_clk clk_en wakeup_in Clock Enable State Machine cool_clk wakeup_out hot clk: clock that always toggles cool clk: gated clock sometimes toggles, sometimes stays low wakeup: alerts circuit that ...

Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more. Course Hero has millions of course specific materials providing students with the best way to expand their education.

Below is a small sample set of documents:

W. Alabama - ECE - 427
Figure 2.9: Final DFD (corrected to show synchronous rds and wrs)2.7.4 Example: Memory Array with Dataflow Diagram
W. Alabama - ECE - 427
E&amp;CE 427 Handout 2 Part 1: Intro to SynopsysSanjay Singh Lab Instructor / Assistant AdminThis tutorial guides the student through the process of familiarizing themselves with the Synopsys VHDL tools. The main prerequisite is a basic familiarity wit
W. Alabama - ECE - 427
January 28, 2002ECE 427 Handout #3 Logic Synthesis with SynopsysSanjay Singh, Lab InstructorThe purpose of this tutorial is to introduce the knowledge required to synthesize the gate-level logic that implements a given design for ECE students usi
W. Alabama - ECE - 427
E&amp;CE 427 Final2007t1 (Winter)Instructions and General Information 100 marks total Time limit: 2.5 hours Calculators are allowed No books, no notes, no computers If you need extra paper, request some from a proctor. Write neatly. The proctor
W. Alabama - ECE - 427
E&amp;CE 427 Midterm Solution2007t1 (Winter)All requests for regrades must be made in writing by 5:30pm on Friday March 2nd.1 (23 Marks) VHDL Simulation SemanticsYou and your team member, John, have divided the coding of your new design between the
W. Alabama - ECE - 427
E&amp;CE 427 Project: Kirsch Edge Detecter2007t1 (Winter)Deliverable Dataow Diagram Main Project Report Demo Due Date Monday, Mar. 5 6:00pm Thursday, Mar. 22 11:59pm 8:30am after project submission TBD Submission Method Drop box Electronic Drop box Sig
W. Alabama - ECE - 427
ECE427Lab #11ECE 427 LAB #1Due: Tuesday, January 16, 2007 11:59pm 1 Background ReadingIt is recommended that you complete Tut-01x and read the following ECE-427 handouts before attempting Lab 1. Solaris Policy Logging into a SunEE Comput
W. Alabama - ECE - 427
ECE 427 LAB #2Due: Friday, January 26th, 2007 11:59pm1Background ReadingIn addition to the handouts for Lab 1, it is recommended that you complete Tut-02x and read the Timing Simulation handout before attempting Lab 2.2InstructionsRead a
W. Alabama - ECE - 427
ECE-427Digital Systems Engineering2007t1 (Winter)Lab-3Due: Friday, February 9, 2007 11:59pmContents1 Preliminaries 1.1 Background Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Instructions . . .
W. Alabama - ECE - 427
Jan 31, 02 9:30library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; package queue_pkg is subtype data is std_logic_vector(3 downto 0); function to_data(i : integer) return data; end queue_pkg; package body queue_pkg is function to_da
W. Alabama - ECE - 427
%!PS-Adobe-3.0 %Title: queue.vhd, queue_spec.vhd %For: Mark Aagaard %Creator: a2ps version 4.13 %CreationDate: Thu Jan 31 09:31:57 2002 %BoundingBox: 24 24 588 768 %DocumentData: Clean7Bit %Orientation: Landscape %Pages: 2 %PageOrder: Ascend %Documen
W. Alabama - ECE - 427
15615915815515815615915815715815815916016016015816316116216016416016516316116316116416516216116516516416616516416616716516516417016616716717016816916616916717016916917116917116917117016917
W. Alabama - ECE - 427
1164090659261110411481204968120181531677617912710935821781371348072106147136141176163861751001098897841301587568116150171477111211216289511381681061871021658020119310618092198113767
W. Alabama - ECE - 427
19519519119319419118919018718618719319418419919419119319818719419019418819019119320019720519719220220020220419820219619819820020319919719119219219419519218919319619818919119219919318919519
W. Alabama - ECE - 427
1621611571611621601631551561611561571571521541561581601671661651721751701721751671611481401191069792979610110510511010810510910910610810910911010810610811111111811912212212212413012312713
Berkeley - EE - 117
University of California College of Engineering Department of Electrical Engineering and Computer SciencesEECS 117 4 units Spring 2008 Control No 25395 T.K. GustafsonElectromagnetic Fields and Waves Information Sheet T.K Gustafson, 459 Cory, 2-313
Bowling Green - RMI - 3500
CUNY Baruch - STAT - 724
STAT 724/ECO 761 Problem Set FiveSpring 20071. Let X N (, 2 ). Let Y = E X be a lognormal random variable. Find the density function of Y . 2. Suppose S is process which satises the S.D.E. dS = Sdt + SdX . Write down a S.D.E. that is satised by
CUNY Baruch - STAT - 724
STAT 724/ECO 761 Spring 2007 Problem Set One SOLUTIONS1. If the cumulative distribution function of a random variable X is given by if b &lt; 0, 0 1/2 if 0 b &lt; 1, 3/5 if 1 b &lt; 2, F (b) = 4/5 if 2 b &lt; 3, 9/10 if 3 b &lt; 3.5, 1 if b 3.5
CUNY Baruch - STAT - 724
STAT 7243/ECO 761 Problem Set Two Solutions1. A stock price is currently $50. It is known that at the end of six months it will be either $45 or $55. The risk free interest rate is 10% per annum compounded continuously. What is the value of a six mo
CUNY Baruch - STAT - 724
STAT 724/ECO 761 Spring 2007 Problem Set Three Solutions1. (Problem 5 from Exercises 1) Consider the Gamblers Ruin Problem: at each play of the game the gamblers fortune increases by one with probability p or decreases by one with probability q = 1
CUNY Baruch - STAT - 724
STAT 724/ECO 761 Problem Set Four SolutionsSpring 20071. Consider the symmetric random walk on the integers: S0 = 0 and if Sn = i then the probability is p = 1/2 that Sn+1 = i + 1 and the probability is q = 1/2 that Sn+1 = i 1. We showed in clas
CUNY Baruch - STAT - 724
STAT 724/ECO 761 Spring 2007 Take Home Final Exam1. A nancial institution plans to oer a derivative security that pays o an 2 amount equal to (ST ) = ST at time T . a) Use the risk-neutral valuation method to calculate the value of the security at t
Eckerd - PO - 304
Eckerd CollegeR. Wigton Fall 2007 U.S. Congress This course is designed as an introduction to the legislative process in general and to the American Congress in particular. We will approach the study of the Congress from a variety of perspectives: t
UMass (Amherst) - ECE - 242
ECE 242 Fall 2008Data Structures and Algorithms in JavaIntroduction to Java: Revision of Important ConceptsECE242 Fall 2008Data Structures and Algorithms in Java: Alodeep SanyalJava is Platform-IndependentCompilerUnix Compiler C source cod
UMass (Amherst) - ECE - 242
ECE 242 Data StructuresLecture 31Introduction to GraphsECE242 L30: Introduction to GraphsNovember 17, 2008Overview Problem: How do we represent irregular connections between locations? Graphs Definition Directed and Undirected graph Sim
UMass (Amherst) - ECE - 242
ECE 242 Data StructuresLecture 8Linked Stacks and QueuesECE242 L8: Linked Stacks and QueuesSeptember 19, 2008Overview Problem: How do we make linked lists more efficient? More Linked Lists Doubly linked listLinked List implementation of s
UMass (Amherst) - ECE - 242
ECE 242 Data StructuresLecture 7Linked ListsECE242 L7: Linked ListsSeptember 16, 2008Overview Problem: Can we implement data structures using something other than arrays? Individual objects can be more flexible Use references to find neig
Purdue - ECE - 477
BACKNEXTLithium BatteriesKEEPER II LITHIUM NON RECHARGEABLEAL H W HBCDEL H WLLWHW L WHFor quantities of 100 and up, call for quote.Features: Low profile prismatic design Wave solderability up to 5 seconds Highe
UMass (Amherst) - ECE - 242
ECE 242 Data StructuresLecture 13Insertion SortECE242 L13: Insertion SortOctober 1, 2008Overview Problem: What is a simple algorithm to sort numbers stored in a data structure Insertion sort Easy to code and analyze Insertion Sort Not
UMass (Amherst) - ECE - 242
ECE 242Data Structures and AlgorithmsLecture 39 VLSI Routing and Shortest PathsLecture 39: VLSI Routing and Shortest PathsDecember 8, 2008What is Reconfigurable Computing? Computation using hardware that can adapt at the logic level to solve
UMass (Amherst) - ECE - 242
A Pattern Generation Technique for Maximizing Power Supply CurrentsAlodeep Sanyal, Kunal Ganeshpure and Sandip KunduDepartment of Electrical and Computer Engineering University of Massachusetts at AmherstMotivation Power is an extremely importan
UMass (Amherst) - ECE - 242
ECE 242 Data Structures and AlgorithmsLecture 1Course OverviewECE242 L1: Course OverviewSeptember 3, 2008Welcome! What is this class about? Designing and building complex software systems Solving common engineering problems efficiently -
UMass (Amherst) - ECE - 242
ECE 242 Data StructuresLecture 9Linked List Wrap-UpECE242 L9: Linked List Wrap-UpSeptember 22, 2008Overview Problem: What about generic linked lists (not stack or queue)? Can we use them to solve useful problems? What about iterators? F
UMass (Amherst) - ECE - 242
ECE 242 Data StructuresLecture 35 Data CompressionECE242 L35: CompressionNovember 26, 2008Compression Files can often be compressed. Represented using fewer bytes than the standard representation. Fixed-length encoding Somewhat wasteful,
UMass (Amherst) - ECE - 242
ECE 242 Data StructuresLecture 19Avoiding RecursionECE242 L19: Avoiding RecursionOctober 15, 2008Overview Recursion is easy to use for many problems Always possible to use iteration instead Recursion makes heavy use of the call stack, wh
UMass (Amherst) - ECE - 242
ECE 242 Data StructuresLecture 37 Topological SortingECE242 L37: Memory ManagementDecember 3, 2008Topological Sorting Topological sort Is an ordering in which the tasks can be performed without violating any of the prerequisites.ECE242 L37
UMass (Amherst) - ECE - 242
ECE 242 Data StructuresLecture 20Binary TreesECE242 L20: Binary TreesOctober 17, 2008Overview Problem: How do we represent non-linear structures? Binary Tree Similar to a linked list except each node has two children Useful for many data
Wayne State University - MATH - 5700
Wayne State University Department of MathematicsInformation SheetCourse InformationTitle: Introduction to Probability Theory Course: MAT 570 Section: 75909 Semester: Fall, 1995 Room: 44 Rachkam Time: MTWF 10:40 AM - 11:35 AMInstructor Informat
East Los Angeles College - POLF - 0109
JOHANNES LINDVALLPersonal Details Address: Department of Politics, Manor Road, Oxford, OX1 3UQ, United Kingdom. E-mail: johannes.lindvall@politics.ox.ac.uk. Website: http:/users.ox.ac.uk/~polf0109/. Date of Birth: February 8, 1975. Nationality: Swe
UCSB - ECE - 242
Signal Compression (ECE 242) Gibson Homework 3 SolutionJanuary 22 20081. (a) Considering uniformly distributed case 1 axb ba fX (x) = 0 elsewhere We have 2 = E [X 2 ] E [X ]2 = Dmin 1 = 6N 230(b a)2 . 12 3 (b a)2 fX (x)dx = . 6(128)2
East Los Angeles College - POLF - 0109
T HE POLITICS OF PURPOSE Swedish Economic Policy After the Golden AgeJohannes Lindvall Department of Political Science Gteborg University Box 711, 405 30 Gteborg, Sweden E-mail: Johannes.Lindvall@pol.gu.seForthcoming in Comparative Politics. Fina
East Los Angeles College - POLF - 0109
JOHANNES LINDVALLT HE POLITICS OF PURPOSESWEDISH MACROECONOMIC POLICY AFTER THE GOLDEN AGEDEPARTMENT OF POLITICAL SCIENCE GTEBORG UNIVERSITY 2004Distribution Johannes Lindvall Department of Political Science Gteborg University P.O. Box 711 405
East Los Angeles College - POLF - 0109
A MODEL OF PROTESTSJOHANNES LINDVALL, UNIVERSITY OF OXFORD1. Introduction This paper develops a theoretical analysis of conicts between governments and pressure groups. The papers main claim is that protests are most likely to occur in political s
UCSB - ECE - 242
ECE 242 Gibson Midterm Exam Solutions 1. Consider some estimate Y and the corresponding errorWinter Quarter 2008 03/06/08 2 (Y ) = E[(Y Y )t W (Y Y )] = E[(Y Y + Y Y )t W (Y Y + Y Y )] = E Y Y E Y Y2 W 2 W + 2 E[(Y Y )t W (Y Y )
UCSB - ECE - 242
Signal Compression (ECE 242) Gibson Course Project January 8, 2008 Handout #2Each student is required to submit an individual course project, consisting of a written report and an oral presentation, describing a detailed examination of a signal c
UCSB - ECE - 242
Signal Compression (ECE 242) Gibson Homework No. 1 Due: January 15, 2008January 8, 2008 Handout #31. Given two independent random variables X and Y, form Z=X+Y.2 2 (a) If X and Y are Gaussian with means X and Y and variances X and Y , respect
UCSB - ECE - 242
Signal Compression (ECE 242) Gibson Homework No. 7 Due: March 13, 2008 1. Problem 9.8 on page 303 of the text. 2. Problem 9.9 on page 303 of the text. 3. Problem 9.10 on page 303 of the text.March 6, 2008
UCSB - ECE - 242
UNIVERSITY OF CALIFORNIA, SANTA BARBARA Department of Electrical and Computer Engineering Signal Compression ECE 242 Winter 2009 Instructor: Ken RoseTime and Place: Mondays, Wednesdays 10 am, Phelps 1431. Oce Hours: TBD Tentative High-Level Outline
UCSB - ECE - 242
UCSB - ECE - 242
UNIVERSITY OF CALIFORNIA, SANTA BARBARA Department of Electrical and Computer Engineering ECE 242 Winter 2009 Instructor: K. RoseHomework Assignment #1 (Due on Wednesday 1/21/2009)Reading: review Chapters 2 and 5. Problem # 1. Text, Prob. 2.1. Pr
UCSB - ECE - 242
UNIVERSITY OF CALIFORNIA, SANTA BARBARA Department of Electrical and Computer Engineering ECE 242 Winter 2009 Instructor: K. RoseHomework Assignment #3 (Due on Wednesday 2/4/2009)Reading: Review Chapter 7. Problem # 1. Consider the optimal estima
UCSB - ECE - 242
UCSB - ECE - 242
UNIVERSITY OF CALIFORNIA, SANTA BARBARA Department of Electrical and Computer Engineering ECE 242 Winter 2009 Instructor: K. RoseHomework Assignment #6 (Due on Wednesday 3/11/2009)Reading: Review Chapters 10 and 11. Problem # 1. Text, Prob. 10.2.
UCSB - ECE - 242
UNIVERSITY OF CALIFORNIA, SANTA BARBARA Department of Electrical and Computer Engineering ECE 242 Winter 2009 Instructor: K. RoseHomework Assignment #2 (Due on Wednesday 1/28/2009)Reading: Review Chapter 5, and Section 6.3. Problem # 1. Text, Pro
UCSB - ECE - 242
UCSB - ECE - 242
UNIVERSITY OF CALIFORNIA, SANTA BARBARA Department of Electrical and Computer Engineering ECE 242 Winter 2009 Instructor: K. RoseHomework Assignment #5 (Due on Wednesday 3/4/2009)Reading: Review Chapters 8 and 9. Problem # 1. Construct a probabil
UCSB - ECE - 242
UNIVERSITY OF CALIFORNIA, SANTA BARBARA Department of Electrical and Computer Engineering ECE 242 Winter 2009 Instructor: Ken RoseHomework Assignment #4 (Due on Wednesday 2/18/2009)2 Problem # 1. Consider a source with variance x and autocorrelat
Penn State - AJS - 394
THE PENNSYLVANIA STATE UNIVERSITY CHEMISTRY BUILDINGUNIVERSITY PARK, PAADAM J. SENK MECHANICAL OPTION www.arche.psu.edu/thesis/eportfolio/ current/portfolios/ajs394/BUILDING: 5 Occupied stories, Basement, Mechanical Penthouse SIZE: 181,890 Sq Ft C
Penn State - AJS - 394
ADAM J. SENKPENNSYLVANIA STATE UNIVERSITY CHEMISTRY BUILDINGTHESIS PROPOSAL MECHANICAL OPTIONEXECUTIVE SUMMARY This report contains proposed changes redesign topics for the Pennsylvania State University Chemistry Building located at the Universit
Penn State - AJS - 394
SENIOR THESIS PROPOSALTHE PENNSYLVANIA STATE UNIVERSITY CHEMISTRY BUILDINGUNIVERSITY PARK, PENNSYLVANIAPREPARED FOR: DR. SREBRIC ASSISTANT PROFESSOR THE PENNSYLVANIA STATE UNIVERSITY OF ARCHITECTURAL ENGINEERING BY: ADAM J. SENK MECHANICAL OPTION