636 Pages

notes-up1

Course: ECE 427, Fall 2009
School: W. Alabama
Rating:
 
 
 
 
 

Word Count: 102606

Document Preview

427: E&CE Digital Systems Engineering Course Notes (with Solutions) Instructors: Farzad Khalvati and Muhammad Nummer Notes by: Mark Aagaard 2007t1Winter University of Waterloo Dept of Electrical and Computer Engineering January 9, 2007 Contents I Course Notes 1 VHDL 1.1 Introduction to VHDL . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Levels of Abstraction . . . . . . . . . . . . . . . . . . . ....

Register Now

Unformatted Document Excerpt

Coursehero >> Alabama >> W. Alabama >> ECE 427

Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

Course Hero has millions of student submitted documents similar to the one below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
427: E&CE Digital Systems Engineering Course Notes (with Solutions) Instructors: Farzad Khalvati and Muhammad Nummer Notes by: Mark Aagaard 2007t1Winter University of Waterloo Dept of Electrical and Computer Engineering January 9, 2007 Contents I Course Notes 1 VHDL 1.1 Introduction to VHDL . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Levels of Abstraction . . . . . . . . . . . . . . . . . . . . . 1.1.2 VHDL Origins and History . . . . . . . . . . . . . . . . . . 1.1.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Synthesis of a Simulation-Based Language . . . . . . . . . 1.1.5 Solution to Synthesis Sanity . . . . . . . . . . . . . . . . . 1.1.6 Standard Logic 1164 . . . . . . . . . . . . . . . . . . . . . 1.2 Comparison of VHDL to Other Hardware Description Languages . 1.2.1 VHDL Disadvantages . . . . . . . . . . . . . . . . . . . . 1.2.2 VHDL Advantages . . . . . . . . . . . . . . . . . . . . . . 1.2.3 VHDL and Other Languages . . . . . . . . . . . . . . . . . 1.2.3.1 VHDL vs Verilog . . . . . . . . . . . . . . . . . 1.2.3.2 VHDL vs SystemC . . . . . . . . . . . . . . . . 1.2.3.3 VHDL vs Other Hardware Description Languages 1.2.3.4 Summary of VHDL Evaluation . . . . . . . . . . 1.3 Overview of Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Syntactic Categories . . . . . . . . . . . . . . . . . . . . . 1.3.2 Library Units . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Entities and Architecture . . . . . . . . . . . . . . . . . . . 1.3.4 Concurrent Statements . . . . . . . . . . . . . . . . . . . . 1.3.5 Component Declaration and Instantiations . . . . . . . . . . 1.3.6 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.7 Sequential Statements . . . . . . . . . . . . . . . . . . . . 1.3.8 A Few More Miscellaneous VHDL Features . . . . . . . . 1.4 Concurrent vs Sequential Statements . . . . . . . . . . . . . . . . . 1.4.1 Concurrent Assignment vs Process . . . . . . . . . . . . . . 1.4.2 Conditional Assignment vs If Statements . . . . . . . . . . 1.4.3 Selected Assignment vs Case Statement . . . . . . . . . . . 1.4.4 Coding Style . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Overview of Processes . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Combinational Process vs Clocked Process . . . . . . . . . 1.5.2 Latch Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 3 3 4 6 7 7 8 9 9 9 10 10 10 10 11 11 11 11 12 14 16 16 17 18 18 18 18 19 19 20 22 23 i ii CONTENTS 1.5.3 Combinational vs Flopped Signals . . . . . . . . . . . 1.6 Details of Process Execution . . . . . . . . . . . . . . . . . . 1.6.1 Temporal Granularities of Simulation . . . . . . . . . 1.6.2 Intuition Behind Delta-Cycle Simulation . . . . . . . 1.6.3 Denitions and Algorithm . . . . . . . . . . . . . . . 1.6.3.1 Process Modes . . . . . . . . . . . . . . . . 1.6.3.2 Simulation Algorithm . . . . . . . . . . . . 1.6.3.3 Delta-Cycle Denitions . . . . . . . . . . . 1.6.4 Example 1: Process Execution (Bamboozle) . . . . . . 1.6.5 Example 2: Process Execution (Flummox) . . . . . . 1.6.6 Example: Need for Provisional Assignments . . . . . 1.6.7 Delta-Cycle Simulations of Flip-Flops . . . . . . . . . 1.7 Register-Transfer Level Simulation . . . . . . . . . . . . . . . 1.7.1 Technique for Register-Transfer Level Simulation . . . 1.7.2 Examples of RTL Simulation . . . . . . . . . . . . . . 1.8 VHDL and Hardware Building Blocks . . . . . . . . . . . . . 1.8.1 Basic Building Blocks . . . . . . . . . . . . . . . . . 1.8.2 Deprecated Building Blocks for RTL . . . . . . . . . 1.8.2.1 An Aside on Flip-Flops and Latches . . . . 1.8.2.2 Deprecated Hardware . . . . . . . . . . . . 1.8.3 Hardware and Code for Flops . . . . . . . . . . . . . 1.8.3.1 Flops with Waits and Ifs . . . . . . . . . . . 1.8.3.2 Flops with Synchronous Reset . . . . . . . 1.8.3.3 Flops with Chip-Enable . . . . . . . . . . . 1.8.3.4 Flop with Chip-Enable and Mux on Input . . 1.8.3.5 Flops with Chip-Enable, Muxes, and Reset . 1.8.4 An Example Sequential Circuit . . . . . . . . . . . . 1.9 Arrays and Vectors . . . . . . . . . . . . . . . . . . . . . . . 1.10 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10.1 Arithmetic Packages . . . . . . . . . . . . . . . . . . 1.10.2 Shift and Rotate Operations . . . . . . . . . . . . . . 1.10.3 Overloading of Arithmetic . . . . . . . . . . . . . . . 1.10.4 Different Widths and Arithmetic . . . . . . . . . . . . 1.10.5 Overloading of Comparisons . . . . . . . . . . . . . . 1.10.6 Different Widths and Comparisons . . . . . . . . . . . 1.10.7 Type Conversion . . . . . . . . . . . . . . . . . . . . 1.11 Synthesizable vs Non-Synthesizable Code . . . . . . . . . . . 1.11.1 Unsynthesizable Code . . . . . . . . . . . . . . . . . 1.11.1.1 Initial Values . . . . . . . . . . . . . . . . . 1.11.1.2 Wait For . . . . . . . . . . . . . . . . . . . 1.11.1.3 Different Wait Conditions . . . . . . . . . . 1.11.1.4 Multiple if rising edges in Same Process . 1.11.1.5 if rising edge and wait in Same Process 1.11.1.6 if rising edge with else Clause . . . . . 1.11.1.7 if rising edge Inside a for Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 24 24 25 26 26 26 28 29 38 40 42 45 45 46 51 51 52 52 52 53 53 53 54 54 55 55 59 60 61 61 61 62 62 62 63 64 65 65 65 65 66 66 67 67 CONTENTS iii 1.11.1.8 wait Inside of a for loop . . . . . . . . . 1.11.2 Synthesizable, but Bad Coding Practices . . . . . . . . . 1.11.2.1 Asynchronous Reset . . . . . . . . . . . . . . 1.11.2.2 Combinational if-then Without else . . . 1.11.2.3 Bad Form of Nested Ifs . . . . . . . . . . . . 1.11.2.4 Deeply Nested Ifs . . . . . . . . . . . . . . . 1.11.3 Synthesizable, but Unpredictable Hardware . . . . . . . 1.12 Synthesizable VHDL Coding Guidelines . . . . . . . . . . . . . 1.12.1 Signal Declarations . . . . . . . . . . . . . . . . . . . . 1.12.2 Flip-Flops and Latches . . . . . . . . . . . . . . . . . . 1.12.3 Inputs and Outputs . . . . . . . . . . . . . . . . . . . . 1.12.4 Multiplexors and Tri-State Signals . . . . . . . . . . . . 1.12.5 Processes . . . . . . . . . . . . . . . . . . . . . . . . . 1.12.6 State Machines . . . . . . . . . . . . . . . . . . . . . . 1.12.7 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.13 VHDL Problems . . . . . . . . . . . . . . . . . . . . . . . . . P1.1 IEEE 1164 . . . . . . . . . . . . . . . . . . . . . . . . P1.2 VHDL Syntax . . . . . . . . . . . . . . . . . . . . . . P1.3 Flops, Latches, and Combinational Circuitry . . . . . . P1.4 Counting Clock Cycles . . . . . . . . . . . . . . . . . . P1.5 Arithmetic Overow . . . . . . . . . . . . . . . . . . . P1.6 Delta-Cycle Simulation: Pong . . . . . . . . . . . . . . P1.7 Delta-Cycle Simulation: Baku . . . . . . . . . . . . . . P1.8 Clock-Cycle Simulation . . . . . . . . . . . . . . . . . P1.9 VHDL VHDL Behavioural Comparison: Teradactyl . P1.10 VHDL VHDL Behavioural Comparison: Ichtyostega P1.11 Waveform VHDL Behavioural Comparison . . . . . P1.12 Hardware VHDL Comparison . . . . . . . . . . . . P1.13 8-Bit Register . . . . . . . . . . . . . . . . . . . . . . . P1.13.1 Asynchronous Reset . . . . . . . . . . . . . . P1.13.2 Discussion . . . . . . . . . . . . . . . . . . . P1.13.3 Testbench for Register . . . . . . . . . . . . . P1.14 Synthesizable VHDL and Hardware . . . . . . . . . . . P1.15 Datapath Design . . . . . . . . . . . . . . . . . . . . . P1.15.1 Correct Implementation? . . . . . . . . . . . P1.15.2 Smallest Area . . . . . . . . . . . . . . . . . P1.15.3 Shortest Clock Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 69 69 70 70 70 71 71 71 72 72 72 73 73 74 76 76 76 78 79 81 82 82 84 85 86 88 90 91 91 91 91 92 94 94 97 97 iv CONTENTS 2 RTL Design with VHDL 2.1 Prelude to Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 A Note on EDA for FPGAs and ASICs . . . . . . . . . . . . . . . . 2.2 FPGA Background and Coding Guidelines . . . . . . . . . . . . . . . . . . . 2.2.1 Generic FPGA Hardware . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1.1 Generic FPGA Cell . . . . . . . . . . . . . . . . . . . . . 2.2.2 Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2.1 Interconnect for Generic FPGA . . . . . . . . . . . . . . . 2.2.2.2 Blocks of Cells for Generic FPGA . . . . . . . . . . . . . 2.2.2.3 Clocks for Generic FPGAs . . . . . . . . . . . . . . . . . 2.2.2.4 Special Circuitry in FPGAs . . . . . . . . . . . . . . . . . 2.2.3 Generic-FPGA Coding Guidelines . . . . . . . . . . . . . . . . . . . 2.2.4 Altera APEX20K Information and Coding Guidelines . . . . . . . . 2.3 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Generic Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Implementation Flows . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Design Flow: Datapath vs Control vs Storage . . . . . . . . . . . . . 2.3.3.1 Classes of Hardware . . . . . . . . . . . . . . . . . . . . . 2.3.3.2 Datapath-Centric Design Flow . . . . . . . . . . . . . . . 2.3.3.3 Control-Centric Design Flow . . . . . . . . . . . . . . . . 2.3.3.4 Storage-Centric Design Flow . . . . . . . . . . . . . . . . 2.4 Algorithms and High-Level Models . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Flow Charts and State Machines . . . . . . . . . . . . . . . . . . . . 2.4.2 Data-Dependency Graphs . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 High-Level Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Finite State Machines in VHDL . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Introduction to State-Machine Design . . . . . . . . . . . . . . . . . 2.5.1.1 Mealy vs Moore State Machines . . . . . . . . . . . . . . 2.5.1.2 Introduction to State Machines and VHDL . . . . . . . . . 2.5.1.3 Explicit vs Implicit State Machines . . . . . . . . . . . . . 2.5.2 Implementing a Simple Moore Machine . . . . . . . . . . . . . . . . 2.5.2.1 Implicit Moore State Machine . . . . . . . . . . . . . . . . 2.5.2.2 Explicit Moore with Flopped Output . . . . . . . . . . . . 2.5.2.3 Explicit Moore with Combinational Outputs . . . . . . . . 2.5.2.4 Explicit-Current+Next Moore with Concurrent Assignment 2.5.2.5 Explicit-Current+Next Moore with Combinational Process 2.5.3 Implementing a Simple Mealy Machine . . . . . . . . . . . . . . . . 2.5.3.1 Implicit Mealy State Machine . . . . . . . . . . . . . . . . 2.5.3.2 Explicit Mealy State Machine . . . . . . . . . . . . . . . . 2.5.3.3 Explicit-Current+Next Mealy . . . . . . . . . . . . . . . . 2.5.4 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.5 State Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.5.1 Constants vs Enumerated Type . . . . . . . . . . . . . . . 2.5.5.2 Encoding Schemes . . . . . . . . . . . . . . . . . . . . . . 2.6 Dataow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 99 99 100 100 100 100 104 104 106 106 107 108 109 109 110 111 111 112 113 113 113 114 114 115 116 116 116 116 117 118 119 120 121 122 123 124 125 126 127 128 130 130 131 132 CONTENTS v 2.6.1 Dataow Diagrams Overview . . . . . . . . . . . . . . 2.6.2 Dataow Diagrams, Hardware, and Behaviour . . . . . 2.6.3 Dataow Diagram Execution . . . . . . . . . . . . . . . 2.6.4 Performance Estimation . . . . . . . . . . . . . . . . . 2.6.5 Area Estimation . . . . . . . . . . . . . . . . . . . . . 2.6.6 Design Analysis . . . . . . . . . . . . . . . . . . . . . 2.6.7 Area / Performance Tradeoffs . . . . . . . . . . . . . . 2.7 Memory Arrays and RTL Design . . . . . . . . . . . . . . . . . 2.7.1 Memory Arrays in VHDL . . . . . . . . . . . . . . . . 2.7.1.1 Using a Two-Dimensional Array for Memory 2.7.1.2 Memory Arrays in Hardware . . . . . . . . . 2.7.1.3 VHDL Code for Single-Port Memory Array . 2.7.1.4 Using Library Components for Memory . . . 2.7.1.5 Build Memory from Slices . . . . . . . . . . 2.7.1.6 Dual-Ported Memory . . . . . . . . . . . . . 2.7.2 Data Dependencies . . . . . . . . . . . . . . . . . . . . 2.7.3 Memory Arrays and Dataow Diagrams . . . . . . . . . 2.7.4 Example: Memory Array and Dataow Diagram . . . . 2.8 Input / Output Protocols . . . . . . . . . . . . . . . . . . . . . . 2.9 Design Example: Massey . . . . . . . . . . . . . . . . . . . . . 2.9.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . 2.9.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.3 Initial Dataow Diagram . . . . . . . . . . . . . . . . . 2.9.4 Dataow Diagram Scheduling . . . . . . . . . . . . . . 2.9.5 Optimize Inputs and Outputs . . . . . . . . . . . . . . . 2.9.6 Input/Output Allocation . . . . . . . . . . . . . . . . . 2.9.7 Register Allocation . . . . . . . . . . . . . . . . . . . . 2.9.8 Datapath Allocation . . . . . . . . . . . . . . . . . . . 2.9.9 Datapath for DP+Ctrl Model . . . . . . . . . . . . . . . 2.10 Design Example: Vanier . . . . . . . . . . . . . . . . . . . . . 2.10.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . 2.10.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 2.10.3 Initial Dataow Diagram . . . . . . . . . . . . . . . . . 2.10.4 Reschedule to Meet Requirements . . . . . . . . . . . . 2.10.5 Optimize Resources . . . . . . . . . . . . . . . . . . . 2.10.6 Assign Names to Registered Values . . . . . . . . . . . 2.10.7 Input/Output Allocation . . . . . . . . . . . . . . . . . 2.10.8 Tangent: Combinational Outputs . . . . . . . . . . . . . 2.10.9 Register Allocation . . . . . . . . . . . . . . . . . . . . 2.10.10 Datapath Allocation . . . . . . . . . . . . . . . . . . . 2.10.11 Hardware Block Diagram and State Machine . . . . . . 2.10.11.1 Control for Registers . . . . . . . . . . . . . 2.10.11.2 Control for Datapath Components . . . . . . . 2.10.11.3 Control for State . . . . . . . . . . . . . . . . 2.10.11.4 Complete State Machine Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 135 138 139 139 140 140 143 143 143 143 144 145 146 148 148 150 153 155 156 156 157 158 158 160 162 164 165 166 171 171 172 172 172 174 175 176 177 178 180 180 180 181 182 182 vi CONTENTS 2.10.12 VHDL Code with Explicit State Machine . . . . . . . . 2.10.13 Peephole Optimizations . . . . . . . . . . . . . . . . . 2.10.14 Notes and Observations . . . . . . . . . . . . . . . . . 2.11 Design Example: Stack . . . . . . . . . . . . . . . . . . . . . . 2.11.1 Stack: Requirements . . . . . . . . . . . . . . . . . . . 2.11.1.1 Entity . . . . . . . . . . . . . . . . . . . . . 2.11.1.2 Instructions . . . . . . . . . . . . . . . . . . 2.11.1.3 Instruction Encoding . . . . . . . . . . . . . 2.11.1.4 Miscellaneous Requirements . . . . . . . . . 2.11.2 Stack: Algorithm . . . . . . . . . . . . . . . . . . . . . 2.11.3 Stack: Dataow Diagram . . . . . . . . . . . . . . . . . 2.11.3.1 Data-Dependency Graphs . . . . . . . . . . . 2.11.3.2 Partition into Clock Cycles . . . . . . . . . . 2.11.4 Stack: High-Level Model . . . . . . . . . . . . . . . . . 2.11.5 Stack: Block Diagram . . . . . . . . . . . . . . . . . . 2.11.5.1 Individual Block Diagrams . . . . . . . . . . 2.11.5.2 Complete Block Diagram . . . . . . . . . . . 2.11.6 Stack: Register Transfer Level . . . . . . . . . . . . . . 2.11.6.1 Stack: Separate Control, Datapath and Storage 2.11.6.2 Stack: Datapath Operations . . . . . . . . . . 2.11.6.3 Stack: Explicit State Machine . . . . . . . . . 2.12 Optimization Techniques . . . . . . . . . . . . . . . . . . . . . 2.12.1 Strength Reduction . . . . . . . . . . . . . . . . . . . . 2.12.1.1 Arithmetic Strength Reduction . . . . . . . . 2.12.1.2 Boolean Strength Reduction . . . . . . . . . . 2.12.2 Replication and Sharing . . . . . . . . . . . . . . . . . 2.12.2.1 Mux-Pushing . . . . . . . . . . . . . . . . . 2.12.2.2 Common Subexpression Elimination . . . . . 2.12.2.3 Computation Replication . . . . . . . . . . . 2.12.3 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . 2.12.4 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . 2.13 Design Problems . . . . . . . . . . . . . . . . . . . . . . . . . P2.1 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . P2.1.1 Data Structures . . . . . . . . . . . . . . . . P2.1.2 Own Code vs Libraries . . . . . . . . . . . . P2.2 Design Guidelines . . . . . . . . . . . . . . . . . . . . P2.3 Dataow Diagram Optimization . . . . . . . . . . . . . P2.3.1 Resource Usage . . . . . . . . . . . . . . . . P2.3.2 Optimization . . . . . . . . . . . . . . . . . . P2.4 Dataow Diagram Design . . . . . . . . . . . . . . . . P2.4.1 Maximum Performance . . . . . . . . . . . . P2.4.2 Minimum area . . . . . . . . . . . . . . . . . P2.5 Michener: Design and Optimization . . . . . . . . . . . P2.6 Dataow Diagrams with Memory Arrays . . . . . . . . P2.6.1 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 186 189 190 190 190 190 190 191 191 193 193 194 195 198 198 200 201 201 206 208 211 211 211 211 212 212 212 212 213 213 214 214 214 214 214 215 215 216 216 216 217 217 217 218 CONTENTS vii P2.7 P2.8 3 P2.6.2 Algorithm 2 . 2-bit adder . . . . . . . P2.7.1 Generic Gates P2.7.2 FPGA . . . . Sketches of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 218 218 219 219 221 221 221 221 222 223 223 223 224 224 225 226 228 229 230 230 231 231 232 232 234 235 236 237 239 239 240 240 244 244 245 245 246 246 249 249 251 252 254 Functional Verication 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Terminology: Validation / Verication / Testing . . . . . . . . . . . . 3.2.2 The Difculty of Designing Correct Chips . . . . . . . . . . . . . . . 3.2.2.1 Notes from Kenn Heinrich (UW E&CE grad) . . . . . . . 3.2.2.2 Notes from Aart de Geus (Chairman and CEO of Synopsys) 3.3 Test Cases and Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Test Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Floating Point Divider Example . . . . . . . . . . . . . . . . . . . . 3.4 Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Overview of Test Benches . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Reference Model Style Testbench . . . . . . . . . . . . . . . . . . . 3.4.3 Relational Style Testbench . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Coding Structure of a Testbench . . . . . . . . . . . . . . . . . . . . 3.4.5 Datapath vs Control . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.6 Verication Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Functional Verication for Datapath Circuits . . . . . . . . . . . . . . . . . . 3.5.1 A Spec-Less Testbench . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Use an Array for Test Vectors . . . . . . . . . . . . . . . . . . . . . 3.5.3 Build Spec into Stimulus . . . . . . . . . . . . . . . . . . . . . . . . 3.5.4 Have Separate Specication Entity . . . . . . . . . . . . . . . . . . . 3.5.5 Generate Test Vectors Automatically . . . . . . . . . . . . . . . . . . 3.5.6 Relational Specication . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Functional Verication of Control Circuits . . . . . . . . . . . . . . . . . . . 3.6.1 Overview of Queues in Hardware . . . . . . . . . . . . . . . . . . . 3.6.2 VHDL Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2.1 Package . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2.2 Other VHDL Coding . . . . . . . . . . . . . . . . . . . . 3.6.3 Code Structure for Verication . . . . . . . . . . . . . . . . . . . . . 3.6.4 Instrumentation Code . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.5 Coverage Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.6 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.7 VHDL Coding Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.8 Queue Specication . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.9 Queue Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Functional Verication Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii CONTENTS P3.1 P3.2 P3.3 P3.4 P3.5 4 Carry Save Adder . . . . . . . . . . . . . Trafc Light Controller . . . . . . . . . . P3.2.1 Functionality . . . . . . . . . . P3.2.2 Boundary Conditions . . . . . P3.2.3 Assertions . . . . . . . . . . . State Machines and Verication . . . . . P3.3.1 Three Different State Machines P3.3.2 State Machines in General . . . Test Plan Creation . . . . . . . . . . . . P3.4.1 Early Tests . . . . . . . . . . . P3.4.2 Corner Cases . . . . . . . . . . Sketches of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 254 254 254 254 255 255 256 256 257 257 257 259 259 259 260 260 261 265 265 265 267 269 270 271 271 272 273 273 274 275 276 284 284 285 285 285 285 285 286 286 286 286 287 Performance Analysis and Optimization 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Dening Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 General Equations . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Example: Performance of Printers . . . . . . . . . . . . . . . . 4.4 Clock Speed, CPI, Program Length, and Performance . . . . . . . . . . 4.4.1 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Example: CISC vs RISC and CPI . . . . . . . . . . . . . . . . 4.4.3 Effect of Instruction Set on Performance . . . . . . . . . . . . . 4.4.4 Effect of Time to Market on Relative Performance . . . . . . . 4.4.5 Summary of Equations . . . . . . . . . . . . . . . . . . . . . . 4.5 Performance Analysis and Dataow Diagrams . . . . . . . . . . . . . . 4.5.1 Dataow Diagrams, CPI, and Clock Speed . . . . . . . . . . . 4.5.2 Examples of Dataow Diagrams for Two Instructions . . . . . . 4.5.2.1 Scheduling of Operations for Different Clock Periods 4.5.2.2 Performance Computation for Different Clock Periods 4.5.2.3 Example: Two Instructions Taking Similar Time . . . 4.5.2.4 Example: Same Total Time, Different Order for A . . 4.5.3 Example: From Algorithm to Optimized Dataow . . . . . . . 4.6 Performance Analysis and Optimization Problems . . . . . . . . . . . . P4.1 Farmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P4.2 Network and Router . . . . . . . . . . . . . . . . . . . . . . . P4.2.1 Maximum Throughput . . . . . . . . . . . . . . . . . P4.2.2 Packet Size and Performance . . . . . . . . . . . . . P4.3 Performance Short Answer . . . . . . . . . . . . . . . . . . . . P4.4 Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . P4.4.1 Average CPI . . . . . . . . . . . . . . . . . . . . . . P4.4.2 Why not you too? . . . . . . . . . . . . . . . . . . . P4.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . P4.5 Dataow Diagram Optimization . . . . . . . . . . . . . . . . . P4.6 Performance Optimization with Memory Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS ix P4.7 Multiply Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 P4.7.1 Highest Performance . . . . . . . . . . . . . . . . . . . . . . . 288 P4.7.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 289 291 291 291 294 295 298 5 Optimization 5.1 Pipelining . . . . . . . . . . . . . . . . 5.1.1 Introduction to Pipelining . . . 5.1.2 Partially Pipelined . . . . . . . 5.1.3 Pipelined Version of InstP . . . 5.1.4 Pipelined Version of InstP/InstQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Timing Analysis 301 6.1 Delays and Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 6.1.1 Background Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 6.1.2 Clock-Related Timing Denitions . . . . . . . . . . . . . . . . . . . . . . 302 6.1.2.1 Clock Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 6.1.2.2 Clock Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 6.1.2.3 Clock Jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 6.1.3 Storage Related Timing Denitions . . . . . . . . . . . . . . . . . . . . . 304 6.1.3.1 Setup Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 6.1.3.2 Hold Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 6.1.3.3 Clock-to-Q Time . . . . . . . . . . . . . . . . . . . . . . . . . . 305 6.1.4 Propagation Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 6.1.4.1 Load Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 6.1.4.2 Interconnect Delays . . . . . . . . . . . . . . . . . . . . . . . . 306 6.1.5 Summary of Delay Factors . . . . . . . . . . . . . . . . . . . . . . . . . . 306 6.1.6 Timing Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 6.1.6.1 Minimum Clock Period . . . . . . . . . . . . . . . . . . . . . . 307 6.1.6.2 Hold Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . 308 6.1.6.3 Example Timing Violations . . . . . . . . . . . . . . . . . . . . 308 6.2 Timing Analysis of Latches and Flip Flops . . . . . . . . . . . . . . . . . . . . . . 310 6.2.1 Review: Latch, Flip-Flop, Setup, Hold, Clock-to-Q . . . . . . . . . . . . . 310 6.2.2 Simple Multiplexer Latch . . . . . . . . . . . . . . . . . . . . . . . . . . 311 6.2.2.1 Structure and Behaviour of Multiplexer Latch . . . . . . . . . . 311 6.2.2.2 Strategy for Timing Analysis of Storage Devices . . . . . . . . . 312 6.2.2.3 Clock-to-Q Time of a Multiplexer Latch . . . . . . . . . . . . . 313 6.2.2.4 Setup Timing of a Multiplexer Latch . . . . . . . . . . . . . . . 315 6.2.2.5 Hold Time of a Multiplexer Latch . . . . . . . . . . . . . . . . . 318 6.2.2.6 Example of a Bad Latch . . . . . . . . . . . . . . . . . . . . . . 321 6.2.3 Timing Analysis of Transmission-Gate Latch . . . . . . . . . . . . . . . . 321 6.2.3.1 Structure and Behaviour of a Transmission Gate (Smith 2.4.3) . . 322 6.2.3.2 Structure and Behaviour of Transmission-Gate Latch (Smith 2.5.1)322 6.2.3.3 Clock-to-Q Delay for Transmission-Gate Latch . . . . . . . . . 323 6.2.3.4 Setup and Hold Times for Transmission-Gate Latch . . . . . . . 323 6.2.4 Falling Edge Flip Flop (Smith 2.5.2) . . . . . . . . . . . . . . . . . . . . . 323 x CONTENTS 6.3 6.4 6.5 6.6 6.7 6.2.4.1 Structure and Behaviour of Flip-Flop . . . . . . . 6.2.4.2 Clock-to-Q of Flip-Flop . . . . . . . . . . . . . . 6.2.4.3 Setup of Flip-Flop . . . . . . . . . . . . . . . . . 6.2.4.4 Hold of Flip-Flop . . . . . . . . . . . . . . . . . 6.2.5 Timing Analysis of FPGA Cells (Smith 5.1.5) . . . . . . . . 6.2.5.1 Standard Timing Equations . . . . . . . . . . . . 6.2.5.2 Hierarchical Timing Equations . . . . . . . . . . 6.2.5.3 Actel Act 2 Logic Cell . . . . . . . . . . . . . . . 6.2.5.4 Timing Analysis of Actel Sequential Module . . . 6.2.6 Exotic Flop . . . . . . . . . . . . . . . . . . . . . . . . . . Critical Paths and False Paths . . . . . . . . . . . . . . . . . . . . . 6.3.1 Introduction to Critical and False Paths . . . . . . . . . . . 6.3.1.1 Example of Critical Path in Full Adder . . . . . . 6.3.1.2 Preliminaries for Critical Paths . . . . . . . . . . 6.3.1.3 Longest Path and Critical Path . . . . . . . . . . 6.3.1.4 Timing Simulation vs Static Timing Analysis . . . 6.3.2 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Detecting a False Path . . . . . . . . . . . . . . . . . . . . 6.3.3.1 Preliminaries for Detecting a False Path . . . . . 6.3.3.2 Almost-Correct Algorithm to Detect a False Path . 6.3.3.3 Examples of Detecting False Paths . . . . . . . . 6.3.4 Finding the Next Candidate Path . . . . . . . . . . . . . . . 6.3.4.1 Algorithm to Find Next Candidate Path . . . . . . 6.3.4.2 Examples of Finding Next Candidate Path . . . . 6.3.5 Correct Algorithm to Find Critical Path . . . . . . . . . . . 6.3.5.1 Algorithm . . . . . . . . . . . . . . . . . . . . . 6.3.5.2 Examples . . . . . . . . . . . . . . . . . . . . . 6.3.6 Further Extensions . . . . . . . . . . . . . . . . . . . . . . Analog Timing Model . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Timing Model . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1.1 Equation for Output Voltage . . . . . . . . . . . . 6.4.1.2 Extrinsic / Intrinsic Delays . . . . . . . . . . . . Elmore Delay Model . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Elmore Time Constant . . . . . . . . . . . . . . . . . . . . 6.5.2 Interconnect with Single Fanout . . . . . . . . . . . . . . . 6.5.3 Interconnect with Multiple Gates in Fanout . . . . . . . . . Practical Usage of Timing Analysis . . . . . . . . . . . . . . . . . 6.6.1 Speed Binning . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1.1 FPGAs, Interconnect, and Synthesis . . . . . . . 6.6.2 Worst Case Timing . . . . . . . . . . . . . . . . . . . . . . 6.6.2.1 Fanout delay . . . . . . . . . . . . . . . . . . . . 6.6.2.2 Derating Factors . . . . . . . . . . . . . . . . . . Timing Analysis Problems . . . . . . . . . . . . . . . . . . . . . . P6.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . P6.2 Hold Time Violations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 325 326 327 327 328 328 328 330 331 331 331 333 334 335 337 338 339 340 343 344 349 349 350 357 357 357 368 368 372 372 374 374 375 376 378 381 382 383 383 383 383 385 385 385 CONTENTS xi P6.3 P6.4 P6.5 P6.6 P6.7 P6.8 P6.2.1 Cause . . . . . . . . . . . . . . . . . . . . . . . . . . . . P6.2.2 Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . P6.2.3 Rectication . . . . . . . . . . . . . . . . . . . . . . . . . Latch Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Critical Path and False Path . . . . . . . . . . . . . . . . . . . . . . Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P6.5.1 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . P6.5.2 Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P6.5.3 Missing Factors . . . . . . . . . . . . . . . . . . . . . . . P6.5.4 Critical Path or False Path? . . . . . . . . . . . . . . . . . Timing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P6.7.1 Wires in FPGAs . . . . . . . . . . . . . . . . . . . . . . . P6.7.2 Age and Time . . . . . . . . . . . . . . . . . . . . . . . . P6.7.3 Temperature and Delay . . . . . . . . . . . . . . . . . . . Worst Case Conditions and Derating Factor . . . . . . . . . . . . . . P6.8.1 Worst-Case Commercial . . . . . . . . . . . . . . . . . . . P6.8.2 Worst-Case Industrial . . . . . . . . . . . . . . . . . . . . P6.8.3 Worst-Case Industrial, Non-Ambient Junction Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 386 386 386 387 388 388 388 388 388 389 390 390 390 390 390 390 390 390 391 391 391 391 392 393 393 393 394 397 397 398 398 399 400 401 402 403 403 405 405 405 406 409 409 7 Power Analysis and Power-Aware Design 7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Importance of Power and Energy . . . . . . . . 7.1.2 Industrial Names and Products . . . . . . . . . 7.1.3 Power vs Energy . . . . . . . . . . . . . . . . 7.1.4 Batteries, Power and Energy . . . . . . . . . . 7.1.4.1 Do Batteries Store Energy or Power? 7.1.4.2 Battery Life and Efciency . . . . . 7.1.4.3 Battery Life and Power . . . . . . . 7.2 Power Equations . . . . . . . . . . . . . . . . . . . . 7.2.1 Switching Power . . . . . . . . . . . . . . . . 7.2.2 Short-Circuited Power . . . . . . . . . . . . . 7.2.3 Leakage Power . . . . . . . . . . . . . . . . . 7.2.4 Glossary . . . . . . . . . . . . . . . . . . . . 7.2.5 Note on Power Equations . . . . . . . . . . . . 7.3 Overview of Power Reduction Techniques . . . . . . . 7.4 Voltage Reduction for Power Reduction . . . . . . . . 7.5 Data Encoding for Power Reduction . . . . . . . . . . 7.5.1 How Data Encoding Can Reduce Power . . . . 7.5.2 Example Problem: Sixteen Pulser . . . . . . . 7.5.2.1 Problem Statement . . . . . . . . . . 7.5.2.2 Additional Information . . . . . . . 7.5.2.3 Answer . . . . . . . . . . . . . . . . 7.6 Clock Gating . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Introduction to Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii CONTENTS 7.7 Implementing Clock Gating . . . . . . . . . . . . . . Design Process . . . . . . . . . . . . . . . . . . . . . Effectiveness of Clock Gating . . . . . . . . . . . . . Example: Reduced Activity Factor with Clock Gating Clock Gating with Valid-Bit Protocol . . . . . . . . . 7.6.6.1 Valid-Bit Protocol . . . . . . . . . . . . . . 7.6.6.2 How Many Clock Cycles for Module? . . . 7.6.6.3 Adding Clock-Gating Circuitry . . . . . . . 7.6.7 Example: Pipelined Circuit with Clock-Gating . . . . Power Problems . . . . . . . . . . . . . . . . . . . . . . . . . P7.1 Short Answers . . . . . . . . . . . . . . . . . . . . . P7.1.1 Power and Temperature . . . . . . . . . . . P7.1.2 Leakage Power . . . . . . . . . . . . . . . P7.1.3 Clock Gating . . . . . . . . . . . . . . . . . P7.1.4 Gray Coding . . . . . . . . . . . . . . . . . P7.2 VLSI Gurus . . . . . . . . . . . . . . . . . . . . . . . P7.2.1 Effect on Power . . . . . . . . . . . . . . . P7.2.2 Critique . . . . . . . . . . . . . . . . . . . P7.3 Advertising Ratios . . . . . . . . . . . . . . . . . . . P7.4 Vary Supply Voltage . . . . . . . . . . . . . . . . . . P7.5 Clock Speed Increase Without Power Increase . . . . P7.5.1 Supply Voltage . . . . . . . . . . . . . . . . P7.5.2 Supply Voltage . . . . . . . . . . . . . . . . P7.6 Power Reduction Strategies . . . . . . . . . . . . . . P7.6.1 Supply Voltage . . . . . . . . . . . . . . . . P7.6.2 Transistor Sizing . . . . . . . . . . . . . . . P7.6.3 Adding Registers to Inputs . . . . . . . . . P7.6.4 Gray Coding . . . . . . . . . . . . . . . . . P7.7 Power Consumption on New Chip . . . . . . . . . . . P7.7.1 Hypothesis . . . . . . . . . . . . . . . . . . P7.7.2 Experiment . . . . . . . . . . . . . . . . . P7.7.3 Reality . . . . . . . . . . . . . . . . . . . . 7.6.2 7.6.3 7.6.4 7.6.5 7.6.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 411 412 414 416 416 417 418 420 426 426 426 426 426 426 426 426 427 427 427 428 428 428 428 428 428 428 428 429 429 429 429 CONTENTS xiii 8 Fault Testing and Testability 8.1 Faults and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Overview of Faults and Testing . . . . . . . . . . . . . . . . . . 8.1.1.1 Faults (Smith 14.3) . . . . . . . . . . . . . . . . . . . 8.1.1.2 Causes of Faults (Smith 14.3) . . . . . . . . . . . . . . 8.1.1.3 Testing (Smith 14) . . . . . . . . . . . . . . . . . . . . 8.1.1.4 Burn In (Smith 14.3.1) . . . . . . . . . . . . . . . . . . 8.1.1.5 Bin Sorting (Smith 5.1.6) . . . . . . . . . . . . . . . . 8.1.1.6 Testing Techniques (Smith 14) . . . . . . . . . . . . . 8.1.1.7 Design for Testability (DFT) (Smith 14.6) . . . . . . . 8.1.2 Example Problem: Economics of Testing (Smith 14.1) . . . . . . 8.1.3 Physical Faults (Smith 14.3.3) . . . . . . . . . . . . . . . . . . . 8.1.3.1 Types of Physical Faults . . . . . . . . . . . . . . . . . 8.1.3.2 Locations of Faults . . . . . . . . . . . . . . . . . . . 8.1.3.3 Layout Affects Locations . . . . . . . . . . . . . . . . 8.1.3.4 Naming Fault Locations . . . . . . . . . . . . . . . . . 8.1.4 Detecting a Fault . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.4.1 Which Test Vectors will Detect a Fault? . . . . . . . . 8.1.5 Mathematical Models of Faults (Smith 14.3.4) . . . . . . . . . . 8.1.5.1 Single Stuck-At Fault Model . . . . . . . . . . . . . . 8.1.6 Generate Test Vector to Find a Mathematical Fault (Smith 14.4) . 8.1.6.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 8.1.6.2 Example of Finding a Test Vector . . . . . . . . . . . . 8.1.7 Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.7.1 Redundant Circuitry . . . . . . . . . . . . . . . . . . . 8.1.7.2 Curious Circuitry and Fault Detection . . . . . . . . . 8.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 A Small Example . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Choosing Test Vectors . . . . . . . . . . . . . . . . . . . . . . . 8.2.2.1 Fault Domination . . . . . . . . . . . . . . . . . . . . 8.2.2.2 Fault Equivalence . . . . . . . . . . . . . . . . . . . . 8.2.2.3 Gate Collapsing . . . . . . . . . . . . . . . . . . . . . 8.2.2.4 Node Collapsing . . . . . . . . . . . . . . . . . . . . . 8.2.2.5 Fault Collapsing Summary . . . . . . . . . . . . . . . 8.2.3 Fault Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 Test Vector Generation and Fault Detection . . . . . . . . . . . . 8.2.5 Generate Test Vectors for 100% Coverage . . . . . . . . . . . . . 8.2.5.1 Collapse the Faults . . . . . . . . . . . . . . . . . . . 8.2.5.2 Check for Fault Domination . . . . . . . . . . . . . . . 8.2.5.3 Required Test Vectors . . . . . . . . . . . . . . . . . . 8.2.5.4 Faults Not Covered by Required Test Vectors . . . . . . 8.2.5.5 Order to Run Test Vectors . . . . . . . . . . . . . . . . 8.2.5.6 Summary of Technique to Find and Order Test Vectors 8.2.5.7 Complete Analysis . . . . . . . . . . . . . . . . . . . . 8.2.6 One Fault Hiding Another . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 431 431 431 431 432 432 432 433 433 434 435 435 435 436 436 436 437 438 438 439 439 440 440 440 442 443 443 443 444 445 445 446 446 447 447 447 448 450 451 451 452 453 454 455 xiv CONTENTS 8.3 8.4 8.5 8.6 8.7 Scan Testing in General . . . . . . . . . . . . . . . . . . . 8.3.1 Structure and Behaviour of Scan Testing . . . . . . 8.3.2 Scan Chains . . . . . . . . . . . . . . . . . . . . . 8.3.2.1 Circuitry in Normal and Scan Mode . . 8.3.2.2 Scan in Operation . . . . . . . . . . . . 8.3.2.3 Scan in Operation with Example Circuit 8.3.3 Summary of Scan Testing . . . . . . . . . . . . . 8.3.4 Time to Test a Chip . . . . . . . . . . . . . . . . . 8.3.4.1 Example: Time to Test a Chip . . . . . . Boundary Scan and JTAG . . . . . . . . . . . . . . . . . . 8.4.1 Boundary Scan History . . . . . . . . . . . . . . . 8.4.2 JTAG Scan Pins . . . . . . . . . . . . . . . . . . . 8.4.3 Scan Registers and Cells . . . . . . . . . . . . . . 8.4.4 Scan Instructions . . . . . . . . . . . . . . . . . . 8.4.5 TAP Controller . . . . . . . . . . . . . . . . . . . 8.4.6 Other descriptions of JTAG/IEEE 1194.1 . . . . . Built In Self Test . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Block Diagram . . . . . . . . . . . . . . . . . . . 8.5.1.1 Components . . . . . . . . . . . . . . . 8.5.1.2 Linear Feedback Shift Register (LFSR) . 8.5.1.3 Maximal-Length LFSR . . . . . . . . . 8.5.2 Test Generator . . . . . . . . . . . . . . . . . . . 8.5.3 Signature Analyzer . . . . . . . . . . . . . . . . . 8.5.4 Result Checker . . . . . . . . . . . . . . . . . . . 8.5.5 Arithmetic over Binary Fields . . . . . . . . . . . 8.5.6 Shift Registers and Characteristic Polynomials . . 8.5.6.1 Circuit Multiplication . . . . . . . . . . 8.5.7 Bit Streams and Characteristic Polynomials . . . . 8.5.8 Division . . . . . . . . . . . . . . . . . . . . . . . 8.5.9 Signature Analysis: Math and Circuits . . . . . . . 8.5.10 Summary . . . . . . . . . . . . . . . . . . . . . . Scan vs Self Test . . . . . . . . . . . . . . . . . . . . . . Problems on Faults, Testing, and Testability . . . . . . . . P8.1 Based on Smith q14.9: Testing Cost . . . . . . . . P8.2 Testing Cost and Total Cost . . . . . . . . . . . . P8.3 Minimum Number of Faults . . . . . . . . . . . . P8.4 Smith q14.10: Fault Collapsing . . . . . . . . . . P8.5 Mathematical Models and Reality . . . . . . . . . P8.6 Undetectable Faults . . . . . . . . . . . . . . . . . P8.7 Test Vector Generation . . . . . . . . . . . . . . . P8.7.1 Choice of Test Vectors . . . . . . . . . . P8.7.2 Number of Test Vectors . . . . . . . . . P8.8 Time to do a Scan Test . . . . . . . . . . . . . . . P8.9 BIST . . . . . . . . . . . . . . . . . . . . . . . . P8.9.1 Characteristic Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 456 456 456 457 458 463 464 464 465 465 466 466 467 467 468 469 469 469 471 472 473 474 474 475 475 477 477 477 478 479 480 481 481 481 482 482 482 482 482 483 483 483 483 483 CONTENTS xv P8.10 P8.11 P8.12 P8.13 P8.9.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . P8.9.3 Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . P8.9.4 Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . P8.9.5 Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . P8.9.6 Detecting a Specic Fault . . . . . . . . . . . . . . . . . . . . P8.9.7 Time to Run Test . . . . . . . . . . . . . . . . . . . . . . . . Power and BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Timing Hazards and Testability . . . . . . . . . . . . . . . . . . . . . . Testing Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . P8.12.1 Are there any physical faults that are detectable by scan testing but not by built-in self testing? . . . . . . . . . . . . . . . . . P8.12.2 Are there any physical faults that are detectable by built-in self testing but not by scan testing? . . . . . . . . . . . . . . . . . Fault Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P8.13.1 Design test generator . . . . . . . . . . . . . . . . . . . . . . P8.13.2 Design signature analyzer . . . . . . . . . . . . . . . . . . . . P8.13.3 Determine if a fault is detectable . . . . . . . . . . . . . . . . P8.13.4 Testing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484 484 484 484 484 484 484 485 485 . 485 . . . . . . 485 485 486 486 486 486 487 487 488 488 488 489 489 489 490 490 490 491 491 491 492 492 492 493 493 493 494 494 494 495 9 Review 9.1 Overview of the Term . . . . . . . . . . 9.2 VHDL . . . . . . . . . . . . . . . . . . 9.2.1 VHDL Topics . . . . . . . . . . 9.2.2 VHDL Example Problems . . . 9.3 RTL Design Techniques . . . . . . . . 9.3.1 Design Topics . . . . . . . . . . 9.3.2 Design Example Problems . . . 9.4 Functional Verication . . . . . . . . . 9.4.1 Verication Topics . . . . . . . 9.4.2 Verication Example Problems . 9.5 Performance Analysis and Optimization 9.5.1 Performance Topics . . . . . . 9.5.2 Performance Example Problems 9.6 Timing Analysis . . . . . . . . . . . . . 9.6.1 Timing Topics . . . . . . . . . 9.6.2 Timing Example Problems . . . 9.7 Power . . . . . . . . . . . . . . . . . . 9.7.1 Power Topics . . . . . . . . . . 9.7.2 Power Example Problems . . . 9.8 Testing . . . . . . . . . . . . . . . . . . 9.8.1 Testing Topics . . . . . . . . . 9.8.2 Testing Example Problems . . . 9.9 Formulas to be Given on Final Exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi CONTENTS II Solutions to Assignment Problems 1 VHDL Problems P1.1 IEEE 1164 . . . . . . . . . . . . . . . . . . . . . . . . . P1.2 VHDL Syntax . . . . . . . . . . . . . . . . . . . . . . . P1.3 Flops, Latches, and Combinational Circuitry . . . . . . . P1.4 Counting Clock Cycles . . . . . . . . . . . . . . . . . . P1.5 Arithmetic Overow . . . . . . . . . . . . . . . . . . . P1.6 Delta-Cycle Simulation: Pong . . . . . . . . . . . . . . P1.7 Delta-Cycle Simulation: Baku . . . . . . . . . . . . . . P1.8 Clock-Cycle Simulation . . . . . . . . . . . . . . . . . . P1.9 VHDL VHDL Behavioural Comparison: Teradactyl . P1.10VHDL VHDL Behavioural Comparison: Ichtyostega P1.11Waveform VHDL Behavioural Comparison . . . . . . P1.12Hardware VHDL Comparison . . . . . . . . . . . . P1.138-Bit Register . . . . . . . . . . . . . . . . . . . . . . . P1.13.1 Asynchronous Reset . . . . . . . . . . . . . . . P1.13.2 Discussion . . . . . . . . . . . . . . . . . . . . P1.13.3 Testbench for Register . . . . . . . . . . . . . . P1.14Synthesizable VHDL and Hardware . . . . . . . . . . . P1.15Datapath Design . . . . . . . . . . . . . . . . . . . . . . P1.15.1 Correct Implementation? . . . . . . . . . . . . . P1.15.2 Smallest Area . . . . . . . . . . . . . . . . . . . P1.15.3 Shortest Clock Period . . . . . . . . . . . . . . Design Problems P2.1 Synthesis . . . . . . . . . . . . . . . . . P2.1.1 Data Structures . . . . . . . . . . P2.1.2 Own Code vs Libraries . . . . . . P2.2 Design Guidelines . . . . . . . . . . . . . P2.3 Dataow Diagram Optimization . . . . . P2.3.1 Resource Usage . . . . . . . . . . P2.3.2 Optimization . . . . . . . . . . . P2.4 Dataow Diagram Design . . . . . . . . P2.4.1 Maximum Performance . . . . . P2.4.2 Minimum area . . . . . . . . . . P2.5 Michener: Design and Optimization . . . P2.6 Dataow Diagrams with Memory Arrays P2.6.1 Algorithm 1 . . . . . . . . . . . . P2.6.2 Algorithm 2 . . . . . . . . . . . . P2.7 2-bit adder . . . . . . . . . . . . . . . . . P2.7.1 Generic Gates . . . . . . . . . . . P2.7.2 FPGA . . . . . . . . . . . . . . . P2.8 Sketches of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 3 4 7 9 11 13 15 17 19 20 22 24 26 26 27 27 29 31 31 35 36 37 37 37 37 37 40 40 41 42 42 44 45 46 47 49 50 50 50 51 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS xvii 3 Functional Verication Problems P3.1 Carry Save Adder . . . . . . . . . . . . . . P3.2 Trafc Light Controller . . . . . . . . . . . P3.2.1 Functionality . . . . . . . . . . . . P3.2.2 Boundary Conditions . . . . . . . . P3.2.3 Assertions . . . . . . . . . . . . . . P3.3 State Machines and Verication . . . . . . P3.3.1 Three Different State Machines . . P3.3.1.1 Number of Test Scenarios P3.3.1.2 Length of Test Scenario . P3.3.1.3 Number of Flip Flops . . P3.3.2 State Machines in General . . . . . P3.4 Test Plan Creation . . . . . . . . . . . . . . P3.4.1 Early Tests . . . . . . . . . . . . . P3.4.2 Corner Cases . . . . . . . . . . . . P3.5 Sketches of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 53 53 53 54 54 55 55 55 56 56 57 57 58 59 60 61 61 62 63 64 64 65 65 66 67 68 68 73 74 76 4 Performance Analysis and Optimization Problems P4.1 Farmer . . . . . . . . . . . . . . . . . . . . . . P4.2 Network and Router . . . . . . . . . . . . . . . P4.2.1 Maximum Throughput . . . . . . . . . P4.2.2 Packet Size and Performance . . . . . . P4.3 Performance Short Answer . . . . . . . . . . . P4.4 Microprocessors . . . . . . . . . . . . . . . . . P4.4.1 Average CPI . . . . . . . . . . . . . . P4.4.2 Why not you too? . . . . . . . . . . . . P4.4.3 Analysis . . . . . . . . . . . . . . . . P4.5 Dataow Diagram Optimization . . . . . . . . P4.6 Performance Optimization with Memory Arrays P4.7 Multiply Instruction . . . . . . . . . . . . . . . P4.7.1 Highest Performance . . . . . . . . . . P4.7.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii CONTENTS 5 Timing Analysis Problems P5.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P5.2 Hold Time Violations . . . . . . . . . . . . . . . . . . . . . . . . . P5.2.1 Cause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P5.2.2 Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . P5.2.3 Rectication . . . . . . . . . . . . . . . . . . . . . . . . . P5.3 Latch Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . P5.4 Critical Path and False Path . . . . . . . . . . . . . . . . . . . . . . P5.5 Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P5.5.1 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . P5.5.2 Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P5.5.3 Missing Factors . . . . . . . . . . . . . . . . . . . . . . . . P5.5.4 Critical Path or False Path? . . . . . . . . . . . . . . . . . . P5.6 Timing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . P5.7 Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P5.7.1 Wires in FPGAs . . . . . . . . . . . . . . . . . . . . . . . P5.7.2 Age and Time . . . . . . . . . . . . . . . . . . . . . . . . . P5.7.3 Temperature and Delay . . . . . . . . . . . . . . . . . . . . P5.8 Worst Case Conditions and Derating Factor . . . . . . . . . . . . . P5.8.1 Worst-Case Commercial . . . . . . . . . . . . . . . . . . . P5.8.2 Worst-Case Industrial . . . . . . . . . . . . . . . . . . . . . P5.8.3 Worst-Case Industrial, Non-Ambient Junction Temperature . Power Problems P6.1 Short Answers . . . . . . . . . . . . . . . . . P6.1.1 Power and Temperature . . . . . . . P6.1.2 Leakage Power . . . . . . . . . . . . P6.1.3 Clock Gating . . . . . . . . . . . . . P6.1.4 Gray Coding . . . . . . . . . . . . . P6.2 VLSI Gurus . . . . . . . . . . . . . . . . . . P6.2.1 Effect on Power . . . . . . . . . . . . P6.2.2 Critique . . . . . . . . . . . . . . . . P6.3 Advertising Ratios . . . . . . . . . . . . . . P6.4 Vary Supply Voltage . . . . . . . . . . . . . P6.5 Clock Speed Increase Without Power Increase P6.5.1 Supply Voltage . . . . . . . . . . . . P6.5.2 Supply Voltage . . . . . . . . . . . . P6.6 Power Reduction Strategies . . . . . . . . . . P6.6.1 Supply Voltage . . . . . . . . . . . . P6.6.2 Transistor Sizing . . . . . . . . . . . P6.6.3 Adding Registers to Inputs . . . . . . P6.6.4 Gray Coding . . . . . . . . . . . . . P6.7 Power Consumption on New Chip . . . . . . P6.7.1 Hypothesis . . . . . . . . . . . . . . P6.7.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 77 78 78 78 78 79 81 82 82 82 83 83 84 86 86 86 86 87 87 87 87 89 89 89 90 90 90 91 91 91 92 92 93 93 94 94 95 95 95 96 96 96 97 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS xix P6.7.3 Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7 Problems on Faults, Testing, and Testability P7.1 Based on Smith q14.9: Testing Cost . . . . . . . . . . . . . . . . . . . . . . . . P7.2 Testing Cost and Total Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P7.3 Minimum Number of Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . P7.4 Smith q14.10: Fault Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . P7.5 Mathematical Models and Reality . . . . . . . . . . . . . . . . . . . . . . . . . P7.6 Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P7.7 Test Vector Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P7.7.1 Choice of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . P7.7.2 Number of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . P7.8 Time to do a Scan Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P7.9 BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P7.9.1 Characteristic Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . P7.9.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P7.9.3 Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P7.9.4 Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . . . . . . P7.9.5 Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . . . . . . P7.9.6 Detecting a Specic Fault . . . . . . . . . . . . . . . . . . . . . . . . . P7.9.7 Time to Run Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P7.10Power and BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P7.11Timing Hazards and Testability . . . . . . . . . . . . . . . . . . . . . . . . . . . P7.12Testing Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P7.12.1 Are there any physical faults that are detectable by scan testing but not by built-in self testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P7.12.2 Are there any physical faults that are detectable by built-in self testing but not by scan testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P7.13Fault Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P7.13.1 Design test generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . P7.13.2 Design signature analyzer . . . . . . . . . . . . . . . . . . . . . . . . . P7.13.3 Determine if a fault is detectable . . . . . . . . . . . . . . . . . . . . . . P7.13.4 Testing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 99 101 102 103 103 103 104 104 104 104 105 105 106 108 111 112 112 113 114 114 116 . . . . . . . . . . . . . . . . . . . . . . 116 . . . . . . 116 117 117 117 118 118 Part I Course Notes 1 Chapter 1 VHDL: The Language 1.1 Introduction to VHDL 1.1.1 Levels of Abstraction There are many different levels of abstraction for working with hardware: Quantum: Schrodingers equations describe movement of electrons and holes through material. Energy band: 2-dimensional diagrams that capture essential features of Schrodingers equations. Energy-band diagrams are commonly used in nano-scale engineering. Transistor: Signal values and time are continous (analog). Each transistor is modeled by a resistor-capacitor network. Overall behaviour is dened by differential equations in terms of the resistors and capacitors. Spice is a typical simulation tool. Switch: Time is continuous, but voltage may be either continuous or discrete. Linear equations are used, rather than differential equations. A rising edge may be modeled as a linear rise over some range of time, or the time between a denite low value and a denite high value may be modeled as having an undened or rising value. Gate: Transistors are grouped together into gates (e.g. AND, OR, NOT). Voltages are discrete values such as pure Boolean (0 or 1) or IEEE Standard Logic 1164, which has representations for different types of unknown or undened values. Time may be continuous or may be discrete. If discrete, a common unit is the delay through a single inverter (e.g. a NOT gate has a delay of 1 and AND gate has a delay of 2). 3 4 CHAPTER 1. VHDL Register transfer level: The essential characteristic of the register transfer level is that the behaviour of hardware is modeled as assignments to registers and combinational signals. Equations are written where a register signal is a function of other signals (e.g. c = a and b;). The assignments may be either combinational or registered. Combinational assignments happen instanteously and registered assignments take exactly one clock cycle. There are variations on the pure register-transfer level. For example, time may be measured in clock phases rather than clock cycles, so as to allow assignments on either the rising or falling edge of a clock. Another variation is to have multiple clocks that run at different speeds a clock on a bus might run at half the speed of the primary clock for the chip. Transaction level: The basic unit of computation is a transaction, such as executing an instruction on a microprocessor, transfering data across a bus, or accessing memory. Time is usually measured as an estimate (e.g. a memory write requires 15 clock cycles, or a bus transfer requires 250 ns.). The building blocks of the transaction level are processors, controllers, memory arrays, busses, intellectual property (IP) blocks (e.g. UARTs). The behaviour of the building blocks are described with software-like models, often written in behavioural VHDL, SystemC, or SystemVerilog. The transaction level has many similarities to a software model of a distributed system. Electronic-system level: Looks at an entire electronic system, with both hardware and software. In this course, we will focus on the register-transfer level. In the second half of the course, we will look at how analog phenomenon, such as timing and power, affect the register-transfer level. In these chapters we will occasionally dip down into the transistor, switch, and gate levels. 1.1.2 VHDL Origins and History VHDL = VHSIC Hardware Description Language VHSIC = Very High Speed Integrated Circuit The VHSIC Hardware Description Language (VHDL) is a formal notation intended for use in all phases of the creation of electronic systems. Because it is both machine readable and human readable, it supports the development, verication, synthesis and testing of hardware designs, the communication of hardware design data, and the maintenance, modication, and procurement of hardware. Language Reference Manual (IEEE Design Automation Standards Committee, 1993a) 1.1.2 VHDL Origins and History 5 development verication synthesis testing hardware designs communication maintenance modication procurement VHDL is a lot more than synthesis of digital hardware VHDL History ....................................................................... . Developed by the United States Department of Defense as part of the very high speed integrated circuit (VHSIC) program in the early 1980s. The Department of Defense intended VHDL to be used for the documentation, simulation and verication of electronic systems. Goals: improve design process over schematic entry standardize design descriptions amongst multiple vendors portable and extensible Inspired by the ADA programming language large: 97 keywords, 94 syntactic rules verbose (designed by committee) static type checking, overloading complicated syntax: parentheses are used for both expression grouping and array indexing Example: a <= b * (3 + c); a <= (3 + c); -- integer -- 1-element array of integers Standardized by IEEE in 1987 (IEEE 1076-1987), revised in 1993, 2000. In 1993 the IEEE standard VHDL package for model interoperability, STD_LOGIC_1164 (IEEE Standard 1164-1993), was developed. std_logic_1164 denes 9 different values for signals In 1997 the IEEE standard packages for arithmetic over std logic and bit signals were dened (IEEE Standard 1076.31997). numeric_std denes arithmetic over std logic vectors and integers. Note: This is the package that you should use for arithmetic. Dont use std logic arith it has less uniform support for mixed integer/signal arithmetic and has a greater tendency for differences between tools. 6 CHAPTER 1. VHDL numeric_bit denes arithmetic over bit vectors and integers. We wont use bit signals in this course, so you dont need to worry about this package. 1.1.3 Semantics The original goal of VHDL was to simulate circuits. The semantics of the language dene circuit behaviour. a c <= a AND b; simulation b c But now, VHDL is used in simulation and synthesis. Synthesis is concerned with the structure of the circuit. Synthesis: converts one type of description (behavioural) into another, lower level, description (usually a netlist). c <= a AND b; synthesis a c b Synthesis is a computer-aided design (CAD) technique that transforms a designers concise, highlevel description of a circuit into a structural description of a circuit. CAD Tools ............................................................................ CAD Tools allow designers to automate lower-level design processes in implementing the desired functionality of a system. NOTE: EDA = Electronic Design Automation. In digital hardware design EDA = CAD. Synthesis vs Simulation ................................................................ For synthesis, we want the code we write to dene the structure of the hardware that is generated. c <= a AND b; synthesis a c b 1.1.4 Synthesis of a Simulation-Based Language 7 The VHDL semantics dene the behaviour of the hardware that is generated, not the structure of the hardware. The scenario below complies with the semantics of VHDL, because the two synthesized circuits produce the same behaviour. If the two synthesized circuits had different behaviour, then the scenario would not comply with the VHDL Standard. a a c th e sis b simulation b c syn c <= a AND b; different structure a a c b same behaviour 1.1.4 Synthesis of a Simulation-Based Language the syn sis simulation b c Not all of VHDL is synthesizable c <= a AND b; (synthesizable) c <= a AND b AFTER 2ns; (NOT synthesizable) how do you build a circuit with exactly 2ns of delay through an AND gate? more examples of non-synthesizable code are in section 1.11 See section 1.11 for more details Different synthesis tools support different subsets of VHDL Some tools generate erroneous hardware for some code behaviour of hardware differs from VHDL semantics Some tools generate unpredictable hardware (Hardware that has the correct behaviour, but undesirable or weird structure). There is an IEEE standard (1076.6) for a synthesizable subset of VHDL, but tool vendors dont yet conform to it. (Most vendors still dont have full support for the 1993 extensions to VHDL!). For more info, see http://www.vhdl.org/siwg/. 1.1.5 Solution to Synthesis Sanity Pick a high-quality synthesis tool and study its documentation thoroughly Learn the idioms of the tool Different VHDL code with same behaviour can result in very different circuits Be careful if you have to port VHDL code from one tool to another 8 CHAPTER 1. VHDL KISS: Keep It Simple Stupid VHDL examples will illustrate reliable coding techniques for the synthesis tools from Synopsys, Mentor Graphics, Altera, Xilinx, and most other companies as well. Follow the coding guidelines and examples from lecture As you write VHDL, think about the hardware you expect to get. Note: If you cant predict the hardware, then the hardware probably wont be very good (small, fast, correct, etc) 1.1.6 Standard Logic 1164 At the core of VHDL is a package named STANDARD that denes a type named bit with values of 0 and 1. For simulation, it helpful to have additional values, such as undened and high impedance. Many companies created their own (incompatible) denitions of signal types for simulation. To regain compatibility amongst packages from different companies, the IEEE dened std logc 1164 to be the standard type for signal values in VHDL simulation. U X 0 1 Z W L H -- uninitialized strong unknown strong 0 strong 1 high impedance weak unknown weak 0 weak 1 dont care The most common values are: U, X, 0, 1. If you see X in a simulation, it usually means that there is a mistake in your code. Every VHDL le that you write should begin with: library ieee; use ieee.std_logic_1164.all; Note: std logic vs boolean The std logic values 1 and 0 are not the same as the boolean values true and false. For example, you must write if a = 1 then .... The code if a then ... will not typecheck if a is of type std logic. From a VLSI perspective, a weak value will come from a smaller gate. One aspect of VHDL that we dont touch on in ece427 is resolution, which describes how to determine the value of a signal if the signal is driven by bmore than one/b process. (In ece427, we restrict ourselves to having each signal be driven by (be the target of) exactly one process). The std logic 1164 library provides a resolution function to deal with situation where different processes drive the same signal with different values. In this situation, a strong value (e.g. 1) will overpower a weak value (e.g. L). If two processes drive the signal with different strong values (e.g. 1 and 0) the signal resolves 1.2. COMPARISON OF VHDL TO OTHER HARDWARE DESCRIPTION LANGUAGES 9 to a strong unknown (X). If a signal is driven with two different weak values (e.g. H and L), the signal resolves to a weak unknown (W). 1.2 Comparison of VHDL to Other Hardware Description Languages 1.2.1 VHDL Disadvantages Some VHDL programs cannot be synthesized Different tools support different subsets of VHDL. Different tools generate different circuits for same code VHDL is verbose Many characters to say something simple VHDL is complicated and confusing Many different ways of saying the same thing Constructs that have similar purpose have very different syntax (case vs. select) Constructs that have similar syntax have very different semantics (variables vs signals) Hardware that is synthesized is not always obvious (when is a signal a ip-op vs latch vs combinational) The infamous latch inference problem (See section 1.5.2 for more information) 1.2.2 VHDL Advantages VHDL supports unsynthesizable constructs that are useful in writing high-level models, testbenches and other non-hardware or non-synthesizable artifacts that we need in hardware design. VHDL can be used throughout a large portion of the design process in different capacities, from specication to implementation to verication. VHDL has static typechecking many errors can be caught before synthesis and/or simulation. (In this respect, it is more similar to Java than to C.) VHDL has a rich collection of datatypes VHDL is a full-featured language with a good module system (libraries and packages). VHDL has a well-dened standard. 10 CHAPTER 1. VHDL 1.2.3 VHDL and Other Languages 1.2.3.1 VHDL vs Verilog Verilog is a simpler language: smaller language, simple circuits are easier to write VHDL has more features than Verilog richer set of data types and strong type checking VHDL offers more exibility and expressivity for constructing large systems. The VHDL Standard is more standard than the Verilog Standard VHDL and Verilog have simulation-based semantics Simulation vendors generally conform to VHDL standard Some Verilog constructs dont simulate the same in different tools VHDL is used more than Verilog in Europe and Japan Verilog is used more than VHDL in North America South-East Asia, India, South America: ????? 1.2.3.2 VHDL vs SystemC System C looks like C familiar syntax C is often used in algorithmic descriptions of circuits, so why not try to use it for synthesizable code as well? If you think VHDL is hard to synthesize, try C.... SystemC simulation is slower than advertised 1.2.3.3 VHDL vs Other Hardware Description Languages Superlog: A proposed language that was based on Verilog and C. Basic core comes from Verilog. C-like extensions included to make language more expressive and powerful. Developed by the Co-Design company, but no longer under active development. Superlog has been superseded by SystemVerilog, see below. SystemVerilog: A language originally proposed by Co-Design and now being standardized by Accellera, and organization aimed at standardizing EDA languages. SystemVerilog is inspired by Verilog, Superlog, and System-C. SystemVerilog is a superset of Verilog aimed to support both high-level design and verication. Esterelle: A language evolving from academia to commercial viability. Very clean semantics. Aimed at state machines, limited support for datapath operations. 1.3. OVERVIEW OF SYNTAX 11 1.2.3.4 Summary of VHDL Evaluation VHDL is far from perfect and has lots of annoying characteristics VHDL is a better language for education than Verilog because the static typechecking enforces good software engineering practices The richness of VHDL will be useful in creating concise high-level models and powerful testbenches 1.3 Overview of Syntax This section is just a brief overview of the syntax of VHDL, focusing on the constructs that are most commonly used. For more information, read a book on VHDL and use online resources. (Look for VHDL under the Documentation tab in the E&C 427 web pages.) 1.3.1 Syntactic Categories There are ve major categories of syntactic constructs. (There are many, many minor categories and subcategories of constructs.) Library units (section 1.3.2) Top-level constructs (packages, entities, architectures) Concurrent statements (section 1.3.4) Statements executed at the same time (in parallel) Sequential statements (section 1.3.7) Statements executed in series (one after the other) Expressions Arithmetic (section 1.10), Boolean, Vectors , etc Declarations Components , signals, variables, types, functions, .... 1.3.2 Library Units Library units are the top-level syntactic constructs in VHDL. They are used to dene and include libraries, declare and implement interfaces, dene packages of declarations and otherwise bind together VHDL code. Package body dene the contents of a library Packages 12 CHAPTER 1. VHDL determine which parts of the library are externally visible Use clause use a library in an entity/architecture or another package technically, use clauses are part of entities and packages, but they proceed the entity/package keyword, so we list them as top-level constructs Entity (section 1.3.3) dene interface to circuit Architecture (section 1.3.3) dene internal signals and gates of circuit 1.3.3 Entities and Architecture Each hardware module is described with an Entity/Architecture pair entity entity architecture architecture Figure 1.1: Entity and Architecture Entity: interface names, modes (in / out), types of externally visible signals of circuit Architecture: internals structure and behaviour of module library ieee; use ieee.std_logic_1164.all; entity and_or is port ( a, b, c : in std_logic ; z : out std_logic ); end and_or; Figure 1.2: Example of an entity 1.3.3 Entities and Architecture 13 The syntax of VHDL is dened using a variation on Backus-Naur forms (BNF). [ use_clause ] entity ENTITYID is [ port ( SIGNALID : (in | out) TYPEID [ := expr ] ; ); ] declaration ] [ [ begin concurrent_statement ] end [ entity ] ENTITYID ; Figure 1.3: Simplied grammar of entity architecture signal x : begin x <= a AND z <= x OR end main; main of and_or is std_logic; b; (a AND c); Figure 1.4: Example of architecture [ use_clause ] architecture ARCHID of ENTITYID is [ declaration ] begin [ concurrent_statement ] end [ architecture ] ARCHID ; Figure 1.5: Simplied grammar of architecture 14 CHAPTER 1. VHDL 1.3.4 Concurrent Statements Architectures contain concurrent statements Concurrent statements execute in parallel (Figure1.6) Concurrent statements make VHDL fundamentally different from most software languages. Hardware (gates) naturally execute in parallel VHDL mimics the behaviour of real hardware. At each innitesimally small moment of time, each gate: 1. samples its inputs 2. computes the value of its output 3. drives the output architecture main of bowser is begin x1 <= a AND b; x2 <= NOT x1; z <= NOT x2; end main; architecture main of bowser is begin z <= NOT x2; x2 <= NOT x1; x1 <= a AND b; end main; a b x1 x2 z Figure 1.6: The order of concurrent statements doesnt matter 1.3.4 Concurrent Statements 15 conditional assignment <= when else ; normal assignment ( <= ) if-then-else style (uses when) c <= a+b when sel=1 else a+c when sel=0 else "0000"; with . . . select . . . <= . . . when . . . | . . . , . . . when . . . | . . . , ... . . . when . . . | . . . ; selected assignment case/switch style assignment with color select d <= "00" when red , "01" when ; component instantiation . . . : . . . port map ( . . . => . . . , . . . ); use an existing circuit section 1.3.5 add1 : adder port map( a => f, b => g, s => h, co => i); for-generate . . . : for . . . in . . . generate ... end generate; bgen: if-generate replicate some hardware . . . : if . . . generate ... end generate; for i in 1 to 7 generate b(i)<=a(7-i); end generate; conditionally create some hardware okgen : if optgoal /= fast then generate result <= ((a and b) or (d and not e)) or g; end generate; fastgen : if optgoal = fast then generate result <= 1; end generate; process process . . . begin ... end process; the body of a process is executed sequentially Sections 1.3.6, 1.6 Figure 1.7: The most commonly used concurrent statements 16 CHAPTER 1. VHDL 1.3.5 Component Declaration and Instantiations There are two different syntaxes for component declaration and instantiation. The VHDL-93 syntax is much more concise than the VHDL-87 syntax. Not all tools support the VHDL-93 syntax. For E&CE 427, some of the tools that we use do not support the VHDL-93 syntax, so we are stuck with the VHDL-87 syntax. 1.3.6 Processes Processes are used to describe complex and potentially unsynthesizable behaviour A process is a concurrent statement (Section 1.3.4). The body of a process contains sequential statements (Section 1.3.7) Processes are the most complex and difcult to understand part of VHDL (Sections 1.5 and 1.6) process begin y <= a AND b; z <= 0; wait until rising_edge(clk); if (a = 1) then z <= 1; y <= 0; wait until rising_edge(clk); else y <= a OR b; end if; end process; process (a, b, c) begin y <= a AND b; if (a = 1) then z1 <= b AND c; z2 <= NOT c; else z1 <= b OR c; z2 <= c; end if; end process; Figure 1.8: Examples of processes Processes must have either a sensitivity list or at least one wait statement on each execution path through the process. Processes cannot have both a sensitivity list and a wait statement. Sensitivity List ....................................................................... . The sensitivity list contains the signals that are read in the process. A process is executed when a signal in its sensitivity list changes value. 1.3.7 Sequential Statements 17 An important coding guideline to ensure consistent synthesis and simulation results is to include all signals that are read in the sensitivity list. If you forget some signals, you will either end up with unpredictable hardware and simulation results (different results from different programs) or undesirable hardware (latches where you expected purely combinational hardware). For more on this topic, see sections 1.5.2 and 1.6. There is one exception to this rule: for a process that implements a ip-op with an if rising edge statement, it is acceptable to include only the clock signal in the sensitivity list other signals may be included, but are not needed. [ PROCLAB : ] process ( sensitivity_list ) declaration ] [ begin sequential_statement end process [ PROCLAB ] ; Figure 1.9: Simplied grammar of process 1.3.7 Sequential Statements Used inside processes and functions. wait signal assignment if-then-else case wait until . . . ; . . . <= . . . ; if . . . then . . . elsif . . . end if; case . . . is when . . . | . . . => . . . ; when . . . => . . . ; end case; loop . . . end loop; while . . . loop . . . end loop; for . . . in . . . loop . . . end loop; next . . . ; loop while loop for loop next Figure 1.10: The most commonly used sequential statements 18 CHAPTER 1. VHDL 1.3.8 A Few More Miscellaneous VHDL Features Some constructs that are useful and will be described in later chapters and sections: report : print a message on stderr while simulating assert : assertions about behaviour of signals, very useful with report statements. generics : parameters to an entity that are dened at elaboration time. attributes : predened functions for different datatypes. For example: high and low indices of a vector. 1.4 Concurrent vs Sequential Statements All concurrent assignments can be translated into sequential statements. But, not all sequential statements can be translated into concurrent statements. 1.4.1 Concurrent Assignment vs Process The two code fragments below have identical behaviour: architecture main of tiny is begin b <= a; end main; architecture main of tiny is begin process (a) begin b <= a; end process; end main; 1.4.2 Conditional Assignment vs If Statements The two code fragments below have identical behaviour: Concurrent Statements t <= <val1> when <cond> else <val2>; Sequential Statements if <cond> then t <= <val1>; else t <= <val2>; end if 1.4.3 Selected Assignment vs Case Statement 19 1.4.3 Selected Assignment vs Case Statement The two code fragments below have identical behaviour with <expr> select t <= <val1> when <choices1>, <val2> when <choices2>, <val3> when <choices3>; Concurrent Statements Sequential Statements case <expr> is when <choices1> => t <= <val1>; when <choices2> => t <= <val2>; when <choices3> => t <= <val3>; end case; 1.4.4 Coding Style Code thats easy to write with sequential statements, but difcult with concurrent: Sequential Statements case <expr> is when <choice1> => if <cond> then o <= <expr1>; else o <= <expr2>; end if; when <choice2> => ... end case; Concurrent Statements Overall structure: with <expr> select t <= ... when <choice1>, ... when <choice2>; Failed attempt: with <expr> select t <= -- want to write: -<val1> when <cond> -else <val2> -- but conditional assignment -- is illegal here when c1, ... when c2; Concurrent statement with correct behaviour, but messy: t <= <expr1> when (expr = <choice1> AND <cond>) else <expr2> when (expr = <choice1> AND NOT <cond>) else . . . ; 20 CHAPTER 1. VHDL 1.5 Overview of Processes Processes are the most difcult VHDL construct to understand. This section gives an overview of processes. Section 1.6 gives the details of the semantics of processes. Within a process, statements are executed almost sequentially Among processes, execution is done in parallel Remember: a process is a concurrent statement! entity ENTITYID is interface declarations end ENTITYID; architecture ARCHID of ENTITYID is begin concurrent statements process begin sequential statements end process; concurrent statements end ARCHID; Figure 1.11: Sequential statements in a process Key concepts in VHDL semantics for processes: VHDL mimics hardware Hardware (gates) execute in parallel Processes execute in parallel with each other All possible orders of executing processes must produce the same simulation results (waveforms) If a signal is not assigned a value, then it holds its previous value All orders of executing concurrent statements must produce the same waveforms It doesnt matter whether you are running on a single-threaded operating system, on a multithreaded operating system, on a massively parallel supercomputer, or on a special hardware emulator with one FPGA chip per VHDL process all simulations must be the same. These concepts are the motivation for the semantics of executing processes in VHDL (Section 1.6) and lead to the phenomenon of latch-inference (Section 1.5.2). 1.5. OVERVIEW OF PROCESSES 21 execution sequence architecture procA: process stmtA1; stmtA2; stmtA3; end process; procB: process stmtB1; stmtB2; end process; B1 B2 A1 A2 A3 execution sequence execution sequence A1 A2 A3 A1 A2 A3 B1 B2 B1 B2 single threaded: single threaded: multithreaded: procA procA before procB procB before procA and procB in parallel Figure 1.12: Different process execution sequences Figure 1.13: All execution orders must have same behaviour Sections 1.5.11.5.3 discuss the hardware generated by processes. Sections 1.61.6.6 discuss the behaviour and execution of processes. 22 CHAPTER 1. VHDL 1.5.1 Combinational Process vs Clocked Process Each well-written synthesizable process is either combinational or clocked. Some synthesizable processes that do not conform to our coding guidelines are both combinational and clocked. For example, in a ip-op with an asynchronous reset, the output is a combinational function of the reset signal and a clocked function of the data input signal. We will deal with only with processes that follow our coding conventions, and so we will continue to say that each process is either combinational xor clocked. Combinational process: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Executing the process takes part of one clock cycle Target signals are outputs of combinational circuitry A combinational processes must have a sensitivity list A combinational process must not have any wait statements A combinational process must not have any rising_edges, or falling_edges The hardware for a combinational process is just combinational circuitry ..................................................................... . Clocked process: Executing the process takes one (or more) clock cycles Target signals are outputs of ops Process contains one or more wait or if rising edge statements Hardware contains combinational circuitry and ip ops Note: Clocked processes are sometimes called sequential processes, but this can be easily confused with sequential statements, so in E&CE 427 well refer to synthesizable processes as either combinational or clocked. Example Processes ................................................................... . Combinational Process process (a,b,c) p1 <= a; if (b = c) then p2 <= b; else p2 <= a; end if; end process; 1.5.2 Latch Inference 23 process begin wait until rising_edge(clk); b <= a; end process; process (clk) begin if rising_edge(clk) then b <= a; end if; end process; Clocked Processes 1.5.2 Latch Inference The semantics of VHDL require that if a signal is assigned a value on some passes through a process and not on other passes, then on a pass through the process when the signal is not assigned a value, it must maintain its value from the previous pass. process (a, b, c) begin if (a = 1) then z1 <= b; z2 <= b; else z1 <= c; end if; end process; a b c z1 z2 Figure 1.14: Example of latch inference When a signals value must be stored, VHDL infers a latch or a ip-op in the hardware to store the value. If you want a latch or a ip-op for the signal, then latch inference is good. If you want combinational circuitry, then latch inference is bad. Loop, Latch, Flop a b z a EN .................................................................... . b z b a D Q z Latch Combinational loop Flip-op 24 CHAPTER 1. VHDL Question: Write VHDL code for each of the above circuits Causes of Latch Inference ............................................................ . Usually, latch inference refers to the unintentional creation of latches. The most common cause of unintended latch inference is missing assignments to signals in if-thenelse and case statements. Latch inference happens during elaboration. When using the Synopsys tools, look for: Inferred memory devices in the output or log les. 1.5.3 Combinational vs Flopped Signals Signals assigned to in combinational processes are combinational. Signals assigned to in clocked processes are outputs of ip-ops. 1.6 Details of Process Execution 1.6.1 Temporal Granularities of Simulation There are several different granularities of time to analyze VHDL behaviour. In this course, we will discuss three major granularities: clock cycles, timing simulation, and delta cycles. clock-cycle smallest unit of time is a clock cycle combinational logic has zero delay ip-ops have a delay of one clock cycle used for simulation early in the design cycle fastest simulation run times timing simulation smallest unit of time is a nano, pico, or fempto second combinational logic and wires have delay as computed by timing analysis tools ip-ops have setup, hold, and clock-to-Q timing parameters used for simulation when ne-tuning design and conrming that timing contraints are satised 1.6.2 Intuition Behind Delta-Cycle Simulation 25 slow simulation times for large circuits delta cycles units of time are artifacts of VHDL semantics and simulation software simulation cycles, delta cycles, and simulation steps are innitesimally small amounts of time VHDL semantics are dened in terms of these concepts In assignments and exams, you will need to be able to simulate VHDL code at each of the three different levels of temporal granularity. In the laboratories and project, you will use simulation programs for both clock-cycle simulation and timing simulation. We dont have access to a program that will produce delta-cycle waveforms, but if anyone is looking for a challenging co-op job or fourth-year design project.... For the remainder of section 1.6, well look at only the delta cycle view of the world. 1.6.2 Intuition Behind Delta-Cycle Simulation Zero-delay simulation might appear to be the simpler than simulation with delays through gates (timing simulation), but in reality, zero-delay simulation algorithms are more complicated than algorithms for timing simulation. The reason is that in zero-delay simulation, a sequence of dependent events must appear to happen instantaneously (in zero time). In particular, the effect of an event must propagate instantaneously through the combinational circuitry. Two fundamental rules for zero-delay simulation: 1. events appear to propagate through combinational circuitry instantaneously. 2. all of the gates appear to operate in parallel To make it appear that events propagate instaneously, VHDL introduces an articial unit of time, the delta cycle, to represent an innitesimally small amount of time. In each delta cycle, every gate in the circuit will sample its inputs, compute its result, and drive its output signal with the result. Because software executes in serial, a simulator cannot run/simulate multiple gates in parallel. Instead, the simulator must simulate the gates one at a time, but make the waveforms appear as if all of the gates were simulated in parallel. In each delta cycle, the simulator will simulate any gate whose input changed in the previous delta cycle. To preserve the illusion that the gates ran in parallel, the effect of simulating a gate remains invisible until the end of the delta cycle. 26 CHAPTER 1. VHDL 1.6.3 Denitions and Algorithm 1.6.3.1 Process Modes An architecture contains a set of processes. Each process is in one of the following modes: active, suspended, or postponed. Note: postponed This use of the word postponed differs from that in the VHDL Standard. We wont be using postponed processes as dened in the Standard. Note: postponed Postponed in VHDL terminology is a synonym for some operating-systems usage of ready to describe a process that is ready to execute. active su e sp at Suspended Nothing to currently execute A process stays suspended until the event that it is waiting for occurs: either a change in a signal on its sensitivity list or the condition in a wait statement Postponed Wants to execute, but not currently active A process stays postponed until the simulator chooses it from the pool of postponed processes Active Currently executing A process stays active until it hits a wait statement or sensitivity list, at which point it suspends en tiv d postponed resume ac suspended Figure 1.15: Process modes 1.6.3.2 Simulation Algorithm The algorithm presented here is a simplication of the actual algorithm in Section 12.6 of the VHDL Standard. The most signicant simplication is that this algorithm does not support delayed assignments. To support delayed assignments, each signals provisional value would be generalized to an event wheel, which is a list containing the times and values for multiple provisional assignments in the future. 1.6.3 Denitions and Algorithm 27 A somewhat ironic note, only six of the two hundred pages in the VHDL Standard are devoted to the semantics of executing processes. The Algorithm ....................................................................... . Simulations start at step 1 with all processes postponed and all signals with a default value (e.g., U for std logic). 1. While there are postponed processes: (a) Pick one or more postponed processes to execute (become active). (b) As a process executes, assignments to signals are provisional new values do not become visible until step 3 (c) A process executes until it hits its sensitivity list or a wait statement, at which point it suspends. At a wait statement, the process will suspend even if the condition is true during the current simulation cycle. (d) Processes that become suspended stay suspended until there are no more postponed or active processes. 2. Each process looks at signals that changed value (provisional value differs from visible value) and at the simulation time. If a signal in a processs sensitivity list changed value, or if the wait condition on which a process is suspended became true, then the process resumes (becomes postponed). 3. Each signal that changed value is updated with its provisional value (the provisional value becomes visible). 4. If there are no postponed processes, then increment simulation time to the next scheduled event. Note: Parallel execution active at a time In n-threaded execution, at most n processes are 28 CHAPTER 1. VHDL 1.6.3.3 Delta-Cycle Denitions Denition simulation step: Executing one sequential assignment or process mode change. Denition simulation cycle: The operations that occur in one iteration of the simulation algorithm. Denition delta cycle: A simulation cycle that does not advance simulation time. Equivalently: A simulation cycle with zero-delay assignments where the assignment causes a process to resume. Denition simulation round: A sequence of simulation cycles that all have the same simulation time. Equivalently: a contiguous sequence of zero or more delta cycles followed by a simulation cycle that increments time (i.e., the simulation cycle is not a delta cycle). Note: Ofcial and unofcial terminology Simulation cycle and delta cycle are ofcial denitions in the VHDL Standard. Simulation step and simulation round are not standard denitions. They are used in E&CE 427 because we need words to associate with the concepts that they describe. 1.6.4 Example 1: Process Execution (Bamboozle) 29 1.6.4 Example 1: Process Execution (Bamboozle) This example (Bamboozle) and the next example (Flummox, section 1.6.5) are very similar. The VHDL code for the circuit is slightly different, but the hardware that is generated is the same. The stimulus for signals a and b also differs. entity bamboozle is begin end bamboozle; architecture main of bamboozle is signal a, b, c, d : std_logic; begin procA : process (a, b) begin c <= a AND b; end process; procB : process (b, c, d) begin d <= NOT c; e <= b AND d; end process; procC : process begin a <= 0; b <= 1; wait for 10 ns; a <= 1; wait for 2 ns; b <= 0; wait for 3 ns; a <= 0; wait for 20 ns; end main; Figure 1.16: Example bamboozle circuit for process execution 30 CHAPTER 1. VHDL Initial conditions (Shown in slides, not in notes) Step 1(a): Activate procA (Shown in slides, not in notes) A procA: process (a, b) begin a c <= a AND b; end process; b procB: process (b, c, d) begin 0ns d <= NOT c; e <= b AND d; sim round B end process; sim cycle B procC: process begin delta cycle ? a <= 0; procA P A b <= 1; procB P wait for 10 ns; procC P a <= 1; a U wait for 2 ns; b U b <= 0; wait for 3 ns; c U a <= 0; d U wait for 20 ns; end process; e U U Uc U Ud U e P P Step 1(a): Activate procA Step 1(c): Suspend procA (Shown in slides, not in notes) Step 1(a): Activate procC (Shown in slides, not in notes) Step 1(b): Provisional assignment to a (Shown in slides, not in notes) Step 1(b): Provisional assignment to b (Shown in slides, not in notes) S procA: process (a, b) begin a c <= a AND b; end process; b procB: process (b, c, d) begin 0ns d <= NOT c; e <= b AND d; sim round B end process; sim cycle B procC: process begin delta cycle ? a <= 0; procA P A b <= 1; procB P wait for 10 ns; procC P a <= 1; a U wait for 2 ns; b U b <= 0; wait for 3 ns; c U a <= 0; d U wait for 20 ns; end process; e U 0U UUc 1U Ud U e P S A A U U U Step 1(b): Provisional assignment to b 1.6.4 Example 1: Process Execution (Bamboozle) 31 Step 1(a): Activate procB (Shown in slides, not in notes) Step 1(b): Provisional assignment to d (Shown in slides, not in notes) Step 1(b): Provisional assignment to e (Shown in slides, not in notes) Step 1(c): Suspend procB (Shown in slides, not in notes) S procA: process (a, b) begin a c <= a AND b; end process; b procB: process (b, c, d) begin 0ns d <= NOT c; e <= b AND d; sim round B sim cycle end process; B procC: process begin delta cycle ? a <= 0; procA P A b <= 1; procB P wait for 10 ns; procC P a <= 1; a wait for 2 ns; b b <= 0; wait for 3 ns; c a <= 0; d wait for 20 ns; end process; e 0U UUc 1U UUd UU e S S A A U U U U U S S S Step 1(c): Suspend procB S procA: process (a, b) begin a c <= a AND b; end process; b procB: process (b, c, d) begin 0ns d <= NOT c; sim round e <= b AND d; B end process; sim cycle B procC: process begin delta cycle ? a <= 0; procA P A b <= 1; procB P wait for 10 ns; procC P a <= 1; a U wait for 2 ns; b U b <= 0; wait for 3 ns; c U a <= 0; d U wait for 20 ns; end process; e U 0U UUc 1U UUd UU e S E ? S A A U U U U U S S S All processes suspended Step 3: Update signal values (Shown in slides, not in notes) 32 CHAPTER 1. VHDL P P S procA: process (a, b) begin a c <= a AND b; end process; b procB: process (b, c, d) begin 0ns d <= NOT c; e <= b AND d; sim round B sim cycle end process; B procC: process begin delta cycle ? a <= 0; procA P A b <= 1; procB P wait for 10 ns; procC P a <= 1; a U wait for 2 ns; b U b <= 0; wait for 3 ns; c U a <= 0; d U wait for 20 ns; end process; e U 0 UUc 1 UUd UU e S A A U U U U U S S P P 0 1 Step 3: Update signal values S procA: process (a, b) begin a c <= a AND b; end process; b procB: process (b, c, d) begin 0ns d <= NOT c; e <= b AND d; sim round B end process; sim cycle B procC: process begin delta cycle B a <= 0; procA P A b <= 1; procB P wait for 10 ns; procC P a <= 1; a U wait for 2 ns; b U b <= 0; wait for 3 ns; c U a <= 0; d U wait for 20 ns; end process; e U 0 Uc 1 Ud U e S E E S A A U U U U U S 0 1 S P P S Step 4: Simulation time remains at 0 ns --- delta cycle 1.6.4 Example 1: Process Execution (Bamboozle) 33 Step 1(a): Activate procA (Shown in slides, not in notes) Step 1(b): Provisional assignment to c (Shown in slides, not in notes) Step 1(c): Suspend procA (Shown in slides, not in notes) Step 1(a): Activate procB (Shown in slides, not in notes) Step 1(b): Provisional assignment to d (Shown in slides, not in notes) Step 1(b): Provisional assignment to e (Shown in slides, not in notes) Step 1(c): Suspend procB (Shown in slides, not in notes) S procA: process (a, b) begin 0 a c <= a AND b; end process; 1 b procB: process (b, c, d) begin 0ns d <= NOT c; sim round B e <= b AND d; end process; sim cycle B E B E procC: process begin delta cycle procA P a <= 0; P procB P b <= 1; P procC P wait for 10 ns; a <= 1; a U 0 U wait for 2 ns; b U 1 U b <= 0; wait for 3 ns; c U U a <= 0; d U U wait for 20 ns; end process; e U U 0Uc UUd UU e S B ? A S A S S U U U All processes suspended Step 3: Update signal values (Shown in slides, not in notes) Step 4: Simulation time remains at 0ns delta cycle (Shown in slides, not in notes) Compact simulation cycle (Shown in slides, not in notes) 34 CHAPTER 1. VHDL Begin next simulation cycle (Shown in slides, not in notes) Step 1(a): Activate procB (Shown in slides, not in notes) Step 1(b): Provisional assignment to d (Shown in slides, not in notes) Step 1(b): Provisional assignment to e (Shown in slides, not in notes) Step 1(c): Suspend procB (Shown in slides, not in notes) All processes suspended (Shown in slides, not in notes) S procA: process (a, b) begin 0 a c <= a AND b; end process; 1 b procB: process (b, c, d) begin 0ns d <= NOT c; e <= b AND d; sim round B end process; sim cycle B E procC: process begin delta cycle B E a <= 0; procA P P b <= 1; procB P P wait for 10 ns; procC P a <= 1; a U 0 U wait for 2 ns; b U 1 U b <= 0; wait for 3 ns; c U U a <= 0; d U U wait for 20 ns; end process; e U U 0c 1Ud UU e S B B E B E ? P A S S U U U 0 U U All processes suspended Step 3: Update signal values (Shown in slides, not in notes) S procA: process (a, b) begin 0 a c <= a AND b; end process; 1 b procB: process (b, c, d) begin 0ns d <= NOT c; e <= b AND d; sim round B end process; sim cycle B E procC: process begin delta cycle B E a <= 0; procA P P b <= 1; procB P P wait for 10 ns; procC P a <= 1; a U U 0 wait for 2 ns; b U U 1 b <= 0; wait for 3 ns; c U U a <= 0; d U U wait for 20 ns; end process; e U U 0c 1d U e P B B E E P B ? A S P S U U U 0 U U 1 Step 3: Update signal values Compact simulation cycle (Shown in slides, not in notes) 1.6.4 Example 1: Process Execution (Bamboozle) 35 Begin next simulation cycle (Shown in slides, not in notes) Step 1(a): Activate procB (Shown in slides, not in notes) Step 1(b): Provisional assignment to d (Shown in slides, not in notes) Step 1(b): Provisional assignment to e (Shown in slides, not in notes) Step 1(c): Suspend procB (Shown in slides, not in notes) S procA: process (a, b) begin 0 a c <= a AND b; end process; 1 b procB: process (b, c, d) begin 0ns d <= NOT c; e <= b AND d; sim round B sim cycle end process; B E procC: process begin delta cycle E B a <= 0; procA P P b <= 1; procB P P wait for 10 ns; procC P a <= 1; a U 0 U wait for 2 ns; b U 1 U b <= 0; wait for 3 ns; c U U a <= 0; d U U wait for 20 ns; end process; e U U 0c 11d 1U e S B B E E P B B E B E ? P A S S U U U 0 U U 1 U Step 1(c): Suspend procB Step 3: Update signal values (Shown in slides, not in notes) S procA: process (a, b) begin 0 a c <= a AND b; end process; 1 b procB: process (b, c, d) begin 0ns d <= NOT c; e <= b AND d; sim round B sim cycle E end process; B E procC: process begin delta cycle B a <= 0; procA P P b <= 1; procB P P wait for 10 ns; procC P a <= 1; a U U 0 wait for 2 ns; b U U 1 b <= 0; wait for 3 ns; c U U a <= 0; d U U wait for 20 ns; end process; e U U 0c 1d 1 e S B B E E P B B E B E P ? A S S U U U 0 U U 1 U 1 Step 3: Update signal values Compact simulation cycle (Shown in slides, not in notes) 36 CHAPTER 1. VHDL Begin next simulation cycle (Shown in slides, not in notes) Step 1: No postponed processes (Shown in slides, not in notes) S procA: process (a, b) begin 0 a c <= a AND b; end process; 1 b procB: process (b, c, d) begin 0ns d <= NOT c; e <= b AND d; sim round B sim cycle E end process; B procC: process begin delta cycle E B a <= 0; procA P P b <= 1; procB P P wait for 10 ns; procC P a <= 1; a U 0 U wait for 2 ns; b U 1 U b <= 0; wait for 3 ns; c U U a <= 0; d U U wait for 20 ns; end process; e U U 0c 1d 1 e S 10ns E B B E E P B B E B E P E S U U U 0 U U 1 U 1 Step 1: no postponed processes Compact simulation cycle (Shown in slides, not in notes) 1.6.4 Example 1: Process Execution (Bamboozle) 37 Begin next simulation cycle (Shown in slides, not in notes) Step 1(a): Activate procC (Shown in slides, not in notes) Step 1(b): Provisional assignment to a (Shown in slides, not in notes) Step 1(c): Suspend procC (Shown in slides, not in notes) Step 2: Check sensitivity list; resume processes (Shown in slides, not in notes) Step 3: Update signal values (Shown in slides, not in notes) P procA: process (a, b) begin 1 a c <= a AND b; end process; 1 b procB: process (b, c, d) begin 0ns d <= NOT c; e <= b AND d; sim round B end process; sim cycle B E procC: process begin delta cycle B E a <= 0; procA P P b <= 1; procB P P wait for 10 ns; procC P a <= 1; a U 0 U wait for 2 ns; b U 1 U b <= 0; wait for 3 ns; c U U a <= 0; d U U wait for 20 ns; end process; e U U 0c 1d 1 e S 10ns E B B B B B P E E P B B E B E P E P A S 1 S U U U 0 U U 1 U 1 Step 3: Update signal values Compact simulation cycle (Shown in slides, not in notes) 38 CHAPTER 1. VHDL 1.6.5 Example 2: Process Execution (Flummox) This example is a variation of the Bamboozle example from section 1.6.4. entity flummox is begin end flummox; architecture main of flummox is signal a, b, c, d : std_logic; begin proc1 : process (a, b, c) begin c <= a AND b; d <= NOT c; end process; proc2 : process (b, d) begin e <= b AND d; end process; proc3 : process begin a <= 1; b <= 0; wait for 3 ns; b <= 1; wait for 99 ns; end main; Figure 1.17: Example ummox circuit for process execution 0ns sim round sim cycle delta cycle proc1 proc2 proc3 a b c d e B B B P P A P U U U U U U A S A 1 0 U U 0 U 0 0 1 0 S EB EB S PA P EB EB PA A S EB E S PA S 3ns EB EB B EB EB P PA S EB EB S PA EB E S PA S E E 102ns S A S PA 1 1 1 1 1 0 0 1.6.5 Example 2: Process Execution (Flummox) 39 To get a more natural view of the behaviour of the signals, we draw just the waveforms and use a timescale of nanoseconds plus delta cycles: 0ns +1 a U b U c U d U e U U U U U U 3ns +2 +3 +1 +2 +3 102ns Finally, we draw the behaviour of the signals using the standard time scale of nanoseconds. Notice that the delta-cycles within a simulation round all collapse to the left, so the signals change value exactly at the nanosecond boundaries. Also, the glitch on e dissappears. Answer: 0ns 1ns 2ns 3ns 4ns 100ns 101ns 102ns a U b U c U d U e U Note and Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Note: If a signal is updated with the same value it had in the previous simulation cycle, then it does not change, and therefore does not trigger processes to resume. Question: What are the different granularities of time that occur when doing delta-cycle simulation? 40 CHAPTER 1. VHDL Answer: simulation step, delta cycle, simulation cycle, simulation round Question: What is the order of granularity, from nest to coarsest, amongst the different granularities related to delta-cycle simulation? Answer: Same order as listed just above. Note: delta cycles have a ner granularity that simulation cycles, because delta cycles do not advance time, while simulation cycles that are not delta cycles do advance time. 1.6.6 Example: Need for Provisional Assignments This is an example of processes where updating signals during a simulation cycle leads to different results for different process execution orderings. architecture main of swindle is begin p_c: process (a, b) begin c <= a AND b; end process; p_d: process (a, c) begin d <= a XOR c; end process; end main; a c b d Figure 1.18: Circuit to illustrate need for provisional assignments 1. Start with all signals at 0. 2. Simultaneously change to a = 1 and b = 1. 1.6.6 Example: Need for Provisional Assignments 41 . . If assignments are not visible within same simulation cycle (correct: i.e. provisional assignments are used) p_c p_d a b c d 0 0 0 0 P A P S A S P A S p_c p_d a b c d 0 0 0 0 P P A S A S P A S If p c is scheduled before p d, then d will have a 1 pulse. . If p d is scheduled before p c, then d will have a 1 pulse. . If assignments are visible within same simulation cycle (incorrect) p_c p_d a b c d 0 0 0 0 P A P S A S P A S p_c p_d a b c d 0 0 0 0 P P A S A S P A S If p c is scheduled before p d, then d will stay constant 0. If p d is scheduled before p c, then d will have a 1 pulse. With provisional assignments, both orders of scheduling processes result in the same behaviour on all signals. Without provisional assignments, different scheduling orders result in different behaviour. 42 CHAPTER 1. VHDL 1.6.7 Delta-Cycle Simulations of Flip-Flops This example illustrates the delta-cycle simulation of a ip-op. Notice how the delta-cycle simulation captures the expected behaviour of the ip op: the signal q changes at the same time (10ns) as rising edge on the clock. p_clk : process begin clk <= 0; wait for 10 ns; clk <= 1; wait for 10 ns; end process; flop : process ( clk ) begin if rising_edge( clk ) then q <= a; end if; end process; p_a : process begin a <= 0; wait for 15 ns; a <= 1; wait for 20 ns; end process; 0ns 0ns+1 10ns 15ns 20ns 30ns 35ns sim round sim cycle delta cycle p_a P p_clk P flop P a U clk U q U B B B A S A U U E B E B E B/E B B/E B E P S A A S P A S 0 0 E B E S P A E B E B/E B/E P A S E B E B/E B E B/E B S P A 1 E B E B E B/E B B/E B E P A S P A S S E B E E E P A S 1 U 0 0 1 1 Redraw with Normal Time Scale ....................................................... To clarify the behaviour, we redraw the same simulation using a normal time scale. 0ns 5ns 10ns 15ns 20ns 25ns 30ns 35ns a U clk U q U 1.6.7 Delta-Cycle Simulations of Flip-Flops 43 Back-to-Back Flops .................................................................. . In the previous simulation, the input to the ip-op (a) changed several nanoseconds before the rising-edge on the clock. In zero delay simulation, the output of a ip-op changes exactly on the rising edge of the clock. This means that the input to the next ip-op will change at exactly the same time as a rising edge. This example illustrates how delta-cycle simulation handles the situation correctly. p_clk : process begin clk <= 0; wait for 10 ns; clk <= 1; wait for 10 ns; end process; flops : process ( clk ) begin if rising_edge( clk ) then q1 <= a; q2 <= q1; end if; end process; 15ns 20ns 30ns 35ns p_a : process begin a <= 0; wait for 15 ns; a <= 1; wait for 20 ns; end process; 10ns sim round sim cycle delta cycle p_a p_clk flops a 0 clk 0 q1 U q2 U B B/E B/E B B A S E B E E B E B/E B/E P P A S P A E B E B/E B E B/E B S P A 1 E B E B E B/E B B/E B E S P A P A S S E B E E E P A S 1 U U 0 0 1 1 U Redraw with Normal Time Scale ....................................................... To clarify the behaviour, we redraw the same simulation using a normal time scale. 0ns 5ns 10ns 15ns 20ns 25ns 30ns 35ns a U clk U q1 q2 U U 44 CHAPTER 1. VHDL Testbenches and Clock Phases env : process begin a <= 1; clk <= 0; wait for 10 ns; a <= 0; clk <= 1; wait for 10 ns; end process; ........................................................ . flop : process ( clk ) begin if rising_edge( clk ) then q1 <= a end if; end process; 0ns 0ns+1 10ns 20ns sim round sim cycle delta cycle env P flop1 P flop2 P a U clk U q1 U B B B A E B E S A U U U E B E B E B B E B P A E B E B S P A P 0 1 U E E S A P A S S P A S 1 0 S A S Redraw with Normal Time Scale 0ns ....................................................... 10ns 20ns a U clk U q1 U 1.7. REGISTER-TRANSFER LEVEL SIMULATION 45 Note: Testbench signals For consistent results across different simulators, simulation scripts vs test benches, and timing-simulation vs zero-delay simulation do not change signals in your testbench or script at the same time as the clock changes. 0ns 10ns 20ns 30ns 40ns 50ns 60ns a U a is output of clocked or combinational process clk U q1 0ns U 10ns 20ns 30ns 40ns 50ns 60ns a U a is output of timed process (testbench or environment) POOR DESIGN clk U q1 0ns U 10ns 20ns 30ns 40ns 50ns 60ns a U a is output of timed process (testbench or environment) GOOD DESIGN clk U q1 U 1.7 Register-Transfer Level Simulation 1.7.1 Technique for Register-Transfer Level Simulation The register-transfer-level is a coarser level of temporal abstraction than the delta-cycle level. In delta-cycle simulation, many delta-cycles can elapse without an increment in real time (e.g. nanoseconds). In register-transfer-level simulation, all of the events that take place in the same moment of real time take place at same moment in the simulation. In other words, all of the events that take place at the same time are drawn in the same column of the waveform diagram. Register-transfer-level simulation can be done for legal VHDL code, either synthesizable or unsynthesizable, so long as the code does not contain combinational loops. For any piece of VHDL code without combinational loops, the register-transfer-level simulation and the delta-cycle simulation will have same value for each signal at the end of each simulation round. A combinational loop is a circuit that contains a cyclic path through the circuit that includes only combinational gates. Combinational loops can cause signals to oscillate, which in delta-cycle simulation with zero-delay assignments, corresponds to an innite sequence of delta cycles. 46 CHAPTER 1. VHDL RTL Simulation Technique ........................................................... . 1. Pre-processing (a) Separate processes into combinational and non-combinational (clocked and timed) (b) Decompose each combinational process into separate processes with one target signal per process (c) Sort processes into topological order based on dependencies 2. For each clock cycle or unit of time: (a) Run non-combinational processes in any order. Non-combinational assignments read from earlier clock cycle / time step. (b) Run combinational processes in topological order. Combinational assignments read from current clock cycle / time step. 1.7.2 Examples of RTL Simulation Combinational Process Decomposition ................................................ . proc(a,b,c) if a = 1 then d <= b; else d <= not b; end if; end process; proc(a,b,c) if a = 1 then e <= c; else e <= b and c; end if; end process; proc(a,b,c) if a = 1 then d <= b; e <= c; else d <= not b; e <= b and c; end if; end process; Original code After decomposition into separate processes for d and e 1.7.2 Examples of RTL Simulation 47 RTL Simulation Example .............................................................. Revisit an earlier example, but do register-transfer-level simulation, rather than delta-cycle simulation. 1. Original code: proc1: process (a, b, c) begin c <= a AND b; d <= NOT c; end process; proc2: process (b, d) begin e <= b AND d; end process; proc3: process begin a <= 1; b <= 0; wait for 3 ns; b <= 1; wait for 99 ns; end process; 2. Decompose combinational processes into single-target processes: proc1c: process (a, b) begin c <= a AND b; end process; proc1d: process (c) begin d <= NOT c; end process; proc2: process (b, d) begin e <= b AND d; end process; proc3: process begin a <= 1; b <= 0; wait for 3 ns; b <= 1; wait for 99 ns; end process; 3. Combinational processes are already in topological order, because each signal is assigned a value before it is read. 4. Run timed process (proc3) until suspend at wait for 3 ns;. The signal a gets 1 from 0 to 3 ns. The signal b gets 0 from 0 to 3 ns. 5. Run proc1c The signal c gets a AND b (0 AND 1 = 0) from 0 to 3 ns. 6. Run proc1d The signal d gets NOT c (NOT 0 = 1) from 0 to 3 ns. 7. Run proc2 The signal e gets b AND d (0 AND 1 = 0) from 0 to 3 ns. 48 CHAPTER 1. VHDL 8. Run the timed process until suspend at wait for 99 ns;, which takes us from 3ns to 102ns. 9. Run combinational processes in topological order to calculate values on c, d, e from 3ns to 102ns. Question: below. 0ns sim round sim cycle delta cycle proc1 proc2 proc3 a b c d e B B B P P A P U U U U U Draw the RTL waveforms that correspond to the delta-cycle waveform 0ns+1 0ns+2 0ns+23ns EB EB PA S EB E S PA EB EB B S PA 3ns+1 3ns+2 3ns+3 E E E S 102ns A S A U1 0 U U U S EB EB S PA P S A EB EB P PA S A S EB EB S PA EB E S PA 1 0 U 0 0 1 0 1 1 1 1 0 0 Answer: 0ns a b c d e U 1 U 0 U 0 U 1 U 0 1 1 0 1ns 2ns 3ns 102ns 1.7.2 Examples of RTL Simulation 49 Example: Communicating State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Note: It is easier to do a simulation by hand if you start your clock at 0 and use the rst clock phase in the waveform diagram for the rst values that your VHDL code assigns to signals Simulate If-Then-Else, Wait Until huey: process begin clk <= 0; wait for 10 ns; clk <= 1; wait for 10 ns; end process; ...................................................... louie: process begin d <= 1; wait until re(clk); if (a < 2) then d <= 0; wait until re(clk); end if; end process; dewey: process begin a <= to_unsigned(0,4); wait until re(clk); while (a < 4) loop a <= a + 1; wait until re(clk); end loop; end process; clk a d 50 CHAPTER 1. VHDL A Related Simulation .................................................................. Small changes to the code can cause signicant changes to the behaviour. riri: process begin clk <= 1; wait for 10 ns; clk <= 0; wait for 10 ns; end process; fifi: process begin a <= to_unsigned(0,4); wait until re(clk); while (a < 4) loop a <= a + 1; wait until re(clk); end loop; end process; I 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 clk a d 110 120 loulou: process begin wait until re(clk); d <= 1; if (a < 2) then d <= 0; wait until re(clk); end if; end process; 1.8. VHDL AND HARDWARE BUILDING BLOCKS 51 1.8 VHDL and Hardware Building Blocks This section outlines the building blocks for register transfer level design and how to write VHDL code for the building blocks. 1.8.1 Basic Building Blocks (also: n-to-1 muxes) 2:1 mux D CE R Q WE A DO WE A0 DI0 A1 DO1 DO0 S DI Hardware VHDL AND, OR, NAND, NOR, XOR, and, or, nand, nor, xor, xnor XNOR multiplexer if-then-else, case statement, selected assignment, conditional assignment adder, subtracter, negater +, -, shifter, rotater sll, srl, sla, sra, rol, ror ip-op wait until, if-then-else, rising edge memory array, register le, queue 2-d array or library component Figure 1.19: RTL Building Blocks 52 CHAPTER 1. VHDL 1.8.2 Deprecated Building Blocks for RTL Some of the common gates you have encountered in previous courses should be avoided when synthesizing register-transfer-level hardware, particularly if FPGAs are the implementation technology. 1.8.2.1 An Aside on Flip-Flops and Latches ip-op Edge sensitive: output only changes on rising (or falling) edge of clock latch Level sensitive: output changes whenever clock is high (or low) A common implementation of a ip-op is a pair of latches (Master/Slave op). Latches are sometimes called transparent latches, because they are transparent (input directly connected to output) when the clock is high. The clock to a latch is sometimes called the enable line. There is more information in the course notes on timing analysis for storage devices (Section 6.2). 1.8.2.2 Deprecated Hardware Latches Use ops, not latches Latch-based designs are susceptible to timing problems The transparent phase of a latch can let a signal leak through a latch causing the signal to affect the output one clock cycle too early Its possible for a latch-based circuit to simulate correctly, but not work in real hardware, because the timing delays on the real hardware dont match those predicted in synthesis T, JK, SR, etc ip-ops Limit yourself to D-type ip-ops Some FPGA and ASIC cell libraries include only D-type ip ops. Others, such as Alteras APEX FPGAs, can be congured as D, T, JK, or SR ip-ops. Tri-State Buffers Use multiplexers, not tri-state buffers Tri-state designs are susceptible to stability and signal integrity problems Getting tri-state designs to simulate correctly is difcult, some library components dont support tri-state signals Tri-state designs rely on the code never letting two signals drive the bus at the same time It can be difcult to check that bus arbitration will always work correctly 1.8.3 Hardware and Code for Flops 53 Manufacturing and environmental variablity can make real hardware not work correctly even if it simulates correctly Typical industrial practice is to avoid use of tri-state signals on a chip, but allow tri-state signals at the board level Note: Unfortunately and surprisingly, PalmChip has been awarded a US patent for using uni-directional busses (i.e. multiplexers) for systemon-chip designs. The patent was led in 2000, so all fourth-year design projects since 2000 that use muxes on FPGAs will need to pay royalties to PalmChip 1.8.3 Hardware and Code for Flops 1.8.3.1 Flops with Waits and Ifs The two code fragments below synthesize to identical hardware (ops). If process (clk) begin if rising_edge(clk) then q <= d; end if; end process; process begin wait until rising_edge(clk); q <= d; end process; Wait 1.8.3.2 Flops with Synchronous Reset The two code fragments below synthesize to identical hardware (ops with synchronous reset). Notice that the synchronous reset is really nothing more than an AND gate on the input. If process (clk) begin if rising_edge(clk) then if (reset = 1) then q <= 0; else q <= d; end if; end if; end process; Wait process begin wait until rising_edge(clk); if (reset = 1) then q <= 0; else q <= d0; end if; end process; 54 CHAPTER 1. VHDL 1.8.3.3 Flops with Chip-Enable The two code fragments below synthesize to identical hardware (ops with chip-enable lines). If process (clk) begin if rising_edge(clk) then if (ce = 1) then q <= d; end if; end if; end process; process begin wait until rising_edge(clk); if (ce = 1) then q <= d; end if; end process; Wait 1.8.3.4 Flop with Chip-Enable and Mux on Input The two code fragments below synthesize to identical hardware (ops with chip-enable lines and muxes on inputs). process (clk) begin if rising_edge(clk) then if (ce = 1) then if (sel = 1) then q <= d1; else q <= d0; end if; end if; end if; end process; If process begin wait until rising_edge(clk); if (ce = 1) then if (sel = 1) then q <= d1; else q <= d0; end if; end if; end process; Wait 1.8.4 An Example Sequential Circuit 55 1.8.3.5 Flops with Chip-Enable, Muxes, and Reset The two code fragments below synthesize to identical hardware (ops with chip-enable lines, muxes on inputs, and synchronous reset). Notice that the synchronous reset is really nothing more than a mux, or an AND gate on the input. Note: The specic combination and order of tests is important to guarantee that the circuit synthesizes to a op with a chip enable, as opposed to a levelsensitive latch testing the chip enable and/or reset followed by a op. Note: The chip-enable pin on the op is connected to both ce and reset. If the chip-enable pin was not connected to reset, then the op would ignore reset unless chip-enable was asserted. process process (clk) begin begin wait until rising_edge(clk); if rising_edge(clk) then if (ce = 1 or reset =1 ) then if (ce = 1 or reset = 1) then if (reset = 1) then if (reset = 1) then q <= 0; q <= 0; elsif (sel = 1) then elsif (sel = 1) then q <= d1; q <= d1; else else q <= d0; q <= d0; end if; end if; end if; end if; end process; end if; end process; If Wait 1.8.4 An Example Sequential Circuit There are many ways to write VHDL code that synthesizes to the schematic in gure1.20. The major choices are: 1. Categories of signals (a) All signals are outputs of ip-ops or inputs (no combinational signals) (b) Signals include both opped and combinational 2. Number of opped signals per process (a) All opped signals in a single process (b) Some processes with multiple opped signals (c) Each opped signal in its own process 3. Style of op code 56 CHAPTER 1. VHDL (a) Flops use if statements (b) Flops use wait statements Some examples of these different options are shown in gures1.211.24. sel reset R a R S clk entity and_not_reg is port ( reset, clk, sel : in std_logic; c : out std_logic c ); end; S Schematic and entity for examples of different code organizations in Figures1.211.24 Figure 1.20: Schematic and entity for and not reg One Process, Flops, Wait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. architecture one_proc of and_not_reg is signal a : std_logic; begin process begin wait until rising_edge(clk); if (reset = 1) then a <= 0; elsif (sel = 1) then a <= NOT a; else a <= a; end if; c <= NOT a; end process; end one_proc; Figure 1.21: Implementation of Figure1.20: all signals are ops, all ops in one process, ops use waits 1.8.4 An Example Sequential Circuit 57 Two Processes, Flops, Wait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. architecture two_proc_wait of and_not_reg is signal a : std_logic; begin process begin wait until rising_edge(clk); if (reset = 1) then a <= 0; elsif (sel = 1) then a <= NOT a; else a <= a; end if; end process; process begin wait until rising_edge(clk); c <= NOT a; end process; end two_proc_wait; Figure 1.22: Implementation of Figure1.20: all signals are ops, one op per process, ops use waits 58 CHAPTER 1. VHDL Two Processes with If-Then-Else ...................................................... . architecture two_proc_if of and_not_reg is signal a : std_logic; begin process (clk) begin if rising_edge(clk) then if (reset = 1) then a <= 0; elsif (sel = 1) then a <= NOT a; else a <= a; end if; end if; end process; process (clk) begin if rising_edge(clk) then c <= NOT a; end if; end process; end two_proc_if; Figure 1.23: Implementation of Figure1.20: all signals are ops, one op per process, ops use if-then-else 1.9. ARRAYS AND VECTORS 59 Concurrent Statements ................................................................ architecture comb of and_not_reg is signal a, b, d : std_logic; begin process (clk) begin if rising_edge(clk) then if (reset = 1) then a <= 0; else a <= d; end if; end if; end process; process (clk) begin if rising_edge(clk) then c <= NOT a; end if; end process; d <= b when (sel = 1) else a; b <= NOT a; end comb; Figure 1.24: Implementation of Figure1.20: opped and combinational signals, one op per process, ops use if-then-else 1.9 Arrays and Vectors VHDL supports multidimensional arrays over elements of any type. The most common array is an array of std logic signals, which has a predened type: std logic vector. Throughout the rest of this section, we will discuss only std logic vector, but the rules apply to arrays of any type. VHDL supports reading from and assigning to slices (aka discrete subranges) of vectors. The rules for working with slices of vectors are listed below and illustrated in gure1.25. 1. The ranges on both sides of the assignment must be the same. 2. The direction (downto or to) of each slice must match the direction of the signal declaration. 3. The direction of the target and expression may be different. 60 CHAPTER 1. VHDL Declarations ---------------------------------------------------a, b : in std_logic_vector(15 downto 0); c, d, e : out std_logic_vector(15 downto 0); ---------------------------------------------------ax, bx : in std_logic_vector(0 to 15); cx, dx, ex : out std_logic_vector(0 to 15); ---------------------------------------------------m, n : in unsigned(15 downto 0); p, q, r : out unsigned(15 downto 0); ---------------------------------------------------w, x : in signed(15 downto 0); y, z : out signed(15 downto 0) ---------------------------------------------------- Legal code c(3 downto 0) cx(0 to 3) (e(3), e(4)) (e(5), e(6)) <= <= <= <= a(15 downto 12); a(15 downto 12); bx(12 to 13); b(13 downto 12); Illegal code d(0 to 3) e(3) & e(2) p(3 downto 0) z(3 downto 0) <= <= <= <= a(15 b(12 (m + m(15 to 12); -- slice dirs must be same as decl to 13); -- syntax error on & n)( 3 downto 0); -- syntax error on )( downto 12); -- types on lhs and rhs must match Figure 1.25: Illustration of Rules for Slices of Vectors 1.10 Arithmetic VHDL includes all of the common arithmetic and logical operators. Use the VHDL arithmetic operators and let the synthesis tool choose the better implementation for you. It is almost impossible for a hand-coded implementation to beat vendor-supplied arithmetic libraries. To use the operators, you must choose which arithmetic package you wish to use (section 1.10.1). The arithmetic operators are overloaded, and you can usually use any mixture of constants and signals of different types that you need (Section 1.10.3). However, you might need to convert a signal from one type (e.g. std logic vector) to another type (e.g. integer) (Section 1.10.7). 1.10.1 Arithmetic Packages 61 1.10.1 Arithmetic Packages Rushton Ch-7 covers arithmetic packages. Rushton Appendex A.5 has the code listing for the numeric std package. To do arithmetic with signals, use the numeric_std package. This package denes types signed and unsigned, which are std_logic vectors on which you can do signed or unsigned arithmetic. numeric std supersedes earlier arithmetic packages, such as std logic arith. Use only one arithmetic package, otherwise the different denitions will clash and you can get strange error messages. 1.10.2 Shift and Rotate Operations Shift and rotate operations are described with three character acronyms: shift/rotate left/right arithmetic/logical The shift right arithmetic (sra) operation preserves the sign of the operand, by copying the most signicant bit into lower bit positions. The shift left arithmetic (sla) does the analogous operation, except that the least signicant bit is copied. 1.10.3 Overloading of Arithmetic The arithmetic operators +, -, and * are overloaded on signed vectors, unsigned vectors, and integers. Tables1.11.4 show the different combinations of target and source types and widths that can be used. Table 1.1: Overloading of Arithmetic Operations (+, -) target unsigned src1/2 unsigned unsigned src2/1 integer OK signed fails in analysis In these tables means dont care. Also, src1/2 and src2/1 mean rst or second operand, and respectively second or rst operand. The rst line of the table means that either the st operand is unsigned and the second is an integer, or the second operand is unsigned and the rst is an integer. Or, more concisely: one of the operands is unsigned and the other is integer. 62 CHAPTER 1. VHDL 1.10.4 Different Widths and Arithmetic Table 1.2: Different Vector Widths and Arithmetic Operations (+, -) target src1/2 src2/1 narrow wide fails in elaboration wide narrow int fails in elaboration wide wide OK narrow narrow narrow OK narrow narrow int OK Example vectors wide unsigned(7 downto 0) narrow unsigned(4 downto 0) 1.10.5 Overloading of Comparisons Table 1.3: Overloading of Comparison Operations (=, /=, >=, >, <) src1/2 unsigned signed unsigned src2/1 integer OK integer OK signed fails in analysis 1.10.6 Different Widths and Comparisons Table 1.4: Different Vector Widths and Comparison Operations (=, /=, >=, >, <) src1/2 src2/1 wide OK narrow OK 1.10.7 Type Conversion 63 1.10.7 Type Conversion The functions unsigned, signed, to integer, to unsigned and to signed are used to convert between integers, std-logic vectors, signed vectors and unsigned vectors. If you convert between two types of the same width, then no additional hardware will be generated. The listing below summarizes the types of these functions. unsigned( val : std_logic_vector ) signed( val : std_logic_vector ) to_integer( val : signed ) to_integer( val : unsigned ) to_unsigned( val : integer; width : natural) to_signed( val : integer; width : natural) return unsigned; return signed; return integer; return integer; return unsigned; return signed; The most common need to convert between two types arises when using a signal as an index into an array. To use a signal as an index into an array, you must convert the signal into an integer using the function to_integer (Figure1.26). signal i : unsigned( 3 downto 0); signal a : std_logic_vector(15 downto 0); ... ... a(i) ... -- BAD: wont typecheck ... a( to_integer(i) ) ... -- OK Avoid (or at least take care when) converting a signal into an integer and then performing arithmetic on the signal. The default size for integers is 32 bits, so sometimes when a signal is converted into an integer, the resulting signals will be 32 bits wide. library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; ... signal bit_sig : std_logic; signal uns_sig : unsigned(7 downto 0); signal vec_sig : std_logic_vector(255 downto 0); ... bit_sig <= vec_sig( to_integer(uns_sig) ); ... Figure 1.26: Using an unsigned signal as an index to array To convert a std_logic_vector signal into an integer, you must rst say whether the signal should be interpreted as signed or unsigned. As illustrated in gure1.27, this is done by: 64 CHAPTER 1. VHDL 1. Convert the std_logic_vector signal to signed or unsigned, using the function signed or unsigned 2. Convert the signed or unsigned signal into an integer, using to_integer library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; ... signal bit_sig : std_logic; signal std_sig : std_logic_vector(7 downto 0); signal vec_sig : std_logic_vector(255 downto 0); ... bit_sig <= vec_sig( to_integer( unsigned( std_sig ) ) ); ... Figure 1.27: Using a std logic vector as an index to array 1.11 Synthesizable vs Non-Synthesizable Code Synthesis is done by matching VHDL code against templates or patterns. Its important to use idioms that your synthesis tools recognizes. If you arent careful, you could write code that has the same behaviour as one of the idioms, but which results in inefcient or incorrect hardware. Section 1.8 described common idioms and the resulting hardware. Most synthesis tools agree on a large set of idioms, and will reliably generate hardware for these idioms. This section is based on the idioms that Synopsys, Xilinx, Altera, and Mentor Graphics are able to synthesize. One exception is that Alteras Quartus does not support implicit state machines (as of v5.0). Section 1.11.1 gives rules for unsynthesizable VHDL code. Section 1.11.2 gives rules for code that is synthesizable, but violates the ece427 guidelines for good practices. The ece427 coding guidelines are designed to produce circuits suitable for FPGAs. Bad code for FPGAs produce circuits with the following features: latches asynchronous resets combinational loops multiple drivers for a signal tri-state buffers We limit our denition of bad practice to code that produces undesirable hardware. Coding styles that lead to inefcient hardware might be useful in the early stages of the design process, when the 1.11.1 Unsynthesizable Code 65 focus is on functionality and not optimality. As such, inefcient code is not considered bad practice. Poor coding styles that do not affect the hardware, for example, including extraneous signals in a sensitivity list, should certainly be avoided, but fall into the general realm of programming guidelines and will not be discussed. 1.11.1 Unsynthesizable Code 1.11.1.1 Initial Values Initial values on signals (UNSYNTHESIZABLE) signal bad_signal : std_logic := 0; Reason: In most implementation technologies, when a circuit powers up, the values on signals are completely random. Some FPGAs are an exception to this. For some FPGAs, when a chip is powered up, all ip ops will be 0. For other FPGAs, the initial values can be programmed. 1.11.1.2 Wait For Wait for length of time (UNSYNTHESIZABLE) wait for 10 ns; Reason: Delays through circuits are dependent upon both the circuit and its operating environment, particularly supply voltage and temperature. 1.11.1.3 Different Wait Conditions wait statements with different conditions in a process (UNSYNTHESIZABLE) -- different clock signals process begin wait until rising_edge(clk1); x <= a; wait until rising_edge(clk2); x <= a; end process; -- different clock edges process begin wait until rising_edge(clk); x <= a; wait until falling_edge(clk); x <= a; end process; Reason: processes with multiple wait statements are turned into nite state machines. The wait statements denote transitions between states. The target signals in the process are outputs of ip ops. Using different wait conditions would require the ip ops to use different clock signals at different times. Multiple clock signals for a single ip op would be difcult to synthesize, inefcient to build, and fragile to operate. 66 CHAPTER 1. VHDL 1.11.1.4 Multiple if rising edges in Same Process Multiple if rising edge statements in a process (UNSYNTHESIZABLE) process (clk) begin if rising_edge(clk) then q0 <= d0; end if; if rising_edge(clk) then q1 <= d1; end if; end process; Reason: The idioms for synthesis tools generally expect just a single if rising edge statement in each process. The simpler the VHDL code is, the easier it is to synthesize hardware. Programmers of synthesis tools make idiomatic restrictions to make their jobs simpler. 1.11.1.5 if rising edge and wait in Same Process An if rising edge statement and a wait statement in the same process (UNSYNTHESIZABLE) process (clk) begin if rising_edge(clk) then q0 <= d0; end if; wait until rising_edge(clk); q0 <= d1; end process; Reason: The idioms for synthesis tools generally expect just a single type of op-generating statement in each process. 1.11.1 Unsynthesizable Code 67 1.11.1.6 if rising edge with else Clause The if statement has a rising edge condition and an else clause (UNSYNTHESIZABLE). process (clk) begin if rising_edge(clk) then q0 <= d0; else q0 <= d1; end if; end process; Reason: Generally, an if-then-else statement synthesizes to a multiplexer. The condition that is tested in the if-then-else becomes the select signal for the multiplexer. In an if rising edge with else, the select signal would need to detect a rising edge on clk, which isnt feasible to synthesize. 1.11.1.7 if rising edge Inside a for Loop An if rising edge statement in a for-loop (UNSYNTHESIZABLE-Synopsys) process (clk) begin for i in 0 to 7 loop if rising_edge(clk) then q(i) <= d; end if; end loop; end process; Reason: just an idiom of the synthesis tool. Some loop statements are synthesizable (Rushton Section 8.7). For-loops in general are described in Ashenden. Examples of for loops in E&CE will appear when describing testbenches for functional verication (Chapter 3). 68 CHAPTER 1. VHDL Synthesizable Alternative .............................................................. A synthesizable alternative to an if rising edge statement in a for-loop is to put the if-risingedge outside of the for loop. process (clk) begin if rising_edge(clk) then for i in 0 to 7 loop q(i) <= d; end loop; end if; end process; 1.11.1.8 wait Inside of a for loop wait statements in a for loop (UNSYNTHESIZABLE) process begin for i in 0 to 7 loop wait until rising_edge(clk); x <= to_unsigned(i,4); end loop; end process; Reason: Unknown. Clocked for-loops are generally unsynthsizable, but while-loops with the same behaviour are synthesizable. Note: Combinational for-loops Combinational for-loops are usually synthesizable. They are often used to build a combinational circuit for each element of an array. Note: Clocked for-loops Clocked for-loops are not synthesizable, but are very useful in simulation, particular to generate test vectors for test benches. 1.11.2 Synthesizable, but Bad Coding Practices 69 Synthesizable Alternative to Wait-Inside-For while loop (synthesizable) .......................................... . This is the synthesizable alternative to the the wait statement in a for loop above. process begin -- output values from 0 to 4 on i -- sending one value out each clock cycle i <= to_unsigned(0,4); wait until rising_edge(clk); while (4 > i) loop i <= i + 1; wait until rising_edge(clk); end loop; end process; 1.11.2 Synthesizable, but Bad Coding Practices Note: For some of the results in this section, the results are highly dependent upon the synthesis tool that you use and the target technology library. 1.11.2.1 Asynchronous Reset In an asynchronous reset, the test for reset occurs outside of the test for the clock edge. process (reset, clk) begin if (reset = 1) then q <= 0; elsif rising_edge(clk) then q <= d1; end if; end process; Asynchronous resets are bad, because if a reset occurs very close to a clock edge, some parts of the circuit might be reset in one clock cycle and some in the subsequent clock cycle. This can lead the circuit to be out of sync as it goes through the reset sequence, potentially causing erroneous internal state and output values. 70 CHAPTER 1. VHDL 1.11.2.2 Combinational if-then Without else process (a, b) begin if (a = 1) then c <= b; end if; end process; Reason: This code synthesizes c to be a latch, and latches are undesirable. 1.11.2.3 Bad Form of Nested Ifs if rising edge statement inside another if (BAD HARDWARE) In Synopsys, with some target libraries, this design results in a level-sensitive latch whose input is a op. process (ce, clk) begin if (ce = 1) then if rising_edge(clk) then q <= d1; end if; end if; end process; 1.11.2.4 Deeply Nested Ifs Deeply chained if-then-else statements can lead to long chains of dependent gates, rather than checking different cases in parallel. Slow (maybe) if cond1 then stmts1 elsif cond2 then stmts2 elsif cond3 then stmts3 elsif cond4 then stmts4 end if; Fast (hopefully) if only one of the conditions can be true at a time, then try using a case statement or some other technique that allows the conditions to be evaluated in parallel. 1.11.3 Synthesizable, but Unpredictable Hardware 71 1.11.3 Synthesizable, but Unpredictable Hardware Some coding styles are synthesizable and might produce desirable hardware with a particular synthesis tool, but either be unsynthesizable or produce undesirable hardware with another tool. variables level-sensitive wait statements missing signals in sens list If you are using a single synthesis tool for an extended period of time, and want to get the full power of the tool, then it can be advantageous to write your code in a way that works for your tool, but might produce undesirable results with other tools. 1.12 Synthesizable VHDL Coding Guidelines This section gives guidelines for building robust, portable, and synthesizable VHDL code. Portability is both for different simulation and synthesis tools and for different implementation technologies. Remember, there is a world of difference between getting a design to work in simulation and getting it to work on a real FPGA. And there is also a huge difference between getting a design to work in an FPGA for a few minutes of testing and getting thousands of products to work for months at a time in thousands of different environments around the world. The coding guidelines here are designed both for helping you to get your E&CE 427 project to work as well as all of the subsequent industrial designs. Finally, note that there are exceptions to every rule. You might nd yourself in a circumstance where your particular situation (e.g. choice of tool, target technology, etc) would benet from bending or breaking a guideline here. Within E&CE 427, of course, there wont be any such circumstances. 1.12.1 Signal Declarations Use signals, do not use variables reason The intention of the creators of VHDL was for signals to be wires and variables to be just for simulation. Some synthesis tools allow some uses of variables, but when using variables, it is easy to create a design that works in simulation but not in real hardware. Use std_logic signals, do not use bit or Boolean reason std_logic is the most commonly used signal type across synthesis tools, simulation tools, and cell libraries Use in or out, do not use inout reason inout signals are tri-state. 72 CHAPTER 1. VHDL note If you have an output signal that you also want to read from, you might be tempted to declare the mode of the signal to be inout. A better solution is to create a new, internal, signal that you both read from and write to. Then, your output signal can just read from the internal signal. Declare the primary inputs and outputs of chips as either std logic and std logic vector. Do not use signed or unsigned for primary inputs or outputs. reason Both the Altera tool Quartus and the Xilinx tool ngd2vhdl convert signed and unsigned vectors in entities into std-logic-vectors. If you want your same testbench to work for both functional simulation and timing simulation, you must not use signed or unsigned signals in the top-level entity of your chip. note Signed and unsigned signals are ne inside testbenches, for non-top-level entities, and inside architectures. It is only the top-level entity that should not use signed or unsigned signals. 1.12.2 Flip-Flops and Latches Use ops, not latches (see section 1.8.2). Use D-ops, not T, JK, etc (see section 1.8.2). For every signal in your design, know whether it should be a ip-op or combinational. Before simulating your design, examine the log le e.g. LOG/dc shell.log to see if the ip ops in your circuit match your expectations, and to check that you dont have any latches in your design. Do not assign a signal to itself (e.g. a <= a; is bad). If the signal is a op, use a chip enable to cause the signal to hold its value. If the signal is combinational, then assigning a signal to itself will cause combinational loops, which are bad. 1.12.3 Inputs and Outputs Put ip ops on primary inputs and outputs of a chip reason Creates more robust implementations. Signal delays between chips are unpredictable. Signal integrity can be a problem (remember transmission lines from E&CE 324?). Putting ip ops on inputs and outputs of chip provides clean boundaries between circuits. note This only applies to primary inputs and outputs of a chip (the signals in the top-level entity). Within a chip, you should adopt a standard of putting ip-ops on either inputs or outputs of modules. Within a chip, you do not need to put ip-ops on both inputs and outputs. 1.12.4 Multiplexors and Tri-State Signals Use multiplexors, not tri-state buffers (see section 1.8.2). 1.12.5 Processes 73 1.12.5 Processes For a combinational process, the sensitivity list should contain all of the signals that are read in the process. reason Gives consistent results across different tools. Many synthesis tools will implicitly include all signals that a process reads in its sensitivity list. This differs from the VHDL Standard. A tool that adheres to the standard will introduce latches if not all signals that are read from are included in the sensitivity list. exception In a clocked process using an if rising edge, it is acceptable to have only the clock in the sensitivity list For a combinational process, every signal that is assigned to, must be assigned to in every branch of if-then and case statements. reason If a signal is not assigned a value in a path through a combinational process, then that signal will be a latch. note For a clocked process, if a signal is not assigned a value in a clock cycle, then the ip-op for that signal will have a chip-enable pin. Chip-enable pins are ne; they are available on ip-ops in essentially every cell library. Each signal should be assigned to in only one process. reason Multiple processes driving the same signal is the same as having multiple gates driving the same wire. This can cause contention, short circuits, and other bad things. exception Multiple drivers are acceptable for tri-state busses or if your implementation technology has wired-ANDs or wired-ORs. FPGAs dont have wired-ANDs or wired-ORs. Separate unrelated signals into different processes reason Grouping assignments to unrelated signals into a single process can complicate the control circuitry for that process. Each branch in a case statement or if-then-else adds a multiplexor or chip-enable circuitry. reason Synthesis tools generally optimize each process individually, the larger a process is, the longer it will take the synthesis program to optimize the process. Also, larger processes tend to be more complicated and can cause synthesis programs to miss helpful optimizations that they would notice in smaller processes. 1.12.6 State Machines In a state machine, illegal and unreachable states should transition to the reset state reason Creates more robust implementations. In the eld, your circuit will be subjected to illegal inputs, voltage spikes, temperature uctuations, clock speed variations, etc. At some point in time, something weird will happen that will cause it to jump into an illegal state. Having a system reset and reboot is much better than having it generate incorrect outputs that arent detected. If your state machine has less than 16 states, use a one-hot encoding. 74 CHAPTER 1. VHDL reason For n states, a one-hot encoding uses n ip-ops, while a binary encoding uses log2 n ip-ops. One-hot signals are simpler to decode, because only one bit must be checked to determine if the circuit is in a particular state. For small values of n, a one-hot signal results in a smaller and faster circuit. For large values of n, the number of signals required for a one-hot design is too great of a penalty to compensate for the simplicity of the decoding circuitry. note Using an enumerated type for states allows the synthesis tool to choose state encodings that it thinks will work well to balance area and clock speed. Quartus uses a modied one-hot encoding, where the bit that denotes the reset state is inverted. That is, when the reset bit is 0, the system is in the reset state and when the reset bit is a 1 the system is not in the reset state. The other bits have the normal polarity. The result is that when the system is in the reset state, all bits are 0 and when the system is in a non-reset state, two bits are 1. note Using your own encoding allows you to leverage knowledge about your design that the synthesis tool might not be able to deduce. 1.12.7 Reset Include a reset signal in all clocked circuits. reason For most implementation technologies, when you power-up the circuit, you do not know what state it will start in. You need a reset signal to get the circuit into a known state. reason If something goes wrong while the circuit is running, you need a way to get it into a known state. For implicit state machines (section 2.5.1.3), check for reset after every wait statement. reason Missing a wait statement means that your circuit might not notice a reset signal, or different signals could reset in different clock cycles, causing your circuit to get out of synch. Connect reset to the important control signals in the design, such as the state signal. Do not reset every ip op. reason Using reset adds area and delay to a circuit. The fewer signals that need reset, the faster and smaller your design will be. note Connect the reset signal to critical ip-ops, such as the state signal. Datapath signals rarely need to be reset. You do not need to reset every signal Use synchronous, not asynchronous, reset reason Creates more robust implementations. Signal propagation delays mean that asynchronous resets cause different parts of the circuit to be reset at different times. This can lead to glitches, which then might cause the circuit to move to an illegal state. 1.12.7 Reset 75 Covering All Cases ................................................................... . When writing case statements or selected assignments that test the value of std logic signals, you will get an error unless you include a provision for non 1/0 signals. For example: signal t : std_logic; ... case t is when 1 => ... when 0 => ... end case; will result in an error message about missing cases. You must provide for t being H, U, etc. The simplest thing to do is to make the last test when other. 76 CHAPTER 1. VHDL 1.13 VHDL Problems P1.1 IEEE 1164 For each of the values in the list below, answer whether or not it is dened in the ieee.std_logic_1164 library. If it is part of the library, write a 23 word description of the value. Values: -, #, 0, 1, A, h, H, L, Q, X, Z. P1.2 VHDL Syntax Answer whether each of the VHDL code fragments q2a through q2f is legal VHDL code. NOTES: 1) ... represents a fragment of legal VHDL code. 2) For full marks, if the code is illegal, you must explain why. 3) The code has been written so that, if it is illegal, then it is illegal for both simulation and synthesis. architecture main of anchiceratops is signal a, b, c : std_logic; begin process begin wait until rising_edge(c); a <= if (b = 1) then q2a ... else ... end if; end process; end main; architecture main of tulerpeton is begin lab: for i in 15 downto 0 loop q2b ... end loop; end main; P1.2 VHDL Syntax 77 architecture main of metaxygnathus is signal a : std_logic; begin lab: if (a = 1) generate q2c ... end generate; end main; architecture main of temnospondyl is component compa port ( a : in std_logic; b : out std_logic ); end component; q2d signal p, q : std_logic; begin coma_1 : compa port map (a => p, b => q); ... end main; architecture main of pachyderm is function inv(a : std_logic) return std_logic is begin return(NOT a); end inv; q2e signal p, b : std_logic; begin p <= inv(b => a); ... end main; architecture main of apatosaurus is type state_ty is (S0, S1, S2); signal st : state_ty; signal p : std_logic; begin q2f case st is when S0 | S1 => p <= 0; when others => p <= 1; end case; end main; 78 CHAPTER 1. VHDL P1.3 Flops, Latches, and Combinational Circuitry For each of the signals p...z in the architecture main of montevido, answer whether the signal is a latch, combinational gate, or ip-op. entity montevido is port ( a, b0, b1, c0, c1, d0, d1, e0, e1 : in std_logic; l : in std_logic_vector (1 downto 0); p, q, r, s, t, u, v, w, x, y, z : out std_logic ); end montevido; architecture main of montevido is signal i, j : std_logic; begin i <= c0 XOR c1; j <= c0 XOR c1; process (a, i, j) begin if (a = 1) then p <= i AND j; else p <= NOT i; end if; end process; process (a, b0, b1) begin if rising_edge(a) then q <= b0 AND b1; end if; end process; process (a, c0, c1, d0, d1, e0, e1) begin if (a = 1) then r <= c0 OR c1; s <= d0 AND d1; else r <= e0 XOR e1; end if; end process; process begin wait until rising_edge(a); t <= b0 XOR b1; u <= NOT t; v <= NOT x; end process; process begin case l is when "00" => wait until rising_edge(a); w <= b0 AND b1; x <= 0; when "01" => wait until rising_edge(a); w <= -; x <= 1; when "1-" => wait until rising_edge(a); w <= c0 XOR c1; x <= -; end case; end process; y <= c0 XOR c1; z <= x XOR w; end main; P1.4 Counting Clock Cycles 79 P1.4 Counting Clock Cycles This question refers to the VHDL code shown below. NOTES: 1. ... represents a legal fragment of VHDL code 2. assume all signals are properly declared 3. the VHDL code is intendend to be legal, synthesizable code 4. all signals are initially U 80 CHAPTER 1. VHDL architecture main of tinyckt is component bigckt ( ... ); signal ... : std_logic; begin p0 : process begin wait until rising_edge(clk); p0_a <= i; entity bigckt is wait until rising_edge(clk); port ( end process; a, b : in std_logic; p1 : process begin c : out std_logic wait until rising_edge(clk); ); p1_b <= p1_d; end bigckt; p1_c <= p1_b; p1_d <= s2_k; architecture main of bigckt is end process; begin process (a, b) begin if (a = 0) then c <= 0; else if (b = 1) then c <= 1 else c <= 0; end if; end if; end process; end main; entity tinyckt is port ( clk : in std_logic; i : in std_logic; o : out std_logic ); end tinyckt; p2 : process (p1_c, p3_h, p4_i, clk) begin if rising_edge(clk) then p2_e <= p3_h; p2_f <= p1_c = p4_i; end if; end process; p3 : process (i, s4_m) begin p3_g <= i; p3_h <= s4_m; end process; p4 : process (clk, i) begin if (clk = 1) then p4_i <= i; else p4_i <= 0; end if; end process; huge : bigckt (a => p2_e, b => p1_d, c => h_y); s1_j <= s3_l; s2_k <= p1_b XOR i; s3_l <= p2_f; s4_m <= p2_f; end main; For each of the pairs of signals below, what is the minimum length of time between when a change occurs on the source signal and when that change affects the destination signal? P1.5 Arithmetic Overow 81 src i i i i i i i s4 p1 p2 p2 m b f f dst p0 a p1 b p1 b p1 c p2 e p3 g p4 i hy p1 d s1 j s2 k Num clock cycles P1.5 Arithmetic Overow Implement a circuit to detect overow in 8-bit signed addition. An overow in addition happens when the carry into the most signicant bit is different from the carry out of the most signicant bit. When performing addition, for overow to happen, both operands must have the same sign. Positive overow occurs when adding two positive operands results in a negative sum. Negative overow occurs when adding two negative operands results in a positive sum. 82 CHAPTER 1. VHDL P1.6 Delta-Cycle Simulation: Pong Perform a delta-cycle simulation of the following VHDL code by drawing a waveform diagram. INSTRUCTIONS: 1. The simulation is to be done at the granularity of simulation-steps. 2. Show all changes to process modes and signal values. 3. Each column of the timing diagram corresponds to a simulation step that changes a signal or process. 4. Clearly show the beginning and end of each simulation cycle, delta cycle, and simulation round by writing in the appropriate row a B at the beginning and an E at the end of the cycle or round. 5. End your simulation just before 20 ns. architecture main of pong_machine is signal ping_i, ping_n, pong_i, pong_n : std_logic; begin next_proc: process (clk) begin if rising_edge(clk) then ping_n <= ping_i; pong_n <= pong_i; end if; end process; comb_proc: process (pong_n, ping_n, reset) begin if (reset = 1) then ping_i <= 1; pong_i <= 0; else ping_i <= pong_n; pong_i <= ping_n; end if; end process; end main; reset_proc: process reset <= 1; wait for 10 ns; reset <= 0; wait for 100 ns; end process; clk_proc: process clk <= 0; wait for 10 ns; clk <= 1; wait for 10 ns; end process; P1.7 Delta-Cycle Simulation: Baku Perform a delta-cycle simulation of the following VHDL code by drawing a waveform diagram. INSTRUCTIONS: P1.7 Delta-Cycle Simulation: Baku 83 1. The simulation is to be done at the granularity of simulation-steps. 2. Show all changes to process modes and signal values. 3. Each column of the timing diagram corresponds to a simulation step. 4. Clearly show the beginning and end of each simulation cycle, delta cycle, and simulation round by writing in the appropriate row a B at the beginning and an E at the end of the cycle or round. 5. Write t=5ns and t=10ns at the top of columns where time advances to 5 ns and 10 ns. 6. Begin your simulation at 5 ns (i.e. after the initial simulation cycles that initialize the signals have completed). 7. End your simulation just before 15 ns; entity baku is port ( clk, a, b : in std_logic; f : out std_logic ); end baku; architecture main of baku is signal c, d, e : std_logic; begin proc_clk: process begin clk <= 0; wait for 10 ns; clk <= 1; wat for 10 ns; end process; proc_extern : process begin a <= 0; b <= 0; wait for 5 ns; a <= 1; b <= 1; wait for 15 ns; end process; proc_1 : process (a, b, c) begin c <= a and b; d <= a xor c; end process; proc_2 : process begin e <= d; wait until rising_edge(clk); end process; proc_3 : process (c, e) begin f <= c xor e; end process; end main; 84 CHAPTER 1. VHDL P1.8 Clock-Cycle Simulation Given the VHDL code for anapurna and waveform diagram below, answer what the values of the signals y, z, and p will be at the given times. entity anapurna is port ( clk, reset, sel : in std_logic; a, b : in unsigned(15 downto 0); p : out unsigned(15 downto 0) ); end anapurna; architecture main of anapurna is type state_ty is (mango, guava, durian, papaya); signal y, z : unsigned(15 downto 0); signal state : state_ty; begin proc_herzog: process begin top_loop: loop wait until (rising_edge(clk)); proc_hillary: process (clk) next top_loop when (reset = 1); begin state <= durian; if rising_edge(clk) then wait until (rising_edge(clk)); if (state = durian) then state <= papaya; z <= a; while y < z loop else wait until (rising_edge(clk)); z <= z + 2; if sel = 1 then end if; wait until (rising_edge(clk)); end if; next top_loop when (reset = 1); end process; state <= mango; y <= b; end if; p <= y + z; state <= papaya; end main; end loop; end loop; end process; P1.9 VHDL VHDL Behavioural Comparison: Teradactyl 85 P1.9 VHDL VHDL Behavioural Comparison: Teradactyl For each of the VHDL architectures q3a through q3c, does the signal v have the same behaviour as it does in the main architecture of teradactyl? NOTES: 1) For full marks, if the code has different behaviour, you must explain why. 2) Ignore any differences in behaviour in the rst few clock cycles that is caused by initialization of ip-ops, latches, and registers. 3) All code fragments in this question are legal, synthesizable VHDL code. entity teradactyl is port ( a : in std_logic; v : out std_logic ); end teradactyl; architecture main of teradactyl is signal m : std_logic; begin m <= a; v <= m; end main; architecture q3a of teradactyl is signal b, c, d : std_logic; begin b <= a; c <= b; d <= c; v <= d; end q3a; architecture q3b of teradactyl is signal m : std_logic; begin process (a, m) begin v <= m; m <= a; end process; end q3b; architecture q3c of teradactyl is signal m : std_logic; begin process (a) begin m <= a; end process; process (m) begin v <= m; end process; end q3c; 86 CHAPTER 1. VHDL P1.10 VHDL VHDL Behavioural Comparison: Ichtyostega For each of the VHDL architectures q4a through q4c, does the signal v have the same behaviour as it does in the main architecture of ichthyostega? NOTES: 1) For full marks, if the code has different behaviour, you must explain why. 2) Ignore any differences in behaviour in the rst few clock cycles that is caused by initialization of ip-ops, latches, and registers. 3) All code fragments in this question are legal, synthesizable VHDL code. entity ichthyostega is port ( clk : in std_logic; b, c : in signed(3 downto 0); v : out signed(3 downto 0) ); end ichthyostega; architecture q4a of ichthyostega is signal bx, cx : signed(3 downto 0); begin process begin wait until (rising_edge(clk)); architecture main of ichthyostega is bx <= b; signal bx, cx : signed(3 downto 0); cx <= c; begin end process; process begin process begin wait until (rising_edge(clk)); if (cx > 0) then bx <= b; wait until (rising_edge(clk)); cx <= c; v <= bx; end process; else process begin wait until (rising_edge(clk)); wait until (rising_edge(clk)); v <= to_signed(-1, 4); if (cx > 0) then end if; v <= bx; end process; else end q4a; v <= to_signed(-1, 4); end if; end process; end main; P1.10 VHDL VHDL Behavioural Comparison: Ichtyostega 87 architecture q4c of ichthyostega is architecture q4b of ichthyostega is signal bx, cx, dx : signed(3 downto 0); signal bx, cx : signed(3 downto 0); begin begin process begin process begin wait until (rising_edge(clk)); wait until (rising_edge(clk)); bx <= b; bx <= b; cx <= c; cx <= c; end process; wait until (rising_edge(clk)); process begin if (cx > 0) then wait until (rising_edge(clk)); v <= bx; v <= dx; else end process; v <= to_signed(-1, 4); dx <= bx when (cx > 0) end if; else to_signed(-1, 4); end process; end q4c; end q4b; 88 CHAPTER 1. VHDL P1.11 Waveform VHDL Behavioural Comparison Answer whether each of the VHDL code fragments q3a through q3d has the same behaviour as the timing diagram. NOTES: 1) Same behaviour means that the signals a, b, and c have the same values at the end of each clock cycle in steady-state simulation (ignore any irregularities in the rst few clock cycles). 2) For full marks, if the code does not match, you must explain why. 3) Assume that all signals, constants, variables, types, etc are properly dened and declared. 4) All of the code fragments are legal, synthesizable VHDL code. clk a b c q3a architecture q3a of q3 is begin process begin a <= 1; loop wait until rising_edge(clk); a <= NOT a; end loop; end process; b <= NOT a; c <= NOT b; end q3a; q3b architecture q3b of q3 is begin process begin b <= 0; a <= 1; wait until rising_edge(clk); a <= b; b <= a; wait until rising_edge(clk); end process; c <= a; end q3b; P1.11 Waveform VHDL Behavioural Comparison 89 q3c architecture q3c of q3 is begin process begin a <= 0; b <= 1; wait until rising_edge(clk); b <= a; a <= b; wait until rising_edge(clk); end process; c <= NOT b; end q3c; q3d architecture q3d of q3 is begin process (b, clk) begin a <= NOT b; end process; process (a, clk) begin b <= NOT a; end process; c <= NOT b; end q3d; q3f architecture q3f of q3 is begin process begin a <= 1; b <= 0; c <= 1; wait until rising_edge(clk); a <= c; b <= a; c <= NOT b; wait until rising_edge(clk); end process; end q3f; q3e architecture q3e of q3 is begin process begin b <= 0; a <= 1; wait until rising_edge(clk); a <= c; b <= a; wait until rising_edge(clk); end process; c <= not b; end q3e; 90 CHAPTER 1. VHDL P1.12 Hardware VHDL Comparison entity q2 is port ( a, clk, reset : in std_logic; d : out std_logic ); end q2; architecture main of q2 is signal b, c : std_logic; begin b <= 0 when (reset = 1) else a; process (clk) begin if rising_edge(clk) then c <= b; d <= c; end if; end process; end main; For each of the circuits q2aq2d, answer whether the signal d has the same behaviour as it does in the main architecture of q2. reset 0 d a 0 d a clk q2b reset q2a clk reset clk reset 0 0 d a q2c clk a d q2d clk P1.13 8-Bit Register 91 P1.13 8-Bit Register Implement an 8-bit register that has: clock signal clk input data vector d output data vector q synchronous active-high input reset synchronous active-high input enable P1.13.1 Asynchronous Reset Modify your design so that the reset signal is asynchronous, rather than synchronous. P1.13.2 Discussion Describe the tradeoffs in using synchonous versus asynchronous reset in a circuit implemented on an FPGA. P1.13.3 Testbench for Register Write a test bench to validate the functionality of the 8-bit register with synchronous reset. 92 CHAPTER 1. VHDL P1.14 Synthesizable VHDL and Hardware For each of the fragments of VHDL q4a...q4f, answer whether the the code is synthesizable. If the code is synthesizable, draw the circuit most likely to be generated by synthesizing the datapath of the code. If the the code is not synthesizable, explain why. process begin wait until rising_edge(a); e <= d; q4a wait until rising_edge(b); e <= NOT d; end process; process begin while (c /= 1) loop if (b = 1) then wait until rising_edge(a); e <= d; else q4b e <= NOT d; end if; end loop; e <= b; end process; process (a, d) begin e <= d; end process; process (a, e) begin q4c if rising_edge(a) then f <= NOT e; end if; end process; process (a) begin if rising_edge(a) then if b = 1 then e <= 0; else q4d e <= d; end if; end if; end process; P1.14 Synthesizable VHDL and Hardware 93 process (a,b,c,d) begin if rising_edge(a) then e <= c; else if (b = 1) then q4e e <= d; end if; end if; end process; process (a,b,c) begin if (b = 1) then e <= 0; else if rising_edge(a) then q4f e <= c; end if; end if; end process; 94 CHAPTER 1. VHDL P1.15 Datapath Design Each of the three VHDL fragments q4aq4c, is intended to be the datapath for the same circuit. The circuit is intended to perform the following sequence of operations (not all operations are required to use a clock cycle): read in source and destination addresses from i src1, i src2, i dst read operands op1 and op2 from memory compute sum of operands sum write sum to memory at destination address dst write sum to output o result Correct Implementation? clk i_src1 i_src2 i_dst o_result P1.15.1 For each of the three fragments of VHDL q4aq4c, answer whether it is a correct implementation of the datapath. If the datapath is not correct, explain why. If the datapath is correct, answer in which cycle you need load=1. NOTES: 1. You may choose the number of clock cycles required to execute the sequence of operations. 2. The cycle in which the addresses are on i src1, i src2, and i dst is cycle #0. 3. The control circuitry that controls the datapath will output a signal load, which will be 1 when the sum is to be written into memory. 4. The code fragment with the signal declaractions, connections for inputs and outputs, and the instantiation of memory is to be used for all three code fragments q4aq4c. 5. The memory has registered inputs and combinational (unregistered) outputs. 6. All of the VHDL is legal, synthesizable code. P1.15 Datapath Design 95 -- This code is to be used for -- all three code fragments q4a--q4c. signal state : std_logic_vector(3 downto 0); signal src1, src2, dst, op1, op2, sum, mem_in_a, mem_out_a, mem_out_b, mem_addr_a, mem_addr_b : unsigned(7 downto 0); ... process (clk) begin if rising_edge(clk) then src1 <= i_src1; src2 <= i_src2; dst <= i_dst; o_result <= sum; end if; end process; mem : ram256x16d port map (clk => clk, i_addr_a => mem_addr_a, i_addr_b => mem_addr_b, i_we_a => mem_we, i_data_a => mem_in_a, o_data_a => mem_out_a, o_data_b => mem_out_b); 96 CHAPTER 1. VHDL q4a op1 <= mem_out_a when state = "0010" else (others => 0); op2 <= mem_out_b when state = "0010" else (others => 0); sum <= op1 + op2 when state = "0100" else (others => 0); mem_in_a <= sum when state = "1000" else (others => 0); mem_addr_a <= dst when state = "1000" else src1; mem_we <= 1 when state = "1000" else 0; mem_addr_b <= src2; process (clk) begin if rising_edge(clk) then if (load = 1) then state <= "1000"; else -- rotate state vector one bit to left state <= state(2 downto 0) & state(3); end if; end if; end process; q4b process (clk) begin if rising_edge(clk) then op1 <= mem_out_a; op2 <= mem_out_b; end if; end process; sum <= op1 + op2; mem_in_a <= sum; mem_we <= load; mem_addr_a <= dst when load = 1 else src1; mem_addr_b <= src2; P1.15 Datapath Design 97 q4c process begin wait until rising_edge(clk); op1 <= mem_out_a; op2 <= mem_out_b; sum <= op1 + op2; mem_in_a <= sum; end process; process (load, dst, src1) begin if load = 1 then mem_addr_a <= dst; else mem_addr_a <= src1; end if; end process; mem_addr_b <= src2; P1.15.2 Smallest Area Of all of the circuits (q4aq4c), including both correct and incorrect circuits, predict which will have the smallest area. If you dont have sufcient information to predict the relative areas, explain what additional information you would need to predict the area prior to synthesizing the designs. P1.15.3 Shortest Clock Period Of all of the circuits (q4aq4c), including both correct and incorrect circuits, predict which will have the shortest clock period. If you dont have sufcient information to predict the relative periods, explain what additional information you would need to predict the period prior to performing any synthesis or timing analysis of the designs. 98 CHAPTER 1. VHDL Chapter 2 RTL Design with VHDL: From Requirements to Optimized Code 2.1 Prelude to Chapter 2.1.1 A Note on EDA for FPGAs and ASICs The following is from John Cooleys column The Industry Gady from 2003/04/30. The title of this article is: The FPGA EDA Slums. For 2001, Dataquest reported that the ASIC market was US$16.6 billion while the FPGA market was US$2.6 billion. Whats more interesting is that the 2001 ASIC EDA market was US$2.2 billion while the FPGA EDA market was US$91.1 million. Nope, thats not a mistake. Its ASIC EDA and billion versus FPGA EDA and million. Do the math and youll see that for every dollar spent on an ASIC project, roughly 12 cents of it goes to an EDA vendor. For every dollar spent on a FPGA project, roughly 3.4 cents goes to an EDA vendor. Not good. Its the old free milk and a cow story according to Gary Smith, the Senior EDA Analyst at Dataquest. Altera and Xilinx have fowled their own nest. Their free tools spoil the FPGA EDA market, says Gary. EDA vendors know that theres no money to be made in FPGA tools. 99 100 CHAPTER 2. RTL DESIGN WITH VHDL 2.2 FPGA Background and Coding Guidelines 2.2.1 Generic FPGA Hardware 2.2.1.1 Cell Generic FPGA Cell = = Logic Element (LE) in Altera Congurable Logic Block (CLB) in Xilinx carry_in comb_data_out comb_data_in comb D CE R Q flop_data_out flop_data_in ctrl_in S carry_out 2.2.2 Area Estimation We estimate the number of FPGA cells required for a design by counting the number of ipops and primary inputs that are in the fanin of each ip-op. Only ip-ops count, because combinational signals are collapsed into the circuity within an FPGA cell. The circuitry for any ip-op signal with up to four source ip-ops can be implemented on a single FPGA cell. If a ip-op signal is dependent upon ve source ip-ops, then two FPGA cells are required. Source ops/inputs Minimum cells 1 1 2 1 3 1 4 1 5 2 6 2 7 2 8 3 9 3 10 3 11 4 2.2.2 Area Estimation 101 For a single target signal, this technique gives a lower bound on the number of cells needed. For example, some functions of seven inputs require more than two cells. As a particular example, a four-to-one multiplexer has six inputs and requires three cells. When dealing with multiple target signals, this technique might be an overestimate, because a single cell can drive several other cells (common subexpression elimination). PLA and Flop for Different Functions carry_in .................................................. comb_data_out comb_data_in comb D CE R Q flop_data_out flop_data_in ctrl_in S carry_out PLA and Flop for Same Function carry_in ..................................................... . comb_data_out comb_data_in comb D CE R Q flop_data_out flop_data_in ctrl_in S carry_out 102 CHAPTER 2. RTL DESIGN WITH VHDL PLA and Flop for Same Function carry_in ...................................................... comb_data_out comb_data_in comb D CE R Q flop_data_out flop_data_in ctrl_in S carry_out 2.2.2 Area Estimation 103 Estimate Area for Circuit .............................................................. Question: a b c d Map the combinational circuits below onto generic FPGA cells. z a b c d comb D CE R Q z S z x z y a b c d e f g h z i y x a b c d comb D CE R Q y comb D CE R Q S S z a b c d x z y comb D CE R Q y comb D CE R Q S S a b c d e f g h z i w y x w b c d comb D CE R Q S 104 CHAPTER 2. RTL DESIGN WITH VHDL 2.2.2.1 Interconnect for Generic FPGA Note: In these slides, the space between tightly grouped wires sometimes disappears, making a group of wires appear to be a single large wire. There are two types of wires that connect a cell to the rest of the chip: General purpose interconnect (congurable, slow) Carry chains and cascade chains (verticaly adjacent cells, fast) 2.2.2.2 Blocks of Cells for Generic FPGA Cells are organized into blocks. There is a great deal of interconnect (wires) between cells within a single block. In large FPGAs, the blocks are organized into larger blocks. These large blocks might themselves be organized into even larger blocks. Think of an FPGA as bunch of nested for-generate statements that replicate a single component (cell) hundreds of thousands of times. 2.2.2 Area Estimation 105 Cells not used for computation can be used as wires to shorten length of path between cells. 106 CHAPTER 2. RTL DESIGN WITH VHDL 2.2.2.3 Clocks for Generic FPGAs Characteristics of clock signals: High fanout (drive many gates) Long wires (destination gates scattered all over chip) Characteristics of FPGAs: Very few gates that are large (strong) enough to support a high fanout. Very few wires that traverse entire chip and can be connected to every ip-op. 2.2.2.4 Memory Special Circuitry in FPGAs ............................................................................. . For more than ve years, FPGAs have had special circuits for RAM and ROM. In Altera FPGAs, these circuits are called ESBs (Embedded System Blocks). These special circuits are possible because many FPGAs are fabricated on the same processes as SRAM chips. So, the FPGAs simply contain small chunks of SRAM. Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. A new feature to appear in FPGAs in 2001 and 2002 is hardwired microprocessors on the same chip as programmable hardware. Hard Soft Altera Arm 922T with 200 MIPs Nios with ?? MIPs Xilinx: Virtex-II Pro Power PC 405 with 420 D-MIPs Microblaze with 100 D-MIPs The Xilinx-II Pro has 4 Power PCs and enough programmable hardware to implement the rstgeneration Intel Pentium microprocessor. Arithmetic Circuitry ................................................................. . A new feature to appear in FPGAs in 2001 and 2002 is hardwired circuits for multipliers and adders. Altera: Mercury 16 16 at 130MHz Xilinx: Virtex-II Pro 18 18 at ???MHz Using these resources can improve signicantly both the area and performance of a design. 2.2.3 Generic-FPGA Coding Guidelines 107 Input / Output ....................................................................... . Recently, high-end FPGAs have started to include special circuits to increase the bandwidth of communication with the outside world. Product True-LVDS (1 Gbps) Rocket I/O (3 Gbps) Altera Xilinx 2.2.3 Generic-FPGA Coding Guidelines Flip-ops are almost free in FPGAs reason In FPGAs, the area consumed by a design is usually determined by the amount of combinational circuitry, not by the number of ip-ops. Aim for using 8090% of the cells on a chip. reason If you use more than 90% of the cells on a chip, then the place-and-route program might not be able to route the wires to connect the cells. reason If you use less than 80% of the cells, then probably: there are optimizations that will increase performance and still allow the design to t on the chip; or you spent too much human effort on optimizing for low area; or you could use a smaller (cheaper!) chip. exception In E&CE 427 (unlike in real life), the mark is based on the actual number of cells used. Use just one clock signal reason If all ip-ops use the same clock, then the clock does not impose any constraints on where the place-and-route tool puts ip-ops and gates. If different ip-ops used different clocks, then ip-ops that are near each other would probably be required to use the same clock. Use only one edge of the clock signal reason There are two ways to use both rising and falling edges of a clock signal: have risingedge and falling-edge ip ops, or have two different clock signals that are inverses of each other. Most FPGAs have only rising-edge ip ops. Thus, using both edges of a clock signal is equivalent to having two different clock signals, which is deprecated by the preceding guideline. 108 CHAPTER 2. RTL DESIGN WITH VHDL 2.2.4 Altera APEX20K Information and Coding Guidelines APEX20K Block Hierarchy ............................................................ Chip 52 Mega Logic Array Blocks (MegaLABs) 1 Embedded System Block (ESB) Memory and wide combinational functions 16 Logic Array Blocks (LABs) 10 Logic Elements (LEs) 4-input lookup table Carry and cascade Flip-op Each level of hierarchy has its own interconnect (wires). LE Interconnect LE Computation and Storage ......... ...................... 4-input lookup table (LUT) Carry-chain computation circuitry Cascade-chain computation circuitry Flip-op with load, clear, clock-enable 4 data inputs 2 data outputs Carry in, carry out Cascade in, cascade out Clock, clock-enable Async clear, synch set (load), synch clear (reset) Global reset Initialization ......................................................................... . The Altera APEX20K chips initialize all ip ops to 0 at startup. To mimic this behaviour in simulation, you should put an initial value of 0 on all ip ops. If you are doing your own encoding for a state machine, choose the reset state to be encoded as all zeroes. You should not put initial values on inputs or combinational signals. 2.3. DESIGN FLOW 109 2.3 Design Flow 2.3.1 Generic Design Flow Most people agree on the general terminology and process for a digital hardware design ow. However, each book and course has its own particular way of presenting the ideas. Here we will lay out the consistent set of denitions that we will use in E&CE 427. This might be different from what you have seen in other courses or on a work term. Focus on the ideas and you will be ne both now and in the future. The design ow presented here focuses on the artifacts that we work with, rather than the operations that are performed on the artifacts. This is because the same operations can be performed at different points in the design ow, while the artifacts each have a unique purpose. Requirements Modify Algorithm Analyze Modify High-Level Model Analyze dp/ctrl specific Modify DP+Ctrl Code Analyze Modify Opt. RTL Code Analyze Modify Implementation Analyze Hardware Figure 2.1: Generic Design Flow 110 CHAPTER 2. RTL DESIGN WITH VHDL Table 2.1: Artifacts in the Design Flow Requirements Algorithm Description of what the customer wants Functional description of computation. Probably not synthesizable. Could be a owchart, software, diagram, mathematical equation, etc.. High-Level Model HDL code that is not necessarily synthesizable, but divides algorithm into signals and clock cycles. Possibly mixes datapath and control. In VHDL, could be a single process that captures the behaviour of the algorithm. Usually synthesizable; resulting hardware is usually big and slow compared to optimized RTL code. Dataow Diagram A picture that depicts the datapath computation over time, clock-cycle by clock-cycle (Section 2.6) Hardware Block Diagram A picture that depicts the structure of the datapath: the components and the connections between the components. (e.g., netlist or schematic) State Machine A picture that depicts the behaviour of the control circuitry over time (Section 2.5) DP+Ctrl RTL code Synthesizable HDL code that separates the datapath and control into separate processes and assignments. Optimized RTL Code HDL code that has been written to meet design goals (high performance, low power, small, etc.) Implementation Code A collection of les that include all of the information needed to build the circuit: HDL program targeted for a particular implementation technology (e.g. a specic FPGA chip), constraint les, script les, etc. Note: Recomendation Spend the time up front to plan a good design on paper. Use dataow diagrams and state machines to predict performance and area. The E&CE 427 project might appear to be sufciently small and simple that you can go straight to RTL code. However, you will probably produce a more optimal design with less effort if you explore high-level optimizations with dataow diagrams and state machines. 2.3.2 Implementation Flows Synopsys Design Compiler and FPGA Compiler are general-purpose synthesis programs. They have very few, if any, technology-specic algorithms. Instead, they rely on libraries to describe technology-specic parameters of the primitive building blocks (e.g. the delay and area of individual gates, PLAs, CLBs, ops, memory arrays). 2.3.3 Design Flow: Datapath vs Control vs Storage 111 Mentor Graphics product Leonardo Spectrum, Cadences product BuildGates, and Synplicitys product Synplify are similar. In comparison, Avant! (Now owned by Synopsys) and Cadence sell separate tools that do place-and-route and other low-level (physical design) tasks. These general-purpose synthesis tools do not (generally) do the nal stages of the design, such as place-and-route and timing analysis, which are very specic to a given implementation technology. The implementation-technology-specic tools generally also produce a VHDL le that accurately models the chip. We will refer to this le as the implementation VHDL code. With Synopsys and the Altera tool Quartus, we compile the VHDL code into an EDIF le for the netlist and a TCL le for the commands to Quartus. Quartus then generates a sof (SRAM Object File), which can be downloaded to an Altera SRAM-based FPGA. The extension of the implementation VHDL le is often .vho, for VHDL output. With the Synopsys and Xilinx tools, we compile VHDL code into a Xilinx-specic design le (xnf Xilinx netlist le). We then use the Xilinx tools to generate a bit le, which can be downloaded to a Xilinx FPGA. The name of the implementation VHDL le is often sufxed with routed.vhd. Terminology: Behavioural and Structural . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Note: behavioural and structural models The phrases behavioural model and structural model are commonly used for what well call high-level models and synthesizable models. In most cases, what people call structural code contains both structural and behavioural code. The technically correct denition of a structural model is an HDL program that contains only component instantiations and generate statements. Thus, even a program with c <= a AND b; is, strictly speaking, behavioural. 2.3.3 Design Flow: Datapath vs Control vs Storage 2.3.3.1 Classes of Hardware Each circuit tends to be dominated by either its datapath, control (state machine) or storage (memory). Datapath Purpose: compute output data based on input data Each parcel of input produces one parcel of output Examples: arithmetic, decoders 112 CHAPTER 2. RTL DESIGN WITH VHDL Storage Purpose: hold data for future use Data is not modied while stored Examples: register les, FIFO queues Control Purpose: modify internal state based on inputs, compute outputs from state and inputs Mostly individual signals, few data (vectors) Examples: bus arbiters, memory-controllers All three classes of circuits (datapath, control, and storage) follow the same generic design ow (Figure2.1) and use dataow diagrams, hardware block diagrams, and state machines. The differences in the design ows appear in the relative amount of effort spent on each type of description and the order in which the different descriptions are used. The differences are most pronounced in the transition from the high-level model to the model that separates the datapath and control circuitry. 2.3.3.2 Datapath-Centric Design Flow High-Level Model Modify Dataflow Analyze Modify Block Diagram Analyze State Machine DP+Ctrl RTL Code Figure 2.2: Datapath-Centric Design Flow 2.4. ALGORITHMS AND HIGH-LEVEL MODELS 113 2.3.3.3 Control-Centric Design Flow High-Level Model Modify State Machine Analyze Modify Dataflow Diagram Analyze Modify Block Diagram Analyze DP+Ctrl RTL Code Figure 2.3: Control-Centric Design Flow 2.3.3.4 Storage-Centric Design Flow In E&CE 427, we wont be discussing storage-centric design. Storage-centric design differs from datapath- and control-centric design in that storage-centric design focusses on building many replicated copies of small cells. Storage-centric designs include a wide range of circuits, from simple memory arrays to complicated circuits such as register les, translation lookaside buffers, and caches. The complicated circuits can contain large and very intricate state machines, which would benet from some of the techniques for control-centric circuits. 2.4 Algorithms and High-Level Models For designs with signicant control ow, algorithms can be described in software languages, owcharts, abstract state machines, algorithmic state machines, etc. For designs with trivial control ow (e.g. every parcel of input data undergoes the same computation), data-dependency graphs (section 2.4.2) are a good way to describe the algorithm. 114 CHAPTER 2. RTL DESIGN WITH VHDL For designs with a small amount of control ow (e.g. a microprocessor, where a single decision is made based upon the opcode) a set of data-dependency graphs is often a good choice. Software executes in series; hardware executes in parallel When creating an algorithmic description of your hardware design, think about how you can represent parallelism in the algorithmic notation that you are using, and how you can exploit parallelism to improve the performance of your design. 2.4.1 Flow Charts and State Machines Flow charts and various avours of state machines are covered well in many courses. Generally everything that youve learned about these forms of description are also applicable in hardware design. In addition, you can exploit parallelism in state machine design to create communicating nite state machines. A single complex state machine can be factored into multiple simple state machines that operate in parallel and communicate with each other. 2.4.2 Data-Dependency Graphs In software, the expression: (((((a + b) + c) + d) + e) + f) takes the same amount of time to execute as: (a + b) + (c + d) + (e + f). But, remember: hardware runs in parallel. In algorithmic descriptions, parentheses can guide parallel vs serial execution. Datadependency graphs capture algorithms of datapath-centric designs. Datapath-centric designs have few, if any, control decisions: every parcel of input data undergroes the same computation. 2.4.3 High-Level Models 115 Serial (((((a+b)+c)+d)+e)+f) a b c d e f Parallel (a+b)+(c+d)+(e+f) + + + + + 5 adders on longest path (slower) 5 adders used (equal area) a b c d e f + + + + + 3 adders on longest path (faster) 5 adders used (equal area) 2.4.3 High-Level Models There are many different types of high-level models, depending upon the purpose of the model and the characteristics of the design that the model describes. Some models may capture power consumption, others performance, others data functionality. High-level models are used to estimate the most important design metrics very early in the design cycle. If power consumption is more important that performance, then you might write highlevel models that can predict the power consumption of different design choices, but which has no information about the number of clock cycles that a computation takes, or which predicts the latency inaccurately. Conversely, if performance is important, you might write clock-cycle accurate high-level models that do not contain any information about power consumption. Conventionally, performance has been the primary design metric. Hence, high-level models that predict performance are more prevalent and more well understood than other types of high-level models. There are many research and entrepreneurial opportunities for people who can develop tools and/or languages for high-level models for estimating power, area, maximum clock speed, etc. In E&CE 427 we will limit ourselves to the well-understood area of high-level models for performance prediction. 116 CHAPTER 2. RTL DESIGN WITH VHDL 2.5 Finite State Machines in VHDL 2.5.1 Introduction to State-Machine Design 2.5.1.1 Mealy vs Moore State Machines ..................................................................... . Moore Machines s0/0 a !a s2/0 Outputs are dependent upon only the state No combinational paths from inputs to outputs s1/1 s3/0 Mealy Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. s0 a/1 !a/0 s2 /0 s3 Outputs are dependent upon both the state and the inputs Combinational paths from inputs to outputs s1 /0 2.5.1.2 Introduction to State Machines and VHDL A state machine is generally written as a single clocked process, or as a pair of processes, where one is clocked and one is combinational. 2.5.1 Introduction to State-Machine Design 117 Design Decisions ..................................................................... . Moore vs Mealy (Sections 2.5.2 and 2.5.3) Implicit vs Explicit (Section 2.5.1.3) State values in explicit state machines: Enumerated type vs constants (Section 2.5.5.1) State values for constants: encoding scheme (binary, gray, one-hot, ...) (Section 2.5.5) VHDL Constructs for State Machines ................................................. . The following VHDL control constructs are useful to steer the transition from state to state: if ... then ... case for ... loop while ... loop else loop next exit 2.5.1.3 Explicit vs Implicit State Machines There are two broad styles of writing state machines in VHDL: explicit and implicit. Explicit and implicit refer to whether there is an explicit state signal in the VHDL code. Explicit state machines have a state signal in the VHDL code. Implicit state machines do not contain a state signal. Instead, they use VHDL processes with multiple wait statements to control the execution. In the explicit style of writing state machines, each process has at most one wait statement. For the explicit style of writing state machines, there are two sub-categories: current state and current+next state. In the explicit-current style of writing state machines, the state signal represents the current state of the machine and the signal is assigned its next value in a clocked process. In the explicit-current+next style, there is a signal for the current state and another signal for the next state. The next-state signal is assigned its value in a combinational process or concurrent statement and is dependent upon the current state and the inputs. The current-state signal is assigned its value in a clocked process and is just a opped copy of the next-state signal. For the implicit style of writing state machines, the synthesis program adds an implicit register to hold the state signal and combinational circuitry to update the state signal. In Synopsys synthesis tools, the state signal dened by the synthesizer is named multiple wait state reg. In Mentor Graphics, the state signal is named STATE VAR We can think of implicit state machines as having 0 state signals, explicit-current state machines as having 1 state signal, and explicit-current+next state machines as having 2 state signals. 118 CHAPTER 2. RTL DESIGN WITH VHDL As with all topics in E&CE 427, there are tradeoffs between these different styles of writing state machines. Most books teach only the explicit-current+next style. This style is the style closest to the hardware, which means that they are more amenable to optimization through human intervention, rather than relying on a synthesis tool for optimization. The advantage of the implicit style is that they are concise and readable for control ows consisting of nested loops and branches (e.g. the type of control ow that appears in software). For control ows that have less structure, it can be difcult to write an implicit state machine. Very few books or synthesis manuals describe multiple-wait statement processes, but they are relatively well supported among synthesis tools. Because implicit state machines are written with loops, if-then-elses, cases, etc. it is difcult to write some state machines with complicated control ows in an implicit style. The following example illustrates the point. s0/0 a !a s2/0 !a s3/0 a s1/1 Note: The terminology of explicit and implicit is somewhat standard, in that some descriptions of processes with multiple wait statements describe the processes as having implicit state machines. There is no standard terminology to distinguish between the two explicit styles: explicit-current+next and explicit-current. 2.5.2 Implementing a Simple Moore Machine s0/0 a s1/1 !a s2/0 entity simple is port ( a, clk : in std_logic; z : out std_logic ); end simple; s3/0 2.5.2 Implementing a Simple Moore Machine 119 2.5.2.1 Implicit Moore State Machine architecture moore_implicit_v1a of simple is begin process begin z <= 0; wait until rising_edge(clk); if (a = 1) then z <= 1; else z <= 0; end if; wait until rising_edge(clk); z <= 0; wait until rising_edge(clk); end process; end moore_implicit; Flops Gates Delay 3 2 1 gate 120 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.2.2 Explicit Moore with Flopped Output architecture moore_explicit_v1 of simple is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = 1) then state <= s1; z <= 1; else state <= s2; z <= 0; end if; when s1 | s2 => state <= s3; z <= 0; when s3 => state <= s0; z <= 1; end case; end if; end process; end moore_explicit_v1; Flops Gates Delay 3 10 3 gates 2.5.2 Implementing a Simple Moore Machine 121 2.5.2.3 Explicit Moore with Combinational Outputs architecture moore_explicit_v2 of simple is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = 1) then state <= s1; else state <= s2; end if; when s1 | s2 => state <= s3; when s3 => state <= s0; end case; end if; end process; z <= 1 when (state = s1) else 0; end moore_explicit_v2; Flops Gates Delay 2 7 4 gates 122 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.2.4 Explicit-Current+Next Moore with Concurrent Assignment architecture moore_explicit_v3 of simple is type state_ty is (s0, s1, s2, s3); signal state, state_nxt : state_ty; begin process (clk) begin if rising_edge(clk) then Flops state <= state_nxt; end if; Gates end process; Delay state_nxt <= s1 when (state = s0) and (a = 1) else s2 when (state = s0) and (a = 0) else s3 when (state = s1) or (state = s2) else s0; z <= 1 when (state = s1) else 0; end moore_explicit_v3; 2 7 4 The hardware synthesized from this architecture is the same as that synthesized from moore explicit v2, which is written in the current-explicit style. 2.5.2 Implementing a Simple Moore Machine 123 2.5.2.5 Explicit-Current+Next Moore with Combinational Process architecture moore_explicit_v4 of simple is type state_ty is (s0, s1, s2, s3); signal state, state_nxt : state_ty; begin process (clk) begin if rising_edge(clk) then state <= state_nxt; end if; end process; process (state, a) begin case state is when s0 => if (a = 1) then state_nxt <= s1; else state_nxt <= s2; end if; when s1 | s2 => state_nxt <= s3; when s3 => state_nxt <= s0; end case; end process; z <= 1 when (state = s1) else 0; end moore_explicit_v4; For this architecture, we change the selected assignment to state into a combinational process using a case statement. Flops Gates Delay The sized tecture that moore v3. 2 7 4 hardware synthefrom this archiis the same as synthesized from explicit v2 and 124 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.3 Implementing a Simple Mealy Machine Mealy machines have a combinational path from inputs to outputs, which often violates good coding guidelines for hardware. Thus, Moore machines are much more common. You should know how to write a Mealy machine if needed, but most of the state machines that you design will be Moore machines. This is the same entity as for the simple Moore state machine. The behaviour of the Mealy machine is the same as the Moore machine, except for the timing relationship between the output (z) and the input (a). s0 a/1 s1 /0 s3 /0 !a/0 s2 entity simple is port ( a, clk : in std_logic; z : out std_logic ); end simple; 2.5.3 Implementing a Simple Mealy Machine 125 2.5.3.1 Implicit Mealy State Machine Note: An implicit Mealy state machine is nonsensical. In an implicit state machine, we do not have a state signal. But, as the example below illustrates, to create a Mealy state machine we must have a state signal. An implicit style is a nonsensical choice for Mealy state machines. Because the output is dependent upon the input in the current clock cycle, the output cannot be a op. For the output to be combinational and dependent upon both the current state and the current input, we must create a state signal that we can read in the assignment to the output. Creating a state signal obviates the advantages of using an implicit style of state machine. architecture implicit_mealy of simple is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process begin state <= s0; wait until rising_edge(clk); if (a = 1) then state <= s1; else state <= s2; end if; wait until rising_edge(clk); state <= s3; wait until rising_edge(clk); end process; z <= 1 when (state = s0) and a = 1 else 0; end mealy_implicit; Flops Gates Delay 4 8 2 gates 126 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.3.2 Explicit Mealy State Machine architecture mealy_explicit of simple is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = 1) then state <= s1; else state <= s2; end if; when s1 | s2 => state <= s3; when others => state <= s0; end case; end if; end process; z <= 1 when (state = s0) and a = 1 else 0; end mealy_explicit; Flops Gates Delay 2 7 3 2.5.3 Implementing a Simple Mealy Machine 127 2.5.3.3 Explicit-Current+Next Mealy architecture mealy_explicit_v2 of simple is type state_ty is (s0, s1, s2, s3); signal state, state_nxt : state_ty; begin process (clk) begin if rising_edge(clk) then Flops state <= state_nxt; end if; Gates end process; Delay state_nxt <= s1 when (state = s0) and a = 1 else s2 when (state = s0) and a = 0 else s3 when (state = s1) or (state = s2) else s0; z <= 1 when (state = s0) and a = 1 else 0; end mealy_explicit_v2; 2 4 3 For the Mealy machine, the explicit-current+next style is smaller than the the explicit-current style. In contrast, for the Moore machine, the two styles produce exactly the same hardware. 128 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.4 Reset All circuits should have a reset signal that puts the circuit back into a good initial state. However, not all ip ops within the circuit need to be reset. In a circuit that has a datapath and a state machine, the state machine will probably need to be reset, but datapath may not need to be reset. There are standard ways to add a reset signal to both explicit and implicit state machines. It is important that reset is tested on every clock cycle, otherwise a reset might not be noticed, or your circuit will be slow to react to reset and could generate illegal outputs after reset is asserted. Reset with Implicit State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. With an implicit state machine, we need to insert a loop in the process and test for reset after each wait statement. Here is the implicit Moore machine from section 2.5.2.1 with reset code added in bold. architecture moore_implicit of simple is begin process begin init : loop -- outermost loop z <= 0; wait until rising_edge(clk); next init when (reset = 1); -- test for reset if (a = 1) then z <= 1; else z <= 0; end if; wait until rising_edge(clk); next init when (reset = 1); -- test for reset z <= 0; wait until rising_edge(clk); next init when (reset = 1); -- test for reset end process; end moore_implicit; 2.5.4 Reset 129 Reset with Explicit State Machine ...................................................... Reset is often easier to include in an explicit state machine, because we need only put a test for reset = 1 in the clocked process for the state. The pattern for an explicit-current style of machine is: process (clk) begin if rising_edge(clk) then if reset = 1 then state <= S0; else if ... then state <= ...; elif ... then ... -- more tests and assignments to state end if; end if; end if; end process; Applying this pattern to the explicit Moore machine from section 2.5.2.3 produces: architecture moore_explicit_v2 of simple is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then if (reset = 1) then state <= s0; else case state is when s0 => if (a = 1) then state <= s1; else state <= s2; end if; when s1 | s2 => state <= s3; when s3 => state <= s0; end case; end if; end if; end process; z <= 1 when (state = s1) else 0; end moore_explicit_v2; 130 CHAPTER 2. RTL DESIGN WITH VHDL The pattern for an explicit-current+next style is: process (clk) begin if rising_edge(clk) then if reset = 1 then state_cur <= reset state; else state_cur <= state_nxt; end if; end if; end process; 2.5.5 State Encoding When working with explicit state machines, we must address the issue of state encoding: what bit-vector value to associate with each state? With implicit state machines, we do not need to worry about state encoding. The synthesis program determines the number of states and the encoding for each state. 2.5.5.1 Constants vs Enumerated Type Using an enumerated type, the synthesis tools chooses the encoding: type state_ty is (s0, s1, s2, s3); signal state : state_ty; Using constants, we choose the encoding: type state_ty is std_logic_vector(1 downto 0); constant s0 : state_ty := "11"; constant s1 : state_ty := "10"; constant s2 : state_ty := "00"; constant s3 : state_ty := "01"; signal state : state_ty; Providing Encodings for Enumerated Types ........................................... . Many synthesizers allow the user to provide hints on how to encode the states, or allow the user to provide explicitly the desire encoding. These hints are done either through VHDL attributes or special comments in the code. 2.5.5 State Encoding 131 Simulation ............................................................................ When doing functional simulation with enumerated types, simulators often display waveforms with pretty-printed values rather than bits (e.g. s0 and s1 rather than 11 and 10). However, when simulating a design that has been mapped to gates, the enumerated type dissappears and you are left with just bits. If you dont know the encoding that the synthesis tool chose, it can be very difcult to debug the design. However, this opens you up to potential bugs if the enumerated type you are testing grows to include more values, which then end up unintentionally executing your when other branch, rather than having a special branch of their own in the case statement. Unused Values ....................................................................... . If the number of values you have in your datatype is not a power of two, then you will have some unused values that are representable. For example: type state_ty is std_logic_vector(2 downto 0); constant s0 : state_ty := "011"; constant s1 : state_ty := "000"; constant s2 : state_ty := "001"; constant s3 : state_ty := "011"; constant s4 : state_ty := "101"; signal state : state_ty; This type only needs ve unique values, but can represent eight different values. What should we do with the three representable values that we dont need? The safest thing to do is to code your design so that if an illegal value is encountered, the machine resets or enters an error state. 2.5.5.2 Encoding Schemes Binary: Conventional binary counter. One-hot: Exactly one bit is asserted at any time. Modied one-hot: Alteras Quartus synthesizer generates an almost-one-hot encoding where the bit representing the reset state is inverted. This means that the reset state is all Os and all other states have two 1s: one for the reset state and one for the current state. Gray: Transition between adjacent values requires exactly one bit ip. Custom: Choose encoding to simplify combinational logic for specic task. 132 CHAPTER 2. RTL DESIGN WITH VHDL Tradeoffs in Encoding Schemes ....................................................... . Gray is good for low-power applications where consecutive data values typically differ by 1 (e.g. no random jumps). One-hot usually has less combinational logic and runs faster than binary for machines with up to a dozen or so states. With more than a dozen states, the extra ip-ops required by one-hot encoding become too expensive. Custom is great if you have lots of time and are incredibly intelligent, or have deep insight into the guts of your design. Note: Dont care values When we dont care what is the value of a signal we assign the signal -, which is dont care in VHDL. This should allow the synthesis tool to use whatever value is most helpful in simplifying the Boolean equations for the signal (e.g. Karnaugh maps). In the past, some groups in E&CE 427 have used - quite succesfuly to decrease the area of their design. However, a few groups found that using - increased the size of their design, when they were expecting it to decrease the size. So, if you are tweaking your design to squeeze out the last few unneeded FPGA cells, pay close attention as to whether using - hurts or helps. 2.6 Dataow Diagrams 2.6.1 Dataow Diagrams Overview Dataow diagrams are data-dependency graphs where the computation is divided into clock cycles. Purpose: Provide a disciplined approach for designing datapath-centric circuits Guide the design from algorithm, through high-level models, and nally to register transfer level code for the datapath and control circuitry. Estimate area and performance Make tradeoffs between different design options Background Based on techniques from high-level synthesis tools Some similarity between high-level synthesis and software compilation Each dataow diagram corresponds to a basic block in software compiler terminology. 2.6.1 Dataow Diagrams Overview 133 a b c d e f + x1 + x2 + x3 + x4 + z Data-dependency graph for z = a + b + c + d + e + f a b c d e f + x1 + x2 + x3 + x4 + z Dataow diagram for z = a + b + c + d + e + f 134 CHAPTER 2. RTL DESIGN WITH VHDL a b c d e f + x1 + x2 Horizontal lines mark clock cycle boundaries + x3 + x4 + z The use of memory arrays in dataow diagrams is described in section 2.7.3. 2.6.2 Dataow Diagrams, Hardware, and Behaviour 135 2.6.2 Dataow Diagrams, Hardware, and Behaviour Primary Input ....................................................................... . Behaviour clk Dataow Diagram i Hardware i x i x x Register Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Behaviour Hardware i clk i Dataow Diagram i x x x Register Signal ........................................................................ Hardware Behaviour clk Dataow Diagram i1 i2 i1 x i1 i2 + + x i2 x Combinational-Component Output Dataow Diagram i1 i2 ................................................... . Behaviour clk Hardware i1 + x x i1 i2 x + i2 136 CHAPTER 2. RTL DESIGN WITH VHDL Read of Memory with Registered Inputs Dataow Diagram M a mem(rd) d we a clk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Behaviour clk we do a M(a) do a d d - Hardware WE A DO M DI Write to Memory with Registered Inputs ............................................... Behaviour clk we a DO Dataow Diagram M di a mem(wr) M we a di clk Hardware WE A a d d U - M DI do di M(a) do Dual-Port Memory with Registered Inputs ............................................ . clk we a0 a d a d U d d M di0 a0 a1 mem(wr) M mem(rd) do1 we a0 di0 a1 clk WE A0 DO0 di0 M do0 do1 a1 M(a) M(a) do0 do1 DI0 A1 DO1 2.6.2 Dataow Diagrams, Hardware, and Behaviour 137 Sequence of Memory Operations ....................................................... clk we a d a a d2 a d d d1 d ? d d1 d - M di0 a0 a1 a0 di0 mem(wr) mem(rd) do1 a0 a1 a1 M(a) M(a) M(a) mem(rd) M do0 mem(rd) do1 M(a) do0 do1 138 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.3 Dataow Diagram Execution Execution with Registers on Both Inputs and Outputs a b c d e f . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 0 1 clk a 0 1 2 3 4 5 6 + x1 + x2 x3 2 x1 x2 + x4 3 x3 x4 + x5 4 x5 z + z 5 6 Execution Without Output Registers a b c d e f .................................................. . 0 1 clk a 0 1 2 3 4 5 6 + x1 + x2 x3 2 x1 x2 + x4 3 x3 x4 + x5 4 x5 z + z 5 2.6.4 Performance Estimation 139 2.6.4 Performance Estimation Performance Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Performance TimeExec 1 TimeExec Latency ClockPeriod Latency = Number of clock cycles from inputs to outputs There is much more information on performance in chapter4, which is devoted to performance. Performance of Dataow Diagrams ................................................... . Latency: count horizontal lines in diagram Min clock period (Max clock speed) limited by longest path in a clock cycle 2.6.5 Area Estimation Maximum number of blocks in a clock cycle is total number of that component that are needed Maximum number of signals that cross a cycle boundary is total number of registers that are needed Maximum number of unconnected signal tails in a clock cycle is total number of inputs that are needed Maximum number of unconnected signal heads in a clock cycle is total number of outputs that are needed The information above is only for estimating the number of components that are needed. In fact, these estimates give lower bounds. There might be constraints on your design that will force you to use more components (e.g., you might need to read all of your inputs at the same time). Implementation-technology factors, such as the relative size of registers, multiplexers, and datapath components, might force you to make tradeoffs that increase the number of datapath components to decrease the overall area of the circuit. Of particular relevance to FPGAs: With some FPGA chips, a 2:1 multiplexer has the same area as an adder. With some FPGA chips, a 2:1 multiplexer can be combined with an adder into one FPGA cell per bit. In FPGAs, registers are usually free, in that the area consumed by a circuit is limited by the amount of combinational logic, not the number of ip-ops. In comparison, with ASICs and custom VLSI, 2:1 multiplexers are much smaller than adders, and registers are quite expensive in area. 140 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.6 Design Analysis a b c d e f + x1 + x2 + x3 + x4 num inputs num outputs num registers num adders min clock period latency 6 1 6 1 delay through op and one adder 6 clock cycles + z 2.6.7 Area / Performance Tradeoffs one add per clock cycle a b c d e f two adds per clock cycle a b c d e f 0 1 0 1 + x1 + x1 + x2 2 + x2 + x3 3 + x3 2 + x4 4 + x4 + x5 z 5 6 + x5 z 3 4 Note: In the Two-add design, half of the last clock cycle is wasted. Two Adds per Clock Cycle ............................................................. 2.6.7 Area / Performance Tradeoffs 141 a b c d e f 0 clk 0 1 2 3 4 5 6 a x1 + x1 1 + x2 x2 + x3 x3 2 x4 x5 + x4 z + x5 z 3 4 142 CHAPTER 2. RTL DESIGN WITH VHDL Design Comparison .................................................................. . One add per clock cycle a b c d e f Two adds per clock cycle a b c d e f 0 1 0 1 + x1 + x1 + x2 2 + x2 + x3 3 + x3 2 + x4 4 + x4 + x5 z 5 6 + x5 z 3 4 inputs outputs registers adders clock period latency 6 1 6 1 op + 1 add 6 6 1 6 2 op + 2 add 4 Question: Under what circumstances would each design option be fastest? Answer: time = latency * clock period compare execution times for both options T1 6 T f Ta T2 4 T f 2 Ta One-add will be faster when T1 6 T f Ta 6T f 6Ta 2T f Tf T2 : 4 T f 2 Ta 4T f 8Ta 2Ta Ta Sanity check: If add is slower than op, then want to minimize the number of adds. One-add has fewer adds, so one-add will be faster when add is slower than op. 2.7. MEMORY ARRAYS AND RTL DESIGN 143 2.7 Memory Arrays and RTL Design 2.7.1 Memory Arrays in VHDL 2.7.1.1 Using a Two-Dimensional Array for Memory A memory array can be written in VHDL as a two-dimensional array: subtype data is std_logic_vector(7 downto 0); type data_vector is array( natural range <> ) of data; signal mem : data_vector(31 downto 0); These two-dimensional arrays can be useful in high-level models and in specications. However, it is possible to write code using a two-dimensional array that cannot be synthesized. Also, some synthesis tools (including Synopsys Design Compiler and FPGA Compiler) will synthesize twodimensional arrays very inefciently. The example below illustrates: lack of interface protocol, combinational write, multiple write ports, multiple read ports. architecture main of mem_not_hw is subtype data is std_logic_vector(7 downto 0); type data_vector is array( natural range <> ) of data; signal mem : data_vector(31 downto 0); begin y <= mem( a ); mem( a ) <= b; -- comb read process (clk) begin if rising_edge(clk) then mem( c ) <= w; -- write port #1 end if; end process; process (clk) begin if rising_edge(clk) then mem( d ) <= v; -- write port #2 end if; end process; u <= mem( e ); -- read port #2 end main; 2.7.1.2 Memory Arrays in Hardware WE WE A DI DO A0 DI0 A1 DO1 DO0 Most simple memory arrays are single- or dualported, support just one write operation at a time, and have an interface protocol using a clock and write-enable. 144 CHAPTER 2. RTL DESIGN WITH VHDL 2.7.1.3 VHDL Code for Single-Port Memory Array package mem_pkg is subtype data is std_logic_vector(7 downto 0); type data_vector is array( natural range <> ) of data; end; entity mem is port ( clk : in std_logic; we : in std_logic -a : in unsigned(4 downto 0); -di : in data; -do : out data -); end mem; architecture main of mem is signal mem : data_vector(31 downto 0); begin do <= mem( to_integer( a ) ); process (clk) begin if rising_edge(clk) then if we = 1 then mem( to_integer( a ) ) <= di; end if; end if; end process; end main; write enable address data_in data_out The VHDL code above is accurate in its behaviour and interface, but might be synthesized as distributed memory (a large number of ip ops in FPGA cells), which will be very large and very slow in comparison to a block of memory. Synopsys synthesis tools implement each bit in a two-dimensional array as a ip-op. Each FPGA and ASIC vendors supplies libraries of memory arrays that are smaller and faster than a two-dimensional array of ip ops. These libraries exploit specialized hardware on the chips to implement the memory. Note: To synthesize a reasonable implementation of a memory array with Synopsys, you must instantiate a vendor-supplied memory component. Some other synthesis tools, such as Xilinx XST, can infer memory arrays from two-dimensional arrays and synthesize efcient implementations. 2.7.1 Memory Arrays in VHDL 145 Recommended Design Process with Memory .......................................... . 1. high-level model with two-dimensional array 2. two-dimensional array packaged inside memory entity/architecture 3. vendor-supplied component 2.7.1.4 Altera Using Library Components for Memory ............................................................................... . Altera uses MegaFunctions to implement RAM in VHDL. A MegaFunction is a black-box description of hardware on the FPGA. There are tools in Quartus to generate VHDL code for RAM components of different sizes. In E&CE 427 we will provide you with the VHDL code for the RAM components that you will need in Lab-3 and the Project. The APEX20KE chips that we are using have dedicated SRAM blocks called Embedded System Blocks (ESB). Each ESB can store 2048 bits and can be congured in any of the following sizes: Number of Elements Word Size (bits) 2048 1 1024 2 512 4 256 8 128 16 Xilinx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Use component instantiation to get these components ram16x1s 16 1 single ported memory ram16x1d 16 1 dual-ported memory Other sizes are also available, consult the datasheet for your chip. 146 CHAPTER 2. RTL DESIGN WITH VHDL 2.7.1.5 Build Memory from Slices If the vendors libraries of memory components do not include one that is the correct size for your needs, you can construct your own component from smaller ones. WriteEn Addr DataIn[W-1..0] DataIn[2W-1..2] Clk WE A DI DO WE A DI DO NxW NxW DataOut[W-1..0] DataOut[2W-1..W] Figure 2.4: An N2W memory from NW components WriteEn Addr[logN] Addr[logN-1..0] DataIn Clk WE A DI DO NxW WE A DI DO NxW 0 1 DataOut Figure 2.5: A 2NW memory from NW components 2.7.1 Memory Arrays in VHDL A 164 Memory from 161 Components library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity ram16x4s is port ( clk, we : in std_logic; data_in : in std_logic_vector(3 downto 0); addr : in unsigned(3 downto 0); data_out : out std_logic_vector(3 downto 0) ); end ram16x4s; 147 ............................................. . architecture main of ram16x4s is component ram16x1s port (d : in std_logic; -- data in a3, a2, a1, a0 : in std_logic; -- address we : in std_logic; -- write enable wclk : in std_logic; -- write clock o : out std_logic -- data out ); end component; begin mem_gen: for i in 0 to 3 generate ram : ram16x1s port map ( we => we, wclk => clk, ----------------------------------------------- d and o are dependent on i a3 => addr(3), a2 => addr(2), a1 => addr(1), a0 => addr(0), d => data_in(i), o => data_out(i) ---------------------------------------------); end generate; end main; 148 CHAPTER 2. RTL DESIGN WITH VHDL 2.7.1.6 Dual-Ported Memory Dual ported memory is similar to single ported memory, except that it allows two simultaneous reads, or a simultaneous read and write. When doing a simultaneous read and write to the same address, the read will usually not see the data currently being written. Question: Why do dual-ported memories usually not support writes on both ports? Answer: What should your memory do if you write different values to the same address in the same clock cycle? 2.7.2 Data Dependencies Denition of Three Types of Dependencies .............................................. There are three types of data dependencies. The names come from pipeline terminology in computer architecture. M[i] := := M[i] := := := M[i] := := M[i] M[i] := M[i] := Read after Write Write after Write Write after Read (True dependency) (Load dependency) (Anti dependency) Instructions in a program can be reordered, so long as the data dependencies are preserved. 2.7.2 Data Dependencies 149 Purpose of Dependencies ............................................................. . W0 WAW ordering prevents W0 from happening after W1 R3 := ...... W1 R3 := ...... producer RAW ordering prevents R1 from happening before W1 WAR ordering prevents W2 from happening before R1 R1 ... := ... R3 ... consumer W2 R3 := ...... Each of the three types of memory dependencies (RAW, WAW, and WAR) serves a specic purpose in ensuring that producer-consumer relationships are preserved. Ordering of Memory Operations ....................................................... M[3] M[2] M[1] M[0] 30 20 10 0 M[2] := 21 M[3] := 31 A B := M[2] := M[0] 21 M[2] := 21 B A := M[0] := M[2] M[2] := 21 B A := M[0] := M[2] M[3] := 31 M[3] := 32 M[0] := 01 C := M[3] M[3] := 31 C := M[3] M[3] := 32 M[0] := 01 C := M[3] M[3] := 32 M[0] := 01 Initial Program with Dependencies Valid Modication Valid (or Bad?) Modication Answer: Bad modication: M[3] := 32 must happen before C := M[3]. 150 CHAPTER 2. RTL DESIGN WITH VHDL 2.7.3 Memory Arrays and Dataow Diagrams Legend for Dataow Diagrams name name name name (rd) name(wr) ......................................................... Input port Output port State signal Array read Array write Basic Memory Operations ............................................................ . mem data addr mem addr mem(rd) data mem (anti-dependency) mem(wr) mem data := mem[addr]; Memory Read mem[addr] := data; Memory Write Dataow diagrams show the dependencies between operations. The basic memory operations are similar, in that each arrow represents a data dependency. There are a few aspects of the basic memory operations that are potentially surprising: The anti-dependency arrow producing mem on a read. Reads and writes are dependent upon the entire previous value of the memory array. The write operation appears to produce an entire memory array, rather than just updating an individual element of an existing array. Normally, we think of a memory array as stationary. To do a read, an address is given to the array and the corresponding data is produced. In datalfow diagrams, it may be somewhat suprising to see the read and write operations consuming and producing memory arrays. Our goal is to support memory operations in dataow diagrams. We want to model memory operations similarly to datapath operations. When we do a read, the data that is produced is dependent upon the contents of the memory array and the address. For write operations, the apparent dependency on, and production of, an entire memory array is because we do not know which address in the array will be read from or written to. The antidependency for memory reads is related to Write-after-Read dependencies, as discussed in Section 2.7.2. There are optimizations that can be performed when we know the address (Section 2.7.3). 2.7.3 Memory Arrays and Dataow Diagrams 151 Dataow Diagrams and Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Algo: mem[wr addr] := data in; data out := mem[rd addr]; mem data_in wr_addr Algo: mem[wr addr] := data in; data out := mem[rd addr]; mem data_in wr_addr rd_addr mem(wr) rd_addr mem(wr) mem(rd) mem(rd) mem data_out Optimization when rd addr wr addr mem data_out Read after Write Algo: mem[wr1 addr] := data1; mem[wr2 addr] := data2; mem data1 wr1_addr mem(wr) data2 wr2_addr mem(wr) mem Write after Write 152 CHAPTER 2. RTL DESIGN WITH VHDL Algo: mem[wr1 addr] := data1; mem[wr2 addr] := data2; mem data2 wr2_addr mem(wr) data1 wr1_addr mem(wr) mem Scheduling option when wr1 addr wr2 addr Algo: rd data := mem[rd addr]; mem[wr addr] := wr data; mem rd_addr Algo: rd data := mem[rd addr]; mem[wr addr] := wr data; mem rd_addr wr_data wr_addr mem(rd) wr_data wr_addr mem(rd) mem(wr) mem(wr) rd_data mem Optimization when rd addr rd_data mem wr addr Write after Read 2.7.4 Example: Memory Array and Dataow Diagram 153 2.7.4 Example: Memory Array and Dataow Diagram mem M data_in wr_addr 21 2 1 M(wr) 31 3 2 M(wr) 2 0 3 M(rd) 4 M(rd) 32 3 1 2 3 4 5 6 7 M[2] := 21 M[3] := 31 A B := M[2] := M[0] A B 5 M(wr) 01 0 6 M(wr) 3 M[3] := 32 M[0] := 01 C := M[3] M C 7 M(rd) Figure 2.6: Memory array example code and initial dataow diagram The dependency and anti-dependency arrows in dataow diagram in Figure2.6 are based solely upon whether an operation is a read or a write. The arrows do not take into account the address that is read from or written to. In gure2.7, we have used knowledge about which addresses we are accessing to remove unneeded dependencies. These are the real dependencies and match those shown in the code fragment for gure2.6. In gure2.8 we have placed an ordering on the read operations and an ordering on the write operations. The ordering is derived by obeying data dependencies and then rearranging the operations to perform as many operations in parallel as possible. 154 CHAPTER 2. RTL DESIGN WITH VHDL M 0 21 2 31 3 M 0 21 2 31 3 M(rd) B 01 0 M(wr) M(wr) M(wr) 1 M(rd) B 1 M(wr) 2 M(wr) 2 M(rd) 32 3 M(wr) 3 M(rd) 4 01 0 M(wr) 2 2 M(rd) 3 32 3 M(wr) 3 3 M(rd) A A M C M C Figure 2.7: Memory array with minimal dependencies Figure 2.8: Memory array with orderings M 0 1 M(rd) B 2 2 M(rd) A 32 3 3 M(wr) 2 31 3 M(wr) 1 21 2 M(wr) 3 3 M(rd) C 4 01 0 M(wr) M Figure 2.9: Final version of Figure2.6 Put as many parallel operations into same clock cycle as allowed by resources (one write + one read, two reads, or one write for dual port RAM). Preserve depencies by putting dependent operations in separate clock cycles. 2.8. INPUT / OUTPUT PROTOCOLS 155 2.8 Input / Output Protocols An important aspect of hardware design is choosing a input/output protocol that is easy to implement and suits both your circuit and your environment. Here are a few simple and common protocols. rdy data ack Figure 2.10: Four phase handshaking protocol Used when timing of communication between producer and consumer is unpredictable. The disadvantage is that it is cumbersome to implement and slow to execute. clk valid data Figure 2.11: Valid-bit protocol A low overhead (both in area and performance) protocol. Consumer must always be able to accept incoming data. Often used in pipelined circuits. More complicated versions of the protocol can handle pipeline stalls. clk start data_in done data_out Figure 2.12: Start/Done protocol A low overhead (both in area and performance) protocol. Useful when a circuit works on one piece of data at a time and the time to compute the result is unpredictable. 156 CHAPTER 2. RTL DESIGN WITH VHDL 2.9 Design Example: Massey Well go through the following artifacts: 1. requirements 2. algorithm 3. dataow diagram 4. high-level models 5. hardware block diagram 6. RTL code for datapath 7. state machine 8. RTL code for control Design Process ....................................................................... . 1. Scheduling (allocate operations to clock cycles) 2. I/O allocation 3. First high-level model 4. Register allocation 5. Datapath allocation 6. Connect datapath components, insert muxes where needed 7. Design implicit state machine 8. Optimize 9. Design explicit-current state machine 10. Optimize 2.9.1 Requirements Functional requirements: Compute the sum of six 8-bit numbers: output = a + b + c + d + e + f Use registers on both inputs and outputs Performance requirements: Maximum clock period: unlimited Maximum latency: four Cost requirements: 2.9.2 Algorithm 157 Maximum of two adders Small miscellaneous hardware (e.g. muxes) is unlimited Maximum of three inputs and one output Design effort is unlimited Note: In reality multiplexers are not free. In FPGAs, a 2:1 mux is more expensive than a full-adder. A 2:1 mux has three inputs while an adder has only two inputs (the carry-in and carry-out signals usually use the special vertical connections on the FPGA cell). In FPGAs, sharing an adder between two signals can be more expensive than having two adders. In a generic-gate technology, a multiplexor contains three two-input gates, while a full-adder contains fourteen two-input gates. 2.9.2 Algorithm Well use parentheses to group operations so as to maximize our opportunities to perform the work in parallel: z = (a + b) + (c + d) + (e + f) This results in the following data-dependency graph: a b c d e f + + + + + 158 CHAPTER 2. RTL DESIGN WITH VHDL 2.9.3 Initial Dataow Diagram a b c d + + + e f + + z This dataow diagram violates the requirement to use at most three inputs. 2.9.4 Dataow Diagram Scheduling We can potentially optimize the inputs, outputs, area, and performance of a dataow diagram by rescheduling the operations, that is allocating the operations to different clock cycles. Parallel algorithms have higher performance and greater scheduling exibility than serial algorithms Serial algorithms tend to have less area than parallel algorithms Serial (((((a+b)+c)+d)+e)+f) a b c d e f Parallel (a+b)+(c+d)+(e+f) + + + + + a b c d e f + + + + + 2.9.4 Dataow Diagram Scheduling 159 Scheduling to Optimize Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Original parallel Parallel after scheduling a b c d e f a b c d + + + + + + + e f + + + inputs outputs registers adders clock period latency 6 1 6 3 op + 1 add 3 4 1 4 2 op + 1 add 3 Scheduling to Optimize Inputs ......................................................... a b Rescheduling the dataow diagram from the parallel algorithm reduced the area from three adders to two. However, it still violates the restriction of a maximum of three inputs. We can reschedule the operations to keep the same area, but reduce the number of inputs. The tradeoff is that reducing the number of inputs causes an increase in the latency from four to ve. + c d + + + z e f + A latency of ve violates the design requirement of a maximum latency of four clock cycles. In comparing the dataow diagram above with the design requirements, we notice that the requirements allow a clock cycle that includes two additions and three inputs. 160 CHAPTER 2. RTL DESIGN WITH VHDL a b c It appears that the parallel algorithm will not lead us to a design that satises the requirements. We revisit the algorithm and try a serial algorithm: z = ((((a + b) + c) + d) + e) + f + x1 + x2 d e + x3 + x4 f The corresponding dataow diagram is shown to the right. + z 2.9.5 Optimize Inputs and Outputs When we rescheduled the parallel algorithm, we rescheduled the input values. This requires renegotiating the schedule of input values with our environment. Sometimes the environment of our circuit will be willing to reschedule the inputs, but in other situations the environment will impose a non-negotiable schedule upon us. If you are currently storing all inputs and can change environments behaviour to delay sending some inputs, then you can reduce the number of inputs and registers. We will illustrate this on both the one-add and the two-add designs. One-add before I/O opt a b c d e f a One-add after I/O opt b + x1 + x1 c + x2 + x2 d + x3 + x3 e + x4 + x4 f + z + z inputs regs 6 6 2 2 2.9.5 Optimize Inputs and Outputs 161 Two-add before I/O opt a b c d e f a Two-add after I/O opt b c + x1 + x1 + x2 + x2 d e + x3 + x3 + x4 + x4 f + z + z inputs regs 6 6 2 2 Design Comparison Between One and Two Add One-add after I/O opt a b ....................................... . Two-add after I/O opt a b c + x1 c + x1 d + x2 + x2 e d e + x3 + x3 f + x4 + x4 f + z + z inputs outputs registers adders clock period latency 2 1 2 1 op + 1 add 6 3 1 3 2 op + 2 add 4 162 CHAPTER 2. RTL DESIGN WITH VHDL Hardware Recipe for Two-Add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Based on the dataow diagram, we can determine the hardware resources required for the datapath. Table 2.2: Hardware Recipe for Two-Add We return now to the two-add design, with the dataow diagram: a b c + x1 + x2 d e + x3 + x4 f + z inputs 3 adders 2 registers 3 output 1 registered inputs YES YES registered outputs clock cycles from inputs to outputs 4 2.9.6 Input/Output Allocation Our rst step after settling on a hardware recipe is I/O allocation, because that determines the interface between our circuit and the outside world. From the hardware recipe, we know that we need only three inputs and one output. However, we have six different input values. We need to allocate these input values to input signals before we can write a high-level model that performs the computation of our design. Based on the input and output information in the hardware recipe, we can dene our entity: entity massey is port ( clk : in std_logic; i1, i2, i3 : in unsigned(7 downto 0); o1 : out unsigned(7 downto 0) ); end massey; 2.9.6 Input/Output Allocation 163 i1 i2 a b i3 c i1 i2 i3 + x1 + x2 i2 d i3 e + x3 + + x4 i2 f + z o1 + o1 Figure 2.13: Dataow diagram and hardware block diagram with I/O port allocation Based upon the dataow diagram after I/O allocation, we can write our rst high-level model (hlm v1). In the high-level model the entire circuit will be implemented in a single process. For larger circuits it may be benecial to have separate processes for different groups of signals. In the high-level model, the code between wait statements describes the work that is done in a clock cycle. The hlm v1 architecture uses an implicit state machine. Because the process is clocked, all of the signals that are assigned to in the process are registers. Combinational signals would need to be done using concurrent assignments or combinational processes. architecture hlm_v1 of massey is ...internal signal decls... process begin wait until rising_edge(clk); a <= i1; b <= i2; c <= i3; wait until rising_edge(clk); x2 <= (a + b) + c; d <= i2; e <= i3; wait until rising_edge(clk); x4 <= (x2 + d) + e; f <= i2; wait until rising_edge(clk); z <= (x4 + f); end process; o1 <= z; end hlm_v1; 164 CHAPTER 2. RTL DESIGN WITH VHDL 2.9.7 Register Allocation The next step after I/O allocation could be either register allocation or datapath allocation. The benet of doing register allocation rst is that it is possible to write VHDL code after register allocation is done but before datapath allocation is done, while the inverse (datapath done but register allocation not done) does not make sense if written in a hardware description language. In this example, we will do register allocation before datapath allocation, and show the resulting VHDL code. i1 i2 i3 i1 i2 a b r1 r2 i3 c r3 + x1 + x2 r1 i2 d r2 i3 e r3 r1 r2 r3 + x3 + + x4 r1 i2 f r2 + r3 z o1 + o1 I/O Allocation Register Allocation i1 i2 i3 o1 r1 r2 r3 a b, d, f c, e z a, x2, x4 b, d, f c, e architecture hlm_v2 of massey is ...internal signal decls... process begin wait until rising_edge(clk); r1 <= i1; r2 <= i2; r3 <= i3; wait until rising_edge(clk); r1 <= (r1 + r2) + r3; r2 <= i2; r3 <= i3; wait until rising_edge(clk); r1 <= (r1 + r2) + r3; r2 <= i2; wait until rising_edge(clk); r3 <= (r1 + r2); end process; o1 <= r3; end hlm_v2; Figure 2.14: Block diagram after I/O and register allocation 2.9.8 Datapath Allocation 165 2.9.8 Datapath Allocation In datapath allocation, we allocate each of the data operations in the dataow diagram to one of the datapath components in the hardware block diagram. i1 i2 i3 i1 i2 a b r1 r2 a1 i3 c r3 + x1 a2 + x2 r1 a1 i2 d r2 i3 e r3 r1 a1 r2 r3 + x3 a2 + a2 + x4 r1 a1 i2 f r2 + r3 z o1 + o1 I/O Allocation Register Allocation Datapath Allocation i1 i2 i3 o1 r1 r2 r3 a1 a2 a b, d, f c, e z a, x2, x4 b, d, f c, e x1, x3, z x2, x4 architecture hlm_dp of massey is ...internal signal decls... process begin wait until rising_edge(clk); r1 <= i1; r2 <= i2; r3 <= i3; wait until rising_edge(clk); r1 <= a2; r2 <= i2; r3 <= i3; wait until rising_edge(clk); r1 <= a2; r2 <= i2; wait until rising_edge(clk); r3 <= a1; end process; a1 <= r1 + r2; a2 <= a1 + r3; o1 <= r3; end hlm_dp; Figure 2.15: Block diagram after I/O, register, and datapath allocation 166 CHAPTER 2. RTL DESIGN WITH VHDL 2.9.9 Datapath for DP+Ctrl Model We will now evolve from an implicit state machine to an explicit state machine. The rst step is to label the states in the dataow diagram and the construct tables to nd the values for chip-enable and mux-select signals. S0 i1 i2 a b r1 r2 a1 i3 c r3 + x1 a2 S1 + x2 r1 a1 i2 d r2 i3 e r3 + x3 a2 S2 + x4 r1 a1 i2 f r2 S3 S0 + r3 z o1 Datapath for DP+Ctrl Model r1 ce=1 , d=i1 ce=1 , d=a2 ce=1 , d=a2 ce=, d= r2 ce=1 , d=i2 ce=1 , d=i2 ce=1 , d=i2 ce=, d= ......................................................... . r3 ce=1 , d=i3 ce=1 , d=i3 ce=, d= ce=1 , d=a1 a1 src1=, src2= src1=r1, src2=r2 src1=r1, src2=r2 src1=r1, src2=r2 a2 src1=, src2= src1=a1, src2=r3 src1=a1, src2=r3 src1=, src2= S0 S1 S2 S3 S0 S1 S2 S3 Choose Dont-Care Values r1 S0 ce=1, d=i1 S1 ce=1, d=a2 S2 ce=1, d=a2 S3 ce=1, d=a2 r2 ce=1, d=i2 ce=1, d=i2 ce=1, d=i2 ce=1, d=i2 ............................................................. r3 ce=1, d=i3 ce=1, d=i3 ce=1, d=i3 ce=1, d=a1 a1 src1=r1, src2=r2 src1=r1, src2=r2 src1=r1, src2=r2 src1=r1, src2=r2 a2 src1=a1, src2=r3 src1=a1, src2=r3 src1=a1, src2=r3 src1=a1, src2=r3 S0 S1 S2 S3 2.9.9 Datapath for DP+Ctrl Model 167 Simplify r1 d=i1 d=a2 d=a2 d=a2 ............................................................................. . r2 = i2 r3 d=i3 d=i3 d=i3 d=a1 S0 S1 S2 S3 a1 a2 src1=r1, src2=r2 src1=a1, src2=r3 VHDL Code ......................................................................... . architecture explicit_v1 of massey is signal type state_ty is std_logic_vector(3 downto 0); constant s0 : state_ty := "0001"; constant s1 : state_ty := "0010"; constant s2 : state_ty := "0100"; constant s3 : state_ty := "1000"; signal state : state_ty; begin 168 ----------------------- r1 process (clk) begin if rising_edge(clk) then if state = S0 then r_1 <= i_1; else r_1 <= a_2; end if; end if; end process; ----------------------- r_2 process (clk) begin if rising_edge(clk) then r_2 <= i_2; end if; end process; ----------------------- r_3 process (clk) begin if rising_edge(clk) then if state = S3 then r_3 <= a_1; else r_3 <= i_3; end if; end if; end process CHAPTER 2. RTL DESIGN WITH VHDL ----------------------- combinational datapath a_1 <= r_1 + r_2; a_2 <= a_1 + r_3; o_1 <= r_3; ----------------------- state machine process (clk) begin if rising_edge(clk) then if reset = 1 then state <= S0; else case state is when S0 => state <= when S1 => state <= when S2 => state <= when S3 => state <= end case; end if; end if; end process; end explicit_v1; S1; S2; S3; S0; Peephole Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Peephole optimizations are localized optimizations to code, in that they affect only a few lines of code. In hardware design, peephole optimizations are usually done to decrease the clock period, although some optimizations might also decrease area. There are many different types of optimizations, and many optimizations that designers do by hand are things that you might expect a synthesis tool to do automatically. In a comparison such as: state = S0, when we use a one-hot state encoding, we need compare only one of the bits of the state. The comparison can be simplied to: state(0) = 1. Without this optimization, many synthesis tools will produce hardware that tests all of the bits of the state signal. This increases the area, because more bits are required as inputs to the comparison, and increases the clock period because the wider comparison leads to a tree-like structure of combinational logic, or an increased number of FPGA cells. 2.9.9 Datapath for DP+Ctrl Model 169 In this example, we will take advatage of our state encoding to optimize the code for r 1, r 3, and the state machine. -- r_1 process (clk) begin if rising_edge(clk) then if state = S0 then r_1 <= i_1; else r_1 <= a_2; end if; end if; end process; -- r_1 (optimized) process (clk) begin if rising_edge(clk) then if state(0) = 1 then r_1 <= i_1; else r_1 <= a_2; end if; end if; end process; The code for r 2 remains unchanged. -- r_3 process (clk) begin if rising_edge(clk) then if state = S3 then r_3 <= a_1; else r_3 <= i_3; end if; end if; end process; -- r_3 (optimized) process (clk) begin if rising_edge(clk) then if state(3) then r_3 <= a_1; else r_3 <= i_3; end if; end if; end process; -- state machine (optimized) -- NOTE: "st" = "state" process (clk) begin if rising_edge(clk) then if reset = 1 then st <= S0; else for i in 0 to 3 loop st( (i+1)mod4 ) <= st(i); end loop; end if; end if; -- state machine process (clk) begin if rising_edge(clk) then if reset = 1 then state <= S0; else case state is when S0 => state <= when S1 => state <= when S2 => state <= when S3 => state <= end case; end if; end if; end process; S1; S2; S3; S0; 170 CHAPTER 2. RTL DESIGN WITH VHDL The hardware-block diagram that corresponds to the tables and VHDL code is: reset State(0) State(1) State(2) State(3) i1 i2 i3 r1 a1 r2 r3 + a2 + o1 2.10. DESIGN EXAMPLE: VANIER 171 2.10 Design Example: Vanier Well go through the following artifacts: 1. requirements 2. algorithm 3. dataow diagram 4. high-level models 5. hardware block diagram 6. RTL code for datapath 7. state machine 8. RTL code for control Design Process ....................................................................... . 1. Scheduling (allocate operations to clock cycles) 2. I/O allocation 3. First high-level model 4. Register allocation 5. Datapath allocation 6. Connect datapath components, insert muxes where needed 7. Design implicit state machine 8. Optimize 9. Design explicit-current state machine 10. Optimize 2.10.1 Requirements Functional requirements: compute the following formula: output = (a d) + c + (d b) + b Performance requirement: Max clock period: op plus (2 adds or 1 multiply) Max latency: 4 Cost requirements Maximum of two adders Maximum of two multipliers 172 CHAPTER 2. RTL DESIGN WITH VHDL Unlimited registers Maximum of three inputs and one output Maximum of 5000 student-minutes of design effort Registered inputs and outputs 2.10.2 Algorithm a Create a data-dependency graph for the algorithm. NOTE: if draw data-dep graph in alphabetical order, its ugly. Lesson is to think about layout and possibly re-do the layout to make it simple and easy to understand before proceeding. d b c + + + z 2.10.3 Initial Dataow Diagram a Schedule operations into clock cycles. Use an as soon as possible schedule, obeying performance requirement of a maximum clock period of one multiply or two additions. In this initial diagram, we ignore the resource requirements. This allows us to establish a lower bound on the latency, which gives us the maximum performance that we can hope to achieve. d b c + + + z 2.10.4 Reschedule to Meet Requirements We have four inputs, but the requirements allow a maximum of three. We need to move one input into the second clock cycle. We want to choose an input that can be delayed by one clock cycle without violating a requirement and with minimal degradation of performance (clock period and latency). 2.10.4 Reschedule to Meet Requirements 173 If delaying an input by a clock cycle causes a requirement to be violated, we can often reschedule the operations to remove the violation. So, we sometimes create an intermediate dataow diagram that violates a requirement, then reschedule the operations to bring the dataow diagram back into compliance. The critical path is from d and b, through a multiplier, the middle adder, the nal adder, and then out through z. Because the inputs d and b are on the critical path, it would be preferable to choose another input (either a or c) as the input to move into the second clock cycle. If we move c, we will move the rst addition in the second clock cycle, which will force us to use three adders, which violates our resource requirement of a maximum of two adders. d By process of elimination, we have settled on a as our input to be delayed. This causes one of the multiply operations to be moved into second clock cycle, which is good because it reduces our resources from two multipliers to just one. b c a + + + z d b c Moving a into the second clock cycle has caused a clock period violation, because our clock period is now a register, a multiply, and an add. This forces us to add an additional clock cycle, which gives us a latency of four. a + + + z 174 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.5 Optimize Resources d b a We can exploit the additional clock cycle to reschedule our operations to reduce the number of inputs from three to two. The disadvantage is that we have increased the number of registers from four to ve. c + + + z Two side comments: Moving the second addition from the third clock cycle to the second will not improve the performance or the area. The number of adders will remain at two, the number of registers will remain at ve, and the clock period will remain at the maximum of a multiply or two additions. In hindsight, if we had chosen originally to move c, rather than a into the second clock cycle, we would likely have produced this same dataow diagram. After moving c, we would see the resource violation of three adders in the second clock cycle. This violation would cause us to add a third clock cycle, and given us an opportunity to move a into the second clock cycle. The lesson is that there are usually several different ways to approach a design problem, and it is infeasible to predict which approach will result in the best design. At best, we have many heuristics, or rules of thumb, that give us guidelines for techniques that usually work well. Having nalized our input/output scheduling, we can write our entity. Note: we will add a reset signal later, when we design the state machine to control the datapath. entity vanier is port ( clk : in std_logic; i_1, i_2 : in std_logic_vector(15 downto 0); o_1 : out std_logic_vector(15 downto 0) ); end vanier; 2.10.6 Assign Names to Registered Values 175 2.10.6 Assign Names to Registered Values We must assign a name to each registered value. Optionally, we may also assign names to combinational values. Registers require names, because in VHDL each register (except implicit state registers) is associated with a named signal. Combinational signals do not require names, because VHDL allows anonymous (unnamed) combinational signals. For example, in the expression (a+b)+c we do not need to provide a name for the sum of a and b. d x1 a x3 If a single value spans multiple clock cycles, it only needs to be named once. In our example x 1, x 2, and x 4 each cross two boundaries. b x2 c x4 x5 + x6 + + x8 z x7 176 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.7 Input/Output Allocation Now that we have names for all of our registered signals, we can allocate input and output ports to signals. After the input and output ports have been allocated to signals, we can write our rst model. We use an implicit state machine and dene only the registered values. In each state, we dene the values of the registered values that are computed in that state. i1 d i2 b x1 i1 a x2 i2 c x3 x4 x5 + x6 + + x8 z o1 x7 architecture hlm_v1 of vanier is signal x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8 : unsigned(15 downto 0); begin process begin -----------------------------wait until rising_edge(clk); -----------------------------x_1 <= unsigned(i_1); x_2 <= unsigned(i_2); -----------------------------wait until rising_edge(clk); -----------------------------x_3 <= unsigned(i_1); x_4 <= x_1(7 downto 0) * x_2(7 downto 0); x_5 <= unsigned(i_2); -----------------------------wait until rising_edge(clk); -----------------------------x_6 <= x_3(7 downto 0) * x_1(7 downto 0); x_7 <= x_2 + x_5; -----------------------------wait until rising_edge(clk); -----------------------------x_8 <= x_6 + (x_4 + x_7); end process; o_1 <= std_logic_vector(x_8); end hlm_v1; The model hlm v1 is synthesizable. If we are happy with the clock speed and area, we can stop now! The remaining steps of the design process seek to optimize the design by reducing the area and clock period. For area, we will reduce the number of registers, datapath components, and multiplexers. Reducing the clock period will occur as we reduce the number of multiplexers and potentially perform peephole (localized) optimizations, such as Boolean simplication. 2.10.8 Tangent: Combinational Outputs 177 2.10.8 Tangent: Combinational Outputs To demonstrate a high-level model where the output is combinational, we modify hlm v1 so that the output is combinational, rather than a register (see hlm v1c). To make the output (x 8) combinational, we move the assignment to x 8 out of the main clocked process and into a concurrent statement. architecture hlm_v1c of vanier is signal x_1, x_2, x_3, x_4, x_5, x_6, x_7 : unsigned(15 downto 0); begin process begin -----------------------------wait until rising_edge(clk); -----------------------------x_1 <= unsigned(i_1); x_2 <= unsigned(i_2); -----------------------------wait until rising_edge(clk); -----------------------------x_3 <= unsigned(i_1); x_4 <= x_1(7 downto 0) * x_2(7 downto 0); x_5 <= unsigned(i_2); -----------------------------wait until rising_edge(clk); -----------------------------x_6 <= x_3(7 downto 0) * x_1(7 downto 0); x_7 <= x_2 + x_5; end process; o_1 <= std_logic_vector(x_6 + (x_4 + x_7)); end hlm_v1c; i1 d i2 b x1 i1 a x2 i2 c x3 x4 x5 + x6 + + z o1 x7 178 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.9 Register Allocation Our previous model (hlm v1) uses eight registers (x 1. . . x 8). However, our analysis of the dataow diagrams says that we can implement the diagram with just ve registers. Also, the code for hlm v1 contains two occurrences of the multiplication symbol (*) and three occurrences of the addition symbol (+). Our analysis of the dataow diagram showed that we need only one multiply and two adds. In hlm v1 we are relying on the synthesis tool to recognize that even though the code contains two multiplies and three adds, the hardware needs only one multiply and two adds. Register allocation is the task of assigning each of our registered values to a register signal. Datapath allocation is the task of assigning each datapath operation to a datapath component. Only high-level synthesis tools (and software compilers) do register allocation. So, as hardware designers, we are stuck with the task of doing register allocation ourselves if we want to further optimize our design. Some register-transfer-level synthesis tools do datapath allocation. If your synthesis tool does datapath allocation, it is important to learn the idioms and limitations of the tool so that you can write your code in a style that allows the tool to do a good job of allocation and optimization. In most cases where area or clock speed are important design metrics, design engineers do datapath allocation by hand or ad-hoc software and spreadsheets. We will now step through the tasks of register allocation and datapath allocation. In our eightregister model, each register holds a unique value we do not reuse registers. To reduce the number of registers from eight to ve, we will need to reuse registers, so that a register potentially holds different values in different clock cycles. When doing register allocation, we assign a register to each signal that crosses a clock cycle boundary. When creating the hardware block diagram, we will need to add multiplexers to the inputs of modules that are connected to multiple registers. To reduce the number of multiplexers, we try to allocate the same registers to the same inputs of the same type of module. For example, x 7 is an input to an adder, we allocate r 5 to x 7, because r 5 was also an input to an adder in another clock cycle. Also in the third clock cycle, we allocate r 2 to x 6, because in the second clock cycle, the inputs to an adder were r 2 and r 5. In the last clock cycle, we allocate r 5 to x 8, because previously r 5 was used as the output of r 2 + r 5. We update our model to reect register allocation, by replacing the signals for registered values (x 1. . . x 8) with the registers r 1. . . r 5. 2.10.9 Register Allocation 179 i1 d r1 x1 i1 a r3 x3 r4 x4 i2 b r2 x2 i2 c r5 x5 + r2 x6 + + r5 x8 z o1 r5 x7 architecture hlm_v2 of vanier is signal r_1, r_2, r_3, r_4, r_5 : unsigned(15 downto 0); begin process begin -----------------------------wait until rising_edge(clk); -----------------------------r_1 <= unsigned(i_1); r_2 <= unsigned(i_2); -----------------------------wait until rising_edge(clk); -----------------------------r_3 <= unsigned(i_1); r_4 <= r_1(7 downto 0) * r_2(7 downto 0); r_5 <= unsigned(i_2); -----------------------------wait until rising_edge(clk); -----------------------------r_2 <= r_3(7 downto 0) * r_1(7 downto 0); r_5 <= r_2 + r_5; -----------------------------wait until rising_edge(clk); -----------------------------r_5 <= r_2 + (r_4 + r_5); end process; o_1 <= std_logic_vector(r_5); end hlm_v2; Both of our models so far (hlm v1 and hlm v2) have used implicit state machines. The optimization from hlm v1 to hlm v2 was done to reduce the number of registers by performing register allocation. Most of the remaining optimizations require an explicit state machine. We will construct an explicit state machine using a methodical procedure that gradually adds more information to the dataow diagram. The rst step in this procedure is to datapath allocation, which is similar to register allocation, except that we allocate datapath components to datapath operations, rather than allocate registers to names. To control the datapath, we need to provide the following signals for registers and datapath components: registers chip-enable and mux-select signals datapath components instruction (e.g. add, sub, etc for ALUs) and mux-select After we determine the chip-enable, mux-select, and instruction signals, and then calculate what value each signal needs in each clock cycle, we can build the explicit state machine to control the datapath. After we build the state machine, we will add a reset to the design. 180 CHAPTER 2. RTL DESIGN WITH VHDL 2.10.10 Datapath Allocation i1 d r1 x1 i2 b r2 x2 m1 r4 x4 a1 r2 x6 a2 i2 c r5 x5 In datapath allocation, we allocate an adder (either a1 or a2) to each addition operation and a multiplier (either m1 or m2) to each multiplication operation. As with register allocation, we attempt to reduce the number of multiplexers will be required by connecting the same datapath component to the same register in multiple clock cycles. i1 a r3 x3 m1 + r5 x7 + a1 + r5 x8 z o1 2.10.11 Hardware Block Diagram and State Machine To build an explicit state machine, we rst determine what states we need. In this circuit, we need four states, one for each clock cycle in the dataow diagram. If our algorithmic description had included control ow, such as loops and branches, then it becomes more difcult to determine the states that are needed. We will use four states: S0..S3, where S0 corresponds to the rst clock cycle (during which the input is read) and S3 corresponds to the last clock cycle. 2.10.11.1 Control for Registers To determine the chip enable and mux select signals for the registers, we build a table where each state corresponds to a row and each register corresponds to a column. For each register and each state, we note whether the register loads in a new value (ce) and what signal is the source of the loaded data (d). r1 S0 S1 S2 S3 ce 1 0 d i1 ce 1 0 1 r2 d i2 m1 ce 1 r3 d i1 ce 1 0 r4 d m1 ce 1 1 1 r5 d i2 a1 a1 2.10.11 Hardware Block Diagram and State Machine 181 Eliminate unnecessary chip enables and muxes. A chip enable is needed if a register must hold a single value for multiple clock cycles (ce=0). A multiplexer is needed if a register loads in values from different sources in different clock cycles. The register simplications are as follows: r1 Chip-enable, because S1 has ce=0. No multiplexer, because i1 is the only input. r2 Chip-enable, because S1 has ce=0. Multiplexer to choose between i2 and m1. r3 No chip enable, no multiplexer. The register r3 simplies to be just r3=i1 without a multiplexer or chip-enable, because there is only one state where we care about its behaviour (S1) all of the other states are dont cares for both chip enable and mux. r4 Chip-enable, because S2 has ce=0. No multiplexer, because m1 is the only input. r5 No chip-enable, because do not have any states with ce=0. Multiplexer between i2 and a1. The simplied register table is shown below. For registers that do not have multiplexers, we show their input on the top row. For registers that need neither a chip enable nor a mux (e.g. r3), we write the assignment in the rst row and leave the other rows blank. r1=i1 ce 1 0 r2 ce 1 0 1 d i2 m1 r3=i1 r4=m1 ce 1 0 r5 d i2 a1 a1 S0 S1 S2 S3 The chip-enable and mux-select signals that are needed for the registers are: r1 ce, r2 ce, r2 sel, r4 ce, and r5 sel. 2.10.11.2 Control for Datapath Components Analogous to the table for registers, we build a table for the datapath components. Each of our components has two inputs (src1 and src2). Each component performs a single operation (either addition or multiplication), so we do not need to dene operation or instruction signals for the datapath components. a1 S0 S1 S2 S3 src1 r2 r2 src2 src1 r5 a2 r4 a2 src2 r5 m1 src1 src2 r1 r2 r3 r1 182 CHAPTER 2. RTL DESIGN WITH VHDL Based on the table above, the adder a1 will need a multiplexer for src2. The multiplier m1 will need two multiplexers: one for each input. Note that the operands to addition and multiplication are commutative, so we can choose which signal goes to src1 and which to src2 so as to minimize the need for multiplexers. We notice that for m1, we can reduce the number of multiplexers from 2 to 1 by swapping the operands in the second clock cycle. This makes r1 the only source of operands for the src1 input. This optimization is reected in the table below. a1 S0 S1 S2 S3 src1 r2 r2 src2 src1 r5 a2 r4 a2 src2 r5 m1 src1 src2 r1 r2 r1 r3 The mux-select signals for the datapath components are: a1 src2 sel and m1 src2 sel. 2.10.11.3 Control for State We need to control the transition from one state to the next. For this example, the transition is very simple, each state transitions to its successor: S0 S1 S2 S3 S0 . 2.10.11.4 Complete State Machine Table The state machine table is shown below. Note that the state signal is a register; the table shows the next value of the signal. r1 ce 1 0 r2 ce r2 sel 1 i2 0 1 m1 r4 ce 1 0 r5 sel a1 src2 sel i2 a1 r5 a1 a2 m1 src2 sel state S1 r2 S2 r3 S3 S0 S0 S1 S2 S3 We now choose instantiations for the dont care values so as to simplify the circuitry. Different state encodings will lead to different simplications. For fully-encoded states, Karnaugh maps are helpful in doing simplications. For a one-hot state encoding, it is usually better to create situations where conditions are based upon a single state. The reason for this heuristic with one-hot encodings will be clear when we get to explicit v2. 2.10.12 VHDL Code with Explicit State Machine 183 r1 ce We rst choose 0 as the dont care instantiation, because that leaves just one state where we need to load. Additionally, it is conceptually cleaner to do an assignment in just the one clock cycle where we care about the value, rather than not do an assignment in the one clock cycle where we must hold the value. (At the end of the dont care allocation, well revisit this decision and change our mind.) r2 ce We choose 1 for S3, so that we have just one state where we do not do a load. If we had chosen 0 for r2ce in S3, we would have two states where we do a load and two where we do not load. If we were using fully-encoded states, this even separation might have left us with a very nice Karnaugh map; or it might have left us with a Karnaugh map that has a checkerboard pattern, which would not simplify. This helps illustrate why state encoding is a difcult problem. r2 sel We choose m1 arbitrarily. The choice of i2 would have also resulted in three assignments from one signal and one assignment from the other signal. r4 ce We choose 0 as we did for r1 ce. r5 sel Choose a1 so that we have three assignments from the same signal and just one assignment from the other signal. a1 src2 Choose a2 arbitrarily. m1 src2 Choose r3 arbitrarily. r1 ce (again) We examine r1 ce and r2 ce and see that if we choose 1 for the dont care instantiation of r1 ce, we will have the same choices for both chip enables. This will simplify our state machine. Also, r4 ce is the negation of r2 ce, so we can use just an inverter to control r4 ce. r1 ce 1 0 1 1 r2 ce r2 sel 1 i2 0 m1 1 m1 1 m1 r4 ce 0 1 0 0 r5 sel a1 src2 sel a1 a2 i2 a2 a1 r5 a1 a2 m1 src2 sel state r3 S1 r2 S2 r3 S3 r3 S0 S0 S1 S2 S3 2.10.12 VHDL Code with Explicit State Machine VHDL code can be written directly from the tables and the dataow diagram that shows register allocation, input allocation, and datapath allocation. As a simplication, rather than write explicit signals for the chip-enable and mux-select signals, we use select and conditional assignment statements that test the state in the condition. We chose a one-hot encoding of the state, which usually results in small and fast hardware for state machines with sixteen or fewer states. 184 CHAPTER 2. RTL DESIGN WITH VHDL architecture explicit_v1 of vanier is signal r_1, r_2, r_3, r_4, r_5 : std_logic_vector(15 downto 0); type state_ty is std_logic_vector(3 downto 0); constant s0 : state_ty := "0001"; constant s1 : state_ty := "0010"; constant s2 : state_ty := "0100"; constant s3 : state_ty := "1000"; signal state : state_ty; 2.10.12 VHDL Code with Explicit State Machine 185 ----------------------- r_5 process (clk) begin if rising_edge(clk) then if state = S1 then r_5 <= i_2; else r_5 <= a_1; end if; end if; end process; ----------------------- combinational datapath with state select a1_src2 <= r_5 when S2, a_2 when others; with state select m1_src2 <= r_2 when S1 r_3 when others; a_1 <= a_2 + a1_src2; a_2 <= r_4 + r_5; m_1 <= r_1 * m1_src2; o_1 <= r_5; ----------------------- state machine process (clk) begin if rising_edge(clk) then if reset = 1 then state <= S0; else case state is when S0 => state <= S1; when S1 => state <= S2; when S2 => state <= S3; when S3 => state <= S0; end case; end if; end if; end process; ---------------------end explicit_v1; begin ----------------------- r_1 process (clk) begin if rising_edge(clk) then if state != S1 then r_1 <= i_1; end if; end if; end process; ----------------------- r_2 process (clk) begin if rising_edge(clk) then if state != S1 then if state = S0 then r_2 <= i_2; else r_2 <= m_1; end if; end if; end if; end process; ----------------------- r_3 process (clk) begin if rising_edge(clk) then r_3 <= i_1; end if; end process; ----------------------- r_4 process (clk) begin if rising_edge(clk) then if state = S1 then r_4 <= m_1; end if; end if; end process; The hardware-block diagram that corresponds to the tables and VHDL code is: 186 CHAPTER 2. RTL DESIGN WITH VHDL i1 i2 S0 S1 i1 a r3 x3 m1 i1 d r1 x1 m1 r4 x4 i2 b r2 x2 i2 c r5 x5 a1 S2 + r5 x7 r1 r2 r3 r5 r2 x6 a2 S3 a1 + + r5 x8 z m1 o1 S0 r4 a2 + + a1 2.10.13 Peephole Optimizations We will illustrate several peephole optimizations that take advantage of our state encoding. -- r_1 process (clk) begin if rising_edge(clk) then if state != S1 then r_1 <= i_1; end if; end if; end process; -- r_1 (optimized) process (clk) begin if rising_edge(clk) then if state(1) = 0 then r_1 <= i_1; end if; end if; end process; Analogous optimizations can be used when comparing against multiple states: 2.10.13 Peephole Optimizations 187 -- r_2 process (clk) begin if rising_edge(clk) then if state != S1 if state = S0 then r_2 <= i_2; else r_2 <= m_1; end if; end if; end if; end process; -- r_2 (optimized) process (clk) begin if rising_edge(clk) then if state(1) = 0 then if state(0) = 1 then r_2 <= i_2; else r_2 <= m_1; end if; end if; end if; end process; Next-state assignment for a one-hot state machine can be done with a simple shift register: -- state machine process (clk) begin if rising_edge(clk) then if reset = 1 then state <= S0; else case state is when S0 => state <= when S1 => state <= when S2 => state <= when S3 => state <= end case; end if; end if; end process; -- state machine (optimized) -- NOTE: "st" = "state" process (clk) begin if rising_edge(clk) then if reset = 1 then st <= S0; else for i in 0 to 3 loop st( (i+1) mod 4 ) <= st( i ); end loop; end if; end if; end process; S1; S2; S3; S0; 188 CHAPTER 2. RTL DESIGN WITH VHDL The resulting optimized code is shown on the next page. architecture explicit_v2 of vanier is signal r_1, r_2, r_3, r_4, r_5 : std_logic_vector(15 downto 0); type state_ty is std_logic_vector(3 downto 0); constant s0 : state_ty := "0001"; constant s1 : state_ty := "0010"; constant s2 : state_ty := "0100"; constant s3 : state_ty := "1000"; signal state : state_ty; begin ----------------------- r_1 process (clk) begin if rising_edge(clk) then if state(1) = 0 then r_1 <= i_1; end if; end if; end process; ----------------------- r_2 process (clk) begin if rising_edge(clk) then if state(1) = 0 then if state(0) = 1 then r_2 <= i_2; else r_2 <= m_1; end if; end if; end if; end process; ----------------------- r_3 process (clk) begin if rising_edge(clk) then r_3 <= i_1; end if; end process; ----------------------- r_4 process (clk) begin if rising_edge(clk) then if state(1) = 1 then r_4 <= m_1; end if; end if; end process; ----------------------- r_5 process (clk) begin if rising_edge(clk) then if state(1) = 1 then r_5 <= i_2; else r_5 <= a_1; end if; end if; end process; ----------------------- combinational datapath a1_src2 <= r_5 when state(2) = 1 else a_2; m1_src2 <= r_2 when state(1)= 1 else r_3; a_1 <= a_2 + a1_src2; a_2 <= r_4 * r_5; m_1 <= r_1 * m1_src2; o_1 <= r_5; ----------------------- state machine process (clk) begin if rising_edge(clk) then if reset = 1 then state <= S0; else for i in 0 to 3 loop state( (i+1) mod 4) <= state(i); end loop; end if; end if; end process; ---------------------end explicit_v1; 2.10.14 Notes and Observations 189 2.10.14 Notes and Observations Our functional requirements were written as: output = (a d) + (d b) + b + c Alternatively, we could have achieved exactly the same functionality with the functional requirements written as (the two statements are mathematically equivalent): output = (a d) + b + (d b) + c The naive data dependency graph for the alternative formulation is much messier than the data dependency graph for the original formulation: Original (a d) + (d b) + b + c Alternative (a d) + c + (d b) + b a d b c a d b c + + + z + + z + An observation: it can be helpful to explore several equivalent formulations of the mathematical equations while constructing the data dependency graph. A mathematical formulation that places occurrences of the same identier close to each other often results in a simpler data dependency graph. The simpler the data dependency graph, the easier it will be to identify helpful optimizations and efcient schedules. 190 CHAPTER 2. RTL DESIGN WITH VHDL 2.11 Design Example: Stack The purpose of the stack example is to illustrate the design techniques on a slightly larger example than Vanier and Massey. There are not any new concepts in this section. 2.11.1 Stack: Requirements 2.11.1.1 Entity VHDL entity for the stack: entity stack is port ( reset, clk : in std_logic; inp : in std_logic_vector(3 downto 0); outp : out std_logic_vector(3 downto 0) ); end stack; The input signal inp is used for both instructions and data. 2.11.1.2 push pop swap tos Instructions put a new piece of data onto the top of the stack remove the top piece of data from the stack swap the top two pieces of data output the current data on the top of the stack 2.11.1.3 Instruction Encoding VHDL package dening stack instructions: package stack_instr is constant pop : std_logic_vector(3 constant push : std_logic_vector(3 constant tos : std_logic_vector(3 constant swap : std_logic_vector(3 end stack_instr; downto downto downto downto 0) 0) 0) 0) := := := := "0001"; "0010"; "0100"; "1000"; 2.11.2 Stack: Algorithm 191 2.11.1.4 Miscellaneous Requirements The stack shall have 16 elements The inputs shall be registered. When a push operation is done, in the clock cycle following the push instruction, inp shall have the data that is to be pushed onto the stack. Popping from an empty stack or pushing onto a full stack results in undened behaviour. When doing a tos or pop operation, the output outp shall have the tos data in the clock cycle after the tos instruction is input. At all other times the output is unconstrained. In the clock cycle following reset being asserted (set to 1), the stack shall be empty. 2.11.2 Stack: Algorithm A simple Perl program to implement an algorithmic description of the stack. Note: You dont need to know Perl in E&CE 427. Perl is just one example of the many different software programming languages that can be used to create algorithmic descriptions of circuits. 192 CHAPTER 2. RTL DESIGN WITH VHDL Stack Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Usage of Perl Stack push 3 tos 3 push 4 tos 4 pop 4 tos 3 . . . . . . . . . . . . .. #! /usr/bin/perl -Wall local ($line, @stack, $stack, $tmp); $tos = 0; % ] while ($line = <STDIN>) chop( $line ); if ( $line eq "tos") print( $stack $tos ); elsif ( $line eq "pop") print( $stack $tos ); $tos = $tos - 1; elsif ( $line eq "push" ) $tos = $tos + 1; $line = <STDIN>; chop( $line ); $stack $tos = $line; elsif ( $line eq "swap" ) $tmp = $stack $tos ; $stack $tos = $stack $tos-1 ; $stack $tos-1 = $tmp; 2.11.3 Stack: Dataow Diagram 193 2.11.3 Stack: Dataow Diagram 2.11.3.1 Data-Dependency Graphs Do one data-dependency graph for each operation. Convert each data-dependency graph into a dataow diagram by adding clock-cycle boundaries. stack stack tos +1 stack(rd) -1 stack(wr) stack data_out tos stack tos stack data_out tos stack(rd) data_in tos stack tos Pop Tos Push stack tos -1 stack(rd) stack(rd) stack(wr) stack(wr) stack tos Swap Note: scheduling decision and anti-dependency arrows 194 CHAPTER 2. RTL DESIGN WITH VHDL 2.11.3.2 Partition into Clock Cycles Note: The memory array used in this example supports combinational reads, hence read operations can be done in the middle of a clock cycle. For the Altera memory arrays used in E&CE 427 the read operations are registered. stack data_in tos stack +1 stack(rd) tos stack tos stack(rd) -1 stack(wr) stack data_out tos stack tos stack data_out tos 2 Tos registers (stack, tos) 2 1 Pop registers (stack, tos) ALU stack tos Push 3 registers (stack, tos, data in) 1 ALU stack tos -1 -1 stack(rd) stack(rd) stack(rd) stack(rd) -1 stack(wr) stack(wr) stack(wr) stack(wr) stack tos stack tos 5 1 registers (stack, tos, stack[tos], stack[tos-1], tos-1) ALU Swap version 1 4 1 registers (stack, tos, stack[tos], stack[tos-1]) ALU Swap version 2 (Optimized) eliminated one register 2.11.4 Stack: High-Level Model 195 2.11.4 Stack: High-Level Model This high-level model is taken directly from the dataow diagrams and block diagrams. There is one process that combines control, datapath, and storage; except for the output (outp), which is done with a concurrent assignment statement. Notice that there is a next init when (reset = 1); after every wait statement. This is needed to get the circuit back to its initial state in the next clock cycle when reset is asserted. First, well see the overall structure of the hlm architecture, and then the gory details. architecture hlm of stack is ...declarations... begin ----------------------------------------------process begin init : loop ...reset assignments... loop -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------case inp is when pop => ...pop code... when push => ...push code... when swap => ...swap code... when tos => ...tos code... when others => next init; end case; end loop; end loop; end process; ----------------------------------------------outp <= stack(to_integer(tos)); ----------------------------------------------end hlm; 196 CHAPTER 2. RTL DESIGN WITH VHDL Now for the actual code. architecture hlm of stack is ----------------------------------------------subtype data_ty is std_logic_vector(3 downto 0); type stack_ty is array (15 downto 0) of data_ty; ----------------------------------------------signal tos : unsigned(3 downto 0); signal tmp1, tmp2 : data_ty; signal stack : stack_ty; signal empty : std_logic; ----------------------------------------------begin --------------------------------------------------------process begin init : loop -------------------------------tos <= to_unsigned(0,4); empty <= 1; -------------------------------loop -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------case inp is when pop => tos <= tos - 1; when push => if (empty = 0) then tos <= tos + 1; end if; -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------stack(to_integer(tos)) <= inp; empty <= 0; Continued... 2.11.4 Stack: High-Level Model 197 ...continued when swap => tmp1 <= stack(to_integer(tos-1)); -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------tmp2 <= stack(to_integer(tos)); -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------stack(to_integer(tos-1)) <= tmp2; -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------stack(to_integer(tos)) <= tmp1; when tos => null; when others => next init; end case; end loop; end loop; end process; ----------------------------------------------outp <= stack(to_integer(tos)); ----------------------------------------------end hlm; The high-level model is synthesizable, but might be large and slow. It uses a 2-d array for the stack, rather than specialized memory components from the library. We are relying on the synthesis tool to build a state machine to drive the datapath. Sometimes, by writing code that is closer to gate-level hardware, we can improve peformance and/or area. 198 CHAPTER 2. RTL DESIGN WITH VHDL 2.11.5 Stack: Block Diagram 2.11.5.1 Individual Block Diagrams Build one block diagram for each operation. stack stack tos tos 0 we a di do outp - stack(rd) -1 -1 + stack data_out tos Pop stack data_in tos control +1 stack tos stack(wr) stack tos d ce q 1 we + a di do inp Push stack tos stack(rd) 0 tos stack we a di do outp stack data_out tos Tos 2.11.5 Stack: Block Diagram 199 stack tos -1 stack(rd) stack(rd) -1 stack(wr) stack(wr) stack tos control tmp1 stack tos d ce q we a -1 + di do tmp2 d ce q Swap 200 CHAPTER 2. RTL DESIGN WITH VHDL 2.11.5.2 Complete Block Diagram Merge all of the block diagrams together, reusing components whereever possible. control tos_inc_dec_sel tos_ce tmp2_ce stack_addr_sel stack_data_sel stack_we tmp1_ce reset r tos d ce q stack tmp1 d ce q we a -1 1 + di do tmp2 outp d inp q ce All Operations 2.11.6 Stack: Register Transfer Level 201 2.11.6 Stack: Register Transfer Level Structuring RTL Code ............................................................... . There are four different ways to structure your RTL code: Control Control Storage Datapath Datapath Storage Control Next-State Funs Storage Control Storage Datapath Storage Single process Separate datapath Separate control, storage, and datapath Datapath Fully disassembled Section 1.8.4 described a variety of options for coding the individual modules in the above diagram. For example: whether to use both opped and combinational signals, the number of target signals per process, and whether to if or wait statements for ip ops. Stack RTL ............................................................................ To write the RTL code for the stack, consider the following options: Replacing the stack as an array with a component instantiation of a memory array from the FPGA libraries Dening a state machine and signals to control the datapath (e.g. dene a state type and a signal of type state and do assignments to current and next-state signals) Question to ponder: does an explicit state machine result in better hardware? 2.11.6.1 Stack: Separate Control, Datapath and Storage This design is derived directly from the hardware block diagram. 202 CHAPTER 2. RTL DESIGN WITH VHDL We separate the state machine and datapath using the control signals that drive the datapath (mux select lines, chip enables, etc). The state machine drives signals that control the datapath. The state machine is very similar to that in the high level model. In every state we assign values to the signals that control the datapath. The datapath is done with concurrent statements. By using concurrent statements, rather than processes, for the datapath, we eliminate the need for the datapath assignments to have sensitivity lists, which simplies the code. This style works best when there are a large number of states and a small number of datapath components. The outline of the code is: architecture sepfsm of stack is ...declarations... begin ...component instantiation for memory... ...clocked process for state machine... ...clocked process for tmp1... ...clocked process for tmp2... ...clocked process for tos... ...concurrent assignment for tos adj... ...concurrent assignment for stack addr... ...concurrent assignment for stack data in... end sepfsm; We now step through the code in detail, beginning with signal declarations: 2.11.6 Stack: Register Transfer Level 203 architecture sepfsm of stack is signal tos, tos_adj, stack_addr : unsigned(3 downto 0); signal inp_intern, stack_data_in, stack_data_out, tmp1, tmp2 : std_logic_vector(3 downto 0); signal synch_reset, empty, tos_inc_dec_sel, stack_addr_sel, tos_ce, stack_we, tmp1_ce, tmp2_ce : std_logic; signal stack_data_sel : std_logic_vector(1 downto 0); ...ram component instantiation... Continued... NOTE: difference from HLM is test for empty before popping 204 CHAPTER 2. RTL DESIGN WITH VHDL ...continued process begin init : loop -------------------------------empty <= 1; tos_inc_dec_sel <= -; stack_addr_sel <= -; tos_ce <= 1; stack_we <= 0; stack_data_sel <= "--"; tmp1_ce <= -; tmp2_ce <= -; -------------------------------loop -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------case inp is when pop => tos_inc_dec_sel <= 0; stack_addr_sel <= 1; tos_ce <= 1; stack_we <= 0; stack_data_sel <= "--"; tmp1_ce <= -; tmp2_ce <= -; when push => if (empty = 1) then tos_inc_dec_sel <= -; stack_addr_sel <= 0; tos_ce <= 0; else tos_inc_dec_sel <= 1; stack_addr_sel <= 1; tos_ce <= 1; end if; stack_data_sel <= "--"; stack_we <= 0; tmp1_ce <= -; tmp2_ce <= -; -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------empty <= 0; ...more assignments... when swap => ... end case; end loop; end loop; end process; Continued... 2.11.6 Stack: Register Transfer Level 205 ...continued -----------------------------------------------------process (clk) begin if rising_edge(clk) then if (tmp1_ce = 1) then tmp1 <= stack_data_out; end if; end if; end process; ... tmp2 assignment ... -----------------------------------------------------process (clk) begin if rising_edge(clk) then if (reset = 1) then tos <= to_unsigned(0, 4); elsif (tos_ce = 1) then tos <= tos_adj; end if; end if; end process; -----------------------------------------------------tos_adj <= tos + 1 when (tos_inc_dec_sel = 1) else tos - 1 ; ... ...tos_adj, stack_addr, and stack_data_in... end sepfsm; 206 CHAPTER 2. RTL DESIGN WITH VHDL 2.11.6.2 Stack: Datapath Operations The state machine in Section 2.11.6.1 controlled each datapath component individually. An alternative style is for the state machine to tell the datapath what state it is in, or what global collection of operations to perform, then each part of the datapath decodes this and takes the appropriate action. This style works best when there are a small number of states and a large number of datapath components. architecture dp_op of stack is ----------------------------------------------------- define the states type dp_op_ty is (init_op, pop_op, push1_op, push2_op, swap_wr_tmp1_op, swap_wr_tmp2_op, swap_rd_tmp1_op, swap_rd_tmp2_op, nop_op ); signal dp_op : dp_op_ty; signal tos, tos_adj, stack_addr : unsigned(3 downto 0); signal inp_intern, stack_data_in, stack_data_out, tmp1, tmp2 : std_logic_vector(3 downto 0); signal empty, stack_we : std_logic; begin Continued ... 2.11.6 Stack: Register Transfer Level 207 ...continued --------------------------------------------------------process begin init : loop -------------------------------empty <= 1; dp_op <= init_op; loop -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------case inp is when pop => dp_op <= pop_op; when push => dp_op <= push1_op; -------------------------------wait until rising_edge(clk); next init when (reset = 1); --------------------------------- stack(to_integer(tos)) <= inp; dp_op <= push2_op; empty <= 0; when swap => ... end case; end loop; end loop; end process; ----------------------------------------------------process (clk) begin if rising_edge(clk) then inp_intern <= inp; end if; end process; Continued... 208 CHAPTER 2. RTL DESIGN WITH VHDL ...continued -----------------------------------------------------process (clk) begin if rising_edge(clk) then if (dp_op = init_op) then tos <= to_unsigned(0,4); elsif ( (dp_op = pop_op) OR (dp_op = push1_op and (empty = 0)) ) then tos <= tos_adj; end if; end if; end process; -----------------------------------------------------tos_adj <= tos + to_unsigned(1,3) when (dp_op = push1_op) else tos - to_unsigned(1,3) ; -----------------------------------------------------stack_addr <= tos_adj when ( OR OR OR ) else tos ; (dp_op = pop_op) ((dp_op = push1_op) AND (empty = 0)) (dp_op = swap_wr_tmp1_op) (dp_op = swap_rd_tmp2_op) ...stack_data_in, stack_we, out, ram ... end dp_op; 2.11.6.3 Stack: Explicit State Machine Here we drop the loop ... wait ... style of implicit state machines and build an explicit state machine with current and next state signals. Notice that the stack is such a simple design that each datapath operation in the Dp-Op architecture is used in only one state. This is a sign that the Dp-Op style is not well-suited to the stack. 2.11.6 Stack: Register Transfer Level 209 This example also illustrates the use of a function to capture common code. The function is used here to determine which state to go to next when a new input instruction arrives. architecture state of stack is type state_ty is (init_st, pop_st, push1_st, push2_st, swap_wr_tmp1_st, swap_wr_tmp2_st, swap_rd_tmp1_st, swap_rd_tmp2_st, nop_st ); signal state, state_n : state_ty; ... ... -------------------------------------------------------function restart (inp : std_logic_vector(3 downto 0)) return state_ty is begin case inp is when pop => return(pop_st); when push => return(push1_st); when swap => return(swap_wr_tmp1_st); when others => return(nop_st); end case; end restart; begin -----------------------------------------------------process (clk) begin if rising_edge(clk) then if (reset = 1) then state <= init_st; empty_n <= 1; else state <= state_n; empty_n <= empty; end if; end if; end process; Continued... 210 CHAPTER 2. RTL DESIGN WITH VHDL ...continued -----------------------------------------------------process (state, inp) begin case state is when init_st | pop_st | push2_st | swap_wr_tmp2_st | nop_st => state_n <= restart(inp); when push1_st => state_n <= push2_st; when swap_rd_tmp1_st => state_n <= swap_rd_tmp2_st; when swap_rd_tmp2_st => state_n <= swap_wr_tmp1_st; when swap_wr_tmp1_st => state_n <= swap_wr_tmp2_st; end case; end process; ... process (clk) begin if rising_edge(clk) then if (state = init_st) then tos <= to_unsigned(0,4); elsif ( (state = pop_st) OR (state = push1_st and (empty = 0)) ) then tos <= tos_adj; end if; end if; end process; -----------------------------------------------------tos_adj <= tos + to_unsigned(1,3) when (state = push1_st) else tos - to_unsigned(1,3) ; -----------------------------------------------------stack_addr <= tos_adj when ( (state = pop_st) OR ((state = push1_st) AND (empty = 0)) OR (state = swap_wr_tmp1_st) OR (state = swap_rd_tmp2_st) ) else tos ; ... end state; 2.12. OPTIMIZATION TECHNIQUES 211 2.12 Optimization Techniques 2.12.1 Strength Reduction Strength reduction replaces one operation with another that is simpler. 2.12.1.1 Arithmetic Strength Reduction wired shift logical left shift logical left wired shift logical right shift logical right wired shift and addition Multiply by a constant power of two Multiply by a power of two Divide by a constant power of two Divide by a power of two Multiply by 3 2.12.1.2 Boolean Strength Reduction Boolean tests that can be implemented as wires is odd, is even : least signicant bit is neg, is pos : most signicant bit NOTE: use is odd(a) rather than a(0) By choosing your encodings carefully, you can sometimes reduce a vector comparisons to a wire. For example if your state uses a one-hot encoding, then the comparison state = S3 reduces to state(3) = 1. You might expect a reasonable logic-synthesis tool to do this reduction automatically, but most tools do not do this reduction. When using encodings other than one-hot, Karnaugh maps can be useful tools for optimizing vector comparisons. By carefully choosing our state assignments, when we use a full binary encoding for 8 states, the comparison: (state = S0 or state = S3 or state = S4) = 1 can be reduced to a single bit comparison, such as state(2) = 1. 212 CHAPTER 2. RTL DESIGN WITH VHDL 2.12.2 Replication and Sharing 2.12.2.1 Mux-Pushing Pushing multiplexors into the fanin of a signal can reduce area. Before z <= a + b when (w = 1) else a + c; tmp <= b when (w = 1) else c; z <= a + tmp; After The rst circuit will have two adders, while the second will have one adder. Some synthesis tools will perform this optimization automatically, particularly if all of the signals are combinational. 2.12.2.2 Common Subexpression Elimination Introduce new signals to capture subexpressions that occur multiple places in the code. Before After a + b + c when (w = 1) d; a + c + d when (w = 1) e; y <= else z <= else tmp <= y <= else z <= else a + c; b + tmp when (w = 1) d; d + tmp when (w = 1) e; Note: Clocked subexpressions Care must be taken when doing common subexpression elimination in a clocked process. Putting the temporary signal in the clocked process will add a clock cycle to the latency of the computation, because the tmp signal will be ip-op. The tmp signal must be combinational to preserve the behaviour of the circuit. 2.12.2.3 Computation Replication To improve performance If same result is needed at two very distant locations and wire delays are signicant, it might improve performance (increase clock speed) to replicate the hardware To reduce area If same result is needed at two different times that are widely separated, it might be cheaper to reuse the hardware component to repeat the computation than to store the result in a register Note: Muxes are not free Each time a component is reused, multiplexors are added to inputs and/or outputs. Too much sharing of a component can cost more area in additional multiplexors than would be spent in replicating the component 2.12.3 Arithmetic 213 2.12.3 Arithmetic VHDL is left-associative. The expression a + b + c + d is interpreted as (((a + b) + c) + d). You can use parentheses to suggest parallelism. Perform arithmetic on the minimum number of bits needed. If you only need the lower 12 bits of a result, but your input signals are 16 bits wide, trim your inputs to 12 bits. This results in a smaller and faster design than computing all 16 bits of the result and trimming the result to 12 bits. 2.12.4 Pipelining You can turn a dataow diagram into a pipeline by making each clock cycle of the dataow diagram a separate pipe stage. However, this can be complicated and error-prone. You need to worry about data hazards if you have state-holding registers in your algorithm. You need to worry about structural hazards if different instructions have different latencies. A rough description of the technique to turn dataow diagram into pipeline: Group one or more consecutive clock cycles of computation for all instructions into each stage. Each stage becomes a single module. Hardware is not shared between stages. So, moving from a non-pipelined implementation to a pipelined implementation will increase the area of the design. For pipelines, the most important measure of performance is usually throughput, which is the inverse of number of clock cycles that are grouped into a single stage. For example if each clock cycle becomes a single stage, then the throughput (as measured in clock cycles) is 1 parcel/clockcycle. As another example, if two clock cycles are grouped into a single stage, then a new parcel can enter the pipeline once every two clock cycles. 214 CHAPTER 2. RTL DESIGN WITH VHDL 2.13 Design Problems P2.1 Synthesis This question is about using VHDL to implement memory structures on FPGAs. P2.1.1 Data Structures If you have to write your own code (i.e. you do not have a library of memory components or a special component generation tool such as LogiBlox or CoreGen), what datastructures in VHDL would you use when creating a register le? P2.1.2 Own Code vs Libraries When using VHDL for an FPGA, under what circumstances is it better to write your own VHDL code for memory, rather than instantiate memory components from a library? P2.2 Design Guidelines While you are grocery shopping you encounter your co-op supervisor from last year. Shes now forming a startup company in Waterloo that will build digital circuits. Shes writing up the design guidelines that all of their projects will follow. She asks for your advice on some potential guidelines. What is your response to each question? What is your justication for your answer? What are the tradeoffs between the two options? 0. Sample Should all projects use silicon chips, or should all use biological chips, or should each project choose its own technique? Answer: All projects should use silicon based chips, because biological chips dont exist yet. The tradeoff is that if biological chips existed, they would probably consume less power than silicon chips. 1. Should all projects use an asynchronous reset signal, or should all use a synchronous reset signal, or should each project choose its own technique? 2. Should all projects use latches, or should all projects use ip-ops, or should each project choose its own technique? P2.3 Dataow Diagram Optimization 215 3. Should all chips have registers on the inputs and outputs or should chips have the inputs and outputs directly connected to combinational circuitry, or should each project choose its own technique? By register we mean either ip-ops or latches, based upon your answer to the previous question. If your answer is different for inputs and outputs, explain why. 4. Should all circuit modules on all chips have ip-ops on the inputs and outputs or should chips have the inputs and outputs directly connected to combinational circuitry, or should each project choose its own technique? By register we mean either ip-ops or latches, based upon your answer to the previous question. If your answer is different for inputs and outputs, explain why. 5. Should all projects use tri-state buffers, or should all projects use multiplexors, or should each project choose its own technique? P2.3 Dataow Diagram Optimization Use the dataow diagram below to answer problems P2.3.1 and P2.3.2. a b c f f d e g f g P2.3.1 Resource Usage List the number of items for each resource used in the dataow diagram. 216 CHAPTER 2. RTL DESIGN WITH VHDL P2.3.2 Optimization Draw an optimized dataow diagram that improves the performance and produces the same output values. Or, if the performance cannot be improved, describe the limiting factor on the preformance. NOTES: you may change the times when signals are read from the environment you may not increase the resource usage (input ports, registers, output ports, f components, g components) you may not increase the clock period P2.4 Dataow Diagram Design Your manager has given you the task of implementing the following pseudocode in an FPGA: if is_odd(a + d) p = (a + d)*2 + ((b + c) - 1)/4; else p = (b + c)*2 + d; NOTES: 1) You must use registers on all input and output ports. 2) p, a, b, c, and d are to be implemented as 8-bit signed signals. 3) A 2-input 8-bit ALU that supports both addition and subtraction takes 1 clock cycle. 4) A 2-input 8-bit multiplier or divider takes 4 clock cycles. 5) A small amount of additional circuitry (e.g. a NOT gate, an AND gate, or a MUX) can be squeezed into the same clock cycle(s) as an ALU operation, multiply, or divide. 6) You can require that the environment provides the inputs in any order and that it holds the input signals at the same value for multiple clock cycles. P2.4.1 Maximum Performance What is the minimum number of clock cycles needed to implement the pseudocode with a circuit that has two input ports? What is the minimum number of ALUs, multipliers, and dividers needed to achieve the minimum number of clock cycles that you just calculated? P2.5 Michener: Design and Optimization 217 P2.4.2 Minimum area What is the minimum number of datapath storage registers (8, 6, 4, and 1 bit) and clock cycles needed to implement the pseudocode if the circuit can have at most one ALU, one multiplier, and one divider? P2.5 Michener: Design and Optimization Design a circuit named michener that performs the following operation: z = (a+d) + ((b c) - 1) NOTES: 1. Optimize your design for area. 2. You may schedule the inputs to arrive at any time. 3. You may do algebraic transformations of the specication. P2.6 Dataow Diagrams with Memory Arrays Component Register Adder Subtracter ALU with , , Memory read Memory write Multiplication 2:1 Multiplexor Delay 5 ns 25 ns 30 ns , , AND, XOR 40 ns 60 ns 60 ns 65 ns 5 ns , NOTES: 1. The inputs of the algorithms are a and b. 2. The outputs of the algorithms are p and q. 3. You must register both your inputs and outputs. 4. You may choose to read your input data values at any time and produce your outputs at any time. For your inputs, you may read each value only once (i.e. the environment will not send multiple copies of the same value). 5. Execution time is measured from when you read your rst input until the latter of producing your last output or the completion of writing a result to memory 6. M is an internal memory array, which must be implemented as dual-ported memory with one read/write port and one read port. 7. M supports synchronous write and asynchronous read. 218 CHAPTER 2. RTL DESIGN WITH VHDL 8. Assume all memory address and other arithmetic calculations are within the range of representable numbers (i.e. no overows occur). 9. If you need a circuit not on the list above, assume that its delay is 30 ns. 10. You may sacrice area efciency to achieve high performance, but marks will be deducted for extra hardware that does not contribute to performance. P2.6.1 Algorithm 1 Algorithm q = M[b]; M[a] = b; p = M[b+1] * a; Assuming a time. b, draw a dataow diagram that is optimized for the fastest overall execution P2.6.2 Algorithm 2 q = M[b]; M[a] = q; p = (M[b-1]) * b) + M[b]; Assuming a time. b, draw a dataow diagram that is optimized for the fastest overall execution P2.7 2-bit adder This question compares an FPGA and generic-gates implementation of 2-bit full adder. P2.7.1 Generic Gates Show the implementation of a 2 bit adder using NAND, NOR, and NOT gates. P2.8 Sketches of Problems 219 P2.7.2 FPGA Show the implementation of a 2 bit adder using generic FPGA cells; show the equations for the lookup tables. c_in sum[0] a[0] b[0] comb D CE R Q S carry_1 sum[1] a[1] b[1] comb D CE R Q S c_out P2.8 Sketches of Problems 1. calculate resource usage for a dataow diagram (input ports, output ports, registers, datapath components) 2. calculate performance data for a dataow diagram (clock period and number of cycles to execute (CPI)) 3. given a dataow diagram, calculate the clock period that will result in the optimum performance 4. given an algorithm, design a dataow diagram 5. given a dataow diagram, design the datapath and nite state machine 6. optimize a dataow diagram to improve performance or reduce resource usage 7. given fsm diagram, pick VHDL code that best implements diagram correct behaviour, simple, fast hardware or critique hardware 220 CHAPTER 2. RTL DESIGN WITH VHDL Chapter 3 Functional Verication 3.1 Introduction 3.1.1 Purpose The purpose of this chapter is to illustrate techniques to quickly and reliably detect bugs in datapath and control circuits. Section 3.5 discusses verication of datapath circuits and introduces the notions of testbench, specication, and implementation. In section 3.6 we discuss techniques that are useful for debugging control circuits. The verication guild website: http://www.janick.bergeron.com/guild/default.htm is a good source of information on functional verication. 3.2 Overview The purpose of functional verication is to detect and correct errors that cause a system to produce erroneous results. The terminology for validation, verication, and testing differs somewhat from discipline to discipline. In this section we outline some of the terminology differences and describe the terminology used in E&CE 427. We then describe some of the reasons that chips tend to work incorrectly. 221 222 CHAPTER 3. FUNCTIONAL VERIFICATION 3.2.1 Terminology: Validation / Verication / Testing functional validation Comparing the behaviour of a design against the customers expectations. In validation, the specication is the customer. There is no specication that can be used to evaluate the correctness of the design (implementation). functional verication Comparing the behaviour of a design (e.g. RTL code) against a specication (e.g. high-level model) or collection of properties usually treats combinational circuitry as having zero-delay usually done by simulating circuit with test vectors big challenges are simulation speed and test generation formal verication checking that a design has the correct behaviour for every possible input and internal state uses mathematics to reason about circuit, rather than checking individual vectors of 1s and 0s capacity problems: only usable on detailed models of small circuits or abstract models of large circuits mostly a research topic, but some practical applications have been demonstrated tools include model checking and theorem proving formal verication is not a guarantee that the circuit will work correctly performance validation checking that implementation has (at least) desired performance power validation checking that implementation has (at most) desired power equivalence verication (checking) checking that the design generated by a synthesis tool has same behaviour as RTL code. timing verication checking that all of the paths in a circuit t meet the timing constraints Hardware vs Software Terminology .................................................... Note: in software testing refers to running programs with specic inputs and checking if the program does the right thing. In hardware, testing usually means manufacturing testing, which is checking the circuits that come off of the manufacturing line. 3.2.2 The Difculty of Designing Correct Chips 223 3.2.2 The Difculty of Designing Correct Chips 3.2.2.1 Notes from Kenn Heinrich (UW E&CE grad) Everyone should get a lecture on why their rst industrial design wont work in the eld. Here are few reasons getting a single system to work correctly for a few minutes in a university lab is much easier than getting thousands of systems to work correctly for months at a time in dozens of countries around the world. 1. You forgot to make your unreachable states transition to the initial (reset) state. Clock glitches, power surges, etc will occasionally cause your system to jump to a state that isnt dened or produce an illegal data value. When this happens, your design should reset itself, rather than crash or generatel illegal outputs. 2. You have internal registers that you cant access or test. If you can set a register you must have some way of reading the register from outside the chip. 3. Another chip controls your chip, and the other chip is buggy. All of your external control lines should be able to be disabled, so that you can isolate the source of problems. 4. Not enough decoupling capacitors on your board. The analog world is cruel and and unusual. Voltage spikes, current surges, crosstalk, etc can all corrupt the integrity of digital signals. Trying to save a few cents on decoupling capacitors can cause headaches and signicant nancial costs in the future. 5. You only tested your system in the lab, not in the real world. As a product, systems will need to run for months in the eld, simulation and simple lab testing wont catch all of the weirdness of the real world. 6. You didnt adequately test the corner cases and boundary conditions. Every corner case is as important as the main case. Even if some weird event happens only once every six months, if you do not handle it correctly, the bug can still make your system unusable and unsellable. 3.2.2.2 Notes from Aart de Geus (Chairman and CEO of Synopsys) More than 60% of the ASIC designs that are fabricated have at least one error, issue, or a problem that whose severity forced the design to be reworked. Even experienced designers have difculty building chips that function correctly on the rst pass (gure3.1). 224 CHAPTER 3. FUNCTIONAL VERIFICATION 61% of new chip designs require at least one re-spin At least one error/issue/problem (61%) Functional logic error Analog tuning issue Signal integrity issue Clock scheme error Reliability issue Mixed-signal problem Uses too much power Timing issue (slow paths) Timing issue (fast paths) IR drop issues Firmware error Other problem (43%) (20%) (17%) (14%) (12%) (11%) (11%) (10%) (10%) (7%) (4%) (3%) 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Source: Aart de Geus, Chairman and CEO of Synopsys. Keynote address. Synopsys Users Group Meeting, Sep 9 2003, Boston USA. Figure 3.1: Problems found on rst-spins of new chip designs 3.3 Test Cases and Coverage 3.3.1 Test Terminology Test case / test vector : A combination of inputs and internal state values. Represents one possible test of the system. Boundary conditions / corner cases : A test case that represents an unusual situation on input and/or internal state signals. Corner cases are likely to contain bugs. Test scenario : A sequence of test vectors that, together, exercise a particular situation (scenario) on a circuit. For example, a scenario for an elevator controller might include a sequence of button pushes and movements between oors. Test suite : A collection of test vectors that are run on a circuit. 3.3.2 Coverage 225 3.3.2 Coverage To be absolutely certain that an implementation is correct, we must check every combination of values. This includes both input values and internal state (ip ops). If we have ni bits of inputs and ns bits in ip-ops, we have to test 2ni ns different cases when doing functional verication. Question: If we have nc combinational signals, why dont we have to test ninsnc different cases? 2 Answer: The value of each combinational signal is determined by the ip ops and inputs in its fanin. Once the values of the inputs and ip ops are known, the value of each combinational signal can be calculated. Thus, the combinational signals do not add additional cases that we need to consider. Denition Coverage: The coverage that a suite of tests achieves on a circuit is the percentage of cases that are simulated by the tests. 100% coverage means that the circuit has been simulated for all combinations of values for input signals and internal signals. Note: Coverage Terminology There are many different types of coverage, which measure everything from percentage of cases that are exercised to number of output values that are exercised. There are many different commercial software programs that measure code and other types of coverage. Company Cadence Cadence Cadence Fintronic Summit Design Synopsys TransEDA Verisity Veritools Aldec Tool Afrma Coverage Analyzer DAI Coverscan Codecover FinCov HDLScore CoverMeter Verication Navigator SureCov Express VCT, VeriCover Riviera Coverage code, expressions, fsm code, expressions, fsm code code, events, variables code coverage (dead?) code and fsm code, block, values, fsm code, branch code, block 226 CHAPTER 3. FUNCTIONAL VERIFICATION 3.3.3 Floating Point Divider Example This example illustrates the difculty of achieving signicant coverage on realistic circuits. Consider doing the functional simulation for a double precision (64-bit) oating-point divider. Given Information Data width Number of gates in circuit Number of assembly-language instructions to simulate one gate for one test case Number of clock cycles required to execute one assembly language instruction on the computer that is running the simulation Clock speed of computer that is running the simulation 64 bits 10 000 100 0.5 1 Gigahertz Number of Cases ...................................................................... Question: How many cases must be considered? Answer: item bits num values 64 src1 64 2 1 8E19 src2 64 264 1 8E19 NumTestsTot NumInputCases NumStateCases 264 264 20 3 4E38cases 3.3.3 Floating Point Divider Example 227 Simulation Run Time .................................................................. Question: How long will it take to simulate all of the different possible cases using a single computer? Answer: 1. Calculate number of seconds to simulate one test case instrs secs TestTime1:1 10000gates 100 0 5 cycles 1E9 cycle gate instr 5E4secs 2. Number of tests per year secs 60 60 mins 24 hours 365 25 days min hour day year NumTests:1 TestTime1:1 SpeedOfLight in m/s TestTime1:1 3E8secs 5E4secs 6E12cases year 3. Number of years to test all cases NumTestsTot TestTimeTot NumTests:1 3 4E38cases 6E12cases year 5 6E26years Coverage ............................................................................ . Question: If you can run simulations non-stop for one year on ten computers, what coverage will you achieve? Answer: 228 CHAPTER 3. FUNCTIONAL VERIFICATION 1. Number of tests per year using ten computers NumTests:10 10 NumTests:1 10 6E12cases 6E13cases 2. Calculate coverage achieved by running tests on ten computers for one year NumTestsRun Covg NumTestsTot NumTests:10 NumTestsTot 6E13 3E38 2E25 0 000000000000000000000002% The message is that, even with large amounts of computing resources, it is difcult to achieve numerically signicant coverage for realistic circuits. An effective functional verication plan requires carefully chosen test cases, so that even the miniscule amount of coverage than is realistically achievable catches most (all?!?!) of the bugs in the design. Simulation vs the Real World ......................................................... . From Validating the Intel(R) Pentium(R) Microprocessor by Bob Bentley, Design Automation Conference 2001. (Link on E&CE 427 web page.) Simulating the Pentium 4 Processor on a Pentium 3 Processor ran at about 15 MHz. By tapeout, over 200 billion simulation cycles had been run on a network of computers. All of these simulations represent less than two minutes of running a real processor. 3.4 Testbenches A test bench (also known as a test rig, test harness, or test jig) is a collection of code used to simulate a circuit and check if it works correctly. Testbenches are not synthesized. You do not need to restrict yourself to the synthesizable subset of VHDL. Use the full power of VHDL to make your testbenches concise and powerful. 3.4.1 Overview of Test Benches 229 3.4.1 Overview of Test Benches testbench specification stimulus check implementation Implementation Circuit that youre checking for bugs also known as: design under test or unit under test Stimulus Generates test vectors Specication Describes desired behaviour of implementation Check Checks whether implementation obeys specication Notes and observations ............................................................... . Testbenches usually do not have any inputs or outputs. Inputs are generated by stimulus Outputs are analyzed by check and relevant information is printed using report statements Different circuits will use different stimuli, specications, and checks. The roles of the specication and check are somewhat exible. Most circuits will have complex specications and simple checks. However, some circuits will have simple specications and complex checks. If two circuits are supposed to have the same behaviour, then they can use the same stimuli, specication, and check. If two circuits are supposed to have the same behaviour, then one can be used as the specication for the other. Testbenches are restricted to stimulating only primary inputs and observing only primary outputs. To check the behaviour of internal signals, use assertions. 230 CHAPTER 3. FUNCTIONAL VERIFICATION 3.4.2 Reference Model Style Testbench reference model testbench specification stimulus implementation Specication has same inputs and outputs as implementation. Specication is a clock-cycle accurate description of desired behaviour of implementation. Check is an equality test between outputs of specication and implementation. Examples ............................................................................ . Execution modules: output is sum, difference, product, quotient, etc.of inputs DSP lters Instruction decoders Note: Functional specication vs Reference model Functional specication and reference model are often used interchangeably. 3.4.3 Relational Style Testbench relational testbench stimulus check implementation 3.4.4 Coding Structure of a Testbench 231 Relational testbenches, or relational specications are used when we do not want to specify the specic output values that the implementation must produce. Instead, we want to check that some relationship holds between the output and the input, or that some relationship holds amongst the output values (independent of the values of the input signals.) Specication is usually just wires to feed the input signals to the check. Check is the brains and encodes the desired behaviour of the circuit. Examples ............................................................................ . Carry-save adders: the two outputs are the sum of the three inputs, but do not specify exact values of each individiual output. Arbiters: every request is eventually granted, but do not specify in which order requests are granted. One-hot encoding: exactly one bit of vector is a 1, but do not specify which bit is a 1. Note: Relational specication vs relational testbench Relational specication and relational testbench are often used interchangeably. 3.4.4 Coding Structure of a Testbench architecture main of athabasca_tb is component declaration for implementation; other declarations begin implementation instantiation; stimulus process; specification process (or component instantiation); check process; end main; 3.4.5 Datapath vs Control Datapath and control circuits tend to use different styles of testbenches. Datapath circuits tend to be well-suited to reference-model style testbenches: Each set of inputs generates one set of outputs Each set of outputs is a function of just one set of inputs Control circuits often pose problems for testbenches, Many more internal signals than outputs. 232 CHAPTER 3. FUNCTIONAL VERIFICATION The behaviour of the outputs provides a view into only a fragment of the current state of the circuit. It may take many clock cycles from when a bug is exercised inside the circuit until it generates a deviation from the correct behaviour on the outputs. When the deviation on the outputs is observed, it is very difcult to pinpoint the precise cause of the deviation (the root cause of the bug). Assertions can be used to check the behaviour of internal signals. Control circuits tend to use assertions to check correctness and rely on testbenches only to stimulate inputs. 3.4.6 Verication Tips Suggested order of simulation for functional verication. 1. Write high-level model. 2. Simulate high-level model until have correct functionality and latency. 3. Write synthesizable model. 4. Use zero-delay simulation (uw-sim) to check behaviour of synthesizable model against high-level model. 5. Optimize the synthesizable model. 6. Use zero-delay simulation (uw-sim) to check behaviour of optimized model against highlevel model. 7. Use timing-simulation (uw-timsim) to check behaviour of optimized model against highlevel model. section 3.5 describes a series of testbenches that are particularly useful for debugging datapath circuits in the early phases of the design cycle. 3.5 Functional Verication for Datapath Circuits In this section we will incrementally develop a testbench for a very simple circuit: an AND gate. Although the example circuit is trivial in size, the process scales well to very large circuits. The process allows verication to begin as soon a circuit is simulatable, even before a complete specication has been written. 3.5. FUNCTIONAL VERIFICATION FOR DATAPATH CIRCUITS 233 Implementation ...................................................................... . entity and2 is port ( a, b : in std_logic; c : out std_logic ); end and2; architecture main of and2 is begin c <= 1 when (a = 1 AND b = 1) else 0; end and2; 234 CHAPTER 3. FUNCTIONAL VERIFICATION 3.5.1 A Spec-Less Testbench (NOTE: this code has been reviewed manually but has not been simulated. The concepts are illustrated correctly, but there might be typographical errors in the code.) First, use waveform viewer to check that implementation generates reasonable outputs for a small set of inputs. entity and2_tb is end and2_tb; architecture main_tb of and2_tb is component and2 port ( a, b : in std_logic; c : out std_logic ); end component; signal ta, tb, tc_impl : std_logic; signal ok : boolean; begin --------------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); --------------------------------------------stimulus : process begin ta <= 0; tb <= 0; wait for 10ns; ta <= 1; tb <= 1; wait for 10ns; end process; --------------------------------------------end main_tb; Use the spec-less testbench until implementation generates solid Boolean values (No X or U data) and have checked that a few simple test cases generate correct outputs. 3.5.2 Use an Array for Test Vectors 235 3.5.2 Use an Array for Test Vectors Writing code to drive inputs and repetitively typing wait for 10 ns; can get tedious, so code up test vectors in an array. (NOTE: this code has not been checked for correctness) architecture main_tb of and2_tb is ... begin ... stimulus : process type test_datum_ty is record ra, rb : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty ; constant test_vectors : test_vectors_ty := -a b ( ( 0, 0), ( 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; wait for 10 ns; end loop; end process; end main_tb; Use this testbench until checking the correctness of the outputs by hand using waveform viewer becomes difcult. 236 CHAPTER 3. FUNCTIONAL VERIFICATION 3.5.3 Build Spec into Stimulus (NOTE: this code has not been checked for correctness) After a few test vectors appear to be working correctly (via a manual check of waveforms on simulation), begin automatically checking that outputs are correct. Add expected result to stimulus Add check process architecture main_tb of and2_tb is ... begin -----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); -----------------------------------------stimulus : process type test_datum_ty is record ra, rb, rc : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := -a, b: inputs -c : expected output -a b c ( ( 0, 0, 0), ( 0, 1, 0), ( 1, 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; tc_spec <= test_vectors(i).rc; wait for 10 ns; end loop; end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= (tc_impl = tc_spec); end process; -----------------------------------------end main_tb; Use this testbench until it becomes tedious to calculate manually the correct result for each test case. 3.5.4 Have Separate Specication Entity 237 3.5.4 Have Separate Specication Entity Rather than write the specication as part of stimulus, create separate specication entity/architecture. The specication component then calculates the expected output values. (NOTE: if your simulation tool supports congurations, the spec and impl can share the same entity, well see this in section 3.6) 238 CHAPTER 3. FUNCTIONAL VERIFICATION entity and2_spec is ...(same as and2 entity)... end and2_spec; architecture spec of and2_spec is begin c <= a AND b; end spec; architecture main_tb of and2_tb is component and2 ...; component and2_spec ...; signal ta, tb, tc_impl, tc_spec : std_logic; signal ok : boolean; begin -----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); spec : and2_spec port map (a => ta, b => tb, c => tc_spec); -----------------------------------------stimulus : process type test_datum_ty is record ra, rb : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := -a b ( ( 0, 0), ( 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; wait for 10 ns; end loop; end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= (tc_impl = tc_spec); end process; -----------------------------------------end main_tb; 3.5.5 Generate Test Vectors Automatically 239 3.5.5 Generate Test Vectors Automatically When it becomes tedious to write out each test vector by hand, we can automaticaly compute them. This example uses a pair of nested for loops to generate all four permutations of input values for two signals. architecture main_tb of and2_tb is ... begin ... stimulus : process subtype std_test_ty of std_logic is (0, 1); begin for va in std_test_tylow to std_test_tyhigh loop for vb in std_test_tylow to std_test_tyhigh loop ta <= va; tb <= vb; wait for 10 ns; end loop; end loop; end process; ... end main_tb; 3.5.6 Relational Specication architecture main_tb of and2_tb is ... begin -----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); -----------------------------------------stimulus : process ... end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= NOT (tc_impl = 1 AND (ta =0 OR tb = 0)); end process; -----------------------------------------end main_tb; 240 CHAPTER 3. FUNCTIONAL VERIFICATION 3.6 Functional Verication of Control Circuits Control circuits are often more challenging to verify than datapath circuits. Control circuits have many internal signals. Testbenches are unable access key information about the behaviour of a control circuit. Many clock cycles can elapse between when a bug causes an internal signal to have an incorrect value and when an output signal shows the effect of the bug. In this section, we will explore the functional verication of state machines via a First-In First-Out queue. The VHDL code for the queue is on the web at: http://www.ece.uwaterloo.ca/ece427/exs/queue 3.6.1 Overview of Queues in Hardware write read Figure 3.2: Structure of queue queue 3.6.1 Overview of Queues in Hardware 241 Empty A Write 1 Write 2 A Figure 3.3: Write Sequence 242 CHAPTER 3. FUNCTIONAL VERIFICATION Write 1 A B Read 1 A B Write 2 A B Read 2 A B Figure 3.4: A Second Example Write Figure 3.5: Example Read Sequence 3.6.1 Overview of Queues in Hardware 243 Write 1 K B C D E F G H I J Write 2 Write 1 B C D E F G H I J Write 2 K B C D E F G H I J Figure 3.7: Write Illustrating Full Queue B C D E F G H I J Figure 3.6: Write Illustrating Index Wrap 244 CHAPTER 3. FUNCTIONAL VERIFICATION do_rd mem do_wr rd_idx data_rd data_wr wr_idx do_rd wr_idx mem do_wr data_wr rd_idx WE A0 DI0 A1 DO1 DO0 data_rd empty empty Figure 3.8: Queue Signals Control circuitry not shown. Figure 3.9: Incomplete Queue Blocks 3.6.2 VHDL Coding 3.6.2.1 Package Things to notice in queue package: 1. separation of package and body package queue_pkg is subtype data is std_logic_vector(3 downto 0); function to_data(i : integer) return data; end queue_pkg; package body queue_pkg is function to_data(i : integer) return data is begin return std_logic_vector(to_unsigned(i, 4)); end to_data; end queue_pkg; 3.6.3 Code Structure for Verication 245 3.6.2.2 Other VHDL Coding VHDL coding techniques to notice in queue implementation: 1. type declaration for vectors 2. attributes (a) low, high, length, 3. functions (reduce overall implementation and maintenance effort) (a) reduce redundant code (b) hide implementation details (c) (just like software engineering....) 3.6.3 Code Structure for Verication Verication things to notice in queue implementation: 1. instrumentation code 2. coverage monitors 3. assertions architecture ... is ... begin ... normal implementation ... process (clk) begin if rising_edge(clk) then ... instrumentation code ... prev_signame <= signame; end if; end process; ... assertions ... ... coverage monitors ... end; 246 CHAPTER 3. FUNCTIONAL VERIFICATION 3.6.4 Instrumentation Code Added to implementation to support verication Usually keeps track of previous values of signals Does not create hardware (Optimized away during synthesis) Does not feed any output signals Must use synthesizable subset of VHDL process (clk) begin if rising_edge(clk) then prev_rd_idx <= rd_idx; prev_wr_idx <= wr_idx; prev_do_rd <= do_rd; prev_do_wr <= do_wr; end if; end process; Note: Naming convention for instrumentation For assertions, signals are named prev signame and signame, rather than next signame and signame as is done for state machines. This is because for assertions we use the prev signals as history signals, to keep track of past events. In contrast, for state machines, we name the signals next, because the state machine computes the next values of signals. 3.6.5 Coverage Monitors The goal of a coverage monitors is to check if a certain event is exercised in a simulation run. If a test suite does not trigger a coverage monitor, then we probably want to add a test vector that will trigger the monitor. For example, for a circuit used in a microwave oven controller, we might want to make sure that we simulate the situation when the door is opened while the power is on. 1. Identify important events, conditions, transitions 2. Write instrumentation code to detect event 3. Use report to write when event happens 4. When run simulation, report statements will print when coverage condition detected 5. Pipe simulation results to log le 6. Examine log le and coverage monitors to nd cases and transitions not tested by existing test vectors 7. Add test vectors to exercise missing cases 8. Idea: automate detection of missing cases using Perl script to nd coverage messages in VHDL code that arent in log le 3.6.5 Coverage Monitors 247 9. Real world: most commercial synthesis tools come with add-on packages that provide different types of coverage analysis 10. Research/entrepreneurial idea: based on missing coverage cases, nd new test vectors to exercise case Coverage Events for Queue ............................................................ Prev Now rd wr rd wr Prev rd wr wr Now rd Prev wr rd wr Now rd Question: tests? What events should we monitor to estimate the coverage of our functional Answer: wr wr wr rd rd wr idx and rd idx are far apart idx and rd idx are equal idx catches rd idx idx catches wr idx idx wraps idx wraps 248 CHAPTER 3. FUNCTIONAL VERIFICATION Coverage Monitor Template .......................................................... . process (signals read) begin if (condition) then report "coverage: message"; elsif (condition) ) then report "coverage: message"; else report "error: case fall through on message" severity warning; end if; end process; Coverage Monitor Code .............................................................. . Events related to rd idx equals wr idx. process (prev_rd_idx, prev_wr_idx, rd_idx, wr_idx) begin if (rd_idx = wr_idx) then if ( prev_rd_idx = prev_wr_idx ) then report "coverage: read = write both moved"; elsif ( rd_idx /= prev_rd_idx ) then report "coverage: Read caught write"; elsif ( wr_idx /= prev_wr_idx ) then report "coverage: Write caught read"; else report "error: case fall through on rd/wr catching" severity warning; end if; end if; end process; Events related to rd idx wrapping. process (rd_idx) begin if (rd_idx = low_idx) then report "coverage: rd mv to low"; elsif (rd_idx = high_idx) then report "coverage: rd mv to high"; else report "coverage: rd mv normal"; end if; end process; 3.6.6 Assertions 249 3.6.6 Assertions Assertions for Queue ................................................................. . 1. If rd idx changes, then it increments or wraps. 2. If rd idx changes, then do rd was 1, or reset is 1. 3. If wr idx changes, then it increments or wraps. 4. If wr idx changes, then do wr was 1, or reset is 1. 5. And many others.... Assertion Template .................................................................... process (signals read) begin assert (required condition) report "error: message" severity warning; end process; Assertions: Read Index ................................................................ process (rd_idx) begin assert ((rd_idx > prev_rd_idx) or (rd_idx = low_idx)) report "error: rd inc" severity warning; assert ((prev_do_rd = 1) or (reset = 1)) report "error: rd imp do_rd" severity warning; end process; Assertions: Write Index .............................................................. . process (wr_idx) begin assert ((wr_idx > prev_wr_idx) or (wr_idx = low_idx)) report "error: wr inc" severity warning; assert ((prev_do_wr = 1) or (reset = 1)) report "error: wr imp do_wr" severity warning; end process; 3.6.7 VHDL Coding Tips Vector Type Declaration ............................................................... type data_array_ty is array(natural range <>) of data; signal data_array : data_array_ty(7 downto 0); 250 CHAPTER 3. FUNCTIONAL VERIFICATION Functions ............................................................................. function to_idx (i : natural range data_arraylow to data_arrayhigh) return idx_ty is begin return to_unsigned(i, idx_tylength); end to_idx; Conversion to Index Without Function With Function rd_idx <= to_unsigned(5, 3); rd_idx <= to_idx(5); The function code is verbose, but is very maintainable, because neither the function itself nor uses of the function need to know the width of the index vector. Attributes ............................................................................. function inc_idx (idx : idx_ty) return idx_ty is begin if idx < data_arrayhigh then return (idx + 1); else return (to_idx(data_arraylow)); end if; end inc_idx; Feedback Loops, and Functions ........................................................ Coding guideline: use functions. Dont use procedures. inc as fun inc as proc wr_idx <= inc_idx(wr_idx); inc_idx(wr_idx); Functions clearly distinguish between reading from a signal and writing to a signal. By examining the use of a procedure, you cannot tell which signals are read from and which are written to. You must examine the declaration or implementation of the procedure to determine modes of signals. Modifying a signal within a procedure results in a tri-state signal. This is bad. 3.6.8 Queue Specication 251 File I/O (textio package) .............................................................. . TEXTIO denes read, write, readline, writeline functions. Described in: http://www.eng.auburn.edu/department/ee/mgc/vhdl.html#textio These functions can be used to read test vectors from a le and write results to a le. 3.6.8 Queue Specication Most bugs in queues are related to the queue becoming full, becoming empty, and/or wrap of indices. Specication should be obviously correct. Avoid bugs in specication by making specication queue larger than the max number of writes that we will do in test suite. Thus, the specication queue will never become full or wrap. However, the implementation queue will become full and wrap. Write Index Update in Specication .................................................... We increment write-index on every write, we never wrap. process (clk) begin if rising_edge(clk) then if (reset = 1) then wr_idx <= 0; elsif (do_wr = 1) then wr_idx <= wr_idx + 1; end if; end if; end process; Things to Notice ....................................................................... Things to notice in queue specication: 1. dont care conditions (-) 2. uninitialized data (hint: what is the value of rd_data when do more reads than writes? 252 CHAPTER 3. FUNCTIONAL VERIFICATION Dont Care ............................................................................ rd_data <= data_array(rd_idx) when (do_rd =1) else (others => -); 3.6.9 Queue Testbench Things to notice in queue testbench: 1. running multipe test sequences 2. uninitialized data U 3. std_match to compare spec and impl data 0 0 1 1 everything else 0 L 1 H everything everything With equality, - 1, but we want to use - to mean dont care in specication. The solution is to use std match, rather than = to check implementation signals against the specication. Stimulus Process Structure ........................................................... . The stimulus process runs multiple test vectors a in single simulation run. 3.6.9 Queue Testbench 253 stimulus : process type test_datum_ty is record r_reset, ... normal fields ... end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := ( -reset ... ( 1, normal fields), ( 0, normal fields), ... -- wr_idx passes rd_idx (overwrite entries) -reset ... ( 1, normal fields), ( 0, normal fields), ... ); begin for i in test_vectorsrange loop if (test_vectors(i).r_reset = 1) then ... reset code ... end if; reset <= 0; ... normal sequence ... wait until rising_edge(clk); end loop; end process; After reset is asserted, set signals to U. 254 CHAPTER 3. FUNCTIONAL VERIFICATION 3.7 Functional Verication Problems P3.1 Carry Save Adder 1. Functionality Briey describe the functionality of a carry-save adder. 2. Testbench Write a testbench for a 16-bit combinational carry save adder. 3. Testbench Maintenance Modify your testbench so that it is easy to change the width of the adder and the latency of the computation. NOTES: (a) You do not need to support pipelined adders. (b) VHDL generics might be useful. P3.2 Trafc Light Controller P3.2.1 Functionality Briey describe the functionality of a trafc-light controller that has sensors to detect the presence of cars. P3.2.2 Boundary Conditions Make a list of boundary conditions to check for your trafc light controller. P3.2.3 Assertions Make a list of assertions to check for your trafc light controller. P3.3 State Machines and Verication 255 P3.3 State Machines and Verication P3.3.1 Three Different State Machines s0 */1 s0 1/0 s1 */0 s1 */0 s2 */0 s9 */0 s8 */0 s3 */0 s4 */0 0/0 */1 */0 s6 s3 */0 s2 */0 */0 s7 Figure 3.10: A very simple machine s5 Figure 3.11: A very big machine s0 */0 s1 q0 */0 q1 */0 q2 */0 input/output */1 s2 */0 */0 */1 * = dont care q4 */0 q3 Figure 3.13: Legend Figure 3.12: A concurrent machine Answer each of the following questions for the three state machines in gures3.103.12. Number of Test Scenarios How many test scenarios (sequences of test vectors) would you need to fully validate the behaviour of the state machine? Length of Test Scenario What is the maximum length (number of test vectors) in a test scenario for the state machine? 256 CHAPTER 3. FUNCTIONAL VERIFICATION Number of Flip Flops Assuming that neither the inputs nor the outputs are registered, what is the minimum number of ip-ops needed to implement the state machine? P3.3.2 State Machines in General If a circuit has i signals of 1-bit each that are inputs, f 1-bit signals that are outputs of ip-ops and c 1-bit signals that are the outputs of combinational circuitry, what is the maximum number of states that the circuit can have? P3.4 Test Plan Creation Youre on the functional verication team for a chip that will control a simple portable CDplayer. Your task is to create a plan for the functional verication for the signals in the entity cd digital. Youve been told that the player behaves just like all of the other CD players out there. If your test plan requires knowledge about any potential non-standard features or behaviour, youll need to document your assumptions. track min sec prev stop play next pwr entity cd_digital is port ( ----------------------------------------------------- buttons prev, stop, play, next, pwr : in std_logic; ----------------------------------------------------- detect if player door is open open : in std_logic; ----------------------------------------------------- output display information track : out std_logic_vector(3 downto 0); min : out unsigned(6 downto 0); sec : out unsigned(5 downto 0) ); end cd_digital; P3.5 Sketches of Problems 257 P3.4.1 Early Tests Describe ve tests that you would run as soon as the VHDL code is simulatable. For each test: describe what your specication, stimulus, and check. Summarize the why your collection of tests should be the rst tests that are run. P3.4.2 Corner Cases Describe ve corner-cases or boundary conditions, and explain the role of corner cases and boundary conditions in functional verication. NOTES: 1. You may reference your answer for problem P3.4.1 in this question. 2. If you do not know what a corner case or boundary condition is, you may earn partial credit by: checking this box and explaining ve things that you would do in functional verication. P3.5 Sketches of Problems 1. Given a circuit, VHDL code, or circuit size info; calculate simulation run time to achieve n% coverage. 2. Given a fragment of VHDL code, list things to do to make it more robust e.g. illegal data and states go to initial state. 3. Smith Problem 13.29 258 CHAPTER 3. FUNCTIONAL VERIFICATION Chapter 4 Performance Analysis and Optimization 4.1 Introduction Hennessey and Pattersons Quantitative Computer Achitecture (textbook for E&CE 429) has good information on performance. We will use some of the same denitions and formulas as Hennessey and Patterson, but we will move away from generic denitions of performance for computer systems and focus on performance for digital circuits. 4.2 Dening Performance Work Time Performance You can double your performance by: doing twice the work in the same amount of time OR doing the same amount of work in half the time Benchmarking ....................................................................... . Measuring time is easy, but how do we accurately measure work? The game of benchmarketing is nding a denition of work that makes your system appear to get the most work done in the least amount of time. 259 260 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION Measure of Work clock cycle instruction synthetic program real program travel 1/4 mile Measure of Performance MHz MIPs Whetstone, Dhrystone, D-MIPs (Dhrystone MIPs) SPEC drag race The Spec Benchmarks are among the most respected and accurate predictions of real-world performance. Denition SPEC: Standard Performance Evaluation Corporation MISSION: To establish, maintain, and endorse a standardized set of relevant benchmarks and metrics for performance evaluation of modern computer systems http://www.spec.org. The Spec organization has different benchmarks for integer software, oating-point software, webserving software, etc. 4.3 Comparing Performance 4.3.1 General Equations Equation for Big is n% greater than Small: n% Big Small Small For the above equation, it can be difcult to remember whether the denominator is the larger number or the smaller number. To see why Small is the only sensible choice, consider the situation where a is 100% greater than b. This means that the difference between a and b is 100% of something. Our only variables are a and b. It would be nonsensical for the difference to be a, because that would mean: a b a. However, if a b b, then for a to be 100% greater than b simply means that a 2b. Using n% greater formula, the phrase The performance of A is n% greater than the performance of B is: n% PerformanceA PerformanceB PerformanceB 4.3.2 Example: Performance of Printers 261 Performance is inversely proportional to time: Performance 1 Time Substituting the above equation into the equation for the performance of A is n% greater than the performance of B gives: TimeB Time A Time A n% In general, the equation for a fast system to be n% faster than a slow system is: TSlow TFast TFast n% Another useful formula is the average time to do one of k different tasks, each of which happens %i of the time and takes an amount of time Ti to do each time it is done . TAvg i 1 %iTi k We can measure the performance of practically anything (cars, computers, vacuum cleaners, printers....) 4.3.2 Example: Performance of Printers Black and White printer1 9ppm printer2 12ppm Colour 6ppm 4ppm Question: Which printer is faster at B&W and how much faster is it? Answer: 262 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION BW Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. n% faster TSlow TFast TFast BW1 1 9ppm 0 1111min page BW2 1 12ppm 0 0833min page TSlow TFast TFast BW1 BW2 BW2 0 1111 0 08333 0 08333 33%faster BWFaster Performance for Different Tasks ...................................................... . Question: If average workload is 90% BW and 10% Colour, which printer is faster and how much faster is it? 4.3.2 Example: Performance of Printers 263 Answer: TAvg1 %BW BW1 %C C1 0 90 0 1111 0 10 0 1667 0 1167min page %BW BW2 %C C2 0 TAvg2 90 0 0833 0 10 0 2500 0 1000min page TSlow TFast TFast Avg1 Avg2 Avg2 0 1167 0 1000 0 1000 16 7%faster AvgFaster Optimizing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Question: If we want to optimize printer1 to match performance of printer2, should we optimize BW or Colour printing? Answer: Colour printing is slower, so appears that can save more time by optimizing colour printing. However, look at extreme case of optimizing colour printing to be instantaneous for P1: 264 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 0.150m/p 0.100m/p 0.050m/p 0.000m/p P1 P2 Even if make colour printing instantaneous for printer 1 and kept same for printer 2, printer 1 would not be measurably faster. Amdahls law Make the common case fast. Optimizations need to take into account both run time and frequency of occurrence. We should optimize black and white printing. Question: If you have to re all of the engineers because your stock price plummeted, how can you get printer1 to be faster than printer2? Note: 2000... This question was actually humorous during the high-tech bubble of Answer: Hire more marketing people! Notice that colour printing on printer 1 is faster than on printer 2. So, marketing suggests that people are increasing the percentage of printing that is done in colour. Question: Revised question: what percentage of printing must be done in colour for printer1 to beat printer2? 4.4. CLOCK SPEED, CPI, PROGRAM LENGTH, AND PERFORMANCE 265 Answer: TAvg1 %BW BW1 %C C1 %BW TAvg2 %BW BW2 %C C2 1 %C 1 1 %C BW1 %C C1 BW1 %C C1 BW1 %C %C %C %C BW2 %C C2 BW2 %C C2 BW2 BW1 BW2 BW1 BW2 C2 C1 0 1111 0 0833 0 1111 0 0833 0 2500 0 1667 0 25 4.4 Clock Speed, CPI, Program Length, and Performance 4.4.1 Mathematics CPI NumInsts ClockSpeed ClockPeriod Cycles per instruction Number of instructions Clock speed Clock period Time Time NumInsts CPI ClockPeriod NumInstsCPI ClockSpeed 4.4.2 Example: CISC vs RISC and CPI AMD Athlon Fujitsu SPARC64 Clock Speed SPECint 1.1GHz 409 675MHz 443 266 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION The AMD Athlon is a CISC microprocessor (it uses the IA-32 instruction set). The Fujitsu SPARC64 is a RISC microprocessor (it uses Suns Sparc instruction set). Assume that it requires 20% more instructions to write a program in the Sparc instruction set than the same program requires in IA-32. Question: Which of the two processors has higher performance? Answer: SPECint, SPECfp, and SPEC are measures of performance. Therefore, the higher the SPEC number, the higher the performance. The Fujitsu SPARC64 has higher performance Question: What is the ratio between the CPIs of the two microprocessors? Answer: We will use a as the subscript for the Athlon and s as the subscript for the Sparc. Time CPI CPI CPIA CPIS NumInstsCPI ClockSpeed Time ClockSpeed NumInsts ClockSpeed Perf NumInsts ClockSpeedA PerfA NumInstsA ClockSpeedA ClockSpeedS PerfA PerfS NumInstsS 11 409 NumInstsA 21 PerfS NumInstsS ClockSpeedS 11 0 675 409 443 1 2 NumInstsA 443 1 20 NumInstsA 675 Executing the average Athlon instruction requires 210% more clock cycles than executing the average Sparc instruction. 4.4.3 Effect of Instruction Set on Performance 267 Question: Can you determine the absolute (actual) CPI of either microprocessor? Answer: To determine the absolute CPI, we would need to know the actual number of instructions execute by at least one of the processors. 4.4.3 Effect of Instruction Set on Performance Your group designs a microprocessor and you are considering adding a fused multiply-accumulate to the instruction set. (A fused multiply accumulate is a single instruction that does both a multiply and an addition. It is often used in digital signal processing.) Your studies have shown that, on average, half of the multiply operations are followed by an add instruction that could be done with a fused multiply-add. Additionally, you know: cpi % ADD 0.8 CPIavg 15% MUL 1.2 CPIavg 5% Other 1.0 CPIavg 80% You have three options: option 1 : no change option 2 : add the MAC instruction, increase the clock period by 20%, and MAC has the same CPI as MUL. option 3 : add the MAC instruction, keep the clock period the same, and the CPI of a MAC is 50% greater than that of a multiply. Question: Which option will result in the highest overall performance? Answer: NumInsts CPI ClockSpeed ClockSpeed NumInsts CPI Time Perf 268 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION We need to nd NumInsts, CPI, and ClockSpeed for each of the three options. Option 1 is the baseline, so we will dene values for variables in Options 2 and 3 in terms of the Option 1 variables. Options 2 and 3 will have the same number of instructions. Half of the multiply instructions are followed by an add that can be fused. In questions that involve changing both CPI and NumInsts, it is often easiest to work with the product of CPI and NumInsts, which represents the total number of clock cycles needed to execute the program. Additionally, set the problem up with an imaginary program of 100 instructions on the baseline system. NumMAC2 0 5 NumMul1 0 55 25 0 5 NumMul1 0 55 25 NumAdd1 0 5 NumMul1 15 0 5 5 12 5 NumMUL2 NumADD2 Find the total number of clock cycles for each option. Cycles1 NumMUL1 CPIMUL NumADD1 CPIADD NumOth1 CPIOth 5 1 2 15 0 8 80 1 0 98 NumMAC2 CPIMAC NumMUL2 CPIMUL NumADD2 CPIADD NumOth2 CPIOth 2 5 1 2 2 5 1 2 12 5 0 8 80 1 0 96 NumMAC3 CPIMAC NumMUL3 CPIMUL NumADD3 CPIADD NumOth3 CPIOth 2 5 1 5 1 2 2 5 1 2 12 5 0 8 80 1 0 97 5 Cycles2 Cycles3 Calculate performance for each option using the formula: 1 Cycles ClockPeriod Performance 4.4.4 Effect of Time to Market on Relative Performance 269 Performance1 Performance2 Performance3 1 1 1 1 1 1 1 98 96 1 2 115 97 5 1 97 5 98 The third option is the fastest. 4.4.4 Effect of Time to Market on Relative Performance Assume that performance of the average product in your market segment doubles every 18 months. You are considering an optimization that will improve the performance of your product by 7%. Question: If you add the optimization, how much can you allow your schedule to slip before the delay hurts your relative performance compared to not doing the optimization and launching the product according to your current schedule? Answer: performance at time t P0 2t 18 From problem statement: 1 07 P0 Pt Equate two equations for Pt , then solve for t. P0 2t 18 1 07 P0 1 07 2t 18 t 18 log2 1 07 t 18 log2 1 07 log x Use: logb x log b log 1 07 18 log 2 1 76months Pt 270 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4.4.5 Summary of Equations Time to perform a task: NumInsts CPI ClockSpeed Time Average time to do one of k different tasks: TAvg i 1 %iTi k Performance: Performance Speedup: TSlow TFast Work Time Speedup TFast is n% faster than TSlow: n% faster TSlow TFast TFast Performance at time t if performance increases by factor of k every n units of time: Perf t Perf 0 kt n 4.5. PERFORMANCE ANALYSIS AND DATAFLOW DIAGRAMS 271 4.5 Performance Analysis and Dataow Diagrams 4.5.1 Dataow Diagrams, CPI, and Clock Speed One of the challenges in designing a circuit is to choose the clock speed. Increasing the clock speed of a circuit might not improve its performance. In this section we will work through several example dataow diagrams to pick a clock speed for the circuit and schedule operations into clock cycles. When partitioning dataow diagrams into clock cycles, we need to choose a clock period. Choosing a clock period affects many aspects of the design, not just the overall performance. Different design goals might put conicting pressure on the clock period: some goals will tend toward short clock periods and some goals will tend toward long clock periods. For performance, not only is clock period a poor indicator of the relative performance of two different systems, even for the same system decreasing the clock period might not increase the performance. Goal Minimize area Action Affect decrease clock pe- fewer operations per clock cycle, so riod fewer datapath components and more opportunities to reuse hardware Increase scheduling exibil- increase clock pe- more exibility in grouping operations ity riod in clock cycles Decrease percentage of clock increase clock pe- decreases number of ops that data tracycle spent in ops (overhead riod verses through time in ops is not doing useful work) Decrease time to execute an ???? depends on dataow diagram instruction Our general plan to nd the clock period for maximum performance is: 1. Pick clock period to be delay through slowest component + delay through op. 2. For each instruction, for each operation, schedule the operation in the earliest clock cycle possible without violating clock-period timing constraints. 3. Calculate average time to execute an instruction as: NumInsts CPI Combine: Time = ClockSpeed and: CPIavg = i 1 %i CPIi NumInsts i 1 k %i CPIi k to derive: Time = ClockSpeed 272 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4. If the maximum latency through dataow diagram is greater than 1, then increase clock period by minimum amount needed to decrease latency by one clock period and return to Step 2. 5. If the maximum latency through dataow diagram is 1, then clock period for highest performance is clock period resulting in fastest Time. 6. If possible, adjust the schedule of operations to reduce the maximum number of occurrences of a component per instruction per clock cycle without increasing latency for any instruction. 4.5.2 Examples of Dataow Diagrams for Two Instructions Circuit supports two instructions, A and B (e.g. multiply and divide). At any point in time, the circuit is doing either A or B it does not need to support doing A and B simultaneously. The diagrams below show the ow for each instruction and the delay through the components (f,g,h,i) that the instructions use. The delay through a register is 5ns. Each operation (A and B) occurs 50% of the time. Our goal is to nd a clock period and dataow diagram for the circuit that will give us the highest overall performance. Instruction A f (30ns) Instruction B i (40ns) g (50 ns) g (50 ns) h (20 ns) g (50 ns) 4.5.2 Examples of Dataow Diagrams for Two Instructions 273 4.5.2.1 Scheduling of Operations for Different Clock Periods 55ns Clock Period Instr A Instr B i (40ns) 75ns 75ns Clock Period Instr A f (30ns) Instr B i (40ns) 55ns 55ns f (30ns) g (50 ns) h (20 ns) g (50 ns) 75ns g (50 ns) h (20 ns) g (50 ns) g (50 ns) 55ns 55ns g (50 ns) 75ns 85ns Clock Period Instr A f (30ns) 85ns g (50 ns) h (20 ns) 85ns g (50 ns) g (50 ns) 95ns Instr B i (40ns) 95ns 95ns Clock Period Instr A f (30ns) g (50 ns) h (20 ns) g (50 ns) Instr B i (40ns) g (50 ns) 155ns Clock Period Instr A f (30ns) g (50 ns) 155ns h (20 ns) g (50 ns) Instr B i (40ns) g (50 ns) 4.5.2.2 Performance Computation for Different Clock Periods Which clock speed will result in the highest overall performance? Question: 274 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION Answer: Clock Period CPIA 55ns 4 75ns 3 85ns 2 95ns 2 155ns 1 CPIB 2 2 2 1 1 Tavg 55 0 5 4 0 5 2 75 0 5 3 0 5 2 85 0 5 2 0 5 2 95 0 5 2 0 5 1 155 0 5 1 0 5 1 165 187 5 170 143 155 4.5.2.3 Example: Two Instructions Taking Similar Time Question: For the ow below, which clock speed will result in the highest overall performance? B A 30ns 40ns 50ns 50ns 20ns 40ns 50ns Answer: 55ns 55ns f (30ns) i (40ns) 75ns f (30ns) i (40ns) g (50 ns) h (20 ns) g (50 ns) 75ns g (50 ns) h (20 ns) 75ns g (50 ns) i (40ns) g (50 ns) i (40ns) 55ns 55ns g (50 ns) f (30ns) 85ns g (50 ns) i (40ns) f (30ns) h (20 ns) 85ns g (50 ns) g (50 ns) 95ns g (50 ns) h (20 ns) 85ns i (40ns) 95ns g (50 ns) i (40ns) g (50 ns) i (40ns) 4.5.2 Examples of Dataow Diagrams for Two Instructions 275 f (30ns) 135ns f (30ns) 105ns g (50 ns) h (20 ns) 105ns g (50 ns) i (40ns) i (40ns) g (50 ns) 135ns g (50 ns) g (50 ns) h (20 ns) i (40ns) g (50 ns) i (40ns) Should skip 105 ns, because it has same latency as 95 ns. f (30ns) g (50 ns) 155ns h (20 ns) i (40ns) g (50 ns) i (40ns) g (50 ns) Clock Period CPIA 55ns 4 75ns 3 85ns 2 95ns 2 105ns 2 135ns 2 155ns 1 CPIB 3 3 3 2 2 1 1 Tavg 193 225 213 190 NO GAIN 203 155 A clock period of 155 ns results in the highest performance. For a clock period of 105 ns, we did not calculate the performance, because we could see that it would be worse than the performance with a clock period of 95 ns. The dataow diagram with a 105 ns clock period has the same latency as the diagram with a clock period of 95 ns. If the data ow diagram with the longer clock period has the same latency as the diagram with the shorter clock period, then the diagram with the longer clock period will have lower performance. 4.5.2.4 Example: Same Total Time, Different Order for A Question: For the ow below, which clock speed will result in the highest overall performance? 276 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION A B 30ns 40ns 20ns 50ns 50ns 40ns 50ns Answer: Clock Period CPIA 55ns 3 95ns 3 105ns 2 135ns 2 155ns 1 CPIB 3 2 2 1 1 Tavg 165ns 238ns 210ns 203ns 155ns A clock period of 155 ns results in lowest average execution time, and hence the highest performance. This is the same answer as the previous problem, but the total times for higher clock frequencies differ signicantly between the two problems. 4.5.3 Example: From Algorithm to Optimized Dataow This question involves doing some of the design work for a circuit that implements InstP and InstQ using the components described below. Instruction Algorithm Frequence of Occurrence InstP a b a b b d e 75% InstQ i j k l m 25% Component Delays 2-input Mult 40ns 2-input Add 25ns Register 5ns NOTES There is a resource limitation of a maximum of 3 input ports. (There are no other resource limitations.) You must put registers on your inputs, you do not need to register your outputs. The environment will directly connect your outputs (its inputs) to registers. Each input value (a, b, c, d, e, i, j, k, l, m) can be input only once if you need to use a value in multiple clock cycles, you must store it in a register. Question: What clock period will result in the best overall performance? Answer: 4.5.3 Example: From Algorithm to Optimized Dataow 277 Algorithm Answers (InstP) a b d ................................................ . e a b d e * a*b * + + * * b*d * a*b * b*d (a*b) + (b*d) + (a*b) + (b*d) + e (a*b)*((a*b) + (b*d) + e) + (a*b) + (b*d) + e * (a*b)*((a*b) + (b*d) + e) InstP: common subexpr elim InstP data-dep graph a b d * * b*d a b e d + + a*b (b*d) + e (a*b) + (b*d) + e * a*b * b*d + + e * (a*b)*((a*b) + (b*d) + e) (a*b) + (b*d) + e InstP: alternative data dependency graph. (a*b)*((a*b) + (b*d) + e) Both options have critical path of 2mults+2adds. First option allows three operations to be done InstP: clock=50ns, lat=4, T=200 with just three inputs (a,b,d). Second option requires all four inputs to do three operations. * 278 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION a b d a b e d * a*b * b*d * a*b * b*d + + (a*b) + (b*d) + e + + e (a*b) + (b*d) + e * (a*b)*((a*b) + (b*d) + e) InstP: clock=55ns, lat=3, T=165ns * (a*b)*((a*b) + (b*d) + e) InstP: clock=70ns, lat=2, T=140 b d * a b d e 70ns a b*d e + * + (b*d) + e * a*b * b*d + a*b (a*b) + (b*d) + e + (a*b) + (b*d) + e * (a*b)*((a*b) + (b*d) + e) InstP: dataflow diagram with alternative data-dep graph. Adds a third clock cycle without any gain in clock speed. From diagram, its clear that its better to put a*b in first clock cycle and e in second, because a*b can be done in parallel with b*d. * (a*b)*((a*b) + (b*d) + e) InstP: illegal: 4 inputs Fastest option for InstP is 70ns clock, which gives a total execution time of 140 ns. 4.5.3 Example: From Algorithm to Optimized Dataow 279 Algorithm Answers (InstQ) ................................................. i j k + i j k l m + + + + l m + * * InstQ: data-dep graph with max parallelism InstQ: alternative data-dep graph: able to do two operations with three inputs, while first data-dep graph required four inputs to do two operations. We are limited to three inputs, so choose this data-dep graph for dataflow diagrams. i j k i j k + + + * InstQ: clock=50ns, lat=4, T=200ns. l m + + + * InstQ: clock=55ns, lat=3, T=165ns. l m 280 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION i j k i j k + + + * InstQ: clock=70ns, lat=2, T=140ns. l m + + + * InstQ: irrelevant: lat did not decrease l m i j k i j k + + + * InstQ: clock=120ns, lat=1, T=120ns 70ns l m + + + * InstQ l m Fastest option for InstQ is 70ns clock, which gives a total execution time of 140 ns. Both InstP and InstQ need a 70ns clock period to maximize their performance. So, use a 70ns clock, which gives a latency of 2 clock cycles for both instructions. Fastest execution time Clock period 140ns 70ns 4.5.3 Example: From Algorithm to Optimized Dataow 281 Question: Find a minimal set of resources that will achieve the performance you calculated. Answer: Final dataow graphs for InstP and InstQ a b d i j k * a*b * b*d + e + + + + l m (a*b) + (b*d) + e 70ns * (a*b)*((a*b) + (b*d) + e) InstP: clock=70ns, lat=2, T=140 InstQ * Need do only one of InstP and InstQ at any time, so simply take max of each resource. Inputs Outputs Registers Adders Multipliers InstP InstQ System 3 3 3 1 1 1 3 3 3 2 2 2 2 1 2 282 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION Question: Design the datapath and state machine for your design Answer: a b S0 i1 i2 r1 r2 m1 d i3 r3 m2 i S0 j k i3 r3 i1 i2 r1 r2 a1 S1 * a2 r3 * e i2 r2 S1 + l a2 m i3 r3 + r1 + r1 a1 i2 r2 S0 a1 + S0 + m2 m2 * o1 * o1 InstP: clock=70ns, lat=2, T=140ns. InstQ: clock=70ns, lat=2, T=140ns. Control Tables InstP S0 InstP S1 InstQ S0 InstQ S1 ce 1 1 1 1 r1 mux i1 a2 i1 a2 ............................................................ . ce 1 1 1 1 r2 mux i2 i2 i2 i2 ce 1 1 1 1 r3 mux i3 m1 i3 i3 m1 src1 src2 r1 r2 m2 src1 src2 r3 a1 r2 r3 a1 r3 src1 r1 r1 r1 a1 src2 r2 r2 r2 src1 m1 a1 a2 src2 m2 r3 Optimize Control Table InstP S0 InstP S1 InstQ S0 InstQ S1 r1 mux i1 a2 i1 a2 r2 mux i2 i2 i2 i2 ................................................... . r3 mux i3 m1 i3 i3 m1 src1 src2 r1 r2 r1 r2 r1 r2 r1 r2 m2 src1 src2 a1 r3 r2 r3 a1 r3 r2 r3 src1 r1 r1 r1 r1 a1 src2 r2 r2 r2 r2 a2 src1 m1 m1 a1 a1 src2 m2 m2 r3 r3 4.5.3 Example: From Algorithm to Optimized Dataow 283 Write VHDL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Use the optimized control table as basis for VHDL code. process (clk) begin if rising_edge(clk) then if state=S0 then r1 <= i1 else r1 <= a2 end if; end if; end process; process (clk) begin if rising_edge(clk) then r2 <= i2 end if; end process; process (clk) begin if rising_edge(clk) then if inst=instP and state=S0 then r3 <= m1 else r1 <= i3 end if; end if; end process; m1 <= r1 * r2; m2_src1 <= r2 when state=S0 else a1; m2 <= m2_src1 * r3; a1 <= r1 + r2; process (inst, m1, m2, a1, r3) begin if inst=instP then a2_src1 <= m1; a2_src2 <= m2; else a2_src1 <= a1; a2_src2 <= r3; end if; end process; 284 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 4.6 Performance Analysis and Optimization Problems P4.1 Farmer A farmer is trying to decide which of his two trucks to use to transport his apples from his orchard to the market. Facts: capacity of truck big truck 12 tonnes small truck 6 tonnes distance to market 120 km amount of apples 85 tonnes NOTES: 1. All of the loads of apples must be carried using the same truck 2. Elapsed time is counted from beginning to deliver rst load to returning to the orchard after the last load 3. Ignore time spent loading and unloading apples, coffee breaks, refueling, etc. 4. For each trip, a truck travels either its fully loaded or empty speed. speed when loaded with apples 15kph 30kph speed when unloaded (no apples) 38kph 70kph Question: Which truck will take the least amount of time and what percentage faster will the truck be? Question: In planning ahead for next year, is there anything the farmer could do to decrease his delivery time with little or no additional expense? If so, what is it, if not, explain. P4.2 Network and Router 285 P4.2 Network and Router In this question there is a network that runs a protocol called BigLan. You are designing a router called the DataChopper that routes packets over the network running BigLan (i.e. theyre BigLan packets). The BigLan network protocol runs at a data rate of 160 Mbps (Mega bits per second). Each BigLan packet contains 100 Bytes of routing information and 1000 Bytes of data. You are working on the DataChopper router, which has the following performance numbers: 75MHz 4 500 clock speed cycles for a byte of either data or header number of additional clock cycles to process the routing information for a packet Maximum Throughput P4.2.1 Which has a higher maximum throughput (as measured in data bits per second that is only the payload bits count as useful work), the network or your router, and how much faster is it? P4.2.2 Packet Size and Performance Explain the effect of an increase in packet length on the performance of the DataChopper (as measured in the maximum number of bits per second that it can process) assuming the header remains constant at 100 bytes. P4.3 Performance Short Answer If performance doubles every two years, by what percentage does performance go up every month? This question is similar to compound growth from your economics class. P4.4 Microprocessors The Yme microprocessor is very small and inexpensive. One performance sacrice the designers have made is to not include a multiply instruction. Multiplies must be written in software using loops of shifts and adds. The Yme currently ships at a clock frequency of 200MHz and has an average CPI of 4. A competitor sells the Y!v1 microprocessor, which supports exactly the same instructions as the Yme. The Y!v1 runs at 150MHz, and the average program is 10% faster on the Yme than it is on the Y!v1. 286 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION P4.4.1 Average CPI Question: What is the average CPI for the Y!v1? If you dont have enough information to answer this question, explain what additional information you need and how you would use it? A new version of the Y!, the Y!u2 has just been announced. The Y!u2 includes a multiply instruction and runs at 180MHz. The Y!u2 publicity brochures claim that using their multiply instruction, rather than shift/add loops, can eliminate 10% of the instructions in the average program. The brochures also claim that the average performance of Y!u2 is 30% better than that of the Y!v1. P4.4.2 Why not you too? Question: Assuming the advertising claims are true, what is the average CPI for the Y!u2? If you dont have enough information to answer this question, explain what additional information you need and how you would use it? P4.4.3 Analysis Which of the following do you think is most likely and why. Question: 1. the Y!u2 is basically the same as the Y!v1 except for the multiply 2. the Y!u2 designers made performance sacrices in their design in order to include a multiply instruction 3. the Y!u2 designers performed other signicant optimizations in addition to creating a multiply instruction P4.5 Dataow Diagram Optimization Draw an optimized dataow diagram that improves the performance and produces the same output values. Or, if the performance cannot be improved, describe the limiting factor on the performance. NOTES: you may change the times when signals are read from the environment you may not increase the resource usage (input ports, registers, output ports, f components, g components) you may not increase the clock period P4.6 Performance Optimization with Memory Arrays 287 a b c a b d f f d e f c f f g g f g e g After Optimization Before Optimization P4.6 Performance Optimization with Memory Arrays This question deals with the implementation and optimization for the algorithm and library of circuit components shown below. Algorithm q = M[b]; if (a > b) then M[a] = b; p = (M[b-1]) * b) + M[b]; else M[a] = b; p = M[b+1] * a; end; Component Register Adder Subtracter ALU with , , Memory read Memory write Multiplication 2:1 Multiplexor , Delay 5 ns 25 ns 30 ns , , AND, XOR 40 ns 60 ns 60 ns 65 ns 5 ns NOTES: 1. 25% of the time, a > b 2. The inputs of the algorithm are a and b. 3. The outputs of the algorithm are p and q. 4. You must register both your inputs and outputs. 5. You may choose to read your input data values at any time and produce your outputs at any time. For your inputs, you may read each value only once (i.e. the environment will not send multiple copies of the same value). 6. Execution time is measured from when you read your rst input until the latter of producing your last output or the completion of writing a result to memory 288 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION 7. M is an internal memory array, which must be implemented as dual-ported memory with one read/write port and one write port. 8. Assume all memory address and other arithmetic calculations are within the range of representable numbers (i.e. no overows occur). 9. If you need a circuit not on the list above, assume that its delay is 30 ns. 10. Your dataow diagram must include circuitry for computing a > b and using the result to choose the value for p Draw a dataow diagram for each operation that is optimized for the fastest overall execution time. NOTE: You may sacrice area efciency to achieve high performance, but marks will be deducted for extra hardware that does not contribute to performance. P4.7 Multiply Instruction You are part of the design team for a microprocessor implemented on an FPGA. You currently implement your multiply instruction completely on the FPGA. You are considering using a specialized multiply chip to do the multiplication. Your task is to evaluate the performance and optimality tradeoffs between keeping the multiply circuitry on the FPGA or using the external multiplier chip. If you use the multipliplier chip, it will reduce the CPI of the multiply instruction, but will not change the CPI of any other instruction. Using the multiplier chips will also force the FPGA to run at a slower clock speed. FPGA option FPGA + MULT option MULT FPGA FPGA average CPI % of instrs that are multiplies CPI of multiply Clock speed 5 10% 20 200 MHz ??? 10% 6 160 MHz P4.7.1 Highest Performance Which option, FPGA or FPGA+MULT, gives the higher performance (as measured in MIPs), and what percentage faster is the higher-performance option? P4.7 Multiply Instruction 289 P4.7.2 Performance Metrics Explain whether MIPs is a good choice for the performance metric when making this decision. 290 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION Chapter 5 Optimization 5.1 Pipelining 5.1.1 Introduction to Pipelining Execution of dataow diagram (Review of section 2.6.3) a r1 add1 ......................................................... b r2 0 c r2 + r1 add1 1 d r2 + r1 add1 2 clk e r2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a r1 + r1 add1 3 f r2 + r1 add1 4 5 z + z Pipelined Execution .................................................................. . Pipelining is optimization that increases performance by overlapping the execution of multiple parcels (instructions). The cost is an increase in area, because we cannot reuse datapath components, registers, inputs, or outputs. 291 292 CHAPTER 5. OPTIMIZATION a r1 add1 b r2 0 c r4 + r3 add2 1 clk d r5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a r1 + r5 add3 2 e r8 + r7 add4 3 f r10 r3 r5 + r9 add5 4 5 r7 r9 z + z Sequential (Unpipelined) Hardware reset State(0) State(1) State(2) .................................................... i1 State(3) State(4) i2 r1 add1 r2 + o1 Pipelined Hardware .................................................................. . 5.1.1 Introduction to Pipelining 293 i1 i2 r1 add1 r2 i3 + r3 add2 r4 i4 + r5 add3 r6 i5 + r7 add4 r8 i6 + r9 add5 r10 + o1 Pipelined VHDL Code ................................................................. begin process wait until rising_edge(clk); r1 <= i1; r2 <= i2; r3 <= r1 + r2; r4 <= i3; r5 <= r3 + r4; r6 <= i4; r7 <= r5 + r6; r8 <= i5; r9 <= r7 + r8; r10 <= i6; end process; o1 <= r9 + r10; Denition Bubble: When a pipe stage is empty (contains invalid data), it is said to contain a bubble. Question: data? How do we know whether the output of the pipeline is a bubble or is valid Answer: Add one register per stage to hold valid bit. If valid=0; then the pipe stage contains a bubble. 294 CHAPTER 5. OPTIMIZATION 5.1.2 Partially Pipelined The previous section illustrated a fully pipelined circuit, which means that the circuit could accept a new parcel every clock cycle. Sometimes we want to sacrice performance (throughput) in order to reduce area. We can do this by having a throughput that is less than one parcel per clock-cycle and reusing some hardware. a r1 add1 b r2 0 c r2 + r1 add1 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 d r4 + r3 add2 2 e r4 clk a + r3 add2 3 f r6 r1 r3 + r5 add3 4 5 r5 z + z Hardware for Partially Pipelined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5.1.3 Pipelined Version of InstP 295 i1 i2 reset State(0) State(1) r1 add1 r2 + i2 r3 add2 r4 + i2 r5 add3 r6 + o1 5.1.3 Pipelined Version of InstP This example is based on the InstP/InstQ circuit from section 4.5.3 Dataow Graph a b d ....................................................................... * + + * * e Component Delays 2-input Mult 40ns 2-input Add 25ns Register 5ns 296 CHAPTER 5. OPTIMIZATION Behaviour of Unpipelined and Pipelined Unpipelined: 0 50 100 150 ................................................ 200 250 clk input output Pipelined: 0 50 100 150 200 250 clk input stage1 stage2 output 5.1.3 Pipelined Version of InstP 297 Pipelined Hardware a r1 r2 .................................................................. . d r3 b valid v1 stage1 m1 * a1 m2 * + e process begin wait until rising_edge(clk); r1 <= i1; r2 <= i2; r3 <= i3; r4 <= m1; r5 <= a1; r6 <= i4; end process; m1 <= r1 * m2 <= r2 * a1 <= m1 * a1 <= r5 * m3 <= r4 * o_valid <= r2; r3; m2; r6; a2; v2; v1 <= i_valid; v2 <= v1; r4 r5 r6 v2 stage2 a2 + m3 * 298 CHAPTER 5. OPTIMIZATION 5.1.4 Pipelined Version of InstP/InstQ The unpipelined version of this example is in section 4.5.3 Dataow Graph a b d ....................................................................... * + + * 0 * e 50 100 150 200 250 300 clk input output Behaviour of Unpiped and Pipelined 0 50 100 150 200 .................................................. . 250 300 clk input output 5.1.4 Pipelined Version of InstP/InstQ 299 r1 r2 r3 m1 * m1 m2 * m2 a1 + a1 a2 + a2 0 50 100 150 200 250 clk input output 300 CHAPTER 5. OPTIMIZATION Chapter 6 Timing Analysis 6.1 Delays and Denitions In this section we will look at the different timing parameters of circuits. Our focus will be on those parameters that limit the maximum clock speed at which a circuit will work correctly. 6.1.1 Background Denitions Denition fanin: The fanin of a gate or signal x are all of the gates or signals y where an input of x is connected to an output of y. Denition fanout: The fanout of a gate or signal x are all of the gates or signals y where an output of x is connected to an input of y. y0 y1 x y2 y3 y4 x y0 y1 y2 y3 y4 Figure 6.1: Immediate Fanin of x Figure 6.2: Immediate Fanout of x 301 302 CHAPTER 6. TIMING ANALYSIS Denition immediate fanin/fanout: The phrases immediate fanout and immediate fanin mean that there is a direct connection between the gates. x x Figure 6.3: Transitive Fanin Figure 6.4: Transitive Fanout Denition transitive fanin/fanout: The phrases transitive fanout and transitive fanin mean that there is either a direct or indirect connection between the gates. Note: Immediate vs Transitive fanin and fanout Be careful to distinguish between immediate fan(in/out) and transitive fanin/out. If fanin or fanout are not qualied with immediate or transitive, be sure to make sure whether immediate or transitive is meant. In E&CE 427, fan(in/out) will mean immediate fan(in/out). 6.1.2 Clock-Related Timing Denitions 6.1.2.1 Clock Skew skew clk1 clk2 clk3 clk4 clk2 clk4 clk1 clk3 Denition Clock Skew: The difference in arrival times for the same clock edge at different ip-ops. Clock skew is caused by the difference in interconnect delays to different points on the chip. Clock tree design is critical in high-performance designs to minimize clock skew. Sophisticated synthesis tools put lots of effort into clock tree design, and the techniques for clock tree design still generate PhD theses. 6.1.2 Clock-Related Timing Denitions 303 6.1.2.2 Clock Latency master clock latency intermediate clock jitter final clock master clock intermediate clock final clock Denition Clock Latency: The difference in arrival times for the same clock edge at different levels of interconnect along the clock tree. (Intuitively different points in the clock generation circuitry.) Note: Clock latency clock period. Clock latency does not affect the limit on the minimim 6.1.2.3 Clock Jitter ideal clock clock with jitter Denition Clock Jitter: Difference between actual clock period and ideal clock period. Clock jitter is caused by: temperature and voltage variations over time temperature and voltage variations across different locations on a chip manufacturing variations between different parts etc. 304 CHAPTER 6. TIMING ANALYSIS 6.1.3 Storage Related Timing Denitions Storage devices (latches, ip-ops, memory arrays, etc) dene setup, hold and clock-to-Q times. Setup d d clk q clk q Clock-to-Q Hold Figure 6.5: Setup, hold, and clock-to-Q times for a ip op Setup and hold dene window in which input data must be held constant in order to guarantee that storage device will store data correctly. Setup denes the beginning of the window. Hold denes the end of the window. Both setup and hold are measured with respect to the clock edge for when the device transitions from load mode to store mode. Setup is assumed to happen before the clock edge and hold is assumed to happen after the edge. If the end of the time window constraint occurs before the clock edge, then the hold constraint is negative. Note: Require / Guarantee Setup and hold times are requirements that the storage device imposes upon its environment. Clock-to-Q is a guarantee that the storage device provides its environment. If the environment satises the setup and hold times, then the storage device guarantees that it will satisfy the clock-to-Q time. In this section, we will use the denitions of setup, hold and clock-to-Q. Section 6.2 will show how to calculate setup, hold, and clock-to-Q times for ip ops, latches, and other storage devices. 6.1.3.1 Setup Time Denition Setup Time (T ) : Latest time before arrival of clock edge (ip op), or SUD deasserting of enable line (latch), that input data is required to be stable in order for storage device to work correctly. If setup time is violated, current input data will not be stored; input data from previous clock cycle might remain stored. 6.1.4 Propagation Delays 305 6.1.3.2 Hold Time Denition Hold Time (T ): Latest time after arrival of clock edge (ip op), or HO deasserting of enable line (latch), that input data is required to remain stable in order for storage device to work correctly. If hold time is violated, current input data will not be stored; input data from next clock cycle might slip through and be stored. 6.1.3.3 Clock-to-Q Time Denition Clock-to-Q Time (T ): Earliest time after arrival of clock edge (ip op), CO or asserting of enable line (latch) when output data is guaranteed to be stable. 6.1.4 Propagation Delays Propagation delay is the time it takes a signal to travel from the source (driving) op to the destination op. The two factors that contribute to propagation delay are the load of the combinational gates between the ops and the delay along the interconnect (wires) between the gates. 6.1.4.1 Load Delays Load delay is proportional to load capacitance. Timing of a simple inverter with a load. Vi Vo 1->0 0->1 0->1 1->0 Schematic Input 1 0: Charge output cap Input 0 1: Discharge output cap Load capacitance is a dependent on the fanout (how many other gates a gate drives) and how big the other gates are. Section 6.4.1 goes into more detail on timing models and equations for load delay. 306 CHAPTER 6. TIMING ANALYSIS 6.1.4.2 Interconnect Delays Wires, also known as interconnect, have resistance, and there is a capacitance between parallel wires. Both of these factors increase delay. Wire resistance is dependent upon the material and geometry of the wire. Wire capacitance is dependent on wire geometry, geometry of neighboring wires, and materials. Shorter wires are faster. Fatter wires are faster. FPGAs have special routing resources for long wires. CMOS processes use higher metal layers for long wires, these layers have wires with much larger cross sections than lower levels of metal. More on this in section 6.5. 6.1.5 Summary of Delay Factors Name Skew Symbol Denition Difference in arrival times for different clock signals Jitter Difference in clock period over time Clock-to-Q T Delay from clock signal to Q output of op CO Setup T Length of time prior to clock/enable that data SUD must be stable Hold T Length of time after clock/enable that data must HO be stable Load Delay due to load (fanout/consumers/readers) Interconnect Delay along wire Table 6.1: Summary of delay factors 6.1.6 Timing Constraints For a circuit to operate correctly, the clock period must be longer than the sum of the delays shown in table6.1. Denition Margin: The difference between the required value of a timing parameter and the actual value. A negative margin means that there is a timing violation. A 6.1.6 Timing Constraints 307 margin of zero means that the timing parameter is just satised: changing the timing of the signals (which would affect the actual value of the parameter) could violate the timing parameter. A positive margin means that the constraint for the timing parameter is more than satised: the timing of the signals could be changed at least a little bit without violating the timing parameter. Note: Margin is often called slack. Both terms are used commonly. 6.1.6.1 Minimum Clock Period a clk1 clk2 b signal is stable signal may change signal may rise signal may fall clock period propagation skew jitter clock-to-Q interconnect + load setup clk1 clk2 a b slack ClockPeriod Skew Jitter T Interconnect Load T CO SUD Note: The minimum clock period is independent of hold time. 308 CHAPTER 6. TIMING ANALYSIS 6.1.6.2 Hold Constraint skew -Q jitter hold io n to k- oc clk1 clk2 a b slack Skew Jitter T HO 6.1.6.3 Example Timing Violations cl pr op ag The gures below illustrate correct timing behaviour of a circuit and then two types of violations: setup violation and hold violation. In the gures, the black rectangles identify the point where the violation happens. at T Interconnect Load CO 6.1.6 Timing Constraints 309 a clk b c d a clk b Clock-to-Q Prop Setup Hold c d Figure 6.6: Good Timing a clk b Clock-to-Q Prop Setup c d ??? ??? Figure 6.7: Setup Violation 310 CHAPTER 6. TIMING ANALYSIS a clk b c d a clk b Clock-to-Q Prop Hold c d ??? Figure 6.8: Hold Violation 6.2 Timing Analysis of Latches and Flip Flops In this section, we show how to nd the clock-to-Q, setup, and hold times for latches, ip-ops, and other storage elements. 6.2.1 Review: Latch, Flip-Flop, Setup, Hold, Clock-to-Q clk d q clk d q Flop Behaviour Review: Timing Parameters Latch Behaviour .......................................................... . Setup : Time before arrival of clock edge (ip op), or deasserting of enable line (latch), that input data is required to start being stable 6.2.2 Simple Multiplexer Latch 311 Hold : time after arrival of clock edge (ip op), or deasserting of enable line (latch), that input data is required to remain stable Clock-to-Q : Time after arrival of clock edge (ip op), or asserting of enable line (latch) when output data is guaranteed to start being stable 6.2.2 Simple Multiplexer Latch We begin our study of timing analysis for storage devices with a simple latch built from an inverter ring and multiplexer. There are many better ways to build latches, primarily by doing the design at the transistor level. However, the simplicity of this design makes it ideal for illustrating timing analysis. 6.2.2.1 Structure and Behaviour of Multiplexer Latch Two modes for storage devices: loading data: loads input data into storage circuitry input data passes through to output using stored data input signal is disconnected from output storage circuitry drives output clk i o i 1 o i 0 o Schematic Loading / pass-through mode Storage mode Unfold Multiplexer to Simple Gates ................................................... . 312 CHAPTER 6. TIMING ANALYSIS s a b o a sel b d clk o o Multiplexer: symbol and implementation Latch implementation Note: inverters on clk Both of the inverters on the clk signal are needed. Together, they prevent a glitch on the OR gate when clk is deasserted. If there was only one inverter, a glitch would occur. For more on this, see section 6.2.2.6 d=0 clk=1 1 1 0 1 1 0 0 o d=1 clk=1 0 1 0 0 0 0 1 o Loading 0 d clk=0 0 1 1 0 1 1 0 d clk=0 o=0 Loading 1 0 1 0 0 0 0 1 o=1 Storing 0 6.2.2.2 Strategy for Timing Analysis of Storage Devices Storing 1 The key to calculating setup and hold times of a latch, op, etc is to identify: 1. how the data is stored when not connected to the input (often a pair of inverters in a loop) 2. the gate(s) that the clock uses to cause the stored data to drive the output (often a transmission gate or multiplexor) 3. the gate(s) that the clock uses to cause the input to drive the output (often a transmission gate or multiplexor) 6.2.2 Simple Multiplexer Latch 313 d clk=0 0 1 0 o d clk=1 1 0 0 o Note: Clock-to-Q for latches For latches, clock-to-Q times are measured with respect to the clock edge that connects the data input to the output. For active-high latches, this is a rising edge. Setup and hold timing constraints ensure that, when the storage device transitions from load mode to store mode, the input data is stored correctly in the storage device. Thus, the setup and hold timing constraints come into play when the storage device transitions from load mode to store mode. Note: Setup and hold time for latches For latches, hold time and setup time are measured with respect to the clock edge that disconnects the data input from the output. For active-high latches, this is a falling edge. Hold time is concerned with the next data value sneaking in before the latch goes into storage mode. Setup time is concerned with the previous data value still being in the storage circuitry when the input is disconnected. Note: Storage devices vs. Signals We can talk about the setup and hold time of a signal or of a storage device. For a storage device, the setup and hold times are requirements that it imposes upon all environments in which it operates. For an individual signal in a circuit, there is a setup and hold time, which is the amount of time that the signal is stable before and after a clock edge. 314 CHAPTER 6. TIMING ANALYSIS 6.2.2.3 Clock-to-Q Time of a Multiplexer Latch l1 c2 cn d clk l2 qn s2 s1 q Figure 6.9: Latch for Clock-to-Q analysis d l1 l2 qn q s1 s2 clk cn c2 clock-to-Q Figure 6.10: Waveforms of latch showing Clock-to-Q timing Assume that input is stable, and then clock signal transitions to cause the circuit to move from storage mode to load mode. Calculate clock-to-Q time by nding delay of critical path from where clock signal enters storage circuit to where q exits storage circuit. The path is: clk cn c2 l2 delay of exactly one time unit). qn q, which has a delay of 5 (assuming each gate has a 6.2.2 Simple Multiplexer Latch 315 6.2.2.4 Setup Timing of a Multiplexer Latch Storage device transitions from load mode to store mode. Setup is time that input must be stable before clock changes. d clk l1 c2 cn l2 qn s2 s1 q Figure 6.11: Latch for Setup Analysis setup + margin d l1 l2 qn q s1 s2 clk cn c2 Figure 6.12: Setup with margin: goal is to store 316 CHAPTER 6. TIMING ANALYSIS Step-by-step animation of latch transitioning from load to store mode. d 1 clk 0 1 0 0 d 0 clk 1 0 0 1 Circuit is stable in load mode d 0 clk 0 1 0 0 t=3: l2 is set to 0, because c2 turns off AND gate d 0 clk 1 0 1 0 t=0: Clk transitions from load to store d 0 clk 1 1 1 0 t=4: from store path propagates to q d 0 clk 1 0 1 0 t=1: Clk transitions from load to store d 0 clk 1 0 1 t=5: from store path completes cycle t=2: s1 propagates to s2, because cn turns on AND gate The value on s1 at t=1 will propagate from the store loop to the output and back through the store loop. At t=1, s1 must have the value that we want to store. Or, equivalently, the value to store must have saturated the store loop by t=1. It takes 5 time units for a value on the input d to propagate to l1 l2 qn q s1). s1 (d The setup time is the difference in the delay from d to s1 and the delay from clk to cn: 5 1 so the setup time for this latch is 4 time units. 4, 6.2.2 Simple Multiplexer Latch 317 setup with negative margin d l1 l2 qn q s1 s2 clk cn c2 / / / / / / / / / / / / Figure 6.13: Setup Violation setup d l1 l2 qn q s1 s2 clk cn c2 Figure 6.14: Minimum Setup Time When cn is asserted, must be at s1. Otherwise, will affect storage circuitry when data input is disconnected. 318 CHAPTER 6. TIMING ANALYSIS 6.2.2.5 Hold Time of a Multiplexer Latch l1 c2 cn s2 s1 d clk l2 qn q Figure 6.15: Latch for Hold Analysis hold + margin d l1 l2 qn q s1 s2 clk cn c2 Figure 6.16: Hold OK: goal is to store 6.2.2 Simple Multiplexer Latch 319 d 1 clk 0 1 0 0 d 0 clk 1 0 1 Circuit is stable in load mode 0 1 0 t=6: Clk transition propagates to c2, l1 may change now without affecting storage device d 0 clk 1 0 1 0 d 0 clk 0 t=0: Clk transitions from load to store d 0 clk 1 1 1 0 t=7: Clk transition propagates to l2, t=5: Clk transition propagates to cn Figure 6.17: Animation of hold analysis It takes 6 time units for a change on the clock signal to propagate to the input of the AND gate that controls the load path. It takes 1 time unit for a change on d to propagate to its input to this AND gate. The data input must remain stable for 6 1 5 time units after the clock transitions from load to store mode, or else the new data value (e.g., ) will slip into the storage loop and corrupt the value that we are trying to store. 320 CHAPTER 6. TIMING ANALYSIS hold with negative margin d l1 l2 qn q s1 s2 clk cn c2 Figure 6.18: Hold violation: slips through to q hold d l1 l2 qn q s1 s2 clk cn c2 Figure 6.19: Minimum Hold Time Cant let affect l1 before c2 deasserts. Hold time is difference between path from clk to c2 and path from d to l1. 6.2.3 Timing Analysis of Transmission-Gate Latch 321 6.2.2.6 Example of a Bad Latch This latch is very similar to the one from section 6.2.2.5, however this one does not work correctly. The difference between this latch and the one from section 6.2.2.5 is the location of the inverter that determines whether l2 or s2 is enabled. When the clock signal is deasserted, c2 turns off the AND gate l2 before the AND gate s2 turns on. In this interval when both l2 and s2 are turned off, a glitch is allowed to enter the feedback loop. The glitch on the feedback loop is independent of the timing of the signals d and clk. d clk l1 c2 cn l2 qn s2 s1 q d l1 l2 qn q s1 s2 clk c2 cn 6.2.3 Timing Analysis of Transmission-Gate Latch The latch that we now examine is more realistic than the simple multiplexer-based latch. We replace the multiplexer with a transmission gate. 322 CHAPTER 6. TIMING ANALYSIS 6.2.3.1 Structure and Behaviour of a Transmission Gate (Smith 2.4.3) Symbol s 1 0 0 0 i o i o i o 1 0 s 0 1 1 1 Implementation s i o Open Closed Transmit 1 Transmit 0 Transmission gate as switch 6.2.3.2 d clk Structure and Behaviour of Transmission-Gate Latch (Smith 2.5.1) q d clk 1 0 1 1 0 q d clk 1 1 0 0 1 q Loading data into latch Using stored data from latch 6.2.4 Falling Edge Flip Flop (Smith 2.5.2) 323 6.2.3.3 d clk 1 Clock-to-Q Delay for Transmission-Gate Latch q 6.2.3.4 Setup and Hold Times for Transmission-Gate Latch d clk 1 path2 path1 q path2 d clk 1 path1 q Setup time = path1 path2 Setup time for latch Hold time = path1 path2 Hold time for latch 6.2.4 Falling Edge Flip Flop (Smith 2.5.2) We combine two active-high latches to create a falling-edge, master-slave ip op. The analysis of the master-slave ip-op illustrates how to do timing analysis for hierarchical storage devices. Here, we use the timing information for the active high latch to compute the timing information of the ip-op. We do not need to know the primitive structure of the latch in order to derive the timing information for the ip op. 324 CHAPTER 6. TIMING ANALYSIS 6.2.4.1 Structure and Behaviour of Flip-Flop d clk EN m EN q d clk m clk_b q ?? A B C D E F A B D E ?? d clk EN m EN q TInv d clk m clk_b q Tinv Tmd Latch Setup Latch Clock-Q TInv Tmd delay through an inverter propagation delay from m to d 6.2.4 Falling Edge Flip Flop (Smith 2.5.2) 325 6.2.4.2 Clock-to-Q of Flip-Flop d clk EN m EN q d clk m clk_b q Tinv Latch Clock-to-Q Flop Clock-to-Q T Flop CO TInv T Latch CO 326 CHAPTER 6. TIMING ANALYSIS 6.2.4.3 Setup of Flip-Flop d clk EN m EN q d clk m clk_b q Latch Setup Flop Setup T Flop SUD T Latch SUD The setup time of the ip op is the same as the setup time of the master latch. This is because, once the data is stored in the master latch, it will be held for the slave latch. 6.2.5 Timing Analysis of FPGA Cells (Smith 5.1.5) 327 6.2.4.4 Hold of Flip-Flop d clk EN m EN q d clk m clk_b q Hold time for latch Hold time for flop T Flop HO T Latch HO The hold of the ip op is the same as the hold time of the master latch. This is because, once the data is stored in the master latch, it will be held for the slave latch. 6.2.5 Timing Analysis of FPGA Cells (Smith 5.1.5) We can apply hierarchical analysis to structures that include both datapath and storage circuitry. We use an Actel FPGA cell to illustrate. The description of the Actel FPGA cell in the course notes is incomplete, refer to Smiths book for additional material. 328 CHAPTER 6. TIMING ANALYSIS 6.2.5.1 Standard Timing Equations T PD T CLKD T OUT T SUD delay from D-inputs to storage element delay from clk-input to storage element delay from storage element to output setup time slowest D path fastest clk path T TCLKD Min PD Max hold time slowest clk path fastest D path T TPD Min CLKD Max delay clk to Q clk path output path T T CLKD OUT T HO T CO 6.2.5.2 Hierarchical Timing Equations Add combinational logic to inputs, clock, and outputs of storage element. t SUD data inputs t PD d t HO t CO clk clk t CLKD q t OUT T SUD T HO T CO 6.2.5.3 Actel Act 2 Logic Cell T T TCLKD Min SUD PD Max T T TPD Min HO CLKD Max T T T CO CLKD Max OUT Max Timing analysis of Actel Act 2 logic cell (Smith 5.1.5). 6.2.5 Timing Analysis of FPGA Cells (Smith 5.1.5) 329 Actel ACT Basic logic cells are called Logic Module ACT 1 family: one type of Logic Module (see Figure 5.1, Smiths pp. 192) ACT 2 and ACT 3 families: use two different types of Logic Module (see Figure 5.4, Smiths pp. 198) C-Module (Combinatorial Module) combinational logic similar to ACT 1 Logic Module but capable of implementing ve-input logic function S-Module (Sequential Module) C-Module + Sequential Element (SE) that can be congured as a ip-op Actel Timing ACT family: (see Figure 5.5, Smiths pp. 200) Simple. Why? Only logic inside the chip Not exact delay (as no place and route, physical layout, hence not accounting for interconnection delay) Non-Deterministic Actel Architecture All primed parameters inside S-Module are assumed Calculate tSUD, tH, and tCO The combinational logic delay of 3 ns: 0.4 went into increasing the setup time, tSUD, and 2.6 ns went into increasing the clock-output delay, tCO. From outside we can say that the combinational logic delay is buried in the ip-op set up time d clk d clk q q clr Simple Actel-style latch Actel latch with active-low clear d clk clr m q Actel op with active-low clear 330 CHAPTER 6. TIMING ANALYSIS C-Module d00 d01 d10 d11 a1 b1 a0 b0 SE-Module m se_clk se_clk_n q clk clr Actel sequential module 6.2.5.4 Timing Analysis of Actel Sequential Module Other given timing parameters 3ns C-Module delay (t ) PD tCLKD (from clk to se clk and se clk n) 2.6ns Timing parameters for Actel latch with active-low clear T SUD T HO T CO 0.4ns 0.9ns 0.4ns Question: What are the setup, hold, and T times for the entire Actel sequential CO module? Answer: See Smith pp 199. Use Smiths eqn 5.15, 5.16, and assume t CLKD T SUD T HO T CO 0.8ns 0.5ns 3.0ns 2 6ns. 6.2.6 Exotic Flop 331 6.2.6 Exotic Flop As a contrast to the gate-level implementations of latches that we looked at previously, the gure below is the schematic for a state-of-the-art high-performance latch circa 2001. precharge node keeper precharge node keeper q d clk inverter chain The inverter chain creates an evaluation window in time when clock has just risen and the p transistors are turned on. When clock is 0, the left precharge node charges to 1 and the right precharge node discharges to 0. If d is 1 during the evaluation window, the left precharge node discharges to 0. The left precharge nodes goes through an inverter to the second precharge node, which will charge from 0 to texttt1, resulting in a 0 on q. If d is 0 during the evaluation window, the left precharge node stays at the precharge value of 1. The left precharge nodes goes through an inverter to the second precharge node, which will stay at 0, resulting in a 1 on q. The two inverter loops are keepers, which provide energy to keep the precharge nodes at their values after the evaluation window has passed and the clock is still 1. 6.3 Critical Paths and False Paths 6.3.1 Introduction to Critical and False Paths In this section we describe how to nd the critical path through the circuit: the path that limits the maximum clock speed at which the circuit will work correctly. A complicating factor in nding the 332 CHAPTER 6. TIMING ANALYSIS critical path is the existence of false paths: paths through the circuit that appear to be the critical path, but in fact will not limit the clock speed of the circuit. The reason that a path is false is that the behaviour of the gates prevents a transition (either 0 1 or 1 0) from travelling along the path from the source node the destination node. To conrm that a path is a true critical path, and not a false path, we must nd a pair of input vectors that exercise the critical path. The two input vectors differ only their value for the input signal on the critical path. The change on this signal (either 0 1 or 1 0) must propagate along the candidate critical path from the input to the output. Usually the two input vectors will produce different output values. However, a critical path might produce a glitch (0 1 0 or 1 0 1) on the output, in which case the path is still the critical path, but the two input vectors both result in the same value on the output signal. Glitches should be ignored, because they may result in setup violations. If the glitching value is inside the destination op or latch at the end of the clock period, then the storage element will not store a stable value. The algorithm that we present comes from McGeer and Brayton in the DAC 198? paper. The algorithm to nd the critical path through a circuit is presented in several parts. 1. Section 6.3.2: Find the longest path ignoring the possibility of false paths. 2. Section 6.3.3: Almost-correct algorithm to test whether a candidate critical path is a false path. 3. Section 6.3.4: If a candidate path is a false path, then nd the next candidate path, and repeat the false-path detection algorithm. 4. Section 6.3.5: Correct, complete, and complex algorithm to nd the critical path in a circuit. Note: The analysis of critical paths and false paths assumes that all inputs change values at exactly the same time. Timing differences between inputs are modelled by the skew parameter in timing analysis. Note: To exercise a path, only one input needs to change. Stated another way, if a path cannot be exercised by toggling one input, then the path cannot be exercised by toggling more than one input. Throughout our discussion of critical paths, we will use the delay values for gates shown in the table below. gate NOT AND OR XOR delay 2 4 4 6 6.3.1 Introduction to Critical and False Paths 333 6.3.1.1 Example of Critical Path in Full Adder Find the critical path through the full-adder circuit shown below. ci a b i k j co s Question: Answer: Annotate with Max Distance to Destination ci 14 a 14 b 8 6 14 14 8 8 8 4 6 8 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 0 4 4 4 0 s co 0 0 Find Candidate Critical Path ci 14 a 14 b 8 .............................................. . 6 0 4 4 4 0 14 14 8 8 8 4 6 8 8 s co 0 0 There are two paths of length 14: aco and bco. We arbitrarily choose aco. Test if Candidate is Critical Path ci a 0 b 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. s 0 0 0 1 0 co Yes, the candidate path is the critical path. 334 CHAPTER 6. TIMING ANALYSIS The assignment of ci=1, a=0, b=0 followed in the next clock cycle by ci=1, a=1, b=0 will exercise the critical path. As a shortcut, we write the pair of assignments as: ci=1, a= , b=0. Question: Do the input values of ci=0, a= , b=1 exercise the critical path? Answer: ci a 1 b 0 s 1 1 0 0 0 co The alternative does not exercise the critical path. Instead, the alternative excitation follows a shorter path, so the output stabilizes sooner. Lesson: not all all transitions on the inputs will exercise the critical path. Using timing simulation to nd the maximum clock speed of a circuit might overestimate the clock speed, because the inputs values that you simulate might not exercise the critical path. 6.3.1.2 Preliminaries for Critical Paths Denition critical path: The slowest path on the chip between ops or ops and pins. The critical path limits the maximum clock speed. There are three classes of paths on a chip: entry path: from an input to a op Quartus does not report this by default. When Quartus reports this path, it is reported as the period associated with System fmax. In Xilinx timing reports this is reported as Maximum Delay stage path: from one op to another op In Quartus timing reports, this is reported as the period associated with Internal fmax. In Xilinx timing reports, this is reported as Clock to Setup and Maximum Frequency. 6.3.1 Introduction to Critical and False Paths 335 exit path: from a op to an output Quartus does not report this by default. When Quartus reports this path, it is reported as the period associated with System fmax. In Xilinx timing reports this is reported as Maximum Delay 6.3.1.3 Longest Path and Critical Path The longest path through the circuit might not be the critical path, because the behaviour of the gates might prevent an edge (0 1 or 1 0) from travelling along the path. Denition false path: : a path along which an edge cannot travel from beginning to end. Example False Path .................................................................. . Question: Determine whether the longest path in the circuit below is a false path a y b Answer: For this example, we use a very naive approach simply to illustrate the phenomenon of false paths. Sections 6.3.26.3.5 present a better algorithm to detect false paths and nd the real critical path. In the circuit above, the longest path is from b to y: The four possible scenarios for the inputs are: a a a a 0 0 1 1 b b b b 0 1 0 1 1 0 1 0 336 CHAPTER 6. TIMING ANALYSIS a a 0, b 0 1 a y b 0 a 0 0, b 1 0 0 0 0 y 0 b a a b 0 0 1, b 0 0 0 1 0 y a b 1 a 1 1, b 1 1 0 1 y 0 In each of the four scenarios, the edge is blocked at either the AND gate or the OR gate. None of the four scenarios result in an edge on the output y, so the path from b to y is a false path. Question: How can we determine analytically that this is a false path? Answer: The value on a will always force either the AND gate to be a 0 (when a is 0) or the the OR gate to be a 1 (when a is 1). For both a=0 and a=1, a change on b will be unable to propagate to y. The algorithm to detect false paths is based upon this type of analysis. Preview of Complete Example ........................................................ . This example illustrates all of the concepts in analysing critical paths. Here, we explore the circuit informally. In section 6.3.5, we will revisit this circuit and analyse it according to the complete, correct, and complex algorithm. Question: Find the critical path through the circuit below. b a c d e f g 6.3.1 Introduction to Critical and False Paths 337 Answer: Even though the equation for this circuit reduces to false, the output signal (g) is not a constant 0. Instead, glitches can occur on g. To explore the behaviour of the circuit, we will stimulate the circuit rst with a falling edge, then a rising edge. Stimulate the circuit with a falling edge and see which path the edge follows. 0 a 0 c 0 b 2 d 4 e 2 6 f10 0 g The longest path through the circuit is the middle path. At g, the side input a has a controlling value before the falling edge arrives on the path input e. Thus, a falling edge is unable to excite the longest path through the circuit. Stimulate the circuit with a rising edge and see which path the edge follows. 0 a 0 c 0 b 2 d 4 e 2 6 f6 0 g 10 At f, the side input c has a controlling value before the falling edge arrives on the path input e. Thus, a rising edge is unable to excite the longest path through the circuit. Of the two scenarios, the falling edge follows a longer path through the circuit than the rising edge. The critical path is the lower path through the circuit. When we develop our rst algorithm to detect false paths (section 6.3.3), we will assume that at each gate, the input that is on the critical path will arrive after the other inputs. Not all circuits satisfy the assumption. At f, when a is a falling edge, the path input c arrives before the side input e. 6.3.1.4 Timing Simulation vs Static Timing Analysis The delay through a component is usually dependent upon the values on signals. This is because different paths in the circuit have different delays and some input values will prevent some paths from being exercised. Here are two simple examples: 338 CHAPTER 6. TIMING ANALYSIS In a ripple-carry adder, if a carry out of the MSB is generated from the least signicant bit, then it will take longer for the output to stabilize than if no carries generated at all. In a state machine using a one-hot state encoding, false paths might exist when more than one state bit is a 1. Because of these effects, static timing analysis might be overly conservative and predict a delay that is greater than you will experience in practice. Conversely, a timing simulation may not demonstrate the actual slowest behaviour of your circuit: if you dont ever generate a carry from LSB to MSB, then youll never exercise the critical path in your adder. The most accurate delay analysis requires looking at the complete set of actual data values that will occur in practice. 6.3.2 Longest Path The following is an algorithm to nd the longest path from a set of source signals to a set of destination signals. We rst provide a high-level, intuitive, description, and then present the actual algorithm. Outline of Algorithm to Find Longest Path ............................................ . Start at destination signals and traverse through fanin to source signals, annotating each intermediate signal with the maximum delay from the intermediate signal to the destination signals. The source signal with the maximum delay is the start of the longest path. The delay annotation of this signal is the delay of the longest path. The longest path is found by working from the source signal to the destination signals, picking the fanout signal with the maximum delay at each step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Algorithm to Find Longest Path 1. Set current time to 0 2. Start at destination signals 3. For each input to a gate that drives a destination signal, annotate the input with the current time plus the delay through the gate 4. For each gate that has times on all of its fanout but not a time for itself, (a) annotate each input to the gate with the maximum time on the fanout plus the delay through the gate (b) go to step 4 5. To nd the longest path, start at the source node that has the maximum delay. Work forward through the fanout. For signals that fanout to multiple signals, choose the fanout signal with the maximum delay. 6.3.3 Detecting a False Path 339 Longest Path Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Question: a b c e d Find the longest path through the circuit below. f g h i k m j l Answer: Annotate signals with the maximum delay to an output: a 16 b 12 c 10 e 8 d 14 f 12 12 g 8 12 6 8 8 h 4 i 4 2 4 4 j 0 k 0 Find longest path: a 16 b 12 c 10 e 8 d 14 f 12 12 g 8 12 6 8 8 h 4 i 4 2 4 4 j 0 k 0 The path from a to y has a delay of 16. 6.3.3 Detecting a False Path In this section, we will explore a simple and almost correct algorithm to determine if a path is a false path. The simple algorithm in this section sometimes gives the incorrect results if the candidate path intersects false paths. For all of the example circuits in this section, the algorithm gives the correct result. The purpose of presenting this almost-correct algorithm is that it is relatively easy to understand and introduces one of the key concepts used in the complicated, correct, and complete algorithm for nding the critical path in section 6.3.5. 340 CHAPTER 6. TIMING ANALYSIS 6.3.3.1 Preliminaries for Detecting a False Path .................................................................... . Controlling Value The controlling value of a gate is the value such that if one of the inputs has this value, the output can be determined independently of the other inputs. For an AND gate, the controlling value is 0, because when one of the inputs is a 0, we know that the output will be 0 regardless of the values of the other inputs. The controlled output value of a gate is the value produced by the controlling input value. Gate AND OR NAND NOR XOR Controlling Value 0 1 0 1 none Controlled Output 0 1 1 0 none Path Input, Side Input ................................................................. Denition path input: For a gate on a path (either a candidate critical path, or a real critical path), the path input is the input signal that is on the path. Denition side input: For a gate on a path (either a candidate critical path, or a real critical path), the side inputs are the input signals that are not on the path. The key idea behind the almost-correct algorithm is that: for an edge to propagate along a path, the side inputs to each gate on the path must have non-controlling values. The complete, correct, and complicated algorithm generalizes this constraint to handle circuits where the side inputs are on false paths. Reconvergent Fanout .................................................................. Denition reconvergent fanout: There are paths from signals in the fanout of a gate that reconverge at another gate. 6.3.3 Detecting a False Path 341 Most of the difculties both with critical paths and with testing circuits for manufacturing faults (Chapter 8) are caused by reconvergent fanout. a b d e f c g y h z There are two sets of reconvergent paths in the circuit above. One set of reconvergent paths goes from a to y and one set goes from d to z. If a candidate path has reconvergent fanout, then the rising or falling edge on the input to the path might cause a side input along the path to have a rising or falling edge, rather than a stable 0 or 1. To support reconvergent fanout, we extend the rule for side inputs having non-controlling values to say that side inputs must have either non-controlling values or have edges that stabilize in noncontrolling values. Rules for Propagating an Edge Along a Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. These rules assume that side inputs arrive before path inputs. Section 6.3.5 relaxes this constraint. 342 CHAPTER 6. TIMING ANALYSIS NOT 1 AND 1 0 OR 0 1 XOR 1 0 0 Question: Why do the rules not have falling edges for AND gates or rising edges for OR gates on the side input? Answer: a b a c b c For an AND gate, a falling edge on side-input will force the output to change and prevent the path input from affecting the output. This is because the nal value of a falling edge is the controlling value for an AND gate. Similarly, for an OR gate, the nal value of a rising edge is the controlling value for the gate. Analyzing Rules for Propagating Edges ............................................... . The pictures below show all combinations of output edge (rising or falling) and input values (constant 1, constant 0, rising edge, falling edge) for AND and OR gates. These pictures assume that 6.3.3 Detecting a False Path 343 the side input arrives before the path intput. The pictures that are crossed out illustrate situations that prevent the path input from affecting the output. In these situations the inputs cause either a constant value on the output or the side input affects the output but the path input does not. The pictures that are not crossed out correspond to the rules above for pushing edges through AND and OR gates. 0 constant 0 output 0 constant 0 output 0 0 is controlling 1 constant 1 output 0 1 constant 1 output constant 1 output 1 is controlling constant 0 output 1 1 AND OR 6.3.3.2 Almost-Correct Algorithm to Detect a False Path The rules above for propagating an edge along a candidate path assume that the values on side inputs always arrive before the value on the path input. This is always true when the candidate path is the longest path in the circuit. However, if the longest path is a false path, then when we are testing subsequent candidate paths, there is the possibility that a side input will be on a false path and the side input value will arrive later than the value from the path input. This almost-correct algorithm assumes that values on side inputs always arrive before values on path inputs. The correct, complex, and complete critical path algorithm in section 6.3.5 extends the almost correct algorithm to remove this assumption. To determine if a path through a circuit is a false path: 1. Annotate each side input along the path with its non-controlling value. These annotations are the constraints that must be satised for the candidate path to be exercised. 2. Propagate the constraints backward from the side inputs of the path to the inputs of the circuit under consideration. 3. If there is a contradiction amongst the constraints, then the candidate path is a false path. 4. If there is no contradiction, then the constraints on the inputs give the conditions under which an edge will traverse along the candidate path from input to output. 344 CHAPTER 6. TIMING ANALYSIS 6.3.3.3 Examples of Detecting False Paths ................................................................ . False-Path Example 1 Question: a 16 b 12 c 10 Determine if the longest path in the circuit below is a false path. d 14 f 12 12 g 8 12 6 e 8 8 8 h 4 i 4 2 4 4 j 0 k 0 Answer: Compute constraints for side inputs to have non-controlling values: a b c e d f 1 0 1 0 h 1 i 0 Contradictory values. 1 k g j l m side input g[b] i[e] k[h] non-controlling value constraint 1 b 0 c b 1 Found contradiction between g[b] needing b and k[h] needing b, therefore the candidate path is a false path. Analyze cause of contradiction: 6.3.3 Detecting a False Path 345 a b c d f g h j k l m e 2 i These side inputs will always have opposite values. Both side inputs feed the same type of gate (AND), so it always be the case that one of the side inputs will be a controlling value (0). False-Path Example 2 ................................................................ . Question: Determine if the longest path through the circuit below is a critical path. If the longest path is a critical path, nd a pair of input vectors that will exercise the path. a b d f g h e c Answer: a b 1 d f 1 0 e g h c side input e[a] g[b] h[f] non-controlling value constraint 1 a 0 b 1 ab Complete constraint is conjunction of constraints: aba b, which reduces to false. Therefore, the candidate path is a false path. 346 CHAPTER 6. TIMING ANALYSIS False-Path Example 3 ................................................................ . This example illustrates a candidate path that is a true path. Question: Determine if the longest path through the circuit below is a critical path. If the longest path is a critical path, nd a pair of input vectors that will exercise the path. a b d f g h e c Answer: Find longest path; label side inputs with non-controlling values: a b 0 d f 1 0 e g h c Table of side inputs, non-controlling values, and constraints on primary inputs: side input e[a] g[b] h[b] non-controlling value constraint 0 a 0 b 1 ab The complete constraint is aba b, which reduces to ab. Thus, for an edge to propagate along the path, a must be 0 and b must be 0. The primary input to the path (c) does not appear in the constraint, thus both rising and falling edges will propogate along the path. If the primary input to the path appears with a positive polarity (e.g. c) in the constraint, then only a rising edge will propogate. Conversely, if the primary input appears negated (e.g., c), then only a falling edge will propogate. 6.3.3 Detecting a False Path 347 Critical path c, e, g, h Delay 14 Input vector a=0, b=0, c=rising edge Illustration of rising edge propagating along path: a0 b0 0 d1 1 0 f1 1 0 e g h c Illustration of falling edge propagating along path: a0 b0 0 d1 1 0 f1 1 0 e g h c False-Path Example 4 ................................................................ . This example illustrates reconvergent fanout. Question: Determine if the longest path through the circuit below is a critical path. If the longest path is a critical path, nd a pair of input vectors that will exercise the path. a c d e f g b Answer: a c d e b 1 1 g f 348 CHAPTER 6. TIMING ANALYSIS side input e[b] g[d] non-controlling value constraint 1 b 1 a The complete constraint is ab. The constraint includes the input to the path (a), which indicates that not all edges will propagate along the path. The polarity of the path input indicates the nal value of the edge. In this case, the constraint of a means that we need a rising edge. Critical path a, c, e, f, g Delay 12 Input vector a=rising edge, b=1 Illustration of rising edge propagating along path: a c d e b1 1 g f If we try to propagate a falling edge along the path, the falling edge on the side input d forces the output g to fall before the arrival of the falling edge on the path input f. Thus, the edge does not propagate along the candidate path. a c d e b1 1 g f Patterns in False Paths ............................................................... . After analyzing these examples, you might have begun to observe some patterns in how false paths arise. There are several patterns in the types of reconvergent fanout that lead to false paths. For example, if the candidate path has an OR gate and an AND that are both controlled by the same signal and the candidate has an even number of inverters between these gates then the candidate path is almost certainly a false path. The reason is the same as illustrated in the rst example of a false path. The side input will always have a controlling value for either the OR gate or the AND gate. 6.3.4 Finding the Next Candidate Path 349 6.3.4 Finding the Next Candidate Path If the longest path is a false path, we need to nd the next longest path in the circuit, which will be our next candidate critical path. If this candidate fails, we continue to nd the next longest of the remaining paths, ad innitum. 6.3.4.1 Algorithm to Find Next Candidate Path To nd the next candidate path, we use a path table, which keeps track of the partial paths that we have explored, their maximum potential delay, and the signals that we can follow to extend a partial path toward the outputs. We keep the path table sorted by the maximum potential delay of the paths. We delete a path from the table if we discover that it is a false path. The key to the path table is how to update the potential delay of the partial paths after we discover a false path. All partial paths that are prexes of the false path will need to have their potential delay values recomputed. The updated delay is found by following the unexplored signals in the fanout of the end of the partial path. 1. Initialize path table with primary inputs, their potential delay, and fanout. 2. Sort path table by potential delay (path with greatest potential delay at bottom of table) 3. If the partial path with the maximum potentialdelay has just one unused fanout signal, then extend the partial path with this signal. Otherwise: (a) Create a new entry in the path table for the partial path extended by the unused fanout signal with the maximum potential delay. (b) Delete this fanout signal from the list of unused fanout signals for the partial path. 4. Compute the constraint that side input of the new signal does not have a controlling value, and update constraint table. 5. If the new constraint does not cause a contradiction, then return to step 3. Otherwise: (a) Mark this partial path as false. (b) For each partial path that is a prex of the false path: reduce the potential delay of the path by the difference between the potential delay of the fanout that was followed and the unused fanout with next greatest delay value. (c) Return to step 2 350 CHAPTER 6. TIMING ANALYSIS 6.3.4.2 Examples of Finding Next Candidate Path .................................................................. Next-Path Example 1 Question: Starting from the initial delay calculation and longest path, nd the next candidate path and test if it is a false path. a 16 b 12 c 10 e 8 d 14 f 12 12 g 8 12 6 8 8 h 4 i 4 2 4 4 j 0 k 0 Answer: Initial state of path table: potential unused delay fanout path 10 e c 12 h, g b 16 d a Extend path with maximum potential delay until nd contradiction or reach end of path. Add an entry in path table for each intermediate path with multiple signals in the fanout. Path table and constraint table after detecting that the longest path is a false path: potential delay 10 12 16 false side input g[b] i[e] k[h] unused fanout e h, g j, i path c b a, d, f, g a, d, f, g, i, k non-controlling value constraint 1 b 0 c 1 b 6.3.4 Finding the Next Candidate Path 351 The longest path is a false path. Recompute potential delay of all paths in path table that are prexes of the false path. The one path that is a prex of the false path is: a,d,f,g . The remaining unused fanout of this path is j, which has a potential delay on its input of 2. The previous potential delay of g was 8, thus the potential delay of the prex reduces by 8 2 6, giving the path a potential delay of 16 6 10. Path table after updating with new potential delays: potential delay false 10 10 12 unused fanout path a, d, f, g, i, k e c i a, d, f, g h, g b Extend b through g, because g has greater potential delay than the other fanout signal (h). potential delay false 10 10 12 12 side input g[a] unused fanout path a, d, f, g, i, k e c i a, d, f, g h, g b i, j b, g non-controlling value constraint 1 a From g, we will follow i, because it has greater potential delay than j. potential delay false 10 10 12 12 12 side input g[a] i[e] k[h] unused fanout path a, d, f, g, i, k e c i a, d, f, g h, g b i, j b, g b, g, i, k non-controlling value constraint 1 a 0 c 1 b 352 CHAPTER 6. TIMING ANALYSIS We have reached an output without encountering a contradiction in our constraints. The complete constraint is abc. Critical path b, g, i, k Delay 12 Input vector a=1, b=falling edge, c=1 Illustrate the propagation of a falling edge: a1 b c1 e 0 d f 1 g h 2 j k i At k, the rising edge on the side input (h) arrives before the falling edge on the path input (i). For a brief moment in time, both the side input and path input are 1, which produces a glitch on k. Next-Path Example 2 .................................................................. Question: a b c d e Find the critical path in the circut below l j f g h i k k m m Answer: Find the longest path: a 10 b 14 c 20 d 22 e 4 20 14 6 10 l j 6 6 2 m 0 m f g 20 16 h 14 i 10 10 4 4 k 0 0 k Initial state of path table: 6.3.4 Finding the Next Candidate Path 353 potential unused delay fanout path 4 k e 10 j, l a 14 i b 20 g c 22 f d Extend path with maximum potential delay until nd contradiction or reach end of path. Add an entry in path table for each intermediate path with multiple fanout signals. potential unused delay fanout 4 k 10 j, l 14 i 20 g 22 j, k false side input g[c] i[b] j[a] l[a] path e a b c d, f, g, h, i d, f, g, h, i, j, l non-controlling value constraint 1 c 0 b 0 a 1 a Contradiction between j[a] and l[a], therefore the path d,f,g,h,i,j,l is a false path. And, any path that extends this path is also false. To nd next candidate, begin by recomputing delays along the candidate path. The second gate in the contradiction is l. The last intermediate path before l with unused fanout is i. Cut the candidate path at this signal. The remaining initial part of the candidate path is: d, f, g, h, i. The only unused fanout of this path is k. We now calculate the new maximum potential delay of d, f, g, h, i , taking into account the false path that we just discovered. The delay from i along the candidate path j, l, m is 10 and the maximum potential delay along the remaining unused (k) is 4. The difference is: 10 4 6, and so the potential delay of d, f, g, h, i is reduced to 22 6 16. After updating the partial delay of d, f, g, h, i , the partial path with the maximum potential delay is c. The new critical path candidate will be: c, g, h, i, j, l, m. 354 CHAPTER 6. TIMING ANALYSIS Update the path table with delay of 16 for previous candidate path. Extend c along path with maximum potential delay until nd contradiction or reach end of path. Add an entry in path table for each intermediate path with multiple fanout signals. potential unused delay fanout false 4 k 10 j, l 14 i 16 k 20 k false path d, f, g, h, i, j, l e a b d, f, g, h, i c, f, g, h, i c, f, g, h, i, j, l We encounter the same contradiction as with the previous candidate, and so we have another false path. We could have detected this false path without working through the path table, if we had recognized that our current candidate path overlaps with the section (j, l) of previous candidate that caused the false path. As with the previous candidate, we reduce the potential delay of the current candidate the path up through i by 6, giving us a potential delay of 20 10 14 for c, f, g, h, i . The next candidate path is d, f, g, h, i, k with a delay of 16. potential unused delay fanout false false 4 k 10 j, l 14 i 14 k 16 k path d, f, g, h, i, j, l c, f, g, h, i, j, l e a b c, f, g, h, i d, f, g, h, i We extend the path through k and compute the constraint table. side input g[c] i[b] k[e] non-controlling value constraint 1 c 0 b 0 e The complete constraint is bce. There is no constraint on a and d may be either a rising edge or a falling edge. 6.3.4 Finding the Next Candidate Path 355 Critical path d, f, g, h, i, k Delay 16 Input vector a=0, b=0, c=1, d=rising edge, e=0 Next Path Example 3 ................................................................. . Question: Find the critical path in the circuit below. j a k b c d e f g h i l m m n p o p Answer: a 12 b 14 c 16 d 8 10 12 j k 8 8 8 l 4 4 4 m 0 e f 12 8 m g h i 8 8 8 6 n 4 4 4 p 0 o4 p Initial state of path table: potential unused delay fanout path 8 n, o d 12 j, k a 14 e b 16 f c Extend c through f: 356 CHAPTER 6. TIMING ANALYSIS potential delay 8 12 14 16 false side input n[d] p[o] unused fanout n, o j, k e m, n path d a b c,f,g,h,i c,f,g,h,i,n,p non-controlling value constraint 1 d 1 d The rst candidate is a false path. Recompute potential delay of c, f, g, h, i, which reduces it from 16 to 12. potential delay false 8 12 12 14 Extend b through e: potential delay false 8 12 12 false side input k[a] l[j] unused fanout path c,f,g,h,i,n,p n, o d j, k a m c,f,g,h,i b,e,k,l unused fanout path c,f,g,h,i,n,p n, o d j, k a m c,f,g,h,i e b non-controlling value constraint 1 a 1 a The second candidate is a false path. There is no unused fanout signal from l for the path b, e, k, l, so this partial path is a false path and there is no new delay information to compute. There are two paths with a potential delay of 12. Choose c, f, g, h, i, because the end of the path is closer to an output, so there will be less work to do in analyzing the path. 6.3.5 Correct Algorithm to Find Critical Path 357 potential delay false false 8 12 12 side input m[l] unused fanout path c,f,g,h,i,n,p b,e,k,l n, o d j, k a c,f,g,h,i,m non-controlling value constraint 0 a ab Critical path c,f,g,h,i,m Delay 12 Input vector a=0, b=1, c=rising edge, d=0 6.3.5 Correct Algorithm to Find Critical Path In this section, we remove the assumption that values on side inputs always arrive earlier than the value on the path input. 6.3.5.1 Algorithm If nd contradiction on path, check for side inputs that are on previously discovered false paths. If a side input to candidate path is on a previously discovered false path, and the primary input of the candidate path is the same signal as the primary input of the false path, then the side input denes a prex a false path that is a late-arriving side input. Compute constraint to excite the prex (this is called the viability constraint of the prex. To the row of the late arriving side input in the constraint table, add as a disjunction the constraint that the prex is viable and the path input has a controlling value. 6.3.5.2 Examples ................................................................. . Complete Example 1 Question: Find the critical path in the circuit below. b a c d e f g 358 CHAPTER 6. TIMING ANALYSIS Answer: 4 f4 4 a 14 14 10 b c 8 12 d 10 e 8 8 8 g 0 potential delay 14 false side input f[c] g[a] unused fanout path g, b, c a a,b,d,e,f,g non-controlling value constraint 1 a 1 a First false path, pursue next candidate. potential delay false 10 10 side input f[e] g[a] unused fanout path a,b,d,e,f,g g, c a a,c,f,g non-controlling value constraint 1 a 1 a At rst, this path appears to be false, but the side input f[e] is on the prex of the false path a,b,d,e,f,g and the start of the false path is the primary input to the current candidate. Thus, f[e] is a late arriving side input. The candidate path will be a true path if the side input arrives late and the path input is a controlling value. The viability condition for the path a,b,d,e is true. The constraint for the path input (c) to have a controlling value for f is a. Together, the viability constraint of true and the controlling value constraint of a give us a late-side constraint of a. Updating the constraint table with the late arriving side input constraint gives us: side input f[e] g[a] non-controlling value constraint 1 a a true 1 a The constraint reduces to a. A rising edge will exercise the path. 6.3.5 Correct Algorithm to Find Critical Path 359 Critical path a, c, f, g Delay 10 Input vector a=falling edge Illustration of rising edge exercising the critical path: 0 a 0 c 0 b 2 d 4 e 2 6 f6 0 g 10 Complete Example 2a ................................................................ . Question: Find the critical path in the circuit below. a d b c e f i g h i j j Answer: Find longest path: a 8 8 8 18 f4 8 8 8 i 4 4 4 j 0 j b 12 c d 16 e 14 14 g 12 12 12 h8 i Explore longest path: potential unused delay fanout path 8 f a 12 h c 18 f, g b,d,e 18 h, i b,d,e,g false b,d,e,g,h,i,j 360 CHAPTER 6. TIMING ANALYSIS side input h[c] i[g] j[f] non-controlling value constraint 0 c 0 b 0 ab Contradiction. First false path, nd next candidate. Changes in potential delays: Signal / path g on b d e g b d eg g[e] on b d e e on b d e b d e old new 12 8 18 14 14 10 14 10 18 14 potential unused delay fanout path false b,d,e,g,h,i,j 8 f a 12 h c 14 f, g b,d,e 14 b,d,e,g,j a 8 8 8 14 f4 8 8 8 i 4 4 4 j 0 j b 12 c d 12 e 10 10 g 8 12 12 h8 i side input h[c] i[h] j[f] non-controlling value constraint 0 c 0 cb 0 ab Initially, found contradiction, but b d e g h is a prex of a false path with the same input as the candidate path, and i[h] is a side input to the candidate path. We have a late side input. The viability constraint for this prex is c. The constraint for the path input (i[g]) to have a controlling value of 1 is b. Combining the two constraints together gives us a constraint for the late side input of i[h] to be bc. Adding the constraint of the late side input to to the condition table gives us: side input h[c] i[h] j[f] non-controlling value constraint 0 c 0 bc bc c 0 ab 6.3.5 Correct Algorithm to Find Critical Path 361 The constraints reduce to abc. A falling edge will exercise the path. Critical path b d e g i j Delay 14 Input vector a=0, b=falling edge, c=0 Illustration of falling edge exercising the critical path: a 0 4 0 f 8 g 6 6 10 8 i 10 j 14 d 2 e 4 4 j b c 0 h 10 i Complete Example 2b ................................................................ . Question: Find the critical path in the circuit below. a b x c d h e j f g i k m l m Answer: Find longest path: a 18 b 18 8 8 c d e 20 x f 18 18 14 g 12 12 k 8 4 i 12 14 14 14 12 8 8 8 l 4 4 4 m 0 m h 10 j 8 Explore longest path: 362 CHAPTER 6. TIMING ANALYSIS potential delay 8 12 14 18 20 20 false side input f[b] i[d] l[j] m[k] unused fanout k i h f f, h i, k path a d e b c,x c,x,f,g c,x,f,g,i,l,m non-controlling value constraint 0 b 1 d 1 ce 0 ab c Contradiction. First false path, nd next candidate. Changes in potential delays: Signal / path g on c x f g cx f g f[x] on c x x on c x cx old new 12 8 20 16 18 14 18 14 20 16 potential delay false 8 12 14 16 16 18 Pursue b f . unused fanout path c,x,f,g,i,l,m k a i d h e f, h c,x k c,x,f,g f b 6.3.5 Correct Algorithm to Find Critical Path 363 potential delay false 8 12 14 16 16 18 18 side input f[x] i[d] l[j] m[k] unused fanout path c,x,f,g,i,l,m k a i d h e f, h c,x k c,x,f,g i,k b,f,g b,f,g,i,l,m non-controlling value constraint 0 c 1 d 1 ce 0 ab c Constraint simplies to abcde. Critical path b f g i l m Delay 18 Input vector a=0, b=rising edge, c=1, d=1, e=0 Demonstrate excitation of path: a b c d e 0 0 0 1 x f 0 4 g 6 6 k i m 18 10 1 l 1 0 0 0 14 m h 0 j 1 364 CHAPTER 6. TIMING ANALYSIS Modied Example 3b .................................................................. Question: Modify circuit to illustrate late side input. Make j a very slow inverter with delay of 5. Answer: Pick up example after determining that c x f g i l m is false. a 18 b 18 8 8 c d e 19 x f 17 18 14 g 12 12 k 8 4 i 12 17 17 17 12 8 8 8 l 4 4 4 m 0 m h 13 j 8 potential delay false 8 12 14 18 19 side input h[e] l[i] m[k] unused fanout path c,x,f,g,i,l,m k a i d h e f b [[[ c,x,h,j,l,m non-controlling value constraint 0 e 1 bcd 0 ab c Initially, found contradiction, but c x f g i is a prex of a false path with the same primary input as the candidate path, and l[i] is a side input to the candidate path. We have a late side input. The viability constraint for this prex is bd. The constraint for the path input (l[j]) to have a controlling value of 0 is c e. Combining the two constraints together gives us a constraint for the late side input of l[i] to be bcd bde. Adding the constraint of the late side input to to the condition table gives us: side input h[e] l[i] m[k] non-controlling value constraint 0 e 1 bcd bcd bde bd 0 ab c 6.3.5 Correct Algorithm to Find Critical Path 365 The constraints reduce to abcde. Critical path c x h j l m Delay 19 Input vector a=0, b=0, c=falling edge, d=1, e=0 Illustration of falling edge exercising the critical path: 0 a b c d e 0 0 x f 2 6 g k 8 8 i 0 12 l 1 2 0 12 m 16 m h 6 j 8 366 CHAPTER 6. TIMING ANALYSIS Complete Example 3 Question: ................................................................. . Find the critical path in the circuit below. a b c d e f g i h k j Answer: a b c 12 12 12 16 14 h 8 d e 14 12 f g 10 12 8 8 8 4 k j 4 0 i potential unused delay fanout 12 h, k 14 e false side input h[a] j[i] k[a] path a c b,d,f,h,j,k non-controlling value constraint 1 a 0 c 0 a First false path, pursue next candidate. potential unused delay fanout false 12 h, k 14 side input j[h] k[a] path b,d,f,h,j,k a c,e,g,i,j,k non-controlling value constraint 0 ab 0 a The constraint reduces to a. Because the minimum delay from an input to the side input h is greater than the delay to the path input i, we might be tempted (incorrectly!) to treat h as 6.3.5 Correct Algorithm to Find Critical Path 367 a late arriving side input to j. This would be a mistake. The primary input to the path (c) does not fanout to h, thus h will have a stable value (Remember, to detect whether a candidate path is false, the only input to the circuit that changes value is the primary input to the critical path.) Late arriving side inputs are relevant only to signals that are affected by the primary input to the path. Complete Example 4 Question: ................................................................. . Find the critical path in the circuit below. a b c d e 0 0 0 1 x f 0 4 g 6 6 k i m 18 10 1 l 1 0 0 0 14 m h 0 j 1 Answer: potential unused delay fanout path 12 g, h a 16 e, j c 16 d, e b false e, d b,e,g,i side input e[c] g[a] i[a] non-controlling value constraint 0 c 1 a 0 a First false path, pursue next candidate. potential unused delay fanout path false e, d b,e,g,i 12 g, h a 14 d b 16 e, j c false c,e,g,i 368 CHAPTER 6. TIMING ANALYSIS Second false path, pursue next candidate. potential unused delay fanout path false e, d b,e,g,i false c,e,g,i 8 j c 12 g, h a 14 d b side input j[c] k[i] non-controlling value constraint 1 c 0 a Third candidate is a true path. If the initial analysis suggested that the candidate was a false path, we might be tempted to use k[i] as a late side input. However, this is not a late side input, because b-i and c-i are false paths. There is no true path to i that has a delay longer than the delay of the current candidate path c,d,f,h,j. If we did not see immediately that k[i] is not a late side input, we would discover this when we tried but failed to construct the viability condition for the paths b-i and c-i. Although the path a-i is a true path, it does not contribute to the making k[i] a late side input, because the delay from a to i is less than the delay along the candidate path. 6.3.6 Further Extensions McGeer and Braytons paper includes two extensions to the critical path algorithm presented here that we will not cover. gates with more than two inputs nding all input values that will exercise the critical path 6.4 Analog Timing Model There are many different models used to describe the timing of circuits. In the section on critical paths, we used a timing model that was based on the size of the gate. The timing model ignored interconnect delays and treated all gates as if they had the same fanout. For example, the delay through an AND gate was 4, independent of how many gates were in its immediate fanout. 6.4. ANALOG TIMING MODEL 369 In this section and the next (section 6.5) we discuss two timing models. In this section, we discuss the detailed analog timing model, which reects quite accurately the actual voltages on different nodes. The SPICE simulation program uses very detailed analog models of transistors (dozens of parameters to describe a single transistor). In the next section, we describe the Elmore delay model, which achieves greater simplicity than the analog model, but at a loss of accuracy. Transistor Level (P-Tran) source gate drain Cross-Section of Fabricated Transistor Mask Level (P-Tran) Switch Level (P-Tran) poly gate drain substrate source contact p-diff p-diff poly contact source gate drain Transistor Level (N-Tran) source gate drain Cross-Section of Mask Level (N-Tran) Fabricated Transistor poly gate drain substrate source contact n-diff poly contact Switch Level (N-Tran) source gate drain p-diff Different Levels of Abstraction for Inverter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Mask Level contact Transistor Level VDD metal VDD poly p-diff b n-diff Gate Level a b a b a GND GND metal 370 CHAPTER 6. TIMING ANALYSIS RC-Network for Timing Analysis VDD Rpu a CL Cp Rpd GND Contacts (vias) have resistance (RV ) Metal areas (wires) have resistance (RW ) and capacitance (CW ). The resistance is dependent upon the geometry of the wire. The capacitance is dependent upon the geometry of the wire and the other wires adjacent to it. b For most circuits, the via resistance is much greater than the wire resistance (RV RW ) To reduce area, modern wires tend to have tall and narrow cross sections. When wires are packed close together (e.g. a signal that is an array or vector), the wires act like capacitors. A Pair of Inverters ................................................................... . Transistor Level VDD Gate Level a b c a b c GND Mask Level VDD b c a GND 6.4. ANALOG TIMING MODEL 371 A Pair of Inverters (Contd) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Mask Level VDD b c a GND RC-Network for Timing Analysis VDD Rpu a CL Rpd GND Cp b RW CW RV CL Rpd Cp Rpu c A Circuit with Fanout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gate Level Gate Level (physical layout) c a b d a b c d c Transistor Level VDD b a c b d c GND Mask Level VDD b a b c d c GND 372 CHAPTER 6. TIMING ANALYSIS RC-Network for Timing Analysis VDD Rpu a CL Cp Rpd b RW1 RV CW1 CL Rpu b c Cp Rpd RW3 CW3 RW2 CW2 RV CL Rpu d Cp Rpd c GND 6.4.1 Timing Model Rpu Vi Cp Rpd Vo Cout Timing model Rpu Rpd Cp Cout pull up resistor in p-tran pull down resistor in n-tran parasitic capacitance load capacitance 6.4.1.1 Equation for Output Voltage Output voltage when Vo discharges through Rpd . t Vo VDD e Rpd Cp Cout 6.4.1 Timing Model 373 Measuring Delay Through an Inverter Gate Level b ................................................ . a a c b RC-Network (Analog Level) a RC-network of 2 inverters b How do we use the analog waveforms to determine the discrete delay through the inverter? Trip Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . To measure delay through inverter, what voltage levels should we use? Denition Trip Points: A high or 1 trip point is the voltage level where an upwards transition means the signal represents a 1. A low or 0 trip point is the voltage level where a downwards transition means the signal represents a 0. In the gure below the gray line represents the actual voltage on a signal. The black line is digital discretization of the analog signal. a b We need to pick our trip points, then these determine the start and stop time for measuring delay. Pick the trip points to simplify the delay equation. Pick trips points of 0.35/0.65: low-voltage (0) trip point of 0.35 Vdd high-voltage (1) trip point of 0.65 Vdd 374 CHAPTER 6. TIMING ANALYSIS Setup the delay equation for TPD to be the time for Vo to fall from VDD to the low trip point of 0 35VDD: Original equation 0 35VDD trip point t Vo VDD e Rpd Cp Cout 0 35VDD VDD e TPD Rpd Cp Cout TPD represents the propagation delay, which is the sum of the interconnect and load delays. Solving for TPD , using ln1 0 35 1, doing some more approximations: TPD Rpd Cp Cout Some Rough Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A larger transistor has a lower resistance, but a higher capacitance. Resistance affects timing of source (driving) signals. Capacitance affects (mostly) timing of destination (load) signals. Decreasing resistance increases the current through drivers. Increasing capacitance slows down (dis)charging of load capacitors. 6.4.1.2 Extrinsic / Intrinsic Delays Denition intrinsic delay: Delay resulting from pull(up/down) resistor and parasitic capacitance. Denition extrinsic delay: Delay resulting from load capacitance. 6.5 Elmore Delay Model The Elmore delay model is an appealing tradeoff between the cumbersome detail of the accurate analog delay model and the simplistic inaccuracy of models that use average interconnect and load delays. 6.5.1 Elmore Time Constant 375 6.5.1 Elmore Time Constant Elmore time constants are used to analyze interconnect and load delay with intermediate connections and/or fanout. Original equation Vo VDD e t Rpd Cp Cout 0 35VDD trip point 0 35VDD VDD e TPD Rpd Cp Cout Introduce Elmore-delay constant 0 35VDD VDD e TPD Di Vi t Di The voltage on node i (capacitor i) at time t t Di e Elmore time constant for node i n k 1 ERk,iCk (n is the number of nodes in the circuit) ERk,i = resistance along path from node i to the source-ground node that is also on the path from node k to the source-ground node (source ground is the ground node below the pull-down resistor of the source) If we: approximate Vit as an exponential waveform, and use 0.35/0.65 trip points then the delay from the source to node i is Di seconds. 376 CHAPTER 6. TIMING ANALYSIS 6.5.2 Interconnect with Single Fanout G1 G2 Ra4 Ra1 G1 C3 Rw3 Ra3 G* C* Ra* Rw* gate capacitance on wire resistance through antifuse resistance through wire G2 C1 Rw1 G1 Rpu C2 Rw2 Ra2 G2 Ra1 Rw1 Ra2 Rw2 Ra3 Rw3 Ra4 Vi Cp Rpd C1 C2 C3 CG2 Question: Calculate delay from gate 1 to gate 2 Answer: Gate 2 represents node 4 on the RC tree. 6.5.2 Interconnect with Single Fanout 377 D4 k 1 ERk,iCk C ER C2 ER C3 ER C4 1,4 1 2,4 3,4 4,4 4 ER Ra1 Rw1 Ra2 Rw2 Ra3 Rw3 Ra4 CG2 Ra1 Rw1 Ra2 Rw2 Ra3 Rw3 C3 Ra1 Rw1 Ra2 Rw2 C2 Ra1 Rw1 C1 approximate Ra Rw Ra1 C1 Ra1 Ra2 C2 Ra1 Ra2 Ra3 C3 Ra1 Ra2 Ra3 Ra4 CG2 approximate Rai Ra j 4RaCG2 3RaC3 2RaC2 RaC1 Question: If you double the number of antifuses and wires needed to connect two gates, what will be the approximate effect on the wire delay between the gates? Answer: Di k 1 ERk,iCk n Assume all resistances and capacitances are the same values (R and C), and assume that all intermediate nodes are along path between the two gates of interest. kR ER k,i Di k 1 kRC n Using the mathematical theorem: 378 CHAPTER 6. TIMING ANALYSIS i 1 i n n 1n 2 n2 We simplify delay equation: Di n kRC k 1 n2 RC We see that the delay is propotional to the square of the number of antifuses along the path. 6.5.3 Interconnect with Multiple Gates in Fanout G2 G3 G1 G1 G3 G2 Question: Assuming that wire resistance is much less than antifuse resistance and that all antifuses have equal resistance, calculate the delay from the source inverter (G1) to G2 Answer: 1. There are a total of 7 nodes in the circuit (n 7). 2. Label interconnect with resistance and capacitance identiers. R4 C5 G2 C1 R1 G1 C4 R3 C3 R5 R6 C7 G3 R2 C2 C6 6.5.3 Interconnect with Multiple Gates in Fanout 379 3. Draw RC tree G1 Rpu R1 n1 R2 n2 Cp Rpd G3 R5 n6 R6 C6 C1 C2 n3 R3 n4 R4 C3 C4 G2 Vi n5 C5 n7 C7 4. G2 is node 5 in the circuit (i 5. Elmore delay equations D5 5). k 1 ERk,5Ck C ER C2 ER C3 ER C4 1,5 1 2,5 3,5 4,5 C ER C6 ER C7 5,5 5 6,5 7,5 7 ER ER 6. Elmore resistances ER = R1 1,5 ER ER ER ER ER ER 2,5 3,5 4,5 5,5 6,5 7,5 = = = = = = R1 + R2 R1 + R2 R1 + R2 + R3 = = = = R 2R 2R 3R 4R 2R 2R R1 + R2 + R3 + R4 = R1 + R2 R1 + R2 = = 7. Plug resistances into delay equations 380 CHAPTER 6. TIMING ANALYSIS D5 RC1 2RC2 2RC3 3RC4 4RC5 2RC6 2RC7 Delay from G1 to G3 ................................................................. . Question: Assuming that wire resistance is much less than antifuse resistance and that all antifuses have equal resistance, calculate the delay from the source inverter (G1) to G3 Answer: 1. G3 is node 7 in the circuit (i 2. Elmore delay equations Di D7 7). k 1 7 k 1 ERk,iCk ERk,7Ck C ER C2 ER C3 ER C4 1,7 1 2,7 3,7 4,7 C ER C6 ER C7 5,7 5 6,7 7,7 n ER ER 3. Elmore resistances ER = R1 1,7 ER ER ER ER ER ER 2,7 3,7 4,7 5,7 6,7 7,7 = = = = = = R1 + R2 R1 + R2 R1 + R2 R1 + R2 R1 + R2 + R5 = = = = = = R 2R 2R 2R 2R 3R 4R R1 + R2 + R5 + R6 = 6.6. PRACTICAL USAGE OF TIMING ANALYSIS 381 4. Plug resistances into delay equations D7 RC1 2RC2 2RC3 2RC4 2RC5 3RC6 4RC7 Delay to G2 vs G3 .................................................................... . Question: Assuming all wire segments at same level have roughly the same capacitance, which is greater, the delay to G2 or the delay to G3? Answer: 1. Equations for delay to G2 (D5) and G3 (D7) D5 D7 RC1 2RC2 2RC3 3RC4 4RC5 2RC6 2RC7 RC1 2RC2 2RC3 2RC4 2RC5 3RC6 4RC7 2. Difference in delays D5 D7 3. Compare capacitances C4 C5 C6 C7 RC4 2RC5 RC6 2RC7 4. Conclusion: delays are approximately equal. 6.6 Practical Usage of Timing Analysis Speed Grading Fabs sort chips according to their speed (sorting is known as speed grading or speed binning) Faster chips are more expensive 382 CHAPTER 6. TIMING ANALYSIS In FPGAs, sorting is based usualy on propagation delay through an FPGA cell. As wires become a larger portiono of delay, some analysis of wire delays is also being done. Propagation delay is the average of the rising and falling propagation delays. Typical speed grades for FPGAs: Std standard speed grade 1 15% faster than Std 2 25% faster than Std 3 35% faster than Std Worst-Case Timing Maximum Delay in CMOS. When? Minimum voltage Maximum temperature Slow-slow conditions (process variation/corner which result in slow p-channel and slow n-channel). We could also have fast-fast, slow-fast, and fast-slow process corners Increasing temperature increases delay Temp resistivity resistivity electron vibration electron vibration colliding with current electrons colliding with current electrons delay Increasing supply voltage decreases delay supply voltage current current load capacitor charge time load capacitor charge time total delay Derating factor is a number used to adjust timing number to account for voltage and temp conditions ASIC manufacturers classes, based on variety of environments: VDD TA (ambient temp) TC (case temp) Commercial 5V 5% 0 to +70C Industrial 5V 10% 40 to +85C Military 5V 10% 55 to +125C What is important is the transistor temperature inside the chip, TJ (junction temperature) 6.6.1 Speed Binning Speed binning is the process of testing each manufactured part to determine the maximum clock speed at which it will run reliably. Manufacturers sell chips off of the same manufacturing line at different prices based on how fast they will run. 6.6.2 Worst Case Timing 383 A speed bin is the clock speed that chips will be labeled with when sold. Overclocking: running a chip at a clock speed faster than what it is rated for (and hoping that your software crashes more frequently than your over-stressed hardware will). 6.6.1.1 FPGAs, Interconnect, and Synthesis On FPGAs 40-60% of clock cycle is consumed by interconnect. When synthesizing, increasing effort (number of iterations) of place and route can signicantly reduce the clock period on large designs. 6.6.2 Worst Case Timing 6.6.2.1 Fanout delay In Smiths book, Table 5.2 (Fanout delay) combines two separate parameters: capacitive load delay interconnect delay into a single parameter (fanout). This is common, and ne. But, when reading a table such as this, you need to know whether fanout delay is combining both capacitive load delay and interconnect delay, or is just capacitive load. 6.6.2.2 Derating Factors Delays are dependent upon supply voltage and temperature. Temp Supply voltage Delay Delay 384 CHAPTER 6. TIMING ANALYSIS Temperature .......................................................................... Temp Delay Temp Resistivity of wires As temp goes up, atoms vibrate more, and so have greater probability of colliding with electrons owing with current. Supply voltage Delay Supply voltage current (V = IR) current time to charge load capacitors to threshold voltage Derating Factor Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A derating factor is a number to adjust timing numbers to account for different temperature and voltage conditions. Excerpt from table 5.3 in Smiths book (Actel Act 3 derating factors): Derating factor Temp Vdd 1.17 125C 4.5V 1.00 70C 5.0V 0.63 -55C 5.5V 6.7. TIMING ANALYSIS PROBLEMS 385 6.7 Timing Analysis Problems P6.1 Terminology For each of the terms: clock skew, clock period, setup time, hold time, and clock-to-q, answer which time periods (one or more of t1 t9 or NONE) are examples of the term. NOTES: 1. The timing diagram shows the limits of the allowed times (either minimum or maximum). 2. All timing parameters are non-negative. 3. The signal a is the input to a rising-edge op and b is the output. The clock is clk1. t4 t3 t1 t2 t7 t6 signal may change t9 signal is stable clk1 t8 clk2 a b b t10 t11 t5 clock skew clock period setup time hold time P6.2 Hold Time Violations P6.2.1 Cause What is the cause of a hold time violation? 386 CHAPTER 6. TIMING ANALYSIS P6.2.2 Behaviour What is the bad behaviour that results if a hold time violation occurs? P6.2.3 Rectication If a circuit has a hold time violation, how would you correct the problem with minimal effort? P6.3 Latch Analysis Does the circuit below behave like a latch? If not, explain why not. If so, calculate the clock-to-Q, setup, and hold times; and answer whether it is active-high or active-low. d Gate Delays AND 4 OR 2 NOT 1 q en P6.4 Critical Path and False Path 387 P6.4 Critical Path and False Path Find the critical path through the following circuit: b c d f g e h i j l m k a 388 CHAPTER 6. TIMING ANALYSIS P6.5 Critical Path a d f g k h l m i j b c e gate NOT AND OR XOR delay 2 4 4 6 Assume all delay and timing factors other than combinational logic delay are negligible. P6.5.1 Longest Path List the signals in the longest path through this circuit. P6.5.2 Delay What is the combinational delay along the longest path? P6.5.3 Missing Factors What factors that affect the maximum clock speed does your analysis for parts 1 and 2 not take into account? P6.5.4 Critical Path or False Path? Is the longest path that you found a real critical path, or a false path? If it is a false path, nd the real critical path. If it is a critical path, nd a set of assignments to the primary inputs that exercises the critical path. P6.6 Timing Models 389 P6.6 Timing Models In your next job, you have been told to use a fanout timing model, which states that the delay through a gate increases linearly with the number of gates in the immediate fanout. You dimly recall that a long time ago you learned about a timing model named Elmo, Elmwood, Elmore, El-Morre, or something like that. For the circuit shown below as a schematic and as a layout, answer whether the fanout timing model closely matches the delay values predicted by the Elmore delay model. G2 G3 G1 G4 G5 G1 Gate Cg 0 Symbol Description Interconnect level 2 Capacitance Cx Resistance 0 Interconnect level 1 Cy 0 Antifuse 0 R G2 G3 G4 G5 Assumptions: The capacitance of a node on a wire is independent of where the node is located on the wire. 390 CHAPTER 6. TIMING ANALYSIS P6.7 Short Answer P6.7.1 Wires in FPGAs In an FPGA today, what percentage of the clock period is typically consumed by wire delay? P6.7.2 Age and Time If you were to compare a typical digital circuit from 5 years ago with a typical digital circuit today, would you nd that the percentage of the total clock period consumed by capacative load has increased, stayed the same, or decreased? P6.7.3 Temperature and Delay As temperature increases, does the delay through a typical combinational circuit increase, stay the same, or decrease? P6.8 Worst Case Conditions and Derating Factor Assume that we have a Std speed grade Actel A1415 (an ACT 3 part) Logic Module that drives 4 other Logic Modules: P6.8.1 Worst-Case Commercial Estimate the delay under worst-case commercial conditions (assume that the junction temperature is the same as the ambient temperature) P6.8.2 Worst-Case Industrial Find the derating factor for worst-case industrial conditions and calculate the delay (assume that the junction temperature is the same as the ambient temperature). P6.8.3 Worst-Case Industrial, Non-Ambient Junction Temperature Estimate the delay under the worst-case industrial conditions (assuming that the junction temperature is 105C). Chapter 7 Power Analysis and Power-Aware Design 7.1 Overview 7.1.1 Importance of Power and Energy Laptops, PDA, cell-phones, etc obvious! For microprocessors in personal computers, every watt above 40W adds $1 to manufacturing cost Approx 25% of operating expense of server farm goes to energy bills (Dis)Comfort of Unix labs in E2 Sandia Labs had to build a special sub-station when they took delivery of Teraops massively parallel supercomputer (over 9000 Pentium Pros) High-speed microprocessors today can run so hot that they will damage themselves Athlon reliability problems, Pentium 4 processor thermal throttling In 2000, information technology consumed 8% of total power in US. Future power viruses: cell phone viruses cause cell phone to run in full power mode and consume battery very quickly; PC viruses that cause CPU to meltdown batteries 7.1.2 Industrial Names and Products All of the articles and papers below are linked to from the Documentation page on the E&CE 427 web site. Overview white paper by Intel: PC Energy-Efciency Trends and Technologies An 8-page overview of energy and power trends, written in 2002. Available from the web at an intolerably long URL. 391 392 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN AMDs Athlon PowerNow! Reduce power consumption in laptops when running on battery by allowing software to reduce clock speed and supply voltage when performance is less important than battery life. Intel Speedstep Reduce power consumption in laptops when running on battery by reducing clock speed to 70-80% of normal. Intel X-Scale An ARM5-compatible microprocessor for low-power systems: http://developer.intel.com/design/intelxscale/ Synopsys PowerMill A simulator that estimates power consumption of the circuit as it is simulated: http://www.synopsys.com/products/etg/powermill ds.html DEC / Compaq / HP Itsy A tiny but powerful PDA-style computer running linux and X-windows. Itsy was created in 1998 by DECs Western Research Laboratory to be an experimental platform in low-power, energy-efcient computing. Itsy lead to the iPAQ PocketPC. www.hpl.hp.com/techreports/Compaq-DEC/WRL-2000-6.html www.hpl.hp.com/research/papers/2003/handheld.html Satellites Satellites run on solar power and batteries. They travel great distances doing very little, then have a brief period very intense activity as they pass by an astronomical object of interest. Satellites need efcient means to gather and store energy while they are ying through space. Satellites need powerful, but energy efcient, computing and communication devices to gather, process, and transmit data. Designing computing devices for satellites is an active area of research and business. 7.1.3 Power vs Energy Most people talk about power reduction, but sometimes they mean power and sometimes energy. Power minimization is usually about heat removal Energy minimization is usually about battery life or energy costs Type Units Equivalent Types Equations Energy Joules Work Volts Coulombs 2 1 2 C Volts Power Watts Energy / Time Volts I Joules sec 7.1.4 Batteries, Power and Energy 393 7.1.4 Batteries, Power and Energy 7.1.4.1 Do Batteries Store Energy or Power? Volts Coulombs Energy Time Energy Power Batteries rated in Amp-hours at a voltage. battery Amps Seconds Volts Coulombs Seconds Volts Seconds Coulombs Volts Energy Batteries store energy. 7.1.4.2 Battery Life and Efciency To extend battery life, we want to increase the amount of work done and/or decrease energy consumed. Work and energy are same units, therefore to extend battery life, we truly want to improve efciency. Power efciency of microprocessors normally measured in MIPS/Watt. Is this a real measure of efciency? MIPs Watts millions of instructions Seconds Energy Seconds millions of instructions Energy Both instructions executed and energy are measures of work, so MIPs/Watt is a measure of efciency. (This assumes that all instructions perform the same amount of work!) 394 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.1.4.3 Battery Life and Power Question: Running a VHDL simulation requires executing an average of 1 million instructions per simulation step. My computer runs at 700MHz, has a CPI of 1.0, and burns 70W of power. My battery is rated at 10V and 2.5AH. Assuming all of my computers clock cycles go towards running VHDL simulations, how many simulation steps can I run on one battery charge? Answer: Outline of approach: 1. Calculate amount of energy stored in battery 2. Calculate energy consumed by each simulation step 3. Calculate number of simulation steps that can be run Energy stored in battery: Ebatt AmpHours Vbatt 2 5AH 10V 25Watt Hours 25Watt Hours 3600Sec/Hour 90 000Watt Secs 90 000Joules Energy per simulation step: 70Watts 1 1 0cyc/instr 106instr/step 700 106cyc/sec Estep 0 1Watt-Secs / Step 0 1Joules/Step Number of steps: 7.1.4 Batteries, Power and Energy 395 NumSteps Ebatt Estep 90 000 01 900 000steps Question: If I use the SpeedStep feature of my computer, my computer runs at 600MHz with 60W of power. With SpeedStep activated, much longer can I keep the computer running on one battery? Answer: Approach: 1. Calculate uptime with Speedstep turned off (high power) 2. Calculate uptime with Speedstep turned on (low power) 3. Calculate difference in uptimes High-power uptime: Ebatt PH 90 000Watt-Secs 70Watt 1285Secs 21minutes Low-power uptime: Ebatt PL 90 000Watt-Secs 60Watt 1500Secs 25minutes TH TL 396 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN Difference in uptimes: Tdiff TL TH 25 21 4minutes Analysis: This question is based on data from a typical laptop. So, why are the predicted uptimes so much shorter than those experienced in reality? Answer: The power consumption gures are the maximum peak power consumption of the laptop: disk spinning, fan blowing, bus active, all peripherals active, all modules on CPU turned on. In reality, laptop almost never experience their maximum power consumption. Question: With SpeedStep activated, how many more simulation steps can I run on one battery? Answer: Clock speed is proportional to power consumption. In both high-power and low-power modes, the system runs the same number of clock cycles on the energy stored in the battery. So, we are run the same number of simulation steps both with and without SpeedStep activated. Analysis: In reality, with SpeedStep activated, I am able to run more simulation steps. Why does the theoretical calculation disagree with reality? Answer: In reality, the processor does not use 100% of the clock cycles for running the simulator. Many clock cycles are wasted while waiting for I/O from the disk, user, etc. When reducing the clock speed, a smaller number of clock cycles are wasted as idle clock cycles. 7.2. POWER EQUATIONS 397 7.2 Power Equations Power SwitchPower ShortPower DynamicPower LeakagePower StaticPower Dynamic Power dependent upon clock speed Switching Power useful charges up transistors Short Circuit Power not useful both N and P transistors are on Static Power independent of clock speed Leakage Power not useful leaks around transistor Dynamic power is proportional to how often signals change their value (switch). Roughly 20% of signals switch during a clock cycle. Need to take glitches into account when calculating activity factor. Glitches increase the activity factor. Equations for dynamic power contain clock speed and activity factor. 7.2.1 Switching Power 1->0 0->1 CapLoad 0->1 1->0 CapLoad Charging a capacitor energy to (dis)charge capacitor Disharging a capacitor 1 CapLoad VoltSup2 2 When a capacitor C is charged to a voltage V , the energy stored in capacitor is 1 CV 2 . 2 The energy required to charge the capacitor from 0 to V is CV 2. Half of the energy ( 1 CV 2 is 2 dissipated as heat through the pullup resistance. Half of energy is transfered to the capacitor. 1 When the capacitor discharges from V to 0, the energy stored in the capacitor ( 2 CV 2 ) is dissipated as heat through the pulldown resistance. f : frequency at which invertor goes through complete charge-discharge cycle. (eqn 15.4 in Smith) average switching power f CapLoad VoltSup2 398 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN ClockSpeed clock speed ActFact average number of times that signal switches from 0 1 0 during a clock cycle 1 or from average switching power 1 ActFact ClockSpeed CapLoad VoltSup2 2 7.2.2 Short-Circuited Power VoltSup VoltSup - VoltThresh VoltThresh GND P-trans on N-trans on IShort Vi Vo TimeShort Gate Voltage PwrShort ActFact ClockSpeed TimeShort IShort VoltSup 7.2.3 Leakage Power Vi Vo I N P N P P ILeak V N-substrate Cross section of invertor showing parasitic diode PwrLk Leakage current through parasitic diode ILeak VoltSup ILeak e q VoltThresh kT 7.2.4 Glossary 399 7.2.4 Glossary ClockSpeed ActFact def aka def aka = = = Clock speed f activity factor A NumTransitions NumSignals NumClockCycles Per signal: percentage of clock cycles when signal changes value. Per clock cycle: percentage of signals that change value per clock cycle. Note: When measuring per circuit, sometimes approximate by looking only at ops, rather than every single signal. short circuit time Time that both N and P transistors are turned on when signal changes value. Maximum clock speed that an implementation technology can support. fmax VoltSup VoltThresh2 VoltSup Supply voltage V Threshold voltage Vth voltage at which P transistors turn on Leakage current IS (reverse bias saturation current) q VoltThresh kT e Short circuit current Ishort Current that goes through transistor network while both N and P transistors are turned on. load capacitance CL switching power (dynamic) 2 1 2 ActFact ClockSpeed CapLoad VoltSup switching power (dynamic) ActFact ClockSpeed TimeShort IShort VoltSup leakage power (static) ILeak VoltSup total power PwrSw PwrShort PwrLk TimeShort def aka = def aka MaxClockSpeed VoltSup VoltThresh ILeak def aka def aka = def aka def aka = def aka def = def = def = def = IShort CapLoad PwrSw PwrShort PwrLk Power 400 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN q k T def = def = def electron charge 1 60218 1019C Boltzmanns constant 1 38066 1023 J/K temperature in Kelvin 7.2.5 Note on Power Equations The power equation: Power DynamicPower StaticPower PwrSw PwrShort PwrLk 2 1 ActFact ClockSpeed 2 CapLoad VoltSup ActFact ClockSpeed TimeShort IShort VoltSup ILeak VoltSup is for an individual signal. To calculate dynamic power for n signals with different CapLoad, TimeShort, and IShort: DynamicPower ActFacti i 1 n ActFacti i 1 n 1 CapLoadi ClockSpeed VoltSup2 2 ClockSpeed TimeShorti IShorti VoltSup If know the average CapLoad, TimeShort, and IShort for a collection of n signals, then the above formula simplies to: DynamicPower n n ActFactAV G 1 CapLoadAV G ClockSpeed VoltSup2 2 ActFactAV G ClockSpeed TimeShortAV G IShortAV G VoltSup If capacitances and short-circuit parameters dont have an even distribution, then dont average them. If high-capacitance signals have high-activity factors, then averaging the equations will result in erroneously low predictions for power. 7.3. OVERVIEW OF POWER REDUCTION TECHNIQUES 401 7.3 Overview of Power Reduction Techniques We can divide power reduction techniques into two classes: analog and digital. analog Parameters to work with: capacitance for example, Silicon on Insulator (SOI) resistance for example, copper wires voltage low-voltage circuits Techniques: dual-VDD Two different supply voltages: high voltage for performance-critical portions of design, low voltage for remainder of circuit. Alternatively, can vary voltage over time: high voltage when running performance-critical software and low voltage when running software that is less sensitive to performance. dual-Vt Two different threshold voltages: transistors with low threshold voltage for performance-critical portions of design (can switch more quickly, but more leakage power), transistors with high threshold voltage for remainder of circuit (switches more slowly, but reduces leakage power). exotic circuits Special ops, latches, and combinational circuitry that run at a high frequency while minimizing power adiabatic circuits Special circuitry that consumes power on 0 1 transitions, but not 1 0 transitions. These sacrice performance for reduced power. clock trees Up to 30% of total power can be consumed in clock generation and clock tree digital Parameters to work with: capacitance (number of gates) activity factor clock frequency Techniques: multiple clocks Put a high speed clock in performance-critical parts of design and a low speed clock for remainder of circuit clock gating Turn off clock to portions of a chip when its not being used data encoding Gray coding vs one-hot vs fully encoded vs ... glitch reduction Adjust circuit delays or add redundant circuitry to reduce or eliminate glitches. asynchronous circuits Get rid of clocks altogether.... Additional low-power design techniques for RTL from a Qualis engineer: http://home.europa.com/celiac/lowpower.html 402 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.4 Voltage Reduction for Power Reduction If our goal is to reduce power, the most promising approach is to reduce the supply voltage, because, from: Power ActFact ClockSpeed 1 CapLoad VoltSup2 2 ActFact ClockSpeed TimeShort IShort VoltSup ILeak VoltSup we observe: Power VoltSup2 Reducing Difference Between Supply and Threshold Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . As the supply voltage decreases, it takes longer to charge up the capacitive load, which increases the load delay of a circuit. In the chapter on timing analysis, we saw that increasing the supply voltage will decrease the delay through a circuit. (From V IR, increasing V causes an increase in I, which causes the capacitive load to charge more quickly.) However, it is more accurate to take into account both the value of the supply voltage, and the difference between the supply voltage and the threshold voltage. MaxClockSpeed VoltSup VoltThresh2 VoltSup Question: If the delay along the critical path of a circuit is 20 ns, the supply voltage is 2.8 V, and the threshold voltage is 0.7 V, calculate the critical path delay if the supply voltage is dropped to 2.2 V. Answer: d d V V Vt 20ns ?? 2 8V 2 2V 0 7V current delay along critical path new delay along critical path current supply voltage new supply voltage threshold voltage 7.5. DATA ENCODING FOR POWER REDUCTION 403 MaxClockSpeed 1 d MaxClockSpeed d d d d VoltSup VoltThresh2 V V Vt 2 VoltSup V Vt 2 V Vt 2 V V V d Vt 2 V 2 20ns 31ns 8V 0 7V2 2V 2 2V2 0 7V2 2 8V V V Vt 2 Reducing Threshold Voltage Increases Leakage Current . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . If we reduce the supply voltage, we want to also reduce the threshold voltage, so that we do not increase the delay through the circuit. However, as threshold voltage drops, leakage current increases: ILeak e q VoltThresh kT And increasing the leakage current increases the power: Power ILeak So, need to strike a balance between reducing VoltSup (which has a quadratic affect on reducing power), and increasing ILeak, which has a linear affect on increasing power. 7.5 Data Encoding for Power Reduction 7.5.1 How Data Encoding Can Reduce Power Data encoding is a technique that chooses data values so that normal execution will have a low activity factor. 404 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN The most common example is Gray coding where exactly one bit changes value each clock cycle when counting. Decimal 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Gray Binary 0000 0000 0001 0001 0011 0010 0010 0011 0110 0100 0111 0101 0101 0110 0100 0111 1100 1000 1101 1001 1111 1010 1110 1011 1010 1100 1011 1101 1001 1110 1000 1111 Question: For an eight-bit counter, how much more power will a binary counter consume than a gray-code counter? Question: For completely random eight-bit data, how much more power will a binary circuit consume than a gray-code circuit? Answer: If the data is completely random, then the gray code loses its feature that consecutive data will differ in only one bit position. In fact, the activity factor for gray code and binary code will be the same. There will not be any power saving by using gray code. A binary counter will consume the same power as a gray-code circuit. On average, half of the bits will be 1 and half will be 0. For each bit, there are four possible transitions: 0 0, 0 1, 1 0, and 1 1. In these four transitions, two causes changes in value and two do not cause a change. Half of the transitions result in a change in value, therefore for random data the activity factor will be 0 5, independent of data encoding or the number of bits. 7.5.2 Example Problem: Sixteen Pulser 405 7.5.2 Example Problem: Sixteen Pulser 7.5.2.1 Problem Statement Your task is to do the power analysis for a circuit that should send out a one-clock-cycle pulse on the done signal once every 16 clock cycles. (That is, done is 0 for 15 clock cycles, then 1 for one cycle, then repeat with 15 cycles of 0 followed by a 1, etc.) 1 clk done 2 3 15 16 17 31 32 33 Required behaviour You have been asked to consider three different types of counters: a binary counter, a Gray-code counter, and a one-hot counter. (The table below shows the values from 0 to 15 for the different encodings.) Question: What is the relative amount of power consumption for the different options? 7.5.2.2 Additional Information Your implementation technology is an FPGA where each cell has a programable combinational circuit and a ip-op. The combinational circuit has 4 inputs and 1 output. The capacitive load of the combinational circuit is twice that of the ip-op. PLA cell 1. You may neglect power associated with clocks. 2. You may assume that all counters: (a) are implemented on the same fabrication process (b) run at the same clock speed (c) have negligible leakage and short-circuit currents 406 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.5.2.3 Answer Outline of Thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Factors to consider that distinguish the options: capacitance and activity factor: Capacitance is dependent upon the number of signals, and whether a signal is combinational or a op. Sketch out the circuitry to evaluate capacitance. Sketch the Circuitry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Name the output done and the count digits d(). d(0) PLA d(1) PLA d(2) PLA d(3) PLA PLA done Block diagram for Gray and Binary Counters d(0) PLA PLA d(1) d(15) PLA done Block diagram for One-Hot Observation: The Gray and Binary counters have the same design, and the Gray counter will have the lower activity factor. Therefore, the Gray counter will have lower power than the Binary counter. However, we dont know how much lower the power of the Gray counter will be, and we dont know how much power the One-Hot counter will consume. 7.5.2 Example Problem: Sixteen Pulser 407 Capacitance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. cap number subtotal cap Gray d() PLAs 2 4 8 Flops 1 4 4 done PLAs 2 1 2 Flops 1 0 0 1-Hot d() PLAs 2 0 0 Flops 1 16 16 done PLAs 2 0 0 Flops 1 0 0 Binary d() PLAs 2 4 8 Flops 1 4 4 done PLAs 2 1 2 Flops 1 0 0 Activity Factors clk d(0) d(1) d(2) d(3) done ...................................................................... . clk 8/16 4/16 2/16 2/16 2/16 done d(0) d(1) d(2) 2/16 2/16 2/16 2/16 2/16 Gray coding clk d(0) d(1) d(2) d(3) done 16/16 8/16 4/16 2/16 2/16 One-hot coding Binary coding 408 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN Gray d() done 1-Hot d() done PLAs Flops PLAs Flops PLAs Flops PLAs Flops PLAs Flops Binary d() done PLAs Flops act fact 1/4 signals in each clock cycle 1/4 signals in each clock cycle 2 transitions / 16 clock cycles 2 transitions / 16 clock cycles 16 + 8 + 4 + 2 transitions = 0.47 4 signals 16 clock cycles 16 + 8 + 4 + 2 transitions = 0.47 4 signals 16 clock cycles 2 transitions / 16 clock cycles Note: Activity factor for One-Hot counter Because all signals have same capacitance, and all clock cycles have the same number of transitions for the One-Hot counter, could have calculated activity factor as two transitions per sixteen signals. 7.6. CLOCK GATING 409 Putting it all Together Gray d() done PLAs Flops PLAs Flops Total PLAs Flops PLAs Flops Total PLAs Flops PLAs Flops Total 1-Hot d() done Binary d() done ................................................................ . subtotal cap act fact power 8 1/4 2 4 1/4 1 2 2/16 4/16 0 0 3.25 0 0 16 2/16 2 0 0 0 0 2 8 0.47 3.76 4 0.47 1.88 2 2/16 0.25 0 0 5.87 If choose Binary counting as baseline, then relative amounts of power are: Gray 54% One-Hot 35% Binary 100% If choose One-Hot counting as baseline, then relative amounts of power are: Gray 156% One-Hot 100% Binary 288% 7.6 Clock Gating The basic idea of clock gating is to reduce power by turning off the clock when a circuit isnt needed. This reduces the activity factor. 7.6.1 Introduction to Clock Gating Examples of Clock Gating Condition O/S in standby mode No oating point instructions for k clock cycles Instruction cache miss No instruction in pipe stage i Circuitry turned off Everything except core state (PC, registers, caches, etc) oating point circuitry Instruction decode circuitry Pipe stage i 410 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN Design Tradeoffs ..................................................................... . Can signicantly reduce activity factor (Synopsys PowerCompiler claims that can cut power to be 5080% of ungated level) Increases design complexity design effort bugs! Increases area Increases clock skew Functional Validation and Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Its a functional bug to turn a clock off when its needed for valid data. Its functionally ok, but wasteful to turn a clock on when its not needed. (About 5% of the bugs caught on Willamette (Intel Pentium 4 Processor) were related to clock gating.) Nicolas Mokhoff. EE Times. June 27, 2001. http://www.edtn.com/story/OEG20010621S0080 7.6.2 Implementing Clock Gating Clock gating is implemented by adding a component that disables the clock when the circuit isnt needed. i_data i_valid clk o_data o_valid Without clock gating i_data i_valid clk o_data cool_clk o_valid clk_en i_wakeup Clock Enable State Machine With clock gating 7.6.3 Design Process 411 The total power of a circuit with clock gating is the sum of the power of the main circuit with a reduced activity factor and the power of the clock gating state machine with its activity factor. The clock-gating state machine must always be on, so that it will detect the wakeup signal do not make the mistake of gating the clock to your clock gating circuit! 7.6.3 Design Process Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What level of granularity for gated clocks? entire module? individual pipe stages? something in between? When should the clocks turn off? When should the clocks turn on? Protocol for incoming wakeup signal? Protocol for outgoing wakeup signal? Wakeup Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Designers negotiate incoming and outgoing wakeup protocol with environment. An example wakeup protocol: wakeup in will arrive 1 clock cycle before valid data wakeup in will stay high until have at least 3 cycles of invalid data Design Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . When designing clock gating circuitry, consider the two extreme case: a constant stream of valid data circuit is turned off and receives a single parcel of valid data For a constant stream of valid data, the key is to not incur a large overhead in design complexity, area, or clock period when clocks will always be toggling. For a single parcel of valid data, the key is to make sure that the clocks are toggling so that data can percolate through circuit. Also, we want to turn off the clock as soon as possible after data leaves. 412 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.6.4 Effectiveness of Clock Gating We can measure the effectiveness of clock gating by comparing the percentage of clock cycles when the clock is not toggling to the percentage of clock cycles that the circuit does not have valid data (i.e. the clock does not need to toggle). The most ineffective clock gating scheme is to never turn off the clock (let the clock always toggle). The most effective clock gating scheme is to turn off the clock whenever the circuit is not processing valid data. Parameters to characterize effectiveness of clock gating: Eff PctValid PctClk = = = effectiveness of clock gating percentage of clock cycles with valid data in the circuit the clock must be toggling percentage of clock cycles that clock toggles Effectiveness measures the percentage of clock cycles with invalid data in which the clock is turned off. Equation for effectiveness of clock gating: PctClkOff PctInvalid 1 PctClk 1 PctValid Eff Question: What is the effectiveness if the clock toggles only when there is valid data? Answer: PctClk = PctValid and the effectiveness should be 1: Eff 1 PctClk 1 PctValid 1 PctValid 1 PctValid 1 Question: What is the effectiveness of a clock that always toggles? 7.6.4 Effectiveness of Clock Gating 413 Answer: If the clock is always toggling, then PctClk = 100% and the effectiveness should be 0. Eff 1 PctClk 1 PctValid 11 1 PctValid 0 Question: What does it mean for a clock gating scheme to be 75% effective? Answer: 75% of the time that the there is invalid data, the clock is off. Question: Answer: If PctClk 1 PctClk What happens if PctClk PctValid? PctValid, then: 1 PctValid so, effectiveness will be greater than 100%. In some sense, it makes sense that the answer would be nonsense, because a clock gating scheme that is more than 100% effective is too effective: it is turning off the clock sometime when it shouldnt! We can see the effect of the effectiveness of a clock-gating scheme on the activity factor: A PctValid * A A 0 0 Eff 1 When the effectiveness is zero, the new activity factor is the same as the original activity factor. For a 100% effective clock gating scheme, the activity factor is A PctValid. Between 0% and 100% effectiveness, the activity factor decreases linearly. The new activity factor with a clock gating scheme is: A A 1 PctValid Eff A 414 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.6.5 Example: Reduced Activity Factor with Clock Gating Question: How much power will be saved in the following clock-gating scheme? 70% of the time the main circuit has valid data clock gating circuit is 90% effective (90% of the time that the circuit has invalid data, the clock is off) clock gating circuit has 10% of the area of the main circuit clock gating circuit has same activity factor as main circuit neglect short-circuiting and leakage power Answer: 1. Set up main equations 7.6.5 Example: Reduced Activity Factor with Clock Gating 415 PwrMain Pwr Main PwrClkFsm PwrTot Pwr PwrSw PwrLk PwrShort Pwr PwrTot AMain CMain AClkFsm CClkFsm A Main A ClkFsm Pwr power for main circuit without clock gating power for main circuit with clock gating power for clock enable state machine PwrMain PwrClkFsm PwrSw PwrLk PwrShort 1 A C V2 2 negligible negligible 1 A C V2 2 1 AMain CMain V 2 2 A C A 0 1C A A 1 A C V 2 1 A 0 1C V 2 2 2 1 A C V2 2 A 0 1A A 1 AClkFsm CClkFsm V 2 2 Tot PwrTot 2. Find new activity factor for main circuit (A ): A 1 1 0 73A Eff1 PctValid A 0 91 0 7 A 3. Find ratio of new total power to previous total power: 416 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN Pwr Tot PwrTot A 0 1A A 0 73A 0 1A A 0 83 4. Final answer: new power is 83% of original power 7.6.6 Clock Gating with Valid-Bit Protocol A common technique to determine when a circuit has valid data is to use a valid-bit protocol. In section 7.6.6.1 we review the valid-bit procotol and then in section 7.6.6.3 we add clock-gating circuitry to a circuit that uses the valid-bit protocol. 7.6.6.1 Valid-Bit Protocol Need a mechanism to tell circuit when to pay attention to data inputs e.g. when is it supposed to decode and execute an instruction, or write data to a memory array? clk i_valid i_data clk i_valid i_data o_valid o_data o_valid o_data i valid: high when i data has valid data signies whether circuit should pay attention to or ignore data. o valid: high when o data has valid data signies whether whether environment should pay attention to output of circuit. For more on circuit protocols, see section 2.8. 7.6.6 Clock Gating with Valid-Bit Protocol 417 Microscopic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Which clock edges are needed? i_valid clk clk i_valid o_valid o_valid 7.6.6.2 How Many Clock Cycles for Module? Given a module with latency Lat, if the module receives a stream of NumPcls consecutive valid parcels, how many clock cycles must the clock-enable signal be asserted? ti1 to1 tik tok tstart tlast time of rst i valid time of rst o valid time of last i valid time of last o valid rst clock cycle with clock enabled last clock cycle with clock enabled Initial equations to describe relationships between different points in time: to1 tok trst ti1 1 tlast tok 1 ti1 Lat to1 NumPcls 1 To understand the 1 in the equation for tok, examine the situation when NumPcls one parcel going through the system to1 ti1 Lat, so we have: tok to1 1 1. In the equation for tlast , we need the 1 to clear the last valid bit. 1. With just Solve for the length of time that the clock must be enabled. The 1 at the end of this equation is becuase if tlast trst , we would have the clock enabled for 1 clock cycle. ClkEnLen tlast trst 1 tok 1 ti1 1 1 tok ti1 1 to1 NumPcls 1 ti1 1 to1 NumPcls ti1 ti1 Lat NumPcls ti1 Lat NumPcls 418 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN We are left with the formula that the number of clock cycles that the modules clock must be enabled is the latency through the module plus the number of consecutive parcels. 7.6.6.3 Adding Clock-Gating Circuitry ................................................................... data_out valid_out Before Clock Gating data_in valid_in clk clk valid_in data_in valid_out data_out dont care uninitialized After Clock Gating: Circuitry data_in valid_in ........................................................ . data_out valid_out hot_clk clk_en wakeup_in Clock Enable State Machine cool_clk wakeup_out hot clk: clock that always toggles cool clk: gated clock sometimes toggles, sometimes stays low wakeup: alerts circuit that valid data will be arriving soon clk en: turns on cool clk 7.6.6 Clock Gating with Valid-Bit Protocol 419 After Clock Gating: New Signals hot_clk wakeup_in valid_in data_in clk_en cool_clk valid_out data_out wakeup_out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 420 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.6.7 Example: Pipelined Circuit with Clock-Gating Design a clock enable state machine for the pipelined component described below. capacitance of pipelined component = 200 latency varies from 5 to 10 clock cycles, even distribution of latencies contains a maximum of 6 instructions (parcels of data). 60% of incomming parcels are valid average length of continuous sequence of valid parcels is 80 use input and output valid bits for wakeup leakage current is negligible short-circuit current is negligible Capacitance of building blocks (per bit) for state machine eq comparator increment increment / reset increment / decrement le,lt,eq comparator increment / decrement / reset ip-op 2 3 4 5 5 6 4 The two factors affecting power are activity factor and capacitance. 1. Scenario: turned off and get one parcel. (a) Need to turn on and stay on until parcel departs (b) idea #1 (parcel count): count number of parcels inside module keep clocks toggling if have non-zero parcels. (c) idea #2 (cycle count): count number of clock cycles since last valid parcel entered module once hit 10 clock cycles without any valid parcels entering, know that all parcels have exited. keep clocks toggling if counter is less than 10 2. Scenario: constant stream of parcels (a) parcel count would require looking at input and output stream and conditionally incrementing or decrementing counter (b) cycle count would keep resetting counter 7.6.7 Example: Pipelined Circuit with Clock-Gating 421 Waveforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 i_valid o_valid parcel_count parcel_clk_en 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 i_valid o_valid cycle_count 0 1 2 0 0 0 1 2 3 4 5 0 1 2 3 4 5 6 7 8 9 10 cycle_clk_en Outline: 1. sketch out circuitry for parcel count and cycle count state machine 2. estimate capacitance of each state machine 3. estimate activity factor of main circuit, based on behaviour Parcel Count Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Need to count (0..6) parcels, therefore need 3 bits for counter. Counter must be able to increment and decrement. Equations for counter action (increment/decrement/no-change): action i valid o valid 0 0 no change 0 1 decrement increment 1 0 1 1 no change To keep clock enabled for additional clock cycle to clear the valid bit, add an extra op to hold a delayed version of o valid. Use this delayed o valid to decrement the counter. In addition to the increment/decrement counter, we need an equality test on parcel count for zero, so that we know whether the clock should be on or off. Each bit of the counter needs: 422 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN component cap ip-op 4 inc/dec 5 eq-comp 2 total 11 Total capacitance is 3 11 4 37. Cycle Count Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Latency is 10 clock cycles. We need to keep the clock enabled for 11 clock cycles, so that we can clear the valid bit. We will count from 0 to 11; when the counter reaches 11, we saturate the counter and turn off the clock. Need to count (0..11), therefore need 4 bits for counter. Counter must be able to increment, saturate, and reset. i valid saturated action 0 0 increment 0 1 no change 1 0 reset 1 1 reset Use an equality comparator to detect when saturated. Clock is enabled whenever the counter is not saturated, so can use a single comparator for both detecting saturation and enabling the clock. Each bit of the counter needs: component cap ip-op 4 inc/reset 4 eq-comp 2 total 10 Total capacitance is 4 10 40. Capacitance result: parcel count : capacitance = 37 cycle count : capacitance = 40 Behavioural Analysis .................................................................. Question: option? Without further detailed analysis, can we determine which design is better 7.6.7 Example: Pipelined Circuit with Clock-Gating 423 Answer: If parcel leaves after 5 clock cycles, cycle count will continue to power circuit for another 5 cycles (wasting power!). So, parcel count wins on both capacitance and activity factor. If we needed only to determine which option was better, we could stop now. This analysis has approximated that the activity factors for the clock enabling circuit will be the same for both options. For these state machines and implementation technology, estimating the activity factor would be very complicated. Question: Which design option has lower power and how much lower is it? Answer: Goal: determine what percentage of time cool clk is toggling for each of the two design options. 1. Assume that all three of the circuits in question (main circuit without clock gating, and the two clock enable state machines) have the same activity factor. 2. Construct average waveform for cool clock. (a) 60% of incoming data are valid (b) average length of valid data is 80 instructions (c) length of window for average data is: ValidLength WindowLength PctValid 80 06 133cycles 80 valid parcels 133 clock cycles 3. Calculate percentage of clock cycles that parcel count circuit is powered. 424 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN (a) Clock will run for: 80 clock cycles + average latency 1 + 1 cycle to clear out last parcel The rst clock cycle of the last parcel is counted in the 80 clock cycles, hence, we take the average latency 1, rather than the average latency. The last clock cycle clears out the last valid parcel by opping in an invalid parcel. See section 7.6.6.1. (b) Minimum latency is 5, max is 10, distribution is even. Therefore average latency is 7.5. (c) Clock will run for: 80 7 5 1 1 87 5cycles. (d) Percentage clocking is 87 5 133 65 8% 4. Calculate percentage of clock cycles that cycle count circuit is powered. (a) Clock will run for: 80 clock cycles + 10 - 1 for powering last parcel + 1 cycle to clear out last parcel = 90.0 clock cycles (b) Percentage clocking is 90 0 133 67 7% 5. Total power consumption Parcel Count Cycle Count Main capacitance 200 200 Fsm capacitance 37 40 Percentage clocking 65.8% 67.7% Use A for the activity factor without clock gatigg. Ptot Ppcl Pmain Pfsm PctClkActFactCmain ActFactCfsm 65 8%A200 A37 168 6A Pcyc 67 7%A200 A40 175 4A Parcel count consumes less power. 6. How much more power does the cycle count design consume? 7.6.7 Example: Pipelined Circuit with Clock-Gating 425 n%more power CycPwr PclPwr PclPwr 175 4 168 6 168 6 4% 426 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN 7.7 Power Problems P7.1 Short Answers P7.1.1 Power and Temperature As temperature increases, does the power consumed by a typical combinational circuit increase, stay the same, or decrease? P7.1.2 Leakage Power The new vice president of your company has set up a contest for ideas to reduce leakage power in the next generation of chips that the company fabricates. The prize for the person who submits the suggestion that makes the best tradeoff between leakage power and other design goals is to have a door installed on their cube. What is your door-winning idea, and what tradeoffs will your idea require in order to achieve the reduction in leakage power? P7.1.3 Clock Gating In what situations could adding clock-gating to a circuit increase power consumption? P7.1.4 Gray Coding What are the tradeoffs in implementing a program counter for a microprocessor using Gray coding? P7.2 VLSI Gurus The VLSI gurus at your company have come up with a way to decrease the average rise and fall time (0-to-1 and 1-to-0 transitions) for signals. The current value is 1ns. With their fabrication tweaks, they can decrease this to 0.85ns . P7.2.1 Effect on Power If you implement their suggestions, and make no other changes, what effect will this have on power? (NOTE: Based on the information given, be as specic as possible.) P7.3 Advertising Ratios 427 P7.2.2 Critique A group of wannabe performance gurus claim that the above optimization can be used to improve performance by at least 15%. Briey outline what their plan probably is, critique the merits of their plan, and describe any affect their performance optimization will have on power. P7.3 Advertising Ratios One day you are strolling the hallways in search of inspiration, when you bump into a person from the marketing department. The marketing department has been out surng the web and has noticed that companies are advertising the MIPs/mm2 , MIPs/Watt, and Watts/cm3 of their products. This wide variety of different metrics has confused them. Explain whether each metric is a reasonable metric for customers to use when choosing a system. If the metric is reasonable, say whether bigger is better (e.g. 500 MIPs/mm2 is better than 20 MIPs/mm2) or smaller is better (e.g. 20 MIPs/mm2 is better than 500 MIPs/mm2), and which one type of product (cell phone, desktop computer, or compute server) is the metric most relevant to. MIPs/mm2 MIPs/Watt Watts/cm3 P7.4 Vary Supply Voltage As the supply voltage is scaled down (reduced in value), the maximum clock speed that the circuit can run at decreases. The scaling down of supply voltage is a popular technique for minimizing power. The maximum clock speed is related to the supply voltage by the following equation: MaxClockSpeed VoltSup VoltThresh2 VoltSup Where VoltSup is supply voltage and VoltThresh is threshold voltage. With a supply voltage of 3V and a threshold voltage of 0.8V, the maximum clock speed is measured to be 200MHz. What will the maximum clock speed be with a supply voltage of 1.5V? 428 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN P7.5 Clock Speed Increase Without Power Increase The following are given: You need to increase the clock speed of a chip by 10% You must not increase its dynamic power consumption The only design parameter you can change is supply voltage Assume that short-circuiting current is negligible P7.5.1 Supply Voltage How much do you need to decrease the supply voltage by to achieve this goal? P7.5.2 Supply Voltage What problems will you encounter if you continue to decrease the supply voltage? P7.6 Power Reduction Strategies In each low power approach described below identify which component(s) of the power equation is (are) being minimized and/or maximized: P7.6.1 Supply Voltage Designers scaled down the supply voltage of their ASIC P7.6.2 Transistor Sizing The transistors were made larger. P7.6.3 Adding Registers to Inputs All inputs to functional units are registered P7.6.4 Gray Coding Gray coding of signals is used for address signals. P7.7 Power Consumption on New Chip 429 P7.7 Power Consumption on New Chip While you are eating lunch at your regular table in the company cafeteria, a vice president sits down and starts to talk about the difculties with a new chip. The chip is a slight modication of existing design that has been ported to a new fabrication process. Earlier that day, the rst sample chips came back from fabrication. The good news is that the chips appear to function correctly. The bad news is that they consume about 10% more power than had been predicted. The vice president explains that the extra power consumption is a very serious problem, because power is the most important design metric for this chip. The vice president asks you if you have any idea of what might cause the chips to consume more power than predicted. P7.7.1 Hypothesis Hypothesize a likely cause for the surprisingly large power consumption, and justify why your hypothesis is likely to be correct. P7.7.2 Experiment Briey describe how to determine if your hypothesized cause is the real cause of the surprisingly large power consumption. P7.7.3 Reality The vice president wants to get the chips out to market quickly and asks you if you have any ideas for reducing their power without changing the design or fabrication process. Describe your ideas, or explain why her suggestion is infeasible. 430 CHAPTER 7. POWER ANALYSIS AND POWER-AWARE DESIGN Chapter 8 Fault Testing and Testability 8.1 Faults and Testing 8.1.1 Overview of Faults and Testing 8.1.1.1 Faults (Smith 14.3) During manufacturing, faults can occur that make the physical product behave incorrectly. Denition: A fault is a manufacturing defect that causes a wire, poly, diffusion, or via to either break or connect to something it shouldnt. Good wires Shorted wires Open wire 8.1.1.2 Causes of Faults (Smith 14.3) Fabrication process (initial construction is bad) chemical mix impurities dust Manufacturing process (damage during construction) handling probing cutting 431 432 CHAPTER 8. FAULT TESTING AND TESTABILITY mounting materials corrosion adhesion failure cracking peeling 8.1.1.3 Testing (Smith 14) Denition Testing is the process of checking that the manufactured wafer/chip/board/system has the same functionality as the simulations. 8.1.1.4 Burn In (Smith 14.3.1) Some chips that come off the manufacturing line will work for a short period of time and then fail. Denition Burn-in: The process of subjecting chips to extreme conditions (high and low temps, high and low voltages, high and low clock speeds) before and during testing. The purpose is to cause (and catch) failures in chips that would pass a normal test, but fail in early use by customers. Soon to break wire The hope is that the extreme conditions will cause chips to break that would otherwise have broken in the customers system soon after arrival. The trick is to create conditions that are extreme enough that bad chips will break, but not so extreme to cause good chips to break. 8.1.1.5 Bin Sorting (Smith 5.1.6) Each chip (or wafer) is run at a variety of clock speeds. The chips are grouped and labeled (binned) by the maximum clock frequency at which they will work reliably. For example, chips coming off of the same production line might be labelled as 800MHz, 900MHz, and 1000MHz. Overclocking is taking a chip rated at nMHz and running it at 1 x nMHz. (Sure your computer often crashes and loses your assignment, but just think how much more productive you are when it is working...) 8.1.1 Overview of Faults and Testing 433 8.1.1.6 Testing Techniques (Smith 14) Scan Testing or Boundary Scan Testing (BST, JTAG) (Smith 14.2, 14.6): Load test vector from tester into chip Run chip on test data Unload result data from chip to tester Compare results from chip against those produced by simulation If results are different, then chip was not manufactured correctly Built In Self Test (BIST) (Smith 14.7): Build circuitry on chip that generates tests and compares actual and expected results IDDQ Testing : (Smith 14.3.6) Measure the quiescent current between VDD and GND. Variations from expected values indicate faults. Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The challenges in testing: test circuitry consumes chip area test circuitry reduces performance decrease fault escapee rate of product that ships while having minimal impact on production cost and chip performance external tester can only look at I/O pins ratio of internal signals to I/O pins is increasing some faults will only manifest themselves at high-clock frequencies The crux of testing is to use yesterdays technology to nd faults in tomorrows chips. Agilent engineer at ARVLSI 2001. 8.1.1.7 Design for Testability (DFT) (Smith 14.6) Scan testing and self-testing require adding extra circuitry to chips. Design for test is the process of adding this circuitry in a disciplined and correct manner. A hot area of research, that is becoming mainstream practice, is developing synthesis tools to automatically add the testing circuitry. 434 CHAPTER 8. FAULT TESTING AND TESTABILITY 8.1.2 Example Problem: Economics of Testing (Smith 14.1) Given information: The ACHIP costs $10 without any testing Each board uses one ACHIP (plus lots of other chips that we dont care about) 68% of the manufactured ACHIPS do not have any faults For the ACHIP, it costs $1 per chip to catch half of the faults Each 50% reduction in fault escapees doubles cost of testing (intuition: doubles number of tests that are run) If board-level testing detects a bad ACHIP, it costs $200 to replace the ACHIP Board-level testing will detect 100% of the faults in an ACHIP Question: What escapee fault rate will minimize cost of the ACHIP? Answer: TotCost NoTestCost TestCost EscapeeProb ReplaceCost NoTestCost Testcost EscapeeProb ReplaceCost TotCost $10 $0 32% (200 0 32 = $64) $74 $43 $10 $1 16% (200 0 16 = $32) $10 $2 8% (200 0 08 = $16) $28 $10 $4 4% (200 0 04 = $8) $22 $10 $8 2% (200 0 02 = $4) $22 $10 $16 1% (200 0 01 = $2) $28 $10 $32 0.5% (200 0 005 = $1) $43 The lowest total cost is $22. There are option with a total cost of $22: $4 of testing and $8 of testing. Economically, we can choose either option. For high-volume, small-area chips, testing can consume more than 50% of the total cost. 8.1.3 Physical Faults (Smith 14.3.3) 435 8.1.3 Physical Faults (Smith 14.3.3) 8.1.3.1 Types of Physical Faults Bad Circuits open wired-AND bridging short wired-OR bridging short stronger wins bridging short (b is stronger) short to VDD short to GND a b a b a b a b c d c d c d c d Good Circuit a b c d a b a b c d c d 8.1.3.2 Locations of Faults Each segment of wire, poly, diffusion, via, etc is a potential fault location. Different segments affect different gates in the fanout. A potential fault location is a segment or segments where a fault at any position affects the same set of gates in the same way. b b BAD OK BAD b BAD b BAD b OK Three different locations for potential faults. 436 CHAPTER 8. FAULT TESTING AND TESTABILITY When working with faults, we work with wire segments, not signals. In the circuit below, there are 8 different wire segments (L1L8). Each wire segment corresponds to a logically distinct fault location. All physical faults on a segment affect the same set of signals, so they are grouped together into a logical fault. If a signal has a fanout of 1, then there is one wire segment. A signal with a fanout of n, where n 1, has at least n 1 wire segments one for the source signal and one for each gate of fanout. As shown in section 8.1.3.3, the layout of the circuit can have more than n 1 segments. a b c L1 L4 L2 L5 L3 L7 L6 L8 z 8.1.3.3 a b c d Layout Affects Locations f g h i b e L3 e L2 L1 L4 L2 e L3 L5 L4 g h b L1 g h For the signal b in the schematic above, we can have either four or ve different locations for potential faults, depending upon how the circuit is layed out. 8.1.3.4 Naming Fault Locations Two ways to name a fault location: pin-fault model Faults are modelled as occuring on input and output pins of gates. net-fault model Faults are modelled as occuring on segments of wires. In E&CE 427, well use the net-fault model, because it is simpler to work with and is closer to what actually happens in hardware. 8.1.4 Detecting a Fault To detect a fault, we compare the actual output of the circuit against the expected value. To nd a test vector that will detect a fault: 8.1.4 Detecting a Fault 437 1. build Boolean equation (or Karnaugh map) of correct circuit 2. build Boolean equation (or Karnaugh map) of faulty circuit 3. compare equations (or Karnaugh maps), regions of difference represent test vectors that will detect fault 8.1.4.1 Which Test Vectors will Detect a Fault? Question: For the good circuit and faulty circuit shown below, which test vectors will detect the fault? a b c d e c a b d e Good circuit Answer: a 0 0 0 0 1 1 1 1 b 0 0 1 1 0 0 1 1 c good faulty 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 1 Faulty circuit The only test vector that will detect the fault in the circuit is 110. Sometimes multiple test vectors will catch the same fault. Sometimes a single test vector can catch multiple faults. a b c d e a b 1 1 c 0 good faulty 1 0 Another fault The test vector 110 can catch both this fault and the previous one. With testing, we are primarily concerned with determining whether a circuit works correctly or not detecting whether there is a fault. If the circuit has a fault, we usually do not care where 438 CHAPTER 8. FAULT TESTING AND TESTABILITY the fault is diagnosing the fault. To detect the two faults above, the test vector 110 is sufcient, because if either of the two faults is present, 110 will detect that the circuit does not work correctly. Note: Detect vs. diagnose which fault occurred. Testing detects faults. Testing does not diagnose If we have a higher-than-expected failure rate for a chip, we might want to investigate the cause of the failures, and so would need to diagnose the faults. In this case, we might do more exhaustive analysis to see which test vectors pass and which fail. We might also need to examine the chip physically with probes to test a few individual wires or transistors. This is done by removing the top layers of the chip and using very small and very sensitive probes, analogous to how we use a multimeter to test a circuit on a breadboard. 8.1.5 Mathematical Models of Faults (Smith 14.3.4) Goal: develop reliable and predictable technique for detecting faults in circuits. Observations: The possible faults in a circuit are dependent upon the physical layout of the circuit. A very wide variety of possible faults A single test vector can catch many different faults Need: a mathematical model for faults that is abstracted from complexities of circuit layout and plethora of possible faults, yet still detects most or all possible faults. 8.1.5.1 Single Stuck-At Fault Model Although there are many different bad behaviours that faults can lead to, the simple model of single-stuck-at-faults has proven very capable of nding real faults in real circuits. Two simplifying assumptions: 1. A maximum of one fault per tested circuit (hence single) 2. All faults are either: (a) stuck-at 1: short to VDD (b) stuck-at 0: short to GND hence, stuck at 8.1.6 Generate Test Vector to Find a Mathematical Fault (Smith 14.4) 439 Example of Stuck-At Faults a b c d L1 L5 L2 L6 L3 L7 L4 L8 L9 ............................................................ L10 L12 i L11 12 fault locations 2 types of faults 24 possible faults. If restrict to single stuck-at fault model, then have 24 faulty circuits to consider. If allowed multiple faults, then the circuit above could have up to 12 different faults. How many faulty circuits would need to be considered? Each of the 12 locations has three possible values: good, stuck-at-1, stuck-at-0. Therefore, 312 5 3 105 different circuits would need to be considered! If allowed multiple faults of 4 different types at 12 different locations, then would have 512 1 2 4 108 different faulty circuits to consider! There are 22 6 6 104 different Boolean functions of four inputs (A k-map of four variables is 4 4 a grid of 2 squares; each square is either 0 or 1, which gives 22 different combinations). There are 6 6 104 possible equations for circuits with four inputs and one output. This is much less than the number of faulty circuit models that would be generated by the simultaneous-faults-at-every-location models. So both of the simultaneous-faults-at-every-location models are too extreme. 4 8.1.6 Generate Test Vector to Find a Mathematical Fault (Smith 14.4) Faults are detected by stimulating circuits (real, manufactured circuit, not a simulation!) with test-vectors and checking that the real circuit gives the correct output. Standard practice in testing is to test circuits for single stuck-at faults. Mathematics and empirical evidence demonstrate that if a circuit appears to be free of single stuck-at faults, then probably it also free of other types of faults. That is, testing a circuit for single stuck-at faults will also detect many other types of faults and will often detect multiple faults. 8.1.6.1 Algorithm 1. compute Karnaugh map for correct circuit 2. compute Karnaugh map for faulty circuit 3. nd region of disagreement 440 CHAPTER 8. FAULT TESTING AND TESTABILITY 4. any assignment in region of disagreement is a test vector that will detect fault 5. any assignment outside of region of disagreement will result in same output on both correct and faulty circuit 8.1.6.2 Example of Finding a Test Vector a b c a c b d e a b c ab ab ab ab 10 11 01 00 c1 c0 c a d e b Good circuit a c Faulty circuit b Difference between good and faulty circuits 8.1.7 Undetectable Faults Not all faults are detectable. 1. If a circuit is irredundant then all single stuck-at faults can be detected. A redundant circuit is one where one or more gates can be removed without affecting the functional behaviour. 2. If not trying to nd all of the faults in a circuit, then a fault that you arent looking for can mask a fault that you are looking for. 8.1.7.1 Redundant Circuitry Some faults are undetectable. Undetectable stuck-at faults are located in redundant parts of a circuit. 8.1.7 Undetectable Faults 441 Timing Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Static hazard Timing hazards are often removed by adding redundant circuitry. Dynamic hazard Redundant Circuitry ................................................................. . a b a b 1,1 1,0 e c 1,0 1,0,1 d g e f g d c 1,1 0,1 f 0,1 Irredundant circuit Illustration of timing hazard Glitch on g is caused because the AND gate for e turns off before f turns on. Question: Add one or more gates to the circuit so that the static hazard is guaranteed to be prevented, independent of the delay values through the gates In this sum-of-products style circuit, each AND gate corresponds to a cube in the Karnaugh map. a c b We can prevent this transition from causing a glitch by adding a cube that covers the two squares of the transition from 111 to 101. This cube is 1-1, which is the black cube in the Karnaugh map below and the signal h in the redundant circuit below. a c b c a b a b c a b e h d c f d e L1 g f h g Redundant circuit No more timing hazards 442 CHAPTER 8. FAULT TESTING AND TESTABILITY Question: Has the redundant circuitry introduced any undetectable faults? If so, identify an undetectable fault. L1@0 is undetectable. Correct circuit ab bc Faulty circuit ab bc ac With L1@0, ac 0 ab bc 0 ab bc Same equation as correct circuit A stuck-at fault in redundant circuitry will not affect the steady state behaviour of the circuit, but could allow timing glitches to occur. 8.1.7.2 Curious Circuitry and Fault Detection The two circuits below have the same steady-state behaviour. a L2 a b a z z c b c L1 L3 c Because the two circuits have the same behaviour, it might appear that the leftmost two XOR gates are redundant. However, these gates are not redundant. In the test for redundancy, when we remove a gate, we delete it; we do not replace it with other circuitry. Curiously, the stuck-at fault at L1 is undetectable, but faults at either L2 or L3 are detectable. fault eqn K-map a c b c diff w/ ckt a b L2@0 a b c c a b c a b L2@1 a b c 8.2. TEST GENERATION 443 8.2 Test Generation 8.2.1 A Small Example Throughout this section we will use the circuit below: ab bc a a b c L4 L2 L5 b c z At rst, we will consider only the following faults: L2@1, L4@1, L5@1. fault 1) L2@1 2) L4@1 3) L5@1 eqn ac a c b c a b K-map a c b diff w/ ckt a c b test vectors 101, 001, 100 101, 100 a bc a c b c a b ab c 101, 001 Choose Test Vector ................................................................... . a b c If we choose 101, we can detect all three faults. Choosing either 001 or 100 will miss one of the three faults. 8.2.2 Choosing Test Vectors The goal of test vector generation is to nd the smallest set of test vectors that will detect the faults of interest. Test vector generation requires analyzing the faults. We can simplify the task of fault analysis by reducing the number of faults that we have to analyze. Smith has examples of this in Figures 14.13 and 14.14. 444 CHAPTER 8. FAULT TESTING AND TESTABILITY 8.2.2.1 Fault Domination eqn ab+c a c b c a b fault 1) L5@1 2) L6@1 K-map a c b Diff w/ ckt a c b test vectors 101, 001 101, 001, 100, 010, 000 1 Any test vector that detects L5@1 will also detect L6@1: L5@1 is detected by 101 and 001, each of which will detect L6@1. L6@1 does not dominate L5@1, because there is at least one test vector that detecs L6@1 but does not detect L5@1 (e.g. each of 100, 010, 000 detect L6@1 but not L5@1). Denition dominates: f1 dominates f2 : any test vector that detects f1 will also detect f2 . When choosing test vectors, we can ignore the dominated fault, but must keep the dominant fault. L5@1 dominates L6@1. When choosing test vectors we can ignore L6@1 and just include L5@1. Question: To detect both L5@1 and L6@1, can we ignore one of the faults? Answer: We can ignore L6@1, because L5@1 dominates L6@1: each test vector that detects L5@1 also detects L6@1. Question: What would happen if we ignored the wrong fault? Answer: If we ignore L5@1, but keep L6@1, we can choose any of 5 test vectors that detect L6@1. If we chose 100, 010, or 000 as our test vector to detect L6@1, then we would not detect L5@1. 8.2.2 Choosing Test Vectors 445 8.2.2.2 Fault Equivalence eqn b a c b c a b fault 1) L1@1 2) L3@1 K-map a c b Diff w/ ckt a c b b The two faults above are equivalent. Denition fault equivalence: f1 is equivalent to f2 : f1 and f2 are detected by exactly the same set of test vectors. That is, all of the test vectors that detect f1 will also detect f2 , and vice versa. When choosing test vectors we can ignore one of the faults and just include the other. 8.2.2.3 Gate Collapsing A controlling value on an input to a gate forces the output to be the controlled value. If a stuck-at fault on the input causes the input to have a controlling value, then that fault is equivalent to the output having a stuck-at fault of being at the controlled value. For example, a 1 on the input to an OR gate will force the output to be 1. So, a stuck-at-1 fault on either input to an OR gate is equivalent to a stuck-at-1 fault on the output of the gate, and is equivalent to a stuck-at-1 fault on any other input to the OR gate. A stuck-at-1 fault on the input to an OR gate is equivalent to a stuck-at-1 fault on the output of the OR gate. Denition Gate collapsing: : The technique of looking at the functionality of a gate and nding equivalent faults between inputs and outputs. Sets of collapsable faults for common gates @0 AND @1 @0 @0 OR @1 @1 QuestionWhat is the set of collapsible faults for a NAND gate? NAND 446 CHAPTER 8. FAULT TESTING AND TESTABILITY Answer: To determine the collapsible faults, treat the NAND gate as an AND gate followed by an inverter, then invert the faults on the output of the gate. @0 AND + NOT @0 @0 @0 NAND @0 @1 8.2.2.4 Node Collapsing Note: Node collapsing is relevant only for the pin-fault model When two segments affect the same set of gates (ignoring any gates between the two segments), then faults on the two segments can be collapsed. With an invertor or buffer, the segment on the input affects the same gates as the output. Therefore, faults on the input and output segments are equivalent. Sets of collapsable faults for nodes @1 @0 @1 NOT-1 @0 NOT-0 With the net-fault model, which is the one we are using in E&CE 427, inverters and buffers are the only gates where node collapsing is relevant. With the pin-fault model, where faults are modelled as occuring on the pins of gates, there are other instances where node collapsing can be used. 8.2.2.5 Fault Collapsing Summary When calculating the test-vectors to detect a set of faults, apply the fault collapsing techniques of: gate collapsing node collapsing (if using pin-fault model) general fault equivalence (intelligent collapsing) fault domination to reduce the number of faults that you must examine. Fault collapsing is an optimization. If you skip this step, you will still get the correct answer, it will just take more work to get the correct answer, because in each step you will analyze a greater number of faults than if you do fault collapsing. 8.2.3 Fault Coverage 447 8.2.3 Fault Coverage Denition Fault coverage: percentage of detectable faults that are detected by a set of test vectors. DetectedFaults DetectableFaults FaultCoverage Some peoples denition of fault coverage has a denominator of AllPossibleFaults, not just those that are detectable. If the denominator is AllPossibleFaults, then, if a circuit has 100% single stuck-at fault coverage with a suite of test vectors, then each stuck-at fault in the circuit can be detected by one or more vectors in the suite. This also means that the circuit has no undetectable faults, and hence, no redundant circuitry. Even if the denominator is AllPossibleFaults, it is possible that achieving 100% coverage for single stuck at faults will allow defective chips to pass if they have faults that are not stuck-at-1 or stuck-at-0. I think, but havent seen a proof, that achieving 100% single stuck-at coverage will detect all combinations of multiple stuck-at faults. But, if you do not achieve 100% coverage, then a stuck-at fault that you arent testing for can mask (hide) a fault that you are testing for. NOTE: In Smiths book, undetectable faults dont hurt your coverage. This is not universally true. 8.2.4 Test Vector Generation and Fault Detection There are two ways to generate vectors and check results: built-in tests and scan testing. Both require: generate test vectors overide normal datapath to send test-vectors, rather than normal inputs, as inputs to ops compare outputs of ops to expected result 8.2.5 Generate Test Vectors for 100% Coverage In this section we will nd the test vectors to achieve 100% coverage of single stuck at faults for the circuit of the day. We will use a simple algorithm, there are much more sophisticated algorithms that are more efcient. 448 CHAPTER 8. FAULT TESTING AND TESTABILITY The problem of test vector generation is often called Automatic Test Pattern Generation (ATPG) and continues to be an active area of research. A trendy idea is to use Genetic Algorithms (inspired by how DNA works) to generate test vectors that catch the maximum number of faults. The classic algorithm is the D algorithm invented by Roth in 1966 (Smith 14.5.1, 14.5.2). An enhanced version is the Path-Oriented D Algorithm (PODEM), which supports reconvergent fanout and was developed by Goel in 1981 (Smith 14.5.3). a b c L1 L4 L2 L5 L3 L7 ab bc L6 L8 a b z c Figure 8.1: Example Circuit with Fault Locations and Karnaugh Map 8.2.5.1 Collapse the Faults a b L2@0,1 L5@0,1 L1@0,1 L4@0,1 L6@0,1 L8@0,1 L7@0,1 z Initial circuit with potential faults: a b L2 L5 c L3@0,1 Gate collapsing L1 @0 L4 @0 @0 L6 L8 L7 z c a b L3 L1 L4 L2 L5 @0 L1@0, L4@0, L6@0 L6 L8 @0 L7 L6 @1 @1 L7 @1 L8 z c a b L3 L1 L4 L2 L5 @0 L3@0, L5@0, L7@0 z c L3 L6@1, L7@1, L8@1 8.2.5 Generate Test Vectors for 100% Coverage 449 Node Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Node collapsing: none applicable (no invertors or buffers). a b L1@1 L4@1 L2@0,1 L5@1 L6@0 L8@0,1 z L7@0 Remaining faults: c L3@1 Intelligent Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sometimes, after the regular forms of fault collapsing have been done, there will still be some sets of equivalent faults in the circuit. It is usually benecial to quickly look for patterns or symmetries in the circuit that will indicate a set of potentially equivalent faults. Intelligent Collapsing a b L2@0 L8@0 z c a b z c L1@1 L2@0, L8@0 Both L2@0 and L8@0 result in the equation 0. L1@1, L3@1 L3@1 Both L1@1 and L3@1 result in the equation b a b L2@1 L5@1 L4@1 L6@0 L8@0,1 z L7@0 Remaining faults: c L3@1 450 CHAPTER 8. FAULT TESTING AND TESTABILITY 8.2.5.2 Check for Fault Domination eqn K-map a c b c fault Diff w/ ckt a b 1) L2@1 a+c a c b c a b dominated by L4@1, L5@1 2) L3@1 b a c b c a b 3) L4@1 a+bc a c b c a b 4) L5@1 ab+c a c b c a b 5) L6@0 bc a c b c a b 6) L7@0 ab a c b c a b 7) L8@0 0 a c b c a b dominated by L6@0, L7@0 8) L8@1 1 dominated by L2@1, L3@1, L4@1, L5@1 8.2.5 Generate Test Vectors for 100% Coverage 451 Remove dominated faults Current faults: a b L2@1 L5@1 L4@1 L6@0 .............................................................. L8@0,1 z L7@0 c L3@1 Dominated faults: (L2@1, L8@0, L8@1). fault eqn K-map Diff w/ ckt a c b c a b 1) L3@1 2) L4@1 3) L5@1 4) L6@0 5) L7@0 b a c b c a b a b c L4@1 L6@0 z a+bc a c b c a b L5@1 L3@1 L7@0 ab+c a c b c a b bc a c b c a b ab 8.2.5.3 Required Test Vectors Required vectors L3@1 010 L6@0 110 L7@0 011 If we have any faults that are detected by just one test-vector, then we must include that test vector in our suite. Denition required test vector: A test vector tv is required if there is a fault for which tv is the only test vector that will detect the fault. 8.2.5.4 Faults Not Covered by Required Test Vectors eqn a+bc a c b c a b fault 1) L4@1 2) L5@1 K-map a c b Diff w/ ckt a c b ab+c The intersection of the two difference regions is 101. Choosing 101 detects both L4@1 and L5@1. Add 101 to suite of test vectors. Final set of test vectors is: 010, 110, 011, 101. 452 CHAPTER 8. FAULT TESTING AND TESTABILITY 8.2.5.5 Order to Run Test Vectors The order in which the test vectors are run is important because it can affect how long a faulty chip stays in the tester before the chips fault is detected. The rst vector to run should be the one that detects the most faults. Build a table for which faults each test vector will detect. Test Vector a c b c a b c a b c a b fault 110 a c b 010 011 101 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) L1@0 a c b 1 1 a c b L1@1 L2@0 a c b 1 1 1 L2@1 a c b L3@0 a c b 1 1 a c b L3@1 L4@0 a c b 1 1 a c b L4@1 L5@0 a c b 1 1 a c b L5@1 L6@0 a c b 1 1 a c b L6@1 L7@0 a c b 1 1 L7@1 a c b 1 1 a c b 1 1 L8@0 L8@1 Faults detected 5 1 5 5 1 6 101 detects the most faults, so we should run it rst. 8.2.5 Generate Test Vectors for 100% Coverage 453 This reduces the faults found by 010 from 5 to 2 (because L6@1, L7@1, and L8@1 will be found by 101). This leaves 110 and 011 with 5 faults each, we can run them in either order, then run 010. We settle on a nal order for our test suite of: 101, 011, 110, 010. 8.2.5.6 Summary of Technique to Find and Order Test Vectors 1. identify all possible faults 2. gate collapsing 3. node collapsing 4. intelligent collapsing 5. fault domination 6. determine required test vectors 7. choose minimal set of test vectors to detect remaining faults 8. order test vectors based on number of faults detected (NOTE: when iterating through this step, need to take into account faults detected by earlier test vectors) 454 CHAPTER 8. FAULT TESTING AND TESTABILITY 8.2.5.7 Complete Analysis In case you dont trust the fault collapsing analysis, heres the complete analysis. fault 1) L1@0 2) L1@1 3) L2@0 4) L2@1 5) L3@0 6) L3@1 7) L4@0 8) L4@1 9) L5@0 10) L5@1 11) L6@0 12) 13) 14) 15) 16) L6@1 L7@0 L7@1 L8@0 L8@1 eqn bc a c b c a b K-map a c b Diff w/ ckt a c b b a c b c a b 0 a c b c a b dominated by 1, 5 dominated by 8, 10 a c b c a b a+c ab b bc a+bc ab ab+c bc 1 ab 1 0 1 same as 2 same as 1 a c b c a b same as 5 a c b c a b same as 1 a c b c a b dominated by 8, 10 same as 5 same as 12 same as 3 same as 12 8.2.6 One Fault Hiding Another 455 8.2.6 One Fault Hiding Another a b c L1 L4 L2 L5 L3 L7 L6 L8 z Assume that we are not trying to detect all faults L1 is viewed as not being at risk for faults, but L3 is at risk for faults. a b z c L3 L1 a b L1 z c L3 Problem: If L1 is stuck-at 1, the test vectors that normally detect L3@0 will not detect L3@0. In the presence of other faults, the set of test vectors to detect a fault will change. fault(s) L3@0 L1@1,L3@0 eqn ab a c b c a b K-map a c b Diff w/ ckt a c b b 8.3 Scan Testing in General Scan testing is based on the techniques described in section 8.2.5. The generation of test vectors and the checking of the result are done off-chip. In comparison, built-in self test (section 8.5) does test-vector generation and result checking on chip. Scan testing has the advantage of exibility and reduced on-chip hardware, but increases the length of time required to run a test. In scan testing, we want to individually drive and read every op in the circuit. Even without using any I/O pins for testing purposes, chips are already I/O bound, so scan-testing must be very frugal in its use of pins. Flops are connected together in scan chains with one input pin and one output pin. 456 CHAPTER 8. FAULT TESTING AND TESTABILITY 8.3.1 Structure and Behaviour of Scan Testing data_in(3) another circuit #0 zeta_in(3) another circuit #1 yet another circuit scan_out1 zeta_in(3) zeta_in(2) zeta_in(1) zeta_in(0) scan_out0 scan_out1 data_in(2) circuit under test zeta_in(2) data_in(1) zeta_in(1) data_in(0) zeta_in(0) Normal Circuit mode0 scan_in0 mode1 scan_in1 another circuit scan chain 0 circuit under test scan_out0 Circuit with Scan Chains Added 8.3.2 Scan Chains 8.3.2.1 Circuitry in Normal and Scan Mode mode0 scan_in0 mode1 scan_in1 data_in(3) data_in(2) circuit under test data_in(1) data_in(0) Normal Mode scan chain 1 8.3.2 Scan Chains 457 mode0 scan_in0 mode1 scan_in1 circuit under test scan_out0 scan_out1 Scan Mode 8.3.2.2 Scan in Operation mode0 scan chain 0 scan_in0 mode1 scan chain 0 scan_in1 another circuit yet another circuit circuit under test Sequence of load; test; unload scan_out0 scan_out1 Circuit under test with scan chains Load Test Vector (1 cycle per bit) Run Test Vector Through Circuit Unload Result (1 cycle per bit) Unload and Load and Same Time Unload Prev Result Load Cur Test Vector (1 cycle per bit) clk mode0 scan_out0 scan_in0 scan_out1 scan_in1 previous results0 current vector0 previous results1 current vector1 current results0 next test vector0 current results1 next test vector1 ...................................................... Unload Cur Result Load New Test Vector (1 cycle per bit) Run Cur Test Vector Through Circuit Sequence of load; run; unload 458 CHAPTER 8. FAULT TESTING AND TESTABILITY 8.3.2.3 Scan in Operation with Example Circuit mode0 scan_in0 a mode1 scan_in1 a b y z c d c b y z Circuit under test d scan_out0 scan_out1 Circuit under test with scan test circuitry 8.3.2 Scan Chains 459 mode0 scan_in0 a mode1 scan_in1 y b z c d scan_out0 clk mode0 Start Loading Test Vector (Load ) scan_out1 mode0 scan_in0 a mode1 scan_in1 y b z c d scan_out0 clk mode0 Load scan_out1 460 CHAPTER 8. FAULT TESTING AND TESTABILITY mode0 scan_in0 a mode1 scan_in1 y b z c d scan_out0 clk mode0 Load scan_out1 mode0 scan_in0 a mode1 scan_in1 y b z c d scan_out0 clk mode0 Load scan_out1 8.3.2 Scan Chains 461 mode0 scan_in0 mode1 scan_in1 scan_out1 scan_out0 clk mode0 Run Test Vector mode0 scan_in0 mode1 scan_in1 __ + __ __ + __ scan_out1 scan_out0 clk mode0 Test Values Propagate 462 CHAPTER 8. FAULT TESTING AND TESTABILITY mode0 scan_in0 mode1 scan_in1 - + __ + __ scan_out0 clk mode0 Flop-In Result, Start (Un)loading Test Vector scan_out1 (+) __ mode0 scan_in0 mode1 scan_in1 __ + scan_out0 __ scan_out1 (+, +) __ clk mode0 Continue (Un)loading Test Vector 8.3.3 Summary of Scan Testing 463 mode0 scan_in0 mode1 scan_in1 scan_out0 __ scan_out1 (+, +) __ clk mode0 Finish (Un)loading Test Vector mode0 scan_in0 mode1 scan_in1 scan_out0 __ scan_out1 (+, +) __ clk mode0 Run Next Test Vector 8.3.3 Summary of Scan Testing Adding scan circuitry 1. Registers around circuit to be tested are grouped into scan chains 2. Replace each op with mux + op 3. Flops and muxes wired together into scan chains 4. Each scan chain is connected to dedicated I/O pins for loading and unloading test vectors 464 CHAPTER 8. FAULT TESTING AND TESTABILITY Running test vectors 1. Put scan chain in scan mode 2. Load in test vector (one element of vector per clock cycle) 3. Put scan chain in normal mode 4. Run circuit for one clock cycle load result of test into ops 5. Unload results of current test vector while simultaneously loading in next test vector (one element of vector per clock cycle) 8.3.4 Time to Test a Chip If the length (number of ops) of a scan chain is n, then it takes 2n 1 clock cycles to run a single test: n clock cycles to scan in the test vector, 1 clock cycle to execute the test vector, and n cycles to scan out the results. Once the results are scanned out, they can be compared to the expected results for a correctly working circuit. If we run 2 or more tests (and chips generally are subjected to hundreds of thousands of tests), then we speed things up by scanning in the next test vector while we scan out the previous result. ScanLength = NumVectors = TimeScan = = number of ip ops in a scan chain number of test vectors in test suite number of clock cycles to run test suite NumVectors ScanLength 1 ScanLength 8.3.4.1 Example: Time to Test a Chip A 800MHz chip has scan chains of length 20,000 bits, 18,000 bits, 21,000 bits, 22,000 bits, and two of 15,000 bits. 500,000 test vectors are used for each scan chain. The tests are run at 80% of full speed. Question: Calculate the total test time. Answer: We can load and unload all of the scan chains at the same time, so time will be limited by the longest (22,000 bits). 8.4. BOUNDARY SCAN AND JTAG 465 For the rst test vector, we have to load it in, run the circuit for one clock cycle, then unload the result. Loading the second test vector is done while unloading the rst. TimeTot ClockPeriod MaxLengthVec NumVecs MaxLengthVec 1 1 0 80 800 106 22 000 500 000 22 000 1 17secs 8.4 Boundary Scan and JTAG Boundary scan originated as technique to test wires on printed circuit boards (PCBs). Goal was to replace bed-of-nails style testing with technique that would work for high-density PCBs (lots of small wires close together) Now used to test both boards and chip internals. Used both on boundaries (I/O pins) and internal ops. Boundary Scan with JTAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Standardized by IEEE (1149) and previously by JTAG: 4 required signals (Scan Pins: TDI, TDO, TCK, TMS) 1 optional signal (Scan Pin: TRST) protocol to connect circuit under test to tester and other circuits state machine to drive test circuitry on chip Boundary Scan Description Language (BSDL): structural language used to describe which features of JTAG a circuit supports JTAG circuitry now commonly built-into FPGAs and ASICS, or part of a cell-library. Rarely is a JTAG circuit custom-built as part of a larger part. So, youll probably be choosing and using JTAG circuits, not constructing new ones. Using JTAG circuitry is usually done by giving a description of your printed circuit board (PCB) and the JTAG components on each chip (in BSDL) to test generation software. The software then generates a sequence of JTAG commands and data that can be used to test the wires on the circuit board for opens and shorts. 8.4.1 Boundary Scan History 1985 JETAG: Joint European Test Action Group 1986 JTAG (North American companies joined) 1990 JTAG 2.0 formed basis for IEEE 1491 Test access port and boundary scan architecture 466 CHAPTER 8. FAULT TESTING AND TESTABILITY 8.4.2 JTAG Scan Pins TDI TDO TCK TMS TRST test data input: input testvector to chip test data output: output result of test test clock: clock signal that test runs on test mode select: controls scan state machine test reset (optional): resets the scan state machine chip BSR BSC circuit under test BSC BSC BSC BSC BSC chip scan registers control TDI BR Instruction Decoder IR TCK IDCODE IRC IRC TDO normal input pins circuit under test normal output pins TDI TCK TMS TDO control TMS TAP Controller High-level view Detailed view 8.4.3 Scan Registers and Cells Basic Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TDR Test data register The boundary scan registers on a chip DR Fig 14.2 Data register cell Often used as a Boundary scan cell (BSC) JTAG Components ................................................................... . 8.4.4 Scan Instructions 467 BSR BSC Fig 14.8 Fig 14.5 Fig 14.2 BR Fig 14.3 IDCODE IR cell Fig 14.4 IR Fig 14.6 IDecode Table 14.4 Fig 14.7 Top level diagram Boundary scan register A chain of boundary scan cells (BSCs) Boundary scan cell Connects external input and scan signal to internal circuit. Acts as wire between external input and internal circuit in normal mode. Bypass-register cell Allows direct connection from TDI to TDO. Acts as a wire when executing BYPASS instruction. Device identication register data register to hold manufacturers name and chip identier. Used in IDCODE instruction. Instruction register cell Cells are combined together as a shift register to form an instruction register (IR) Instruction register Two or more IR cells in a row. Holds data that is shifted in on TDI, sends this data in parallel to instruction decoder. Instruction decoder Reads instruction stored in instruction register (IR) and sends control signals to bypass register (BR) and boundary scan register (BSR) TAP Controller State machine that, together with instruction decoder, controls the scan circuitry. 8.4.4 Scan Instructions This the set of required instructions, other instructions are optional. Test board-level interconnect. Drive output pins of chip with hardcoded test vector. Sample results on inputs. SAMPLE Sample result data PRELOAD Load test vector BYPASS Directly connect TDI to TDO. This is used when several chips are daisy chained together to skip loading data into some chips. IDCODE Output manufacturer and part number EXTEST 8.4.5 TAP Controller The TAP controller is required to have 16 states and obey the state machine shown in Fig 14.7 of Smith. 468 CHAPTER 8. FAULT TESTING AND TESTABILITY 8.4.6 Other descriptions of JTAG/IEEE 1194.1 Texas Instrument...

Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more. Course Hero has millions of course specific materials providing students with the best way to expand their education.

Below is a small sample set of documents:

W. Alabama - ECE - 427
Pipelining Fundamentals Unpipelined: For a task with delay of D, the performance is 1/D Pipelined: Partitions the system into multiple stages with added buffering between the stages For a pipeline with k stages, a task with delay of D, can be p
W. Alabama - ECE - 427
E&amp;CE 427: Digital Systems Engineering Lecture SlidesInstructors: Farzad Khalvati and Muhammad Nummer Notes by: Mark Aagaard 2007t1Winter University of Waterloo Dept of Electrical and Computer EngineeringJanuary 9, 2007ContentsI Lecture Notes.
W. Alabama - ECE - 427
Figure 2.9: Final DFD (corrected to show synchronous rds and wrs)2.7.4 Example: Memory Array with Dataflow Diagram
W. Alabama - ECE - 427
E&amp;CE 427 Handout 2 Part 1: Intro to SynopsysSanjay Singh Lab Instructor / Assistant AdminThis tutorial guides the student through the process of familiarizing themselves with the Synopsys VHDL tools. The main prerequisite is a basic familiarity wit
W. Alabama - ECE - 427
January 28, 2002ECE 427 Handout #3 Logic Synthesis with SynopsysSanjay Singh, Lab InstructorThe purpose of this tutorial is to introduce the knowledge required to synthesize the gate-level logic that implements a given design for ECE students usi
W. Alabama - ECE - 427
E&amp;CE 427 Final2007t1 (Winter)Instructions and General Information 100 marks total Time limit: 2.5 hours Calculators are allowed No books, no notes, no computers If you need extra paper, request some from a proctor. Write neatly. The proctor
W. Alabama - ECE - 427
E&amp;CE 427 Midterm Solution2007t1 (Winter)All requests for regrades must be made in writing by 5:30pm on Friday March 2nd.1 (23 Marks) VHDL Simulation SemanticsYou and your team member, John, have divided the coding of your new design between the
W. Alabama - ECE - 427
E&amp;CE 427 Project: Kirsch Edge Detecter2007t1 (Winter)Deliverable Dataow Diagram Main Project Report Demo Due Date Monday, Mar. 5 6:00pm Thursday, Mar. 22 11:59pm 8:30am after project submission TBD Submission Method Drop box Electronic Drop box Sig
W. Alabama - ECE - 427
ECE427Lab #11ECE 427 LAB #1Due: Tuesday, January 16, 2007 11:59pm 1 Background ReadingIt is recommended that you complete Tut-01x and read the following ECE-427 handouts before attempting Lab 1. Solaris Policy Logging into a SunEE Comput
W. Alabama - ECE - 427
ECE 427 LAB #2Due: Friday, January 26th, 2007 11:59pm1Background ReadingIn addition to the handouts for Lab 1, it is recommended that you complete Tut-02x and read the Timing Simulation handout before attempting Lab 2.2InstructionsRead a
W. Alabama - ECE - 427
ECE-427Digital Systems Engineering2007t1 (Winter)Lab-3Due: Friday, February 9, 2007 11:59pmContents1 Preliminaries 1.1 Background Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Instructions . . .
W. Alabama - ECE - 427
Jan 31, 02 9:30library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; package queue_pkg is subtype data is std_logic_vector(3 downto 0); function to_data(i : integer) return data; end queue_pkg; package body queue_pkg is function to_da
W. Alabama - ECE - 427
%!PS-Adobe-3.0 %Title: queue.vhd, queue_spec.vhd %For: Mark Aagaard %Creator: a2ps version 4.13 %CreationDate: Thu Jan 31 09:31:57 2002 %BoundingBox: 24 24 588 768 %DocumentData: Clean7Bit %Orientation: Landscape %Pages: 2 %PageOrder: Ascend %Documen
W. Alabama - ECE - 427
15615915815515815615915815715815815916016016015816316116216016416016516316116316116416516216116516516416616516416616716516516417016616716717016816916616916717016916917116917116917117016917
W. Alabama - ECE - 427
1164090659261110411481204968120181531677617912710935821781371348072106147136141176163861751001098897841301587568116150171477111211216289511381681061871021658020119310618092198113767
W. Alabama - ECE - 427
19519519119319419118919018718618719319418419919419119319818719419019418819019119320019720519719220220020220419820219619819820020319919719119219219419519218919319619818919119219919318919519
W. Alabama - ECE - 427
1621611571611621601631551561611561571571521541561581601671661651721751701721751671611481401191069792979610110510511010810510910910610810910911010810610811111111811912212212212413012312713
Berkeley - EE - 117
University of California College of Engineering Department of Electrical Engineering and Computer SciencesEECS 117 4 units Spring 2008 Control No 25395 T.K. GustafsonElectromagnetic Fields and Waves Information Sheet T.K Gustafson, 459 Cory, 2-313
Bowling Green - RMI - 3500
CUNY Baruch - STAT - 724
STAT 724/ECO 761 Problem Set FiveSpring 20071. Let X N (, 2 ). Let Y = E X be a lognormal random variable. Find the density function of Y . 2. Suppose S is process which satises the S.D.E. dS = Sdt + SdX . Write down a S.D.E. that is satised by
CUNY Baruch - STAT - 724
STAT 724/ECO 761 Spring 2007 Problem Set One SOLUTIONS1. If the cumulative distribution function of a random variable X is given by if b &lt; 0, 0 1/2 if 0 b &lt; 1, 3/5 if 1 b &lt; 2, F (b) = 4/5 if 2 b &lt; 3, 9/10 if 3 b &lt; 3.5, 1 if b 3.5
CUNY Baruch - STAT - 724
STAT 7243/ECO 761 Problem Set Two Solutions1. A stock price is currently $50. It is known that at the end of six months it will be either $45 or $55. The risk free interest rate is 10% per annum compounded continuously. What is the value of a six mo
CUNY Baruch - STAT - 724
STAT 724/ECO 761 Spring 2007 Problem Set Three Solutions1. (Problem 5 from Exercises 1) Consider the Gamblers Ruin Problem: at each play of the game the gamblers fortune increases by one with probability p or decreases by one with probability q = 1
CUNY Baruch - STAT - 724
STAT 724/ECO 761 Problem Set Four SolutionsSpring 20071. Consider the symmetric random walk on the integers: S0 = 0 and if Sn = i then the probability is p = 1/2 that Sn+1 = i + 1 and the probability is q = 1/2 that Sn+1 = i 1. We showed in clas
CUNY Baruch - STAT - 724
STAT 724/ECO 761 Spring 2007 Take Home Final Exam1. A nancial institution plans to oer a derivative security that pays o an 2 amount equal to (ST ) = ST at time T . a) Use the risk-neutral valuation method to calculate the value of the security at t
Eckerd - PO - 304
Eckerd CollegeR. Wigton Fall 2007 U.S. Congress This course is designed as an introduction to the legislative process in general and to the American Congress in particular. We will approach the study of the Congress from a variety of perspectives: t
UMass (Amherst) - ECE - 242
ECE 242 Fall 2008Data Structures and Algorithms in JavaIntroduction to Java: Revision of Important ConceptsECE242 Fall 2008Data Structures and Algorithms in Java: Alodeep SanyalJava is Platform-IndependentCompilerUnix Compiler C source cod
UMass (Amherst) - ECE - 242
ECE 242 Data StructuresLecture 31Introduction to GraphsECE242 L30: Introduction to GraphsNovember 17, 2008Overview Problem: How do we represent irregular connections between locations? Graphs Definition Directed and Undirected graph Sim
UMass (Amherst) - ECE - 242
ECE 242 Data StructuresLecture 8Linked Stacks and QueuesECE242 L8: Linked Stacks and QueuesSeptember 19, 2008Overview Problem: How do we make linked lists more efficient? More Linked Lists Doubly linked listLinked List implementation of s
UMass (Amherst) - ECE - 242
ECE 242 Data StructuresLecture 7Linked ListsECE242 L7: Linked ListsSeptember 16, 2008Overview Problem: Can we implement data structures using something other than arrays? Individual objects can be more flexible Use references to find neig
Purdue - ECE - 477
BACKNEXTLithium BatteriesKEEPER II LITHIUM NON RECHARGEABLEAL H W HBCDEL H WLLWHW L WHFor quantities of 100 and up, call for quote.Features: Low profile prismatic design Wave solderability up to 5 seconds Highe
UMass (Amherst) - ECE - 242
ECE 242 Data StructuresLecture 13Insertion SortECE242 L13: Insertion SortOctober 1, 2008Overview Problem: What is a simple algorithm to sort numbers stored in a data structure Insertion sort Easy to code and analyze Insertion Sort Not
UMass (Amherst) - ECE - 242
ECE 242Data Structures and AlgorithmsLecture 39 VLSI Routing and Shortest PathsLecture 39: VLSI Routing and Shortest PathsDecember 8, 2008What is Reconfigurable Computing? Computation using hardware that can adapt at the logic level to solve
UMass (Amherst) - ECE - 242
A Pattern Generation Technique for Maximizing Power Supply CurrentsAlodeep Sanyal, Kunal Ganeshpure and Sandip KunduDepartment of Electrical and Computer Engineering University of Massachusetts at AmherstMotivation Power is an extremely importan
UMass (Amherst) - ECE - 242
ECE 242 Data Structures and AlgorithmsLecture 1Course OverviewECE242 L1: Course OverviewSeptember 3, 2008Welcome! What is this class about? Designing and building complex software systems Solving common engineering problems efficiently -
UMass (Amherst) - ECE - 242
ECE 242 Data StructuresLecture 9Linked List Wrap-UpECE242 L9: Linked List Wrap-UpSeptember 22, 2008Overview Problem: What about generic linked lists (not stack or queue)? Can we use them to solve useful problems? What about iterators? F
UMass (Amherst) - ECE - 242
ECE 242 Data StructuresLecture 35 Data CompressionECE242 L35: CompressionNovember 26, 2008Compression Files can often be compressed. Represented using fewer bytes than the standard representation. Fixed-length encoding Somewhat wasteful,
UMass (Amherst) - ECE - 242
ECE 242 Data StructuresLecture 19Avoiding RecursionECE242 L19: Avoiding RecursionOctober 15, 2008Overview Recursion is easy to use for many problems Always possible to use iteration instead Recursion makes heavy use of the call stack, wh
UMass (Amherst) - ECE - 242
ECE 242 Data StructuresLecture 37 Topological SortingECE242 L37: Memory ManagementDecember 3, 2008Topological Sorting Topological sort Is an ordering in which the tasks can be performed without violating any of the prerequisites.ECE242 L37
UMass (Amherst) - ECE - 242
ECE 242 Data StructuresLecture 20Binary TreesECE242 L20: Binary TreesOctober 17, 2008Overview Problem: How do we represent non-linear structures? Binary Tree Similar to a linked list except each node has two children Useful for many data
Wayne State University - MATH - 5700
Wayne State University Department of MathematicsInformation SheetCourse InformationTitle: Introduction to Probability Theory Course: MAT 570 Section: 75909 Semester: Fall, 1995 Room: 44 Rachkam Time: MTWF 10:40 AM - 11:35 AMInstructor Informat
East Los Angeles College - POLF - 0109
JOHANNES LINDVALLPersonal Details Address: Department of Politics, Manor Road, Oxford, OX1 3UQ, United Kingdom. E-mail: johannes.lindvall@politics.ox.ac.uk. Website: http:/users.ox.ac.uk/~polf0109/. Date of Birth: February 8, 1975. Nationality: Swe
UCSB - ECE - 242
Signal Compression (ECE 242) Gibson Homework 3 SolutionJanuary 22 20081. (a) Considering uniformly distributed case 1 axb ba fX (x) = 0 elsewhere We have 2 = E [X 2 ] E [X ]2 = Dmin 1 = 6N 230(b a)2 . 12 3 (b a)2 fX (x)dx = . 6(128)2
East Los Angeles College - POLF - 0109
T HE POLITICS OF PURPOSE Swedish Economic Policy After the Golden AgeJohannes Lindvall Department of Political Science Gteborg University Box 711, 405 30 Gteborg, Sweden E-mail: Johannes.Lindvall@pol.gu.seForthcoming in Comparative Politics. Fina
East Los Angeles College - POLF - 0109
JOHANNES LINDVALLT HE POLITICS OF PURPOSESWEDISH MACROECONOMIC POLICY AFTER THE GOLDEN AGEDEPARTMENT OF POLITICAL SCIENCE GTEBORG UNIVERSITY 2004Distribution Johannes Lindvall Department of Political Science Gteborg University P.O. Box 711 405
East Los Angeles College - POLF - 0109
A MODEL OF PROTESTSJOHANNES LINDVALL, UNIVERSITY OF OXFORD1. Introduction This paper develops a theoretical analysis of conicts between governments and pressure groups. The papers main claim is that protests are most likely to occur in political s
UCSB - ECE - 242
ECE 242 Gibson Midterm Exam Solutions 1. Consider some estimate Y and the corresponding errorWinter Quarter 2008 03/06/08 2 (Y ) = E[(Y Y )t W (Y Y )] = E[(Y Y + Y Y )t W (Y Y + Y Y )] = E Y Y E Y Y2 W 2 W + 2 E[(Y Y )t W (Y Y )
UCSB - ECE - 242
Signal Compression (ECE 242) Gibson Course Project January 8, 2008 Handout #2Each student is required to submit an individual course project, consisting of a written report and an oral presentation, describing a detailed examination of a signal c
UCSB - ECE - 242
Signal Compression (ECE 242) Gibson Homework No. 1 Due: January 15, 2008January 8, 2008 Handout #31. Given two independent random variables X and Y, form Z=X+Y.2 2 (a) If X and Y are Gaussian with means X and Y and variances X and Y , respect
UCSB - ECE - 242
Signal Compression (ECE 242) Gibson Homework No. 7 Due: March 13, 2008 1. Problem 9.8 on page 303 of the text. 2. Problem 9.9 on page 303 of the text. 3. Problem 9.10 on page 303 of the text.March 6, 2008
UCSB - ECE - 242
UNIVERSITY OF CALIFORNIA, SANTA BARBARA Department of Electrical and Computer Engineering Signal Compression ECE 242 Winter 2009 Instructor: Ken RoseTime and Place: Mondays, Wednesdays 10 am, Phelps 1431. Oce Hours: TBD Tentative High-Level Outline
UCSB - ECE - 242
UCSB - ECE - 242
UNIVERSITY OF CALIFORNIA, SANTA BARBARA Department of Electrical and Computer Engineering ECE 242 Winter 2009 Instructor: K. RoseHomework Assignment #1 (Due on Wednesday 1/21/2009)Reading: review Chapters 2 and 5. Problem # 1. Text, Prob. 2.1. Pr
UCSB - ECE - 242
UNIVERSITY OF CALIFORNIA, SANTA BARBARA Department of Electrical and Computer Engineering ECE 242 Winter 2009 Instructor: K. RoseHomework Assignment #3 (Due on Wednesday 2/4/2009)Reading: Review Chapter 7. Problem # 1. Consider the optimal estima
UCSB - ECE - 242
UCSB - ECE - 242
UNIVERSITY OF CALIFORNIA, SANTA BARBARA Department of Electrical and Computer Engineering ECE 242 Winter 2009 Instructor: K. RoseHomework Assignment #6 (Due on Wednesday 3/11/2009)Reading: Review Chapters 10 and 11. Problem # 1. Text, Prob. 10.2.
UCSB - ECE - 242
UNIVERSITY OF CALIFORNIA, SANTA BARBARA Department of Electrical and Computer Engineering ECE 242 Winter 2009 Instructor: K. RoseHomework Assignment #2 (Due on Wednesday 1/28/2009)Reading: Review Chapter 5, and Section 6.3. Problem # 1. Text, Pro
UCSB - ECE - 242
UCSB - ECE - 242
UNIVERSITY OF CALIFORNIA, SANTA BARBARA Department of Electrical and Computer Engineering ECE 242 Winter 2009 Instructor: K. RoseHomework Assignment #5 (Due on Wednesday 3/4/2009)Reading: Review Chapters 8 and 9. Problem # 1. Construct a probabil
UCSB - ECE - 242
UNIVERSITY OF CALIFORNIA, SANTA BARBARA Department of Electrical and Computer Engineering ECE 242 Winter 2009 Instructor: Ken RoseHomework Assignment #4 (Due on Wednesday 2/18/2009)2 Problem # 1. Consider a source with variance x and autocorrelat
Penn State - AJS - 394
THE PENNSYLVANIA STATE UNIVERSITY CHEMISTRY BUILDINGUNIVERSITY PARK, PAADAM J. SENK MECHANICAL OPTION www.arche.psu.edu/thesis/eportfolio/ current/portfolios/ajs394/BUILDING: 5 Occupied stories, Basement, Mechanical Penthouse SIZE: 181,890 Sq Ft C