03_ARM_Processor_Architecture

03_ARM_Processor_Architecture - ARM Processor Architecture...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ARM Processor Architecture Adopted from National Chiao-Tung University IP Core Design SOC Consortium Course Material Outline ARM Processor Core Memory Hierarchy Software Development Summary SOC Consortium Course Material 2 ARM Processor Core SOC Consortium Course Material 3 3-Stage Pipeline ARM Organization A[31:0] control Register Bank address register P C – 2 read ports, 1 write ports, access any register – 1 additional read port, 1 additional write port for r15 (PC) incrementer PC register bank Barrel Shifter instruction decode A L U b u s multiply register & A B b u s control b u s barrel shifter ALU – Shift or rotate the operand by any number of bits ALU Address register and incrementer Data Registers – Hold data passing to and from memory data out register data in register D[31:0] Instruction Decoder and Control SOC Consortium Course Material 4 3-Stage Pipeline (1/2) Fetch – The instruction is fetched from memory and placed in the instruction pipeline Decode – The instruction is decoded and the datapath control signals prepared for the next cycle Execute – The register bank is read, an operand shifted, the ALU result generated and written back into destination register SOC Consortium Course Material 5 3-Stage Pipeline (2/2) At any time slice, 3 different instructions may occupy each of these stages, so the hardware in each stage has to be capable of independent operations When the processor is executing data processing instructions , the latency = 3 cycles and the throughput = 1 instruction/cycle SOC Consortium Course Material 6 Multi-Cycle Instruction Memory access (fetch, data transfer) in every cycle Datapath used in every cycle (execute, address calculation, data transfer) Decode logic generates the control signals for the data path use in next cycle (decode, address calculation) SOC Consortium Course Material 7 Data Processing Instruction address register address register increment increment Rd Rd PC registers registers Rn PC Rn Rm mult mult as ins. as ins. as instruction as instruction [7:0] data out data in i. pipe (a) register - register operations data out data in i. pipe (b) register - immediate operations All operations take place in a single clock cycle SOC Consortium Course Material 8 Data Transfer Instructions address register address register increment increment Rn PC PC registers registers Rd Rn mult mult shifter lsl #0 =A + B /A - B =A / A+ B / A- B [11:0] data out data in i. pipe (a) 1st cycle - compute address byte? data in i. pipe (b) 2nd cycle - store data & auto-index Computes a memory address similar to a data processing instruction Load instruction follows a similar pattern except that the data from memory only gets as far as the ‘data in’ register on the 2nd cycle and a 3rd cycle is needed to transfer the data from there to the destination register SOC Consortium Course Material 9 Branch Instructions address register address register increment increment R14 registers registers PC PC mult mult lsl #2 shifter =A+ B =A [23:0] data out data in i. pipe (a) 1st cycle - compute branch target data out data in i. pipe (b) 2nd cycle - save return addr ess The third cycle, which is required to complete the pipeline refilling, is also used to mark the small correction to the value stored in the link register in order that is points directly at the instruction which follows the branch SOC Consortium Course Material 10 Branch Pipeline Example Breaking the pipeline Note that the core is executing in the ARM state SOC Consortium Course Material 11 5-Stage Pipeline ARM Organization Tprog = Ninst * CPI / fclk – Tprog: the time that executes a given program – Ninst: the number of ARM instructions executed in the program => compiler dependent – CPI: average number of clock cycles per instructions => hazard causes pipeline stalls – fclk: frequency Separate instruction and data memories => 5 stage pipeline Used in ARM9TDMI SOC Consortium Course Material 12 5-Stage Pipeline Organization (1/2) next pc Fetch +4 I-cache fetch pc + 4 pc + 8 I decode r15 instruction decode register read immediate fields mul LDM/ STM +4 postindex reg shift shift pre-index execute ALU forwarding paths mux B, BL MOV pc SUBS pc D-cache buffer/ data rot/sgn ex LDR pc register write Decode – The instruction is decoded and register operands read from the register files. There are 3 operand read ports in the register file so most ARM instructions can source all their operands in one cycle Execute byte repl. load/store address – The instruction is fetched from memory and placed in the instruction pipeline – An operand is shifted and the ALU result generated. If the instruction is a load or store, the memory address is computed in the ALU write-back SOC Consortium Course Material 13 5-Stage Pipeline Organization (2/2) next pc Buffer/Data +4 I-cache fetch pc + 4 pc + 8 I decode r15 instruction decode register read immediate fields mul LDM/ STM +4 postindex reg shift shift pre-index execute ALU – Data memory is accessed if required. Otherwise the ALU result is simply buffered for one cycle Write back – The result generated by the instruction are written back to the register file, including any data loaded from memory forwarding paths mux B, BL MOV pc SUBS pc byte repl. load/store address D-cache buffer/ data rot/sgn ex LDR pc register write write-back SOC Consortium Course Material 14 Pipeline Hazards There are situations, called hazards, that prevent the next instruction in the instruction stream from being executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining. There are three classes of hazards: – Structural Hazards • They arise from resource conflicts when the hardware cannot support all possible combinations of instructions in simultaneous overlapped execution. – Data Hazards • They arise when an instruction depends on the result of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline. – Control Hazards • They arise from the pipelining of branches and other instructions that change the PC SOC Consortium Course Material 15 Structural Hazards When a machine is pipelined, the overlapped execution of instructions requires pipelining of functional units and duplication of resources to allow all possible combinations of instructions in the pipeline. If some combination of instructions cannot be accommodated because of a resource conflict, the machine is said to have a structural hazard. SOC Consortium Course Material 16 Example A machine has shared a single-memory pipeline for data and instructions. As a result, when an instruction contains a data-memory reference (load), it will conflict with the instruction reference for a later instruction (instr 3): Clock cycle number instr 1 2 3 4 5 load IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM Instr 1 Instr 2 Instr 3 6 SOC Consortium Course Material 7 8 WB 17 Solution (1/2) To resolve this, we stall the pipeline for one clock cycle when a data-memory access occurs. The effect of the stall is actually to occupy the resources for that instruction slot. The following table shows how the stalls are actually implemented. Clock cycle number instr 1 2 3 4 5 load IF ID EX MEM IF ID EX MEM IF ID EX MEM IF ID EX MEM WB WB stall 9 WB Instr 2 Instr 3 SOC Consortium Course Material 7 8 WB Instr 1 6 18 Solution (2/2) Another solution is to use separate instruction and data memories. ARM belongs to the Harvard architecture, so it does not suffer from this hazard SOC Consortium Course Material 19 Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instructions on the unpipelined machine. Clock cycle number 1 ADD R1,R2,R3 SUB R4,R5,R1 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R1,R11 2 3 4 5 6 IF ID EX MEM WB IF IDsub EX MEM WB IF IDand EX MEM WB IF IDor EX MEM WB IF IDxor EX SOC Consortium Course Material 7 8 9 MEM WB 20 Forwarding The problem with data hazards, introduced by this sequence of instructions can be solved with a simple hardware technique called forwarding. Clock cycle number 1 ADD R1,R2,R3 SUB R4,R5,R1 AND R6,R1,R7 2 3 4 5 IF ID EX MEM WB IF IDsub EX MEM WB IF IDand EX MEM SOC Consortium Course Material 6 7 WB 21 Forwarding Architecture next pc +4 I-cache fetch pc + 4 pc + 8 I decode r15 instruction decode register read immediate fields mul LDM/ STM +4 postindex reg shift shift pre-index execute ALU forwarding paths mux B, BL MOV pc SUBS pc byte repl. load/store address D-cache buffer/ data rot/sgn ex – The ALU result from the EX/MEM register is always fed back to the ALU input latches. – If the forwarding hardware detects that the previous ALU operation has written the register corresponding to the source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file. forwarding paths LDR pc register write Forwarding works as follows: write-back SOC Consortium Course Material 22 Forward Data Clock cycle number 1 ADD R1,R2,R3 SUB R4,R5,R1 AND R6,R1,R7 2 3 4 5 6 IF ID EXadd MEMadd WB IF ID EXsub MEM WB IF ID EXand MEM 7 WB The first forwarding is for value of R1 from EXadd to EXsub. The second forwarding is also for value of R1 from MEMadd to EXand. This code now can be executed without stalls. Forwarding can be generalized to include passing the result directly to the functional unit that requires it: a result is forwarded from the output of one unit to the input of another, rather than just from the result of a unit to the input of the same unit. SOC Consortium Course Material 23 Without Forward Clock cycle number 1 ADD R1,R2,R3 SUB R4,R5,R1 AND R6,R1,R7 2 3 4 5 6 7 IF ID EX MEM WB IF stall stall IDsub EX MEM WB stall stall IF IDand EX SOC Consortium Course Material 8 9 MEM WB 24 Data Forwarding Data dependency arises when an instruction needs to use the result of one of its predecessors before the result has returned to the register file => pipeline hazards Forwarding paths allow results to be passed between stages as soon as they are available 5-stage pipeline requires each of the three source operands to be forwarded from any of the intermediate result registers Still one load stall LDR rN, […] ADD r2,r1,rN ;use rN immediately – One stall – Compiler rescheduling SOC Consortium Course Material 25 Stalls are Required 1 LDR R1,@(R2) SUB R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9 2 3 4 5 6 7 IF ID EX MEM WB IF ID EXsub MEM WB IF ID EXand MEM WB IF ID EXE MEM 8 WB The load instruction has a delay or latency that cannot be eliminated by forwarding alone. SOC Consortium Course Material 26 The Pipeline with one Stall 1 LDR R1,@(R2) SUB R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9 2 3 4 5 6 7 IF ID EX MEM WB IF ID stall EXsub MEM WB IF stall ID EX MEM WB stall IF ID EX MEM 8 9 WB The only necessary forwarding is done for R1 from MEM to EXsub. SOC Consortium Course Material 27 LDR Interlock In this example, it takes 7 clock cycles to execute 6 instructions, CPI of 1.2 The LDR instruction immediately followed by a data operation using the same register cause an interlock SOC Consortium Course Material 28 Optimal Pipelining In this example, it takes 6 clock cycles to execute 6 instructions, CPI of 1 The LDR instruction does not cause the pipeline to interlock SOC Consortium Course Material 29 LDM Interlock (1/2) In this example, it takes 8 clock cycles to execute 5 instructions, CPI of 1.6 During the LDM there are parallel memory and writeback cycles SOC Consortium Course Material 30 LDM Interlock (2/2) In this example, it takes 9 clock cycles to execute 5 instructions, CPI of 1.8 The SUB incurs a further cycle of interlock due to it using the highest specified register in the LDM instruction SOC Consortium Course Material 31 ARM7TDMI Processor Core Current low-end ARM core for applications like digital mobile phones TDMI – T: Thumb, 16-bit compressed instruction set – D: on-chip Debug support, enabling the processor to halt in response to a debug request – M: enhanced Multiplier, yield a full 64-bit result, high performance – I: Embedded ICE hardware Von Neumann architecture 3-stage pipeline, CPI ~ 1.9 SOC Consortium Course Material 32 ARM7TDMI Block Diagram scan chain 2 extern0 extern1 scan chain 0 Embedded ICE opc, r /w, mreq, trans, mas[1:0] A[31:0] processor core D[31:0] Din[31:0] Dout[31:0] other signals scan chain 1 bus splitter JTAG TAP controller TCK TMSTRST TDI TDO SOC Consortium Course Material 33 ARM7TDMI Core Diagram SOC Consortium Course Material 34 ARM7TDMI Interface Signals (1/4) clock control configuration bigend interrupts irq ¼ q isync initialization reset bus control enin enout enouti abe ale ape dbe tbe busen highz busdis ecapclk A[31:0] mclk wait eclk debug coprocessor interface power dbgrq breakpt dbgack exec extern1 extern0 dbgen rangeout0 rangeout1 dbgrqi commrx commtx opc cpi cpa cpb Vdd Vss Din[31:0] Dout[31:0] D[31:0] bl[3:0] r/w mas[1:0] mreq seq lock memory interface trans mode[4:0] abort Tbit ARM7TDMI core MMU interface state tapsm[3:0] ir[3:0] tdoen tck1 tck2 screg[3:0] TAP information drivebs ecapclkbs icapclkbs highz pclkbs rstclkbs sdinbs sdoutbs shclkbs shclk2bs boundary scan extension TRST TCK TMS TDI TDO JTAG controls SOC Consortium Course Material 35 ARM7TDMI Interface Signals (2/4) Clock control – All state change within the processor are controlled by mclk, the memory clock – Internal clock = mclk AND \wait – eclk clock output reflects the clock used by the core Memory interface – 32-bit address A[31:0], bidirectional data bus D[31:0], separate data out Dout[31:0], data in Din[31:0] – \mreq indicates that the memory address will be sequential to that used in the previous cycle mreq 0 0 1 1 s eq 0 1 0 1 Cy cl e N S I C Us e Non-sequential memory access Sequential memory access Internal cycle – bus and memory inactive Coprocessor register transfer – memory inactive SOC Consortium Course Material 36 ARM7TDMI Interface Signals (3/4) – Lock indicates that the processor should keep the bus to ensure the atomicity of the read and write phase of a SWAP instruction – \r/w, read or write – mas[1:0], encode memory access size – byte, half–word or word – bl[3:0], externally controlled enables on latches on each of the 4 bytes on the data input bus MMU interface – \trans (translation control), 0: user mode, 1: privileged mode – \mode[4:0], bottom 5 bits of the CPSR (inverted) – Abort, disallow access State – T bit, whether the processor is currently executing ARM or Thumb instructions Configuration – Bigend, big-endian or little-endian SOC Consortium Course Material 37 ARM7TDMI Interface Signals (4/4) Interrupt – \fiq, fast interrupt request, higher priority – \irq, normal interrupt request – isync, allow the interrupt synchronizer to be passed Initialization – \reset, starts the processor from a known state, executing from address 0000000016 ARM7TDMI characteristics Process Metal layers Vdd 0.35 um 3 3.3 V Transistors Core area Clock 74,209 2 2.1 mm 0 to 66 MHz SOC Consortium Course Material MIPS Power MIPS/W 60 87 mW 690 38 Memory Access The ARM7 is a Von Neumann, load/store architecture, i.e., – Only 32 bit data bus for both instr. and data. – Only the load/store instr. (and SWP) access memory. Memory is addressed as a 32 bit address space Data type can be 8 bit bytes, 16 bit half-words or 32 bit words, and may be seen as a byte line folded into 4-byte words Words must be aligned to 4 byte boundaries, and half-words to 2 byte boundaries. Always ensure that memory controller supports all three access sizes SOC Consortium Course Material 39 ARM Memory Interface Sequential (S cycle) – (nMREQ, SEQ) = (0, 1) – The ARM core requests a transfer to or from an address which is either the same, or one word or one-half-word greater than the preceding address. Non-sequential (N cycle) – (nMREQ, SEQ) = (0, 0) – The ARM core requests a transfer to or from an address which is unrelated to the address used in the preceding address. Internal (I cycle) – (nMREQ, SEQ) = (1, 0) – The ARM core does not require a transfer, as it performing an internal function, and no useful prefetching can be performed at the same time Coprocessor register transfer (C cycle) – (nMREQ, SEQ) = (1, 1) – The ARM core wished to use the data bus to communicate with a coprocessor, but does no require any action by the memory system. SOC Consortium Course Material 40 Cached ARM7TDMI Macrocells ARM720T ARM710T – 8K unified write through cache – Full memory management unit supporting virtual memory – Write buffer – As ARM 710T but with WinCE support ARM 740T – 8K unified write through cache – Memory protection unit – Write buffer SOC Consortium Course Material 41 ARM8 Higher performance than ARM7 – By increasing the clock rate – By reducing the CPI • Higher memory bandwidth, 64-bit wide memory • Separate memories for instruction and data accesses ARM9TDMI ARM10TDMI ARM 8 Core Organization prefetch unit addresses PC instructions – The prefetch unit is responsible for fetching instructions from memory and buffering them (exploiting the double bandwidth memory) – It is also responsible for branch prediction and use static prediction based on the branch prediction (backward: predicted ‘taken’; forward: predicted ‘not taken’) memory (doublebandwidth) SOC Consortium Course Material read data write data integer unit CPinst. CPdata coprocessor(s) 42 Pipeline Organization 5-stage, prefetch unit occupies the 1st stage, integer unit occupies the remainder (1) Instruction prefetch Prefetch Unit (2) Instruction decode and register read (3) Execute (shift and ALU) (4) Data memory access Integer Unit (5) Write back results SOC Consortium Course Material 43 Integer Unit Organization instructions PC+8 coprocessor instructions inst. decode decode register read coproc data multiplier execute ALU/shifter write pipeline +4 mux write data address memory read data forwarding paths rot/sgn ex write register write SOC Consortium Course Material 44 ARM8 Macrocell ARM810 virtual address 8 Kbyte cache (doublebandwidth) prefetch unit PC instructions read data ARM8 integer unit write data CPinst. CPdata copy-back tag CP15 copy-back data JTAG write buffer – 8Kbyte unified instruction and data cache – Copy-back – Double-bandwidth – MMU – Coprocessor – Write buffer MMU physical address address buffer data in data out address SOC Consortium Course Material 45 ARM9TDMI Harvard architecture – Increases available memory bandwidth • Instruction memory interface • Data memory interface – Simultaneous accesses to instruction and data memory can be achieved 5-stage pipeline Changes implemented to – Improve CPI to ~1.5 – Improve maximum clock frequency SOC Consortium Course Material 46 ARM9TDMI Organization next pc +4 I-cache fetch pc + 4 pc + 8 I decode r15 instruction decode register read immediate fields mul LDM/ STM +4 post index reg shift shift pre-index execute ALU forwarding paths mux B, BL MOV pc SUBS pc byte repl. load/store address buffer/ data D-cache rot/sgn ex LDR pc register write write-back SOC Consortium Course Material 47 ARM9TDMI Pipeline Operations (1/2) ARM7TDMI: Fetch instruction fetch Decode Thumb decompress Execute ARM decode reg read shift/ALU reg write shift/ALU data memory access reg write Execute Memory ARM9TDMI: instruction fetch decode Fetch Decode r. read Write Not sufficient slack time to translate Thumb instructions into ARM instructions and then decode, instead the hardware decode both ARM and Thumb instructions directly SOC Consortium Course Material 48 ARM9TDMI Pipeline Operations (2/2) Coprocessor support – Coprocessors: floating-point, digital signal processing, specialpurpose hardware accelerator On-chip debugger – Additional features compared to ARM7TDMI • Hardware single stepping • Breakpoint can be set on exceptions ARM9TDMI characteristics Process Metal layers Vdd 0.25 um 3 2.5 V Transistors Core area Clock 110,000 2 2.1 mm 0 to 200 MHz SOC Consortium Course Material MIPS Power MIPS/W 220 150 mW 1500 49 ARM9TDMI Macrocells (1/2) ARM920T virtual IA instruction cache instruction MMU external coprocessor interface data data cache virtual DA instructions CP15 data MMU EmbeddedICE & JT AG AMBA interface physical DA ARM9TDMI write buffer – 2 × 16K caches – Full memory management unit supporting virtual addressing and memory protection – Write buffer physical address tag copy-back DA physical IA AMBA AMBA address data SOC Consortium Course Material 50 ARM9TDMI Macrocells (2/2) ARM 940T external coprocessor interface Protection Unit instruction cache data cache – 2 × 4K caches – Memory protection Unit – Write buffer AMBA interface write buffer data EmbeddedICE & JTAG data address I address instructions ARM9TDMI AMBA AMBA address data SOC Consortium Course Material 51 ARM9E-S Family Overview ARM9E-S is based on an ARM9TDMI with the following extensions: – – – – – – Single cycle 32*6 multiplier implementation EmbeddedICE logic RT Improved ARM/Thumb interworking New 32*16 and 16*16 multiply instructions New count leading zero instruction New saturated math instructions Architecture v5TE ARM946E-S – – – – – ARM9E-S core Instruction and data caches, selectable sizes Instruction and data RAMs, selectable sizes Protection unit AHB bus interface SOC Consortium Course Material 52 ARM10TDMI (1/2) Current high-end ARM processor core Performance on the same IC process ARM10TDMI ×2 ARM9TDMI ×2 ARM7TDMI 300MHz, 0.25µm CMOS Increase clock rate ARM10TDMI addr. calc. branch prediction instruction fetch Fetch decode Issue data memory access r. read decode shift/ALU multiply multiplier partials add Decode Execute Memory SOC Consortium Course Material data write reg write Write 53 ARM10TDMI (2/2) Reduce CPI – Branch prediction – Non-blocking load and store execution – 64-bit data memory → transfer 2 registers in each cycle SOC Consortium Course Material 54 ARM1020T Overview Architecture v5T – ARM1020E will be v5TE CPI ~ 1.3 6-stage pipeline Static branch prediction 32KB instruction and 32KB data caches – ‘hit under miss’ support 64 bits per cycle LDM/STM operations Embedded ICE Logic RT-II Support for new VFPv1 architecture ARM10200 test chip – – – – ARM1020T VFP10 SDRAM memory interface PLL SOC Consortium Course Material 55 Summary (1/2) ARM7TDMI – Von Neumann architecture – 3-stage pipeline – CPI ~ 1.9 ARM9TDMI, ARM9E-S – Harvard architecture – 5-stage pipeline – CPI ~ 1.5 ARM10TDMI – Harvard architecture – 6-stage pipeline – CPI ~ 1.3 SOC Consortium Course Material 56 Summary (2/2) Cache – Direct-mapped cache – Set-associative cache – Fully associative cache Software Development – CodeWarrior – AXD SOC Consortium Course Material 57 References [1] http://twins.ee.nctu.edu.tw/courses/ip_core_02/index.html [2] ARM System-on-Chip Architecture by S.Furber, Addison Wesley Longman: ISBN 0-201-67519-6. [3] www.arm.com SOC Consortium Course Material 58 ...
View Full Document

Ask a homework question - tutors are online