02_ARM_Processor_Core_and_Instruction_Sets

02_ARM_Processor_Core_and_Instruction_Sets - Chapter 2 ARM...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Chapter 2 ARM Processor Core and Instruction Sets Prof. Tian-Sheuan Chang Outline Institute of Electronics, National Chiao Tung University • • • • • Processor programming model 32-bit instruction set 16-bit instruction set ARM processor core Software development 1/213 Institute of Electronics, National Chiao Tung University Processor Programming Model 2/213 ARM Ltd Institute of Electronics, National Chiao Tung University • ARM was originally developed at Acron Computer Limited of Cambridge, England between 1983 and 1985 – 1980, RISC concept at Stanford and Berkeley universities – first RISC processor for commercial use • 1990 Nov, ARM Ltd was founded • ARM cores licensed to partners who fabricate and sell to customers • Technologies assist to design in the ARM application – Software tools, boards, debug hardware, application software, bus architectures, peripherals etc… 3/213 ARM Architecture vs. Berkeley RISC Institute of Electronics, National Chiao Tung University • Features used – load/store architecture – fixed-length 32-bit instructions – 3-address instruction formats • Features unused – register windows ⇒ costly use shadow registers in ARM – delayed branch ⇒ not well to superscalar badly with branch prediction – single-cycle execution of all instructions most single-cycle – memory access multiple cycles when no separate data and instruction memory support auto-indexing addressing modes 4/213 Data Size and Instruction set Institute of Electronics, National Chiao Tung University • ARM processor is a 32-bit architecture • Most ARM’s implement two instruction sets – 32-bit ARM instruction set – 16-bit Thumb instruction set 5/213 Data Types Institute of Electronics, National Chiao Tung University • ARM processor supports 6 data types – 8-bits signed and unsigned bytes – 16-bits signed and unsigned half-words, aligned on 2-byte boundaries – 32-bits signed and unsigned words, aligned on 4-byte boundaries • ARM instructions are all 32-bit words, word-aligned Thumb instructions are half-words, aligned on 2-byte boundaries • Internally all ARM operations are on 32-bit operands; the shorter data types are only supported by data transfer instructions. When a byte is loaded form memory, it is zeroor sign-extended to 32 bits • ARM coprocessor supports floating-point values 6/213 Programming Model Institute of Electronics, National Chiao Tung University • Each instruction can be viewed as performing a defined transformation of the states visible registers invisible registers system memory user memory 7/213 Processor Modes Institute of Electronics, National Chiao Tung University • ARM has seven basic operating modes • Mode changes by software control or external interrupts CPRS[4:0] Mode Use Registers 10000 User Normal user code User 10001 FIQ Processing fast interrupts _fiq 10010 IRQ Processing standard interrupts _irp 10011 SVC Processing software interrupts (SWIs) _svc 10111 Abort Processing memory faults _abt 11011 Undef Handling undefined instruction traps _und 11111 System Running privileged operating system user 8/213 Privileged Modes Institute of Electronics, National Chiao Tung University • Most programs operate in user mode. ARM has other privileges operating modes which are used to handle exceptions, supervisor calls (software interrupts), and system mode • More access rights to memory systems and coprocessors • Current operating mode is defined by CPSR[4:0] 9/213 Supervisor Mode Institute of Electronics, National Chiao Tung University • Having some protective privileges • System-level function (transaction with the outside world) can be accessed through specified supervisor calls • Usually implemented by software interrupt (SWI) 10/213 The Registers Institute of Electronics, National Chiao Tung University • ARM has 37 registers, all of which are 32 bits long – – – – 1 dedicated program counter 1 dedicated current program status register 5 dedicated saved program status registers 30 general purpose registers • The current processor mode governs which bank is accessible each mode can access – – – – a particular set of r0-r12 registers a particular r13 (stack pointer, SP) and r14 (link register, LR) the program counter, r15 (PC) the current program status register, CPSR privileged modes (except System) can access – a particular SPSR (saved program status register) 11/213 Register Banking Institute of Electronics, National Chiao Tung University Current Visible Registers r0 User Mode r1 r2 r3 r4 Banked out Registers r5 r6 r7 FIQ IRQ SVC Undef Abort r8 r9 r8 r10 r9 r11 r10 r12 r11 r13 (sp) r12 r14 (lr) r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp) r15 (pc) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr) spsr spsr spsr spsr spsr cpsr 12/213 Registers Organization Summary User FIQ IRQ SVC Undef Abort Institute of Electronics, National Chiao Tung University r0 r1 r2 r3 r4 r5 User mode r0-r7, r15, and cpsr r6 r7 User mode r0-r12, r15, and cpsr User mode r0-r12, r15, and cpsr User mode r0-r12, r15, and cpsr User mode r0-r12, r15, and cpsr r8 r8 r9 r9 r10 r10 r11 r11 r12 r12 r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr) spsr spsr spsr spsr Thumb state Low registers spsr Thumb state High registers r15 (pc) cpsr Note : System mode uses the User mode reigster set 13/213 Program Counter (r15) Institute of Electronics, National Chiao Tung University • When the processor is executing in ARM state: – all instructions are 32 bits wide – all instructions must be word-aligned – therefore the PC value is stored in bits [31:2] with bits [1:0] undefined (as instruction cannot be halfword or byte aligned) • When the processor is executing in Thumb state: – all instructions are 16 bits wide – all instructions are must be halfword-aligned – therefore the PC value is stored in bits [31:1] with bit [0] undefined (as instruction cannot be byte-aligned) 14/213 Program Status Registers (CPSR) Institute of Electronics, National Chiao Tung University 31 28 27 24 23 NZCVQ f • U 8 ndefined s Condition code flags – – – – • 16 15 N : Negative result from ALU Z : Zero result from ALU C: ALU operation Carried out V : ALU operation oVerflowed Sticky overflow flag – Q flag – architecture 5TE only – indicates if saturation has occurred during certain operations 7 6 5 4 0 mode IFT x c • Interrupt disable bits – I = 1, disables the IRQ – F = 1, disables the FIQ • T Bit – architecture xT only – T = 0, processor in ARM state – T = 1, processor in Thumb state • Mode bits – specify the processor mode 15/213 SPSRs Institute of Electronics, National Chiao Tung University • Each privileged mode (except system mode) has associated with it a Save Program Status Register, or SPSR • This SPSR is used to save the state of CPSR (Current Program Status Register) when the privileged mode is entered in order that the user state can be fully restored when the user process is resumed • Often the SPSR may be untouched from the time the privileged mode is entered to the time it is used to restore the CPSR, but if the privileged supervisor calls to itself) then the SPSR must be copied into a general register and saved 16/213 Exceptions Institute of Electronics, National Chiao Tung University • • Exceptions are usually used to handle unexpected events which arise during the execution of a program, such as interrupts or memory faults, also cover software interrupts, undefined instruction traps, and the system reset Three groups : 1. generated as the direct effect of executing an instruction software interrupts, undefined instructions, prefetch abort (memory fault) 2. generated as the side-effect of an instruction data aborts 3. generated externally reset, IRQ, FIQ 17/213 Exception Entry (1/2) Institute of Electronics, National Chiao Tung University • When an exception arises, ARM completes the current instruction as best it can (except that reset exception terminates the current instruction immediately) and then departs from the current instruction sequence to handle the exception which starts from a specific location (exception vector) • Processor performs the following sequence – change to the operating mode corresponding to the particular exception – save the address of the instruction following the exception entry instruction in r14 of the new mode – save the old value of CPSR in the SPSR of the new mode – disable IRQs by setting bit of the CPSR, and if the exception is a fast interrupt, disable further faster interrupt by setting bit of the CPSR 18/213 Exception Entry (2/2) – force the PC to begin executing at the relevant vector address Institute of Electronics, National Chiao Tung University Exception Mode Vector address Reset SVC 0x00000000 Undefined instruction UND 0x00000004 Software interrupt (SWI) SVC 0x00000008 Prefetch abort (instruction fetch memory fault) Abort 0x0000000C Data abort (data access memory fault) Abort 0x00000010 IRQ (normal interrupt) IRQ 0x00000018 FIQ (fast interrupt) FIQ 0x0000001C • Normally the vector address contains a branch to the relevant routine, though the FIQ code can start immediately • Two banked registers in each of the privilege modes are used to hold the return address and stack pointer 19/213 Exception Return (1/3) Institute of Electronics, National Chiao Tung University • Once the exception has been handled, the user task is normally resumed • The sequence is – any modified user registers must be restored from the handler’stack – CPSR must be restored from the appropriate SPSR – PC must be changed back to the relevant instruction address • The last two steps happen atomically as part of a single instruction 20/213 Exception Return (2/3) Institute of Electronics, National Chiao Tung University • When the return address has been kept in the banked r14 – to return from a SWI or undefined instruction trap MOVS PC,r14 – to return from an IRQ, FIQ or prefetch abort SUBS PC,r14,#4 – To return from a data abort to retry the data access SUBS PC,r14,#8 – ‘S’ signifies when the destination register is the PC 21/213 Exception Return (3/3) Institute of Electronics, National Chiao Tung University • When the return address has been saved onto a stack LDMFD r13!,{r0-r3,PC}^ ;restore and return – ‘^’ indicates that this is a special form of the instruction the CPSR is restored at the same time that the PC is loaded from memory, which will always be the last item transferred from memory since the registers are loaded in increasing order 22/213 Exception Priorities Institute of Electronics, National Chiao Tung University • Priority order 1. 2. 3. 4. 5. 6. reset (highest priority) data abort FIQ IRQ prefetch abort SWI, undefined instruction 23/213 Memory Organization bit 31 Institute of Electronics, National Chiao Tung University 23 22 bit 31 bit 0 21 20 20 18 17 16 word16 15 14 13 12 half-word14 half-word12 11 10 9 8 word8 7 6 5 4 byte6 half-word4 3 2 1 0 byte3 byte2 byte1 byte0 22 23 16 19 (a) Little-endian memory organization 21 bit 0 byte address 17 18 19 word16 15 12 13 14 half-word12 half-word14 8 9 10 11 word8 4 5 6 7 byte5 half-word6 0 1 2 3 byte0 byte1 byte2 byte3 byte address (b) Big-endian memory organization • Word, half-word alignment (xxxxoo or xxxxxo) • ARM can be set up to access data in either littleendian or big-endian format 24/213 Features of the ARM Instruction Set Institute of Electronics, National Chiao Tung University • Load-store architecture process values which are in registers load, store instructions for memory data accesses • • • • • 3-address data processing instructions Conditional execution of every instruction Load and store multiple registers Shift, ALU operation in a single instruction Open instruction set extension through the coprocessor instruction • Very dense 16-bit compressed instruction set (Thumb) 25/213 Coprocessors Institute of Electronics, National Chiao Tung University Handshaking signals ARM core F DE Coprocessor X F DE Coprocessor Y F DE Databus • Up to 16 coprocessor can be defined • Expands the ARM instruction set • ARM uses them for “internal functions” so as not to enforce a particular memory map (eg cp15 is the ARM cache controller) • Usually better for system designers to use memory mapped peripherals - easier to implement 26/213 Thumb Institute of Electronics, National Chiao Tung University • Thumb is a 16-bit instruction set – optimized for code density from C code – improved performance from narrow memory – subset of the fumctionality of the ARM instruction set • Core has two execution states – ARM and Thumb 31 – switch between them using BX instruction 0 ADDS r2,r2,#1 15 ADD r2,#1 32-bit ARM instruction For most instruction generated by compiler: • Conditional execution is not used • Source and destination registers identical • Only Low registers used • Constants are of limited size 0 • Inline barrel shifter not used 16-bit Thumb instruction 27/213 Average Thumb Code Sizes Institute of Electronics, National Chiao Tung University 28/213 ARM and Thumb Performace Institute of Electronics, National Chiao Tung University 29/213 I/O System Institute of Electronics, National Chiao Tung University • ARM handles input/output peripherals as memory-mapped with interrupt support • Internal registers in I/O devices as addressable locations within ARM’s memory map read and written using load-store instructions • Interrupt by normal interrupt (IRQ) or fast interrupt (FIQ) higher priority input signals are level-sensitive and maskable • May include Direct Memory Access (DMA) hardware 30/213 ARM Exceptions Institute of Electronics, National Chiao Tung University • Supports interrupts, traps, supervisor calls • When an exception occurs, the ARM: – copies CPSR into SPSR_<mode> – sets appropriate CPSR bits • if core currently in Thumb state then ARM state is entered • mode field bits • interrupt disable bits (if appropriate) 0x1C FIQ 0x18 IRQ 0x14 (Reserved) 0x10 Data Abort 0x0C Prefetch Abort – stores the return address in LR_<mode> – set pc to vector address 0x08 Software Interrupt 0x04 Undefined Instruction 0x00 Reset • To return, exception handler needs to: – restore CPSR from SPSR_<mode> – restore PC from LR_<mode> Vector Table Vector table can be at 0xffff0000 on ARM720T and on ARM9/10 family devices This can only be done in ARM state 31/213 ARM Exceptions Institute of Electronics, National Chiao Tung University • Exception handler use r13_<mode> which will normally have been initialized to point to a dedicated stack in memory, to save some user registers for use as work registers 32/213 ARM Processor Cores Institute of Electronics, National Chiao Tung University • ARM Processor core + cache + MMU →ARM CPU cores • ARM6 → ARM7 (3V operation, 50-100MHz for .25µ or .18 µ) T : Thumb 16-bit compressed instruction set D : on-chip Debug support, enabling the processor to halt in response to a debug request M : enhanced Multiplier, 64-bit result I : embedded ICE hardware, give on-chip breakpoint and watchpoint support 33/213 ARM Processor Cores Institute of Electronics, National Chiao Tung University • ARM 8 → ARM 9 → ARM 10 • ARM 9 – 5-stage pipeline (130 MHz or 200MHz) – using separate instruction and data memory ports • ARM 10 (1998. Oct.) – high performance, 300 MHz – multimedia digital consumer applications – optional vector floating-point unit 34/213 ARM Architecture Versions (1/5) Institute of Electronics, National Chiao Tung University • Version 1 – the first ARM processor, developed at Acorn Computers Limited 1983-1985 – 26-bit addressing, no multiply or coprocessor support • Version 2 – sold in volume in the Acorn Archimedes – 26-bit addressing, including 32-bit result multiply and coprocessor • Version 2a – coprocessor 15 as the system control coprocessor to manage cache – add the atomic load store (SWP) instruction 35/213 ARM Architecture Versions (2/5) Institute of Electronics, National Chiao Tung University • Version 3 – first ARM processor designed by ARM Limited (1990) – ARM6 (macro cell) ARM60 (stand-alone processor) ARM600 (an integrated CPU with on-chip cache, MMU, write buffer) ARM610 (used in Apple Newton) – 32-bit addressing, separate CPSR and SPSRs – add the undefined and abort modes to allow coprocessor emulation and virtual memory support in supervisor mode • Version 3M – introduce the signed and unsigned multiply and multiplyaccumulate instructions that generate the full 64-bit result 36/213 ARM Architecture Versions (3/5) Institute of Electronics, National Chiao Tung University • Version 4 – add the signed, unsigned half-word and signed byte load and store instructions – reserve some of SWI space for architecturally defined operations – system mode is introduced • Version 4T – 16-bit Thumb compressed form of the instruction set is introduced 37/213 ARM Architecture Versions (4/5) Institute of Electronics, National Chiao Tung University • Version 5T – introduced recently, a superset of version 4T adding the BLX, CLZand BRK instructions • Version 5TE – add the signal processing instruction set extension 38/213 ARM Architecture Versions (5/5) Institute of Electronics, National Chiao Tung University Core Architecture ARM1 v1 ARM2 v2 ARM2as, ARM3 v2a ARM6, ARM600, ARM610 v3 ARM7, ARM700, ARM710 v3 ARM7TDMI, ARM710T, ARM720T, ARM740T StrongARM, ARM8, ARM810 v4T v4 ARM9TDMI, ARM920T, ARM940T V4T ARM9E-S v5TE ARM10TDMI, ARM1020E v5TE 39/213 Institute of Electronics, National Chiao Tung University 32-bit Instruction Set 40/213 Institute of Electronics, National Chiao Tung University • ARM assembly language program – ARM development board or ARM emulator • ARM instruction set – standard ARM instruction set – a compressed form of the instruction set, a subset of the full ARM instruction set is encoded into 16-bit instructions - Thumb instruction – some ARM cores support instruction set extensions to enhance signal processing capabilities 41/213 Instructions Institute of Electronics, National Chiao Tung University • Data processing instructions • Data transfer instructions • Control flow instructions 42/213 ARM Instruction Set Summary (1/4) Institute of Electronics, National Chiao Tung University Mnemonic Instruction Action ADC Add with carry Rd:=Rn+Op2+Carry ADD Add Rd:=Rn+Op2 AND AND Rd:=Rn AND Op2 B Branch R15:=address BIC Bit Clear Rd:=Rn AND NOT Op2 BL Branch with Link BX Branch and Exchange CDP Coprocessor Data Processing R14:=R15 R15:=address R15:=Rn T bit:=Rn[0] (Coprocessor-specific) CMN Compare Negative CPSR flags:=Rn+Op2 CMP Compare CPSR flags:=Rn-Op2 43/213 ARM Instruction Set Summary (2/4) Institute of Electronics, National Chiao Tung University Mnemonic Instruction Action EOR Exclusive OR Rd:=Rn^Op2 LDC Load Coprocessor from memory (Coprocessor load) LDM Load multiple registers Stack Manipulation (Pop) LDR Load register from memory Rd:=(address) MCR MLA Move CPU register to coprocessor CRn:=rRn{<op>cRm} register Multiply Accumulate Rd:=(Rm*Rs)+Rn MOV Move register or constant MRC MRS Move from coprocessor register to rRn:=cRn{<op>cRm} CPU register Move PSR status/flags to register Rn:=PSR MSR Move register to PSR status/flags Rd:=Op2 PSR:=Rm 44/213 ARM Instruction Set Summary (3/4) Institute of Electronics, National Chiao Tung University Mnemonic Instruction Action MUL Multiply Rd:=Rm*Rs MVN Move negative register Rd:=~Op2 ORR OR Rd:=Rn OR Op2 RSB Reverse Subtract Rd:=Op2-Rn RSC Reverse Subtract with Carry Rd:=Op2-Rn-1+Carry SBC Subtract with Carry Rd:=Rn-Op2-1+Carry STC Store coprocessor register to memory Store Multiple address:=cRn STM Stack manipulation (Push) 45/213 ARM Instruction Set Summary (4/4) Institute of Electronics, National Chiao Tung University Mnemonic Instruction Action STR Store register to memory <address>:=Rd SUB Subtract Rd:=Rn-Op2 SWI Software Interrupt OS call SWP Swap register with memory TEQ Test bitwise equality Rd:=[Rn] [Rn]:=Rm CPSR flags:=Rn EOR Op2 TST Test bits CPSR flags:=Rn AND Op2 46/213 ARM Instruction Set Format Institute of Electronics, National Chiao Tung University 47/213 Data Processing Instructions Institute of Electronics, National Chiao Tung University • Consist of – – – – arithmetic (ADD, SUB, RSB) logical (BIC, AND) compare (CMP, TST) register movement (MOV, MVN) • All operands are 32-bit wide; come from registers or specified as literal in the instruction itself • Second operand sent to ALU via barrel shifter • 32-bit result placed in register; long multiply instruction produces 64-bit result • 3-address instruction format 48/213 Conditional Execution Institute of Electronics, National Chiao Tung University • Most instruction sets only allow branches to be executed conditionally. • However by reusing the condition evaluation hardware, ARM effectively increases number of instructions. – all instructions contain a condition field which determines whether the CPU will execute them – non-executed instructions still take up 1 cycle • to allow other stages in the pipeline to complete • This reduces the number of branches which would stall the pipeline – allows very dense in-line code – the time penalty of not executing several conditional instructions is frequently less than overhead of the branch or subroutine call that would otherwise be needed 49/213 Conditional Execution Institute of Electronics, National Chiao Tung University Each of the 16 values causes the instruction to be executed or skipped according to the N, Z, C, V flags in the CPSR 31 28 27 0 cond 50/213 Using and Updating the Condition Field Institute of Electronics, National Chiao Tung University • To execute an instruction conditionally, simply postfix it with the appropriate condition: – for example an add instruction takes the form: • ADD r0,r1,r2 ;r0:=r1+r2 (ADDAL) – to execute this only if the zero flag is set: • ADDEQ r0,r1,r2 ;r0:=r1+r2 iff zero flag is set • By default, data processing operations do not affect the condition flags – with comparison instructions this is the only effect • To cause the condition flags to be updated, the S bit of the instruction needs to be set by postfixing the instruction (and any condition code) with an “S”. – for exammple to add two numbers and set the condition flags: • ADDS r0,r1,r2 ;r0:=r1+r2 and set flags 51/213 Data Processing Instructions Institute of Electronics, National Chiao Tung University • • • • Simple register operands Immediate operands Shifted register operands Multiply 52/213 Simple Register Operands (1/2) Institute of Electronics, National Chiao Tung University • Arithmetic Operations ADD ADC SUB SBC RSB RSC r0,r1,r2 r0,r1,r2 r0,r1,r2 r0,r1,r2 r0,r1,r2 r0,r1,r2 ;r0:=r1+r2 ;r0:=r1+r2+C ;r0:=r1-r2 ;r0:=r1-r2+C-1 ;r0:=r2-r1,reverse subtraction ;r0:=r2-r1+C-1 – by default, data processing operations do not affect the condition flags • Bit-wise Logical Operations AND ORR EOR BIC r0,r1,r2 r0,r1,r2 r0,r1,r2 r0,r1,r2 ;r0:=r1 ;r0:=r1 ;r0:=r1 ;r0:=r1 AND r2 OR r2 XOR r2 AND (NOT r2), bit clear 53/213 Simple Register Operands (2/2) Institute of Electronics, National Chiao Tung University • Register Movement Operations – omit 1st source operand from the format MOV r0,r2 MVN r0,r2 ;r0:=r2 ;r0:=NOT r2, move 1’s complement • Comparison Operations – not produce result; omit the destination from the format – just set the condition code bits (N, Z, C and V) in CPSR CMP CMN TST TEQ r1,r2 r1,r2 r1,r2 r1,r2 ;set ;set ;set ;set cc cc cc cc on on on on r1-r2, r1+r2, r1 AND r1 XOR compare compare negated r2, bit test r2, test equal 54/213 Immediate Operands Institute of Electronics, National Chiao Tung University • Replace the second source operand with an immediate operand, which is a literal constant, preceded by “#” ADD r3,r3,#1 AND r8,r7,#&FF ;r3:=r3+1 ;r8:=r7[7:0], &:hexadecimal • Immediate = (0~255)*22n where n is 0-15 4-bit value 55/213 Shifted Register Operands Institute of Electronics, National Chiao Tung University • ADD r3,r2,r1,LSL#3 ;r3:=r2+8*r1 – a single instruction executed in a single cycle • • LSL: Logical shift left by 0 to 31 places, 0 filled at the lsb end LSR, ASL(Arithmetic Shift Left), ASR, ROR(Rotate Right), RRX(Rotate Right eXtended by 1 place) • ADD r5,r5,r3,LSL r2 ;r5:=r5+r3*2r2 • MOV r12,r4,ROR r3 ;r12:=r4 rotated right by value of r3 56/213 Using the Barrel Shifter: the 2nd Operand Institute of Electronics, National Chiao Tung University Operand 1 Operand 2 • Register, optionally with shift operation applied. – Shift value can be either: • 5-bit unsigned integer • Specified in bottom byte of another register Barrel Shifter – Used for multiplication by constant • ALU Result Immediate value – 8-bit number, with a range of 0-255 • Rotated right through even number of positions – Allows increased range of 32-bit constants to be loaded directly into registers 57/213 Multiply Institute of Electronics, National Chiao Tung University MUL r4,r3,r2 ;r4:=(r3*r2)[31:0] • Multiply-Accumulate MLA r4,r3,r2,r1 ;r4:=(r3*r2+r1)[31:0] 58/213 Multiplication by a Constant Institute of Electronics, National Chiao Tung University • Multiplication by a constant equals to a ((power of 2) +/- 1) can be done in a single cycle • • – Using MOV, ADD or RSBs with an inline shift Example: r0=r1*5 Example: r0=r1+(r1*4) ADD r0,r1,r1,LSL #2 • Can combine several instructions to carry out other multiplies • • • Example: r2=r3*119 Example: r2=r3*17*7 Example: r2=r3*(16+1)*(8-1) ADD r2,r3,r3,LSL #4 ;r2:=r3*17 RSB r2,r2,r2,LSL #3 ;r2:=r2*7 59/213 Data Processing Instructions (1/3) Institute of Electronics, National Chiao Tung University • <op>{<cond>}{S} Rd,Rn,#<32-bit immediate> • <op>{<cond>}{S} Rd,Rn,Rm,{<shift>} – omit Rn when the instruction is monadic (MOV, MVN) – omit Rd when the instruction is a comparison, producing only condition code outputs (CMP, CMN, TST, TEQ) – <shift> specifies the shift type (LSL, LSR, ASL, ASR, ROR or RRX) and in all cases but RRX, the shift amount which may be a 5-bit immediate (# < # shift>) or a register Rs • 3-address format – 2 source operands and 1 destination register – one source is always a register, the second may be a register, a shifted register or an immediate value 60/213 Data Processing Instructions (2/3) Institute of Electronics, National Chiao Tung University 61/213 Data Processing Instructions (3/3) • Institute of Electronics, National Chiao Tung University Allows direct control of whether or not the condition codes are affected by S bit (condition code unchanged when S=0) N=1 if the result is negative; 0 otherwise (i.e. N=bit 31 of the result) Z=1 if the result is zero; 0 otherwise C= carry out from the ALU when ADD, ADC, SUB, SBC, RSB, RSC, CMP, CMN; carry out from the shifter V=1 if overflow from bit 30 to bit 31; 0 if no overflow (V is preserved in non-arithmetic operations) • • PC may be used as a source operand (address of the instruction plus 8) except when a register-specified shift amount is used PC may be specified as the destination register, the instruction is a form of branch (return from a subroutine) 62/213 Examples Institute of Electronics, National Chiao Tung University • • ADD r5,r1,r3 ADD Rs,PC,#offset • Decrement r2 and check for zero ;PC is ADD address+8 SUBS r2,r2#1 BEQ LABEL … • ;dec r2 and set cc Multiply r0 by 5 ADD r0,r0,r0,LSL #2 • A subroutine to multiply r0 by 10 TIMES10 • MOV BL …… MOV ADD MOV r0,#3 TIMES10 r0,r0,LSL #1 r0,r0,r0,LSL #2 PC,r14 ;*2 ;*5 ;return Add a 64-bit integer in r1, r0 to one in r3, r2 ADDs r2,r2,r0 ADC r3,r3,r1 63/213 Multiply Instructions (1/2) • 32-bit Product (Least Significant) Institute of Electronics, National Chiao Tung University MUL{<cond>}{S} Rd,Rm,Rs MLA{<cond>}{S} Rd,Rm,Rs,Rn • 64-bit Product <mul>{<cond>}{S} RdHi,RdLo,Rm,Rs <mul> is (UMULL,UMLAC,SMULL,SMLAL) 64/213 Multiply Instructions (2/2) Institute of Electronics, National Chiao Tung University • Accumulation is denoted by “+=“ • Example: form a scalar product of two vectors Loop MOV r11,#20 MOV r10,#0 LDR r0,[r8],#4 LDR r1,[r9],#4 MLA r10,r0,r1,r10 SUBS r11,r11,#1 BNE Loop ;initialize loop counter ;initialize total ;get first component ;get second component 65/213 Count Leading Zeros (CLZ-V5T only) Institute of Electronics, National Chiao Tung University • CLZ{<cond>} Rd,Rm – set Rd to the number of the bit position of the most significant 1 in Rm; If Rm is zero, Rd=32 – useful for renormalizing numbers • Example MOV r0,#&100 CLZ r1,r0 ;r1:=23 • Example CLZ r1,r2 MOVS r2,r2,LSL r1 66/213 Data Transfer Instructions Institute of Electronics, National Chiao Tung University • Three basic forms to move data between ARM registers and memory – single register load and store instruction • a byte, a 16-bit half word, a 32-bit word – multiple register load and store instruction • to save or restore workspace registers for procedure entry and exit • to copy blocks of data – single register swap instruction • a value in a register to be exchanged with a value in memory • to implement semaphores to ensure mutual exclusion on accesses 67/213 Single Register Data Transfer Institute of Electronics, National Chiao Tung University • Word transfer LDR / STR • Byte transfer LDRB / STRB • Halfword transfer LDRH / STRH • Load singed byte or halfword-load value and sign extended to 32 bits LDRSB / LDRSH • All of these can be conditionally executed by inserting the appropriate condition code after STR/LDR LDREQB 68/213 Addressing Institute of Electronics, National Chiao Tung University • Register-indirect addressing • Base-plus-offset addressing – base register r0-r15 – offset, add or subtract an unsigned number immediate register (not PC) scaled register (only available for word and unsigned byte instructions) • Stack addressing • Block-copy addressing 69/213 Register-indirect Addressing Institute of Electronics, National Chiao Tung University • Use a value in one register (base register) as a memory address LDR r0,[r1] ;r0:=mem32[r1] STR r0,[r1] ;mem32[r1]:=r0 • Other forms – adding immediate or register offsets to the base address 70/213 Initializing an Address Pointer Institute of Electronics, National Chiao Tung University • A small offset to the program counter, r15 – ARM assembler has a “pseudo” instruction, ADR • As an example, a program which must copy data from TABLE1 to TABLE2, both of which are near to the code COPY ADR r1,TABLE1 ;r1 points to TABLE1 ADR r2,TABLE2 ;r2 points to TABLE2 … TABLE1 … ;<source> … ;<destination> TABLE2 71/213 Single Register Load and Store Institute of Electronics, National Chiao Tung University • A base register, an offset which may be another register or an immediate value Copy Loop ADR ADR LDR STR ADD ADD ??? … r1,TABLE1 r2,TABLE2 r0,[r1] r0,[r2] r1,r1,#4 r2,r2,#4 TABLE1 … TABLE2 … 72/213 Base-plus-offset Addressing (1/3) Institute of Electronics, National Chiao Tung University • Pre-indexing LDR r0,[r1,#4] ;r0:=mem32[r1+4] – offset up to 4K, added or subtracted, (#-4) • Post-indexing LDR r0,[r1],#4 ;r0:=mem32[r1], r1:=r1+4 – equivalent to a simple register-indirect load, but faster, less code space • Auto-indexing LDR r0,[r1,#4]! ;r0:=mem32[r1+4], r1:=r1+4 – no extra time, auto-indexing performed while the data is being fetched from memory 73/213 Base-plus-offset Addressing (2/3) Institute of Electronics, National Chiao Tung University *Pre-indexed: STR r0, [r1, #12] Offset 12 Base Register r1 0x200 0x20c 0x5 r0 0x5 0x5 r0 0x5 Source Register for STR 0x200 Auto-update from: STR r0, [r1, #12] ! *Post-indexed: STR r0, [r1], #12 Updated Base Register Original Base Register r1 0x20c r1 0x200 Offset 12 0x20c 0x200 Source Register for STR 74/213 Base-plus-offset Addressing (3/3) • Copy Institute of Electronics, National Chiao Tung University Loop ADR ADR LDR STR ??? … r1,TABLE1 r2,TABLE2 r0,[r1],#4 r0,[r2],#4 TABLE1 … TABLE2 … • A single unsigned byte load LDRB r0,[r1] ;r0:=mem8[r1] – also support signed bytes, 16-bit half-word 75/213 Loading Constants (1/2) Institute of Electronics, National Chiao Tung University • No single ARM instruction can load a 32-bit immediate constant directly into a register – all ARM instructions are 32-bit long – ARM instructions do not use the instruction stream as data • The data processing instruction format has 12 bits available for operand 2 – if used directly, this would only give a range of 4096 • Instead it is used to store 8-bit constants, give a range of 0~255 • These 8 bits can then be rotated right through an even number of positions (i.e. RORs by 0,2,4,…,30) • This gives a much larger range of constants that can be directly loaded, though some constants will still need to be loaded from memory 76/213 Loading Constants (2/2) • This gives us: Institute of Electronics, National Chiao Tung University – – – – • 0~255 256,260,264,…,1020 1024,1240,…,4080 4096,4160,…,16320 To load a constant, simply move the required value into a register - the assembler will convert to the rotate form for us – MOV r0,#4096 • ;MOV r0,#0x1000 (0x40 ror 26) The bitwise complements can also be formed using MVN: – MOV r0,#0xFFFFFFFF • [0-0xff] [0x100-0x3fc,step4,0x40-0xff ror 30] [0x400-0xff0,step16,0x40-0xff ror 28] [0x1000-0x3fc0,step64,0x40-0xff ror 26] ;MVN r0,#0 Values that cannot be generated in this way will cause an error 77/213 Loading 32-bit Constants Institute of Electronics, National Chiao Tung University • To allow larger constants to be loaded, the assembler offers a pseudo-instruction: LDR rd,=const • This will either: – produce a MOV or MVN instruction to generate the value (if possible) or – generate a LDR instruction with a PC-relative address to read the constant from a literal pool (Constant data area embedded in the code) • For example LDR r0,=0xFF LDR r0,=0x55555555 ;MOV r0,#0xFF ;LDR r0,[PC,#Imm10] • As this mechanism will always generate the best instruction for a given case, it is the recommended way of loading constants. 78/213 Multiple Register Data Transfer (1/2) • Institute of Electronics, National Chiao Tung University The load and store multiple instructions (LDM/STM) allow between 1 and 16 registers to be transferred to or from memory – order of register transfer cannot be specified, order in the list is insignificant – lowest register number is always transferred to/from lowest memory location accessed • The transferred registers can be either – any subset of the current bank of registers (default) – any subset of the user mode bank of registers when in a privileged mode (postfix instruction with a “^”) • Base register used to determine where memory access should occur – 4 different addressing modes – base register can be optionally updated following the transfer (using “!”) • These instructions are very efficient for – moving blocks of data around memory – saving and restoring context - stack 79/213 Multiple Register Data Transfer (2/2) Institute of Electronics, National Chiao Tung University • Allow any subset (or all, r0 to r15) of the 16 registers to be transferred with a single instruction LDMIA r1,{r0,r2,r5} ;r0:=mem32[r1] ;r2:=mem32[r1+4] ;r5:=mem32[r1+8] 80/213 Stack Processing • Institute of Electronics, National Chiao Tung University • • A stack is usually implemented as a linear data structure which grows up (an ascending stack) or down (a descending stack) memory A stack pointer holds the address of the current top of the stack, either by pointing to the last valid data item pushed onto the stack (a full stack), or by pointing to the vacant slot where the next data item will be placed (an empty stack) ARM multiple register transfer instructions support all four forms of stacks – full ascending: grows up; base register points to the highest address containing a valid item – empty ascending: grows up; base register points to the first empty location above the stack – full descending: grows down; base register points to the lowest address containing a valid data – empty descending: grows down, base register points to the first empty location below the stack 81/213 Block Copy Addressing (1/2) Institute of Electronics, National Chiao Tung University • Addressing modes r9’ r9 r5 r1 r0 101816 100c16 r9’ r5 r1 r0 r9 101816 Ascending Full 100c16 100016 100016 STMIA r9!, {r0, r1, r5} STMIB r9!, {r0, r1, r5} Before Increment Descending Full Empty STMIB STMFA LDMIB LDMED After r9 r9’ r5 r1 r0 LDMDB STMDB LDMEA STMFD 101816 100c16 r9 100016 r9’ STMDA r9!, {r0, r1, r5} STMIA LDMIA STMEA LDMFD Before 101816 r5 r1 r0 100c16 100016 STMDB r9!, {r0, r1, r5} Empty Decrement After LDMDA LDMFA STMDA STMED 82/213 Block Copy Addressing (2/2) Institute of Electronics, National Chiao Tung University • Copy 8 words from the location r0 points to to the location r1 points to LDMIA r0!,{r2-r9} STMIA r1,{r2-r9} – r0 increased by 32, r1 unchanged • If r2 to r9 contained useful values, preserve them by pushing them onto a stack STMFD LDMIA STMIA LDMFD r13!,{r2-r9} r0!,{r2-r9} r1,{r2-r9} r13,{r2-r9} – FD postfix: full descending stack addressing mode 83/213 Memory Block Copy – – – – STMIA/LDMIA: STMIB/LDMIB: STMDA/LDMDA: STMDB/LDMDB: Increment Increment Decrement Decrement After Before After Before • For Example memory r13 ;r12 points to start of source data r14 ;r14 points to end of source data ;r13 points to start of destination data r12 Loop LDMIA r12!,{r0-r11} ;load 48 bytes STMIA r13!,{r0-r11} ;and store them CMP r12,r14 ;check for the end BNE Loop ;and loop until done copy Institute of Electronics, National Chiao Tung University • The direction that the base pointer moves through memory is given by the postfix to the STM/LDM instruction increasing – this loop transfers 48 bytes in 31 cycles – over 50Mbytes/sec at 33MHz 84/213 Single Word and Unsigned Byte Data Transfer Instructions Institute of Electronics, National Chiao Tung University • Pre-indexed form LDR|STR {<cond>}{B} Rd,[Rn,<offset>]{!} • Post-index form LDR|STR {<cond>}{B}{T} Rd,[Rn],<offset> • PC-relative form LDR|STR {<cond>}{B} Rd,LABEL – LDR ‘load register’; STR ‘store register’ ‘B’ unsigned byte transfer, default is word; <offset> may be # +/-<12-bit immediate> or +/- Rm{,shift} ! auto-indexing T flag selects the user view of the memory translation and protection system 85/213 Example Institute of Electronics, National Chiao Tung University • Store a byte in r0 to a peripheral UARTADD LDR r1,UARTADD STRB r0,[r1] … & &10000000 ;store data to UART 86/213 Half-word and Signed Byte Data Transfer Instructions Institute of Electronics, National Chiao Tung University • Pre-indexed form LDR|STR{<cond>}H|SH|SB Rd,[Rn,<offset>]{!} • Post-indexed form LDR|STR{<cond>}H|SH|SB Rd,[Rn],<offset> – <offset> is # +/-<8-bit immediate> or +/-Rm – H|SH|SB selects the data type - unsigned half-word, signed halfword and signed byte. Otherwise is for word and unsigned byte transfer 87/213 Example Institute of Electronics, National Chiao Tung University • Expand an array of signed half-words into an array of words Loop ADR ADR ADR LDRSH STR CMP BLT r1,ARRAY1 r2,ARRAY2 r3,ENDARR1 r0,[r1],#2 r0,[r2],#4 r1,r3 Loop ;half-word array start ;word array start ;ARRAY1 end+2 ;get signed half-word ;save word ;check for end of array ;if not finished, loop 88/213 Multiple Register Transfer Instructions Institute of Electronics, National Chiao Tung University • LDM|STM {<cond>}<add mode> Rn{!}, <registers> – <add mode> specifies one of the addressing modes; ‘!’: auto-indexing; <registers> a list of registers, e.g. {r0,r3r7,pc} • In non-user mode, the CPSR may be restored by LDM{<cond>}<add mode> Rn{!},<registers+PC>^ • In non-user mode, the user registers may be saved or restored by LDM|STM{<cond>}<add mode> Rn,<registers-PC>^ – The register list must not contain PC and write-back is not allowed 89/213 Example Institute of Electronics, National Chiao Tung University • Save 3 work registers and the return address upon entering a subroutine (assume r13 has been initialized for use as a stack pointer) STMFD r13!,{r0-r2,r14} • Restore the work registers and return LDMFD r13!,{r0-r2,pc} 90/213 Swap Memory and Register Instructions Institute of Electronics, National Chiao Tung University • SWP{<cond>}{B} Rd,Rm,[Rn] • Rd <- [Rn], [Rn] <- Rm • Combine a load and a store of a word or an unsigned byte in a single instruction • Example ADR r0,SEMAPHORE SWPB r1,r1,[r0] ;exchange byte 91/213 Status Register to General Register Transfer Instructions • MRS{<cond>} Rd,CPSR|SPSR Institute of Electronics, National Chiao Tung University • The CPSR or the current mode SPSR is copied into the destination register. All 32 bits are copied. • Example MRS r0,CPSR MRS r3,SPSR 92/213 General Register to Status Register Transfer Instructions Institute of Electronics, National Chiao Tung University • MSR{<cond>} CPSR_<field>|SPSR_<field>,#<32-bit immediate> MSR{<cond>} CPSR_<field>|SPSR_<field>,Rm – <field> is one of • • • • c - the control field PSR[7:0] x - the extension field PSR[15:8] s - the status field PSR[23:16] f - the flag field PSR[31:24] • Example – set N, X, C, V flags MSR CPSR_f,#&f0000000 – set just C, preserving N, Z, V MRS r0,CPSR ORR r0,r0,#&20000000 MSR CPSR_f,r0 ;set bit29 of r0 93/213 Control Flow Instructions Institute of Electronics, National Chiao Tung University • • • • • • • Branch instructions Conditional branches Conditional execution Branch and link instructions Subroutine return instructions Supervisor calls Jump tables 94/213 Branch Instructions Institute of Electronics, National Chiao Tung University LABEL B LABEL … … – LABEL comes after or before the branch instruction 95/213 Conditional Branches Institute of Electronics, National Chiao Tung University • The branch has a condition associated with it and it is only executed if the condition codes have the correct value taken or not taken Loop MOV … ADD CMP BNE r0,#0 ;initialize counter r0,r0,#1 ;increment loop counter r0,#10 ;compare with limit Loop ;repeat if not equal ;else fall through 96/213 Conditional Branch Institute of Electronics, National Chiao Tung University 97/213 Conditional Execution Institute of Electronics, National Chiao Tung University • An unusual feature of the ARM instruction set is that conditional execution applies not only to branches but to all ARM instructions Bypass • CMP BEQ ADD SUB … r0,#5 Bypass r1,r1,r0 r1,r1,r2 ;if(r0!=5) ;{r1=r1+r0-r2} CMP r0,#5 ADDNE r1,r1,r0 SUBNE r1,r1,r2 Whenever the conditional sequence is 3 instructions or fewer it is better (smaller and faster) to exploit conditional execution than to use a branch if((a==b)&&(c==d)) e++; CMP r0,r1 CMPEQ r2,r3 ADDEQ r4,r4,#1 98/213 Branch and Link Instructions Institute of Electronics, National Chiao Tung University • Perform a branch, save the address following the branch in the link register, r14 SUBR BL SUBR … … MOV PC,r14 ;branch to SUBR ;return here ;subroutine entry point ;return • For nested subroutine, push r14 and some work registers required to be saved onto a stack in memory SUB1 SUB2 BL SUB1 … STMFD r13!,{r0-r2,r14} ;save work and link regs BL SUB2 … … 99/213 Subroutine Return Instructions SUB Institute of Electronics, National Chiao Tung University • … MOV PC,r14 ;copy r14 into r15 to return Where the return address has been pushed onto a stack SUB1 STMFD r13!,{r0-r2,r14} ;save work regs and link BL SUB2 … LDMFD r13!,{r0-e12,PC} ;restore work regs & return 100/213 Supervisor Calls Institute of Electronics, National Chiao Tung University • The supervisor is a program which operates at a privileged level, which means that it can do things that a user-level program cannot do directly (e.g. input or output) • SWI instruction – software interrupt or supervisor call SWI SWI_WriteC ;output r0[7:0] SWI SWI_Exit ;return to monitor program 101/213 Jump Tables Institute of Electronics, National Chiao Tung University • To call one of a set of subroutines, the choice depending on a value computed by the program BL JUMPTAB … JUMPTAB CMP r0,#0 BEQ SUB0 CMP r0,#1 BEQ SUB1 CMP r0,#2 SUBTAB BEQ SUB2 … BL JUMPTAB … JUMPTAB ADR r1,SUBTAB ;r1->SUBTAB CMP r0,#SUBMAX ;check for overrun LDRLS PC,[r1,r0,LSL#2];if OK,table jump B ERROR DCD SUB0 DCD SUB1 DCD SUB2 … • The ‘DCD’ directive instructs the assembler to reserve a word of store and to initialize it to the value of the expression to the right, which in these cases is just the address of the label. 102/213 Branch and Branch with Link (B,BL) • B{L}{<cond>} <target address> Institute of Electronics, National Chiao Tung University – <target address> is normally a label in the assembler code. 31 28 27 25 24 23 cond 1110 L 0 24-bit signed word offset 24-bit offset, sign-extended, shift left 2 places + PC (address of branch instruction + 8) target address 103/213 Examples • Unconditional jump Institute of Electronics, National Chiao Tung University LABEL • B … … LABEL Loop ten times MOV r0,#10 Loop … SUBS r0,#1 BNE Loop … • Call a subroutine SUB • BL SUB … … MOV PC,r14 Conditional subroutine call CMP r0,#5 BLLT SUB1 BLGE SUB2 ; if r0<5, call SUB1 ; else call SUB2 104/213 Branch, Branch with Link and eXchange • B{L}X{<cond>} Rm Institute of Electronics, National Chiao Tung University – the branch target is specified in a register, Rm – bit[0] of Rm is copied into the T bit in CPSR; bit[31:1] is moved into PC – if Rm[0] is 1, the processor switches to execute Thumb instructions and begins executing at the address in Rm aligned to a half-word boundary by clearing the bottom bit – if Rm[0] is 0, the processor continues executing ARM instructions and begins executing at the address in Rm aligned to a word boundary by clearing Rm[1] • BLX <target address> – call Thumb subroutine from ARM – the H bit (bit 24) is also added into bit 1 of the resulting address, allowing an odd half-word address to be selected for the target instruction which will always be a Thumb instruction 105/213 Example Institute of Electronics, National Chiao Tung University • A call to a Thumb subroutine TSUB CODE32 … BLX TSUB … CODE16 … BX r14 ;call Thumb subroutine ;start of Thumb code ;return to ARM code 106/213 Software Interrupt (SWI) • SWI{<cond>} <24-bit immediate> Institute of Electronics, National Chiao Tung University – used for calls to the operating system and is often called a “supervisor call” – it puts the processor into supervisor mode and begins executing instruction from address 0x08 • Save the address of the instruction after SWI in r14_svc • Save the CPSR in SPSR_svc • Enter supervisor mode and disable IRQs by setting CPSR[4:0] to 100112 and CPSR[7] to 1 • Set PC to 0816 and begin executing the instruction there – the 24-bit immediate does not influence the operation of the instruction but may be interpreted by the system code 107/213 Examples • Output the character ‘A’ Institute of Electronics, National Chiao Tung University MOV SWI r0,#’A’ SWI_WriteC • Finish executing the user program and return to the monitor SWI SWI_Exit • A subroutine to output a text string STROUT BL STROUT = “Hello World”,&0a,&0d,0 … LDRB r0,[r14],#1 ;get character CMP r0,#0 ;check for end marker SWINE SWI_WriteC ;if not end, print BNE STROUT ; … , loop ADD r14,#3 ;align to next word BIC r14,#3 MOV PC,r14 ;return 108/213 Coprocessor Instructions Institute of Electronics, National Chiao Tung University • Extend the instruction set through the addition of coprocessors – System Coprocessor: control on-chip function such as cache and memory management unit – Floating-point Coprocessor – Application-Specific Coprocessor • Coprocessors have their own private register sets and their state is controlled by instructions that mirror the instructions that control ARM registers 109/213 Coprocessor Data Operations • CDP{<cond>}<CP#>,<Cop1>,CRd,CRn,CRm{,<Cop2>} Institute of Electronics, National Chiao Tung University 31 28 27 cond 1110 24 23 2019 Cop1 CRn 16 15 12 11 CRd 87 54 3 0 CP# Cop2 0 CRm • Use to control internal operations on data in coprocessor registers • CP# identifies the coprocessor number • Cop1, Cop2 operation • Examples CDP P2,3,C0,C1,C2 CDPEQ P3,6,C1,C5,C7,4 110/213 Coprocessor Data Transfers • Pre-indexed form Institute of Electronics, National Chiao Tung University LDC|STC{<cond>}{L}<CP#>,CRd,[Rn,<offset>]{!} • Post-indexed form LDC|STC{<cond>}{L}<CP#>,CRd,[Rn],<offset> – L flag, if present, selects the long data type – <offset> is # +/-<8-bit immediate> 31 28 27 25 23 21 24 22 2019 cond 1110 P U N WL CRn 16 15 12 11 CRd CP# 87 0 8-bit offset source/destination register base register load/store write-back (auto-index) data size (coprocessor dependent) up/down pre-/post-index – the number of words transferred is controlled by the coprocessor – address calculated within ARM; number of words transferred controlled by the coprocessor • Examples LDC P6,c0,[r1] STCEQL P5,c1,[r0],#4 111/213 Coprocessor Register Transfers • Move to ARM register from coprocessor Institute of Electronics, National Chiao Tung University MRC{<cond>} <CP#>,<Cop1>,Rd,CRn,CRm,{,<Cop2>} • Move to coprocessor from ARM register MCR{<cond>} <CP#>,<Cop1>,Rd,CRn,CRm,{,<Cop2>} 31 28 27 cond 1110 24 23 212019 Cop1 L CRn 16 15 12 11 CRd 87 54 3 0 CP# Cop2 1 CRm Load from coprocessor/store to coprocessor • Examples MCR P14,3,r0,c1,c2 MRCCS P2,4,r3,c3,c4,6 112/213 Breakpoint Instructions (BKPT-v5T only) Institute of Electronics, National Chiao Tung University • BKPT <16-bit immediate> • Used for software debugging purposes; they cause the processor to break from normal instruction execution and enter appropriate debugging procedures • BKPT is unconditional • Handled by an exception handler installed on the prefetch abort vector 113/213 Unused Instruction Space • Unused Arithmetic Instructions Institute of Electronics, National Chiao Tung University 28 27 22 212019 cond 000001 op 31 • Rn Unused Control Instructions 31 2322 212019 28 27 16 15 12 11 Rd 16 15 RS 12 11 8 7 65 4 3 1001 Rm 8 7 65 4 3 cond 00010 op 0 Rn CRd RS op 0 Rn CRd RS 0 op2 1 Rm cond 00110 op 0 Rn CRd #rot 8-bit immediate Unused Load/Store Instructions 31 28 27 25 23 21 24 22 2019 cond 000 P U B WL • 16 15 Rn 12 11 Rd RS 8 7 65 4 3 0 1 op 1 Rm Unused Coprocessor Instructions 31 28 27 23 21 24 22 2019 cond 1100 • 0 op2 0 Rm cond 00010 • 0 op 0 x Rn 16 15 12 11 CRd CP# 87 0 offset Undefined Instruction Space 31 28 27 2524 cond 011 xxxxxxxxxxxxxxxxxxxx 54 3 0 1 xxxx 114/213 Institute of Electronics, National Chiao Tung University 16-bit Instruction Set 115/213 Thumb Instruction Set (1/3) Mnemonic Instruction Institute of Electronics, National Chiao Tung University ADC ADD AND ASR B Bxx BIC BL BX CMN CMP EOR LDMIA LDR Add with carry Add AND Arithmetic Shift Right Branch Conditional Branch Bit Clear Branch with Link Branch and Exchange Compare Negative Compare EOR Load Multiple Load Word Lo Hi Condition Register Register Code ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ 116/213 Thumb Instruction Set (2/3) Mnemonic Instruction Institute of Electronics, National Chiao Tung University LDRB LDRH LSL LDSB LDSH LSR MOV MUL MVN NEG ORR POP PUSH ROR Load Byte Load Halfword Logical Shift Left Load Signed Byte Load Signed Halfword Logical Shift Right Move Register Multiply Move Negative Register Negate OR Pop Registers Push Registers Rotate Right Lo Hi Condition Register Register Code ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ 117/213 Thumb Instruction Set (3/3) Mnemonic Instruction Institute of Electronics, National Chiao Tung University SBC STMIA STR STRB STRH SWI SUB TST Subtract with Carry Store Multiple Store Word Store Byte Store Halfword Software Interrupt Subtract Test Bits Lo Hi Condition Register Register Code ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ 118/213 Thumb Instruction Format Institute of Electronics, National Chiao Tung University 119/213 Thumb-ARM Difference Institute of Electronics, National Chiao Tung University • Thumb instruction set is a subset of the ARM instruction set and the instructions operate on a restricted view of the ARM registers • Most Thumb instructions are executed unconditionally (All ARM instructions are executed conditionally) • Many Thumb data processing instructions use a 2-address format, i.e. the destination register is the same as one of the source registers (ARM data processing instructions, with the exception of the 64-bit multiplies, use a 3-address format) • Thumb instruction formats are less regular than ARM instruction formats => dense encoding 120/213 Register Access in Thumb Institute of Electronics, National Chiao Tung University • Not all registers are directly accessible in Thumb • Low register r0~r7: fully accessible • High register r8~r12: only accessible with MOV, ADD, CMP; only CMP sets the condition code flags • SP(stack pointer), LR(link register) & PC(program counter): limited accessibility, certain instructions have implicit access to these • CPSR: only indirect access • SPSR: no access 121/213 Thumb Accessible Registers Institute of Electronics, National Chiao Tung University 122/213 Branches • Institute of Electronics, National Chiao Tung University Thumb defines three PC-relative branch instructions, each of which have different offset ranges – Offset depends upon the number of available bits • Conditional Branches – B<cond> label – 8-bit offset: range of -128 to 127 instructions (+/-256 bytes) – Only conditional Thumb instructions • Unconditional Branches – B label – 11-bit offset: range of -1024 to 1023 instructions (+/- 2Kbytes) • Long Branches with Link – BL subroutine – Implemented as a pair of instructions – 22-bit offset: range of -2097152 to 2097151 instructions (+/- 4Mbytes) 123/213 Data Processing Instructions Institute of Electronics, National Chiao Tung University • Subset of the ARM data processing instructions • Separate shift instructions (e.g. LSL, ASR, LSR, ROR) LSL Rd,Rs,#Imm5 ASR Rd,Rs ;Rd:=Rs <shift> #Imm5 ;Rd:=Rd <shift> Rs • Two operands for data processing instructions – act on low registers BIC Rd,Rs ADD Rd,#Imm8 ;Rd:=Rd AND NOT Rs ;Rd:=Rd+#Imm8 – also three operand forms of add, subtract and shifts ADD Rd,Rs,#Imm3 ;Rd:=Rs+#Imm3 • Condition code always set by low register operations 124/213 Load or Store Register • Two pre-indexed addressing modes Institute of Electronics, National Chiao Tung University – base register+offset register – base register+5-bit offset, where offset scaled by • 4 for word accesses (range of 0-124 bytes / 0-31 words) – STR Rd,[Rb,#Imm7] • 2 for halfword accesses (range of 0-62 bytes / 0-31 halfwords) – LDRH Rd,[Rb,#Imm6] • 1 for byte accesses (range of 0-31 bytes) – LDRB Rd,[Rb,#Imm5] • Special forms: – load with PC as base with 1Kbyte immediate offset (word aligned) • used for loading a value from a literal pool – load and store with SP as base with 1Kbyte immediate offset (word aligned) • used for accessing local variables on the stack 125/213 Block Data Transfers Institute of Electronics, National Chiao Tung University • Memory copy, incrementing base pointer after transfer – STMIA Rb!, {Low Reg list} – LDMIA Rb!, {Low Reg list} • Full descending stack operations – – – – PUSH {Low Reg list} PUSH {Low Reg list, LR} POP {Low Reg list} POP {Low Reg list, PC} • The optional addition of the LR/PC provides support for subroutine entry/exit. 126/213 Miscellaneous Institute of Electronics, National Chiao Tung University • Thumb SWI instruction format – same effect as ARM, but SWI number limited to 0~255 – syntax: • SWI <SWI number> 15 1 8 1 0 1 1 1 1 1 7 0 SWI number • Indirect access to CPSR and no access to SPSR, so no MRS or MSR instructions • No coprocessor instruction space 127/213 Thumb Instruction Entry and Exit Institute of Electronics, National Chiao Tung University • T bit, bit 5 of CPSR – if T=1, the processor interprets the instruction stream as 16-bit Thumb instruction – if T=0, the processor interprets it as standard ARM instructions • Thumb Entry – ARM cores startup, after reset, executing ARM instructions – executing a Branch and Exchange instruction (BX) • set the T bit if the bottom bit of the specified register was set • switch the PC to the address given in the remainder of the register • Thumb Exit – executing a Thumb BX instruction 128/213 The Need for Interworking Institute of Electronics, National Chiao Tung University • The code density of Thumb and its performance from narrow memory make it ideal for the bulk of C code in many systems. However there is still a need to change between ARM and Thumb state within most applications: – ARM code provides better performance from wide memory • therefore ideal for speed-critical parts of an application – some functions can only be performed with ARM instructions, e.g. • access to CPSR (to enable/disable interrupts & to change mode) • access to coprocessors – exception Handling • ARM state is automatically entered for exception handling, but system specification may require usage of Thumb code for main handler – simple standalone Thumb programs will also need an ARM assembler header to change state and call the Thumb routine 129/213 Interworking Instructions Institute of Electronics, National Chiao Tung University • Interworking is achieved using the Branch Exchange instructions – in Thumb state BX Rn – in ARM state (on Thumb-aware cores only) BX<condition> Rn where Rn can be any registers (r0 to r15) • This performs a branch to an absolute address in 4GB address space by copying Rn to the program counter • Bit 0 of Rn specifies the state to change to 130/213 Switching between States Institute of Electronics, National Chiao Tung University 31 1 0 Rn BX 31 ARM/Thumb Selection 0- ARM State 1- Thumb State 1 0 0 Destination Address 131/213 Example Institute of Electronics, National Chiao Tung University ;start off in ARM state CODE32 ADR r0,Into_Thumb+1 ;generate branch target ;address & set bit 0, ;hence arrive Thumb state BX r0 ;branch exchange to Thumb … CODE16 ;assemble subsequent as ;Thumb Into_Thumb … ADR r5,Back_to_ARM ;generate branch target to ;word-aligned address, ;hence bit 0 is cleared. BX r5 ;branch exchange to ARM … CODE32 ;assemble subsequent as ;ARM Back_to_ARM … 132/213 Institute of Electronics, National Chiao Tung University ARM Processor Core 133/213 3-Stage Pipeline ARM Organization Institute of Electronics, National Chiao Tung University • Register Bank – 2 read ports, 1write ports, access any register – 1 additional read port, 1 additional write port for r15 (PC) • Barrel Shifter – Shift or rotate the operand by any number of bits • • • ALU Address register and incrementer Data Registers – Hold data passing to and from memory • Instruction Decoder and Control 134/213 Data Processing Instructions Institute of Electronics, National Chiao Tung University • All Operations take place in a single clock cycle 135/213 3-Stage Pipeline (1/2) Institute of Electronics, National Chiao Tung University 1 2 3 fetch decode execute fetch decode execute fetch decode execute instruction • Fetch time – the instruction is fetched from memory and placed in the instruction pipeline • Decode – the instruction is decoded and the datapath control signals prepared for the next cycle • Execute – the register bank is read, an operand shifted, the ALU result generated and written back into a destination register 136/213 3-Stage Pipeline (2/2) Institute of Electronics, National Chiao Tung University • At any time slice, 3 different instructions may occupy each of these stages, so the hardware in each stage has to be capable of independent operations • When the processor is executing data processing instructions, the latency = 3 cycles and the throughput = 1 instruction/cycle • When accessing r15 (PC), r15=address of current instruction + 8 137/213 Data Transfer Instructions Institute of Electronics, National Chiao Tung University • • Computes a memory address similar to a data processing instruction Load instruction follow a similar pattern except that the data from memory only gets as far as the ‘data in’ register on the 2nd cycle and a third cycle is needed to transfer the data from there to the destination register 138/213 Multi-cycle Instruction Institute of Electronics, National Chiao Tung University 1 2 3 4 5 fetch ADD decode execute fetch STR decode calc. addr. fetch ADD data xfer decode fetch ADD execute decode fetch ADD execute decode execute instruction time • • • Memory access (fetch, data transfer) in every cycle Datapath used in every cycle (execute, address calculation, data transfer) Decode logic generates the control signals for the data path use in next cycle (decode, address calculation) 139/213 Branch Instructions Institute of Electronics, National Chiao Tung University • The third cycle, which is required to complete the pipeline refilling, is also used to make a small correction to the value stored in the link register in order that it points directly at the instruction which follows the branch 140/213 Branch Pipeline Example Institute of Electronics, National Chiao Tung University Cycle address 1 2 3 decode execute linkret adjust fetch decode fetch decode fetch 4 5 opeation 0x8000 BL fetch 0x8004 X 0x8008 XX 0x8FEC ADD 0x8FF0 SUB fetch 0x8FF4 MOV execute decode execute fetch decode fetch Breaking the pipeline Note that the core is executing in ARM state 141/213 Interrupt Pipeline Example Institute of Electronics, National Chiao Tung University IRQ 1 Cycle address 2 3 4 execute IRQ linkret 5 6 7 8 adjust opeation 0x8000 ADD fetch 0x8004 SUB 0x8008 MOV decode execute fetch decode IRQ 0x800C X F00) 0x001C B(to 0xA 0x0018 XX 0x0020 XXX 0xAF00 STMFD 0xAF04 MOV 0xAF08 LDR fetch fetch fetch decode execute fetch decode fetch fetch decode execute fetch decode fetch IRQ interrupt m inim latency = 7 cycyles um 142/213 5-Stage Pipelined ARM Organization Institute of Electronics, National Chiao Tung University • Tprog=Ninst*CPI*cycle_time – Ninst, compiler dependent – CPI, hazard => pipeline stalls – cycle_time, frequency • Separate instruction and data memories => 5 stage pipeline • Used in ARM9TDMI 143/213 ARM9TDMI 5-stage Pipeline Organization • Fetch Institute of Electronics, National Chiao Tung University – • Decode – • An operand is shifted and the ALU result generated. If the instruction is a load or store, the memory address is computed in the ALU Buffer/Data – • The instruction is decoded and register operands read from the register file. There are 3 operand read ports in the register file so most ARM instructions can source all their operands in one cycle Execute – • The instruction is fetched from memory and placed in the instruction pipeline Data memory is accessed if required. Otherwise the ALU result is simply buffered for one cycle Write Back – The results generated by the instruction are written back to the register file, including any data loaded from memory 144/213 Data Forwarding • Institute of Electronics, National Chiao Tung University • • • Data dependency arises when an instruction needs to use the result of one of its predecessors before the result has returned to the register file => pipeline hazards Forwarding paths allow results to be passed between stages as soon as they are available 5-stage pipeline requires each of the three source operands to be forwarded from any of the intermediate result registers Still one load stall LDR rN,[…] ADD r2,r1,rN ;use rN immediately – one stall – compiler rescheduling 145/213 ARM7TDMI Processor Core Institute of Electronics, National Chiao Tung University • Current low-end ARM core for applications like digital mobile phones • TDMI – T: Thumb, 16-bit compressed instruction set – D: on-chip Debug support, enabling the processor to halt in response to a debug request – M: enhanced Multiplier, yield a full 64-bit result, high performance – I: Embedded ICE hardware • Von Neumann architecture • 3-stage pipeline, CPI ~1.9 146/213 ARM7TDMI Block Diagram Institute of Electronics, National Chiao Tung University 147/213 ARM7TDMI Core Diagram Institute of Electronics, National Chiao Tung University 148/213 ARM7TDMI Interface Signals (1/4) Institute of Electronics, National Chiao Tung University 149/213 ARM7TDMI Interface Signals (2/4) • Clock control Institute of Electronics, National Chiao Tung University – all state change within the processor are controlled by mclk, the memory clock – internal clock = mclk AND \wait – eclk clock output reflects the clock used by the core • Memory interface – 32-bit address A[31:0], bidirectional data bus D[31:0], separate data out Dout[31:0], data in Din[31:0] – \mreq indicates a processor cycle which requires a memory access – seq indicates that the memory address will be sequential to that used in the previous cycle 150/213 ARM7TDMI Interface Signals (3/4) Institute of Electronics, National Chiao Tung University – lock indicates that the processor should keep the bus to ensure the atomicity of the read and write phase of a SWAP instruction – \r/w, read or write – mas[1:0], encode memory access size - byte, half-word or word – bl[3:0], externally controlled enables on latches on each of the 4 bytes on the data input bus • MMU interface – \trans (translation control), 0:user mode, 1:privileged mode – \mode[4:0], bottom 5 bits of the CPSR (inverted) – abort, disallow access • State – T bit, whether the processor is currently executing ARM or Thumb instructions • Configuration – bigend, big-endian or little-endian 151/213 ARM7TDMI Interface Signals (4/4) Institute of Electronics, National Chiao Tung University • Interrupt – \fiq, fast interrupt request, higher priority – \irq, normal interrupt request – isync, allow the interrupt synchronizer to be passed • Initialization – \reset, starts the processor from a known state, executing from address 0000000016 • ARM7TDMI characteristics 152/213 External Address Generation Institute of Electronics, National Chiao Tung University 153/213 Memory Access Institute of Electronics, National Chiao Tung University 154/213 ARM Memory Interface Institute of Electronics, National Chiao Tung University 155/213 Instruction Execution Cycles (1/2) Institute of Electronics, National Chiao Tung University Instruction Qualifier Cycle count Any unexecuted Condition codes fail +S D ata processing Single-cycle +S D ata processing Register-specified shift +I +S D ata processing R15 destination +N +2S D ata processing R15, register-specified shift +I +N +2S MUL +(m)I +S MLA +I +(m)I +S MULL +(m)I +I +S MLAL +I +(m)I +I +S B, BL +N +2S LDR Non-R15 destination +N +I +S 156/213 Instruction Execution Cycles (2/2) Institute of Electronics, National Chiao Tung University Instruction Qualifier Cycle count LDR R15 destination +N +I +N +2S STR +N +N SWP +N +N +I +S LDM Non-R15 destination +N +(n-1)S +I +S LDM R15 destination +N +(n-1)S +I +N +2S STM +N +(n-1)S +I +N MSR, MRS +S SWI, trap +N +2S CDP +(b)I +S MCR +(b)I +C +N MRC +(b)I +C +I +S LDC, STC +(b)I +N +(n-1)S +N 157/213 Effect of T bit Institute of Electronics, National Chiao Tung University 158/213 Cached ARM7TDMI Macrocells Institute of Electronics, National Chiao Tung University 159/213 ARM 8 • Higher performance than ARM7 Institute of Electronics, National Chiao Tung University – by increasing the clock rate – by reducing the CPI • higher memory bandwidth, 64-bit wide memory • Separate memories for instruction and data accesses • • ARM8 ARM9TDMI ARM10TDMI Core Organization – the prefetch unit is responsible for fetching instructions from memory and buffering them (exploiting the double bandwidth memory) – it is also responsible for branch prediction and use static prediction based on the branch prediction (backward: predicted ‘taken’, forward: predicted ‘not taken’) 160/213 Pipeline Organization Institute of Electronics, National Chiao Tung University • 5-stage, prefetch unit occupies the 1st stage, integer unit occupies the remainder (1) Instruction prefetch (2) Instruction decode and register read (3) Execute (shift and ALU) (4) Data memory access (5) Write back results 161/213 Integer Unit Organization Institute of Electronics, National Chiao Tung University 162/213 ARM9TDMI Institute of Electronics, National Chiao Tung University • Harvard architecture – increases available memory bandwidth • instruction memory interface • data memory interface – simultaneous accesses to instruction and data memory can be achieved • 5-stage pipeline • Changes implemented to – improve CPI to ~1.5 – improve maximum clock frequency 163/213 ARM9TDMI Organization Institute of Electronics, National Chiao Tung University 164/213 ARM9TDMI Pipeline Operations (1/2) Institute of Electronics, National Chiao Tung University • Not sufficient slack time to translate Thumb instructions into ARM instructions and then decode, instead the hardware decode both ARM and Thumb instructions directly 165/213 ARM9TDMI Datapath (1/2) Institute of Electronics, National Chiao Tung University 166/213 ARM9TDMI Datapath (2/2) Institute of Electronics, National Chiao Tung University 167/213 LDR Interlock Institute of Electronics, National Chiao Tung University 168/213 Optimal Pipelining Institute of Electronics, National Chiao Tung University 169/213 LDM Interlock (1/2) Institute of Electronics, National Chiao Tung University 170/213 LDM Interlock (2/2) Institute of Electronics, National Chiao Tung University 171/213 Example ARM9TDMI System Institute of Electronics, National Chiao Tung University 172/213 Cached ARM9TDMI Macrocell Institute of Electronics, National Chiao Tung University 173/213 ARM9TDMI Pipeline Operations (2/2) Institute of Electronics, National Chiao Tung University • Coprocessor support – coprocessors: floating-point, digital signal processing, special-purpose hardware accelerator • On-chip debug – additional features compared to ARM7TDMI • hardware single stepping • breakpoint can be set on exceptions • ARM9TDMI characteristics 174/213 ARM9E-S Family Overview • ARM9E is based on an ARM9TDMI with the following extensions Institute of Electronics, National Chiao Tung University – – – – – – • Architecture v5TE ARM946E-S – – – – – • single cycle 32*16 multiplier implementation EmbeddedICE Logic RT improved ARM/Thumb interworking new 32*16 and 16*16 multiply instructions new count leading zeros instruction new saturated maths instructions ARM9E-S core instruction and data caches, selectable sizes instruction and data RAMs, selectable sizes protection unit AHB bus interface ARM966E-S – similar to ARM946-S, but with no cache 175/213 ARM1020T Overview • Architecture v5T Institute of Electronics, National Chiao Tung University – ARM1020E will be v5TE • • • • CPI ~ 1.3 6-stage pipeline Static branch prediction 32KB instruction and 32KB data caches – ‘hit under miss’ support • • • • 64 bits per cycle LDM/STM operations EmbeddedICE Logic RT-II Support for new VFPv1 architecture ARM10200 test chip – – – – ARM1020T VFP10 SDRAM memory interface PLL 176/213 ARM10TDMI (1/2) Institute of Electronics, National Chiao Tung University • Current high-end ARM processor core • Performance on the same IC process ARM10TDMI ARM9TDMI *2 ARM7TDMI *2 • 300MHz, 0.25uM CMOS • Increase clock rate 177/213 Institute of Electronics, National Chiao Tung University Software Development 178/213 Institute of Electronics, National Chiao Tung University • ARM software development - ADS • ARM system development - ICE and trace • ARM-based SoC development – modeling, tools, design flow C source C libraries asm source C compiler assembler .aof object libraries linker .aif system model debug ARMsd ARMulator development board 179/213 ARM Development Suite (ADS), ARM Software Development Toolkit (SDT) (1/3) Institute of Electronics, National Chiao Tung University • Develop and debug C, C++ or assembly language program • armcc ARM C compiler armcpp ARM C++ compiler tcc Thumb C compiler tcpp Thumb C++ compiler armasm ARM and Thumb assembler armlink ARM linker - combine the contents of one or more object files with selected parts of one or more object libraries to produce an executable program - ARM linker creates ELF executable images armsd ARM and Thumb symbolic debugger - can single-step through C or assembly language sources, set break-points and watch-points, and examine program variables or memory 180/213 ARM Development Suite (ADS), ARM Software Development Toolkit (SDT) (2/3) Institute of Electronics, National Chiao Tung University • .aof .aif ARM object format file ARM image format file • The .aif file can be built to include the debug tables => ARM symbolic debugger, ARMsd • ARMsd can load, run and debug programs either on hardware such as the ARM development board or using the software emulation of the ARM (ARMulator) AxD (ADW, ADU) • – ARM debugger for Windows and Unix with graphics user interface – debug C, C++, and assembly language source Code Warrior IDE – project management tool for windows 181/213 ARM Development Suite (ADS), ARM Software Development Toolkit (SDT) (3/3) • Utilities Institute of Electronics, National Chiao Tung University armprof ARM profiler Flash downloader download binary images to Flash memory on a development board • Supporting software ARMulator ARM core simulator – provide instruction accurate simulation of ARM processors and enable ARM and Thumb executable programs to be run on non-native hardware – integrated with the ARM debugger Angel ARM debug monitor – run on target development hardware and enable you to develop and debug applications on ARM-based hardware 182/213 ARM C Compiler Institute of Electronics, National Chiao Tung University • Compiler is compliant with the ANSI standard for C • Supported by the appropriate library of functions • Use ARM Procedure Call Standard, APCS for all external functions – for procedure entry and exit • May produce assembly source output – can be inspected, hand optimized and then assembled sequentially • Can also produce Thumb codes 183/213 Linker Institute of Electronics, National Chiao Tung University • Take one or more object files and combine them • Resolve symbolic references between the object files and extract the object modules from libraries • Normally the linker includes debug tables in the output file 184/213 ARM Symbolic Debugger Institute of Electronics, National Chiao Tung University • A front-end interface to debug program running either under emulation (on the ARMulator) or remotely on a ARM development board (via a serial line or through JTAG test interface) • ARMsd allows an executable program to be loaded into the ARMulator or a development board and run. It allows the setting of – breakpoints, addresses in the code – watchpoints, memory address if accessed as data address => cause exception to halt so that the processor state can be examined 185/213 ARM Emulator Institute of Electronics, National Chiao Tung University • ARMulator is a suite of programs that models the behavior of various ARM processor cores in software on a host system • It operates at various levels of accuracy – instruction accurate – cycle accurate – timing accurate => instruction count or number of cycles can be measured for a program => performance analysis • Timing accurate model is used for cache, memory management unit analysis, and so on 186/213 ARM Development Board Institute of Electronics, National Chiao Tung University • A circuit board including an ARM core (e.g. ARM7TDMI), memory components, I/O and electrically programmable devices • It can support both hardware and software development before the final application-specific hardware is available 187/213 Writing Assembly Language Programs Institute of Electronics, National Chiao Tung University AREA SWI_WriteC SWI_Exit ENTRY START ADR LOOP LDRB CMP SWINE BNE SWI TEXT = END HelloW,CODE,READONLY EQU &0 EQU &11 ;declare code area ;output character in r0 ;finish program ;code entry point r1,TEXT ;r1-> “Hello World” r0,[r1],#1 ;get next byte r0,#0 ;check for text end SWI_WriteC ;if not end print … LOOP ;… and loop back SWI_Exit ;end of execution “Hello World”,&0a,&0d,0 ;end of program source • The following tools are needed – – – – a text editor to type the program into an assembler to translate the program into ARM binary code an ARM system or emulator to execute the binary code a debugger to see what is happening inside the code 188/213 Program Design Institute of Electronics, National Chiao Tung University • Start with understanding the requirements, translate the requirements into an unambiguous specifications • Define a program structure, the data structure and the algorithms that are used to perform the required operations on the data • The algorithms may be expressed in pseudo-code • Individual modules should be coded, tested and documented • Nearly all programming is based on high-level languages, however it may be necessary to develop small software components in assembly language to get the best performance 189/213 System Architecture (1/2) • Institute of Electronics, National Chiao Tung University • • ARM processor, memory system, buses, and the ARM reference peripheral specification The reference peripheral specification defines a basic set of components, providing a framework within which an operating system can run but leaving full scope for application-specific system Components include – – – – • a memory map an interrupt control a counter timer a reset controller with defined boot behavior, power-on reset detection, a “wait for interrupt” pause mode The system must define – the base address of the interrupt controller (ICBase) – the base address of the counter-timer (CTBase) – the base address of the reset and pause controller (RPCBase) All the address of the registers are relative to one of the base addresses 190/213 System Architecture (2/2) • Institute of Electronics, National Chiao Tung University • • Interrupt controller provides a way of enabling, disabling (by mask) and examining the status of up to 32 level-sensitive IRQ sources and one FIQ source Two 16-bit counter-timers, controlled by registers. The counters operate from the system clock with selectable pre-scaling Reset and pause controller includes some registers – the readable registers give identification and reset status information – the writable registers can set or clear the reset status, clear the reset map and put the system into pause mode where it uses minimal power until an interrupt wakes it up again • Add application-specific peripherals 191/213 Hardware System Prototype Institute of Electronics, National Chiao Tung University • Verifying the function correctness of hardware blocks, software modules(on-developing) and speed performance is acceptable • Simulating the system using software tools => slower, can’t verify the full system • Hardware Prototyping – building a hardware platform by pre-existing or on-developing components for system verification and software development – “ARM Integrator” or “Rapid Silicon Prototyping” 192/213 ARM Integrator System Bus External Bus Interface Peripheral Input/ Output Core Module Connectors Institute of Electronics, National Chiao Tung University • A motherboard with some extensions to support the development of applications Provide core modules, logic modules (Xilinx Virtex FPGA), OS, input/output resources, bus arbitration, interrupt handling Logic Module Connectors • System Controller FPGA PCI Host Bridge FLASH Standard PCI Slot SRAM Standard PCI Slot GPIO Standard PCI Slot Boot ROM PCI PCI Bridge Compact PCI 193/213 Rapid Silicon Prototyping (VLSI Tech. Inc.) • Specially developed reference chips + off-chip extensions Institute of Electronics, National Chiao Tung University 194/213 ARMulator (1/2) Institute of Electronics, National Chiao Tung University • ARMulator is a collection of programs that emulate the instruction sets and architecture of various ARM processors (It is an instruction set simulator) • ARMulator is suited to software development and benchmarking ARM-targeted software. It models the instruction set and counts cycles. • ARMulator supports a C library to allow complete C programs to run on the simulated system • To run software on ARMulator, through ARM symbolic debugger or ARM GUI debuggers, AxD 195/213 ARMulator (2/2) • It includes Institute of Electronics, National Chiao Tung University – processor core models which can emulate any ARM core – a memory interface which allows the characteristics of the target memory system to be modeled – a coprocessor interface that supports custom coprocessor models – an OS interface that allows individual system calls to be handled • • • The processor core model incorporates the remote debug interface, so the processor and the system state are visible from the ARM symbolic debugger ARMulator => a cycle accurate model of a system including a cache, MMU, physical memory, peripheral devices, OS, software Once the design is OK, hardware -> design or synthesis by CAD software -> still use ARMulator model, but instruction accurate 196/213 JTAG Boundary Scan (1/2) • Institute of Electronics, National Chiao Tung University IEEE 1149, Standard Test Access Port and Boundary Scan Architecture or called JTAG boundary scan (by Joint Test Action Group) 197/213 JTAG Boundary Scan (2/2) Institute of Electronics, National Chiao Tung University • Test signals – \TRST: – TCK: – TMS: – TDI: – TDO: a test reset input test clock which controls the timing of the test interface independently from any system clock test mode select which controls the operation of the test interface state machine test data input line test data output line • TAP controller (Test Access Port) A state machine whose state transitions are controlled by TMS 198/213 TAP Controller (1/2) test logic reset Institute of Electronics, National Chiao Tung University run test/idle select DR scan select IR scan capture DR capture IR shift DR shift IR exit1 DR exit1 IR pause DR pause IR exit2 DR exit2 IR update DR update IR TMS=0 TMS=1 199/213 TAP Controller (2/2) Institute of Electronics, National Chiao Tung University • Test instruction selects various data registers – device ID register, bypass register, boundary scan register • Some public instructions – BYPASS: connect TDI to TDO with 1-clock delay – EXTEST: test the board-level connectivity, boundary scan register is connected • capture DR: captured by the boundary scan register • shift DR: shift out via TDO • update DR: new data applied to the boundary scan register via TDI – TNTEST: test the core logic – INCODE: ID register is connect 200/213 Macrocell Testing • Institute of Electronics, National Chiao Tung University • System chip is composed of the pre-designed macrocells with applicationspecific custom logic Various approaches to test the macrocells – test mode provided which multiplexes the signals in turn onto the chip – on-chip bus may support direct test access to macrocell pins – each macrocell may have a boundary scan path using JTAG extensions 201/213 ARM Debug Architecture (1/2) Institute of Electronics, National Chiao Tung University • Two basic approaches to debug – from the outside, use a logic analyzer – from the inside, tools supporting single stepping, breakpoint setting • Breakpoint: replacing an instruction with a call to the debugger Watchpoint: a memory address which halts execution if it is accessed as a data transfer address Debug Request: through ICEBreaker programming or by DBGRQ pin asynchronously 202/213 ARM Debug Architecture (2/2) Institute of Electronics, National Chiao Tung University • In debug state, the core’s internal state and the system’s external state may be examined. Once examination is complete, the core and system state may be restored and program execution is resumed. • The internal state is examined via a JTAG-style serial interface, which allows instructions to be serially inserted into the core’s pipeline without using the external data bus. • When in debug state, a store-multiple (STM) could be inserted into the instruction pipeline and this would dump the contents of ARM’s registers. 203/213 Debugger (1/2) • Institute of Electronics, National Chiao Tung University • A debugger is software that enables you to make use of a debug agent in order to examine and control the execution of software running on a debug target Different forms of the debug target – early stage of product development, software – prototype, on a PCB including one or more processors – final product • The debugger issues instructions that can – – – – • load software into memory on the target start and stop execution of that software display the contents of memory, registers, and variables allow you to change stored values A debug agent performs the actions requested by the debugger, such as – setting breakpoints – reading from / writing to memory 204/213 Debugger (2/2) Institute of Electronics, National Chiao Tung University Examples of debug agents – – – – – Multi-ICE Embedded ICE ARMulator BATS Angle • Remote Debug Interface (RDI) is an open ARM standard procedural interface between a debugger and the debug agent ARM debugger AxD RDI Remote Debug Interface (RDI) Target (software) Target (hoftware) ARMulator BATS Multi-ICE Angel RDI RDI RDI RDI Remote_A Target emulated in Software Target emulated in Software ARM development board ARM development board 205/213 In Circuit Emulator (ICE) Institute of Electronics, National Chiao Tung University • The processor in the target system is removed and replaced by a connection to an emulator • The emulator may be based around the same processor chip, or a variant with more pins, but it will also incorporate buffers to copy the bus activity to a “trace buffer” and various hardware resources which can watch for particular events, such as execution passing through a breakpoint 206/213 Multi-ICE and Embedded ICE Institute of Electronics, National Chiao Tung University • Multi-ICE and Embedded ICE are JTAG-based debugging systems for ARM processors • They provide the interface between a debugger and an ARM core embedded within an ASIC • It provides – real time address-dependent and data-dependent breakpoints – single stepping – full access to, and control of the ARM core – full access to the ASIC system – full memory access (read and write) – full I/O system access (read and write) 207/213 Basic Debug Requirements Institute of Electronics, National Chiao Tung University • Control of program execution – set watchpoints on interesting data accesses – set breakpoints on interesting instructions – single step through code • Examine and change processor state – read and write register values • Examine and change system state – access to system memory • download initial code 208/213 Debugging with Multi-ICE Institute of Electronics, National Chiao Tung University • • The system being debugged may be the final system Third party protocol converters are also available at http://www.arm.com/DevSupp/ICE_Analyz/ 209/213 ICEBreaker (EmbeddedICE macrocell) • Institute of Electronics, National Chiao Tung University • • ICEBreaker is programmed in a serial fashion using the TAP controller It consists of 2 real-time watchpoint units, together with a control and status register Either watch-point unit can be configured to be a watch-point or a breakpoint DBGRQI EXTERN1 A[31:0] EXTERN0 D[31:0] nOPC nRW TBIT Processor RANGEOUT0 RANGEOUT1 MAS[1:0] ICEBreaker nTRANS DBGACK BREAKPT DBGRQ DBGACKI BREAKPTI DBGEN IFEN ECLK nMREQ SDIN SDOUT TCK nTRST TMS TAP TDI TDO 210/213 Real-Time Trace (1/2) Institute of Electronics, National Chiao Tung University • Debugging uses the breakpoint and single-step to run application code to a given point, and then stop the processor to examine or change memory or register contents, and then step or restart the code • Some bugs occur while the system is running at full clock speed => need non-instrusive trace of instruction flow and data accesses • Using Trace Debug Tool (TDT), you can set up the trace filter facility to collect trace data only during the interrupt routine, and use a trigger to stop tracing 211/213 Real-Time Trace (2/2) Institute of Electronics, National Chiao Tung University ADW and TDT running on the host JTAG Unit ASIC 5-wire JTAG JTAG Port ARM CPU Macrocell Embedded Trace Macrocell Trace Port Trace Port Analyzer • Embedded trace macrocell – monitor the ARM core buses, passed compressed information through the trace port to Trace Port Analyzer (TPA) – the on-chip cell contains the trigger and filter logic • Trace port analyzer – an external device that stores the information from the trace port • Trace debug tool – set up the trigger and filter logic, retrieve the data from the analyzer and reconstruct a historical view of processor activity 212/213 ARM10TDMI (2/2) Institute of Electronics, National Chiao Tung University • Reduce CPI – branch prediction – non-blocking load and store execution – 64-bit data memory => transfer 2 registers in each cycle 213/213 ...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online