ARM.SoC.Architecture

# If the addition in the accumulation in smla and smlaw

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: any of the operand or result registers. If the saturating addition or subtraction overflows, or doubling Rn (where specified) causes overflow, the Q bit in the CPSR is set. v5TE code example To illustrate the use of these instructions, consider the problem of generating the inner ('dot') product of two vectors of 16-bit signed numbers held memory on an ARM9E-S core, which supports the architecture v5TE extensions, and an ARM9TDMI core, which does not. Computing an inner product is a very common procedure in signal processing applications. To minimize errors, saturating arithmetic should be used. The v5TE code for the central loop is as follows: loop SMULBB SUBS QDADD SMULTT LDR QDADD LDR BNE r3,rl,r2 r4, r4,r2 r5,r5,r3 r3,rl,r2 rl,[r6],#4 r5,r5,r3 r2,[r7],#4 loop ; 16x16 multiply ; decrement loop counter ; saturating x2 & accumulate ; 16x16 multiply ; get next two multipliers ; saturating x2 & accumulate ; get next two multiplicands This code example illustrates several important points: The instructions are 'scheduled' to avoid pipeline stalls. On an ARM9E-S this means that the result of a load or 16-bit multiply should not be used in the following cycle. Although the operands are 16-bit halfwords, they are loaded in pairs as 32-bit words. This is a more efficient way to use ARM's 32-bit memory interface than using halfword loads, and the v5TE multiply instructions can access the individual 16-bit operands directly from the registers. The saturating 'double and accumulate' instructions are used to scale the product before accumulation. This is useful because the fixed point arithmetic used in Example and exercises 245 signal processing generally assumes operands in the range -1 to +1 but certain algorithms need coefficients greater than 1. The doubling operation gives an effective range from --2 to +2, which is sufficient for most algorithms. Performance c o mp ar ison The single-cycle 32 x 16 multiplier on the ARM9E-S enables it to complete the above loop in 10 clock cy...
View Full Document

## This document was uploaded on 10/30/2011 for the course CSE 378 380 at SUNY Buffalo.

Ask a homework question - tutors are online