Xia et al
/ J Zhejiang Univ Sci A
New method for high performance multiply-accumulator design
, Peng LIU
, Qing-dong YAO
Department of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China
E-mail: firstname.lastname@example.org; email@example.com
Received July 27, 2008;
Revision accepted Oct. 28, 2008; Crosschecked Apr. 27, 2009
This study presents a new method of 4-pipelined high-performance split multiply-accumulator (MAC) architecture,
which is capable of supporting multiple precisions developed for media processors. To speed up the design further, a novel partial
product compression circuit based on interleaved adders and a modified hybrid partial product reduction tree (PPRT) scheme are
proposed. The MAC can perform 1-way 32-bit, 4-way 16-bit signed/unsigned multiply or multiply-accumulate operations and
2-way parallel multiply add (PMADD) operations at a high frequency of 1.25 GHz under worst-case conditions and 1.67 GHz
under typical-case conditions, respectively. Compared with the MAC in 32-bit microprocessor without interlocked piped stages
(MIPS), the proposed design shows a great advantage in speed. Moreover, an improvement of up to 32% in throughput is achieved.
The MAC design has been fabricated with Taiwan Semiconductor Manufacturing Company (TSMC) 90-nm CMOS standard cell
technology and has passed a functional test.
Multiply-accumulator (MAC), Pipeline, Compressor, Partial product reduction tree (PPRT), Split structure
Multiply-accumulate operation is one of the ba-
sic arithmetic operations extensively used in modern
digital signal processing (DSP). Most arithmetic, such
as digital filtering, convolution and fast Fourier
transform (FFT), requires high-performance multiply-
accumulate operations. The multiply-accumulator
(MAC) unit always lies in the critical path that de-
termines the speed of the overall hardware systems.
Therefore, a high-speed MAC that is capable of
supporting multiple precisions and parallel operations
is highly desirable.
The existing MAC implementation methods in
the literature can be generally classified into three
categories. The first category is the recursive MAC
., 2001; Liao and Roberts, 2002),
which builds wider vector elements out of several
narrower ones and then adds the multiple results to-
gether. It is achieved iteratively by recalculating the
data back through the unit over more than one cycle.
This method saves hardware resource but requires
several clock cycles per operation. The second cate-
gory involves the parallel MAC method (Perri
., 2006; MIPS Technolo-
gies Inc., 2006; 2007) implemented by unrolling the
iterative loop of recursive MAC method, which
achieves high speed at the cost of hardware resources.