Zhang92a - IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR...

Info iconThis preview shows pages 1–12. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 6
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 8
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 10
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 12
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 2. NO. 3, SEPTEMBER 1992 285 Motion-Compensated Wavelet Transform Coding for Color Video Compression Ya-Qin Zhang, Member, IEEE, and Sohail Zafar, Student Member, IEEE Abstract—A video compression scheme based on the wavelet representation and multiresolution motion compensation (MRMC) is presented in this paper. The multiresolution/ multifrequency nature of the discrete wavelet transform is an ideal tool for representing images and video signals. Wavelet transform decomposes a video frame into a set of subframes with different resolutions corresponding to different frequency bands. These multiresolution frames also provide a representa- tion of the global motion structure of the video signal at differ- ent scales. The motion activities for a particular subframe at different resolutions are different but highly correlated since they actually specify the same motion structure at different scales. In the multiresolution motion compensation approach, motion vectors in higher resolution are predicted by the motion vectors in the lower resolution and are refined at each step. In this paper, we propose a variable block-size MRMC scheme in which the size of a block is adapted to its level in the wavelet pyramid. This scheme not only considerably reduces the search- ing and matching time but also provides a meaningful charac— terization of the intrinsic motion structure. The variable block- size MRMC approach also avoids the drawback of the constant-size MRMC in describing small object motion activi- ties. After wavelet decomposition, each scaled subframe tends to have different statistical properties. An adaptive truncation pro- cess was implemented and a bit allocation scheme similar to that in the transform coding is examined by adapting to the local variance distribution in each scaled subframe. Based on the wavelet representation, variable block-size MRMC approach and a uniform quantization scheme, four variations of the proposed motion-compensated wavelet video compression sys- tem are identified. It is shown that the motion-compensated wavelet transform coding approach has a superior performance in terms of the peak-to-peak signal-to-noise ratio as well as the subjective quality. I. INTRODUCTION E discrete wavelet transform (DWT) has recently received considerable attention in the context of im- age processing due to its flexibility in representing nonsta- tionary image signals and its ability in adapting to human visual characteristics. Its relationships to the Gabor trans- form, windowed Fourier transform and other intermediate spatial—frequency representations have been studied [1]—[5]. The wavelet representation provides a multireso— Manuscript received September 5, 1991; revised February 13, 1992. This paper was presented in part at the SPIE Visual Communications and Image Processing Conference [21], Nov. 10—15, 1991, Boston, MA. Paper was recommended by Associate Editor Yrjo Neuvo. Y.—Q. Zhang is with GTE Laboratories, Inc., Waltham, MA 02254. Author to whom correspondence should be addressed. S. Zafar is with the Dept. of Electrical Engineering, University of Maryland, College Park, MD 20742. IEEE Log Number 9201709. 1051-8215/9230390 lution/multifrequency expression of a signal with localiza- tion in both time and frequency. This property is very desirable in image and Video coding applications. First, real-world image and video signals are nonstationary in nature. A wavelet transform decomposes a nonstationary signal into a set of multiscaled wavelets where each com— ponent becomes relatively more stationary and hence easier to code. Also, coding schemes and parameters can be adapted to the statistical properties of each wavelet, and hence coding each stationary component is more efficient than coding the whole nonstationary signal. In addition, the wavelet representation matches to the spa- tially tuned frequency-modulated properties experienced in early human vision as indicated by the research results in psychophysics and physiology [6]. The discrete wavelet theory is found to be closely related to the framework of multiresolution analysis and subband decomposition, which have been successfully used in image processing for a decade [7]-[10]. In the multires- olution analysis, an image is represented as a limit of successive approximations, each of which is a smoothed version of the image at the given resolution. All the smoothed versions of the image at different resolutions form a pyramid structure. An example is so called the Gaussian pyramid in which the Gaussian function is used as the smoothing filter at each step. However, there exists some redundancies among different levels of the pyramid. A Laplacian pyramid is formed to reduce the redundancy by taking the difference between successive layers of the Gaussian pyramid [7]. The Laplacian pyramid representa- tion results in a considerable compression although the image size actually expands after the decomposition. In subband coding, the frequency band of an image signal is decomposed into a number of subbands by a bank of bandpass filters. Each subband is then translated to base- band by down-sampling and is encoded separately. For reconstruction, the subband signals are decoded and up- sampled back to the original frequency band by interpola- tion. The signals are then summed up to give a close replica of the original signal. The subband coding ap- proach provides a signal-to—noise ratio comparable to the transform coding approach and yields a superior subjec- tive perception due to the lack of the “blocking effect” [9]. The multiresolution representation and the subband approach are recently integrated into the framework of the wavelet theory [2], [3]. The wavelet theory provides a systematic way to construct a set of filter banks with a © 1992 IEEE 286 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 2, NO. 3, SEPTEMBER l992 regularity condition and compact support [1]. In the wavelet representation, the overall number of image sam- ples is conserved after the decomposition due to the orthogonality of wavelet basis at different scales (this is referred to as a critically sampled system). Wavelet theory has been applied to image coding in a similar way to the subband coding approach. Schemes using different types of wavelets and quantization schemes have been proposed [2], [3], [11], [12]. There appears to be little efforts in the application of wavelet theory to real—time video compres— sion. In [13], the Laplacian pyramid, subband coding, and wavelet decomposition were compared and used to de- compose a sequence of images, but no coding results were reported. This paper applies the wavelet theory to com- press full-motion color video signals using a variable block-size multircsolution motion—compensated predic- tion. Video signals are nonstationary in nature. In video coding, some type of interframe prediction is often used to remove the interframe redundancy. Motion-com- pensated prediction has been used as an efficient scheme for temporal prediction. After motion compensation, the residual video signal still tends to be highly nonstationary. In the transform coding approach, such as in the CCITT H.261 recommendations and the MPEG proposal [14], [15], the residual video signals are divided into many small rectangular blocks. The reason is that with a small block size, it becomes feasible and advantageous to be imple- mented in hardware. Also, coding parameters can be adapted to each locally stationary block. A detailed and excellent coverage of transform-based coding approach for video signals can be found in [16]. The block transform coding approach suffers from the “blocking effect” in low bit rate applications. The wavelet decomposition provides an alternative approach in repre- senting the nonstationary video signals and the residual signals after prediction. Compared to the transform cod— ing, the wavelet representation is more flexible and can be easily adapted to the nature of human visual system. It is also free from the blocking artifacts due to the nature of its global decomposition. In Section II, the dyadic wavelet theory is reviewed and its extension to two-dimensional cases is briefly described. Video coding often involves some kind of format conversions by subsampling and interpolation. The generalized wavelet—based subsampling and interpolation procedure is discussed in Section III. Wavelet transform decomposes a video frame into a set of subframes with different resolutions corresponding to different frequency bands. These multircsolution frames also provide a representation of motion structure at dif— ferent scales. The motion activities for a particular sub- frame at different resolutions are hence highly correlated since they actually specify the same motion structure at different scales. In the multircsolution motion compensa— tion scheme (MRMC) described in Section IV, motion vectors at higher resolution are predicted by the motion vectors at the lower resolution and are refined at each step. We propose a variable block-size MRMC scheme in which the size of a block is adapted to its scale. This scheme not only considerably reduces the searching and matching time but also provides a meaningful characteri- zation of the intrinsic motion structure. The variable-size MRMC approach also avoids the drawback of the con— stant—size MRMC in describing small object motion activi- ties. The MRMC scheme described here can also be well adapted to the motion-compensated interpolation. After wavelet decomposition, each scaled subframe tends to have different statistical properties. An adaptive trunca- tion process similar to [17] is implemented and a bit allocation scheme similar to that being used in the trans- form coding is examined by adapting to the local variance distribution in each scaled subframe. Based on the wavelet representation, the variable—size MRMC approach, and a uniform quantization scheme, four variations of the pro- posed motion-compensated wavelet video compression system are identified in Section VI. Comparative results for the four different results are presented in Section VII. II. WAVELET DECOMPOSITION AND RECONSTRUCTION In this section, a special class of the discrete orthonor— mal wavelet transform with a resolution step of 2, i.e., the discrete dyadic wavelet transform, is briefly introduced for image decomposition and reconstruction. Dyadic wavelets are a set of functions generated from one single-basis wavelet w(-) by dilations and translations [2]: t wmn(t) =2'('”/2)w(2—m —n) (m,n) EZ. For any square integrable function f(t) E L2(R), its wavelet transform Wf(m, n) is defined as Wf(mm) =<f(t).w...(z)> = firmwmmdz which gives an approximation of f(t) at the resolution (or scale) 2'" at the location n. Conversely, any square inte- grable function f(t) E L2(R) can be represented in terms of a set of wavelet basis that covers all the scales at every location: f0) = Z Wf(m,")wanI In other words, any function can be decomposed into a set of wavelets at various scales, and it can also be recon— structed by the superposition of all the scaled wavelets. The condition for perfect reconstruction is f: lW(2’"w)l2 = 1 m=—oc (1) which ensures that the wavelet transform provides a com- plete representation covering all frequency axis, where W(w) is the Fourier transform of w(t). It has been shown that the wavelet basis can be con- structed from the multircsolution analysis procedure. In the multiresolution analysis, a scaling function ¢(t) is introduced. Let Um denote the vector space spanned by ZHANG AND ZAFAR: MOTION-COMPENSATED WAVELET TRANSFORM CODING 287 ¢m‘fl(t), which is generated by the dilation and translation of the scaling function 42(1‘): ¢m.(r) = 2*<m/2>¢(2im —n]. Therefore, {Um}m E Z represents the successive approxima- tions at resolutions {2’"},,IE Z and Um constitutes a subset of Um, 1 for m E Z. If the scaling function ¢(t) and the basic wavelet function w(t) are chosen to satisfy the following conditions ¢(t) = 26.14%” — n) and we) = Z(—1)"c...¢(2r + n) then, wmn(t) are the functions that span the orthogonal compliment of Um_1 and Um. Hence, (f,wmn(t)> repre- sents the difference of information between the resolution 2”"1 and 2’", which is the “new” information conveyed between the successive approximations. In practice, the input signal f(t) is measured at a finite resolution. A finite dyadic wavelet transform of a given function f(t) is introduced between the scales {2’"; m = 1,2,---, M}. Imposing the condition that i lW(2”’w)l2 m=M+1 léz”(w)l2 = ensures that the condition in (1) holds true. We denote the wavelet at scale 2”‘ by Wmf= {Wf(m,n); n EZ} or only W2". in case of no ambiguity. Therefore, a finite wavelet transform of f(t) between the scale 21 and 2’” can be represented as {52Mf= W21va WzMaf,---,W21f} forOSmsM where SZMf:<f(t).¢2M(z)> = [powwow is the smoothed version of f(t) spanned by the scaling function at the resolution 2“. Relating the wavelet to the multiresolution analysis results in a fast computation algorithm that has long been used in signal processing applications. The algorithm is described as follows: m=0 while (m <M) { W2m+1f= Ssz*Gm S2m+1f= Ssz*Hm m =m + 1 } where S 1 f = f is the original signal. The filter pair H and G corresponds to the expansion of the scaling function and the wavelet function, respec- tively. The coefficients of an orthonormal wavelet trans- form satisfy the following conditions: ZhUI) = \5 ZgUl) = 0 801) = (—1)"h(1 — n)- (2) The reconstruction basically reverses the decomposition procedure: m =M while (m > 0) { 82m‘1f= W2mf*G~m—l + S2mf*Hm—1 m = m +1 } where H and G are the conjugate filters of H and G, respectively. Conditions in (2) are also the requirements for a class of perfect reconstruction filters, that is, quadrature mirror filters (QMF), which have been extensively used in sub- band image coding applications [8], [9]. Wavelet theory provides a systematic way to the construction of QMF. Wavelet theory also explicitly imposes a regularity condi— tion in the QMF coefficients [1]. The regularity condition corresponds to the degree of differentiability of the wavelet functions, which is determined by the number of zeros of the wavelet filters at w = 77. In practical applica- tions, it is desirable to have a continuous and smooth wavelet representation, which is guaranteed by regularity conditions. In this paper, we use a set of orthonormal bases with compactly supported wavelets developed in [1]. Compact support implies a finite length for filters H and C. There is a compromise between the degree of compactness and the degree of regularity. The wavelet function becomes more regular as the number of taps in H increases, which results in more computations. The Daubechies-6 coeffi- cient set is used in this work since it shows an adequate energy concentration in the low-frequency subimage. The extension of the 1-D wavelet transform to 2-D is straightforward. A separable wavelet transform is the one whose 2-D scaling function CID(t1,t2) can be expressed as (DUNE) : q)(t1)¢(tz)- It can be easily shown that the wavelet at a given resolu— tion 2’" can be completely represented by three separable orthogonal wavelet basis functions in L2(R X R): W21m(t17t2) = ¢2m(t1)W2m(12) W22m(t17t2): w2’”(tl)¢2”'(t2) W23m(ti,12) = W2m(t1)wzm(t2)- Therefore, a 2-D dyadic wavelet transform of image f(x, y) between the scale 21 and 2M can be represented as a 288 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 2, N0. 3, SEPTEMBER 1992 sequence of subimages: (ssz,[WfM f]/=1,2.3""9lW7{‘ flj=1.2.3} The 2-D separable wavelet decomposition can be imple- mented first in columns and then in rows independently. The decomposed image data structure and the corre- sponding frequency bands are depicted in Figs. 1 and 2, respectively. In Fig. 1, the decomposed image forms a pyramid structure up to three layers with three subimages in each layer. The resolution decreases by a factor of 4 (2 in the horizontal direction and 2 in the vertical direction) with each layer increased. Separable transform is easy to implement but is limited in orientations. Some nonseparable extensions can also be used such as the Quincunx pyramid where the scaling function has the form of q)(t1’t2) = ¢(L(t1,t2)) where L(tl, t2) = (t1 + t2, t1 — t2) is a linear transform. III. FORMAT CONVERSION BY GENERALIZED SUBSAMPLING AND INTERPOLATION Wavelet representation can also be used as a tool for subsampling and interpolation. As illustrated in Fig. 1, an image can be represented in terms of a pyramid structure after wavelet decomposition. The sequence {52m f }m : up M, represents approximations of a given image at different resolutions. S2." f gives the optimum representation at resolution 2’“ in the sense that it gives the best human visual perception [3]. Video applications often involve some form of format conversion through subsampling and interpolation. For example, in the CCITT H.261 standard, all incoming video signals are converted to a common intermediate format (CIF) or a quarter of CIF (QCIF) format depending on the available channel rate. In cer- tain MPEG specification, the CCIR 601 resolution has to be subsampled to meet the target rate of 1.5 Mb / 5. There are many subsampling and interpolation schemes. The easiest method is the uniform subsampling in which every other samples (say for 2: 1 sampling) are discarded. Some nonuniform subsampling techniques were also developed to discard samples in terms of its local activity. In the receiver, different linear or nonlinear interpolation schemes are engaged to retrieve the video signals. The drawback of this type of sample discarding is the “aliasing effect,” caused by the inadequacy of the sampling rate. Some filtering techniques developed in image coding and enhancement can also be used as generalized subsampling and interpolation tools. Examples include Burt’s pyra- mids, Watson’s cortex transform, and QMF’s used in subband coding [7], [8], [18]. In these schemes, filtering and decimating process is recursively implemented to obtain the desired representation at a given resolution. The process filters out the higher frequency components and confines the frequency distribution to a lower band. Therefore, it is free from the “aliasing effect” since the subsampling rate in the lower frequency subimages satis- fies the Nyquist rate provided that the original sampling rate does. The wavelet representation essentially uses the same approach, but the coefficients of wavelet filters are chosen to satisfy certain constraints. An example of the generalized subsampling and interpolation process is de- picted in Fig. 3. IV. MULTIRESOLUTION MOTION ESTIMATION / COMPENSATION As Figs. 1 and 2 illustrate, a video frame is decomposed into multiple layers with different resolutions and differ- ent frequency bands. Motion activities at different layers of the pyramid are different but highly correlated since they actually characterize the same motion structure at different scales and different frequency ranges. In a mul- tiresolution motion estimation (MRME) scheme, the mo- tion field is first calculated for the lowest resolution subimage, which sits on the top of the pyramid [19]. Then, motion vectors at lower layers of the pyramid are refined using the motion information obtained at higher layers. This scheme can be considered to be a multiresolution version of the predictive motion estimation scheme pro- posed in [20]. The motivation for using the MRME ap- proach is the inherent structure of the wavelet represen- tation. MRME schemes significantly reduce the searching and matching time and provide a smooth motion vector field. A video frame is decomposed up to three levels in our work. A total of 10 subimages are obtained with 3 subim- ages at first two levels, and 4 on the top including the subimage $8 with the lowest frequency band. It is well known that human vision is more perceptible to errors in lower frequencies than those incurred in higher bands and tends to be selective in spatial orientation and positioning, e.g., errors in smooth areas are more disturbing to a viewer than those near edges. The subimage S8 also contains a large percent of the total energy though it is only 1 /64-th of the original video frame size. In addition, errors in higher layer subimages will be propagated and expanded to all subsequent lower layer subimages. In this section, a variable block-size MRME scheme is proposed to take all these factors into considerations. The basic multiresolution motion estimation scheme proposed in this paper is similar to the one used in [19]. The main difference is that the size of blocks varies with resolutions in our scheme, where the size of the motion blocks is kept constant for all resolutions in [19]. We use a block size of p - 2"“ for the mth-level subimage, i.e., the lower is the resolution (corresponding to higher level in the pyramid), the smaller is the motion block size. The constant p is the size of the block used at the lowest resolution (e.g., p equals one for pel recursive case and equals two for a block-size of 2 X 2 for the highest layer subimages). With this structure, the number of motion blocks for all subimages is constant because a block at one resolution corresponds to the same position and the same object at another resolution. In other words, all scaled subimages have the same number of motion blocks ZHANG AND ZAFAR: MOTION-COMPENSATED WAVELET TRANSFORM CODING A 289 I ___..__ I Layer30——-——— I \ 1 2- 3\, “W‘20--‘ \ Fig. 1 . Fig. 2. Frequency band distribution of wavelet decompositions. that characterize the global motion structure in different grids and frequency ranges. The variable block—size ap- proach appropriately weights the importance of different layers and matches the human visual perception to dif— ferent frequency at different resolutions. It can detect motions for small objects at the highest level of the pyramid. The constant-block MRME approach ignores the motion activities (even high motion) for small objects at the higher levels of the pyramid. The variable block-size MRMC approach also requires fewer computations since no interpolation is needed as the grid refines. In the variable block-size MRME, an accurate characterization of motion information at the highest layer subimage pro- duces very low energy in the displaced residual subimagcs and results in a “cleaner” propagation process for the motion estimation in lower layer subimages. In a con- stant-block MRME, “clean” copies of lower layer subim- ages are obtained by interpolation and refinements. Let the value of the video frame i at location (x1, y1) be denoted by I,(x1, yl). The basic principle of motion com- pensation is to find the motion vector I/l-(x, y) that recon- struct Ii(x1,y1) from Ii_1(x1 +x,yl +y) with minimum error. In other words, we try to find the “best match” for 1,.(x1, yl) in the previous frame (i — 1), which is displaced The pyramid structure of wavelet decomposition and reconstruction. from the original location of (xhyl) by V,(x,y). The range of x and y is called the search area and is denoted by S). In a block-based matching scheme, the idea is to divide the image in small blocks of size X X Y (2 X 2 for components at the highest level M in our case), and then for each block in the current frame i to find a block in the previous frame (1' — 1) within a defined search area (1 to minimize the prespecified distortion function. The process of the block-matching motion estimation is shown in Fig. 4. An example of the proposed variable block-size MRME scheme is illustrated in Fig. 5. First, the motion vectors for the highest layer subimage 58 are estimated by full search with a block size of 2 X 2. A pel-recursive scheme with large number of iterations can also be employed. These motion vectors are then sealed appropriately to be used as initial estimates for motion estimation in higher resolu— tion subimagcs. Using ZM‘m times the motion vectors for level M as a bias, the motion vectors for level m are refined by using full search but with a relatively small search area 0. There are several possible variations since the motion activities for subimages Wj’ {for i = 1,2,3 and j 2 2,4, 8} represent frequency-segmented motion structure of the global motion activity, motion vectors at different layers are hence highly correlated. An example of exploiting the motion redundancy is illustrated in Fig. 6, where the estimation path is indicated by three directions (horizon— tal, vertical, and diagonal) corresponding to the direc- tional characteristics of the three 2-D wavelet filters. Let Kay/(x, y) represent the motion vectors centered at (x, y) for the subimagc WJ' {for i = 1,2,3 and j = 2, 4, 8}. Then this estimation scheme is given by Vi.,‘(xa)’) = 21/121064) + A(6x’ 5”) [or i = 1,2,3 and j = 2,4. This configuration produces very low energies in all the displaced residual subimagcs but with a comparatively large overhead and more computations. Certainly, this scheme can be simplified by using independent motion vectors obtained for subimage S8 as initial bias and then refined for all lower layer subimagcs using the full-search algorithm with a smaller search area. With this scheme, the block is first displaced by VLJ-(x, y) in the previous frame and the motion searching algorithm is then imple- 290 IEEE TRANSACTIONS ON CIRCUITS ANT) SYSTEMS FOR VIDFO TECHNOLOGY, VOL. 2. N0. 3, SEPTEMBER 1992 2:1 Row sampling 2:1 Column sampling 2:1 Column Interpo. Fig. 3. An example of generalized subsanlpling and interpolation. Fig. 4. Block-based motion searching and matching scheme. mented to find A(5x, 8y). This is equivalent to finding A(5x,5y) 1 X/Z Y/z = argMin — Z Z 5x,5y€0 p=,X/2q= y/Z ‘|[i(x1+P7YI+ q) *Ii_1(x1+p+x+8x,yl +q+y+ 5y)|. The motion vectors at level m are given by YslanYm’==\%s(xi ) 2M’m + A(5x,sy) f0r{i=1,2,3;j= 2,4,8} where V0’8(x, y)(M) is the motion vector for the subimage $8 and A( 5x, 6y) is the incremental motion vector found by a full search with reduced search area. This scheme provides a meaningful characterization of the intrinsic motion structure and gives very smooth motion vectors from block to block. Motion overhead and computation can be reduced dra- matically if the motion vectors for each subimage are not refined. This scheme is illustrated in Fig. 5 by making all A( 6x, 5y) identically equal to zero, i.e., 5x = By = 0. V. BIT ALLOCATION AND QUANTIZATION Quantization is an important part of a video compres- sion system. As a matter of fact, in most video coding systems, quantization is the only process that introduces distortion and hence achieves a data rate far less than the entropy limit. An efficient quantizer matches to the un— derlying probability distribution of the coefficients in the displaced residual subimages (DRS) at different scales and different frequency bands. In this section, two schemes are presented to quantize the DRS video signals. The first method uses a bit allocation scheme followed by an uni— form quantizer, which is similar to that being used in some existing transform coding and subband coding schemes [9], [16]. The difference here, however, is to multiply a proper weighting factor to each subimage ac- cording to its importance in the pyramid. The second scheme is similar to the adaptive truncation process used in the scene adaptive coder [17]. The bit allocation process can be divided into two parts. Bits are first assigned among each subimage, and then the assigned number of bits will be distributed within each individual subimage. Let {Rig m = 1,-'-,M; k = 1,2,3} be the number of bits associated with subimages {W,,’f; m = 1,'--,M; k = 1,2, 3} and RM represent the number of bits for subimage S M, then the total number of bits R 13 M 3 R=RM+Z ZR; 0) m:lk:l The assignment should be done to minimize the overall distortion in the reconstructed image, which is repre— sented as M 3 D = ZZMDM + z E 2MB]; m—l k: l where {D,’,‘,; m = 1,---,M; k = 1,2, 3} is the distortion associated with subimages {W,,f; m = 1,--', M; k = 1,2,3} and DM represents the distortion introduced in the subimage SM. Appropriate weighting factor 22’" is intro- duced in the above distortion criterion, which means that errors incurred at the higher layer subimages are weighted to have more impact on the overall distortion. The prob- lem is to minimize (4) subject to the bit constraint in (3). The constrained problem can be converted to uncon— strained problem by forming the following function: J=D+AR where A is the Lagrangian multiplier. The solution is obtained by taking the derivative of J with respect to R M (4) ZHANG AND ZAFAR: MOTION—COMPENSATED WAVELET TRANSFORM CODING 291 . HE”. IEJIII IELJ III-III..- IIIIII III- " I ' “ Si distortion-rate function given by [8] is 2—r(1+R,*,,) 0,:(R) = 77lflfi<nlm+ldxlm (6) where {fmk(x); m = 1;", M; k = 1,2, 3} is the PDF associ- ated with wavelets {WW/1‘; m = 1,~--, M; k = 1,2,3}. For simplification we let r+ 1 Fig. 6. Variable block-size MRME‘using independent motion estima- K _ k 1/ r+1 tion for (5,,ng {i = 1,2,3). am - flfm(x)] dx . Substituting these values in (5) we have '9 {xwam—A[R— Em,” =0. (7) and {R,’§,; m = 1,-~,M; k = 1,2,3} and setting it to zero. M 3 M 3 6R + 1 J=22MDM+ )3 222m0,:+)t(RM+ 2 2R5). '" m r m=1 k=1 m=1 k=1 Solving (7) obtains the value of Rm: To simplify the notation we will assume that for any X R _ 1 l (r In 2) amZZMfir M 3 m _ r ng A(r + 1) _ k §Xm _ XM + 21 [(21 Xm and Substituting this value of Rm in the constraint equation "I- _ (3), we get the value of A, M 3 1‘le =XM l_l l—lep A = rln22~(r+l(7ReMX3M+5)]) m m=1 k=1 r + 1 Thus, the partial derivative can be written as and, finally, we get the optimal bit allocation for each a wavelet: ER—[D(R) — A{R ~ 21cm” = 0, (5) R _ R _ M(3M + 5) fl ’" m m 3M + 1 r(3M + 1) r If a difference distortion measure with power r is used, 1 a m D(x) =Ix — q(x)|r r 21 + _ logl 1/(3M+1) ' (8) where q(x) is the quantization of x. The asymptotic ] 292 The result is quite intuitive, as the bit allocation is nearly uniform among all subimages. Since the size of higher layer subimages is much smaller than that of lower layer subimages, this means that more bits are assigned to the higher layer subimages in terms of average bits per pixel. This is consistent with the inherit structure of the wavelet pyramid shown in Fig. 1. Bit allocation within each subim- age is the same as the conventional scheme used in transform coding and will not be elaborated here [16]. The second quantization technique is based on the adaptive truncation scheme [17]. This scheme only in- volves a floating-to-integer conversion process and is very simple to implement. It was originally used for quantizing discrete cosine transform coefficients. We are using this technique by adjusting the normalization factor to the wavelet pyramid. The scheme consists of three steps. The first is to apply a threshold to all subimages {SM,W,,’f; m = 1,-~,M; k = 1,2,3} to reduce the number of coef- ficients to be quantized, i.e., make all the coefficients below a defined value zero. It should be pointed out that the dynamic range of the values in different subimages of the DRS varies and highly depends on the motion activi- ties and the accuracies of motion estimation scheme asso— ciated with each subimagc. Therefore, the threshold could be chosen in terms of dynamic range and the level in the pyramid. In this paper, a fixed threshold T is used for all subimages for the sake of simplicity. The threshold is then subtracted from the remaining nonzero coefficients. k . . . k . . Wfli’j) = WALD T if Wm(t,1) > T 0 1f W,,’;( i, j) s T whereOsigX/Zm—l and Osng/2m71 and X and Y are the video frame size. The next step is to scale the coefficients by a normaliz- ing factor Dm based on their levels in the pyramid. The choice of BM is based on the same principle stated in Section IV. A larger value Dm corresponds to a coarser quantization. In our work, Dm =DM2M’”' is chosen, where DM is the normalization factor for {SM7W~117WA3’WA3}‘ TW"(i,j) NTW" =—"‘ . m(l DMzM—m After normalization, the values are rounded to the next integer values by RNTW,,’f(i,j) = integer{NTW,,f(i,/) + 0.5} Then RNTan(i, j) is entropy coded and transmitted. At the receiver, the decoded values are inversely normalized, added to the threshold, and inversely transformed to reconstruct the image. This simple adaptive truncation process results in a variable bit rate but a nearly constant quality. For constant bit rate output, Dm should be a function of the degree of buffer fullness at the output of the coder. Relating Dm to the variance distribution of different subimages should also improve the performance. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 2, NO. 3, SEPTEMBER 1992 VI. DESCRIPTION OF THE VIDEO CODING SCHEMES The compression scheme we implemented in our work is basically an interframe hybrid DPCM/DWT scheme. Wavelet decomposition can be operated either on the original video samples before the motion compensation or on the residual video samples after motion compensation. Fig. 7 depicts a typical video coding system where original video source S is first decomposed into subimages {SM,W,,’,‘; m = 1,-~-,M; k = 1,2,3}. After using the vari- able block-size MRMC, the DRS frames {R M, R5,; in = 1,---, M; k = l, 2,3} are coded and transmitted. The ener— gies of DRS may be further compacted by a mapping that intends to localize the energy distribution within each subimagc. The mapping can be any conventional trans- form or a predictive coder. Of course, a simple PCM coder followed by an entropy coding is also expected to perform well. These two variations are illustrated in Fig. 7 where dashed lines represent a DCT mapping being used. Alternatively, wavelet decomposition can take place in the residual video signal after a conventional motion-com- pensated prediction scheme, shown in Fig. 8. Therefore, four variations of the proposed algorithm are identified in terms of the domain of wavelet decomposition and the choice of mapping strategies: 1) Wavelet decomposition a multiresolution motion compensation “a multiscale quantization —> entropy encoder 2) Wavelet decomposition —+ multiresolution motion compensation —> DCT —> uniform quantization —> entropy encoder 3) Motion compensation —> wavelet decomposition —> multiscale quantization —> entropy encoder 4) Motion compensation -> wavelet decomposition a DCT —> uniform quantization —> entropy encoder In scheme 1), the original video frames are first decom- posed into wavelets in different scales and resolutions, Then DRS frames are formed using the variable block-size MRMC prediction scheme. The DRS frames are then quantized using the adaptive truncation process (ATP) with a multiscale normalization factor described in Sec- tion V. The coding chain of this scheme is expressed as DWT/MRMC/ATP. Scheme 2) also decomposes the original video frame using the wavelet transform; how- ever, the energy of the DRS frames is further compacted using a conventional DCT and the DCT coefficients are quantized by an uniform quantizer. The coding chain of this scheme is expressed as DWT/MRMC/DCF/UQ. In schemes 3) and 4), the wavelet decomposition is per- formed on the residual video frame (the displaced frame difference (DFD)) using a conventional motion-com- pensated prediction scheme. A multiscale quantizer is used in scheme 3), whereas in scheme 4) DCT is used for all DFD’s followed by a uniform quantizer. These strate— gies can be expressed as MC/DWT/ATP and MC /DWT/DCT /UQ, respectively. In all four cases, mo- tion vectors are DPCM-coded and all quantities are en- tropy-coded prior to the transmission. ZHANG AND ZAFAR: MOTION-COMPENSATED WAVELET TRANSFORM CODING 293 Entropy Coding Entropy Coding Fig. 8. Motion estimation (ME) and discrete wavelet transform (DWT). VII. TEST RESULTS The proposed video coding system is implemented in the video compression testbed at GTE Laboratories, Waltham, MA. The testbed includes a digital video recorder that allows a real-time acquisition and playback of 25 3 digital video in CCIR 601 format. The recorder is interfaced to a host machine in which the compression software resides. By using software simulation of the compression and decompression algorithms, we can re— construct video segments to compare with the original signal via real-time playback. Therefore, compression per- formance, quality degradation, and computational effi- ciency can be evaluated for different coding algorithms. The test sequence “CAR” we use in this paper is a full-motion interlaced color video sequence in CCIR 601 format with 720 X 480 per frame and 16 b /pixel. It is a fast camera-panning sequence and ideal for testing vari- ous motion compensation schemes. Experimental results are also obtained for other sequences including the “CHEERLEADERS” and “FOOTBALL” sequences used for the MPEG testing. All the results and parameters follow the same pattern although actual numbers may turn out to be different. Table I shows the energy distribution among different subimages before and after the proposed variable block- size MRMC for a typical video frame in the “CAR” sequence. Four variations of the MRME schemes de- scribed in Section V are compared. “S8 only” means that only full-motion searching in S8 is implemented and all other subimages use the same or scaled motion vectors obtained in S8 (A = 0 in Fig. 5). “SS, Wat only” means that full-motion searching in layer 3 of the wavelet pyramid is conducted and directionally propagated to lower layer subimages without refinement (see Fig. 6). “S8 + refine” means that full searching is used for SS, and motion vectors of all other wavelets are predicted based on the motion information obtained for S8 (Fig. 5). Finally, “$8,W8‘ + refine” is an extension of the scheme illus- trated in Fig. 6 by using motion vectors of “SS, as initial predictions. 294 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 2, NO. 3, SEPTEMBER 1992 TABLE I ENERGY DISTRIBUTION AMONG DIFFERENT SUBIMAGES FOR A TYPICAL FRAME IN “CAR” SEQUENCE Energy 58 W81 w; w,3 m1 W42 W,‘ W; W22 W23 Original signal 4958723 7361.20 452.91 148.47 1391.86 65.89 18.46 203.53 7.48 331 S8 only 336.00 848.63 155.74 193.20 428.40 53.07 37.12 110.65 5.26 1.28 SB,W8‘ only 336.00 195.58 44.32 28.34 414.99 44.54 21.62 97.47 4.31 0.55 S8 + refine 336.00 186.16 44.56 29.01 138.83 13.58 4.65 39.96 1.02 0.05 S8, Ws‘ + refine 336.00 195.58 44.32 28.34 139.71 15.14 7.37 38.61 0.78 0.19 The means of all the DRS’s are very small, ranging from as low as 0.01 to a maximum of 3.75. However, the variance depends on the motion activity and the accuracy of the motion prediction algorithm. With high motion the variances of all the subimages increase. The accuracy of motion vectors contributes significantly to the energy in the displaced residual subimages. It can be seen from Table I that the decomposition compacts most of the energy in S8 for the original video signal. After the variable block-size MRMC, energies or variances in most subimages are considerably reduced. The reduction in the highest layer subimages (especially for SS) is rather significant. This is due to the fact that a very accurate motion estimation scheme is used for this layer and the variations between successive frames are well compensated. This layer is the most important layer in terms of visual perception and is appropriately treated in the proposed variable block-size MRMC approach. Also, it can be easily observed that the “S8 + refine” and the “$8, W + refine” schemes produce less energies than the other two schemes. The distribution of energies among different subimages in both U and V components of the color signal follows the similar pattern as that in the Y component. The luminance signal contains more than 60% of the total energy of the original signal and U and then V components have less than 20% of the total energy, respectively. The normalizing factor for quantiza— tion is, therefore, set to a lower level for the Y component than that for the U and V components. The four different scenarios 1—4 of the proposed wavelet video compression scheme described in Section VI are implemented. A classified vector quantization scheme (V0) is also used to quantize the variable-size MRMC residual video frames (scenario (e) in Fig. 9). For schemes (a) and (b), the “SS, W; + refine” MRMC scheme is used to find the motion vectors since it gives the best perfor- mance as shown in Table I. The conventional full-search MC scheme using a block size of 8 X 8 is employed in schemes (c) and (d). Fig. 9 illustrates the peak-to-peak signal-to-noise ratio (SNR) for these five scenarios at an average bit rate of 3 Mb / s. A normalizing factor (DM) of 8.0 for Y and 16.0 for U and V is used for schemes (a) and (c). For schemes (b) and (d) the same quantization tables as that being used in the JPEG specification are used to quantize the DCT coefficients. Fig. 9 indicates that DWT working on the original video domain incorporated with the proposed variable block—size MRMC scheme has a better performance than DWT operating on the DFD compensated by a conventional WT/MRMC/AQ (A) Peak SNR In dB WT/MRMC/DCT/Q MCNVT/AQ MC/WT/DCT/O WT/MRMCNO a. \ 02468101214 Frame Number 16 18 2O 22 24 Fig. 9. Peak-to-peak signal-to-noise ratios for schemes (a) through (c) at 3 Mb/s. (21) WT/MRMC/AQ. (b) WT/MRC/DCT/Q. (0) MC /WT/AQ. (d) MC /WT /DCT/Q. (e) WT/MRMC/VQ. full-searching scheme. We can also see that DCT mapping after DWT does not compact energies as one might expect for original video samples, instead it has an inverse effect on the overall performance. Scheme (a) clearly outperforms schemes (b), (c), and ((1). V0 gives further gain in the peak-to-peak signal-to-noise ratio. These ob- servations are also supported by subjective evaluations. We observe some “blocking effect” and other artifacts inherent to the block DCT approach when using schemes (b) and (d) although they are not as pronounced as using a simple interframe block DCT scheme [20]. Schemes (a) and (c) are basically free from the “blocking effect.” This is due to the fact that the wavelet decomposition involves a global transform, and hence the distortion is randomly distributed among the whole picture, which is less annoy- ing than the periodic “blocking effect” for human viewers. It should be pointed out that appropriate quantization tables (rather than the JPEG default tables used in this paper) for DCT coefficients in residual subimages may improve the performance for scenarios (b) and (d) since residual subimages have a completely different statistical properties from original video samples. VIII. SUMMARY, CONCLUSIONS, AND FUTURE WORK In this paper, application of the discrete wavelet trans- form (DWT) to full-motion video compression was exam- ined. DWT decomposes a video signal into a pyramid structure with multiple layers that characterize the video signal in different scales with different frequency ranges. ZHANG AND ZAFAR: MOTIONeCOMPENSATED WAVELET TRANSFORM CODING This representation matches to the intrinsic properties of human vision structure in early stages as speculated in current research in the field. Based on a set of wavelet coefficients developed by Daubechies [1] and a variable block-size MRMC scheme, a video compression system was presented. A bit-allocation assignment formula was derived based on a weighted distortion criterion. The adaptive truncation process used in this paper is similar to the scheme used in Chen’s scene-adaptive coder, but the normalization factor was appropriately adjusted to match the “importance” level of subimages in the pyramid struc— ture. Four variations of the proposed video compression scheme were implemented and compared in terms of the peak-to-peak signal-to-noise ratio. Our results indicated that DWT working on the original video domain incorporated with the proposed variable block—size MRMC scheme outperforms the DWT operat- ing on the DFD compensated by a conventional full- scarching scheme. We also observed that DCT mapping after DWT does not compact energies as one might expect for original video samples, instead it has an inverse effect on the overall performance. These observations are also supported by subjective evaluations. We observe some “blocking effect” and other artifacts inherent to the block DCT approach when using schemes 2 and 4 although they are not as pronounced as with a simple inter-frame block DCT scheme. Schemes 1 and 3 are basically free from the “blocking effect.” This is due to the fact that the wavelet decomposition involves a global transform and hence the distortion is randomly distributed among the whole pic- ture, which is less annoying than the periodic “blocking effect” for human viewers. In addition, a classified VQ scheme was used to quantize the displaced residual subimages after the proposed variable-size MRMC and considerable gain in the peak—to-peak signal—to-noise ratio was obtained. Recently, biorthogonal wavelets with linear phase and nonseparable wavelets functions such as the quincunx and hexagonal wavelets have been developed and applied to image coding applications. The extensions to video com- pression are straightforward by using the proposed motion prediction schemes or other existing methods. The direc- tional properties of the nonseparable wavelets may fur— ther improve the motion prediction performance in the proposed MRMC scheme as indicated in Fig. 6. Also, comparisons should be made between the 3-D wavelet representation approach and the motion—compensated wavelet approach. The advantages of the 3-D approach are 1) it is more computationally efficient since no motion estimation needs to be implemented, 2) no refreshing frames need to be sent. The disadvantages of the 3-D approach are 1) larger buffers are needed and 2) it is difficult to handle scene changes. It is not clear which approach performs better. However, these two approaches can be used in different scenarios. For example, the 3-D wavelet approach can be used in interactive video commu- nication environments such as video-telephony since it is symmetric in nature where the MRMC 2-D approach can 295 be used in one-way broadcasting and video—on-demand applications since the decoder structure is much simpler. Future studies should be directed to these areas. Finally, the combination of 3-D wavelet/subband representation with the proposed multiresolution motion compensation by using the filter banks along the motion trajectory should also improve the performance. These topics de- serve further study. ACKNOWLEDGMENT The authors would like to thank Dr. D. Le Gall from C-Cube Microsystems for providing us the original Video sequence “CAR.” Assistance from Dr. I. Daubechies from AT&T Bell Laboratories is greatly appreciated. Com- ments from reviewers are also helpful in improving the presentation of this material. This work was carried out at the video techniques and system engineering department of GTE Laboratories. The authors would like to acknowl— edge the consistent supports and interests from S. Walker and P. Tweedy at GTE Laboratories. REFERENCES [1] I. Daubechies, “Orthonormal bases of compactly supported wavelets,” Comm. Pure Appl. Matli., vol. XLI, pp. 909-996, 1988. S. Mallat, “Multifrequency channel decompositions of images and wavelet models,” IEEE Trans. Acoust. Speech Signal Processing, vol. 17, no. 12, Dec. 1989, pp. 2091—2110. —, “A theory for multiresolution signal decomposition: The wavelet representation," IEEE Trans. Pattern Anal, Machine late/1.. vol. 11, no. 7, July 1989. pp. 6744793. F. Burt, “Multiresolution techniques for image representation, analysis, and ‘smart’ transmission,” SPIE Visual Communications and Image Processing IV, vol. 1199, Philadelphia, PA, Nov. 1989. M. Vetterli and C. llerley, “Wavelets and filter banks: Relation- ships and new results,” in Proc. [CASSP'90, Albuquerque, NM, April 3—6, 1990. D. Marr, Vision. New York: Freeman, 1982. P. Burt and E. Adelson. “The Laplacian pyramid as a compact image code,” IEEE Trans. Commun, vol. COM-31, pp. 532'540, April 1983. M. Vetterli, Multidimensional subband coding: Some theory and algorithms,” Signal Processing, vol. 6, pp. 97—112, April 1984. J. Woods and S. O’Neil, “Subband coding of images," IEEE Trans. Acoust. Speech, Signal Processing, vol. ASSP-34, no. 5, pp. 1278 1288, Oct. 1986. E. Adelson, S. Simoncelli, and R. Hingorani “Orthogonal pyramid transforms for image coding,” SPIE Visual Communications and Image Processing [1, Boston, MA, vol. 845, pp. 50-58, Oct. 1987. W. Zettler, J. Huffman, and D. Linden, “Application of compactly supported wavelets to image compression,” SPIE Image Processing Algorithms and Techniques. vol. 1244, pp. 150-160. Santa Clara, CA, February 13*15, 1990. M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies, “Image coding using vector quantization in the wavelet transform domain," in Proc. ICASSP‘90, Albuquerque, NM, pp. 2297—2300, April 376, I990. N. Baaziz and Claude Labit, “Laplacian Pyramid versus wavelet decomposition for image sequence coding," in Proc. ICASSP’90, Albuquerque, NM, April 3—6. 1990. Draf/ Revision of Recommendation H.261, Document 572, CCITI“ SG XV, Working Party XV / 1, Spec, Grp. on Coding for Visual Telephony, 1990. MPEG Video Drufi Proposal, ISO/[EC JTC1/SC2/WG11, Sept. 1990. A. Netravali and B. Haskell, Digital Pictures—Representation and Compression. New York: Plenum Press, 1988. W. Chen and W. Pratt, “Scene adaptive coder," IEEE Trans. Commun., vol. COM—32, no. 3, pp. 225—232, March 1984. A. Watson, “The cortex transform: Rapid computation of simu- lated neural images,” Comput. Vision, Graphics, Image Processing, vol. 39, pp. 311—327. 1987. {2] [3] [4] [5] [6] [7] [8] [9] [10] [111 [12] [13] [14] [15] [16] [17] [18] 296 [19] K. Uz, M. Vetterli, and D. Le Gall, “Interpolative multiresolution coding of advanced television and compatible subchannels,” IEEE Trans. Circuits Syst. Video Technol., vol. 1, no. 1, pp. 86*99, March 1991. S. Zafar, Y. Zhang, and J. Baras, “Predictive block-matching motion estimation schemes for TV coding—Part I: inter-block prediction,” IEEE Trans. Broadcast, vol. 37, no. 3, pp. 97—105, Sept. 1991. Y. Zhang and S. Zafar, “Motion-compensated wavelet transform coding for color video compression," SPIE Visual Communications and Image Processing, pp. 301316, Boston, MA, Nov. 10715, 1991. [20] [21] Ya-Qin Zhang was born in 1966 in Taiyuan, China. He received the BS. and MS. degrees in electrical engineering from the China University of Science and Technology (USTC), Hefei, China, in 1983 and 1985, respectively. He re- ceived the Doctor of Science (ScD) degree in electrical engineering from the George Wash- ington University, Washington, DC, in 1989. He has been a Senior Member of Technical Staff at the video techniques and system engi- neering department of GTE Laboratories, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 2, NO. 3, SEPTEMBER 1992 Waltham, MA since May, 1991. Previously, he was a Member of TCChni- cal Staff at Contel Technology Center, Chantilly, VA. He was on the part-time faculty of George Washington University in 1990. He has published more than 40 papers in medical imaging and image /video communications. He received the Merwin Ph.D. award sponsored by the Industrial Liaison Program for his academic achievements in 1989. Sohail Zafar (8’87) was born in Lahore, Pak— istan, on November 3, 1960. He received the B.Sc degree in electrical engineering from Uni- versity of Engineering and Technology. Lahore, Pakistan, in 1981, and the MS. degree from Columbia University, New York, in 1988, Since 1989, he has been working as a Graduate Re- search Assistant at University of Maryland, Col- lege Park, MD, where he is pursuing the PhD. degree. He has worked as a Member of Technical Staff at Contel Technology Center, Chantilly, VA, during the summer of 1989 and 1990. He is working as a summer Member of Technical Staff at GTE Laboratories, Waltham, MA. His research interests include neural networks, parallel processing, and video coding and transmission. ...
View Full Document

{[ snackBarMessage ]}

Page1 / 12

Zhang92a - IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR...

This preview shows document pages 1 - 12. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online