 Open Access
 Total Downloads : 867
 Authors : S.Srikanth, V.Muralidharan, A.Santhoshkumar
 Paper ID : IJERTV1IS7381
 Volume & Issue : Volume 01, Issue 07 (September 2012)
 Published (First Online): 26092012
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Survey on Lifting based 2D DWT VLSI Architectures
Survey on Lifting based 2D DWT VLSI Architectures
S.Srikanth Assistant Professor
SNS College of Technology
V.Muralidharan Assistant Professor
Christ the King Engineering College

anthoshkumar Assistant Professor
SNS College of Technology
Abstract In this paper, we review recent developments in VLSI architectures and algorithms for efficient implementation of lifting based 2D Discrete Wavelet Transform (DWT). Lifting scheme technique is used for Fast implementation of DWT. This method is entirely based on a spatial interpretation of the wavelet transform. In this paper, we provide a survey of the architectures for 2dimensional DWT. The architectures are representative of many design styles and range from highly parallel architectures to DSPbased architectures to folded architectures
Keywords Discrete Wavelet Transform, Lifting Scheme, MIMO,

INTRODUCTION
2D DWT has evolved as essential part of modern compression system such as JPEG 2000. This is because the DWT can decompose the signals into different subbands with both time and frequency information and facilitate to arrive a high compression ratio [1]. In addition ,a wavelet based compression system, not only presents superior compression performance over DCT ,but provides four dimension of scalabilities resolution, distortion, spatial and color, which are very difficult to achieve in DCT based compression system. In a compression system, the function of DWT is to decorrelate the original image pixels prior to compression step such that they can be amenable to compression. The computation of DWT can be done either by convolution based scheme or Lifting based scheme. The lifting scheme of computation of DWT has, however, become more popular over the convolutionbased scheme for its lower computational complexity [2] .The main feature of the liftingbased DWT scheme is to break up the high pass and lowpass filters into a sequence of upper and lower triangular matrices and covert the filter implementation into banded matrix multiplications. Such a scheme has several advantages, including inplace computation of DWT, integerto integer wavelet transform, symmetric forward and inverse transform. The popularity of liftingbased DWT has triggered the development of several architectures in recent years. These architectures range from highly parallel architectures to programmable DSPbased architectures to folded architectures.
In this paper we present a survey of these architectures. We provide a systematic derivation of these architectures and comment on their hardware and timing requirements.
The rest of the paper is organized as follows. In Section II, We explained about the lifting based DWT. We also present a Comparison of the hardware and timing complexities of all the Architectures. In Section III, we present the memory configuration for 2dimensional DWT architectures, followed by descriptions of a few representative architectures and a comparison of their hardware and timing complexities.

LIFTING BASED DWT
In traditional convolution (filtering) based approach for computation of the forward DWT, the input signal (x) is filtered separately by a lowpass filter ( h ) and a high pass filter ( g). The two output streams are then sub sampled by simply dropping the alternate output samples in each stream to produce the lowpass (yL) and highpass (yH) subband outputs.
The liftingbased DWT has many advantages over the convolution based approach. Some of them are as follows.
Liftingbased DWT typically requires less computation (up to 50%) compared to the convolution based approach. However the savings depends upon the length of the filters.
During the lifting implementation, no extra memory buffer is required because of the inplace Computation feature of lifting. This is particularly suitable for hardware implementation with limited On chip memory.
The lifting based approach offers integer to integer transformation suitable for lossless image Compression.
In lossless transformation mode, the boundary extension of the input data can be avoided because the original input can be exactly reconstructed by integer to integer lifting transformation.

TWO DIMENSIONAL DWT ARCHITECTURE
Generally, 2D wavelet filters are separable functions. A straightforward approach for 2D implementation is to first apply the 1D DWT rowwise (to produce L and H sub bands) and then columnwise to produce four sub bands LL, LH, HL and HH in each level of decomposition. Obviously, the processor utilization is a concern in direct implementation of this approach because it requires all the rows be filtered before the column wise filtering can begin and thus it requires a size of memory buffer of the order of the image size.
The alternative approach is to begin the columnprocessing as soon as sufficient number of rows has been filtered. The columnwise processing is now performed on these available lines to produce wavelet coefficients rowwise. The overview of the twodimensional architecture for convolution based DWT is shown in Fig1: (a). the row module reads the data from MEM1 performs DWT along the rows and writes the data into MEM2. The column module reads the data from MEM2 performs DWT along the columns and writes LL data to MEM1 and LH, HL, HH data to external memory.

A similar approach can be implemented for the lifting scheme as well. The basic idea of lifting based approach for DWT implementation is to replace the parallel lowpass and highpass filtering of traditional approach by a sequence of alternating smaller filters. The computations in each filter can be partitioned into prediction (dual lifting) and update (primal lifting) stages as shown in Fig1: (b). Here the row module reads the data from MEM1 performs the DWT along the rows (H and L) and writes the data into MEM2. The prediction filter of the column module reads the data from MEM2, performs columnwise DWT along alternate rows (HH and LH) and writes the data into MEM2 in [5] (and into MEM1 in [12]); the update filter of the column module reads the data from MEM2 in [5] (and MEM1 in [12]), performs column wise DWT along the remaining rows, and writes the LL data into MEM1 for higher octave computations and HL data to external memory. Note that this is a generic architectural flow and is the backbone of the existing 2D architectures. An important consideration in the design of 2D architectures is the memory configuration. A tradeoff exists between the size of the internal memory and the frame memory access bandwidth. The size of the internal memory is again a function of the way the frame memory is scanned. In Section A, we describe the existing scanning techniques along the lines of [14], [13].Then we describe three representative 2D DWT architectures, namely, the dedicated architecture for the (4,2) filter [12], the generalized architecture [5] and the recursive architecture [8], and compare them with respect to hardware and timing complexities.

Memory Scan Techniques
The memory scan techniques can be broadly classified into linebased scan, blockbased scan and stripebased scan. Though most of the existing architectures are based on line scan, we describe all three techniques to (possibly) facilitate development of new 2D DWT architectures.
LineBased Scan
In linebased scan, the scan order is raster scan. An internal line buffer of size LN is required, where N is the number of pixels in a row and L is the number of rowsthat are required for that particular filter. A linebased implementation of the traditional convolution based wavelet transform has been discussed in great detail in [10]. For lifting based architectures, the value of L can be determined as in [13] by considering the data dependency graph (Fig2). This is an extension of the 1D data dependency graph (Fig1) with a node now corresponding to a row of pixels. Note that several rows of data corresponding to R2R4 and coefficients corresponding to R1 and R5 have to be stored. When a new row of data
corresponding to R6 is available, another column operation can be initiated. After this column operation, data in R7R9 are stored for the next column operation. According to the implementation in [13], the line buffer needs to store six rows of data. The implementation in [14] as well as that in [11] requires only four rows of data to be stored. A detailed analysis of the memory requirements for line scan implementations of both forward and inverse transforms are presented in [7].
Blockbased Scan
In blockbased scan, the frame memory is scanned blockbyblock and the DWT coefficients are also computed blockbyblock. Figure 3 shows two configurations of block based methods where the blocks are scanned in the row direction first. In the non overlapped configuration, the blocks are not overlapped with each other and in the overlapped configuration, the blocks are overlapped by 2K pixels in the column direction.
Here K = _(L 1)/2_,where L is the number of DWT filter taps. In both cases, intermediate data have to be stored between two adjacent blocks as shown in grey in Fig. 3. The size of the internal buffer for one level for the nonoverlapped case is LN + LBy. The first term, LN, is due to the columnwise intermediate data and the second term, LBy is due to the intermediate data between adjacent blocks in a row.
The size of the internal buffer can be reduced to only LBy if the columnwise intermediate data is not stored and instead the data is read from the frame memory as needed. The size of the internal buffer for the overlapped case can also be reduced to LBy at the expense of increasing the number of frame memory reads to N2By/(By2K) [14]. However, this scheme is not directly applicable to multilevel architectures. The blockbased technique proposed in [11] first performs filtering operation on neighboring data blocks independently and later combines the partial boundary results together. Two boundary postprocessing techniques are proposed – overlapstate sequential which reduces the buffer size for sequential processing and splitandmerge which reduces the interprocessor delay in parallel implementations.
StripeBased Scan
The stripebased scan is equivalent to the linebased scan with Bx = N. In other words, the stripe is a very wide block with width N and height S. As in the case of blockbased scan, there are two categories, namely, the nonoverlapped stripe based scan also referred to as the optimal Zscan in [13] and shown in Fig. 4(a) and the overlapped stripe based scan shown in Fig. 4(b). The nonoverlapped stripebased scan has an internal buffer of size LN +LS and N2 frame memory READ accesses. In contrast, the overlapped stripebased scan has a significantly smaller internal buffer of size LS and N2S/(S 2K) frame memory READ accesses.

(4, 2) Filter Architecture
A dedicated architecture for 2D DWT using the (4, 2) filter from the DeslauriersDubuc family has been proposed by Ferretti and Rizzo in [12]. The architecture is shown in Fig5. It consists of two parallel filters to compute the predict and update values along the rows (Predrow, Updrow), two parallel filters to compute the predict and update values along the columns, and four buffers A, B, C, D, to hold the intermediate data to support the pipelined computations. The buffers are dualported and are organized such that words can be accessed simultaneously.
Each filter consists of multipliers (Lg = 4 for predict filters and Lh = 2 for update filters), adders, shifters and internal buffers (proportional to lg and Lh) to streamline the computations. Predrow computes on Lg = 4 data, Lg 1 of which are stored in its internal buffer. It computes the H values. The Updrow requires LH = 2 H values to compute a L value. It obtains these by reading the last value produced by Predrow and storing the other Lh 1 in internal registers. It picks up the primary input value from the internal buffer in Predrow.
Predcol performs the same basic operations as Predrow, though working on columns. It reads Lg even position H values along the columns. It produces a new row of wavelet coefficients for every two rows produces by Predrow. During the time Predrow produces H values for oddindexed rows, Predcol computes on the L values generated by Updrow. The architecture utilization is only 50% if we only consider the computations in the first level. The higher level computations can thus be easily interspersed with the first level computations using a RPAbased approach. In fact, once the
unit delay for any level is determined, the schedule can be easily obtained.

Generalized 2D DWT Architecture
The architecture proposed by Andra et al. [8] Is more generalized and can compute a large set of filters for both the 2D forward and inverse transforms. It supports two classes of Architectures based on whether lifting is implemented by one or two lifting steps. The M2 architecture corresponds to implementation using one lifting step or two factorization matrices, and the M4 architecture corresponds to implementation by two lifting steps or four factorization matrices. The dataflow of the M2 architecture that is used to implement the wavelet filters (5,3), C(13,7), S(13,7), (2,6), (2, 10) is similar to that in Fig1: (b). A block diagram of the M2 architecture is shown in Fig. 6.
It consists of the row and column computation modules and two memory units, MEM1 and MEM2. The row module consists of two processors RP1 and RP2 along with a register file REG1, and the column module consists of two processors CP1 and CP2 along with a register file REG2. All the four processors RP1, RP2, CP1, CP2 in the proposed architecture consists of 2 adders, 1 multiplier and 1 shifter. For the M2 architecture, RP1 and CP1 are predicting filters and RP2 and CP2 are update filters.
The data access pattern for the (5, 3) filter with N = 5. RP1 calculates the highpass (odd) elements along the rows, y01,y03, while RP2 calculates the lowpass (even) elements along the rows, y00, y02, y04. . . , CP1 calculates the high pass and low pass elements z10, z11. . . z30, z31 . . . along odd rows and CP2 calculates highpass and lowpass elements z00,
z01, . . . ; z20, z21, . . . ; z40, z41, . . . along the even rows. Note that CP1 and CP2 start computations as soon as the required elements are generated by RP1 and RP2.
The memory modules, MEM1 and MEM2, are both dual port with one read and one write port, and support two simultaneous accesses per cycle. MEM1 consists of two banks and MEM2 consists of four banks. The multibank structure increases the memory bandwidth and helps support highly pipelined operation. Details of the memory organization and size, register file, and schedule for the overall architecture with specific details for each constituent filter have been included in [5].
[9].The rowprocessor is a 1D folded architecture and does rowwise computations in the usual fashion. The column processor is responsible for filtering along the columns at the first level and filtering along both the rows and the columns at the higher levels. It does this by employing RPA scheduling and achieves very high utilization. The utilization of the row processor is 100%, and that of the column processor is 83% for 5level decomposition.
Parallel and Pipelined VLSI Architecture for Multilevel Lifting 2D DWT
The dataflow of the M4 architecture that is used to implement the filters (9,), (6,10) is quite different. Since this is a generalized architecture with the hardware in the row and column modules fixed, the computations span two passes. In the first pass, the rowwise computations are performed using both the modules. Module 1 reads the data from MEM1, executes the first two matrix multiplications, and writes the result into MEM2. Module 2 executes the next two matrix multiplications and writes the result into MEM1. In the second pass, the transform is computed along the columns. Once again, Module 1 executes the first two matrix multiplications and Module 2 executes the next two matrix multiplications.
D. 2D Recursive Architecture
The 2D recursive architecture proposed by Liao et al. [8] is built by the 1D recursive architecture proposed by the same authors in [7,8]. As in the 1D case, the computations of all the lifting stages are interleaved to increase the hardware utilization. The column processor and the rowprocessor are similar to the1D recursive architecture processor. The image is input to the rowprocessor in raster scan format. When the rowprocessor processes the even rows, the high and low frequency DWT coefficients of the odd rows are shifted into their corresponding firstin firstout FIFO registers. The use of two FIFOs to separately store high frequency and low frequency components results in lower controller complexity. When the row processor processes the odd lines, the low frequency DWT coefficients of the current line and lines previously stored in the FIFOs are sent to the column processors. The columnprocessor starts calculating the vertical DWT in zigzag scan format after one row delay. The computations are arranged in a way that the DWT coefficients of the first stage are interleaved with the other stages. The arrangement is done with the help of the data arrangement switches at the input to the row and column processors, and the exchange switch.
A mix of the principles of recursive pyramid algorithm (RPA) [6] and folded architecture has been adopted by Jung et to design a 2D architecture for lifting based DWT in
According to the pyramidalgorithm (PA), the lowlow sub band of a given decomposition level is processed further to generate DWT coefficients of the next higher level, and due to downsampling, after each level of decomposition, the computational complexity steadily decrease by a factor of four. The amount of hardware resources required to calculate the DWT coefficients of every higherlevel of decomposition should, therefore, be reduced by a factor of 4, in order to achieve 100% HUE. The scalable parallel structure [15] for multilevel DWT, based on the above view point, is shown in Fig 7. It is comprised of (L+1) PUs, where L= [log4P] The PU (L+1) is a RPA based structure, while all other PUs are simple PA units. In each cycle, PU1 receives a block of samples of an input row and produces a block of component of a particular
row of sub band matrices [C1,D1]or[B1,A1] One row of sub band [A1]is obtained from PU1 after a gap of R cycles.
Architecture
Multiplier
Adder
Onchip memory
Computing time
control
(4,2) filter [17]
10
8
25N+7[L2]
Tm+2Ta+Ts
Moderate
Generalized [5]
4
8
N2+34
Tm
Moderate
Recursive[7]
8
8
4N
4 Tm+8 Ta
Complex
Folded Rpa [9]
9
2
12N
4Tm+4 Ta
Complex
MIMO[16]
4M
8M
10M+4N+
(MN/2)
N2/M
Simple
Each block of sub band output [A1] of PU1 is split into two blocks of (P/4) samples each, and fed to PU2 in each cycle, so that one complete row of [A1] is fed in 2R cycles. The computation of PU j, (for j>1) is similar to that of PU1. Similar to PU1PUj can be designed for the computation of jth level DWT. It consists of (P/22j1) number of PEs and calculates the jth level DWT of an input block of (P/22j2) samples in every cycle, where 1<j<L and L=[log4P]. Structures of PUj are similar to that of PU1, except that the size of each shift register of ID and MD of subcell2 is 2j1 words. Each of the (PU j)s uses a separate input buffer (IB j).The structure (IB j) is shown in fig .It is comprised of two RAM units and two MULTIPLEXERS. (IB j).receives N components in each cycle and complete a row of Aj1 in one cycle and complete one row in 2R cycles.

MIMO VLSI Architecture for 2D LiftingBased
DWT
An efficient multiinput/multioutput VLSI architecture (MIMOA)[16] is constructed as shown in Fig.8, which meets the high processing speed requirement with controlled increase of hardware cost and simple control signals. High processing speed can be achieved when multiple row data samples are processed simultaneously. And time multiplexing technique is adopted to control the increase of the hardware cost for the MIMOA. Furthermore, the control signals are simple, since the regular architecture is a combination of simple single input/singleoutput (SISO) modules and twoinput/twooutput (TITO) modules. It provides a variety of hardware implementations to meet different processing speed requirements by selecting different throughput rates.

Comparison of Performances
A summary of the hardware, memory and timing requirements of a few representative architectures is presented in Fig 9 . The hardware complexity has been described in terms of data path components and internal memory size
We list only the internal memory size since all the architectures require an External memory of size N2 for input data of size NÃ—N. The Timing performance has been compared with respect to the number of clock cycles to compute L levels of decomposition and the clock period.
Of the five architectures, the architecture in [8] has the smallest internal memory. This is because [8] is an RPA based approach that intersperses the computations at the higher levels with those of the lower levels. The architecture in [8], on the other hand, computes all the outputs of one level before starting the computations at the next level and has an internal memory of size O(N2). The data path complexity of the architecture in [8] is by far the lowest.
The control complexity of the architecture in [8] is significantly higher than the others. This is because of the large number of control signals and switches that are used to organize the data before sending to the row and column Computation units.
In terms of the timing performance, the architecture in [5] is pipelined and has the highest throughput (1/Tm). The
architecture in [8, 9] requires the fewest number of cycles since they are RPA based, though the clock periods are significantly higher.
The architecture in [12] is specific to the (4,2) filters while the RPA concept that is applied to the architectures in [8,9] can be applied to a large set of filters (not just (3,5), (9,7),Daub4). The architecture in [9] is essentially a programmable architecture which supports implementation of a large set of filters on the same hardware platform.



CONCLUSION

In this paper, we presented a survey of the existing lifting based implementations of 2dimensional Discrete Wavelet Transform. We briefly described the principles behind the lifting scheme in order to better understand the different implementation styles and structures. We have presented several architectures of different flavors ranging from highly parallel ones to highly folded ones to programmable ones.
REFERENCES
[1]. S. Mallat, A Theory for Multireolution Signal Decomposition: The Wavelet Representation, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 11, no. 7, 1989, pp. 674693.
T. Acharya and P. S. Tsai, JPEG2000 Standard for Image Compression: Concepts, Algorithms and VLSI Architectures.John Wiley & Sons, Hoboken, New Jersey, 2004.

W. Sweldens, The Lifting Scheme: A CustomDesign Construction of Biorthogonal Wavelets, Applied and Computational Harmonic Analysis, vol. 3, no. 15, 1996, pp. 186200.

I. Daubechies and W. Sweldens, Factoring Wavelet Transforms into Lifting Schemes, The J. of Fourier Analysis and Applications, vol. 4,1998, pp. 247269.

K. Andra, C. Chakrabarti, and T. Acharya, A VLSI Architecture for LiftingBased Forward and InverseWavelet Transform, IEEE Trans.of Signal Processing, vol. 50, no. 4, 2002, pp. 966 977.

M. Vishwanath, The Recursive Pyramid Algorithm for the Discrete Wavelet Transform, in IEEE Transactions on Signal Processing, vol. 42, 1994, pp. 673676.

H. Liao, M.K. Mandal, and B.F. Cockburn, Novel Architectures for LiftingBased Discrete Wavelet Transform, in Electronics Letters, vol. 38, no. 18, 2002, pp. 10101012.

H. Liao, M.K. Mandal, and B.F. Cockburn, Efficient Architectures for1D and 2D LiftingBasedWavelet Transform, IEEE Transactions on Signal Processing, vol. 52, no. 5, 2004, pp.13151326.

G.C. Jung, D.Y. Jin and S.M. Park, An Efficient Line Based VLSI Architecture for 2D Lifting DWT, in The
47th IEEE International Midwest Symposium on Circuits and Systems, 2004.

C. Chrysafis and A. Ortega, LineBased, Reduced Memory, Wavelet Image Compression, IEEE Trans. on Image Processing, vol. 9, no. 3, 2000, pp. 378389.

W. Jiang and A. Ortega, Lifting FactorizationBased Discrete Wavelet Transform Architecture Design, IEEE Trans, on Circuits and Systems for Video Technology, vol. 11, 2001, pp. 651 657.

M. Ferretti and D. Rizzo, A Parallel Architecture for the 2D Discrete Wavelet Transform with Integer Lifting Scheme, Journal of VLSI Signal Processing, vol. 28, 2001, pp. 165 185.

M.Y. Chiu, K.B. Lee, and C.W. Jen, Optimal Data Transfer and Buffering Schemes for JPEG 20000 Encoder, in Proceedings of the IEEE Workshop on Design and Implementation of Signal Processing Systems, 2003, pp. 177182.

C.T. Huang, P.C. Tseng, and L.G. Chen, Memory Analysis and Architecture for TwoDimensional Discrete Wavelet Transform, in Proceedings of the IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2004, pp. V13 V16.

Meher, P. K. Mohanty, B. K. and Patra, J. C.( 2008) Memory Efficient Modular VLSI Architecture for High throughput and LowLatency Implementation of Multilevel Lifting 2D DWT. IEEE Transactions on Signal processing, vol. 59, no. 5, may 2011.

Xin Tian , Lin Wu , YiHua Tan Efficient Multi Input/MultiOutput VLSI Architecture for TwoDimensional LiftingBased Discrete Wavelet Transform, IEEE Transactions on Computers, vol. 60, no. 8, August 2011.

M. Ferretti and D. Rizzo, A Parallel Architecture for the 2D Discrete Wavelet Transform with Integer Lifting Scheme, Journal of VLSI Signal Processing, vol. 28, 2001, pp. 165 185.