 Open Access
 Total Downloads : 301
 Authors : Ajinkya S. Bankar, Bhavika S. Shaha, P. K. Kadbe
 Paper ID : IJERTV2IS50408
 Volume & Issue : Volume 02, Issue 05 (May 2013)
 Published (First Online): 21052013
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Interstage Pipeline VLSI Architecture for 2D DWT
Ajinkya S. Bankar1,Bhavika S. Shaha2, P.K. Kadbe3 E&TC Department, Pune University1,2,3
VPCOE Baramati
Abstract
In this paper, a scheme for the design of a highspeed pipeline VLSI architecture for the computation of the 2 D discrete wavelet transform (DWT) is proposed. The main focus in the development of the architecture is on providing a high operating frequency and a small number of clock cycles along with an efficient hardware utilization by maximizing the interstage computational parallelism for the pipeline. The high speed computation is achieved by efficiently distributing the task of the computations of multiple decomposition levels among the stages of the pipeline and by optimally configuring the data and synchronizing the operations of pipeline so as to maximize the interstage computational parallelism. To validate the proposed scheme, an algorithm is designed and implemented in MATLAB for the 2D DWT computation. Then the circuit is simulated and implemented in VHDL.

Introduction
With the rapid progress of VLSI design technologies, many processors based on audio and image signal processing have been developed recently. The twodimensional discrete wavelet transform (2D DWT) plays major role in image/video compression standard. Wavelets decompose the signal at one level of approximation and detail signals at the next level. Thus subsequent levels can add more details to the information content. In addition to audio and image compression, the DWT has important applications in many areas, such as computer graphics, numerical analysis, radar target distinguishing and so forth. DWT is a computationally very intensive process and slow for many realtime applications when implemented in a general purpose computing system. It is essential to develope custom VLSI chips for DWT exploiting the underlying data parallelism to achieve high data rate.
H. Y. Liao et al. [2] have presented an architecture in which each of the row and columnwise filtering operations are decomposed using the so called lifting operations into a cascade of subfiltering operations. The scheme leads to a lowcomplexity architecture with
a large latency. C. Cheng et al. [3] have proposed an architecture in which a number of parallel FIR filters with a polyphase structure are used to improve the processing speed at the expense of increased hardware.

Marino et al. [4] have introduced a twostage pipeline architecture in which the first stage performs the task of the first decomposition level and the second one that of all the remaining levels, and has aimed at providing a short computation time. As the processing units employed in this architecture differ from one another, the complexity of the hardware resources is high and the design of the architecture is complicated.

Benkrid et al. [5] presents an FPGA architecture for the separable 2D Biorthogonal Discrete Wavelet Transform (DWT) decomposition. The architecture is based on the Pyramid Algorithm Analysis, which handles computation along the border efficiently by using the method of symmetric extension. P. McCanny et al. [6] have given, a VLSI architecture for performing the symmetrically extended two dimensional transform is presented. This architecture conforms to the JPEG 2000 standard and is capable of nearoptimal performance when dealing with the image boundaries.The architecture also achieves efficient processor utilization. S. Raghunath et al. [7] have presented an efficient architecture for a multiresolution symmetrically extended 2D 9/7 filter discrete wavelet transform processor is presented. Hardware complexity is greatly reduced with improved performance, due to the proposed combination of lifting scheme and line based architecture. I. S. Uzun et al. [8] have designed the nonseparable 2D discrete biorthogonal wavelet filter architecture which has been derived from modifiedrecursivepyramidalgorithm. MRPA based architecture exploits the downsampling of output subbands and performs the first decomposition level interspersed with all other levels by means of only one processing unit. C. Zhang et al. [9] presents, a scheme for the design of a highspeed pipeline VLSI architecture for the computation of the 2D discrete wavelet transform (DWT). The main focus in the development of the architecture is on providing a high operating frequency and a small number of clock cycles along with an efficient hardware utilization.

In this paper, a nonseparable pipeline architecture for fast computation of the 2D DWT with a reasonable low cost for the hardware resources is proposed. Separable approach is a simple way to compute the 2D DWT. However, separable filters being a special class of 2D filters are not capable to approximate well all arbitrary frequency responses. In this regard, a non separable approach of the 2D computation provides more flexibility. In the nonseparable approach depicted in Fig. 1, the DWT of a 2D signal s(n1, n2) is

Formulation for the Computation of Four Subbands
Let a 2D signal be represented by N0Ã—N0 matrix S(0), with its (m,n)th element denoted by S(0)(m,n)(0 m, n N01), where N0 is chosen to be 2J , J being an integer. Let the coefficients of a 2D FIR filter P (P=HH, HL, LH, LL) be represented by an LÃ—M matrix H(P). The (k,i)th coefficient of the filter P is denoted by H(P)(k, i)(0 k L1; 0 i M1). The decomposition
at a given level j=1, 2, , , J can be expressed as
computed by carrying out four separate 2D filtering
operations using four 2D filters: a highpasshighpass (HH) filter GHH(z1, z2), a highpasslowpass (HL) filter GHL(z1, z2), a lowpasshighpass (LH) filter GLH(z1, z2), and a lowpass lowpass (LL) filter GLL(z1, z2).The output signals of these four filters are then decimated by a factor of two in the horizontal and vertical directions producing, respectively the HH, HL, LH and LL components.
L1
A( j ) (m, n)
k 0
L1
B( j ) (m, n)
k 0
L1
C( j ) (m, n)
k 0
L1
S ( j ) (m, n)
k 0
M 1
H ( HH ) (k,i).S ( j1) (2m k,2n i)
i0
M 1
H ( HL) (k, i).S ( j1) (2m k,2n i)
i0
M 1
H ( LH ) (k, i).S ( j1) (2m k,2n i)
i0
M 1
H ( LL) (k,i).S ( j 1) (2m k,2n i)
i0
(1)
(2)
(3)
(4)
where A(j), B(j), D(j) and S(j), respectively, representing the HH, HL, LH and LL subbands of the 2D input signal at the jth level.
Fig.1 Computation of 1Level 2D DWT based on Non Separable approach


Formulations for the computation of 2D DWT
The 2D DWT is an operation through which a 2D signal is successively decomposed in a spatial multi resolution domain by low pass and highpass FIR filters along each of the two dimensions. The four FIR filters, denoted as highpasshighpass (HH), highpasslowpass (HL), lowpasshighpass (LH) and lowpasslowpass (LL) filters, produce, respectively, the HH, HL, LH and LL subband data of the decomposed signal at a given resolution level. The samples of the four subbands of

Formulation for a FourChannel Filtering Operation
In order to facilitate parallel processing for the 2D DWT computation, the LÃ—M filterig operation needs to be divided into mltichannel operations, each channel processing one part of the 2D data. It is seen from (4) that the even and odd indexed elements are always operated on the even and odd indexed filter coefficients, respectively. The matrix S(j) representing the LL subband at the jth level can, therefore, be divided into four (Nj/2+ L/2) Ã— (Nj/2 + M/2) sub matrices, S(j)ee , S(j)oe ,S(j)eo and S(j)oo , whose (m,n)th (0 m Nj/2 + L/2 – 1, 0 n Nj/2 +M/2 – 1) elements are given by
ee
ee
s( j ) (m, n) s( j ) (2m,2n)
oe
oe
s( j ) (m, n) s( j ) (2m 1,2n)
the decomposed signal at each level are decimated by a factor of two in each of the two dimensions. For the operation at the first level of decomposition, the given
s( j ) (m, n) s( j ) (2m,2n 1)
eo
eo
oo
oo
s( j ) (m, n) s( j ) (2m 1,2n 1)
(5)
2D signal is used as input, whereas for the operations of the succeeding levels of decomposition, the decimated LL subband signal from the previous resolution level is used as input.
taking into consideration the periodic padding samples at the boundary. It is seen from (5) that the data at any resolution level are divided into four channels for processing by first separating the even and odd indexed rows of S(j), and then separating the even and odd indexed columns of the resulting two sub matrices. The
data in each channel can then be computed by an (L/2Ã—M/2)tap filtering operation. In order to facilitate such a 4channel filtering operation, the filter coefficients, as used in (4), need to be decomposed appropriately. Accordingly, the matrix H(P) needs to be decomposed into four (L/2Ã—M/2) submatrices, H(p)ee
,H(p)oe, H(p)eo and H(p)oo , whose (k,i)th (0 k L/21, 0 i M/21) elements are given by respectively.
ee
ee
H ( P) (m, n) H ( P) (2m,2n)
oe
oe
H ( P) (m, n) H ( P) (2m 1,2n)
eo
eo
H ( P) (m, n) H ( P) (2m,2n 1) (6)
oo
oo
H ( P) (m, n) H ( P) (2m 1,2n 1)
By using (5) and (6) in ( 1 4), any of the four subband signals, A(j) ,B(j), C(j) and S(j), at the jth resolution level, can be computed as a sum of four convolutions using (L/2Ã—M/2)tap filters. For example, the LL subband given by (4) can now be expressed as
L / 21 M / 21
Hence, for a structure of the pipeline that uses identical filter units, the number of these filter units would be very large. Further, since the number of such filter units employed by the stages would decrease exponentially from one stage to the next in the pipeline, it will make their synchronization very difficult. The solution to such a difficult synchronization problem, in general, requires more control units, multiplexers and registers, which result in a higher design complexity. A reasonably large value of < 1 would be more attractive for synchronization. In this respect, the parameter can be seen as a measure of design difficulty, with a smaller value of this parameter representing a greater design complexity[9].
The parameter can be increased from its value of 1/4J1 in the onelevel to one stage pipeline structure by dividing the largesize stages into a number of smaller stages or merging the smallsize stages into larger ones. However, dividing a stage of the onelevel to onestage pipeline into multiple stages would require a division of
the task associated with the corresponding resolution
ee
ee
S ( j ) (m, n)
k 0
Hee i0
( LL) (k,i).S
( j1) (m k, n i)
level into subtasks, which in turn, would call for a solution of even a more complex problem of
L / 21 M / 21
synchronization of the subtasks associated with
eo
eo
k 0
Heo i0
( LL) (k, i).S
( j1) (m k, n i)
divided stages. On the other hand, merging multiple smallsize stages of the pipeline into one stage would
L / 21 M / 21
not create any additional synchronization problem. As a
oe
oe
oe
oe
k 0
H
i0
( LL) (k,i).S
( j1) (m k, n i)
matter of fact, such a merger could be used to reduce the overall number of filter units of the pipeline.
L / 21 M / 21
oo
oo
oo
oo
k 0
H
i0
( LL) (k,i).S
( j 1) (m k, n i)
(7)
At any resolution level, the separation of the subband processing corresponding to even and odd indexed data as given by (7) is consistent with the requirement of decimation of the data in each dimension by a factor of two in the DWT computation. It is also seen from (7) that the filtering operations in the four channels are independent and identical, which can be exploited in the design of an efficient pipeline architecture for the 2 D DWT computation.


Pipeline For The 2D DWT Computation
A straightforward mapping of the overall task of the DWT computation to a pipeline is onelevel to one stage mapping, in which the tasks of J resolution levels are distributed to J stages of the pipeline. In this mapping, the amount of hardware resources used by a stage should be onequarter of that used by the preceding stage. Thus, the ratio of the hardware resource used by the last stage to that used by the first stage has a value of 1/4J1. For images of typical size, this parameter would assume a very small value.
Fig.2 Pipeline structure with I stages for Jlevel computation
In view of the above discussion, the synchronization parameter can be increased by merging a number of stages at tail end of the pipeline. Fig. 2 shows the structure of a pipeline in which the stages I to J of the onelevel to onestage pipeline have been merged. In this structure, the tasks of the resolution level from j=1 to j=I – 1 are mapped to stage 1 to I – 1, respectively, whereas those of the resolution levels j=I, , J are mapped all together to the Ith stage. Note that the total amount of computations performed by stage I is less than onehalf of that performed by stage I – 1. Considering the fact that the number of filter units employed by each stage of the pipeline is an integer, it is reasonable to have the ratio of the numbers of filter units used by the last two stages (i.e., stages I – 1 and I)
to be 2:1. The value of the parameter is now increased from 1/4J1 to 1/4I1.5. However, now the resources employed by stage I would not be fully
utilized, which would lower the efficiency of the hardware utilization of the pipeline of Fig. 2
Assume that the parameter represents the hardware utilization efficiency defined as the ratio of the resources used to that employed by the pipeline [9]. The hardware utilization efficiency of the pipeline in Fig.2 can be shown to be equal to (1 – 4J )/(1 + 4I+0.5). Since for images of typical size, 4J is negligibly small compared to one, the expression for can be simplified as 1/(1 + 4I+0.5). As the number of stages I employed by the pipeline increases, the hardware utilization efficiency increases with the parameter approaching unity for a maximum efficiency. On the other hand, the difficulty in synchronizing the stages gets worse as the parameter decreases with increasing value of I. A variation in the value of I results in the values of and that are in conflict from the point of view of stage synchronization and hardware utilization efficiency. Therefore, a value of I needs to be determined that optimizes the values of and jointly.
Considering an example of an image of size 28Ã—28 , in which case J=8 . Table I gives the values of the parameters and for the pipeline structures with I=2,3 and 4.
data resulting from the even or odd numbered rows and even or odd numbered columns of an LÃ—M window of an LL subband data[9]. An LÃ—M window of the raw 2 D input data or that of an LLsubband data mus be decomposed into four distinct L/2Ã—M/2 subwindows in accordance with the four decomposed terms given by the right side of (7). This decomposition of the data in an LÃ—M window can be accomplished by designing for each stage an appropriate data scanning unit (DSU) based on the way the raw input or the LLsubband data is scanned. The stages would also require memory space (buffer) to store the raw input data or the LL subband data prior to scanning. Fig.3 gives the block diagram of the pipeline showing all the components required by the three stages. Note that the data flow shown in this figure comprises only the LL subband data necessary for the operations of the stages
Table 1
Parameter
I=2
I=3
I=4
1/2
1/8
1/32
89%
96%
99%
Parameter
I=2
I=3
I=4
1/2
1/8
1/32
89%
96%
99%
Values of the parameters and
It is seen from this table that the 2stage and 3stage pipelines have acceptable values of , whereas the synchronization of the 4stage pipeline would be very difficult because of its very low value of =1/32. On the other hand, the 3stage and 4stage pipelines have more desirable values of in comparison to that for the 2stage pipeline. Therefore, a 3stage pipeline with an acceptable value for the synchronization parameter and high hardware utilization efficiency would be the best choice of a pipeline

Design Of Stages
In the proposed threestage architecture, stages 1 and 2 perform the computations of levels 1 and 2, respectively, and stage 3 that of all the remaining levels. Since the basic operation of computing each output sample, regardless of the resolution level or the subband, is the same, the computation blocks in the three stages can differ only in the number of identical processing units employed by them depending on the amount of the computations assigned to the stages. As seen from (7), an (LÃ—M)tap filtering operation is decomposed into four independent (L/2Ã—M/2)tap filtering operations, each operating on the 2D L/2Ã—M/2
Fig.3 Block diagram of the threestage architecture

Performance Results
The Pipeline algorithm for decomposition of input data is implemented in MATLAB. Fig.5 shows input image and results of 1st level of decomposition and Fig.6 shows 2nd and Fig.7 shows 3rd level of decomposition.
Input Image 1st Level of Decomposition Fig.5 Results of MATLAB Implementation
2nd Level of Decomposition
Fig.6 Results of MATLAB Implementation
3rd Level of Decomposition
Fig.7 Results of MATLAB Implementation
Stage 1
Stage 1
Stage 2
Stage 2
Stage 3
Stage 3
Fig.8 Modelsim Simulation result of Pipeline Algorithm
Same pipeline algorithm is implemented in VHDL. For this purpose the filter coefficients are scaled and then they are used in the design. This digital design is simulated in Modelsim and its results are shown in Fig.8. For 100MHz of clock signal,three stages of pipeline, three levels of decomposition and image size of 16Ã—16, it requires 16395ns.

Conclusion
To enhance the interstage parallelism, it is most efficient to map the overall task of the DWT computation to only three pipeline stages for performing the computation tasks corresponding to the decomposition level 1, level 2, and all the remaining levels, respectively. Two parameters, one specifying the synchronization of the operations of the stages and
the other representing the utilization of the hardware resources of the pipeline, have been defined. It has been shown that the best combination for the value of these parameters is achieved when the pipeline is chosen to have three stages.

References

S. Mallat, A theory for multiresolution signal decomposition: The wavelet representation IEEE Trans. Pattern Anal.
Mach. Intell., vol.11, no. 7, pp. 674693, Jul.1989

H. Y. Liao, M. K. Mandal, and B. F. Cockburn, Efficient architectures for 1D and 2D liftingbased wavelet transforms, IEEE Trans. Signal Process., vol. 52, no. 5, pp 13151326, May 2004.

C. Cheng and K. K. Parhi, Highspeed VLSI implementation of 2D discrete wavelet transform, IEEE Trans. Signal Process., vol. 56, no.1, pp. 393403, Jan. 2008.

F. Marino, Efficient highspeed lowpower pipelined architecture for the direct 2 D discrete wavelet transform, IEEE Trans. Circuits Syst. II, Analog. Digit. Signal Process., vol. 47, no. 12, pp. 1476 1491, Dec 2000.

A. Benkrid, D. Crookes, and K. Benkrid, Design and implementation of a generic 2D orthogonal discrete wavelet transform on an FPGA, in Proc. IEEE 9th Symp. Field programming Custom Computing Machines (FCCM), Apr. 2001, pp. 190198.

P. McCanny, S. Masud, and J. McCanny, Design and implementation of the symmetrically extended 2D wavelet transform, in Proc. IEEE Int. Conf. Acoustic, Speech, Signal Process.(ICASSP), 2002, vol. 3, pp. 31083111.
S. Raghunath and S. M. Aziz, High speed area efficient multiresolution 2D 9/7 filter DWT processor, Proc. Int.
Conf. Very Large Scale Integration (IFIP), Oct. 2006, vol. 16 18. Pp, 210215.

I. S. Uzun and A. Amira, Rapid prototypingframework for FPGA based discrete biorthogonal wavelet transforms implementation, IEE Vision, Image Signal Process., vol. 153, no. 6, pp. 721734, Dec. 2006.

C. Zhang, C. Wang, and M. O. Ahmad, A Pipeline VLSI Architecture for Fast Computation of the 2D Discrete Wavelet Transform, IEEE Trans. On Circuits and SystemsI, vol. 59, No. 8, August 2012.

M. Alam,W. Badway, V. Dimitrov and G. Jullien, An Efficient Architecture for a Lifted 2D Biorthogonal DWT, Journal of VLSI Signal Processing 40, 335342, 2005.

R.C. Gonzalez,R. Woods, Digital Image Processing,PrenticeHall,3rd Edition, 2007.

Principles of Digital System Design using VHDL, Charles H. Roth, Jr. & Lizy Kurian John, 1998 Cengage Learning publication
.