 Open Access
 Total Downloads : 303
 Authors : Rajalekshmi R, Arya Lekshmi M
 Paper ID : IJERTV4IS070813
 Volume & Issue : Volume 04, Issue 07 (July 2015)
 DOI : http://dx.doi.org/10.17577/IJERTV4IS070813
 Published (First Online): 29072015
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
FPGA Implementation of Low Complexity Video Encoder using Optimized 3DDCT
Rajalekshmi R
Embedded Systems
Sree Buddha College of Engineering, Pattoor India
Arya Lekshmi M
Electronics and Communication Engineering Sree Buddha College of Engineering, Pattoor
India
Abstract Discrete Cosine Transform (DCT) is an essential tool of most of the image and video compression standards, because of its better energy compaction properties. As the demands for the twoway video transmission and video messaging over mobile communication systems increasing, the encoding complexity needs to be optimized. The three dimensional discrete cosine transform (3DDCT) and its inverse (3DIDCT) can be used as an alternative to motion compensated transform coding, because it extends the spatial compression property of 2DDCT to spatialtemporal compression of video data. In the proposed architecture, a low complexity video encoder using 3DDCT has been presented. This method converts video data into threedimensional video cube of 8Ã—8Ã—8 pixels and 3DDCT is then performed, followed by quantization, zigzag scanning and entropy encoding. The threedimensional DCT circuit can be realized using few additions and subtractions, thus increasing the area efficiency with low complexity. The proposed architecture is coded in Verilog HDL, synthesized in Xilinx ISE design suite 14.2 and physically realized as a digital prototype circuit using Xilinx Virtex5 FPGA.
Keywords Video Compression; ThreeDimensional Discrete Cosine Transform; Coefficient Quantization; MultiplicationFree Transform; FPGA implementation.

INTRODUCTION
Video compression/decompression are important process in many applications including internet TV, video conferencing etc., which use DCT based transform because of its outstanding energy compaction properties. The DCT transform eliminates duplication of data in frequency domain. Most of the video compression standards to date use 2DDCT based transform for reducing the spatial redundancies by eliminating intercorrelations between each pixels within the video frames and motion estimation/compensation for reducing the temporal redundancies between the video frames. However, these algorithms are very hard to implement in hardware and no symmetry exist between encoder and decoder circuit. Motion estimation includes complex calculations for the computation of motion vector and need more sophisticated algorithms. Hence the implementation becomes more complex
Alternative approach is to use a transformbased approach for the encoding of subsequent frames [2], which is the 3D DCT. The 3DDCT removes not only the spatial redundancy but also the temporal redundancy of the video frames. The video compression standards such as H.264/AVC [3], MPEG,
H.261 etc. uses hybrid coding (i.e., 2DDCT followed by prediction/motion compensation) which can be potentially replaced by the 3DDCT algorithm.
For the fast computation of DCT, numerous efficient algorithms have been proposed. The floating point DCT packs most of the energy to the low frequency region, which is well suited for the image processing applications [8]. But the floating point operations do not meet the real time constraints and also increases the circuit complexity. For the minimization of the number of floating point operations, approximate transforms are used [4], [6], [7], which in turn reduces the computational as well as circuit complexity. Todays most of the compression standards use approximate DCT transforms.
In this context, a large number of approximate DCT have been proposed. R. J Cintra et al. introduced a method for finding DCT by using 22 addition operations, based on the round off function [6]. In this method, the resulting 8point approximation matrix contains only 0s and Â±1s. R. K. Senapati et al. proposed a low complexity 8Ã—8 transform matrix for fast image compression [5]. This transform requires only 14 additions and two shift operations. T. D. Tran et al. introduced binDCT, which is an 8Ã—8 biorthogonal transform. It requires 31 additions and 14 shift operations [8]. The binDCT shows finer approximations to exact DCT. R. J. Cintra et al. proposed the architecture for 1DDCT, which requires only 14 additions [1]. This work has been extended to 2DDCT with the use of a transposition buffer. Several studies were performed for the simplification of 3DDCT, because of the increasing demands of three dimensional applications. The 3D DIF VR algorithm [9] introduced by S. Boussakta et al., for the computation of 3D DCTII. This algorithm reduces the number of multiplications and retains the same number of additions. 3DDIF VR algorithm requires 5568 additions and 1344 multiplication for processing one 8Ã—8Ã—8 video cube.
The objective of the proposed method is to introduce a new optimized 3DDCT approximation that possesses extremely low arithmetic complexity, which requires less number of addition operations, and to develop a video encoder. The paper is organized as follows. Section II explains the 3DDCT in detail. Section III describes the 1DDCT. Section III presents physical architecture of 3DDCT and its realization. Section IV explains the application of the proposed design in video encoder. A comparison of the video encoder with and without motion estimation along with the results is given in Section V.

3DDCT ALGORITHM
The human eyes are highly sensitive to low frequency components; because of the energy compaction property of the DCT, which concentrates most of the information in a few low frequency components. In addition, it is signal independent and can be computed efficiently by fast algorithms. For these reasons, the DCT is widely used in image and video compression. Since the common 3DDCT kernel is separable, the 3DDCT is usually obtained by applying the 1DDCT along each of the three dimensions. The NÃ—NÃ—N 3DDCT can be defined [2] as:
Where, R, C, F is the number of rows, columns, frames respectively in each cube, G(x,y,z) is the DCT domain data, g(r,c,f) is the time domain data, Wn is given by:
And
y z
Where r = 0, 1, 2,R. The same definition is applied on C 2c+1, C 2f+1
The reverse 3DDCT or 3DIDCT can be written as:
For the implementation of (1) needs a lot of hardware and takes a lot of time to complete the computation. If R=C=F=B, then the number of additions and the number of multiplications are B31, B3+B2+B respectively. The complexity per coefficient is O(B3). So the time and hardware needed for mathematical manipulation should be reduced. For that temporalrowcolumn decomposition algorithm has been chosen for the calculation of 3DDCT.
The DCT equation exhibits the separability property, so it can be decomposed into three orthogonal and symmetric functions. The equation 1 can be rewrite as
The equation (5) gives three identical onedimensional equations. So the 3DDCT can be calculated by first computing the 1DDCT of the rows followed by calculating 1DDCT of columns. The resultant DCT coefficients are again subjected to 1DDCT, forming fully transformed DCT cube. This method reduces the number of multiplications and number of additions to R+C+F, R+C+F3 respectively for the computation of one DCT coefficient.
Fig. 1: Computation flow of 3DDCT
The 3DDCT can be computed as follows: the DCT and its inverse can be calculated by means of pseudo 3DDCT, with the intention of reducing the computational complexity. The computational flow of 3DDCT is illustrated in Fig. 1. First, perform 2DDCT of the 8Ã—8 image. Pick the DC or AC coefficients of the block loated in the same position of consecutive channels of the 3D block and then, transform these values to the DCT domain. After performing the step 2, new DC value and AC values are obtained. The 3DDCT is computed by means of three 1DDCT structures. The 1D DCT transform and its architecture is as follows:

1DDCT TRANSFORM
Most of the 8point DCT approximation methods such as BouguezelAhmadSwamy Approximate DCT [7], CB2011 Approximation[6], Modified CB2011 approximation [4] etc. gives the transformation matrix format as, [D] * [LC]
Where D, diagonal matrix contains irrational numbers in the form 1/ k, where k is a small positive integer. The entries of the LC, the lowcomplexity matrix, comprise only powers of two in {0, Â±1/2, Â±1, Â±2}. So that null multiplicative complexity is attained.
For deriving lowcomplexity approximate DCT, a search over the 8Ã—8 matrix space is done to find the candidate matrix with low computation cost. The good candidate matrix is the one which does not require multiplication operations. Thus encounters the following optimization problem:
Mp= arg min_cost(M) (6)
Where Mp is the sought matrix and min_cost(M) returns minimum arithmetic complexity of matrix, M.
Exact 8point DCT matrix is given by,
Where, p = cos(2(p+1)/32), p = 0, 16
The number of retained coefficient is considered as the important parameter in the image compression. In several applications, a small number of coefficients are retained. In this transform it is 10. The 1DDCT approximation is defined as:
Where,
Dp= diagonal ([1/8,1/2,1/2/1/2,1/8,1/2,1/2,1/2])
Matrix Mp has entries in {0, Â±1} and the transform matrix can be written as product of three matrices.
Mp= P4. W12. W11. W1, where
And P4 is the permutation (1) (2 5 6 8 4 3 7). Where, I4 is the identity matrix of order 4.

DIGITAL ARCHITECTURES
The detailed architecture for 1D and 2D approximate 8 point DCT is shown in the Fig. 5 and Fig. 4 respectively and the architecture for 3D approximate DCT is being proposed, shown in the Fig. 2. The introduced architectures were then given to the Xilinx FPGA and the obtained results are included in the section VI.
A. Proposed Architecture for 3DDCT
The proposed architecture 3DDCT uses three parallel realizations of 1DDCT approximation blocks (Fig. 3), requires only 2688 additions. The input of 3DDCT is in a form of 8Ã—8Ã—8 video cube. The architecture is described as: first 1D approximate DCT block instantiation does the row wise transform computation of the input 2D image while the second instantiation provides the columnwise transform computation of the intermediate result. Hence 2DDCT is obtained. Likewise 2DDCT is computed for 8 blocks of size 8Ã—8, by instantiating 2D approximate blocks. After computing the 2DDCT coefficients, again 1Dapproximate DCT block is needed for the calculation of 3DDCT. The r, c, f represents row, column, and frame respectively.
The 1D approximate DCT architecture is detailed in Fig. 4. Detailed architecture of 1DDCT requires only 14 addition/subtraction operations, shown in Fig. 5. Without the use of multipliers, the delay and area can be reduced.
Fig. 2: Proposed 3DDCT architecture
Fig. 3: Serial architecture of the 3DDCT
Fig. 4: Architecture of the 2DDCT
A realtime row parallel transpose matrix buffer circuit is necessary between the 1D approximate DCT blocks, which ensures data ordering for converting the rowtransformed data from the first DCT approximation unit to a transposed format. Then these results are given to the second 1D approximation DCT unit. The architecture for the transpose matrix buffer is shown in the Fig. 6.
Fig. 5: Digital architecture for approximate 1DDCT
Fig. 6: Architecture of transpose matrix buffer

3DDCT BASED VIDEO ENCODER
As compared with the other compression standards such as MPEG, H.26Ã— etc., this algorithm exhibit different principle for compressing the temporal information (motion estimation/compensation). The working of the encoder is as follows. The forward DCT converts 2Dimage from spatial to frequency domain. The 3DDCT based video compression algorithm takes full motion video as input. The input is then divided into groups of 8 frames, which can be considered as a three dimensional image with two spatial dimensions (x and y) and a temporal dimension (z). The Fig. 7 illustrates the formation of 8Ã—8Ã—8 video cubes.
Each 8Ã—8Ã—8 cubes are independently encoded using 3D DCT, quantization, and entropy encoder. The block diagram of video encoder using 3DDCT is illustrated in Fig. 8. Each 8Ã—8Ã—8 cube of 512 pixels is transformed into frequency domain using the forward 3DDCT. The architecture for 3D DCT is explained above. After DCT transformation most of the energy is confined in few low frequency coefficients. Majority of the high frequency coefficients are zero or have negligible value, so only 64 pixel values are needed to be considered for the next section.
In the next block, all the 64 DCT coefficients are quantized using 64 element quantization table. This step is also used for compression, which introduces minimum error while increasing the number of zerovalue coefficients and discards visual information to which human eye is not sensitive. Quantization is performed according to the following equation:
Where, F(x,y) are the elements before quantization, Q(x,y) are the elements from the quantization table, and Qb are the quantized elements.
The result of the quantization step is the collection of small valued coefficients, in which a large number of values are zero. These results are then converted into a compact binary sequence using entropy encoder. The quantized DCT coefficients are reordered in a zigzag manner using zigzag scanning and reordered sequence will be in a form DC, AC1,
AC9 AC63. Then this 1Ddata vector is subjected to the entropy encoder for compression.
Fig. 7: Forming 8Ã—8Ã—8 video cube
Fig. 8: Video encoder using 3DDCT
The entropy encoder consists of runlengthencoding, and Huffman encoder. From the data vector, the run lengths of zero is computed and then to the Huffman encoder. The output of the entropy encoder gives a binary sequence which represents the compressed 8Ã—8Ã—8 block. The video decoder is designed in such a way that all the above said steps are inverted and implemented in reverse manner.

FPGA IMPLEMENTATION AND DISCUSSIONS The proposed architecture is coded in Verilog HDL,
simulated using Xilinx ISE Design suite 14.2 and synthesized in Xilinx Virtex5, XC5VLX110T2FF1136 device. The obtained results are shown below. Fig. 9 shows simulation result of video encoder with 3DDCT. The input is assumed as
an 8bit resolution. The variable out_sig gives the bit stream of the input video cube, when the variable out_rdy is set. The serial architecture of 3DDCT is employed here; so only three 1DDCTs are used, which requires 64 clock cycles to process one 8Ã—8Ã—8 video cube . This architecture uses 659 registers to store the transformed results and the throughput is fclk.
Fig. 9: Simulation result of video encoder with 3DDCT
After synthesizing the Verilog code, it is then routed to the FPGA board, using JTAG based hardware cosimulation. JTAG is a serial communication standard, which enables boundary scan technology providing access to many logic signals of complex ICs.
TABLE I
HARDWARE RESOURCE CONSUMPTION USING XILINX VIRTEX5 XC5VLX110T2FF1136 DEVICE
Video encoder with motion estimation
Video encoder with 3DDCT
LUT
1313[1%]
3460[5%]
Bonded IOBs
6
6
FF
780[21%]
241[2%]
Registers
3065[6%]
659[0%]
The FPGA implemntation can be evaluated by means of configurable LUTs, FF count, delay, maximum operating frequency, gives the hardware complexity as well as the real time performance. 15.190ns, 14.798 are the obtained delay of
the video encoder using motion estimation, 3DDCTbased video encoder respectively. The synthesis report of video encoder with 8Ã—8Ã—8 video cube as input and video encoder using 2DDCT and motion estimation is tabulated in table I. The table I show that the hardware resource utilization of video encoder using 3DDCT is lesser as compared to the other. Hence the area required is less. The proposed 3DDCT architecture uses only adders, the complexity is also less.

CONCLUSION
The video encoder using optimized 3DDCT is proposed and implemented in Xilinx Virtex5 XC5VLX110T2FF1136 device. The proposed 3DDCT saves 75% of additions and 100% of multiplications. The result shows a low complexity and area efficient architecture, which can be used in multimedia portable systems where low power and area minimization are prominent. Hence the architecture is well suited for real time low power applications.
REFERENCES

Uma Sadhvi Potluri, Arjuna Madanyake, Renato J. Cintra, FÃ¡bio M. Bayer, Sunera Kulasekara and Amila Edirisuriya, Improved 8point Approximate DCT for image and Video Compression Requiring Only 14 Additions, IEEE Transactions on circuits and systems, vol. 61,no. 6,June 2014

M. Bakr and A. E. Salama,Implementation of 3DDCT based video encoder/decoder system, Int. Sym. on signals, circuits, and systems, vol. 2, pp. 389392,2003

T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, Overview of the H.264/AVC video coding standard, IEEE Trans. Circuits Syst.
Video Technol., vol. 13, no. 7, pp. 560576, Jul. 2003

F. M. Bayer and R. J. Cintra, DCTLike transform for image compression requires only 14 additions, Electronics Letters, vol.48, no. 15, pp. 919921, 2012

R. K. Senapati, U. C. pati, K. K. Mahapatra,A low complexity orthogonal 8Ã—8 transform matrix for fast image compression, in proc.
Of the Annual IEEE India Conf. (INDICON), pp. 14, 2010

F. M. Bayer and R. J. Cintra, A DCT approximation for image compression, IEEE Signal Processing Letters, vol. 18, no. 10, pp. 579 583, October 2011

S. Bouguezel, M. O. Ahmed, and M. N. S. Swamy, Lowcomplexity 8Ã—8 transform for image compression, Electronics letters, vol. 44, no. 21, pp. 12491250, 2008

T. D. Tran, The binDCT: fast multiplierless approximation of the DCT, IEEE Signal Processing Letters, vol. 7, pp. 141144, June 2000

S. Boussakta and H. O. Alshibami,Fast algorithm for the 3D DCT II, IEEE Trans. on Signal Processing, vol. 52, pp. 9921001, April 2004