 Open Access
 Total Downloads : 251
 Authors : Kasa Srinivasulu, K.Satya Sujith, C.Laxmana Sudheer, Y.Chamundeswari
 Paper ID : IJERTV2IS90630
 Volume & Issue : Volume 02, Issue 09 (September 2013)
 Published (First Online): 19092013
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Multiplier Less ROM Free DA based DCT
1Kasa Srinivasulu, 2K.Satya Sujith,I.S.T.E ,3C.Laxmana Sudheer(M.tech),4Y.Chamundeswari(M.tech) 1Associate Professor,2,3,4 Assistant Professor
1Brahmaiah College Of Engg.,2 Audisankara Institute Of Technology
Abstract Discrete cosine transform (DCT) is a widely used tool in image and video compression applications. Recently, the highthroughput DCT designs have been adopted to fit the requirements of realtime application. Operating the shifting and addition in parallel, an optimized adder tree (OAT) is proposed to deal with the truncation errors and to achieve lowerror and highspeed discrete cosine transform (DCT) design. Instead of the 12 bits used in previous works, 9bit Distributed Arithmetic was proposed. DAbased DCT design with an optimized adder tree (OAT) is the proposed architecture in which, OAT operates shifting and addition in parallel by unrolling all the words required to be computed. Furthermore, the ErrorCompensated Circuit alleviates the truncation error for high accuracy design. Based on lowerror OAT, the DAprecision in this work is chosen to be 9 bits instead of the traditional 12 bits. Therefore, the hardware size and cost is reduced, and the speed is improved using the proposed OAT.
Keywords Adders, DCT Discrete Cosine Transform, DA Distributed Arithmetic, OAT optimized adder tree.
I.INTRODUCTION
Today we are talking about digital networks, digital representation of images, movies, video, TV, voice, digital libraryall because digital representation of the signal is more robust than the analog counterpart for processing, manipulation, storage, recovery, and transmission over long distances, even across the globe through communication networks. In recent years, there have been significant advancements in processing of still image, video, graphics, speech, and audio signals through digital computers in order to accomplish different application challenges. As a result, multimedia information comprising image, video, audio, speech, text, and other data types has the potential to become just another data type. Development of efficient image compression techniques continues to be an important challenge to us, both in academia and in industry [1]. In [2] multiplier based DCTs were implemented, later to reduce area ROMbased DA was applied for designing DCT [3]. Then knowing the advantage of ROMbased, DAbased multipliers using ROMs were implemented to produce partial products together with adders that accumulated these partial products. By applying DAbased ROM to DCT core design we can reduce the area required. In addition, the symmetrical properties of the DCT transform and parallel DA architecture can be used in reducing the ROM size in [4], respectively. Recently, ROMfree DA architectures were presented [6][11]. Shams et al. employed a bitlevel sharing scheme to construct the adderbased butterfly matrix called
new DA (NEDA) [7]. Being compressed, the butterflyaddermatrix in [7] utilized 35 adders and 8 shiftaddition elements to replace the ROM. Based on NEDA architecture, the recursive form and ALU were applied in DCT design to reduce area cost [8], [9], but speed limitations exist in the operations of serial shifting and addition after the
DAcomputation. In DAbased computation partial products words are shifted and added in parallel [10] and [11]. However, a large truncation error occurred.
We need to reduce truncation error that error is introduced if the least significant part is directly truncated. In order to reduce truncation error effect several error compensation bias methods have been presented based on statistical analysis of relationship between partial product and multipliermultiplicand. Hardware complexity will be reduced if truncation error minimized. In general, the truncation part (TP) is usually truncated to reduce hardware costs in parallel shifting and addition operations, known as the direct truncation (DirectT) method. Thus, a large truncation error occurs due to the neglecting of carry propagation from the TP to Main Part (MP). Distributed arithmetic is a bit level rearrangement of a multiply accumulate to hide the multiplications. It is a powerful technique for reducing the size of a parallel hardware multiplyaccumulate that is well suited to FPGA designs. The Discrete cosine transform (DCT) is widely used in digital image processing for image compression, especially in image transform coding. However, though most of them are good software solutions to the realization of DCT, only a few of them are really suitable for VLSI implementation. Cyclic convolution plays an important role in digital signal processing due to its nature of easy implementation. Specifically, there exist a number of welldeveloped convolution algorithms and it can be easily realized through modular and structural hardware such as distributed arithmetic and systolic array.

MATHEMATICAL DERIVATION OF DISTRIBUTED ARITHMETIC
The inner product is an important tool in digital signal processing
applications. It can be written as follows:
L1
Y=AT.X= AiXi (1)
i1
where Ai, Xi and L are ith fixed coefficient, ith input data, and number of inputs, respectively. Assume that coefficient Ai is Qbit
twos complement binary fraction number. Equation (1) can be expressed as follows:
Y= [20
21
22
.. 2
(Q1) ]
A10 A11
A20 A21
…..
…..
AL0
AL1
x1
x 2
.
. . . . .
. . .
. .
A1(Q 1) A2(Q 1) ….. AL(Q 1) .
Y= [20 21 22 .. 2(Q1) ]
y
y
y0
y1
.
.
.
( Q
1)
xL
Ai,j stay between [1 , 0] Note that y0 may be 0 or a negative number due to twos complement representation. In (2), y0 can be calculated by adding all Xi values when Ai,j=1 and then the transform output Y can be obtained by shifting and adding all nonzero yi values. Thus the inner product computation in (1) can be implemented by using shifting and
adders instead of multipliers. Therefore, low hardware cost can be achieved by using DAbased architecture.
Fig:2 proposed optimized adder tree architecture
IV. PROPOSED 8X8 2D DCT DESIGN
N 1
N 1
The 1D DCT employs the DAbased architecture and the proposed Optimized adder tree to achieve a highspeed, small area, and lowerror design. The 1D 8point DCT can be expressed as follows:
Zn 1
x cos 2m 1n

PROPOSED OPTIMIZED ADDER TREE ARCHITECTURE
2 Kn m m0
16
In general, the shifting and addition computation uses a shiftandadd operator in VLSI implementation in order to reduce hardware cost. However, when the number of the shifting and addition words increases, the computation
time will also increase. Therefore, the shiftaddertree (SAT)
Where xm denotes the input data;
Zn denotes the transform output.
By neglecting the scaling factor 1/2, the 1D 8point DCT in above equation can be divided into even and odd parts: Ze and Zo as listed in below equations, respectively
presented operates shifting and addition in parallel by unrolling all the words needed to be computed for highspeed
Z 0 C4 C 4
Z 2 C2 C6
C 4

C6
C 2
a0
a1
applications. However, a large truncation error occurs in
Ze
SAT, and optimized adder tree architecture is proposed in this
Z 4 C4
Z
Z
6 C6

C4

C

C4
C
C 4

C
a2
a
brief to compensate for the truncation error in highspeed applications
2 2 6
2
Z 1 C1
Z 3 C3
C3

C7
C5

C1
C7
C5
b0
b1
Z 0
Z 5 C5 C1 C7 C3 b2
7 C7 C C
Z
Z
C
5 3 1
b2
Fig:1: Q,P bit words shifting and addition operations in parallel.
In Fig. 1, the Q Pbit words operate the shifting and addition in parallel by unrolling all computations. Furthermore, the operation in Fig. 1 can be divided into two parts: the main part (MP) that includes _ most significant bits (MSBs) and the truncation part (TP) that has least significant bits (LSBs). a large truncation error occurs due to the neglecting of carry propagation from the TP to MP.
The proposed optimized adder tree architecture is illustrated in Fig. 2 for (P,Q)=(12,6) , where block FA indicates a fulladder cell with three inputs (a, b, and c) and two outputs, a sum (s) and a carryout (co). Also, block HA indicates halfadder cell with two inputs (a and b) and two outputs, a sum (s) and a carryout (co).
Where Ci=COS (i/16)
Below shows bit level formulation for Z0 and Z4 Let see Z4 evaluation
Z1
Z4
weight
value
weight
value
20
0
20
A1
21
B0+B1+B2
21
A0
22
B0+B1
22
A1
23
B0+B3
23
A0
24
B0+B1+B3
24
A0
25
B0+B2
25
A1
Z1
Z4
weight
value
weight
value
20
0
20
A1
21
B0+B1+B2
21
A0
22
B0+B1
22
A1
23
B0+B3
23
A0
24
B0+B1+B3
24
A0
25
B0+B2
25
A1
TABLE: 1: bit level formulation
Where A0=(X0+X7)+(X4+X3)=a0+a3
to compute the inverse DCT using 64bit doubleprecision operations. The proposed DCT core has
the highest hardware efficiency, defined as follows (based on
the accuracy required by the presented standards)
A1= (X1+X6)+(X2+X5) =a1+a2 B0=(X0+X7)(X4+X3)=a0a3 B1= (X1+X6)(X2+X5) =a1a2
hardwareefficiency =
Throughput rate gate count
Input data A0 and A1, the transform output Z0 needs only one adder to compute (A0 + A1) and two separated optimized adder trees to obtain the results of Z0 and Z4. Similarly, the other transform outputs Zo and Z4 can be implemented in DAbased forms using 10(=1 + 9) adders and corresponding optimized adder trees. Consequently, the proposed 1D 8point DCT architecture can be constructed as illustrated in Fig. 3 using a DAButterflyMatrix, that includes two DA even processing elements (DAEs), a DA odd processing element (DAO) and 12 adders/subtractors, and 8 optimized adder trees (one optimized adder tree for each transform output Zn). The eight separated optimized adder trees work simultaneously, enabling highspeed applications to be achieved. After the data output from the DAButterflyMatrix is completed, the transform output Z will be completed during one clock cycle by the proposed optimized adder trees. In contrast, the traditional shiftandadd architecture requires Q clock cycles to complete the transform output Z if the DAprecision is Qbits here is 9 bits
Fig 3. Architecture of the proposed 1D 8point DCT.
With highspeed considerations in mind, the proposed 2D DCT is designed using two 1D DCT cores and one transpose buffer. For accuracy, the DAprecision and transpose buffer word lengths are chosen to be 9 bits and 12 bits, respectively, meaning that the system can meet the PSNR requirements outlined in previous works. Moreover, the 2D DCT core accepts 9bit image input and 12bit output precision.

RESULTS AND DISCUSSIONS
The test image Lena used to check system accuracy is comprised of 256X256 Pixels with each pixel being represented by 8bit 256 gray level data. After inputting the original test image pixels to the proposed 2D DCT core, the transform output data is captured and fed into MATLAB
Furthermore, the proposed 2D DCT core synthesized by using Xilinx ISE 10.1, simulated by using modelsim6.4d and the Xilinx XC2VP30 FPGA can achieve 1067 megapixels per second (Mpels/sec) throughput rate which is 7 folds of previous work of [16]
FIG4: Simulation result of DADCT

CONCLUSION

The paper contributed with specific simplifications in the multiplier stage, by using shift and adds method, which lead to hardware simplification and speed up over architecture. The proposed 8X8 2D DCT core has a latency of 10 clock cycles and is operated at 125 MHz As a result of the 8 parallel outputs, the proposed 2D DCT core can achieve a throughput rate of 1 Gpixels per second (8X125MHz), meeting the 1080 p (1920X1080X60 pixels/s) high definition television (HDTV) specifications for 200 MHz based on low power operations. The maximum throughput rate is 1 Gpels/s.as, the proposed architecture is suitable for high compression rate applications in VLSI designs.
REFERENCES:

Y.Wang, J. Ostermann, and Y. Zhang, Video Processing and Communications, 1st ed. Englewood Cliffs, NJ: PrenticeHall, 2002.

Y. Chang and C.Wang, New systolic array implementation of the 2D discrete cosine transform and its inverse, IEEE Trans. Circuits Syst Video Technol., vol. 5, no. 2, pp. 150157, Apr. 1995.

C. T. Lin, Y. C. Yu, and L. D. Van, Costeffective triplemode reconfigurable pipeline FFT/IFFT/2D DCT processor, IEEE
Trans. Very Large Scale Integr. Syst., vol. 16, no. 8, pp. 10581071, Aug. 2008.

S. Uramoto, Y. Inoue, A. Takabatake, J. Takeda, Y. Yamashita,
H. Yerane, and M. Yoshimoto, A 100MHz 2D discrete cosine transform core processor, IEEE J. SolidState Circuits, vol. 27, no. 4, pp. 492499, Apr. 1992.

S. Yu and E. E. S. , Jr., DCT implementation with distributed arithmetic, IEEE Trans. Comput., vol. 50, no. 9, pp. 985991, Sep. 2001. 714 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 4, APRIL 2011

P. K. Meher, Unified systoliclike architecture for DCT and DST using distributed arithmetic, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 53, no. 12, pp. 26562663, Dec. 2006.

A. M. Shams, A. Chidanandan, W. Pan, and M. A. Bayoumi, NEDA: A lowpower highperformance DCT architecture, IEEE Trans. Signal Process., vol. 54, no. 3, pp. 955964, Mar. 2006.

M. R. M. Rizk and M. Ammar, Low power small area high performance 2DDCT architecture, in Proc. Int. Design Test Workshop, 2007, pp. 120125.

Y. Chen, X. Cao, Q. Xie, and C. Peng, An area efficient high performance DCT distributed architecture for video compression, in Proc. Int. Conf. Adv. Comm. Technol., 2007, pp. 238241.

C. Peng, X. Cao, D. Yu, and X. Zhang, A 250 MHz optimized distributed architecture of 2D 88 DCT, in Proc. Int. Conf. ASIC, 2007, pp. 189192.

C. Y. Huang, L. F. Chen, and Y. K. Lai, A highspeed 2D transform architecture with unique kernel for multistandard video applications, in Proc. IEEE Int. Symp. Circuits Syst., 2008, pp. 2124.

S. S. Kidambi, F. E. Guibaly, and A. Antonious, Areaefficient multipliers for digital signal processing applications, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 43, no. 2, pp. 9095, Feb. 1996.

K. J. Cho, K. C. Lee, J. G. Chung, and K. K. Parhi, Design of lowerror fixedwidth modified booth multiplier, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 5, pp. 522531, May 2004.

L. D. Van and C. C. Yang, Generalized lowerror areaefficient fixedwidth multipliers, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 52, no. 8, pp. 16081619, Aug. 2005.

C. C. Sun, P. Donner, and J. Gotze, Lowcomplexity multipurpose IP core for quantized discrete cosine and integer transform, in Proc.IEEE Int. Symp. Circuits Syst., 2009, pp. 30143017.

A. Tumeo, M. Monchiero, G. Palermo, F. Ferrandi, and D. Sciuto, a pipelined fast 2DDCT accelerator for FPGAbased SoCs, in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, 2007, pp. 331336

S. Ghosh, S. Venigalla, and M. Bayoumi, Design and implementation of a 2DDCT architecture using coefficient distributed arithmetic, in Proc. IEEE Comput. Soc. Ann. Symp. VLSI, 2005, pp. 162166.

YuanHo Chen, TsinYuan Chang, and ChungYi Li, High Throughput DABased DCT With High Accuracy
ErrorCompensated Adder Tree IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 4, APRIL 2011