fbpx

DA-Based DCT with Error-Compensated Adder Tree


Call for Papers Engineering Journal, May 2019

Download Full-Text PDF Cite this Publication

Text Only Version

DA-Based DCT with Error-Compensated Adder Tree

CHINABABU PANDURU,

II M.Tech, CREC, Tirupati, INDIA panduruchinababu@gmail.com

P. MAHESH KUMAR,

Assistant Professor, ECE, CREC, Tirupati, INDIA maheshpenubaku@gmail.com

Abstract- In this paper, by operating the shifting and addition in parallel, an error-compensated adder-tree (ECAT) is proposed to deal with the truncation errors and to achieve low-error and high-throughput discrete cosine transform (DCT) design. Instead of the 12 bits used in previous works, 9-bit distributed arithmetic- precision is chosen for this work so as to meet peak- signal-to-noise-ratio (PSNR) requirements. Thus, an area-efficientDCT core is implemented to achieve 1 Gpels/s throughput rate with gate counts of 22.2 K for the PSNR requirements outlined in the previous works. Index Terms-Distributed arithmetic (DA)-based, error- error-compensated adder-tree (ECAT), 2-D discrete cosine transform (DCT).

  1. INTRODUCTION

    Discrete cosine transform (DCT) is a widely used tool in image and video compression applications [1]. Recently, the high throughput DCT designs have been adopted to fit the requirements of real-time applications. The high-throughput shift-adder-tree (SAT) and adder-tree (AT), those unroll the number of shifting and addition words in parallel for DA- based computation, were introduced in [3] and [4], respectively. However, a large truncation error occurred. In order to reduce the truncation error effect, several error compensation bias methods have been presented [5][7] based on statistical analysis of the relationship between partial products and multiplier-multiplicand. However, the elements of the truncation part outlined in this work are independent so that the previously described compensation methods cannot be applied. This brief addresses a DA-based DCT core with an error-compensated adder-tree (ECAT).The proposed ECAT operates shifting and addition in parallel by unrolling all the words required to be computed. Furthermore, the error-compensated circuit alleviates the truncation error for high accuracy design.

  2. MATHEMATICAL DERIVATION OF DISTRIBUTED ARITHMETIC

    The inner product is an important tool in digital signal processing applications. It can be written as follows:

    L

    Y= AT X = Ai Xi (1)

    i=1

    where Ai, Xi, and L are ith fixed coefficient, ith input data, and number of inputs, respectively. Assume that coefficient Ai is Q-bit twos complement binary fraction number. The inner product computation in

    (1) can be implemented by using shifting and adders instead of multipliers. Therefore, low hardware cost can be achieved by using DA-based architecture.

    .

    Fig. 1. Q P-bit words shifting and addition operations in parallel.

  3. ECAT ARCHITECTURE

    The shifting and addition computation can be written as follows:

    Q-1

    Y= yj . 2-j . (2)

    j=0

    In Fig. 1, the Q P-bit words operate the shifting and addition in parallel by unrolling all computations. Furthermore, the operation in Fig.1 can be divided into two parts: the main part (MP) that includes P

    Chinababu Panduru, P. Mahesh Kumar

    121

    most significant bits (MSBs) and the truncation part (TP) that has Q least significant bits (LSBs). Then, the shifting and addition output can be expressed as follows:

    Y= MP + TP .2-(P-2) . (3)

    The proposed ECAT is explained as follows.

    1. Proposed Error-Compensated Scheme

      Fig. 2. Proposed ECAT architecture of shifting and addition operators for the (P,Q )=(12,6) example.

      From Fig. 1, (3) can be approximated as Y MP + . 2-(P-2) (4)

      where is the compensated bias from the TP to the

      Shift-add- add

      SAT

      Proposed ECAT

      Area (gares)

      236

      406

      463

      Delay (ns)

      10.8

      3.72

      3.89

      Area×delay

      100%

      59.3%

      70.7%

      mse

      0.326

      6.761

      0.218

      MP

      = Round(TPmajor + TPminor) (5)

      where Round() is rounded to the nearest integer. The TPmajor has more weight than TPminor when contributing towards the . Therefore, the compensated bias can be calculated by obtaining Tpmajor and estimating TPminor.

    2. Performance Simulation for an Error- Compensated Circuit

      The , max, and mse are defined as follows:

      = Avg {|TP |} (6) = max {|TP |} (7)

      TABLE I

      COMPARISONS OF ABSOLUTE AVERAGE ERROR , MAXIMUM ABSOLUTE max

      ,ERROR , AND MEAN SQUARE ERROR mse

      Error

      (P,Q)

      (12,3)

      Case 1

      (12,6)

      Case 3

      (12,9)

      Case 2

      (12,12)

      Case 2

      Direct-T

      1.062

      5

      2.5078

      4.0010

      5.500

      1

      Propose d

      0.265

      6

      0.3789

      0.3804

      0.473

      8

      Post-T

      0.250

      0

      0.2500

      0.2500

      0.250

      0

      ma

      x

      Direct-T

      2.125

      0

      5.015

      6

      8.002

      0

      11.00

      0

      Propose d

      0.625

      0

      1.500

      0

      2.002

      0

      3.000

      0

      Post-T

      0.500

      0

      0.500

      0

      0.500

      0

      0.500

      0

      ms

      e

      Direct-T

      1.351

      6

      6.761

      4

      16.73

      0

      31.22

      4

      Propose d

      0.101

      6

      0.218

      4

      0.222

      2

      0.347

      2

      Post-T

      0.085

      9

      0.083

      4

      0.083

      3

      0.083

      3

      TABLE II

      COMPARISONS OF THE PROPOSED ECAT WITH OTHER ARCHITECTURES FOR A SIX 8- BIT WORDS EXAMPLE

    3. Proposed ECAT Architecture

      The proposed ECAT has the highest accuracy with a moderate area-delay product. The shift-and-add [2] method has the smallest area, but the overall computation time is equal to 10.8(=1.8×6) ns that is the longest. Similarly, the SAT[10], which truncates

      max

      mse =Avg { (TP-)2 } (8) where Avg is the average operator.

      the TP and computes in parallel, takes 3.72 ns to

      complete the computation and uses 406 gates, which is the best area-delay product performance. However,

      Chinababu Panduru, P. Mahesh Kumar

      122

      for system accuracy, the SAT is the worst option shown in Table II. Therefore, the ECAT is suitable for high-speed and low-error applications.

  4. PROPOSED 8 × 8 2-D DCT CORE DESIGN The 1-D DCT employs the DA-based architecture and the proposed ECAT to achieve a high-speed, small area, and low-error design. The 1-D 8-point DCT can be expressed as follows:

    7

    Zn=(½)kn xm×cos(((2m+1)n)/16) .(9) m=0

    Where xm denotes the input data;Zn denotes the transform output; the proposed 2-D DCT is desgned using two 1-D DCT cores and one transpose buffer. For accuracy, the DA-precision and transpose buffer word lengths are chosen to be 9 bits and 12 bits, respectively, meaning that the system can meet the PSNR requirements outlined in previous works.Moreover, the 2-D DCT core accepts 9-bit image input and 12-bit output precision.

  5. DISCUSSION AND COMPARISONS

    Table IV compares the proposed 8 × 8 2-D DCT core with previous 2-D DCT cores. In [3] and [4], the SAT and AT architectures for DA-based DCTs improve the throughput rate of the NEDA method. However, DA-precision must be chosen as 13 bits to meet the system accuracy with more area overhead. The proposed DCT core uses low-error ECAT to achieve a high-speed design, and the DA-precision can be chosen as 9 bits to meet the PSNR requirements for reducing hardware costs. The proposed DCT core has the highest hardware efficiency, defined as follows (based on the accuracy required by the presented standards)

    Hardware Efficiency(103 pels/s) = Throughput Rate

    GateCounts

    …… (10)

    TABLE III

    COMPARISONS OF DIFFERENT 2-D DCT ARCHITECTURES WITH THE PROPOSED ARCHITECTURE

    Adders

    92

    50+16AT

    46+16

    ECAT

    DA-precision

    12 bits

    13 bits

    9 bits

    Throughput

    Rate(pels/sec)

    77M

    400 M

    1G

    Hardware

    Efficiency

    3.42

    10.05

    45

    77MHz=1GHz/13

    Fig. 3. Architecture of the proposed 1-D 8-point DCT

    Fig. 4. Core layout and characteristics

  6. CONCLUSION

In this brief, a high-speed and low-error 8 × 8 2-D DCT design with ECAT is proposed to improve the throughput rate significantly up to about 13 folds at high compression rates by operating the shifting and addition in parallel. Furthermore, the proposed error- compensated circuit alleviates the truncation error in ECAT. In this way, the DA-precision can be chosen as 9 bits instead of 12 bits so as to meet the PSNR requirements. Thus, the proposed DCT core has the highest hardware efficiency than those in previous works for the same PSNR requirements. Finally, an area-efficient 2-D DCT core is implemented using a TSMC 0.18-_m process, and the maximum throughput rate is 1 Gpels/s. In summary, the

Shams

et al.[2]

Huang et al.[4]

Proposed

Architecture

NEDA

DA-

based

DA-

based

Technology

0.18m

0.18m

0.18m

Multipliers/ROMs

0/0

0/0

0/0

Chinababu Panduru, P. Mahesh Kumar

123

proposed architecture is suitable for high compression rate applications in VLSI designs.

ACKNOWLEDGEMENT

p. chinababu would like to thank Mr. p. Mahesh kumar Assistant professor of ECE department. Who had been guiding throughout the project and supporting me in giving technical ideas about the paper and motivating me to complete the work effectively and successfully.

REFERENCES

  1. Y.Wang, J. Ostermann, and Y. Zhang, Video Processing and Communications,1st ed. Englewood Cliffs, NJ: Prentice-Hall, 2002.

  2. A. M. Shams, A. Chidanandan, W. Pan, and M.

    A. Bayoumi, NEDA:A low-power high performance DCT architecture, IEEE Trans. Signal Process., vol. 54, no. 3, pp. 955964, Mar. 2006.

  3. C. Peng, X. Cao, D. Yu, and X. Zhang, A 250 MHz optimized distributed architecture of 2D 88 DCT, in Proc. Int. Conf. ASIC, 2007,pp. 189192.

  4. C. Y. Huang, L. F. Chen, and Y. K. Lai, A high- speed 2-D transform architecture with unique kernel for multi-standard video applications, in Proc. IEEE Int. Symp. Circuits Syst., 2008, pp. 2124.

  5. S. S. Kidambi, F. E. Guibaly, and A. Antonious, Area-efficient multipliers for digital signal processing applications, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 43, no. 2, pp. 9095, Feb. 1996.

  6. K. J. Cho, K. C. Lee, J. G. Chung, and K. K. Parhi, Design of low-error fixed-width modified booth multiplier, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 5, pp. 522531, May 2004.

  7. L. D. Van and C. C. Yang, Generalized low- error area-efficient fixed width multipliers, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 52, no. 8, pp. 16081619, Aug. 2005.

Chinababu Panduru, P. Mahesh Kumar

124

Leave a Reply

Your email address will not be published. Required fields are marked *