An Efficient Reduction of Area in Multistandard Transform Core

DOI : 10.17577/IJERTV3IS041193

Download Full-Text PDF Cite this Publication

Text Only Version

An Efficient Reduction of Area in Multistandard Transform Core

  1. Shanmuga Priya1, Dr. T. K. Shanthi2

    1PG scholar, Applied Electronics, Department of ECE,

    2Assosiate Professor, Department of ECE

    Thanthai Periyar government institute of technology, Vellore, Tamilnadu, India.

    Abstract: This paper explains about the reduction of area in multistandard transform core. By using the buffer circuit instead of D flip-flop in the pipeline registers of architectural design, the gate counts will be reduced. By doing this the area of the hardware design is reduced. Thus the hardware cost of the core is reduced and also the hardware efficiency is improved and the improved efficiency will be 45%.


      In this part of the paper, the overall views about various transform techniques for various video applications are discussed.

      A.Distributed Arithmetic:

      It is an efficient technique for calculation of sum of product or vector dot product. It is a technique that is bit serial in nature. The important advantage of DA is, it is used in data path circuit designing. It uses look up tables and accumulators instead of multipliers for computing inner product and has been used in many DSP applications such as DCT, DFT, and digital filters.

  2. New Distributed Arithmetic:

Reducing cost metrics is important attention in ASIC design. There are increasing demands for more efficient DA paradigms which can eliminate the need of using ROM. NEDA scheme is proposed to realize the inner product of vectors with optimal solutions of hardware requirements.

C.Discrete Cosine Transforms:

DCT expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. It is a Fourier related transform similar to DFT. DCTs are equivalent to DFTs of roughly twice the length, operating on real data with even symmetry, where is some variants the input/output data shifted by half a sample.

Useful properties of DCT in image and video applications:

  1. Energy compaction: concentrating the energy into small number of co efficient.

  2. De-correlation: minimizing the interdependencies between coefficients

D.Factor Sharing:

The factor sharing method shares the same factor in different coefficients among the same input . The 2D DCT/IDCT technique is used for finding the coefficient matrix. By using the IDCT coefficients the delta matrix and the area efficient architecture is found for the unified IDCT circuits.

In section II, the previous work related to common sharing distributed arithmetic algorithm is discussed. In section III the architecture of multi standard transform core is discussed. In section IV the importance metrics of FPGA is discussed. Section V gives the circuit comparison between the buffer and D flip-flop. Section VI gives the performance analysis of the design.


    The common sharing distributed arithmetic algorithm is the combination of both Factor Sharing and Distributed Algorithm. The factor sharing method first finds the shared factor in the coefficient matrix. The DA method is then applied to share the combination of the input among the coefficients.

    The search flow of the CSDA algorithm is used for better hardware resource sharing of the core design. So the higher capability in hardware resource sharing can be achieved by using CSDA.

    To form the coefficient matrix for CSDA the canonic signed digit coefficients will be used. Canonical signed digit is a number system for encoding a floating point value in a twos complement representation. This encoding contains 33% fewer non zero than 2s

    complement form leading to efficient implementation of add/subtract networking in hardware digital signal processing.

    The main aim of CSDA is to reduce the non zero elements in the co efficient matrix. So the canonical signed digit is used in CSDA method.

    For the mathematical part of the CSDA, the general 8point transform technique is used. According to the symmetric property, the 8point transform can be equally divided into two 4 point transforms, where one part is even transform and another one is odd transform. Again the four point transforms can be divided into two 2point transforms.


    fig1. 1 D CSDA MST core

    1D multistandard transform core consists of Selected Butterfly Architecture (SBF)

    Even Part CSDA Odd Part CSDA

    Error Correction Adder Tree (ECAT) Permutation


    Selected butterfly model architecture performs the 8point butterfly model with 8 multiplexers. These 8 point transform is divided into two 4 point transforms, like even part CSDA and odd part CSDA.


    CSDA even part calculates the even part of the 8 point transform. Here it consists of the two pipeline stage architectures, which is developed by using the D flipflops

    as pipeline registers. 1st stage executes the 4point input butterfly matrix circuit and the second stage of even part CSDA tells about the hardware resources in variable video applications.


    Similar to the even part CSDA, the odd part CSDA also consists of the two pipeline stage registers , which efficiently shares the hardware resources among the odd part of the architecture in variable video standard applications.


    Error correction adder trees are followed by the CSDA even part and CSDA odd part which add the non zero coefficients of CSDA with corresponding tree like structures. Permutation is used to minimize the truncation errors.

    Fig2. 2D CSDA MST core

    In the 2D multistandard transform core consists of two one dimensional MST and transposed memory which is made up of pipelined registers which is formed by D flip-flop. Due to this there are more number of gate counts are required in synthesis part of the design.


    Field programmable gate array mainly consists of look up tables based logic elements in the combinational logic blocks. It uses the inter connects for connecting the logic elements between the blocks.

    In general to describe the device capacity in FPGA, three metrics are mainly considered. They are maximum logic gates, maximum memory bits and typical gate range. FPGA manufacturers mainly considering the device capacity in terms of gate counts because of the logic elements.

    Maximum logic gates metric is used to estimate the maximum number of gates that can be used for performing the particular logic functions.

    Table 1: gate counts for common logic functions.





    2 input NAND gate


    2 to 1 MUX


    3 input XOR


    4 input XOR


    2 bit carry save full adder



    D flip-flop


    D flip-flop with set or reset


    D flip-flop with reset and clock




    Both flip-flops and buffers are used for the same purpose i.e. for holding the circuit data for the specific clock period. But in the case of area reduction these buffers can be used inthe place of flip-flops because of the minimum number of gate counts.

    Fig3. Buffer using NAND gates

    Fig 4. D flip flop using NAND gates

    From the above two figures we can easily say that the NAND gate required for buffer is only two, where the D flip flop requires six NAND gates for the circuit design.


    Table 2: Device utilization summary

    From the above table we can say that the total number of gate counts is reduced by using the buffer circuit. By using the CSDA technique the gate counts is 30k. But when the buffer circuit is implemented in the design the gate counts are reduced to 23k. So the area is reduced in the core design and the hard ware efficiency which is inversely proportional to the gate counts also improved to 45%.





    Logic utilization:

    Number of slice flip flop




    Number of 4 input LUT



    Logic distribution:

    Number of occupied




    Number of slices containing only related




    Number of slices

    containing unrelated logic



    Total number of 4 input




    Number used as logic



    Number used as route




    Number of bonded IOB



    Number of GCLKS



    Total number of gate counts for design




  1. Y. H. Chen, J. N. Chen, T.Y. Chang and, High -throughput multi standard transform core supporting MPEG/H.264/VC-1 using common sharing distributed arithmetic .VLSI.2013.2251021.

  2. Gate Count Capacity Metrics for FPGAs XAPP 059 Feb1. 1997 version 1.1.

  3. A. M. Shams, A. Chidanandan, W. Pan, and M. A. Bayoumi, NEDA: A low-power high performance DCT architecture, IEEE Trans. Signal Process., vol. 54, no. 3, pp. 955964, Mar. 2006.

  4. C. Peng, X. Cao, D. Yu, and X. Zhang, A 250 MHz optimized distributed architecture of 2D 8×8 DCT, in Proc. 7th Int. Conf. ASIC, Oct. 2007, pp. 189192.

  5. C. Y. Huang, L. F. Chen, and Y. K. Lai, A high-speed 2-D transform architecture with unique kernel for multi-standard video applications, in Proc. IEEE Int. Symp. Circuits Syst., May 2008,

    pp. 2124.

  6. Y. H. Chen, T. Y. Chang, and C. Y. Li, High throughput DA-based DCT with high accuracy error-compensated adder tree, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 4, pp. 709714, Apr. 2011.

  7. Y. H. Chen, T. Y. Chang, and C. Y. Li, A high performance video transform engine by using space-time scheduling strategy, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 4, pp. 655664, Apr. 2012.

  8. Y. K. Lai and Y. F. Lai, A reconfigurable IDCT architecture for universal video decoders, IEEE Trans. Consum. Electron., vol. 56, no. 3, pp. 18721879, Aug. 2010.

  9. H. Chang, S. Kim, S. Lee, and K. Cho, Design of area-efficient unified transform circuit for multi-standard video decoder, in Proc. IEEE Int. SoC Design Conf., Nov. 2009, pp. 369372.

  10. S. Lee and K. Cho, Circuit implementation for transform and quantization operations of H.264/MPEG-4/VC-1 video decoder, in Proc. Int. Conf. Design Technol. Integr. Syst. Nanosc., Sep. 2007, pp. 102107.

  11. H. Qi, Q. Huang, and W. Gao, A low-cost very large scale integration architecture for multistandard inverse transform, IEEE Trans. Circuits Syst., vol. 57, no. 7, pp. 551555, Jul. 2010.

  12. T. S. Chang, C. S. Kung, and C. W. Jen, A simple processor core design for DCT/IDCT, IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 3, pp. 439447, Apr. 2000.

  13. K. H. Chen, J. I. Guo, J. S. Wang, C. W. Yeh, and T. F. Chen, A poweraware IP core design for the variable-length DCT/IDCT targeting at MPEG4 shape-adaptive transforms, in Proc. IEEE Int. Symp. Circuits Syst., vol. 2. May 2004, pp. 141144.

  14. S. Lee and K. Cho, Design of high-performance transform and quantization circuit for unified video CODEC, in Proc. IEEE Asia Pacific Conf. Circuits Syst., Nov. 2008, pp. 1450 1453.

  15. C. P. Fan and G. A. Su, Fast algorithm and low-cost hardware- sharing design of multiple integer transforms for VC-1, IEEE Trans. Circuits Syst., vol. 56, no. 10, pp. 788792, Oct. 2009.

Leave a Reply