Design of video transform engine by using dct techniques

Download Full-Text PDF Cite this Publication

Text Only Version

Design of video transform engine by using dct techniques


M.E Vlsi design(PG scholar) Srinivasan Engineering College Perambalur-621212


Assistant Professor

Srinivasan Engineering College Perambalur-621212

ABSTRACT-The technique of spatial and time scheduling strategy is called the space- time scheduling strategy (STS).It achieves high image resolutions in real time systems. The spatial scheduling strategy includes the ability to choose the DA precision bit length. The strategy is a hardware sharing architecture reduces hardware cost arranges different dimensional computations. So it calculates first and second dimensional calculations simultaneously in single 1D- DCT core to reach a hardware utilization of 100%. The hardware sharing architecture utilizes a binary signed digit DA architecture that modifies the arithmetic resources shared during four time slots. The 2D-DCT achieves high accuracy with a small area high throughput rate. While using an 8- bit length; the noise in the image can be reduced to a low level compared to that in a 9-bit length. In addition reduction in area, increased accuracy can be obtained.

key Terms:Binary Signed Digit (BSD), Discrete Cosine Transform (DCT), Distributed arithmetic (DA)-Based, Space-Time Scheduling (STS).


    Discrete cosine transform is a widely applied transform engine for image and video compression applications[1]. The visual media has been built-up towards high-resolution specifications, high definition television (HDTV). Accordingly a high-accuracy and high throughput rate component is used to meet the future applications. In addition to reduce the manufacturing costs of the integrated circuit (IC), a low hardware cost design is also necessitated. Therefore, a high performance video transform engine that victimized high accuracy, a small area, and a high-throughput rate is hoped for VLSI designs. Proposed a novel 8×8 two dimensional (2-D) discrete cosine transform/inverse discrete cosine transform architecture based on the conduct 2-D approach and the rotary motion technique. The computational complexity is reduced by using the special attribute of complex number[3].The 2-D DCT method has been implemented using either direct or indirect method.The direct

    methods includes fast algorithm that reduce the computation complexity.It provides a design automation environment with parameter configurations in designing 2-D DCT/IDCT core that is suitable for most image and video compression applications[5].


    The inner product for a general matrix multiplication and accumulation is given below


    Y=ATX=L AiXi (1)

    Where Ai is a fixed co-efficient and Xi is the input data, and is called the DA co-efficient matrix. DA is an efficient technique for calculation of sum of product or multiplies and accumulates. MAC operation is very common in all digital signal processing applications. It is used in data path circuit designing and area savings. DA implements the MAC using basic building blocks in FPGAS. It is bit-serial and DA is basically a bit-level rearrangement of the multiply and accumulation operation. A smaller area can be achieved by using BSD DA-based architecture.

    Fig.1.Adder Based DA

    1. 2-D DCT core design

      The 8×8 2-D DCT is defined as

      Fig.2.8×8 2-D DCT

      y =1/4K K


      cos ((2i+1) urr)/16

    2. Analysis of the co-efficient bits


    u v i=O



    The co-efficient bits are analyzed by using seven internal co-

    xcos ((2j+1) vrr/16) (2)

    Where Ku=Kv=1/2 for u=v=0 and Ku=Kv=1 for 1u, v7. The 8 point 1D-DCT is defined as

    efficient from C1 to C7 for the 2-D DCT transformation. Then it will be expressed as BSD because to save computation time and hardware cost, as well as to achieve the PSNR rate. The PSNR is defined as given below

    z =1/2K X xcos((2m+1) nrr)/16)

    PSNR=10 log102552/MSEI (4)

    n n m=O m


    The mean-square-error between the original image and the reconstructed image for each pixel.Accuracy between the 8-bit

    Where Kn=1/2 for n=0, Kn=1 for n 0

    By using hardware designs to reduce the computation complexity is row-column disintegration that executes row- wise 1-D transform followed by column-wise transform with arbitrating transposition. The DA-precision bit length for the BSD representation to achieve organization accuracy, share the hardware resource in time to reduce the area cost.

    and 9-bit BSD expressions is shown in figure. While the BSD bit length increases the HW also increased. Reduction of the HW cost and improvement in the system accuracy is done by means of 9-bit BSD

    Fig.3.Simulation for system PSNR in different BSD bit length


    Video Transform Engine core architecture is based two 1D DCT units coupled through transpose matrix RAM. Transposition RAM is double (2 stage) buffered. When 2nd stage of DCT reads out data from transposition memory 1, 1st DCT stage can write 2nd transposition memory with new data. This enables creation of dual stage global pipeline where every stage (both tow stages for each DCT) consists of 1D DCT and transposition memory. 1D DCT units are not internally pipelined; they use a parallel distributed arithmetic with butterfly computation as per digital signal processing for video to compute DCT values. Design based on distributed arithmetic does not use any multipliers for computing MAC (multiply and accumulate), instead it stores precompiled MAC results in ROM memory and grab them as needed. DBUFCTL block is a memory arbiter between 1D DCT stages

    Fig 4.system architecture.


The hardware shares the resources in order to reduce area cost by suing some modules. Those modules are modified two- input butterfly module(MBF2),process element even(PEE),process element odd(PEO),pre-reorder and post- reorder.

A.Modified butterfly module

Genrally BF2 has a hardware utilization rate in the adder and subtracter of 50%.Additional multiplexers and reorder registers are added to the proposed MBF2 module in order to enble the hardware resources to be shared.The reorder registers consist of four word registers that use the control signals to select the input data and use enable signals to output data.The operation of the proposed MBF2 has an eight clock cycle period similar to BF2. The pre-reorder module operation of the hardware resources can be shared the hardware resources by reording the inputs during the four separate time slots

  1. Process element module

    The DA-based computation, even part and odd parttransformation can be implemented using PEE and PEO.Hardware shares the resources at the bit level.The even adder tree(EAT) and odd adder tree(OAT) using the error compensated adder tree.The even part transformation can be extended based on DA computation formats

  2. Post-Reorder Module

    This is the last stage of the 1-D DCT computation,the data sequence after the PEE and PEO must be merged and repermuted in the post-reorder module.Then the post-reorder module permutes the data order in sequence.Twomultiplxers are selected that is fed into the different reorder registers.

  3. 2-D DCT Core Architecture

To save the hardware costs the proposed 2-D DCT core is implemented using a single 1-D DCT and one TMEM.The 1- D DCT core includes an MBF2,a pre-reorder module,aPEE,aPEO,a post-reorder module,and one TMEM.The TMEM is mplemented using 64-word 12-bit dual-port registers and has a latency of 52-cycles.Based on the time scheduling strategy a hardware utilization of maximum can be achieved.


2-D DCT core employs a single 1-D DCT core and one TMEM with a small area.The 8-bit DA precision is chosen in order to meet the PSNR requirements and the hardware sharing architecture enables sharing based on time so as to reduce area cost.

The number of adders/subtracters in 1-D DCT core allows 74% saving in area over the NEDA architecture for the DA- based DCT design. The system arranges the computation time for each process element.The 1-D core can calculate 1-D and 2-D transformation simultaneously and achieves a high throughput rate.So it has high accuracy,a small area,and a high-throughput rate has achieved using STS strategy.In the future work while using an 8-bit DA precision length will reduce the noise in the image.


  1. Y.Wang, J.Ostermann, and Y.Zhang, video processing and communication,1st ed. Englewood cliffs, NJ: prentice-Hall,2002.

  2. M. Alam, W. Badawy, and G. Jullien, A new time distributed DCT architecture for MPEG-4 hardware reference model, IEEE Trans. CircuitsSyst. Video Technol., vol. 15, no. 5, pp. 726730, May 2005.

  3. Y. P. Lee, T. H. Chen, L. G. Chen, M. J. Chen, and C. W. Ku, A cost- effective architecture for 8×8 two-dimensional DCT/IDCT using direct method, IEEE Trans. Circuits Syst. Video Technol., vol. 7, no.3, pp. 459 467, Jun – 1997.

  4. Y. H. Chen, T. Y. Chang, and C. Y. Li, High throughput DA-based DCT with high accuracy error-compensated adder tree, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 2010

  5. J. I. Guo, R. C. Ju, and J. W. Chen, An efficient 2-D DCT/IDCT core design using cyclic convolution and adder-based realization, IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 4, pp. 416428, April – 2004.

  6. e. Feig and s.winograd, fast algorithms for the discrete cosine transform, ieee trans. Signal process., vol. 40, no. 9, pp. 21742193, sep -1992..

  7. s. Ghosh, s. Venigalla, and m. Bayoumi, design and implementation of a 2d-dct architecture using coefficient distributed arithmetic, in proc. Ieee computes. Soc. Annu. Symp.vlsi., 2005, pp. 162166.

  8. e. Feig and s.winograd, fast algorithms for the discrete cosine transform, ieee trans. Signal process., vol. 40, no. 9, pp. 21742193, sep -1992.

  9. s. C. Hsia and s., shift-register-based data transposition for cost- effective discrete cosine transform, ieee trans. Very large scale integr. (vlsi) syst., vol. 15, no. 6, pp. 725728, june – 2007.

Leave a Reply

Your email address will not be published. Required fields are marked *