An Efficient Reduction of Area in Multistandard Transform Core

A.   Shanmuga Priya; Dr.   T.   K.   Shanthi

doi:10.17577/IJERTV3IS041193

Volume 03, Issue 04 (April 2014)

An Efficient Reduction of Area in Multistandard Transform Core

DOI : 10.17577/IJERTV3IS041193

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 38
Total Downloads : 321
Authors : A. Shanmuga Priya, Dr. T. K. Shanthi
Paper ID : IJERTV3IS041193
Volume & Issue : Volume 03, Issue 04 (April 2014)
Published (First Online): 21-04-2014
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

An Efficient Reduction of Area in Multistandard Transform Core

Shanmuga Priya1, Dr. T. K. Shanthi2

1PG scholar, Applied Electronics, Department of ECE,

2Assosiate Professor, Department of ECE

Thanthai Periyar government institute of technology, Vellore, Tamilnadu, India.

Abstract: This paper explains about the reduction of area in multistandard transform core. By using the buffer circuit instead of D flip-flop in the pipeline registers of architectural design, the gate counts will be reduced. By doing this the area of the hardware design is reduced. Thus the hardware cost of the core is reduced and also the hardware efficiency is improved and the improved efficiency will be 45%.
1. INTRODUCTION
  
  In this part of the paper, the overall views about various transform techniques for various video applications are discussed.
  
  A.Distributed Arithmetic:
  
  It is an efficient technique for calculation of sum of product or vector dot product. It is a technique that is bit serial in nature. The important advantage of DA is, it is used in data path circuit designing. It uses look up tables and accumulators instead of multipliers for computing inner product and has been used in many DSP applications such as DCT, DFT, and digital filters.
New Distributed Arithmetic:

Reducing cost metrics is important attention in ASIC design. There are increasing demands for more efficient DA paradigms which can eliminate the need of using ROM. NEDA scheme is proposed to realize the inner product of vectors with optimal solutions of hardware requirements.

C.Discrete Cosine Transforms:

DCT expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. It is a Fourier related transform similar to DFT. DCTs are equivalent to DFTs of roughly twice the length, operating on real data with even symmetry, where is some variants the input/output data shifted by half a sample.

Useful properties of DCT in image and video applications:

Energy compaction: concentrating the energy into small number of co efficient.
De-correlation: minimizing the interdependencies between coefficients

D.Factor Sharing:

The factor sharing method shares the same factor in different coefficients among the same input . The 2D DCT/IDCT technique is used for finding the coefficient matrix. By using the IDCT coefficients the delta matrix and the area efficient architecture is found for the unified IDCT circuits.

In section II, the previous work related to common sharing distributed arithmetic algorithm is discussed. In section III the architecture of multi standard transform core is discussed. In section IV the importance metrics of FPGA is discussed. Section V gives the circuit comparison between the buffer and D flip-flop. Section VI gives the performance analysis of the design.

RELATED WORK

The common sharing distributed arithmetic algorithm is the combination of both Factor Sharing and Distributed Algorithm. The factor sharing method first finds the shared factor in the coefficient matrix. The DA method is then applied to share the combination of the input among the coefficients.

The search flow of the CSDA algorithm is used for better hardware resource sharing of the core design. So the higher capability in hardware resource sharing can be achieved by using CSDA.

To form the coefficient matrix for CSDA the canonic signed digit coefficients will be used. Canonical signed digit is a number system for encoding a floating point value in a twos complement representation. This encoding contains 33% fewer non zero than 2s

complement form leading to efficient implementation of add/subtract networking in hardware digital signal processing.

The main aim of CSDA is to reduce the non zero elements in the co efficient matrix. So the canonical signed digit is used in CSDA method.

For the mathematical part of the CSDA, the general 8point transform technique is used. According to the symmetric property, the 8point transform can be equally divided into two 4 point transforms, where one part is even transform and another one is odd transform. Again the four point transforms can be divided into two 2point transforms.
ARCHITECTURE OF MULTISTANDARD TRANSFORM CORE:

fig1. 1 D CSDA MST core

1D multistandard transform core consists of Selected Butterfly Architecture (SBF)

Even Part CSDA Odd Part CSDA

Error Correction Adder Tree (ECAT) Permutation

A.SBF

Selected butterfly model architecture performs the 8point butterfly model with 8 multiplexers. These 8 point transform is divided into two 4 point transforms, like even part CSDA and odd part CSDA.

B.EVEN PART CSDA

CSDA even part calculates the even part of the 8 point transform. Here it consists of the two pipeline stage architectures, which is developed by using the D flipflops

as pipeline registers. 1st stage executes the 4point input butterfly matrix circuit and the second stage of even part CSDA tells about the hardware resources in variable video applications.

C.ODD PART CSDA

Similar to the even part CSDA, the odd part CSDA also consists of the two pipeline stage registers , which efficiently shares the hardware resources among the odd part of the architecture in variable video standard applications.

D.ECAT&PERMUTATION

Error correction adder trees are followed by the CSDA even part and CSDA odd part which add the non zero coefficients of CSDA with corresponding tree like structures. Permutation is used to minimize the truncation errors.

Fig2. 2D CSDA MST core

In the 2D multistandard transform core consists of two one dimensional MST and transposed memory which is made up of pipelined registers which is formed by D flip-flop. Due to this there are more number of gate counts are required in synthesis part of the design.

IMPORTANCE OF FPGA

Field programmable gate array mainly consists of look up tables based logic elements in the combinational logic blocks. It uses the inter connects for connecting the logic elements between the blocks.

In general to describe the device capacity in FPGA, three metrics are mainly considered. They are maximum logic gates, maximum memory bits and typical gate range. FPGA manufacturers mainly considering the device capacity in terms of gate counts because of the logic elements.

Maximum logic gates metric is used to estimate the maximum number of gates that can be used for performing the particular logic functions.

Table 1: gate counts for common logic functions.

FUNCTIONS	GATE COUNTS
COMBINATIONAL FUNCTIONS
2 input NAND gate	2
2 to 1 MUX	4
3 input XOR	6
4 input XOR	9
2 bit carry save full adder	9
REGISTER FUNCTIONS
D flip-flop	6
D flip-flop with set or reset	8
D flip-flop with reset and clock enable	12

BUFFER AND D FLIPFLOP:

Both flip-flops and buffers are used for the same purpose i.e. for holding the circuit data for the specific clock period. But in the case of area reduction these buffers can be used inthe place of flip-flops because of the minimum number of gate counts.

Fig3. Buffer using NAND gates

Fig 4. D flip flop using NAND gates

From the above two figures we can easily say that the NAND gate required for buffer is only two, where the D flip flop requires six NAND gates for the circuit design.

PERFORMANCE ANALYSIS:

Table 2: Device utilization summary

From the above table we can say that the total number of gate counts is reduced by using the buffer circuit. By using the CSDA technique the gate counts is 30k. But when the buffer circuit is implemented in the design the gate counts are reduced to 23k. So the area is reduced in the core design and the hard ware efficiency which is inversely proportional to the gate counts also improved to 45%.

UTILIZATION	USING D FLIP FLOP	USING BUFFER
Logic utilization:
Number of slice flip flop used	1259	704
Number of 4 input LUT	2095	2048
Logic distribution:
Number of occupied slices	1310	1239
Number of slices containing only related logic	1310	1239
Number of slices containing unrelated logic	0	0
Total number of 4 input LUTs	2279	2158
Number used as logic	2095	2048
Number used as route thru	184	110
Number of bonded IOB	188	188
Number of GCLKS	1	1
Total number of gate counts for design	28606	23799

REFERENCES:

Y. H. Chen, J. N. Chen, T.Y. Chang and C.W.lu, High -throughput multi standard transform core supporting MPEG/H.264/VC-1 using common sharing distributed arithmetic .VLSI.2013.2251021.
Gate Count Capacity Metrics for FPGAs XAPP 059 Feb1. 1997 version 1.1.
A. M. Shams, A. Chidanandan, W. Pan, and M. A. Bayoumi, NEDA: A low-power high performance DCT architecture, IEEE Trans. Signal Process., vol. 54, no. 3, pp. 955964, Mar. 2006.
C. Peng, X. Cao, D. Yu, and X. Zhang, A 250 MHz optimized distributed architecture of 2D 8Ã—8 DCT, in Proc. 7th Int. Conf. ASIC, Oct. 2007, pp. 189192.
C. Y. Huang, L. F. Chen, and Y. K. Lai, A high-speed 2-D transform architecture with unique kernel for multi-standard video applications, in Proc. IEEE Int. Symp. Circuits Syst., May 2008,

pp. 2124.
Y. H. Chen, T. Y. Chang, and C. Y. Li, High throughput DA-based DCT with high accuracy error-compensated adder tree, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 4, pp. 709714, Apr. 2011.
Y. H. Chen, T. Y. Chang, and C. Y. Li, A high performance video transform engine by using space-time scheduling strategy, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 4, pp. 655664, Apr. 2012.
Y. K. Lai and Y. F. Lai, A reconfigurable IDCT architecture for universal video decoders, IEEE Trans. Consum. Electron., vol. 56, no. 3, pp. 18721879, Aug. 2010.
H. Chang, S. Kim, S. Lee, and K. Cho, Design of area-efficient unified transform circuit for multi-standard video decoder, in Proc. IEEE Int. SoC Design Conf., Nov. 2009, pp. 369372.
S. Lee and K. Cho, Circuit implementation for transform and quantization operations of H.264/MPEG-4/VC-1 video decoder, in Proc. Int. Conf. Design Technol. Integr. Syst. Nanosc., Sep. 2007, pp. 102107.
H. Qi, Q. Huang, and W. Gao, A low-cost very large scale integration architecture for multistandard inverse transform, IEEE Trans. Circuits Syst., vol. 57, no. 7, pp. 551555, Jul. 2010.
T. S. Chang, C. S. Kung, and C. W. Jen, A simple processor core design for DCT/IDCT, IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 3, pp. 439447, Apr. 2000.
K. H. Chen, J. I. Guo, J. S. Wang, C. W. Yeh, and T. F. Chen, A poweraware IP core design for the variable-length DCT/IDCT targeting at MPEG4 shape-adaptive transforms, in Proc. IEEE Int. Symp. Circuits Syst., vol. 2. May 2004, pp. 141144.
S. Lee and K. Cho, Design of high-performance transform and quantization circuit for unified video CODEC, in Proc. IEEE Asia Pacific Conf. Circuits Syst., Nov. 2008, pp. 1450 1453.
C. P. Fan and G. A. Su, Fast algorithm and low-cost hardware- sharing design of multiple integer transforms for VC-1, IEEE Trans. Circuits Syst., vol. 56, no. 10, pp. 788792, Oct. 2009.

An Efficient Reduction of Area in Multistandard Transform Core

Leave a Reply