 Open Access
 Total Downloads : 20
 Authors : G. Satish Chandra , D. Sai Manikumar
 Paper ID : IJERTV7IS040160
 Volume & Issue : Volume 07, Issue 04 (April 2018)
 DOI : http://dx.doi.org/10.17577/IJERTV7IS040160
 Published (First Online): 16042018
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
High Performance Integer DCT Architecture for HEVC
G. Satish Chandra
Electronics and communication SRM Institute of Science & Technology
D. Sai Mani Kumar
Electronics and communication SRM Institute of Science And Technology
Abstract Our proposed system proceeds VLSI architecture for integer Discrete Cosine Transform (integer DCT), which is used in real time High Efficiency Video Coding (HEVC) applications. It has N point 1DInteger DCT architecture, which includes signed configurable carry save adder tree based multiplier unit. So, the depth of the architecture falls within the bounds of O (log2 N). The proposed 1D architecture is used to perform one Npoint or Integer DCTs in parallel. The proposed 1D architecture is used to design 2D folded and parallel designs. The performance results show that the proposed architecture gives better performance compared with existing architectures using 45 nm CMOS TSMC libraries. The proposed 32*32point parallel Integer DCT achieves 59.1% of improvement in worst path delay compared with oddeven decomposition based architecture.
Keywords: Integer DCT, HEVC, 1D DCT Architecture, 2D DCT Architecture
INTRODUCTION
Digital signal processors (DSPs) are very important for the realtime processing of realworld digitized data to do highspeed numeric calculations used for lot of applications from basic consumer electronics to sophisticated industrial instrumentation. The discrete transform is used to change the representation of a signal from one domain to another for reducing the complexity of a particular digital signal processing application.
Discrete cosine transform (DCT) is very powerful transformation used in image compression. The circuit complexity of DCT is greater than integer DCT because DCT is floating point and the integer DCT is fixed point. So, the delay of the multiplier adders used in the adder. The output can be stored at one particular 1*32 Buffer. The outputs of Ith, 1* 32 Buffer are b32i, b16i, b8i, b4i, and b2i, which are the resultants of 32, 16, 8, 4, and 2point Integer DCTs respectively. Each 1*32Buffer is made up of 32 numbers of registers and 2to1 multiplexers with common select line. The DCT application can have many purposes Such as filtering, teleconferencing, highdefinition television (HDTV), speech coding, image coding, data compression, and more. All of these use DCT algorithm for compression and/or filtering purposes. The DCT has the energy packing capabilities and also approaches the statistically optimal transform in decor relating a signal. It was implemented with discrete components at the board level.

LITERATURE SURVEY

High performance Multiplier less DCT Architecture for HEVC, Wenjun Zhao, Takao Onoye, and Tian Song(2015)
There are numerous video compression format for storage or transmission of digital video content. High Efficiency Video Coding (HEVC) is a video compression standard, a successor to H.264/MPEG4 Advanced Video
Coding (AVC). In this paper, we propose an efficient architecture for the computation of 4, 8, 16 and 32 point DCT used in HEVC standard. The architecture uses the Canonical Signed Digit (CSD) representation and Common Sub expression Elimination (CSE) technique to perform the multiplication with shiftadd operation.

A Reconfigurable MultiTransform VLSI Architecture Supporting Video Codec Design Kanwen Wang, Jialin Chen, Wei Cao, Ying Wang, Lingli Wang.
The proposed system for the realtime processing of 1080P HD video, which can support both forward and inverse transforms of MPEG using multi transform VLSI architecture. The MCM (RMCM) algorithm is the multiple constant multiplication algorithm with 2 fusing strategies, which is provided to generate constant multipliers in the matrix calculation blocks.

Multimode parallel and folded VLSI architectures for 1Dfast Fourier transform Mohamed Asan Basiri M and Noor Mahammad Sk.
This paper proposes efficient FFT VLSI architectures using folded/parallel implementation. The folded FFT architecture has number of cycles required to complete the operation is less than single/multipath delay commutator (MDC) architectures. Npoint FFT is implemented by using one N/2point FFT without much extra hardware. Both the proposed architectures are implemented for radix2, 22.


EXISTING SYSTEM
In all the existing architectures, threadshift network based multiplier is used. So, the delay of the multiplier is based on the number of adders used in the addshift network. The existing technique is addshift network. It uses configurable carry save addition
Disadvantages of Existing System:

Multiplication requirement is more.

More delay

High Power


PROPOSED SYSTEM

In the proposed architecture, configurable carry save adder (CSA) tree based multiplier is used. It shows the series of multiplexers used for configurable carry save addition based multiplication in the proposed architecture. The maximum number of values to be added in the configurable carry save addition based 32point Integer DCT is log2N = log232 = 5.IV.
The mathematical representations of the 2D Forward
Y (0)
Y (2)
C4
0
(6) X (7)
0
0
0
0
0 0 0 X (0) X (1) X (2) X (3) X (4) X (5) X
0
0
0
0 0 0 X (0) X (7) X (3) X (4) X (1) X (2) X
0
C2
C6
0
0 0 0 X (0) X (7) X (3) X (4)
0
C
0
0 0 0 X (1) X (6) X (2) X (5)
*

C

C
(5) X (6)
DCT and the 2D IDCT are represented in the following:
4
Y (4)
Y (6)
0
1 0
Formulae
6 2
Y (1)
4 0
0 0 0
C1 C3
C5 C7
X (0) X (7)
Y (5)
0 0
0 0 C3
C7 C1 C5
X (1) X (6)
Y (6)
0 0
0 0 C

C C
C
X (2) X (5)
Forward DCT
5 1 7
3
( N 1) ( N 1)
(2x 1)u
(2 y 1)v
Y (7)
0
0 0 0
C7 C5 C3
C1
X (3) X (4)
F (u,v) C(u)C(v)[ f (x, y) cos 2N
cos ]
2N
The circuit consists of 38 multipliers and 8 adders and
( N 1) ( N 1)
x0
y0
Inverse DCT
(2x 1)u
(2y 1)v
8 subtractor connected in a regular matrix of cells. Bit serial logical adders and subtractor cells have been used and the array multipliers have been implemented. The bit serial multiplier will be pipelined every two cells. By using KMap, The serial
logical adder and subtractor equation is implemented. The
f (x, y) [ C(u)C(v)F(u,v)cos 2N
cos ]
2N
following table showing the equations:
u0 v0
1 1
Where: C(u) = , C(v) =
Sum = A B Cin Difference = A B bin
N
for u,v = 0
2
C(u) =
N
N
2
, C(v) =
N
Carry = AB + ACin +BCin Borrow = AB + Abin + Bbin
Consider the two unsigned biary numbers X and Y that are M and N bits wide respectively. X and Y in a binary representation are as below:
for u,v = 1 through N1; N = 4, 8, or 16
In the design, N = 8. F(u,v) is called the (u,v)th
M 1 i
x xi 2 ,
i0
N 1
y y 2
j
j
j 0
transform coefficient. The above formula shows that the 2D DCT can be computed by applying the 1D DCT to each of the columns of the matrix separately and then applying the 1D DCT to each of the rows separately. This is the reparability property of the 2D DCT. All the 2D DCT processors developed so far have made use of this property of the 2D
With Xi, Yj {0,1}. The multiplication operation is then defined as follow:
x
M N 1
z x * y zk 2
k 0
M 1 i N 1 j
xi 2 y j 2
DCT. In this report, we present the design of the 2D DCT
function under VLSI architecture for image processing. The
i0
M 1 N 1
j0
design layout will be at cells block level, which it does not show in great detail for the entire chip design .

WORKING PRINCIPLE
A. DCT Algorithm
2D DCT Architecture: The two dimensional (2D) Discrete Cosine Transform (DCT) forms the cornerstone of many image processing standards such as JPEG and MPEG. Many proposed solutions are based on row column decomposition implementation which allows the 2D DCT to be implemented by two one dimensional (1D) DCTs separated by a transposition memory.
1D DCT Architecture: The derivation of the lD DCT architecture can be more easily explained by examining the lD DCT in matrix form, given as below:
[Y ] [C]*[X ]x y 2i j
i j
i0 j 0
The multiplicand is consecutively multiplied with every bit of the multiplier, resulting in a number of partial products. These intermediate results are adder after the proper shifting has been applied. Use the algorithms of two binary number multiplications to implement the array multiplier. The array multiplier consists numerous of AND and full adder. This type of multiplier requires Mbit (Multiplicand) x Nbit (Multiplier) number of AND gates and full adders. The transpose component is an array of 8×8, 12bit shift registers. It receives the output from the 1D DCT (row), and transposes the row to the column of the second 1D DCT input. The shift register used to store the bits into registers. Then, connect the metal plate to other shifters input, and shift the each 12 bits arrange as column format.

FLOWCHART
Y (0)
Y (1)
C4 C4 C4
C C C
C4 C4
C C
C4 C4


C C
C4

C
X (0)
X (1)
1
3 5 7
7 5 3
1
Y (2)
Y (3)
C2
1 C
C6 C6

C C

C2

C

C2

C

C6 C6
C C
C2

C
X (2)
X (3)
3
7 1 5
5 1 7
3
Y (4) 4 C4 C4 C4 C4 C4 C4 C4 C4 X (4)
Y (5)
C5 C1 C7
C3 C3 C7
C1 C5
X (5)
Y (6) C C C

C C C
C C X (6)
6
2 2 6 6 2
2 6
Y (7)
C7

C5 C3

C1
C1 C3

C5
C7
X (7)
Where C(k) = Cos(2K/32). As multipliers are m times more complex than adders, the aim is to reduce the number of multiplications at the expense of additions. The sparse matrix approach achieves this by manipulating the terms in the input matrix as shown in equation.
Fig.1 Flow chart of 1D DCT Architecture
Fig.2 Flow chart of 2DDCT Architecture

PROPOSED DCT ARCHITECTURE
Proposed block architecture used for 32point 1D Integer DCT. In 32point 1DInteger DCT, the coefficient matrix is in the size of 32_32. The input signal sample values should be multiplied with the coefficient, which forms the matrixvector multiplier.
In all the existing architectures, the addshift network based multiplier is used. So, the delay of the multiplier is based on the number of adders used in the addshift network. In the proposed architecture, configurable carry save adder (CSA) tree based multiplier is used. Fig. 3(a) shows the series of multiplexers used for configurable carry save addition based multiplication in the proposed architecture.
Fig.3 Proposed Architecture
The maximum number of values to be added in the configurable carry save addition based 32point Integer DCT is log2N = log232 = 5. For example, the multiplication of the co efficient 87 with the input signal sample value xi is equal to 87xi = 64xi +16xi +4xi +2xi +xi. The minimum number of values to be added in the configurable carry save addition based 32point Integer DCT is 1. For example, the multiplication of the coefficient 4 with the input signal sample value xi is equal to 4xi = 4xi + 0xi + 0xi + 0xi + 0xi. So, the corresponding left shifted (power of two) input signal values are sent as the input of the series of multiplexers used in Fig. 3(a), which is named
as Cell.Therefore, five Cells are used in Fig. 3(b). So, the maximum possible levels of the configurable carry save adder (CSA) tree is log25 = 3. The Sum and Carry from the final carry save adder are added. The proposed block architecture (Block) used for 32point 1DInteger DCT with (a) Series of multiplexers used for configurable carry save addition based multiplication (Cell) (b) configurable carry save adder tree based multiplication unit (c) Series of multiplexers used to find the resultant sign bits for the multiplication.
Fig.4 VLSI architectures for proposed 32point 1DInteger DCT
The overall architecture of proposed 32point 1D Integer DCT, where the inputs are from 32 numbers of Blocks as shown in Fig.4. Therefore, log232 = 5 levels of signed fixed point adders are used. Therefore, the critical path depth of the signed adder tree (Tadd; pro delay) used in the Npoint proposed Integer DCT architecture is (log2N) T(add). Here, T (add) represents the critical path depth of the signed adder. The proposed 32point 1D architecture is used to perform one 32 point or two 16point or four 8point or eight 4point or sixteen 2point Integer DCTs in parallel. The 32point Integer DCT output is fou32s; ou32g. Fig.5 shows the 32 X 32Buffer architecture, where 32 numbers of 1*32Buffers are used. The 1*32Buffer inputs are the outputs from the column of 5to1 multiplexers, with select line se. Here, se = 0; 1; 2; 3, and 4 for
32; 16; 8; 4; and 2point Integer DCTs respectively. Each 1*32 Buffer is made up of 32 numbers of registers and 2to1 multiplexers with common select line. The select lines used in the 1*32Buffers 0, 1,…. 30, and 31 are en0, en1,…en30, and en31 respectively. The output from Fig. 4 can be stored at one particular 1*32Buffer with corresponding select line as 1. The 1*32Buffer architecture is shown in Fig. 6. The outputs of ith 1*32Buffer are b32i, b16i, b8i, b4i, and b2i, which are the resultants of 32, 16, 8, 4, and 2point Integer DCTs respectively. Here, eni = 0 to maintain the values (32 values) stored in the buffer and eni = 1 if the the new value.
Fig.5 VLSI architectures for 32 X 32Buffer
Fig.6 Architecture of 1*32 Buffer
1* 32Buffer architecture: The output from Fig. 4(a) can be stored at one particular 1 *32Buffer with corresponding select line as 1. The 1 _ 32Buffer architecture is shown in Fig. 5. The outputs of ith 1 * 32Buffer are b32i, b16i, b8i, b4i, and b2i, which are the resultants of 32, 16, 8, 4, and 2point Integer DCTs respectively. Here, eni = 0 to maintain the values (32 values) stored in the buffer and eni = 1 if the the new value is obtained.
Fig.7 roposed 32 DCT
FORWARD TRANSFORM (DCT):
T1
First Stage of Forward Transform: The first stage of the forward transform consists of multiplication of the result of the D4. The input into the second stage of the forward transform is the output matrix from the first stage of forward transform which is a matrix with only the DC element. The output of multiplication with DT4 will be a matrix with first column elements. Consequently, the scaling required after the first stage of the forward transform for the output to fit within 16 bits is S = 2(BM+9).
T2
Second Stage of Forward Transform: The second stage of the forward transform consists of multiplication of the result of the first transform stage with D4. The input into the second stage of the forward transform is the output from the first stage which is a matrix with all elements in the first row. All other elements will be zero. The output of multiplication with will be a matrix with only a DC value. This implies that the scaling required after the second stage of transform is in S = 2(21B) order for the output to fit within 16 bits.

TECHNOLOGY USED
The multiplier unit used in the latest Npoint Integer DCT architectures is in the form of addshift network, whereas in the proposed architecture, signed configurable carry save adder tree [11] is used. Therefore, the depth of the architecture falls within the bounds of O (log2 N). The proposed 1D architecture is used to perform one Npoint or multiple N 2 , N 4 ,.2point Integer DCTs in parallel. The performance results show that the proposed architecture gives better performance compared with existing , Xilinx 13.2 software is used to check this model. For IC development they have three processes in Xilinx software.

Check syntax

Pin assignment

Implementation
Check syntax is used to check our design having any error. After finishing this process we allocate the input and output pins by using pin assignment. In the final process is implementation. Here we implement the design into our
assigning pins. Then we convert our code into bit file then after we dump this bit file in to FPGA spartan6 and verified it.
Device Utilization Summary 32Point DCT


ADVANTAGES

Better Compression Performance

Computation Performance is good


APPLICATIONS

Used in health department.

Human welfare

It is used to monitor industrial radiation levels.



ACKNOWLEDGEMENT
We would like to thank all those who provide us the possibility to propose this project. A special gratitude to our Project Guide, Mrs.T.Theresal Assistant professor and Project coordinator Mrs.K.Ferents koni jiavana Associate professor, whose contribution in giving suggestions and encouragement helped us for this project to complete.

RESULTS AND CONCLUSION

The multipliers are, at less, twice faster than the conventional design, and consume half of the power. This can be done by ignore the zeros in the multiply constant and the insignificant parts of the answer. The circuit is further reduced. Therefore, it consumes less power. The reduction will more stages, because the scares of resources. This can be countered by doing multistage in one period. The multiplier operation will take one period and one or two adders operations will perform in one period. Then, there will be less power consumption without comprising the speed. The 1 D DCT and Transpose is finished and the simulation is shown above. The area of the 1D DCT chip is .8mm x .6mm. The total delay 258.16ns. The Transpose is 292ns. The area is .2mmx.5mmThe results is the same as the calculation. We dont have time to construct the 2 D DCT. But, it is simple. It just need to connect two 1D DCT to the transpose. The performance results show that the proposed architecture gives good improvement as compared with existing architectures. The Snapshot below gives the clear elaboration of application.
Fig.8 Simulation Snapshot of 32POINT DCT
REFERENCES

Mohamed Asan Basiri M and Noor Mahammad Sk, Multimode Parallel and Folded VLSI Architectures for 1DFast Fourier Transform, Integration, the VLSI Journal, Elsevier, vol. 55, pp. 43 56, Sept. 2016.

Fei Liang, Xiulian Peng, and Jizheng Xu2, A lightweight HEVC encoder for image coding, IEEE International Conference on Visual Communications and Image Processing (VCIP), pp. 15, Nov. 2013.

Pramod Kumar Meher, Sang Yoon Park, Basant Kumar Mohanty, Khoon Seong Lim, and Chuohao Yeo,, Efficient Integer DCT Architectures for HEVC, IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 1, pp. 168 178, Jan. 2014.

PaiTse Chiang and Tian Sheuan Chang, A Reconfigurable Inverse Transform Architecture Design for HEVC Decoder, IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1006 1009, May 2013. 125.

Honggang Qi, Qingming Huang, and Wen Gao, A LowCost Very Large Scale Integration Architecture for Multi Standard Inverse Transform, IEEE Transactions on Circuits and Systems – II, Express Briefs, vol. 57, no. 7, pp. 551555, July 2010.

Khan Wahid, Muhammad Martuza, Mousumi Das, and Carl McCrosky, Resource Shared Architecture of Multiple Transforms for Multiple Video Codecs, IEEE International Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 947950, May 2011.

Kanwen Wang, Jialin Chen, Wei Cao, Ying Wang, Lingli Wang, and Jiarong Tong, A Reconfigurable MultiTransform VLSI Architecture Supporting Video Codec Design, IEEE Transactions on Circuits and Systems – II, Express Briefs, vol. 58, no. 7, pp. 432436, July 2011.

Yao Ziyou, He Weifeng, Hong Liang, He Guanghui, and Mao Zhigang, Area and Throughput Efficient IDCT/IDST Architecture for HEVC Standard, IEEE International Symposium on Circuits and Systems(ISCAS), pp. 25112514, June 2014.

Hong Liang, He Weifeng, Zhu Hu, and Mao Zhigang, A Cost Effective 2D Adaptive Block Size IDCT Architecture for HEVC Standard, IEEE 56th International Midwest Symposium on Circuits and Systems (MWSCAS), pp. 12901293, Aug. 2013.

Wenjun Zhao, Takao Onoye, and Tian Song, HighPerformance Multiplierless Transform Architecture for HEVC, IEEE International Symposium on Circuits and Systems, pp. 16681671, May 2013.

Mohamed Asan Basiri M and Noor Mahammad Sk, An Efficient VLSI Architecture for Discrete Hadamard Transform, IEEE International VLSI Design Conference, pp. 140145, Jan. 2016.

Ricardo Gonzalez, Benjamin M. Gordon, and Mark A. Horowitz, Supply and Threshold Voltage Scaling for Low Power CMOS, IEEE Journal of Solid State Circuits, vol. 32, no. 8, pp. 12101216, Aug. 1997. 126.