 Open Access
 Total Downloads : 4
 Authors : Bhuvanesh.N
 Paper ID : IJERTCONV1IS06148
 Volume & Issue : ICSEM – 2013 (Volume 1 – Issue 06)
 Published (First Online): 30072018
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Floating point units in hybrid FPGA using clock gating
BHUVANESH.N
M.E Vlsi design(PG scholar) Srinivasan Engineering College Perambalur621212 nbhuvi@gmail.com
Abstract Low power and high speed are the two major challenges in the VLSI circuit design. Subgraph extraction is a method to optimize coarse grained oating point units (FPUs) in a hybrid eldprogrammable gate array (FPGA), where the FPU consists of a number of interconnected oating point adders/ subtracters (FAs), multipliers (FMs), and wordblocks(WBs). The wordblocks consist of registers and lookup tables (LUTs) which can implement xed point operations efciently. Common subgraph extraction is used to determine the best mix of blocks within an FPU for increasing the speed of the system.By using clockgating method the power can be reduced. Finally, an optimized coarsegrained FPU by considering both architectural and systemlevel issues can be achieved and also the power of the floating point units is compared.
KeyTerms: Common subgraph extraction, eldprogrammable gate array (FPGA), floating point.
I.INTRODUCTION
In the hybrid FPGAs,coarsegrained elements such as memories and digitalsignal processors (DSPs) are embedded within a negrainedprogrammable fabric. Since coarse grained elements are less flexible, it is efficient for implementing specific word level operations. Improved speed and density can be achieved by incorporating embedded floating point units in an application which demands high performance floating point computation. By using more and faster FPUs, an FPGA can provide higher data throughput.wordblocks, floating point adders and floating point multipliers are composed from nonprogrammable elements, resulting in amore compact block with higher speed, but less exibility thannegrained logic. The main contributions are
1)A study of floating point unit architectures over a set of floating point benchmarkcircuits.

An analysis of the benets of merging different types of floating point units into a larger coarsegrained floating point unit.

Amethodology to optimize a oating point hybrid FPGA by considering both the internal architecture of floating point units and types of floating point units incorporated into the system.

HYBRID FPGA
A hybrid FPGA includes coarse grained and fine grained components, which are connected by routing tracks. Our finegrained fabric consists of an array of identical configurable logic blocks (CLBs), each containing basic logic elements (BLEs).Each BLE contains LUTs, flip flops
(FFs), support for fast carry chains,internal multiplexers and XOR gates.FPGA is an array of ne grained congurable logic blocks interconnected in a hierarchical fashion. Commercial FPGA contains coarse grained blocks such as memories and multipliers for commonly used primitives to improve efcient for specic functions. However, FPGAs are approximately 20 times larger and 4 times slower than applicationspecic integrated circuits (ASICs). In order to reduce this gap, considerable research has been focused on identifying more exible and efcient coarsegrained blocks, particularly for specialized application domains such as oating point computations. It is because implementing FP operations in negrained FPGA consumes a large amount of resources and a number of approaches to optimize FP operations in FPGAs have been proposed. Thats why floating point unit is to be implemented in a hybrid FPGA, where hybrid FPGA consists of coarse grained elements.

Coarse grained block
Bhuvanesh.N, Balasubramaniyan.A
137
A coarsegrained floating point unit consists of FAs, FMs, and WBs. The floating point adders and floating point multipliers are double precision (64 bit) and fully IEEE754 compatible including all four rounding modes. The four rounding modes are

Nearest rounding, where ties round away from
zero.

Round up, that is negative results thus round toward zero.

Round down, that is negative results thus round away from zero.

Round toward zero.
Each word block contains Nidentical bit blocks, each consisting of two 4 input look up tables and a recongurable register. The value of N depends on the bitwidth of the FPU. Configuration bits controlling the bit blocks within the word block perform the same function.A WB can efciently implement operations such as xed point addition and multiplexing. So some bit level operations in FP primitives can be supported inside the FPU, to reduce the need for communication between the FPU and the negrained fabric.This avoids using the ne grained routing resources of the field programmable gate array for connections that can be implemented inside the FPU, see Fig. 1
Fig 1:Connecting WBs, FAs, and FMs into different coarse grained FPUs.


Interface
Finegrained blocks are able to connect to coarse grained resources in various ways. Coarse grained blocks are generally large and have high I/O density, resulting in a disruption in the routing fabric, similar to that of Mega RAM
blocks in the Altera.Traditional column based hybrid field programmable gate array architecture is not particularly suitable for large coarse grained blocks and concludes that coarse grained blocks should be 1) square in aspect ratio 2) closely packed together 3) positioned near the center of the field programmable gate array and 4) have I/O pins arranged on all four sides of the block.


CLOCK GATING
Clock gating is a very famous technique used in many circuits for reducing dynamic power dissipation. Clock gating method saves power by adding more logic to a circuit to prune the clock tree. Pruning the clock disable them portions of the circuitry so that the flip flops in them do not have to switch states. Switching states consumes power. When it is not being switched, the switching power consumption goes to zero, and only leakage currents are incurred.
Clock gating technique can be added into a design in a variety of ways. 1) Coded into the RTL code as enable conditions that can be automatically translated into clock gating logic by synthesis tools.2) The RTL designers manually inserted into the design (typically as module level clock gating) by instantiating library specific Integrated Clock Gating(ICG)cells to gate the clocks of specific modules or registers.3) Automated clock gating tools inserted into the RTL Semiautomatically. These tools either insert integrated clock gating cells into the RTL, or add enable conditions into the RTL code.
IV FPU OPTIMIZATIONS AND METHODOLOGY
In this section there are three types of optimizations: the internal structure of the floating point unit, optimizations for density and exibility, and the merging of FPUs into larger composite structures.The techniques described here are not restricted to our specic architecture. It can be extended to optimize the other FPU architecture also.

Internal Optimization of FPUs
FPU consists of oating point adders and multipliers (FAs and FMs) as well as xed point coarsegrained WBs. The rst optimization is the optimization of the exact number of each type of subunit within each FPU, as well as the pattern in which these subunits are connected. Fig.1 shows an example of two different potential FPU architectures with different internal structures.
By employing common subgraph extraction FPU structure can be derived. In this methodology, by analyzing a
Bhuvanesh.N, Balasubramaniyan.A
138
set of benchmarks sytematically, and extract patterns of FP operations that commonly appear in these circuits. Fig. 2 is an example of a common subgraph of two circuits (dscg and by benchmarks).
Fig 2:Common subgraph extraction for an FP applications.
A single unit which combines the common FP operations can be extracted. Thus, creation of the FPU requires considering both xed point and oating point functions in the application circuits. In this approach it is necessary to include a multiplexer to the input of the FA or FM in the common subgraph when there is a xed
point operationsuch as FF, XOR, and AND connecting to the floating adder or floating multiplier in one of the analyzed benchmarks. This multiplexer allows selecting internal signal fromwordblocks implementing a xed point operation, internal signal from FM or FA implementing a floating point operation, or external signal. The combination of xed point WBswith FAs and FBs leads to more efcient circuit implementations,which reduces the slow communication between the FPU and the negrained fabric.

SystemLevel Optimizations
System level optimization includes the optimization for density.Since connections between the components in the FPU can be made locally,An FPU with more computational elements achieves a greater reduction in area.However, larger blocks may require more routing resources for connections to negrained blocks, and may lead to a reduction in exibilitysince it is difcult to reuse them in other applications.
Fig 3:Flow of the selection of the highest density FPU
C.Optimization by Merging FPUs
The nal type of optimization is the merging of two different types of FPUs into one composite FPU. The distinct types of FPUs exist in an FPGA, the more placementconstraints the architecture imposes, since each subgraph in the circuit may only be able to be implemented in one of the FPU types. By combining FPUs into larger FPUs, these placement constraints may be relaxed, leading to efcient implementations. Fig. 4 shows an example of merging grapp5 and grapp6 into a larger composite FPU
Fig 4:Merging grapp5 and grapp6 into a larger FPU
Two important considerations are there while merging FPUs. First, merging the different floating point units into a composite floating point unit may lead to reduced placement constraints and also leading to a reduction in the overall wirelength of the implemented circuit. The position of various FPUs is xed so only during the placement stage the
Bhuvanesh.N, Balasubramaniyan.A
139
same FPU type can be swapped. This leads to inexible placement and may introduce long wires between FPUs.
Merging different FPU types into a larger FPU may lead to a better placement and reduce the wirelength.Second, a
larger FPU will require more chip area and may lead to an increase in overall wirelength.

EVAUATION
FP benchmark circuits are introduced in this section and it evaluates the area and delay impact of internal and systemlevel optimizations of coarsegrained FPUs based on common subgraphs and nally optimizes the systems by merging different FPUs into a larger composite FPU.
A. FP Benchmark Circuits
To explore the design of a hybrid FPGA based on common subgraph extraction and synthesis, a set of FP designs are used as benchmark circuits.They are: 1) a data path of four digital sinecosine generators – dscg 2) the basic computation of fast Fourier transform z = x + y where x and y
are the inputs and z is the output. Both are complex numbers bfly 3) four 4tap nite impulse response lters fir 4) four circuits to solve ordinary differential equations ode 5) four 3 x3 matrix multipliers – mm3 6) a circuit to compute Monte Carlo simulations of interest rate model derivativesbgm 7) a circuit containing 5 FAs and 4 FMssyn2 and 8) a circuit containing 25 FAs and 25 FMssyn7and are two synthetic benchmark circuits generated by a synthetic benchmark circuit generator. These eight double precision floating point benchmark circuits are not efciently implemented in ne grained FPGAs.It is only efficient in coarse grained elements.
Bhuvanesh.N, Balasubramaniyan.A
140

CONCLUSION


This paper presented a methodology to determine optimized coarse grained FPUs in hybrid FPGAs based on common subgraph extraction. The internal and systemlevel optimization of the FPU is explored. Low power can also be achieved in the FPU by using clock gating method. Finally in the FPU architecture 1) the speed of the system is the highest for implementations involving only FAs.2) higher density subgraphs produce greater reduction on area.3) Merging of FPUs can improve the speed of hybrid FPGAs.
REFERENCES

I. Kuon and J. Rose Measuring the Gap Between FPGAs and ASICs IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 26.

M. J. Beauchamp, S. Hauck, K. D. Underwood, and K. S. Hemmert, Architectural Modications to Enhance the FloatingPoint Performance of FPGAs, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. vol. 16 Feb. 2008.

C. H. Ho, C. W. Yu, P. H. W. Leong, W. Luk, and S. J. E. Wilton, Floatingpoint FPGA: Architecture and modeling, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 17, no. 2, pp. 17091718, Dec.
2009.

S. S. Demirsoy and M. Langhammer, Cholesky decomposition using fused datapath synthesis, in Proc. FPGA, 2009, pp. 241244.

M. Langhammer and T. VanCourt, FPGA oating point datapath
compiler, in Proc. FCCM, 2009, pp. 259262.

G. Govindu, L. Zhuo, S. Choi, and V. Prasanna, Analysis of high performance oatingpoint arithmetic on FPGAs, in Proc. Parallel Distrib. Process.Symp., 2004, pp. 149156.

Y. J. Chong and S. Parameswaran, Flexible multimode embedded oatingpoint unit for eld programmable gate arrays, in Proc. FPGA, 2009, pp. 171180.

A. M. Smith, G. A. Constantinides, and P. Y. K. Cheung, Fused arithmetic unit generation for recongurable devices using common subgraph extraction, in Proc. ICFPT, 2007, pp. 105112.

C. W. Yu, A. M. Smith, W. Luk, P. H. W. Leong, and S. J. E. Wilton,Optimizing coarsegrained units in oating point hybrid FPGA, in Proc. ICFPT, 2008, pp. 5764.
Bhuvanesh.N, Balasubramaniyan.A
141