 Open Access
 Total Downloads : 20
 Authors : Naga Valli.Vegesna V.Srinivasa Rao
 Paper ID : IJERTCONV1IS03004
 Volume & Issue : AAMT – 2013 (Volume 1 – Issue 03)
 Published (First Online): 30072018
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Hardware Implementation of Double precision FloatingPoint Reciprocator on FPGA
Naga valli.Vegesna V.Srinivasa Rao
Dept of ECE Dept of ECE
Shri Vishnu Engineering College for women Shri Vishnu Engineering College for women Email:valli.vegesna@gmail.com Email:vemu1974@gmail.com
AbstractFloating point division is generally regarded as a low frequency, high latency operation in typical floating point applications. So due to this not much development had taken place in this field. But nowadays floating point divider has become most important in many modern applications. Most of the previous implementations required larger area and latencies. In this paper we have presented an efficient FPGA implementation of a reciprocator for doubleprecision floating point numbers. The method is based on the use of small lookup tables and partial block multipliers. The modules occupy less area and less latency.
Keywords: Double precision, Floatingpoint arithmetic, reciprocator, partial blockmultipliers,
FPGA.

INTRODUCTION
Floating point arithmetic like addition, multiplication, division and square root etc are mostly used in modern applications. Mainly in scientific and signal processing applications. The greater dynamic range and lack of need to scale the numbers makes development of algorithms much easier. Implementing of arithmetic operations for floating point numbers in hardware is very difficult. Among the operations division is most difficult to implement in hardware. So to make floating point division simple we are going for binomial expansion method.
The IEEE standard for floating point (IEEE754) defines the format of the numbers, and also specifies various rounding modes
Less area, less delay. We are taking only normalized numbers. All the exceptional cases are detected, and indicated as invalid input/output. When compared with other methods binomial expansion methods have efficient hardware.
We are using Xilinx ISE synthesis tool, ModelSim 6.4c simulation tool, and FPGA as our platform.

APPROACH
The format of a floatingpoint number is as follows in fig:1 for Single Precision and Double Precision:
Fig:1 format for single and double precision
In this paper, we do not discuss the exponent manipulation as it is a standard process. The benefits of our implementation are in the computation of the inverse of the mantissa.
Let y be the inverse of the mantissa a. Then,
that determine the accuracy of the result. For many signal 1
processing, and graphics applications, it is acceptable to trade off some accuracy for faster and better implementations.
There are many algorithms to implement division. Many algorithms were developed for division which includes subtractive method, functional iterations, Digit recurrence method, seed architecture which uses multipliers and algorithms for faster computation of division like high radix algorithm. But these algorithms uses huge look up tables along with wider multipliers which affects the area and performance.
Our approach focuses on finding the reciprocal. It is based on the well known binomialexpansion, contains small lookup table, and uses partial blockmultipliers, resulting in
X = 1.a, where in 1.a, 1 is hidden bit of mantissa.
We have divided the mantissa in two parts, a1 and a2. a1 is used to fetch some precalculated data from a lookup table.
Now, since X=1/(a1+a2)
= (a1+a2)1
1
=a1Â¹a2Â².a2+a1Â³.a2Â²a 4a3——— (1)
The content of each term of equation (1) will be as follows:
a11
= 0.
full significant bits
x
xxxxxxxx
mzero bits significant bits
Block RAM
x x
a2(15bit) a1(8bit)
Stage1
2 0. 00 Â· Â· Â·
a 1 .a2 = 00 xx Â· Â· Â· xx
2mzero bits significant bits 2
a1 3.a22 =
x x
0. 00 Â· Â· Â·
00 xx Â· Â· Â· xx
3mzero bits significant bits
a1
Multiplier Multiplier Stage2
(17bit) (17bit)
2 2
x x
a a 3 a a
a1 4.a32 =
0. 00 Â· Â· Â·
00 xx Â· Â· Â· xx
1
1

Â· Â· and so on
2 1 1 1 2
a1
Multiplier Subtractor Stage3
(17bit) (30bit)
. Where m is the number of bits of a1.
a13a22 (a 1a 2a2)
If we go for higher terms, contribution to main results decreases. Thus, depending upon our precision number of terms can be taken from equation (1) for calculating inverse, based on value of m.
For our implementation, based on experiments over a large number of random test cases, we have chosen the number of terms as described below. In case of singleprecision we have taken the first three terms, while for the case of doubleprecision 7 terms have been taken. The value of m we have chosen is 8 for both cases. These values were selected based on available FPGA. We have simplified the desired terms in such a way so that we
can use less hardware with low latency and good accuracy.
Adder Stage4
(30bit) 23bit Mantissa
Fig. 2. Architecture for singleprecision floatingpoint reciprocator

Singleprecision Floatingpoint
The architecture of singleprecision floatingpoint reciprocator is shown in Fig. 1. It includes a Block
Memory
(BRAM) which contains precalculated values of a 1(24
2 3 1
For singleprecision we have taken all the three terms as available, like
Y=a 1a 2a +a 3a 2
bits), a 1 (17 bits), and a 1 (17 bits) in a single data Word (58bits), with 8bit (content of a1) as address bits.
1
The contents of the BRAM have been calculated using a separate program written in C, with float data type for the
1 1 2 1 2
a 1a 1[(a 1a a 2a 2) (1+a 2a 2+a 4a 4)]——(2)
numbers. The content of a 1 has been extended to 30bits (by appending 6bits111111 at least significant bit
1 1 1 2 1 2
1 2 1 2
(LSBs)) for addition/subtraction purpose. Here we can also do above operation with only value of a 1, but it will
The above equation can be little more simplified but it affects the
area, latency and accuracy. The accuracy is affected due to the fact that floatingpoint operations are not completely associative,
i.e. u(v + w) may not be exactly equal to (uv + uw). This is due to the finite number of bits used to represent the numbers.


IMPLEMENTATION
An algorithm for single precision and double precision floating point reciprocal is implemented using Binomial Expansion method which contains small look up tables, and partial block multipliers, resulting in less area, less delay. We have shown the implementations for single precision and double precision separately as different issues arise in each case.
First we convert the required decimal number in to IEEE
754 floating point number using IEEE 754 decimal to floating point converter and mantissa is inverted.
1
increase the total operation latency and size of multipliers. In both cases we will use only a single BRAM on FPGA, so we prefer the first approach.
The architecture has latency of four, though we can include the BRAM access in the first stage with a slight loss in maximum operating frequency. By using pipelined multiplier we can approximately double the overall frequency. We have shown the result with the latency four. Our aim here is to only show the use of less necessary hardware. We can do pipelining in the given architeture very easily.

Doubleprecision Floatingpoint
The architecture of doubleprecision floatingpoint reciprocator is shown in Fig. 2. It also includes a single BRAM which contains precalculated values of only
a 1(54 bits) with 8bit (content of a ) as address bits.
1 1
The content of BRAM has been calculated using a C
1
program, with double as datatype of floatingpoint numbers. The content of a 1 has been extended to 60bits (by appending 6bits111111 at LSBs) for addition/subtraction purpose. Here we have a huge saving on blockmemory compared to other methods discussed later.
There are three type of multiplier (based on Xilinx MULT18x18 block) that have been used. Second, third and
a1(8bit) a2(44bit)
Block
RAM Stage1
a
1
1
A B
This part Ignored
A1 . B2
51bits
A1, 17bits
A2,
17bits
A3, 17bits
B3, 17bits
B2, 17bits
B1, 17bits
A1 . B1 34bits
34bits 17bits
51bit partial Stage2
block multiplier
a 1a
A2 . B1 34bits
A1 . B3 34bits
17bits__
a 2a 2
1 2
51bit partial Stage3
block multiplier
A2 . B2 34bits
A3 . B1 34bits
A2 . B3 34bits 17bits
A3 . B2 34bits
1 2
This part Ignored
34bit Full 60bit Stage4 block multiplier Substractor
60bit Adder
a14a24
17bits__
34bits
A3 . B3
We took sum of this part only, using only 6MULT18x18.
2 2 4 4 1 2 2
(1+a1 a2 +a1 a2 ) (a1 a2 – a1 a2 )
51bit Reduced partial
block multiplier
51bit partial block multiplier
a1
Stage5
Stage6
Fig. 4. Partial 51bit multiplier for stages 2,3 and 7
51bits
0x10000
17bits
A2, 17bits
A1, 17bits
A
B3, 17bits
B2, 17bits
B1, 17bits
B
a1
Stage7
60bit Stage8
Substractor
A1 . B3 34bits
A2 . B2 34bits
{B1,0x0000} 34bits
A2 . B3 34bits 17bits
{B2,0x0000} 34bits
This part Ignored
52bit Mantissa
Fig. 3. Architecture for doubleprecision floatingpoint reciprocator
seventh stage has 51bit partial multiplier, which is shown in Fig. 3. It uses only sixMULT18x18 block instead of nine, to produce more than 52bit (MSB) of correct result, which is all that we need. Stage six is also a 51bit partial multiplier, but due to its specific input nature (17bits of first input is 0x10000 in hex), it contains only three MULT18x18 block Fig. 4. The fourth stage multiplier is a 34bit full multiplier, but instead of using IPcore for it we have designed it using four MULT18x18 block (shown in Fig. 5) which is taking less (about 2/3) glue logic and is faster than the IPcore available from Xilinx. Overall latency of module is eight, which we can increase further using pipelining as discussed in the case of single precision, for better performance.


RESULTS
Hardware utilization and performance of both the single precision and doubleprecision is shown in TableI. Since our implementation neglects some of the lower order bits in the computation, it is important to estimate the impact of this on the overall accuracy of results. For the error performance 5
17bits
34bits
{B3,0x0000}
We took sum of this part, using only 3MULT18x18.
Fig. 5. Partial 51bit multiplier for stage 6
Fig. 6. 34bit block multiplier for stage 4 Millions randomly generated test cases were used to check the errors. The error performance is shown in TableII for both versions of floatingpoint numbers. The error was obtained by comparing results from the proposed module with the results produced by a C compiler on a workstation. In all cases, it was found that the maximum error in the case of single precision was 2 ulp (unit last place), while in the case of double precision numbers, it was 1 ulp. The error we got is without rounding.

CONCLUSION
We have implemented an efficient reciprocal unit on FPGA for both single and double precision floatingpoint numbers. The method uses the idea of neglecting higher order terms in the partial block multiplication to reduce the number of multipliers. At the same time, the lookup table requirements are kept to a minimum, and are the least reported in the literature for double precision implementation. Initial latency for our module is also less (4 for single and 8 for doubleprecision), that too with promising frequency, which we can improve by pipelining them very easily. The error performance is also within acceptable range (1ulp for doubleprecision). The implementation can thus form a useful core for use in hardware dividers, especially for applications like signal processing that could be more tolerant of
inaccuracies in the least significant bits.
SINGLE PRECISION: DOUBLE PRECISION
Resource Comparison:

REFERENCES

K. Scott Hemmert and Keith D. Underwood Floating Point Divider Design for FPGAs, IEEE Transaction on very large scale integration systems,vol. 15, No. 1, pp. 115 118,Jan 2007.

Mohamed anane, Hamid Bessalah ,Mohamed Issad, Nadjia Anane and Hassen Salhi Higher radix and redundancy factor for floating point SRT Division, IEEE Transaction on very large scale integration systems, vol. 16, no. 16, pp. 122128,June 2008.

Anuja Jayraj Thakkar and Abdel Ejnioui Pipelining of Double Precision Floating Point Divider and Square Root Operations Proceedings of the 44th annual southeast regional conference, March 2006.

Anuja Jayraj Thakkar and Abdel Ejnioui Design and Implementation of Double Precision Floating Point Divider And Square Root Operations On FPGAs ,IEEE Conference on field programmable technology,2006

U. Kucukkabak, A. Akkas, A Combined Interval and Floating Point Reciprocal Unit, ThirtyNinth Asilomar Conference on Signals, Systems and Computers, pages 1366 1371, Nov2005.

Xiaojun Wang, B. E. Nelson, Tradeoffs ofdesigning floatingpoint division and square root on Virtex FPGAs, 11th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines (FCCM 2003), Pages 195 203, Apr2003.

J. Hopf, A parameterizable HandelC divider generator for FPGAs with embedded hardware multipliers, IEEE International Conference on FieldProgrammable Technology, Pages 355358, Dec2004.

W. F. Wong, Member, IEEE, and E. Goto, Fast Hardware Based Algorithms for Elementary Function Computations Using Rectangular Multipliers, IEEE Transactions on Computers, Issue 3, VOL. 43,March1994.

P. Montuschi, L. Ciminiera, A. Giustina, Division unit with NewtonRaphson approximation and digitbydigit refinement of the quotient,IEEE Proceedings – Computers and Digital Techniques, Issue 6, Vol. 141,
Pages 317 – 324, Nov1994.

W. F. Wong, Member, IEEE, and E. Goto, Fast Evaluation of the Elementary Functions in Single Precision, IEEE Transactions on Computers, Issue 3, Vol. 44, Pages 453457, March1995.

M. Ito, N. Takagi, and S. Yajima, Efficient Initial Approximation and Fast Converging Methods for Division and Square Root, Proceedings of the 12th Symposium on Computer Arithmetic, Pages 29, July1995.