FPGA Implementation of Single Precision Floating Point Multiplier using Booth Recoding Algorithm

doi:https://doi.org/10.5281/zenodo.19185374

Volume 15, Issue 03 (March 2026)

FPGA Implementation of Single Precision Floating Point Multiplier using Booth Recoding Algorithm

DOI : https://doi.org/10.5281/zenodo.19185374

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 454
Authors : Prince Mishra
Paper ID : IJERTV15IS030860
Volume & Issue : Volume 15, Issue 03 , March – 2026
Published (First Online): 23-03-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

FPGA Implementation of Single Precision Floating Point Multiplier using Booth Recoding Algorithm

Prince Mishra

Electronics and Instrumentation Engineering Odisha University of Technology and Research, Bhubaneswar, India

Abstract-Through this paper, we focus on implementing a Single Precision Floating Multiplier using IEEE 754 Standards. By reducing partial product generation and addition units, this implementation offers benefits such as faster results, reduced power consumption and reduction in the utilization of hardware resources. Moreover, the implementation deals with multiplication of both signed and unsigned numbers. The paper presents a comparative analysis with a 32-bit multiplier performance in terms of power consumption and FPGA hardware resource utilization. The proposed 32-bit multiplier is designed using Verilog HDL and implemented through Xilinx Vivado 2025.1 software for Xilinx Virtex-7 FPGA.

Keywords Used- FPGA; Booth Recording; Single Precision; signed multiplier; Verilog HDL.

INTRODUCTION

Floating-point multipliers serve as critical computational kernels in high-performance digital signal processing (DSP) architectures, specifically within Finite Impulse Response (FIR) and Infinite Impulse Response (IIR) filters.

I 23 I 22 I o
FLOATING POINT MULTIPLIER

ALGORITHM

In accordance with the IEEE 754 standard, a 32-bit single- precision floating-point datum is partitioned into three distinct functional components: a 1-bit sign (S), an 8-bit biased exponent (E), and a 23-bit fractional mantissa (M). The multiplication of two such operands involves concurrent, independent operations across these fields to derive the product. The detailed description is defined as below.
1. Calculation of the sign bit; i.e. SA XOR SB.
2. Exponent is calculated by adding the exponent EA and EB After that, bias subtraction by 127 to get the final exponent,
  
  i.e. EA+EB-127.
3. An implicit leading bit is appended to each 23-bit mantissa to form 24-bit operands. These are processed through a Booth Recoding multiplier to generate a 48-bit intermediate product
4. Normalizing the result, to get the required 23bit mantissa, en- suring the output adheres to the standard format.
5. Combine the calculated sign, exponent and mantissa compo- nents to get the desired multiplication result.

DESIGN OF BOOTH RECODING

Sign Exponent Mantissa MULTIPLIER

Figure I.Single precision floating point representation

The multiplication involves concurrent XOR-based sign processing, biased exponent addition (EA+EB-127), and normalized mantissa multiplication. To address the hardware complexity of partial product summation in high-order filters, the proposed architecture employs the Booth Recoding algorithm. This optimizes the 24-bit mantissa multiplication

by reducing partial product rows, significantly enhancing throughput and lowering power consumption.

24×24 bit multiplier

The proposed Booth recoding multiplier architecture accepts two 24-bit inputs, serving as the multiplier and multiplicand. Two control signals are included to specify whether the multiplier and multiplicand are treated as signed or unsigned integers.

mplier –2-4-

The paper is organized as follows. Section II presents the floating point multiplier algorithm. Section III presents the design of booth recoding multiplier. Section IV describes details of the proposed architecture and its implementation. Section V

mpiler_s_u mplicand_ ,2

mplicand_s_u_

24 bit booth recod- ing multiplier

4,…8-prod

and VI presents partial product generation and addition respectively. Section VII contains the proposed architecture for the multiplier. Results are displayed in section VIII. Finally the conclusion in section IX marks the end of the paper.

Figure 2.24- bit booth recording multiplier top module

Signal Name	Width	Source	Description
mplier	24	input	Top module multiplier input
mplier_s_u	1	input	1= multiplier is signed, 0 =multiplier is un- signed
mplicand	24	input	Top module multipli- cand input
mplicand_s_u	1	input	1 = multiplicand is signed, 0 =multiplicand is unsigned
prod		output	Output from the multi- plier block

Table 1. Signal Description oftop module C. Mathematical representation ofunsigned numbers

= -224a24 + 223 a23 + 223 a22

-222a + 221a + 221a20

20 19 18

-2 2oa22 + 219a21 + 219a

-22a + 21a + 21ao-1

0 2 -11 a_

-2 ao + 2 a_1 + 2 2

Mathematical Representation ofSigned Number

2's complement representation of A

= 223 (-2a24 + a23 + a22)

+ 221(-2a22 + a21 + a20)

-223 a23 + 222a22 +

2o 19

(2 – 1)221a21 +

1s

+ 21(-2a2 + a1 + a0)

+ Z-1(-2a0 + a_1 + a_2)

2 a20 + 2 a19 + (2 – 1)2 a1s +

14 2

211a11 + 216a16 + (2 – 1)21sa1s + (2 – 1)2 + ………………… +2 a2 +

(2 – 1)21a1 + 2°a0 + 2°a_1

Where a_1 = 0

2o 23

2o 22 + = 22i-1r2i c2)

= -223 a + 222a + 222a +

12

L

i=O

12

23

– 221a21 + 2 a20 + 2 a19

– 219a19 + 218a1 + 21sa17 +

i=O

– 216a + 21sa 85 + 21sa 4 +

16 1 1

Where f2i-1 = -2a2i + a2i-1 + a2i-2

– 23 a + 22a2 + 22a1 + By comparing Equations (1) and (2), it becomes evident that

– 21 3 + 2°a + 2° a_ integer A can be processed prior to the recoding of both basic

a1 o

1

Where a_1 = 0

unsigned and signed integers.

Table 2.Booth recoding truth table

a2i+l	a2i	a2i-1	f2i	F (+/-)	pl (x 1/0)	p2 (x 2/0)
0	0	0	0	0	0	0
0	0	1	1	0	1	0
0	1	0	1	0	1	0
0	1	1	2	0	1	1
1	0	0	-2	1	1	1
1	0	1	-1	1	1	0
1	1	0	-1	1	1	0
1	1	1	0	0	0	0

= 222(-2a23 + a22 + a21) 220 (-2a21 + a20 + a19)

218(-2a19 + am + a11)

22(-2a3 + a + a1)

2°(-2a1 + a0 + a_1)

Where a_1 = 0

=rrl1 22i f2i (1)

Extra bit added for block formation	Extended 25 bit number(result of 25bit extension unit)	0 added in the 1.SB
a[24]	a[24:0]	0

SUB-MODULES OF IMPLEMENTED BOOTH RECODING MULTIPLIER

25th bit extension unit

The proposed architecture employs a unified signed multiplier core to facilitate both signed and unsigned arithmetic operations. To preserve the full dynamic range during unsigned multiplicatio- n and prevent magnitude truncation, the 24-bit operands undergo a bit-width expansion at the most significant bit (MSB) position. For unsigned operands, this 25th bit is initialized to zero. In case of a signed operation, the 25th bit replicates the 24th bit (MSB) to follow up two's compliment integrity. This preprocessing stage ensures that the subsequent booth recording logic can process both number formats using a singular, hardware- efficient internal data path.

Table 3. Extended 26 bit extension unit

The preprocessing stage concludes with the operand expanded to a total width of 27 bits. This transformation is achieved through combinational rewiring to form overlapping F blocks without further implementation of hardware.

Table 4. F block formation

mplier_s_u —–.

mplicand[23]

mux out

FO	{ ai, a0, 0}
F2	{a3, a2, ai}
F4	{as, a4, a3}
F6	{ a7, a6, as}
F8	{a9, as, a7}
FIO	{ a11, a10, a9}
Fl2	{U13, U12, all}
Fl4	{U1s, U14, U13}
Fl6	{U17, U16, U1s}
F18	{U19, Urn, U17}
F20	{U21, Uzo, U19}
F22	{Uz3, U22, U21}
F24	{Uz4, Uz4, U23}

mux out

mplicand_s_u

Figure 3.25th bit extension unit block diagram

Preprocessing/F block unitformation

The preprocessing unit conditions the operand for the Booth recoding algorithm to effectively minimize the partial product count. In this 24-bit architecture, the 25-bit operand is first initialized by appending a logic "O" at the least significant bit

(LSB) position, establishing the essential reference bit (a_1 = 0)

for the initial Booth recoding cycle. This modified 26-bit

sequence is then partitioned into overlapping three-bit groupings, designated as "F-blocks", where the most significant bit (MSB) of each block serves as the LSB for the subsequent group. To achieve architectural bit-width alignment for the final three-bit grouping, an additional bit is appended at the MSB position, extending the sequence to 27 bits. This final MSB is a direct replication of the 25th bit, a technique that ensures sign integrity and preserve the numerical value for both signed and unsigned formats during the partial product generation phase.

V. Partial Product Generation

The partial product generation unit utilizes the 13 designated "F" blocks to derive the corresponding control signals- F, F1 and F2 based on the truth table logic defined in Table 2. These signals drive the hardware realization required for bit manipulation to produce the necessary partial product rows. The hardware im- plementation for F, F1 and F2 is listed below.

a2i-1–.– i

a2i 1

a2 i+ 1 ,

Figure 4. Hardware realization of F, F1 and F2

Note:

1. Fis high when fzi

is -ve, otherwise low. VI. PARTIAL PRODUCT ADDITION UNIT

The partial product summation unit performs the addition of

F1 is high for fzi =I=- 0, otherwise low.
F2 is high for fzi = ±2 ,otherwise low.

The Booth recoding hardware processes each F-block concurrently to generate the corresponding control signals F, F1 and F2 .For instance, processing the Fa block yields the specific signals Fa, Fa 1, Fa 2 while subsequent blocks such as F2 through F24 generate their respective control triplets via parallel hardware blocks. By leveraging this recoding algorithm, the architecture effectively minimizes the total count of partial products, requiring only 13 rows for the final summation stage. To generate the ROW#0 partial product, the Fa signal triplet is applied to the 25-bit preprocessed multiplicand "b". This functional mapping is repeated for ROW#2 using the F2control signals, and the process continues across all remaining rows to

complete the partial product generation phase.

b [24]
all relevant rows, specifically ROW#O through ROW#24, by utilizing a network of full adders and half adders. To maximize computational throughput, carries are propagated diagonally to the left and downward throughout the array. However, upon reaching the terminal ROW#24, the absence of a succeeding row necessitates a shift to horizontal carry propagation. This horizon- tal transition is architecturally feasible because the lower augend inputs remain unoccupied, allowing the final stage to complete the summation without a downward path.

In instances where operand bis negative, the system must cal- culate its 2's complement to ensure mathematical accuracy. While the structures in Figures 4 and 5 utilize XOR operations for bit- wise inversion, the required increment of 1 at the LSB position must still be integrated. To address this, a dedicated ROW#-1 is introduced into the architecture, which facilitates the addition of the "l" bit at the aligned LSB position for each relevant row. This modification ensures the hardware correctly implements signed multiplication without disrupting the primary adder tree.

ROW#-1= {24'b0,F24, 0 ,F22, 0 ,F20, 0 , F18, 0

, Fl6, 0, Fl4, 0, F12, 0,FlO, 0,FB, 0,F6, 0, F4, 0

, F2, 0, F0}

row0[25] row0[24]
row0[26] row0[47]
rowO[l] rowO[O]
1. Addition Stage
  
  The initial addition stage integrates ROW#-1, ROW#O, and ROW#2, with the latter being left-shifted by two positions to align with its binary weight. In the subsequent stage, the architec- ture sums the intermediate result from the first stage with ROW#4, which is left-shifted by four positions, alongside the carry bits generated during the first stage, shifted left by one posi- tion. This iterative process continues systematically, where each successive stage accumulates the previous sum and carry results with the next even-indexed row at its respective bit alignment.
  
  To maintain precision, each partial product is shifted to its appro-
  
  Figure 5. ROW#0 calculation using F0, F01 and F02
  
  Similar structures can be used for calculation of other ROW's
  
  b [24] b [24]
  F21
  
  priate power-of-two significance within the array. This Carry Save Adder (CSA) topology defers final carry propagation to the last stage, simultaneously processing the previous sum, previous carry, and new partial product. By minimizing critical path delay compared to ripple-carry methods, this reduction tree ensures the high-speed performance necessary for real-time 32-bit floating- point multiplication.
  
  Row2[26] Row2[47]
  ] Row2[24] Row2[1] Row2[0]
  Figure 7.First stage addition block
  
  Figure 6. ROW#2 calculation using F2, F21 and F2
2. Exponent section
  
  In this section, the two operands are integrated via an 8-bit ripple carry adder, as illustrated in Fig. 11. Following the adi- tion, the result must be normalized by subtracting a bias of 127 to achieve the final exponent. This subtraction is efficiently execut- ed using the 2' s complement method, ensuring the hardware maintains consistent logic for both addition and subtraction pro- cesses.
  
  Figure 8. Second stage addition block
3. Mantissa section
  
  The Mantissa computation represents the core performance bottleneck of the floating-point multiplier, necessitating a high-speed 24×24 bit binary multiplier to process the oper-
  
  u:i n
  
  u:i
  
  n n u:i
  
  !:I:=w =!:I: =!:I: =!:I:
  
  u:i =!:I:
  
  ands. This design utilizes the Booth recoding procedure to
  
  N N N +'>-
  
  streamline the multiplication of the 23-bit mantissas (plus the
  
  80
  
  .2:l
  
  .:::l
  
  8 r-, r:i implicit leading bit). By appending a sign bit to accommodate
  
  C carry
  
  l
  
  signed-number logic, the mantissas are converted into a Booth-encoded format that identifies repeating bit patterns. This encoding effectively minimizes the volume of partial products, condensing multiple operations into larger, collec- tive groups to accelerate the hardware execution. Finally, the- se partial products are accumulated carefully maintaining their
  
  M[17] M[l] M[O]
  Figure 9. Final stage addition block
  1. Proposed Architecture Of Floating Point Multiplier
    
    The proposed architecture for Single Precision Floating Point Multiplier using a 24bit multiplier using booth recoding algorithm is given in Fig.10
    
    From the calculation perspective whole floating-point multiplication is divided into four sections.
    1. Sign section
    2. Exponent section
    3. Mantissa section
    4. Normalization section
      1. Sign Section
        
        In the sign section, the final result's sign bit is determined through a logical XOR operation applied to the sign bits of both input operands. This logic ensures that if the signs differ, the result is negative, whereas identical signs yield a positive result. The specific logic gates and outcomes for this process are detailed in the truth table provided in Table-5.
        
        EA
        
        Es
        
        Sign
        
        0
        
        0
        
        0
        
        0
        
        1
        
        1
        
        1
        
        0
        
        1
        
        1
        
        1
        
        0
        
        Table 5. Sign bit operation
        
        relative bit positions and signs to produce the definitive prod- uct of the mantissa multiplication.
4. Normalization section

The exponent and mantissa are normalized in the Normaliza- tion section. Normalization is completed based on the 47th bit, which is the outcome of the 24×24 bit binary multiplier. The mantissa is normalized to 23 bits by taking the 46th to 24th bit position number and increasing the exponent by decimal value one when the 47th bit of the 24X24 bit binary multiplier is binary one. The mantissa is normalized to 23 bits by taking the 45th to 23rd bit position number and there is no increase in the exponent when the 47th bit of the 24X24 bit binary multiplier is binary zero.

Results

The design was successfully synthesized for the Xilinx Vir- tex-7 FPGA (Device: xc7v585tffg1157-l) using the XST tool within the Xilinx 14.4 environment. Functional verification via a comprehensive test bench confirmed that the Booth recoding architecture effectively streamlines hardware resource utilization by minimizing partial product generation. This reduction in com- plexity, combined with the efficient handling of both signed and unsigned 32-bit operands according to IEEE 754 standards, makes the implementation a reliable solution for high-speed, power-efficient digital signal processing applications.

Table 6. FPGA hardware utilization

Parameter

Utilization

Bonded IOB

96

Slice

SLICEL

122

SLICEM

110

LUT as

using 06 output only

628

Logic

using o5 and 06

157

Sign

31 30

Exponent

23 22

Mantissa Sign

0 31 30

Exponent

23 22

Mantissa

0

XOR 8-bit ripple carry adder

Biased to -127

24×24 binary multiplier using booth recoding algorithm

Normalization Unit

31

Sign Exponent

Mantissa

23 22 0

Figure 10. Proposed architecture of a single precisionfloating point multiplier

b7 a7 b6 a6 b5 b3 a3 b2 a2 bl al

FA FA

s7 s6 s5 s4 s3

Figure 11. Eight bit ripple carry adder

bO aO HA

s

Figure 12. Simulation waveform

7-0

,———–

normalised_exponentO_i

+ >-0……11:01 normalised_exponent_i

I 7:0

mult,plier[31:0J 1 >—–'=../ RTL ADD

111701

multiplicand[31:0J

rodu 47:0

mantisa_product_l

1<1(2201 22:0

11220

10 sign_product_i

,.. ……..,, [Y—-

RTL_XOR

Figure 1.3 RTL schematic ofsingle precisionfloating point multiplier

On-Chip Power

Ill Dynamic: 65.072 W (91%)

27%

Ii] Signals:

17.855 W

(27%)

28%

18.289 W

(28%)

45%

1/0:

28.928 W

(45%)
- Logic:
91%

9% Device Static: 6.610W (9%)

Figure 14. On chip power
CONCLUSION

This paper presented the design and implementation of a high-efficiency 32-bit single-precision floating-point multiplier based on the IEEE 754 standard. By integrating the Booth recoding algorithm for mantissa multiplication, the architecture successfully reduced the total number of partial products, leading to a more streamlined addition stage. The performance evaluation confirms that the proposed design effectively optimizes hardware resource utilization while maintaining architectural integrity for floating-point arithmetic. The implemented unit provides a reliable solution for high-speed digital signal processing and embedded engineering applications where balanced power consumption and FPGA area efficiency are critical

REFERENCES.

[l] Saha, P., Banerjee, A., Bhattacharyya, P., and Dandapat, A. (2011, January). "High speed ASIC design of complex multiplier using vedicmathematics". In Students' Technolo- gy Symposium (TechSym), 2011 IEEE (pp. 237-241). IEEE.

Ankush Nikam, Swati Salunke, Sweta Bhurse. "Design and Implementation of 32bit Complex Multiplier using Vedic Algorithm" IJERT ,2015 March,Vol 4
M. Morris Mano, "Digital Design",5th edition, Prentice Hall,2002.
Piyush Pati. "FPGA Implementation of Single Cycle Signed Multiplier using Booth Recoding Algorithm" IJERT

,2023 March,Vol 12, issue 03.
Prity Mishra," FPGA Realization of Single-Cycle, 32-Bit Booth Recoding Signed Multiplier Enhanced by High- Speed Compressors" IJERT ,2024 March,Vol 13, issue 03.
IEEE. (2019). IEEE Standardor Floating-Point Arithmetic

(IEEE std 754-2019)1EEE.

Parameter		Utilization
Bonded IOB		96
Slice	SLICEL	122
Slice	SLICEM	110
LUT as	using 06 output only	628
Logic	using o5 and 06	157

27%	Ii] Signals:	17.855 W	(27%)
28%		18.289 W	(28%)
45%	1/0:	28.928 W	(45%)

FPGA Implementation of Single Precision Floating Point Multiplier using Booth Recoding Algorithm

FLOATING POINT MULTIPLIER

ALGORITHM