DOI : https://doi.org/10.5281/zenodo.19185374
- Open Access

- Authors : Prince Mishra
- Paper ID : IJERTV15IS030860
- Volume & Issue : Volume 15, Issue 03 , March – 2026
- Published (First Online): 23-03-2026
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
FPGA Implementation of Single Precision Floating Point Multiplier using Booth Recoding Algorithm
Prince Mishra
Electronics and Instrumentation Engineering Odisha University of Technology and Research, Bhubaneswar, India
Abstract – Through this paper, we focus on implementing a Single Precision Floating Multiplier using IEEE 754 Standards. By reducing partial product generation and addition units, this implementation offers benefits such as faster results, reduced power consumption and reduction in the utilization of hardware resources. Moreover, the implementation deals with multiplication of both signed and unsigned numbers. The paper presents a comparative analysis with a 32-bit multiplier performance in terms of power consumption and FPGA hardware resource utilization. The proposed 32-bit multiplier is designed using Verilog HDL and implemented through Xilinx Vivado 2025.1 software for Xilinx Virtex-7 FPGA.
Keywords Used- FPGA; Booth Recording; Single Precision; signed multiplier; Verilog HDL.
- INTRODUCTION
Floating-point multipliers serve as critical computational kernels in high-performance digital signal processing (DSP) architectures, specifically within Finite Impulse Response (FIR) and Infinite Impulse Response (IIR) filters.
I 23 I 22 I o
-
FLOATING POINT MULTIPLIER
ALGORITHM
In accordance with the IEEE 754 standard, a 32-bit single- precision floating-point datum is partitioned into three distinct functional components: a 1-bit sign (S), an 8-bit biased exponent (E), and a 23-bit fractional mantissa (M). The multiplication of two such operands involves concurrent, independent operations across these fields to derive the product. The detailed description is defined as below.
- Calculation of the sign bit; i.e. SA XOR SB.
- Exponent is calculated by adding the exponent EA and EB After that, bias subtraction by 127 to get the final exponent,
i.e. EA+EB-127.
- An implicit leading bit is appended to each 23-bit mantissa to form 24-bit operands. These are processed through a Booth Recoding multiplier to generate a 48-bit intermediate product
- Normalizing the result, to get the required 23bit mantissa, en- suring the output adheres to the standard format.
- Combine the calculated sign, exponent and mantissa compo- nents to get the desired multiplication result.
- DESIGN OF BOOTH RECODING
Sign Exponent Mantissa MULTIPLIER
Figure I.Single precision floating point representation
The multiplication involves concurrent XOR-based sign processing, biased exponent addition (EA+EB-127), and normalized mantissa multiplication. To address the hardware complexity of partial product summation in high-order filters, the proposed architecture employs the Booth Recoding algorithm. This optimizes the 24-bit mantissa multiplication
by reducing partial product rows, significantly enhancing throughput and lowering power consumption.
- 24×24 bit multiplier
The proposed Booth recoding multiplier architecture accepts two 24-bit inputs, serving as the multiplier and multiplicand. Two control signals are included to specify whether the multiplier and multiplicand are treated as signed or unsigned integers.
mplier –2-4-
The paper is organized as follows. Section II presents the floating point multiplier algorithm. Section III presents the design of booth recoding multiplier. Section IV describes details of the proposed architecture and its implementation. Section V
mpiler_s_u mplicand_ ,2
mplicand_s_u_
24 bit booth recod- ing multiplier
4,…8-prod
and VI presents partial product generation and addition respectively. Section VII contains the proposed architecture for the multiplier. Results are displayed in section VIII. Finally the conclusion in section IX marks the end of the paper.
Figure 2.24- bit booth recording multiplier top module
Signal Name Width Source Description mplier 24 input Top module multiplier input mplier_s_u 1 input 1= multiplier is signed, 0 =multiplier is un- signed
mplicand 24 input Top module multipli- cand input mplicand_s_u 1 input 1 = multiplicand is signed, 0 =multiplicand is unsigned
prod output Output from the multi- plier block
Table 1. Signal Description oftop module C. Mathematical representation ofunsigned numbers
- Mathematical Representation ofSigned Number
- 24×24 bit multiplier
2’s complement representation of A
unsigned and signed integers.
Table 2.Booth recoding truth table
| a2i+l | a2i | a2i-1 | f2i | F
(+/-) |
pl
(x 1/0) |
p2
(x 2/0) |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 1 | 0 | 1 | 0 |
| 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 0 | 1 | 1 | 2 | 0 | 1 | 1 |
| 1 | 0 | 0 | -2 | 1 | 1 | 1 |
| 1 | 0 | 1 | -1 | 1 | 1 | 0 |
| 1 | 1 | 0 | -1 | 1 | 1 | 0 |
| 1 | 1 | 1 | 0 | 0 | 0 | 0 |
| Extra bit added for block formation | Extended 25 bit number(result of 25bit extension unit) | 0 added in the 1.SB |
| a[24] | a[24:0] | 0 |
- SUB-MODULES OF IMPLEMENTED BOOTH RECODING MULTIPLIER
- 25th bit extension unit
The proposed architecture employs a unified signed multiplier core to facilitate both signed and unsigned arithmetic operations. To preserve the full dynamic range during unsigned multiplicatio- n and prevent magnitude truncation, the 24-bit operands undergo a bit-width expansion at the most significant bit (MSB) position. For unsigned operands, this 25th bit is initialized to zero. In case of a signed operation, the 25th bit replicates the 24th bit (MSB) to follow up two’s compliment integrity. This preprocessing stage ensures that the subsequent booth recording logic can process both number formats using a singular, hardware- efficient internal data path.
Table 3. Extended 26 bit extension unit
The preprocessing stage concludes with the operand expanded to a total width of 27 bits. This transformation is achieved through combinational rewiring to form overlapping F blocks without further implementation of hardware.
Table 4. F block formation
mplier_s_u —–.
mplicand[23]
mux out
FO { ai, a0, 0} F2 {a3, a2, ai} F4 {as, a4, a3} F6 { a7, a6, as} F8 {a9, as, a7} FIO { a11, a10, a9} Fl2 {U13, U12, all} Fl4 {U1s, U14, U13} Fl6 {U17, U16, U1s} F18 {U19, Urn, U17} F20 {U21, Uzo, U19} F22 {Uz3, U22, U21} F24 {Uz4, Uz4, U23} mux out
mplicand_s_u
Figure 3.25th bit extension unit block diagram
- Preprocessing/F block unitformation
- 25th bit extension unit
The preprocessing unit conditions the operand for the Booth recoding algorithm to effectively minimize the partial product count. In this 24-bit architecture, the 25-bit operand is first initialized by appending a logic “O” at the least significant bit
(LSB) position, establishing the essential reference bit (a_1 = 0)
for the initial Booth recoding cycle. This modified 26-bit
sequence is then partitioned into overlapping three-bit groupings, designated as “F-blocks”, where the most significant bit (MSB) of each block serves as the LSB for the subsequent group. To achieve architectural bit-width alignment for the final three-bit grouping, an additional bit is appended at the MSB position, extending the sequence to 27 bits. This final MSB is a direct replication of the 25th bit, a technique that ensures sign integrity and preserve the numerical value for both signed and unsigned formats during the partial product generation phase.
V. Partial Product Generation
The partial product generation unit utilizes the 13 designated “F” blocks to derive the corresponding control signals- F, F1 and F2 based on the truth table logic defined in Table 2. These signals drive the hardware realization required for bit manipulation to produce the necessary partial product rows. The hardware im- plementation for F, F1 and F2 is listed below.
Figure 4. Hardware realization of F, F1 and F2
Note:
1. Fis high when fzi
is -ve, otherwise low. VI. PARTIAL PRODUCT ADDITION UNIT
The partial product summation unit performs the addition of
- F1 is high for fzi =I=- 0, otherwise low.
- F2 is high for fzi = ±2 ,otherwise low.
The Booth recoding hardware processes each F-block concurrently to generate the corresponding control signals F, F1 and F2 .For instance, processing the Fa block yields the specific signals Fa, Fa 1, Fa 2 while subsequent blocks such as F2 through F24 generate their respective control triplets via parallel hardware blocks. By leveraging this recoding algorithm, the architecture effectively minimizes the total count of partial products, requiring only 13 rows for the final summation stage. To generate the ROW#0 partial product, the Fa signal triplet is applied to the 25-bit preprocessed multiplicand “b”. This functional mapping is repeated for ROW#2 using the F2control signals, and the process continues across all remaining rows to
complete the partial product generation phase.
b [24]
all relevant rows, specifically ROW#O through ROW#24, by utilizing a network of full adders and half adders. To maximize computational throughput, carries are propagated diagonally to the left and downward throughout the array. However, upon reaching the terminal ROW#24, the absence of a succeeding row necessitates a shift to horizontal carry propagation. This horizon- tal transition is architecturally feasible because the lower augend inputs remain unoccupied, allowing the final stage to complete the summation without a downward path.
In instances where operand bis negative, the system must cal- culate its 2’s complement to ensure mathematical accuracy. While the structures in Figures 4 and 5 utilize XOR operations for bit- wise inversion, the required increment of 1 at the LSB position must still be integrated. To address this, a dedicated ROW#-1 is introduced into the architecture, which facilitates the addition of the “l” bit at the aligned LSB position for each relevant row. This modification ensures the hardware correctly implements signed multiplication without disrupting the primary adder tree.
- Addition Stage
The initial addition stage integrates ROW#-1, ROW#O, and ROW#2, with the latter being left-shifted by two positions to align with its binary weight. In the subsequent stage, the architec- ture sums the intermediate result from the first stage with ROW#4, which is left-shifted by four positions, alongside the carry bits generated during the first stage, shifted left by one posi- tion. This iterative process continues systematically, where each successive stage accumulates the previous sum and carry results with the next even-indexed row at its respective bit alignment.
To maintain precision, each partial product is shifted to its appro-
Figure 5. ROW#0 calculation using F0, F01 and F02
Similar structures can be used for calculation of other ROW’s
priate power-of-two significance within the array. This Carry Save Adder (CSA) topology defers final carry propagation to the last stage, simultaneously processing the previous sum, previous carry, and new partial product. By minimizing critical path delay compared to ripple-carry methods, this reduction tree ensures the high-speed performance necessary for real-time 32-bit floating- point multiplication.
Row2[26] Row2[47]
] Row2[24] Row2[1] Row2[0]
Figure 7.First stage addition block
Figure 6. ROW#2 calculation using F2, F21 and F2
- Exponent section
In this section, the two operands are integrated via an 8-bit ripple carry adder, as illustrated in Fig. 11. Following the adi- tion, the result must be normalized by subtracting a bias of 127 to achieve the final exponent. This subtraction is efficiently execut- ed using the 2′ s complement method, ensuring the hardware maintains consistent logic for both addition and subtraction pro- cesses.
Figure 8. Second stage addition block
- Mantissa section
The Mantissa computation represents the core performance bottleneck of the floating-point multiplier, necessitating a high-speed 24×24 bit binary multiplier to process the oper-
ands. This design utilizes the Booth recoding procedure to
streamline the multiplication of the 23-bit mantissas (plus the
8 r-, r:i implicit leading bit). By appending a sign bit to accommodate
signed-number logic, the mantissas are converted into a Booth-encoded format that identifies repeating bit patterns. This encoding effectively minimizes the volume of partial products, condensing multiple operations into larger, collec- tive groups to accelerate the hardware execution. Finally, the- se partial products are accumulated carefully maintaining their
M[17] M[l] M[O]
Figure 9. Final stage addition block
- Proposed Architecture Of Floating Point Multiplier
The proposed architecture for Single Precision Floating Point Multiplier using a 24bit multiplier using booth recoding algorithm is given in Fig.10
From the calculation perspective whole floating-point multiplication is divided into four sections.
- Sign section
- Exponent section
- Mantissa section
- Normalization section
- Sign Section
In the sign section, the final result’s sign bit is determined through a logical XOR operation applied to the sign bits of both input operands. This logic ensures that if the signs differ, the result is negative, whereas identical signs yield a positive result. The specific logic gates and outcomes for this process are detailed in the truth table provided in Table-5.
EA Es Sign 0 0 0 0 1 1 1 0 1 1 1 0 Table 5. Sign bit operation
relative bit positions and signs to produce the definitive prod- uct of the mantissa multiplication.
- Sign Section
- Proposed Architecture Of Floating Point Multiplier
- Normalization section
- Addition Stage
The exponent and mantissa are normalized in the Normaliza- tion section. Normalization is completed based on the 47th bit, which is the outcome of the 24×24 bit binary multiplier. The mantissa is normalized to 23 bits by taking the 46th to 24th bit position number and increasing the exponent by decimal value one when the 47th bit of the 24X24 bit binary multiplier is binary one. The mantissa is normalized to 23 bits by taking the 45th to 23rd bit position number and there is no increase in the exponent when the 47th bit of the 24X24 bit binary multiplier is binary zero.
- Results
The design was successfully synthesized for the Xilinx Vir- tex-7 FPGA (Device: xc7v585tffg1157-l) using the XST tool within the Xilinx 14.4 environment. Functional verification via a comprehensive test bench confirmed that the Booth recoding architecture effectively streamlines hardware resource utilization by minimizing partial product generation. This reduction in com- plexity, combined with the efficient handling of both signed and unsigned 32-bit operands according to IEEE 754 standards, makes the implementation a reliable solution for high-speed, power-efficient digital signal processing applications.
Table 6. FPGA hardware utilization
Parameter Utilization Bonded IOB 96 Slice SLICEL 122 SLICEM 110 LUT as using 06 output only 628 Logic using o5 and 06 157 Sign
31 30
Exponent
23 22
Mantissa Sign
0 31 30
Exponent
23 22
Mantissa
0
XOR 8-bit ripple carry adder
Biased to -127
24×24 binary multiplier using booth recoding algorithm
Normalization Unit
31
Sign Exponent
Mantissa
23 22 0
Figure 10. Proposed architecture of a single precisionfloating point multiplier
b7 a7 b6 a6 b5 b3 a3 b2 a2 bl al
FA FA
s7 s6 s5 s4 s3
Figure 11. Eight bit ripple carry adder
Figure 12. Simulation waveform
normalised_exponentO_i
multiplicand[31:0J
rodu 47:0
mantisa_product_l
Figure 1.3 RTL schematic ofsingle precisionfloating point multiplier
On-Chip Power
Ill Dynamic: 65.072 W (91%)
27% Ii] Signals: 17.855 W (27%) 28% 45% 1/0: 28.928 W (45%) Figure 14. On chip power
- CONCLUSION
This paper presented the design and implementation of a high-efficiency 32-bit single-precision floating-point multiplier based on the IEEE 754 standard. By integrating the Booth recoding algorithm for mantissa multiplication, the architecture successfully reduced the total number of partial products, leading to a more streamlined addition stage. The performance evaluation confirms that the proposed design effectively optimizes hardware resource utilization while maintaining architectural integrity for floating-point arithmetic. The implemented unit provides a reliable solution for high-speed digital signal processing and embedded engineering applications where balanced power consumption and FPGA area efficiency are critical
REFERENCES.
[l] Saha, P., Banerjee, A., Bhattacharyya, P., and Dandapat, A. (2011, January). “High speed ASIC design of complex multiplier using vedicmathematics”. In Students’ Technolo- gy Symposium (TechSym), 2011 IEEE (pp. 237-241). IEEE.- Ankush Nikam, Swati Salunke, Sweta Bhurse. “Design and Implementation of 32bit Complex Multiplier using Vedic Algorithm” IJERT ,2015 March,Vol 4
- M. Morris Mano, “Digital Design”,5th edition, Prentice Hall,2002.
- Piyush Pati. “FPGA Implementation of Single Cycle Signed Multiplier using Booth Recoding Algorithm” IJERT
,2023 March,Vol 12, issue 03.
- Prity Mishra,” FPGA Realization of Single-Cycle, 32-Bit Booth Recoding Signed Multiplier Enhanced by High- Speed Compressors” IJERT ,2024 March,Vol 13, issue 03.
- IEEE. (2019). IEEE Standardor Floating-Point Arithmetic
(IEEE std 754-2019)1EEE.
