Leading Research Platform
Serving Researchers Since 2012

FPGA Implementation of Single Precision Floating Point Multiplier using Booth Recoding Algorithm

DOI : https://doi.org/10.5281/zenodo.19185374
Download Full-Text PDF Cite this Publication

Text Only Version

 

FPGA Implementation of Single Precision Floating Point Multiplier using Booth Recoding Algorithm

Prince Mishra

Electronics and Instrumentation Engineering Odisha University of Technology and Research, Bhubaneswar, India

Abstract – Through this paper, we focus on implementing a Single Precision Floating Multiplier using IEEE 754 Standards. By reducing partial product generation and addition units, this implementation offers benefits such as faster results, reduced power consumption and reduction in the utilization of hardware resources. Moreover, the implementation deals with multiplication of both signed and unsigned numbers. The paper presents a comparative analysis with a 32-bit multiplier performance in terms of power consumption and FPGA hardware resource utilization. The proposed 32-bit multiplier is designed using Verilog HDL and implemented through Xilinx Vivado 2025.1 software for Xilinx Virtex-7 FPGA.

Keywords Used- FPGA; Booth Recording; Single Precision; signed multiplier; Verilog HDL.

  1. INTRODUCTION

    Floating-point multipliers serve as critical computational kernels in high-performance digital signal processing (DSP) architectures, specifically within Finite Impulse Response (FIR) and Infinite Impulse Response (IIR) filters.

    I 23 I 22 I o

  2. FLOATING POINT MULTIPLIER

    ALGORITHM

    In accordance with the IEEE 754 standard, a 32-bit single- precision floating-point datum is partitioned into three distinct functional components: a 1-bit sign (S), an 8-bit biased exponent (E), and a 23-bit fractional mantissa (M). The multiplication of two such operands involves concurrent, independent operations across these fields to derive the product. The detailed description is defined as below.

    1. Calculation of the sign bit; i.e. SA XOR SB.
    2. Exponent is calculated by adding the exponent EA and EB After that, bias subtraction by 127 to get the final exponent,

      i.e. EA+EB-127.

    3. An implicit leading bit is appended to each 23-bit mantissa to form 24-bit operands. These are processed through a Booth Recoding multiplier to generate a 48-bit intermediate product
    4. Normalizing the result, to get the required 23bit mantissa, en- suring the output adheres to the standard format.
    5. Combine the calculated sign, exponent and mantissa compo- nents to get the desired multiplication result.
  3. DESIGN OF BOOTH RECODING

    Sign Exponent Mantissa MULTIPLIER

    Figure I.Single precision floating point representation

    The multiplication involves concurrent XOR-based sign processing, biased exponent addition (EA+EB-127), and normalized mantissa multiplication. To address the hardware complexity of partial product summation in high-order filters, the proposed architecture employs the Booth Recoding algorithm. This optimizes the 24-bit mantissa multiplication

    by reducing partial product rows, significantly enhancing throughput and lowering power consumption.

    1. 24×24 bit multiplier

      The proposed Booth recoding multiplier architecture accepts two 24-bit inputs, serving as the multiplier and multiplicand. Two control signals are included to specify whether the multiplier and multiplicand are treated as signed or unsigned integers.

      mplier –2-4-

      The paper is organized as follows. Section II presents the floating point multiplier algorithm. Section III presents the design of booth recoding multiplier. Section IV describes details of the proposed architecture and its implementation. Section V

      mpiler_s_u mplicand_ ,2

      mplicand_s_u_

      24 bit booth recod- ing multiplier

      4,…8-prod

      and VI presents partial product generation and addition respectively. Section VII contains the proposed architecture for the multiplier. Results are displayed in section VIII. Finally the conclusion in section IX marks the end of the paper.

      Figure 2.24- bit booth recording multiplier top module

      Signal Name Width Source Description
      mplier 24 input Top module multiplier input
      mplier_s_u 1 input 1= multiplier is signed,

      0 =multiplier is un- signed

      mplicand 24 input Top module multipli- cand input
      mplicand_s_u 1 input 1 = multiplicand is

      signed, 0 =multiplicand is unsigned

      prod output Output from the multi-

      plier block

       

      Table 1. Signal Description oftop module C. Mathematical representation ofunsigned numbers

    2. Mathematical Representation ofSigned Number

2’s complement representation of A

unsigned and signed integers.

Table 2.Booth recoding truth table

a2i+l a2i a2i-1 f2i F

(+/-)

pl

(x 1/0)

p2

(x 2/0)

0 0 0 0 0 0 0
0 0 1 1 0 1 0
0 1 0 1 0 1 0
0 1 1 2 0 1 1
1 0 0 -2 1 1 1
1 0 1 -1 1 1 0
1 1 0 -1 1 1 0
1 1 1 0 0 0 0

 

Extra bit added for block formation Extended 25 bit number(result of 25bit extension unit) 0 added in the 1.SB
a[24] a[24:0] 0

 

  1. SUB-MODULES OF IMPLEMENTED BOOTH RECODING MULTIPLIER
    1. 25th bit extension unit

      The proposed architecture employs a unified signed multiplier core to facilitate both signed and unsigned arithmetic operations. To preserve the full dynamic range during unsigned multiplicatio- n and prevent magnitude truncation, the 24-bit operands undergo a bit-width expansion at the most significant bit (MSB) position. For unsigned operands, this 25th bit is initialized to zero. In case of a signed operation, the 25th bit replicates the 24th bit (MSB) to follow up two’s compliment integrity. This preprocessing stage ensures that the subsequent booth recording logic can process both number formats using a singular, hardware- efficient internal data path.

      Table 3. Extended 26 bit extension unit

      The preprocessing stage concludes with the operand expanded to a total width of 27 bits. This transformation is achieved through combinational rewiring to form overlapping F blocks without further implementation of hardware.

      Table 4. F block formation

      mplier_s_u —–.

      mplicand[23]

      mux out

      FO { ai, a0, 0}
      F2 {a3, a2, ai}
      F4 {as, a4, a3}
      F6 { a7, a6, as}
      F8 {a9, as, a7}
      FIO { a11, a10, a9}
      Fl2 {U13, U12, all}
      Fl4 {U1s, U14, U13}
      Fl6 {U17, U16, U1s}
      F18 {U19, Urn, U17}
      F20 {U21, Uzo, U19}
      F22 {Uz3, U22, U21}
      F24 {Uz4, Uz4, U23}

      mux out

      mplicand_s_u

      Figure 3.25th bit extension unit block diagram

    2. Preprocessing/F block unitformation

The preprocessing unit conditions the operand for the Booth recoding algorithm to effectively minimize the partial product count. In this 24-bit architecture, the 25-bit operand is first initialized by appending a logic “O” at the least significant bit

(LSB) position, establishing the essential reference bit (a_1 = 0)

for the initial Booth recoding cycle. This modified 26-bit

sequence is then partitioned into overlapping three-bit groupings, designated as “F-blocks”, where the most significant bit (MSB) of each block serves as the LSB for the subsequent group. To achieve architectural bit-width alignment for the final three-bit grouping, an additional bit is appended at the MSB position, extending the sequence to 27 bits. This final MSB is a direct replication of the 25th bit, a technique that ensures sign integrity and preserve the numerical value for both signed and unsigned formats during the partial product generation phase.

V. Partial Product Generation

The partial product generation unit utilizes the 13 designated “F” blocks to derive the corresponding control signals- F, F1 and F2 based on the truth table logic defined in Table 2. These signals drive the hardware realization required for bit manipulation to produce the necessary partial product rows. The hardware im- plementation for F, F1 and F2 is listed below.

Figure 4. Hardware realization of F, F1 and F2

Note:

1. Fis high when fzi

is -ve, otherwise low. VI. PARTIAL PRODUCT ADDITION UNIT

The partial product summation unit performs the addition of

  1. F1 is high for fzi =I=- 0, otherwise low.
  2. F2 is high for fzi = ±2 ,otherwise low.

    The Booth recoding hardware processes each F-block concurrently to generate the corresponding control signals F, F1 and F2 .For instance, processing the Fa block yields the specific signals Fa, Fa 1, Fa 2 while subsequent blocks such as F2 through F24 generate their respective control triplets via parallel hardware blocks. By leveraging this recoding algorithm, the architecture effectively minimizes the total count of partial products, requiring only 13 rows for the final summation stage. To generate the ROW#0 partial product, the Fa signal triplet is applied to the 25-bit preprocessed multiplicand “b”. This functional mapping is repeated for ROW#2 using the F2control signals, and the process continues across all remaining rows to

    complete the partial product generation phase.

    b [24]

    all relevant rows, specifically ROW#O through ROW#24, by utilizing a network of full adders and half adders. To maximize computational throughput, carries are propagated diagonally to the left and downward throughout the array. However, upon reaching the terminal ROW#24, the absence of a succeeding row necessitates a shift to horizontal carry propagation. This horizon- tal transition is architecturally feasible because the lower augend inputs remain unoccupied, allowing the final stage to complete the summation without a downward path.

    In instances where operand bis negative, the system must cal- culate its 2’s complement to ensure mathematical accuracy. While the structures in Figures 4 and 5 utilize XOR operations for bit- wise inversion, the required increment of 1 at the LSB position must still be integrated. To address this, a dedicated ROW#-1 is introduced into the architecture, which facilitates the addition of the “l” bit at the aligned LSB position for each relevant row. This modification ensures the hardware correctly implements signed multiplication without disrupting the primary adder tree.

    1. Addition Stage

      The initial addition stage integrates ROW#-1, ROW#O, and ROW#2, with the latter being left-shifted by two positions to align with its binary weight. In the subsequent stage, the architec- ture sums the intermediate result from the first stage with ROW#4, which is left-shifted by four positions, alongside the carry bits generated during the first stage, shifted left by one posi- tion. This iterative process continues systematically, where each successive stage accumulates the previous sum and carry results with the next even-indexed row at its respective bit alignment.

      To maintain precision, each partial product is shifted to its appro-

      Figure 5. ROW#0 calculation using F0, F01 and F02

      Similar structures can be used for calculation of other ROW’s

      priate power-of-two significance within the array. This Carry Save Adder (CSA) topology defers final carry propagation to the last stage, simultaneously processing the previous sum, previous carry, and new partial product. By minimizing critical path delay compared to ripple-carry methods, this reduction tree ensures the high-speed performance necessary for real-time 32-bit floating- point multiplication.

      Row2[26] Row2[47]

      ] Row2[24] Row2[1] Row2[0]

      Figure 7.First stage addition block

      Figure 6. ROW#2 calculation using F2, F21 and F2

    2. Exponent section

      In this section, the two operands are integrated via an 8-bit ripple carry adder, as illustrated in Fig. 11. Following the adi- tion, the result must be normalized by subtracting a bias of 127 to achieve the final exponent. This subtraction is efficiently execut- ed using the 2′ s complement method, ensuring the hardware maintains consistent logic for both addition and subtraction pro- cesses.

      Figure 8. Second stage addition block

    3. Mantissa section

      The Mantissa computation represents the core performance bottleneck of the floating-point multiplier, necessitating a high-speed 24×24 bit binary multiplier to process the oper-

      ands. This design utilizes the Booth recoding procedure to

      streamline the multiplication of the 23-bit mantissas (plus the

      8 r-, r:i implicit leading bit). By appending a sign bit to accommodate

      signed-number logic, the mantissas are converted into a Booth-encoded format that identifies repeating bit patterns. This encoding effectively minimizes the volume of partial products, condensing multiple operations into larger, collec- tive groups to accelerate the hardware execution. Finally, the- se partial products are accumulated carefully maintaining their

      M[17] M[l] M[O]

      Figure 9. Final stage addition block

      1. Proposed Architecture Of Floating Point Multiplier

        The proposed architecture for Single Precision Floating Point Multiplier using a 24bit multiplier using booth recoding algorithm is given in Fig.10

        From the calculation perspective whole floating-point multiplication is divided into four sections.

        1. Sign section
        2. Exponent section
        3. Mantissa section
        4. Normalization section
          1. Sign Section

            In the sign section, the final result’s sign bit is determined through a logical XOR operation applied to the sign bits of both input operands. This logic ensures that if the signs differ, the result is negative, whereas identical signs yield a positive result. The specific logic gates and outcomes for this process are detailed in the truth table provided in Table-5.

            EA Es Sign
            0 0 0
            0 1 1
            1 0 1
            1 1 0

             

            Table 5. Sign bit operation

            relative bit positions and signs to produce the definitive prod- uct of the mantissa multiplication.

    4. Normalization section

The exponent and mantissa are normalized in the Normaliza- tion section. Normalization is completed based on the 47th bit, which is the outcome of the 24×24 bit binary multiplier. The mantissa is normalized to 23 bits by taking the 46th to 24th bit position number and increasing the exponent by decimal value one when the 47th bit of the 24X24 bit binary multiplier is binary one. The mantissa is normalized to 23 bits by taking the 45th to 23rd bit position number and there is no increase in the exponent when the 47th bit of the 24X24 bit binary multiplier is binary zero.

  1. Results

    The design was successfully synthesized for the Xilinx Vir- tex-7 FPGA (Device: xc7v585tffg1157-l) using the XST tool within the Xilinx 14.4 environment. Functional verification via a comprehensive test bench confirmed that the Booth recoding architecture effectively streamlines hardware resource utilization by minimizing partial product generation. This reduction in com- plexity, combined with the efficient handling of both signed and unsigned 32-bit operands according to IEEE 754 standards, makes the implementation a reliable solution for high-speed, power-efficient digital signal processing applications.

    Table 6. FPGA hardware utilization

    Parameter Utilization
    Bonded IOB 96
    Slice SLICEL 122
    SLICEM 110
    LUT as using 06 output only 628
    Logic using o5 and 06 157

    Sign

    31 30

    Exponent

    23 22

    Mantissa Sign

    0 31 30

    Exponent

    23 22

    Mantissa

    0

    XOR 8-bit ripple carry adder

    Biased to -127

    24×24 binary multiplier using booth recoding algorithm

    Normalization Unit

    31

    Sign Exponent

    Mantissa

    23 22 0

    Figure 10. Proposed architecture of a single precisionfloating point multiplier

    b7 a7 b6 a6 b5 b3 a3 b2 a2 bl al

    FA FA

    s7 s6 s5 s4 s3

    Figure 11. Eight bit ripple carry adder

    Figure 12. Simulation waveform

    normalised_exponentO_i

    multiplicand[31:0J

    rodu 47:0

    mantisa_product_l

    Figure 1.3 RTL schematic ofsingle precisionfloating point multiplier

    On-Chip Power

    Ill Dynamic: 65.072 W (91%)

    27% Ii] Signals: 17.855 W (27%)
    28%
    45% 1/0: 28.928 W (45%)

    Figure 14. On chip power

  2. CONCLUSION

This paper presented the design and implementation of a high-efficiency 32-bit single-precision floating-point multiplier based on the IEEE 754 standard. By integrating the Booth recoding algorithm for mantissa multiplication, the architecture successfully reduced the total number of partial products, leading to a more streamlined addition stage. The performance evaluation confirms that the proposed design effectively optimizes hardware resource utilization while maintaining architectural integrity for floating-point arithmetic. The implemented unit provides a reliable solution for high-speed digital signal processing and embedded engineering applications where balanced power consumption and FPGA area efficiency are critical

REFERENCES.

[l] Saha, P., Banerjee, A., Bhattacharyya, P., and Dandapat, A. (2011, January). “High speed ASIC design of complex multiplier using vedicmathematics”. In Students’ Technolo- gy Symposium (TechSym), 2011 IEEE (pp. 237-241). IEEE.

  1. Ankush Nikam, Swati Salunke, Sweta Bhurse. “Design and Implementation of 32bit Complex Multiplier using Vedic Algorithm” IJERT ,2015 March,Vol 4
  2. M. Morris Mano, “Digital Design”,5th edition, Prentice Hall,2002.
  3. Piyush Pati. “FPGA Implementation of Single Cycle Signed Multiplier using Booth Recoding Algorithm” IJERT

    ,2023 March,Vol 12, issue 03.

  4. Prity Mishra,” FPGA Realization of Single-Cycle, 32-Bit Booth Recoding Signed Multiplier Enhanced by High- Speed Compressors” IJERT ,2024 March,Vol 13, issue 03.
  5. IEEE. (2019). IEEE Standardor Floating-Point Arithmetic

(IEEE std 754-2019)1EEE.