 Open Access
 Total Downloads : 292
 Authors : M. Pravallika, V. Vamsi Mohana Krishna
 Paper ID : IJERTV2IS70701
 Volume & Issue : Volume 02, Issue 07 (July 2013)
 Published (First Online): 24072013
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Design And Verification Of High Speed And Efficient Asynchronous Floating Point Multiplier

Pravallika (M.Tech), V. Vamsi Mohana Krishna M.Tech,(PhD)
NIMRA COLLEGE OF ENGINEERING & TECHNOLOGY, ASSOCIATE PROFESSOR,
NIMRA NAGAR,JUPUDI, NIMRA COLLEGE OF ENGINEERING & TECHNOLOGY,
VIJAYAWADA, NIMRA NAGAR,JUPUDI,
KRISHNA (DIST)521456 VIJAYAWADA,KRISHNA (DIST)521456
AbstractWe present the details of our energy efficient asynchronous floatingpoint multiplier (FPM). We discuss design tradeoffs of various multiplier implementations. A higher radix array multiplier design with operanddependent carry propagation adder and low handshake overhead pipeline design is presented, which yields significant energy savings while preserving the average throughput. Our FPM also includes a hardware implementation of denormal and underflow cases. When compared against a custom synchronous FPM design, our asynchronous FPM consumes 3X less energy per operation while operating at 2.3X higher throughput. To our knowledge, this is the first detailed design of a high performance asynchronous IEEE754 compliant doubleprecision floatingpoint multiplier.
KeywordsFloatingpointarithmetic;asynchronous logic circuits; verylargescale integration; pipeline processing

Introduction
Energyefficient floatingpoint computation is important for a wide range of applications. Traditionally, VLSI designers primarily relied on CMOS technology and voltage scaling to reduce power consumption. With the transistor threshold voltage fixed, VDD has been scaling very slowly if at all, which means all performance improvements come at in increased energy consumption. Furthermore, process variations in deep submicron range have made devices far less robust, which are increasingly making it difficult for synchronous designers to overcome the problems associated with clock skew rates and clock distribution. The findings of a recent indepth study, to explore and devise ways to further scale supercomputer peta FLOP performance by 1000X, indicate the inadequacy of current design practices and technologies to achieve
the desired throughput within a sustainable power budget. This underscores a pressing need for alternate design practices, to reduce energy consumption behavior in advanced technology nodes. Basically in a computer processor operation the floating point operations plays very important role.
At the other end of the spectrum, embedded systems that have traditionally been considered low performance are demanding higher and higher throughput for the same power budget to support computeintensive floatingpoint applications that improve the user experience. Since these applications have to be deployed on portable devices with limited batterylife, it is critical that we develop energy efficient floatingpoint hardware for these embedded systems, not simply high performance floatingpoint hardware.
1.1.IEEE 754 Standard for floating point
The IEEE 754 standard for binary floatingpoint arithmetic provides a precise specification of floatingpoint number formats computation operations, and exceptions and their handling. The combination of a vast range of inputs, special cases, and rounding modes makes the hardware implementation of fully IEEE 754 standard compliant floatingpoint arithmetic a very challenging task. Ignoring certain aspects of the standard can lead to unexpected consequences in the context of numerical algorithms. The IEEE format specifies two main groups of floatingpoint format: singleprecision and doubleprecision. In this work, we primarily focus on doubleprecision format since it is commonly used in most scientific and emerging applications. We introduce a number of microarchitectural and circuit level optimizations to reduce the power
Consumption in the floatingpoint multiplier (FPM) data path. A floatingpoint multiplier consumes significantly more energy compared to a floating point adder (FPA). This combined with the knowledge that the frequency of floatingpoint multiplication operations in emerging applications is similar to that of floating point addition computations makes energy and power optimizations in the FPM data path highly essential for an efficient full floatingpoint unit (FPU) design.

Background and Related work
In terms of microarchitectural complexity, the floatingpoint multiplier (FPM) data path is simpler than the FPA data path. The FPM data path for double precision multiplication operation is shown in Figure2.1. The doubleprecision inputs into the data path, A and B, comprise 1bit of sign, 11bits of exponent, and 52bits of mantissa (also known as the significant) each.
Fig.2.1: Floatingpoint Multiplier Data path.
The following summarizes the key steps in an IEEE compliant FPM data path: 1).The first step in the FPM data path is to unpack the IEEE representation and analyze the sign, exponent, and mantissa bits of each input to determine if the inputs are standard normalized or are of one of the special types (NaN, infinity, denormal).
2).The mantissa bits are extended with the implicit bit. It is set to one for normal inputs and zero for a denormal input.
3).The 53bit long mantissas of both inputs are used to generate partial products corresponding to 106bit product. Since high throughput and low latency are of essence in floatingpoint applications, most FPMs use some form of an array multiplier, such as a booth encoded multiplier as shown Figure 1, to meet the performance demands. Most array multipliers employ an array of carrysaveadders (CSAs) to reduce the large number of partial products to two final full product length bit streams.
4).The most significant 53bits of the two output bit streams from the CSA array are summed up using a carry propagation adder (CPA) to generate a 53bit mantissa. The least significant 53bits are used to generate the carry input to the CPA as well as compute the guard, round, and sticky bits to be used in post normalization rounding. In parallel, the exponent logic computes the resulting exponent, which is a sum of the exponent values of both inputs minus the bias. The bias has a value of 1023 in case of doubleprecision operations. The sign of final product is also computed. 5).The post multiplication step includes normalization of the 53bit mantissa. For normal inputs and non underflow cases, either the mantissa is already normalized or it may require a right shift by a single bit position, in which scenario the exponent is adjusted, in parallel, by adding one to it. The guard, round, and sticky bits are updated and are used, along with the round mode, to determine if the product needs to be rounded or not.
6).In case of rounding, the mantissa is incremented by one. If rounding yields a carry out, the exponent is adjusted by adding one to it and right shifting the mantissa by one bit position.
7).The final stage checks for a NaN, infinity, or a de normal outcome before outputting the correct result in the IEEE format.
With normalization step limited to a simple shift of no more than onebit position and the exponent logic comprising only 11bit long arithmetic, the FPMs complexity is largely a function of its 53×53 multiplier, sticky bit computation block, and the final carry propagation adder. We present various structural and circuitlevel optimization techniques to reduce the
complexity and power consumption footprint of the aforesaid logic blocks.

Asynchronous Multipliers and Floating Point Arithmetic
In terms o the multiplier design, the delay variability nature of iterative multipliers makes them a popular choice amongst asynchronous designers. An iterative multiplier utilizes a few functional units repeatedly to produce the result. In a simple iterative n by n multiplier implementation, where n is the number of bits, the product is computed after n iterations. Each iteration comprises a minimum nbit addition and a serial shift by onebit position. Furber et al. proposed a low power integer multiplier which exploits the commonly occurring pattern of low number of significant bits in integer inputs as means to reduce the total number of iterations. These iterative multiplier designs, though compact in terms of area, are not feasible to be used in floatingpoint multiplier hardware due to their very high latency and low throughput and the fact that unlike the inputs in integer arithmetic, the most significant bits of floatingpoint mantissa inputs are non zero.
Joel Noche et al. used asynchronous circuits in their design of a singleprecision FPU. However, their FPU is completely nonpipelined, doesnt include any energy optimization techniques, and does not implement rounding logic. Their FPU has many orders of magnitude higher latency compared to all recent floatingpoint designs from synchronous domain Sheikh et al employed finegrain asynchronous circuit techniques for various operanddependent optimization techniques to reduce averagecase power consumption in the FPA data path. However, their work is restricted to FPA design only.

SynchronousFloatingPoint Multipliers
There is a large body of work on synchronous FPM design..Ercegovac and Lang contain an overview of the different techniques used to optimize floatingpoint multiplication. The focus of prior work has been the array multiplier block, which is the single largest logic structure within the FPM data path. Earlier designs have employed various architecture and circuitlevel optimizations to reduce array multiplier latency and increase its throughput [18, 20, 22, and 28]. However, there is relatively much less work on improving the energy efficiency of multiplier data path, which is one of
our primary contributions.

Floating Multiplier and Power breakdown
We use quasidelayinsensitive (QDI) asynchronous circuits for our baseline FPM design. The finegrain asynchronous prechargeenablehalfbuffer (PCeHB) pipelines in our design contain only a small amount of logic (e.g. a twobit fulladder). The actual computation is combined with data latching, which removes the overhead of explicit output registers. This pipeline style has been used in previous highperformance asynchronous designs, including a fullyimplemented and fabricated asynchronous microprocessor.
Unlike in the FPA data path where total power is distributed roughly evenly amongst a number of different logic blocks, the FPMs complexity is largely a function of its 53×53 multiplier. This is highlighted in which shows the power breakdown estimates of our baseline fully QDI FPM data path. The boothencoded array multiplier accounts for roughly 76% of the total power consumption. Hence, in this work, we primarily focus on reducing energy/power of the array multiplier block.
The FrontEnd/Exponent block corresponds to the logic that unpacks IEEE format inputs and analyzes the sign, exponent, and mantissa bits of each input to determine if the inputs are standard normalized or are of one of the special types (NaN, infinity, de normal). It also includes the logic to compute the resultant exponent of the FPM product. The array multiplier outputs two 106bit streams. The most significant 53bits of the two output bit streams from the array multiplier are summed up using a carry propagation adder (CPA) to generate a 53bit mantissa. The least significant 53bits are used to generate the carry input to the CPA as well as compute the guard, round, and sticky bits to be used in post normalization rounding. The sticky bit computation block and the final carry propagation adder are the other power consuming structures within the FPM data path. In this work, we present various structural and circuit level optimization techniques to reduce the complexity and power consumption footprint of the aforesaid logic blocks.

Multiplier Design Tradeoffs
The choice of a particular multiplier design depends on a number of factors. These include: desired throughput The choice of a particular multiplier design depends on a number of factors. These include: desired throughput, overall latency, circuit complexity, and the allowed power budget. Traditionally, high performance has been the key driving factor in multiplier design. However, as power consumption has become a major design constraint lately, a number of lowpower multiplier designs have been proposed both in synchronous and asynchronous domains.

Iterative Multipliers
Iterative multipliers represent a low complexity design choice. An iterative multiplier utilizes a few functional units repeatedly to produce the result. Iterative multipliers can be used to reduce energy consumption by exploiting input data patterns; stages which add zero to the partial product could be detected in advance and skipped, hence reducing delay and energy consumption. Though compact in terms of area, iterative multipliers are not feasible to be used in floatingpoint multiplier hardware due to their very high latency and low throughput.
Reduction in the total number of partial products is the key goal of all multiplier optimization techniques, as it helps to reduce both latency as well as energy consumption. Along these lines, Efthymious et al. [7] proposed an asynchronous multiplier implementation based on the original Booth algorithm [3]. Their design scans the multiplier operand and skips chains of consecutive ones or zeros. This can greatly reduce the number of partial product additions required to produce the product. The downside is that it requires a variable length shifter to correctly align multiplicands for generating each partial product row. The effectiveness of this algorithm for high performance FPM hardware is dependent on the number of variable length shifts, which in turn depends on the number of partial product rows that are to be generated.
Our application profiling results for a number of scientific and emerging floatingpoint applications, using Intels PIN [14] toolkit, indicate that although the original Booth algorithm is able to reduce the number of
partial products from the maximum of 27, a sufficiently large number of partial products rows, more than 18 on average, still need to be generated, each of which requires the use of variable shifter. The latency overhead of such a large number of variable shift operations is too costly for any high performance FPM design. Hence, we did not pursue this algorithm any further.

Array Multipliers
Array multipliers are the common choice for high throughput and low latency multiplication operations in most commercial FPM designs [20, 26]. They produce a predetermined fixed number of partial products, which greatly minimizes if not fully eliminates the opportunities for exploiting data dependent optimizations. For example, introducing logic to bypass a zero partial product instance may add the same amount of delay as summing the extra term in a carry save adder (CSA) used to reduce the partial product terms. As array multipliers present very limited opportunities for data dependent optimizations, there has not been much work on asynchronous array multiplier solutions. The simplest implementation of an n by n array multiplier produces n partial products in parallel, which are then summed up using CSAs. The large number of partial products makes this simple design unfeasible for both latency and power consumption perspective. As a result, many advanced multiplier implementations from academia [21] and industry [20, 26, and 28] use some form of adix4 modified booth algorithm, which cuts the number of partial products to n/2. The reduction in the number of partial products yields significant savings in energy consumption, latency, as well as the total transistor count.
For a 53×53bit multiplier in an FPM data path, with inputs Y and X, a radix4 boothencoded algorithm produces 27 partial products. Each of the Y and X inputs is in a radix4 format. The multiplier bits, X, are used to generate booth control signals for each partial product row. One of the big advantages of radix4 booth multiplication is the relative simplicity of the logic which generates partial product rows. The only multiples of the multiplicand that are needed are: 0, Y, and 2Y. Partial product term Y is generated by simply assigning it the multiplicand. The 2Y multiple can be generated with relative ease by assigning it one bit right shifted value of the multiplicand. Bitwise inversion is used to generate complemented multiples. To reduce
these 27 partial product rows to two partial product rows, a reduction tree comprising 7 stages of 3:2 counters/carrysaveadders (CSAs), is usually employed. The energy consumption of the multiplier array is directly correlated to the number of partial product terms. With more partial product terms, more logic is needed first to produce those terms and then to sum and reduce those terms using a reduction tree. To further improve energy efficiency, one of the alternatives is to use a radix8 Boothencoded multiplier which reduces the number of partial product rows from 27 down to 18. The biggest disadvantage of a radix8 multiplier is that it requires a 3Y multiple which needs a full length carry propagation adder to compute. Since the 3Y multiple must be available before any partial product term is computed, a tree adder topology such as a hybrid KoggeStone carryselect adder must be used to minimize any latency degradation in a synchronous
design.
Table I compares three different radix length implementations of a 53×53bit multiplication unit in terms of the total partial products bits and the number of logic stages required to reduce the total number of partial product rows to two rows. A radix8 Booth encoded implementation produces 62.4% and 31.3% less partial products bits compared to bitwise radix2 and Boothencoded radix4 multipliers respectively. But in terms of latency, when compared to a radix8 version, a radix4 implementation needs only one extra logic stage because partial product terms are summed and reduced using CSAs in a tree structure, which has logarithmic logic depth. This gives a radix8 multiplier a single logic stage cushion to compute the tough 3Y multiple. Hence, for any radix8 Booth multiplier to be considered a viable alternative, it must provide a very low latency 3Y computation unit with energy consumption significantly lower than the savings attained with the use of 31.3% less partial product bits. The use of power intensive tree adders greatly diminishes the savings that result from the reduction in the the number of partial product terms. As a result, radix8 multipliers are not commonly used in synchronous FPM implementations.
TABLE I ARRAY MULTIPLIER
Multiplier Type
Partial Product Bits
Reduction Stages
Radix2 Bitwise
2809
9
Radix4 Booth
1539
7
Radix8 Booth
1056
6
53X53BIT RADIX8 ARRAY MULTIPLIER


3Y Adder
The highly operand dependent nature of the 3Y multiple computation makes it a strong potential target for asynchronous circuit optimizations. The application profiling results in Figure 3 show that the longest carry chain in a radix4 3Y ripplecarry addition is limited to 3 ripple positions for over 90% of the operations across most floatingpoint application benchmarks. The delay of an adder depends on how fast the carry reaches each bit position. For input patterns that yield such small carry chain lengths on average, we need not resort to an expensive tree adder topology designed for the worst case input pattern of carry propagating through all bits.
Fig.6.1: Radix4 3Y Adder Longest Carry Length
The interleaved adder topology provides an energy efficient solution for computing the bottleneck 3Y multiple terms required in radix8 Booth multiplication. It comprises two 53bit radix4 ripplecarry adders, where each 3Y block shown in Figure 4 computes the 3Y multiple for the corresponding Y input. The first arriving data tokens YRs are forwarded to the right 3Y adder. In standard PCeHB reshuffling, the interleave split stage has to wait for the acknowledge signal from ripplecarry adder before it can enter neutral stage and accept new tokens. However, this would cause the pipeline to stall in case of a long carry chain. The
interleaved adder topology circumvents this problem by instead issuing the next arriving data tokens to the left 3Y adder. Hence, the two ripplecarry adders could be in operation at the same time on different input operands. The interleave merge stage receives outputs from both right and left adders and forwards them to the next stage in the same interleaved order. With our pipeline cycle time of approximately 18 logic transitions (gate delays), the next data tokens for the right adder are scheduled to arrive after 36 transitions of the first one. This gives ample time to quite rare inputs with very long carry chains to ripple through as well without causing any throughput stalls.
For inputs patterns observed in our various floating point application benchmarks, the forward latency of computing the 3Y term using the interleaved adder is less than that attained with powerintensive tree adders, which are frequently used in synchronous designs to guarantee low latency computation. Compared to a 53 bit hybrid KoggeStone carryselect tree adder implementation, the interleaved adder consumes approximately 68.1% less energy at 8.3% lower latency for the average case input patterns shown in Figure6.2. We exploit this data dependent adder design topology, not possible within
Fig6.2: Radix4 3Y Adder Longest Carry Length.
Synchronous domain, to design an energyefficient radix8 Boothencoded multiplier for our asynchronous FPM data path.

Pipeline Design
Although, the radix8 multiplier reduces the number of partial products bits by 31.3% compared to a radix4 implementation, it still needs to produce and sum over 1050 partial product bits. As discussed by Sheikh et al. [25], standard.
PCeHB pipelines, though very robust, consume considerable power in handshake circuitry, which gets worse as the complexity of PCeHB templates increases with more input and output bits. The handshake overhead, in a twobit full adder PCeHB pipeline implementation, is as high as 69% of the total power consumption . Therefore, for circuits with large number of inputs, intermediate and final outputs, such as a multiplier array, the PCeHB pipelines represent a non optimum choice from energy efficiency perspective.
We use NInverter pipeline templates, are first proposed , to implement the multiplier array. An N Inverter pipeline reduces the total handshake overhead by packing multiple stages of logic computation within a single pipeline block, in contrast to PCeHB template which contains only one effective logic computation per pipeline. The handshake complexity is amortized over a large number of computation stacks within the pipeline stage. Sheikh et al showed that compared to a PCeHB pipelined implementation the NInverter pipelines can reduce the overall energy consumption by 52.6% while maintaining the same throughput. These improvements come at the cost of some timing assumptions and require the use of singletrack handshake protocol. The The design tradeoffs associated with NInverter templates are discussed extensively in .
The blocklevel pipeline breakdown of our radix8 multiplier array is depicted in Figure 8.2. The granularity at which the array is split is critical from both performance and energy efficiency perspective. The NInverter templates allow us to pack considerable logic within each stage, which helps to reduce the handshake associated power consumption significantly. However, as the number of logic computations 6within a pipeline blocks increase, so do the number of outputs. With more outputs, although the number of transitions per pipeline cycle remain the same with the use of wide NOR completion detection logic, each of these transitions incur a higher latency. The choice of 8×4 pipeline blocks, with 15 outputs per each stage, was made to provide a good balance of low power and high
throughput. The pipeline block labeled 8×4 Sign is identical to an 8×4 block except that it includes a sign bit for each partial product row. The sign bit acts as an input of one in the least significant position for any of the cases involving a complemented partial product multiple of – Y, 2Y, 3Y, or 4Y. The pipeline blocks labeled 10×4 Sign Ext are similar in design to the frequent 8×4 block, except that it provides support for sign extension bits required for supporting complemented multiples. The 8×2 block is a reduced version of an 8×4 block with only two booth rows. The similarity between these different pipeline blocks and the frequent use of the 8×4 pipeline block provides us with great design modularity, which helped to reduce the overall design effort required to optimize the multiplier array for throughput and energy efficiency. Due to the similarity between different pipeline blocks, we only present the details of the 8×4 block. Each 8×4 pipeline block receives Boothcontrol, Y and 3Y input tokens. The eight bits of Y and 3Y inputs are encoded as four 1 of4 tokens each. Figure8.3 shows the intermediate and final logic outputs within an 8×4 pipeline. It also shows the corresponding mapping of these outputs to a simplified circuit level depiction of an NInverter pipeline template. The NMOS stacks in the first stage compute four rows of eight bit partial
Fig7.1: Radix 8 Multiplier.
Product terms in inverted sense. These inverted outputs drive the inverters in the second stage of the pipeline block to produce corresponding partial product, PP, outputs. The next stage of NMOS stacks implements carrysave addition logic to sum and reduce these four rows of partial products to two rows of inverted sum and carry outputs. These inverted outputs drive the PMOS transistors in the last stage to produce sum and carry outputs, SS and CC, in correct sense for the following pipeline blocks.
Fig7.2: 8×4 Multiply Logic Block.
For array multiplication, all pipeline blocks have to be in Operation in parallel. The parallel operation requires multiples copies of input tokens to be consumed simultaneously by multiple pipeline blocks. For example, each booth control token is required in seven different pipeline blocks. To facilitate this, we include multiple copy stages prior to initiating the array computation. These copy blocks generate the desired number of copies for each input token. These tokens are then forwarded to the pipeline blocks which consume them to produce sum and carry outputs.
Propagation adder uses fourphase handshake protocol, the output tokens from the reduction tree are converted back to fourphase protocol. We hide the latency of this conversion stage by implementing the final stage of the reduction tree within these conversion templates.
The energy, latency, and throughput estimates of FPM implementations with radix4 and radix8 array multipliers are presented in Figure 7. The results are normalized to FPM data path with a radix4 multiplier. The 31.3% reduction in the number of partial product bits translates into 19.8% reduction in energy per operation. But this improvement in energy efficiency comes at a cost of 5.9% increase in the FPM latency because of the 3Y partial product computation that needs to determined prior to initiating the multiplier array logic. A part of the 3Y computation latency is masked within booth control tokengeneration and copy pipelines. Since the radix4 multiplier requires one extra computation stage in the reduction tree compared to a radix8 multiplier implementation, the latency overhead of the 3Y computation can be further hidden. The 5.9% latency increase is attributed to the 3Y multiple computation part which is not masked. Despite the increase in latency, the throughput for both
implementations remains the same due to sufficient slack availability within the interleaved 3Y computation block. The choice of a particular multiplier implementation represents a design tradeoff. Since our goal was to optimize for energy consumption and throughput, we chose the radix8 multiplier implementation in our final FPM design.

StickyBit logic and CarryPropagation Adder
The multiplier array outputs two rows of 106bit long partial sum and carries terms. The next step is to compute the 53bit mantissa of the FPM output. This requires the summation of the most significant 53bits of the two incoming partial sum and carry terms using a carrypropagation adder (CPA). The least significant 53 bits of the partial sum and carry terms are needed to compute the carry input into the CPA as well as the guard, round, and sticky terms required during the rounding step.

Carry Computation and Stickybit Logic
The multiplier array requires relatively less number of summation steps to produce its least significant output bits. This is because there are less partial product terms to be summed since each successive partial product row is skewed by three bit positions from the previous one in radix8 multiplication. As a result, the least significant bits are available relatively earlier than rest of the multiplier array outputs. We take advantage of our fine grain pipelining by initiating the carry computation as soon as the least significant bits arrive. Furthermore, the application profiling results in Figure 8 show that for over 90% operations across all applications the longest ripplecarry length to compute the carry input term is less than four radix4 bit positions. These averagecase patterns indicate that the carry term could be computed well in time for the CPA operation, hence alleviating the need of any speculative CPA implementations as is usually done in the case of most high performance synchronous FPMs.
Fig. 7.3. Radix4 Multiplier vs. Radix8 Multiplier
The microarchitecture of carry and stickybit computation is depicted in Figure 10.1. It uses interleaved split and merge pipelines, first introduced with the design of interleaved adder. The inputs A and B in Figure10.1 are in oneoffour encoded format and correspond to 52 least significant bits of partial sum and carry output terms from the multiplier array. The odd data tokens are sent on the output channels labeled with R prefix, while the next arriving even data tokens are sent on channels with L prefix. Each Carry Sticky block computes the carry and sticky bit terms at that bit position. With carry chain lengths of less than four, as seen in Figure 8, the final carry term is computed within four logic levels on average. This represents logarithmic average latency. The odd tokens are used to compute the carry term Cin R used as carry input in the odd ripple carry adder of our interleaved CPA, whereas the next arriving even data tokens compute the carry term Cin L used as carry input in the even ripplecarry adder of our interleaved CPA topology.
Fig. 7.4: Interleaved topology to compute sticky bit and carry input.
For stickybit computation, we use parallel tree topology which combines bitwise stickybit values to compute the final stickybit. A ripple flow architecture similar to the one used to compute carry input term was deemed not feasible as it yielded consistently long ripple chains, which caused throughput degradation. Our interleaved topology prevents throughput degradation up to ripple lengths of 14 bit positions only. The application profiling results yield ripple lengths of 15 or more quite frequently. The stickybit is set to one if any of the bits is one, but for it to be set to zero it has to ensure that all prior bits in the sequence are zero. This is what causes the long ripple chains and renders rippleflow design infeasible.


53bit CarryPropagation Adder
We harness the timing flexibility of our underlying asynchronous circuits by using interleaved adder topology for the 53bit carrypropagation adder design. The interleaved adder comprises two ripplecarry adders. The adder topology is identical to the one used earlier for 3Y multiple computation. Our choice of the interleaved adder was made on the basis of application profiling results, which indicate very small carry chain lengths on average across all application benchmarks. It yields average throughput similar to that attained with expensive tree adder designs while consuming up to 4X less energy per operation.

Denormal, Underflow and zero input
While discussing the various tradeoffs involved in the FPM data path design, we have so far ignored certain special cases specified in the IEEE format [19]. Two of these special cases: the denormal numbers and underflow case represent the most difficult operations to implement in an FPM data path. The scenarios under which these two special cases arise and the tasks that need to be performed are summarized as follows:
1).One of the FPM inputs is a denormal number, which yields a mantissa with zeroes in its most significant bit positions. If the nonbias exponent for the product is greater than the minimum value of one, the product needs to be left shifted while decrementing the exponent until it is normalized or
the exponent reaches the value of one. We refer to this scenario as the Denormal case.
2).One of the FPM inputs is a denormal number or both FPM inputs are very small numbers and the resulting exponent is less than the minimum value of one. In this case, the mantissa needs to be right shifted. The value of right shift is equal to the difference between the minimum value and resulting exponent or an amount which zeroes out the mantissa, whichever of the two is smaller. We refer to this scenario as the Underflow case.
The need of variable left shift and right shift logic blocks makes the hardware support for denormal and underflow cases expensive. However, the infrequent occurrence of these special case inputs and the extensive hardware complexity required to support these operations has meant that many FPM designs do not fully support these operations in hardware. Instead, these operations are implemented in software via traps. This yields very long execution time. It also means that the FPM hardware is no longer fully IEEE compliant.
We use serial shifters to provide full hardware support for these special case inputs. Using conditional split pipelines, the output bits from the CPA are directed to either Normal or Denormal/Underflow logic path. The Normal data path includes singlebit normalization shift block and rounding logic. The Denormal/Underflow unit comprises serial left and right shift blocks and a combined rounding block. For input tokens diverted to the Normal data path, no dynamic power is consumed within the Denormal/Underflow block and likewise for input tokens headed for Denormal/Underflow block, there is no dynamic power consumption in the Normal data path. In contrast, synchronous design requires significant control overhead to attain finegrain clock gating.
Once the mantissa has been correctly aligned using variable left or right shift block, a subsequent rounding operation may be required to increment the 53bit mantissa by one. We utilize ripplecarry 1of4 encoded increment logic to implement rounding. An expensive increment logic topology would have been futile since the output from variable shift blocks arrives in bitwise fashion. The rounding logic is shared between the Denormal and Underflow data paths as shown in Figure 10 to further minimize the area overhead of supporting these special case operations. The Rnd block receives incoming guard, round, sticky, and rounding mode bits from both special case data paths. It selects the correct
set of inputs to determine whether to increment the mantissa or not.
Fig.9.1.Unified rounding hardware for denormal/underflow cases
Prior to the final Pack pipeline, there is a merge pipeline stage, which selects the output from either the Normal or the Denormal/Underflow data path. Since these special case inputs happen very infrequently as shown in Figure 11, the throughput degradation due to the use of serial shifters does not effect the average FPM throughput.

Zeroinput Operands
Operand profile of floatingpoint multiplication instructions reveals that a few application benchmarks have a significant proportion of zero input operands. These primarily include applications with sparse matrix manipulations, such as 447.deal and 437.leslie3d, despite their use of specialized sparse matrix libraries. For other benchmarks, the zeroinput percentage varies widely as shown in Figure9.1. In most stateoftheart synchronous FPM designs that we came across, the zeroinput operands flow through the full FPM data path. They yield similar latency and consume same power as any other nonzero operand computation. This is highly non optimum since if one or both of the FPM operands are zero, the final zero output could be produced much earlier and at much reduced energy consumption by skipping most of the compute intensive power consuming logic blocks such as the multiplier array, carry propagation adder, normalization, and rounding unit.
Fig. 9.1: Operand profile of floatingpoint multiplication instructions
We provide a zero bypass path in the FPM data path to optimize its latency and energy consumption in the case of zero operands. To activate the bypass path, the FPM utilizes the zero flag control output from Unpack stage, which checks if any of the input operands is zero. But this information is not available in time before the start of pipeline stages pertaining to Booth control and 3Y multiple generations. One possible solution was to delay these pipeline stages until the zero flag is computed and then use it to divert the tokens to either the regular or the bypass path. Since this solution incurs a latency hit for nonzero operands, it was discarded. In our design, instead of delaying the multiplier array, we inhibit the flow of tokens much deeper in the data path. As a result, in our design the energy footprint of zero operand computations includes the overhead of computing Booth control token as well as some parts of the 3Y multiple computation. But this still yields roughly 82% reduction in energy consumption for each zero operand computation, while preserving same latency and throughput for nonzero operand operations.


FloatingPoint Multiplier


This section presents the SPICE simulation results of our proposed FPM data path. The transistors in the FPM were sized using standard transistor sizing techniques. To meet high performance targets and to minimize charge sharing problems, each NMOS stack was restricted to a maximum of four transistors in series. Since HSIM/HSPICE simulations do not account for wire capacitances, we included an additional wire load equivalent to a wire length of 8.75 m in the SPICE file for every gate in the circuit. Our simulations use 65nm bulk CMOS process at 1V nominal VDD and typical
typical (TT) process corner.
For nonzero operands, the FPM registers a highest throughput of 1.53 GHz. In applications with a considerable percentage of zero operands, the average FPM throughput rises to as high as 1.78 GHz, since zero input operations skip throughput constraining NInverter templates in the multiplier array. The FPM energy per operation results across all application benchmarks are shown in Figure 10.1. Applications with considerable zeroinput operands consume significantly less energy
TABLE II ASYNCHRONOUS FPM VS SYNCHRONOUS FPM
Design 
Energy/op 
Throughput 
Latency @1.3V 
Proposed FPM 
92.1 pJ 
1.53 GHz 
705 ps 
Quinnell FPM 
280.8 pJ 
666 MHz 
701 ps 
For frequently occurring zero input operations in sparse matrix applications, our proposed FPM yields an even lower latency and energy per operation. The results for zero input operands are shown in Table III, which highlights the efficacy of zero bypass paths.
per operation as zeroinput operations skip various logic blocks.
In Table II, we compare our proposed asynchronous FPM design against a custom FPM design by Quinn ell et al in 65nm SOI process at 1.3V nominal VDD. The
energy, throughput, and latency results include only
ZERO
TABLE III OPERAND FEATURES
Design 
Energy/op 
Latency 
Proposed FPM 
15.8 pJ 
464 ps @ 1V 
Quenelle FPM 
280.8 pJ 
701 ps @ 1.3V 
Design 
Energy/op 
Latency 
Proposed FPM 
15.8 pJ 
464 ps @ 1V 
Quenelle FPM 
280.8 pJ 
701 ps @ 1.3V 
nonzero operand operations in order to provide the worstcase comparison. Despite using 65 nm bulk processes, our FPM design consumes 3X less energy per operation while operating at 2.3X higher throughput.
Fig. 10.1: FPM energy per operation across various floatingpoint applications.
Both designs have similar latency at 1.3V. However, the custom FPM latency results do not include any internal pipeline latches, which account for a significant proportion of overall latency especially in high throughput designs. Our asynchronous FPM design compares quite favorably against the custom synchronous FPM implementation despite employing radix8 Boothencoded multiplier, which has an average 5.9% higher latency than a radix4 Booth encoded multiplier design.
Since leakage power has become an important design constraint, our simulations model subthreshold and gate leakage effects in detail. The total leakage power of our FPM in idle mode was estimated at 1.62 mW using typicaltypical process corner at 90 C and a VDD of 1V.
Fig: 10.2 Longest ripple carry Length for Computing CPA Carry input.
Conclusion
We presented the detailed design of an asynchronous highperformance energyefficient IEEE 754 compliant doubleprecision floatingpoint multiplier. We provide thorough analysis of the tradeoffs involved in using radix4 and radix8 array multiplier designs. The radix8 design was preferred since it further reduced the total FPM energy consumption by 19.8% while preserving the average throughput. The full FPM data path with numerous operanddependent and pipeline optimizations are fully quantified using 65nm bulk process. When compared against a custom synchronous FPM design is referred in Quinn El. An implementation of a floating point multiplier that supports the IEEE 7542008 binary interchange format; the multiplier doesnt implement rounding and just presents the significand multiplication result as is (48 bits); this gives better precision if the whole 48 bits are utilized in another unit; i.e. a floating point adder to form a MAC unit. The design has three pipelining stages and after implementation on a Xilinx Virtex5 FPGA it achieves 301 MFLOPs.
Acknowledgement
M.Pravallika would like to thanks Mr. V.Vamsi Mohana Krishna, Associate professor, in the Department of ECE who had been guiding throughout the project and supporting me in giving technical ideas about the paper and motivating me to complete the work efficiently and successfully.
References

Exascale computing study: Technology challenges. www.science.energy.gov/ascr/Research/CS/ DARPAexascalehardware(2008).pdf.

SPECbenchmaie
. www.spec.org.

A. D. Booth. A signed binary multiplication technique. Quarterly Journal of Mechanics and Applied Mathematics, 4(2):236240, June 1951.

S. Borkar. Design challenges of technology scaling. IEEE Micro, 19(4), JulyAugust 1999.

B. S. Cherkauer and E. G. Friedman. A hybrid radix4/radix8 low power signed multiplier architecture. IEEE Transactions on Circuits and Systems, 44(8), August 1997.

W. J. Dally and J. Poulton. Digital Systems Engineering. Cambridge University Press, Cambridge, UK, 1998.

A. Efthymious, W. Suntiamorntut, J. Garside, and L. E. M. Bracken bury. An asynchronous, iterative implementation of the original booth multi plication algorithm. In Proceedings of the International Symposium on Asynchronous Circuits and Systems, 2004.

M. Ercegovac and T. Lang. Digital Arithmetic. Morgan Kaufmann, 2004.

J. Hensley, A. Lastra, and M. Singh. A scalable counter flow pipelined asynchronous radix4 booth multiplier. In Proceedings of
the International Symposium on Asynchronous Circuits and Systems, 2005.

M. Horowitz. Scaling, power and the future of CMOS. In Proceedings of the 20th International Conference on VLSI Design, 2007.

Z. Huang and M. D. Ercegovac. Highperformance lowpower lefttoright array multiplier design. IEEE Transactions on Computers, 54(3), March 2005.

D. Kearny and N. W. Bergmann. Bundled data asynchronous multipliers with data dependent computation times. In Proceedings of the Advanced Research in Asynchronous Circuits and Systems, 1997.

Y. Liu and S. Furber. The design of a low power asynchronous multiplier. In Proceedings of the International Symposium on Low Power Electronics and Design, 2004.

C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the Conference on Programming Language Design and Implementation, 2005.

A. J. Martin, A. Lines, R. Manohar, M. Nystrom,Â¨ P. Penzes, R. Southworth, U. V. Cummings, and T.K. Lee. The design of an asynchronous MIPS R3000. In Proceedings of Conference on Advanced Research in VLSI, 1997.

A. Naini, A. Dhablania, W. James, and D. D. Sarma. 1ghz hal sparc65 dual floating point unit with RAS features. In Proceedings of the International Symposium on Computer Arithmetic, 2001.

J. R. Noche and J. C. Araneta. An asynchronous IEEE floating point arithmetic unit. Proceedings of Science Diliman, 19(2), 2007.

S. F. Oberman, H. AlTwaijry, and M. Flynn. The SNAP project: Design of floating point arithmetic units. In Proceedings of the International Symposium on Computer Arithmetic, 1997.

The Institute of Electrical and Inc. Electronic Engineers. IEEE standard for binary floatingpoint arithmetic. ansi/ieee std 754, 1985.

N. Ohkubo, M. Suzuki, T. Shinbo, T. Yamanaka, A. Shimizu, K. Sasaki, and Y. Nakagome. A 4.4 ns CMOS 54 x 54b multiplier using passtransistor multiplexor. IEEE Journal of SolidState Circuits, 30(3), March 1995.

E. Quinn ell, Jr E. E. Swartzlander, and C. Lemonds. Floating point fused multiplyadd architectures. In Proceedings of the Fortieth Asilomar Conference on Signals, Systems, and Computers, 2007.

E. M. Schwarz, R. M. Averill, and L. J. Signal. A radix8 cmos s/390 multilier. In Proceedings of the International Symposium on Computer Arithmetic, 1997.

E. M. Schwarz, M. Schmookler, and S. D. Trong. FPU implementations with denormalized numbers. IEEE Transactions on Computers, 54(7), July 2005.

B. R. Sheikh and R. Manohar. An operandoptimized asynchronous IEEE 754 doubleprecision floatingpoint adder. In Proceedings of IEEE International Symposium on Asynchronous Circuits and Systems, 2010.

B. R. Sheikh and R. Manohar. Energyefficient pipeline templates for highperformance asynchronous circuits. ACM Journal on Emerging Technologies in Computing Systems, 7(4), November 2011.

S. D. Trong, M. Schmookler, E. M. Schwarz, and M. Kroener. P6 binary floatingpoint unit. In Proceedings of the International Symposium on Computer Arithmetic, 2007.

N. Weste and D. Harris. CMOS VLSI Design: A Circuits and Systems Perspective. AddisonWesley, 2004.

R. K. Yu and G. B. Zyner. 167 MHz radix4 floating point multiplier. In
Proceedings of the International Symposium on Computer Arithmetic, 1995.