 Open Access
 Total Downloads : 1541
 Authors : Uday Arun, A. K. Sinha, K.S. Yadav
 Paper ID : IJERTV1IS7514
 Volume & Issue : Volume 01, Issue 07 (September 2012)
 Published (First Online): 25092012
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
A Hyperpipelining Approach For Highspeed SingleMicroPipeline Design For Hybrid Multiplier
Uday Arun, A. K. Sinha, K.S. Yadav
Bharath Institute of Science & Technology, Chennai, Tamil Nadu, India ABES Engg. College, Ghaziabad, Uttar Pradesh, India
Northern India Engg. College, New Delhi, India
Abstract
Powerawareness indicates the scalability of the system energy with changing conditions and quality requirements. Multipliers are essential elements used in DSP applications and computer architectures. Although Boolean multipliers have natural power awareness to the changing of input precision, deeply pipelined designs do not have this benefit. A SingleEdgeTriggering (SET) & Double EdgeTriggering (DET) onedimensional pipeline gating scheme is proposed to improve the speed, the power awareness and area in designs. In one dimensional pipeline, clock the registers (DFF) at both edges i.e. positive as well negative (data flow in pipeline) and horizontal direction (within each pipeline stage). This technique only needs very little additional area and the overhead is hardly noticeable. A set of Hybrid Multiplier were designed and tested. Results show that the new 32 bit Hybrid Multiplier using this technique has low power and high speed at lower clock frequency as compare to other designs.With increasing clock frequencies and silicon integration, power aware computing has become a critical concern in the design of embedded processors and systemson chip. One of the more effective and widely used methods for power aware computing is Dynamic Voltage Scaling (DVS). In order to obtain the maximum power savings from DVS, it is essential to scale the supply voltage as low as possible while ensuring correct operation of the processor. The critical voltage is chosen such that under a worst case scenario of process and environmental variations, the processor always operates correctly. However, this approach leads to a very conservative supply voltage since such a worst case combination of different variability will be very rare. A SET & DET Dflipflop pipeline multiplier
pipeline Hybrid Multiplier design will work at the clock frequency in the range of 50MHz to 250MHz and SET based pipeline Hybrid Multiplier design will work at the clock frequency in the range of 50MHz to 500MHz. We have used hyperpipelining approach in pipeline design by changing register retiming approach i.e. combined memory at DET to further improve the speed as well as area & power consumption. The various design & analysis presented in result section.

Introduction
The hybrid multiplication mode is a combination of the semiparallel and sum of multiplication modes where bit sections from two unique input streams are multiplied with two different coefficient values. This mode is useful in applications that require complex multiplication like FFTs where each signal generally has a real and imaginary component that could be multiplied by two unique coefficient values. The partial products obtained from each bit section within the components are shift accumulated to obtain the final result.
RealTime signal processing hardware requires efficient multiplier units. However, each application needs different sample rates; from speech to image or radar a wide frequency range is covered. Digital Signal processing (DSP) is used in a wide range of applications such as telephone, radio, video, radar, sonar, satellite communication, etc. The sample rate requirements vary from application to application and can range anywhere from 10 KHZ to 400MHZ. RealTime implementation of these systems require hardware architectures which can process input signal samples as they are received, as opposed to storing them in registers & processing them in batch mode.
In several cases; it has an important cost in area and runs faster than the throughput required by the
application. Most of the DSP computations involve the use of multiply accumulate operations & therefore the design of fast and efficient multipliers
is imperative. Moreover, the demand for portable applications of DSP architectures has dictated the need for low power design.
Multipliers play an important role in todays digital signal processing and various other applications. With advances in technology, many researchers have tried and are trying to design multipliers which offer either of following high speed, low power consumption, regularity of layout and hence less area or even combination of them in one multiplier. Thus, making them suitable for various high speed, lowpower, and compact VLSI implementation.
The common multiplication method is add and shift algorithm. In parallel multipliers number of partial products to be added is the main parameter that determines the performance of the multiplier. To reduce the number of partial products to be added, Modified Booth algorithm is one of the most popular algorithms. To achieve speed improvements Wallace Tree algorithm can be used to reduce the number of sequential adding stages. Further by combining both ModifiedBooth algorithm and WallaceTree technique we can see advantage of both algorithms in one multiplier.
However with increasing parallelism, the amount of shifts between the partial products and intermediate sums to be added will increase which may result in reduced speed, increase in silicon area due to irregularity of structure and also increased power consumption due to increase in interconnect resulting from complex routing.
On the other hand serial parallel multipliers compromise speed to achieve better performance for area and power consumption. So selection of a parallel or serial multiplier actually depends on the nature of application.
In this paper, we compare four parallel, one serial parallel and one cascade multiplier for area, speed, power consumption and different product of them as various performance measures and summarize the results in a tabular form.
As main idea here is that of comparison of different multipliers, we need a common base for comparison of all multipliers. For this reason synthesis was done using same design constraints and keeping the target device same for all multipliers. As a result of this each multiplier in its own may not have optimal performance measures and can be optimized individually.
Figure 1. IEEE Journal Research Paper (System & Circuit Design
From fig.1 we can see that delay of NonAdditive (Cascade) multiplier is the least. Hence it is fastest among six multipliers. DP product is also least for above multiplier and is a good choice for this performance measure. Serial Parallel multiplier is a best choice when speed is not important but reduced area and power consumption is of more interest and also for AP and AD product Serial Parallel multiplier is a good choice. However, one of the most important in applications where area and power consumption is strictly restricted and speed is not much important it is not wise to use parallel multipliers as they occupy more silicon area and consume more power compared to serial parallel multipliers.
We can conclude that each multiplier has its own advantage and disadvantage for different performance measures and choice of a specific multiplier completely depends upon the nature of application and constraints on area, delay, power or combination of them for that particular application. One can select a multiplier from seeing this table depending on which performance criteria is most critical for a particular application. Multipliers compared may be optimized individually by either chaning VHDL coding style, adding certain constraints specific to an architecture during synthesis, changing target device part number and its speed grade etc. Individual optimization is not considered as main purpose. So, main issue in serial/parallel multipliers is to reduce the internal loop delay which is the bottleneck of throughput.
Shailesh Shah [1] have made a comparison of different 32bit integer multipliers based on different performance measures by using same design constraints and keeping the target device same for all multipliers. It is observed that area of twin pipe hybrid multiplier is the least but has more delay as compared to nonadditive cascade multiplier. In this twin pipe hybrid multiplier, the delay is mainly caused by the internal loops in each multiplication stage.
So, main issue in hybrid multipliers is to reduce the internal loop delay which is the bottleneck of throughput.
Soojin Kim [2] proposed the pipeline technique for
A= {a0, a1, a2, a3} B= {b0, b1, b2, b3} & Product P = {p1, p2, p3..p7}
The Equation is:
P (m+n) = A (m) B (n) = + a b
the 8bit and16bit modified booth multipliers to
=
= i j
improve performance but operating at very high frequency (in GHz) which results in more power consumption. He discussed that the overall delay of the multiplier circuits decrease as the number of pipeline stages increase but that results in an increase in area. He compared his proposed multiplier with others [3,4,5,6,7,8] and concluded that proposed one shows the best result. However, single pipeline using the concept of hyper pipelining and DET clock pulse has not been designed in 32bit hybrid multipliers by any researcher which is presented in this work.
Twin pipe hybrid multiplier is not implemented in any real application as it operates on 120 nm technology device and 400 MHz clock frequency.
ARM7TDMI 32bit Controller [14] uses the barrel shifter for designing the Arithmetic & Logic Unit in the core. IBM & Motorola launched PowerPC 603e & 603ev Processor [13] using the ARM core. This processor works on 90 nm technology & operates at 333 MHz and uses barrel shifter to obtain lowpower design.
The ARM11TM processor family provides the engine that powers many smartphones in production today.It delivers extreme low power and a range of performance from 350 MHz in small area designs up to 1 GHz in speed optimized designs.
ARM11 32bit processor core [14] is implemented in Nokia 6120 Classic with a clock frequency of 350 MHz and designed on 4565 nm technology. But in this present work, single pipeline 32bit hybrid multiplier with DET memory is designed on
65 nm technology device at 200 MHz clock frequency to improve the overall performance.

Related work and Design Analysis
Assuming B is multiplicand and A is multiplier then product B x A is shown as:
The operating method of a multiplier is ANDing, Adding & Shifting. ANDing operation is performed in parallel using AND gate to obtain partial products. Adding & Shifting operations are sequential operation which is timeconsuming. Adding operation is performed using full adder circuit to obtain intermediate sum of partial products. Shifting operation requires group of memory cells (storage element). Shifting and Instruction overlapping is equivalent operation in Instruction Pipeline of any processor.
Using this logical and analytical approach, multiplier performance can be improved by designing a pipeline for shifting operation. If Adding operation is performed along with shifting operation in single pipeline then pipeline stage reduction can be achieved and thus multiplier performance can be further optimized.
A commonly used method to improve performance is to increase the clock frequency. However, use of high clock frequency has a number of disadvantages such as more power consumption. An alternative strategy relies on the use of storage elements capable of capturing data on both clock edges (rising and falling edge). [9]

Anatomy of Hybrid Multiplier
Hybrid multiplier is a composistion of Serial/parallel multiplier. An architecture of hybrid multiplier has three components i.e. Paprallel to Serial Converter, Hyperpipeline multiplier & Serial to Parallel Converter. PTSC & STPC are working on master clock frequency.
Figure 2. Architecture of Hybrid Multiplier

Anatomy of Pipeline Multiplier
Figure 3. Generic Architecture of pipelined Multiplier

Anatomy of Pipeline in Processor Micro Architecture
Pipeline is the combination of pipe. The pipe is component which consists of flipflop, adder and various signals for controlling. The main two types of pipe shown below.
Figure 4. Architecture of Pipeline Model
Pipelining is also called as Pipeline Processing. This chapter deals with the concept of advanced pipelining and superscalar design in processor development. It begins with a discussion of conventional linear pipelines and analyzes their performance. A generalized pipeline model is introduced to include nonlinear interstage connections. Further, Super pipelining and Superscalar design techniques are analyzed based on their performance.
It is a cascade of processing stages which are linearly connected to perform a fixed function over
a stream of data flowing from one end to other. A linear pipeline processor is constructed with k processing stages. External inputs (operands) are fed into the pipeline at the first stage S1. The processed results are passed from stage Si to Si+1, for all i = 1, 2 k1. The final result emerges from the pipeline at the last stage Si.
Linear pipelines are applied for instruction execution, arithmetic computation and memory access operations. Depending on the control of data flow along the pipeline, linear pipelines are modeled into two categories:

Asynchronous model

Synchronous model


Synchronous Linear Pipeline Model Clocked latches are used to interface between stages. The latches are made with master slave flip flops, which can isolate inputs from outputs. Upon the arrival of a clock pulse, all latches transfer data to the next stage simultaneously. Pipeline stages are combinational logic circuits. It is desired to have approximately equal delays in all stages. These delays determine the clock period and thus the speed of pipeline.
Figure 5. Architecture of Linear Pipeline Model

Clocking and Timing Control
Let i be the time delay of the circuitry in stage Si and d the time delay of a latch, then clock cycle (period) of a pipeline is:
= max {i} 1k + d = m + d
where, m denotes the maximum stage delay. At the rising edge of the clock pulse, the data is latched to the master flip flops of each latch register.
The clock pulse has a width equal to d. The pipeline frequency (f) is defined as the inverse of the clock period:
f = 1/
If one result is expected to come out of the pipeline per cycle, f represents the maximum throughput of the pipeline. The actual throughput may be lower
than f depending on the initiation rate of successive tasks entering the pipeline.
When clock skew takes effect i.e. same clock pulse may arrive at different stages with a time offset of
s. Let tmax be the time delay of the longest logic path within a stage and tmin that of the shortest logic path within a stage. To avoid a race in two successive stages, we must choose:
m tmax + s and d tmin s
These constraints translate into the following bounds on the clock period:
d + tmax + s m + tmin s
In the ideal case, s = 0, tmax = m and tmin = d. Thus, we have = m + d, consistent with the definition in above equation without the effect of clock skewing.

Anatomy of Hyperpipelining or Register Retiming
The concept of Hyperpipelining or register retiming 11,12] is used to achieve performance requirements that hinges on sequential elements being optimally placed so as to minimize and balance path delays. It works by literally moving registers across portions of combinational logic such that the worstcase combinatorial delays on each the input and output sides of the register are more balanced. It is an optimization strategy that leverages positive slack on one side of a sequential element to address or balance negative slack on the other. It can lead to either an increase or decrease in the number of flipflops in the design.But number of pipeline stages does not change. It is constrained to operate only in such a way that preserves design functionality at the toplevel design ports. It results in less area usage and has a beneficial impact on the system performance. In this work, Hyper Pipelining is used in such a way that registers are used on top of sequential logic in the design of single pipeline and then registers are merged to form one combine memory within each stage of pipeline. It takes a timingdriven approach in which static timing analysis is performed to obtain time delay & total area of Hybrid Multiplier.

Anatomy of Power & Speed Analysis With advances in technology, many researchers have tried & are trying to design multipliers which offer either of following high speed, low power consumption, regularity of layout or less area and even combination of them in one multiplier.
In applications where area and power consumption is strictly restricted and speed is not much important, it is not wise to use Parallel Multiplier as compared to Hybrid Multiplier. P.R. Panda [10] discussed the designing of circuits, which are conceived as Finite State Machines (FSM). Depending on the complexity of the circuit, a large fraction of the power is consumed due to the
switching of the state register. The objective of low power approaches is therefore to choose a state encoding that minimizes the switching power of the state register.
Given a state encoding, the power consumption can be modeled as:
P = Â½V2F Ã— Csr Ã— Esr
Where, F is the clock frequency, V is the supply voltage,
Csr is the effective capacitance of the state register, and
Esr is the expected state register switching activity.
If S is the set of all states, we can estimate Esr as:
Esr = pij Ã— hij
i , j S
Where, pij : probability of a transition between states i and j, and
hij : Hamming Distance between the codes of states i and j.
To achieve low power and high performance design, the value of Esr need to be minimized.
This can be done by using onehot method [15] of designing circuits, in which only one register is used in each state of pipeline.
P.R. Panda [10] discussed that the power consumption of any circuit can be divided into three different components: dynamic, static (or leakage) and short circuit power consumption. Thus,
Ptotal = Pdynamic + Pshortcircuit + Pstatic
Switching power, which includes both dynamic and short circuit power, is consumed when signals change their logic state, resulting in the charging & discharging of load capacitors. The dynamic power of a circuit can be estimated as:
P = ACV2F
Where, C is the switched capacitance, V is the supply voltage,
F is the clock frequency, and
let A be a constant called the activity factor (0 A 1).
The clock frequency (F) is related to Time delay as:
F = 1/Time
Thus, with the increase in clock frequency, delay is minimized which in turn increases the speed but also consumes more power.
Therefore, a good hybrid multiplier design requires some tradeoff between speed and power consumption by optimizing clock frequency. This can be done by using Hyper Pipelining approach in single pipeline hybrid multiplier by using DET storage elements. But most libraries do not provide and support DET memory, so a new library is coded in VHDL which provides dual edge behavior.

Genisis of Doule Edges Triggering Memory Element (Dflipflop)
Double edgetriggered flipflops are becoming a popular technique for lowpower designs since they effectively enable a halving of the clock frequency [17][18]. The paper by Hossain et al [21] showed that while a singleedge triggered flipflop can be implemented by two transparent latches in series, a double edgetriggered flipflop can be implemented by two transparent latches in parallel. The clock related power is one of the most significant components of the dynamic power consumption [20][22].
The total clock related power dissipation in synchr
onous VLSI circuits is further divided into three major components: (i) power dissipation in the clock network, (ii) power dissipation in the clock buffers, and (iii) power dissipation in the flip flop. The total power dissipation of the clock network depends on both the clock frequency and the data rate, and can be computed based on Eq.:
PCLK = V [f (C C + f C ]
2
dd CLK CLK f f, CLK D f f, D
Dual Edge Triggered FlipFlop Where
fCLK is the clock frequency;
fD is the average data rate;
CCLK is the total capacitance seen by the clock network;
Cf f, CLK is the capacitance of the clock path seen by the filpflip;
Cf f, D is the capacitance of the data path seen by the filpflip;
From above Equation, it is obvious that the clock power can be reduced if any of the parameters on the right hand side of the equation is reduced. The reduction of Vdd is already the trend of contemporary design, and it has the strongest impact on the PCLK expression.
By reducing the overall capacitance of the clock network, CCLK, the power dissipation may also be reduced. For instance, the capacitance can be reduced by proper design of clock drivers and buffers. Similarly, by reducing the capacitance inside a flip flop, Cff, CLK and Cff, D, power may also be reduced. Furthermore, the clock power dissipation is linearly proportional to the clock frequency. Although the clock frequency is determined by the system specifications, it can be reduced with the use of dual edge triggered flipflop (DETFFs). As its name implied, DETFF responds to both rising and falling clock edges. Hence, it can reduce the clock frequency by half while keeping the same data throughput.
As a result, power consumption of the clock distribution network is reduced, making DETFFs desirable for low power applications.
Even for high performance applications, the usage of DETFFs offers certain benefits. Since the clock speed is reduced by a factor of two, one does not need to propagate a relatively high speed clock signal [23].
In this classic configuration, two opposite polarity levelsensitive latches are connected in parallel, the output is then multiplexed at the output stage.
Figure 6. Classical Implementation of DETFF
Figure 7. Gate Level CMOS Implementation

Single Edge Triggering (SET) Pipleine Component with Separate Memory
A multiplier pipe is a composition of two main components or elements i.e. Full Adder & Memory Elelments or Storage Elements.One pipe requires two memory elements one for Sum storage and other for Carry out storage. Each memory element consists of DFlipFlop with two control signal. In this meory elements design, we have clock signal & asynchronous reset signal as a control signal for both meory elements separately. A Full Adder has two inputs a and b where b is partial product of an and bn. The behaviour of a pipe is similar to serial adder or sequential adder circuit. The speed of pipe is depended on the speed of adder circuit (combinational circuit) and inter loop delay path of Cout to Cin. The speed of pipe can be improved further by reducing delta delay or transportation delay of path Cout to Cin and propogation delay of DFlipFlop.
Figure 8. Design Adder and Shifter with SET Separate Memory
Figure 9. RTL Schematic of SET Separate Memory Element
In this design, the area will increase due to separate memory element as well as path of clock signal also increased.The chances of clock skew more in separate memory element based pipe in multiplier.

Single Edge Triggering Pipeline Component with Combined Memory
To reduce the area of pipe and clock delay of memory element, a combined memory approach is used. This is known as register retiming or hyper pipelining mechanism. This approach is also minimize the clock signal and reduced the chances of clock skew in combined memory element based pipe.
Figure 10. Design of Adder and Shifter with SET Combined Memory
Figure 11. RTL Schematic of SET Combined Memory Element

Double Edge Triggering Pipeline Component with Separate Memory
Figure 12. Design of Double Edge Triggering (DET) Memory

RTL Modeling of Double Edge Triggering Memory Elememnt of Separate Memory
DFF0: process(d,clk,rst) begin
if rst ='1' then s(0) <= '0';
elsif clk'event and clk = '1' then s(0)<= d;
end if;
end process;
DFF1: process(d, NegClk, rst) begin
if rst ='1' then
s(1) <= '0';
elsif NegClk'event and NegClk = '1' then s(1)<= d;
end if;
end process;
Mux2x1: process(NegClk,S(0),S(1)) begin
q <= 'Z';
if NegClk ='0' then q <= s(0);
else
q <= s(1);
end if;
end process;
s(3)<= d1;
end if;
end process;
Mux2x1: process(NegClk, S(0), S(1), s(2), s(3)) begin
q0 <= 'Z';
q1 <= 'Z';
if NegClk ='0' then q0 <= s(0);
q1 <= s(2);
else
Figure 13. Design Adder and Shifter with Double Edge Triggering (DET) Memory
Figure 14. RTL Schematic of Double Edge Triggering Memory Element

RTL Modeling of Double Edge Triggering Memory Elememnt of Combined Memory

DFF0: process(d0, d1, clk, rst) begin
if rst ='1' then
s(0) <= '0';
s(2) <= '0';
elsif clk'event and clk = '1' then s(0)<= d0;
s(2)<= d1;
end if;
end process;
DFF1: process(d0, d1, NegClk, rst) begin
if rst ='1' then
s(1) <= '0';
s(3) <= '0';
elsif NegClk'event and NegClk = '1' then s(1)<= d0;
q0 <= s(1);
q1 <= s(3);
end if;
end process;
Figure 15. Design Adder and Shifter with Double Edge Triggering Combined Memory
Figure 16. RTL Schematic of DET Combined Memory Element




Observation
Figure 17. Delay Analysis Chart
Figure 18. Delay Analysis of Singleton Pipelined Hybrid Multiplier
Figure 19. Delay Analysis Graph
Figure 20. Area Analysis Singleton Pipelined Hybrid Multipler
Figure 21. Area Analysis Chart
Figure 22. Area Analysis Chart

Result

Conclusion
In the above results, performance measures of all six multipliers and Single Pipeline Hybrid Multiplier are summarized and compared based on the Synthesis and Timing analysis reports. These results were obtained after synthesizing each architecture targeting towards Xilinx Vertex FPGA xc2vp25fg256. All these results are for processing 32bits of data. It is observed that delay of SET & DET Pipeline Hybrid Multiplier is the least. Hence it is fastest among six multipliers.
The number of clock pulses gets reduced by half in DET Pipeline Hybrid Multiplier as compared to SET Pipeline Hybrid Multiplier. However, total area usage is more than the area in TwinPipe Serial Parallel Multiplier but less than all other multipliers. Also low frequency is used in this design which consumes low power and speed is improved through DET memory & Hyper Pipelining concept which reduces the internal loop delay in singlemicropipeline Hybrid Multiplier.
The protyping & verification has been done using following tools & technology by considering various design constraints:

Target Device: xc2vp25fg256

On Board frequency: 50 MHz

Supply Voltage: 3.0 volts

Worst Voltage: 1.4 to 1.6 volts

Temperature: 40oC to 80oC

Speed Grade: 5

FPGA Compiler: Xilinx ISE 10.0i

Simulator used: ModelSim10.0a

Static Timing Analysis Tool: PrimeTime Tools
The limitation of this design is that the maximum frequency on which it operates is 500 MHz for SET
Singleton Pipeline and 200 MHz for DET Singleton Pipeline.
To further enhance the HighSpeed, LowArea & UltraLowPower of the Hybrid Multiplier, we need to design other variant of Singleton Micro Pipeline and its DET Memory element by considering VLSI design matrix paradigm.


Reference

Shailesh Shah, "Comparison of 32bit Multipliers for Various Performance Measures", in Tehran 12th Int. Conf. on Microelectronics, MEng. Project Report, Concordia University, January 2000.

Soojin Kim, Kyeongsoon Cho, Design of High speed Modified Booth Multipliers Operating at GHz Ranges, World Academy of Science, Engg. and Tech. 61 2010.

A. A. Khatibzadeh, K. Raahemifar and M. Ahmadi, A 1.8V 1.1GHz Novel Digital Multiplier, Proc. of IEEE CCECE, pp. 686689, May 2005.

S. Hus, V. Venkatraman, S. Mathew, H. Kaul, M. Anders, S. Dighe, W.Burleson and R. Krishnamurthy, A 2GHZ 13.6mW 12x9b mutiplier for energy efficient FFT accelerators, Proc. of IEEE ESSCIRC, pp. 199 202,Sept. 2005.

HwangCherng Chow and IChyn Wey,A 3.3V 1GHz high speed pipelined Booth multiplier, Proc. of IEEE ISCAS, vol. 1, pp. 457460,May 2002.

M.AguirreHernandez and M. LinarseAranda, Energyefficient highspeed CMOS pipelined multiplier, Proc. of IEEE CCE, pp. 460464, Nov. 2008.

Yungchin Liang, Chingji Huang and Weibin Yang, A 320MHz 8bit x 8bit pipelined multiplier in ultra low supply voltage, Proc. of IEEE ASSCC, pp. 7376, Nov. 2008.

S. B. Tatapudi and J. G. DelgadoFrias, Designing pipelined systems with a clock period approaching pipeline register delay, Proc. of IEEE MWSCAS, vol. 1, pp. 871874, Aug. 2005.

Nikola Nedovic, Marko Aleksic,Vojin G. Oklobdzija, Comparative Analysis of DoubleEdge versus Single Edge Triggered Clocked Storage Elements, IEEE Adv. Comp.Sys. Engg. Lab., University of California Davis, CA 95616, 2002.

P.R. Panda et al., Powerefficient System Design, chapter 2, pp. 11 39. Springer Science+Business Media, LLC 2010.

Darren Zacher, How to use register retiming to optimize your FPGA designs, Mentor Graphics, Dec. 14, 2005.

Tobias Strauch, Hyper Pipelining of Multicores and SoC Interconnects, Germany, Oct. 2010.

Seongmoo Heo, A lowPower 32bit Datapath Design, Â© August 2000.

http://www.arm.com/arm_arm.pdf

John P. Hayes, Computer Architecture and Organization, McGraw Hill, 3rd edition, pp. 308.

G.E. Tellez, A. Farah and M.Sarrafzadeh, Activity
driven clock design for low power circuits, in Proc. IEEE ICCAD, San Jose, pp.6265 Nov. 1995.

S.H. Unger, Douleedgetrigger filpflops, IEEE Trans. Computers, vol.30,no.6, pp.447451, June 1981.

S.L.Lu and M. Ercegovac, A novel CMOS implementation of doubeedgetriggered flipflops, IEEE J. SolidState Circuits, vol.25, pp.10081010, Aug.1990.

M.Afghahi and J.Yuan, Doubleedgetriggered D flipflops for highspeed circuits, IEEE J. SolidState Circuits, vol.25, pp.11681170, Aug.1991.

A.Gago, R.Escano and J. A. Hidalgo, Reduced implementation of Dtype DET flipflops, IEEE J. SolidState Circuits, vol.28, no.3,pp.400402, Mar 1993.

R.Hossain, L. D. Wronski and A. Albicki, Low power design using double edge triggred flipflops, IEEE Trans. VLSI Systems, vol.2, no.2, pp.261265, June 1994.

Massoud Pedram, Qing Wu, Xunwei Wu, A new design for double edge triggered flipflops, Dept. of Electrical Engineering Systems, Univeristy of Southern California Los Angles,CA 90089, USA.

Wai Man Chung, The Usage of Dual Edge Triggered Flipflops in Low Power, Low Voltage Applications,A thesis presented to the University of Waterloo, Master of Applied Science in Electrical and Computer Engineering Waterloo, Ontario, Canada, January 2003.

G. M. Blair, Low power double edgetriggered flip flops, Electron. Lett. , vol.33, no.10 , pp.845847, May 1997.