 Open Access
 Total Downloads : 290
 Authors : K. Hemapriya, R. Karthikeyan, Dr. S. Rajkumar
 Paper ID : IJERTV3IS20543
 Volume & Issue : Volume 03, Issue 02 (February 2014)
 Published (First Online): 24022014
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
A High Performance EnergyEfficient Architecture for Portable Wireless Devices based on Folded Tree and MultiBit FlipFlop Merging Technique
K. Hemapriya1, (M.E AE)., R. Karthikeyan2,M.E., Dr. S. Rajkumar3
1Student/Dept of Applied Electroincs, 2Assistant Professor/ECE Dept, 3Associate professor/ECE Arulmigu Meenakshi Amman College of Engineering, Thiruvannamalai DT, Near Kanchipuram, India
Abstract Wireless data communication are an essential component of mobile computing with the help of wireless sensornodes. Due to their limited energy supply from batteries, the low power design have become inevitable part of todays wireless devices. Power has become a burning issue in VLSI design. In modern integrated circuits, the power consumed by clocking gradually takes a dominant part. Reducing the power consumption not only enhance battery life but also avoid overheating problem. By employing a more appropriate Processing Element(PE), the power consumption is significantly reduced. In this paper the novel method for low power design is achieved by using Folded Tree Architecture (FTA) and MultiBit FlipFlop Merging (MBFM) technique for onthenode data processing in wireless sensor networks using Parallel Prefix Operations (PPO) and data locality in hardware. Besides power reduction the objective of minimizing area and delay is also considered.
Keywords Folded Tree Architecture, Parallel Prefix Operation,MultiBit FlipFlop Merging, Processing Element, Wireless SensorNetwork.

INTRODUCTION
Power optimization is always one of the most important design objectives in modern nanometer integrated circuit design. Especially for Wireless Sensor Networks (WSNs), power optimization have become inevitable part in todays VLSI design.Power optimization not only can enhance battery life but also reduce the overheating problem.
Selfconfiguring wireless sensor networks can be invaluable in many civil and military applications for collecting, processing, and disseminating wide ranges of complex environmental data. Because of this, they have attracted considerable research attention in last years. Sensor nodes are battery driven and hence operate on an extremely frugal energy budget. Further, they must have a lifetime on the order of months to years. Since battery replacement is not an option for networks with thousands of physically embedded nodes.In some cases, these networks may be required to operate solely on energy scavenged from the environment through seismic,
photovoltaic or thermal conversion. This transforms energy consumption into the most important factor that determines sensor node lifetime [1].
The another important application in wireless sensor networks is event tracking, which has widespread use in applications such as security surveillance and wildlife habitat monitoring. Tracking involves a significant amount of collaboration between individual sensors to perform complex signal processing algorithms such as kalman filtering, Bayesian data fusion and coherent beamforming. This applications will require more energy for their processing.
In general Wireless Sensor Networkscan operate in four distinct mode of operation: Transmit, Receive, Idle and Sleep. An important observation in the case of most radios is that operating in Idle mode results in significantly high power consumption, almost equal to the power consumed in the Receive mode. The datadriven nature of WSN applications requires a specific low power data processing approach. By employing more appropriate Processing Element, the power consumption in all the four mode of operation will be reduced significantly.In present VLSI technology, reducing power consumption is an important issue. Especially for WSN, due to their limited battery lifetime the low power VLSI design is become inevitable for wireless commmunication. The goal of this paper is to design an lowenergy Folded Tree and MultiBit FlipFlop Merging technique for WSN nodes.

RELATED WORKS
In paper [2], the author proposed lowenergy data processing architecture for WSN nodes using folded tree method. This paper identifies that many WSN applications employ algorithms which can be solved by using parallel prefixsums. Therefore, an alternative architecture is proposed to calculated them energyefficiently. It consists of several parallel Processing Elements structured as a folded tree.The folded tree method with parallel prefix operations reduces the number of processing element and memory bottleneck. Due to clock distribution for more flipflops, it consumes
more clock power and also parallel prefix operations has high delay.
In paper [3], a novel method is proposed for low clock power consumption in WSN nodes. A previously derived clock energy model is briefly reviewed while a comprehensive framework for the estimation of system wide (chip level) and clock subsystem power as function of technology scaling is presented. This framework is used to study and quantify the impact that various intensifying concerns associated with scaling will have on clock energy and their relative impact on the overall system energy. This technology scaling method reduces clock power consumption (both static and dynamic), but due to large number of processing element area, inverter chain, PowerDelay Product is increased.

PROPOSED SCHEME
Folded Tree Architecture with Parallel Prefix Operation is used to reduce the total number of Processing Elements in the VLSI design. By reducing the number of processing elements, the total area is reduced. Area is proportional to power, so power consumption is also reduced. But during processing and transmission of signals, the WSN nodes will consume more power. Especially for clock distribution nearly 70% power will be consumed. In order to optimize the power during clock distribution, multibit flipflop merging technique using clock skew is added with Folded Tree Architecture.

Folded Tree Architecture
A straightforward binary tree implementation of Blellochs approach costs a significant amount of area as ninputs require p = n 1 PEs. To reduce area and power, pipelining can be traded for throughput. With a classic binary tree, as soon as a layer of PEs finishes processing, the results are passed on and new calculations can already recommence independently [8].
Fig 1. Binary tree equivalent to folded tree
The idea presented here is to fold the tree back onto itself to maximally reuse the PEs.In doing so, p becomes proportional to n/2 and the area is cut in half. Area is proportional to power, so power is also cut in half. Note that also the interconnect is reduced. This folded tree topology is depicted in Fig 1, which is functionally equivalent to the binary tree on the left.By using the Folded Tree architecture power consumption, area and wirelength is reduced considerably.Folded Tree Architecture for onthenode data processing in wireless sensor networks, using parallel prex operations and data locality in hardware reduces both area and power consumption.
TABLE I
Leakage Power and Dynamic Energy for one PE under Normal Conditions
FTA is designed to reuse the PE nodes to reduce half of the total area. It limiting the data set by preprocessing with parallel prex operations.The combination of data ow and control ow elements to introduce a local distributed memory, which removes the memory bottleneck while retaining sufcient exibility.Several processing element consumes more power, so by using FTA the PE can be reused and power is reduced [2].
Fig 2. Folding Architecture
In foldin architecture, we can reuse the PEs with the help of counter and Finite State Machine (FSM). Iteration count in the counter contains the total number of times the specified PE going to be reused. The FSM enables and reset the iteration count based on the instructions.

Parallel Prefix Adder
Adders are also very important component in digital systems because of their extensive use in other basic digital operations such as subtraction, multiplication and division. Hence, improving performance of the digital adder would greatly advance the execution of binary operations inside a circuit compromised of such blocks. The performance of a digital circuit block is gauged by analyzing its power dissipation, layout area and its operating speed.
The main idea behind parallel prefix addition is an attempt to generate all incoming carries in parallel and avoid waiting until the correct carry propagates from the stage of the adder where it has been generated. Parallel prefix adders are constructed out of fundamental carry operators denoted by Â¢ as follows
(G'', P'') Â¢ (G', P') = (G''+G'Â·P'', P'Â·P'')
where P'' and P' indicate the propagations, G'' and G' indicate the generations.
Fig 3. Carry operator
A parallel prefix adder can be represented as a parallel prefix graph consisting of carry operator nodes. The parallel prefix Ladner Fischer adder structure has minimum logic depth, but has large fanout requirement up to n/2. Ladner Fischer adder has less number of delay compared to other parallel prefix adders. Power Delay Product should be less inorder to achieve high throughput and speed.
Fig 4. Ladner Fischer Parallel Prefix Adder
The Ladner Fischer adder construct a circuit that computes the prefix sums in the circuit, each node performs an addition of two numbers. With their construction, one can choose a tradeoff between the circuit depth and the number of nodes.

MultiBit FlipFlop Merging Technique
Given a design that the locations of the cells have been determined, the power consumed by clocking can be reduced further by replacing several FlipFlops (FFs) with multibit flipflops. During clock tree synthesis, less number of flip flops means less number of clock sinks. Thus, the resulting clock network would have smaller power consumption and uses less routing resource.
Besides, once more smaller flipflops are replaced by larger multibit flipflops, device variations in the corresponding
circuit can be effectively reduced. As CMOS technology progresses, the driving capability of an inverterbased clock buffer increases significantly. Fig 5shows the block diagrams of 2 – and 4bit flipflops. The total power is reduced by merging the two 2 bit flipflops with one4bit flipflops since the two flip flops consume the same clock [6].
Fig 5. FlipFlop Merging

Identify Mergeable flipflops
Each and every flipflops in the processing element have different clock skew. The clock skew is measured by calculating the delay difference between points having maximum and minimum delays.The rising skew is measured by calculating the time it takes for a signal to rise from 10% of Supply voltage to 90% of Supply voltage.
Clock skew = DelayMax DelayMin. (1) Rise Skew = Timev=0.9*SupplyTimev=0.1*Supply. (2)
The Falling skew is measured by calculating the time it takes for a signal to fall from 90% of Supply voltage to 10% of
Supply voltage
Fall Skew = Timev=0.1*SupplyTimev=0.9*Supply. (3) Clock skew can be positive or negative. If the clock signals are in complete synchronicity, then the clock skew observed at these registers is zero.
Fig 6. MultiBit FlipFlop Merging technique
In this approach we are going to merge the flip flops with the help of clock skew. In the circuit the clock distribution for
each flipflop has different clock skew to each other. The singlebit flipflops having same clock skew are merged with each other to form single multibit flipflops.For simulation we considered 20 regional clock signal, 50 flipflops and 1 system clock. 20 regional clock signal supplied to 50 flip flops, the skew difference between main and regional clock is identified by using edge detection circuit. We can use either rise edge or fall edge to measure skew difference. Here we used rise edge detection circuit. After finding the rise edge difference between system and regional clock, the counter counts the number of clock cycles between two clocks having same skew.

Build a Combination Table
Combination table is nothing but that contains each and every pair of mergeable flipflops. The combination table, which records all possible combination of FFs to get feasible FFs before replacement. Only one combination of FFs needs to be considered in each time, the search time can be reduced greatly. Clock skew difference is used for building a combination table. In combination table, we use 50 flipflops and 20 regional clock for simplicity. The combination table shows the FFs having same clock frequency, that can be merged by providing single clock.

Merge FlipFlops
Clock signal distribution to large number of flipflops in the circuit will consume most of the power from the power source. Instead of giving separate clock signal to each and every FFs in the circuit, we are going to give single clock signal to flip flops based on the clock skew value generated from the combination table.Now, use the combination table to combine flipflops in this subsection . Depending on the system clock and regional clock values, identify the flipflops with same clock skew. The flipflops those having same clock skew values are given single clock signal.


SIMULATION RESULT
We simulated MBFM technique in ModelSim by considering
1 system clock, 20 regional clock and 50 flipflops. We supplied 20 regional clock to 50 flipflops and identified the skew difference between regional clock and system clock of each flipflop separately. The skew difference for each and every flipflop is built on the combination table.
Fig7. Regional clock signal triggers 50 FlipFlops
Clock distribution network triggers 50 FlipFlops by using 20 regional clock signals.
Fig 8. Edge detection by using differential circuit
Skew is detected by using differential Edge detection circuit. We use an exclusive OR gate to find out edges between various regional clock. Synchronous Clock signal directly to one input and skew clock is applied through a another part. It generates difference between system clock and regional clock. Here difference between main and distributed clocks is identified using xor array circuit. For MC = 0 and DC = 0, DS
= 0, for MC = 0 and DC = 1, DS = 1, for MC = 1 and DC = 0, DS = 1 and for MC = 1 and DC = 1, DS = 0.
Fig 9. Counter values between clocks
The total rising edges occurs when there is difference between main clock and corresponding regional clock. Here we use 10 cyclic difference between reference clock and system clock. Initially counter variables are cleared by reset high and after one cycle period, it begin to count when edges are detected.
The combination table shows the flipflops having same frequency, we merged the flipflop by providing single clock signal. So power consumption of CDN is reduced by half.
Fig 10. Combination Table
Combinational table memory obtained from ModelSim memory list analyzer. We are using 20 number of skew clocks distributed across 50 flip flops. Edge detection and counter circuit identifies total number of clock signals that operating under same frequency. From above look up table values, Clock 1 is used by 9 flip flop region where mergeable together. Clocks 4,5,6,7 are used by single flip flops that cannot be mergeable.
Altera quartus II 9 IDE is used for power analysis and throughput analysis. In power analyis, we observed that power consumption and area is reduced by the proposed method and throughput increased.
Fig 11. Power Analysis Report
Fig 11 power analysis results clearly shows the proposed system consumes less power compared to the existing system. It clearly shows the total thermal power consumption, dynamic power and I/O power consumption for the 50 flip flops after implementing the proposed technique.The Fig 12 area analysis report shows that the total number of processing elements, registers, are reduced, so the total area usage is reduced.
Fig 12. Area Analysis Report
Fig 13. Comparsion of Existing and Proposed Methods
The proposed MBFM technique reduces power consumption from 128.12mw to 62.15mW, area utilization is reduced from 2334 logic elements from 1254 logic elements and delay by
325ns from 432ns. The throughput has been increased from 589Mbps to 325Gbps.

CONCLUSION
This paper presented the Folded Tree Architecture and Multi Bit FlipFlop Merging technique for WSN applications. The design describes many data processing algorithms for WSN applications along with parallel prefix operations and clock distribution networks. Power is saved using flip flop merging technique by providing single clock signal to mergeable flip flops with the help of combinational lookup table. Thus this technique can be effectively used in integrated circuits requiring low power consumption in clock distribution network and low skew clocks. Area is reduced using folded tree architecture by reusing processing element. Ladner Fischer parallel prefix adder reduces the delay constraints and achieve high throughput. The proposed architecture significantly reduces both power and area in WSN nodes, can save up to half of the power in total sensor node.
REFERENCES

V. Raghunathan, C. Schurgers, S. Park, and M. B. Srivastava,
Energyaware wireless microsensor networks, IEEE Signal Process.Mag., vol. 19, no. 2, pp. 4050, Mar. 2002.

C. Walravens and W. Dehaene, Design of a lowenergy data processing architecture for wsn nodes, in Proc.Design, Automat. Test Eur. Conf.Exhibit., Mar. 2012, pp. 570573.

D. Duarte, V. Narayanan, and M. J. Irwin, Impact of technology scaling in the clock power, in Proc. IEEE
VLSI Comput. Soc. Annu. Symp.,Pittsburgh, PA, Apr. 2002, pp. 5257.

H. Kawagachi and T. Sakurai, A reduced clockswing flip flop (RCSFF)for 63% clock power reduction, in VLSI Circuits Dig.Tech. Papers Symp., Jun. 1997, pp. 9798.

Y. Cheon, P.H. Ho, A. B. Kahng, S. Reda, and Q. Wang,
Power aware placement, in Proc. Design Autom. Conf., Jun. 2005, pp. 795800.

Y.T. Chang, C.C. Hsu, P.H. Lin, Y.W. Tsai, and S.F. Chen,Postplacement power optimization with multibit flipflops,in Proc.IEEE/ACM Comput.Aided Design Int. Conf., SanJose, CA, Nov. 2010,pp. 218223.

P. Sanders and J. Traff, Parallel prefix (scan) algorithms for MPI,in proc, Recent ADV. Parallel Virtual Mach Message Pass, Interf.,2006, pp.4957.

G. Blelloch, Scans as primitive parallel operations, IEEE Trans.Comput.,Vol.38, no 11, pp. 15261538, Nov. 1989.

D. B. Hoang, N. Kamyabpour An Energy Driven Architecture For Wireless Sensor Networks International Conference on parallel and Distributed computing Applications and technologies., Dec 2012.

Nazhandali, M. Minuth, and T. Austin, SensBench:Toward anaccurate evaluation of sensor network processors,in Proc. IEEE Workload Characterizat. Symp., Oct. 2005.

M. Hempstead, D. Brooks, and G. Wei, An acceleratorbased wireless sensor network processor in 130 nm cmos, J, Emerg.Select. Topics Circuits Syst., vol. 1, no. 2, pp. 193 202, 2011.

B. A. Warneke and K. S. J. Pister, An ultralow energy micro controller for smart dust wireless sensor networks, in Proc. IEEE Int.Solidstate circuits conf. Dig. Tech. Papers. Feb. 2004, pp. 316317.

M. Hempstead, M. Welsh, and D.Brooks,Tinybench: The case For a standardized benchmark suite for TinyOS based wireless sensor network devices, in Proc. IEEE 29th Local comout. Netw, conf.,Nov.2004, pp. 585586.

O. Girard. (2010). OpenMSP430 processor core, available at opencores.org,[online].Available:http://opencores.org/project, openmsp430.

H. Stone, Parallel processing with the perfect shuffle, IEEE Trans.Comput., vol. 100, no.2, pp. 153161, Feb. 1971.

M. Hempstead, J. M. Lyons, D. Brooks, and GY. Wei, Survey of hardware systems for wireless sensor networks, J. Low Power Electron., vol.4, no. 1, pp. 1129, 2008.

C.C. Yu. Design of lowpower double edgetriggered flipflop circuit.In IEEE Conference on Industrial Electronics and Applications, pp.20542057, 2007.

M. Donno, A. Ivaldi, L. Benini, and E. Macii. Clock tree power optimization based on RTL clockgating. In Design AutomationConference, pp. 622627, 2003.
AUTHORS
Ms. K.Hemapriya was born in Tamilnadu in 1991. She received B.E degree in Electronics and Communication Engineering from University College of Engineering – Panruti, Anna University in 2012. Currently she is pursuing Master degree program in Applied Electronics, Arulmigu Meenakshi Amman College of Engg, Thiruvannamalai Dt, Near Kanchipuram, Anna University, India. And presented papers in National and International Conferences. Her current research interests include VLSI backend designs, and VLSI architectural strategies for high speed and low power DSP applications.
Mr. R.Karthikeyanwas born in Tamilnadu in 1985. He received B.E Degree in ECE from Krishnasamy College of Engineering & Technology,Cuddalore in the year 2007 and
M.E Degree in Communication Systems from B.S.Abdur Rahman Crescent Engineering College,Chennai in the year 2009.He joined Arulmigu Meenakshi Amman College of Engineering as Assistant Professor in 2010 in the department of ECE. He has 4 years of teaching experience. He has guided many UG & PG projects. His research interest are in the areas of wireless communication and Networking. He has published many national and international journals. He is a member of IEEE & ISTE.
Dr.S.Rajkumar was born in Tamilnadu in 1980. He received
B.E Degree in EEE fromPeriyar University, Salem in the year 2002 and M.Tech Degree in VLSI Design from SASTRA University, Thanjavur in the year 2004 and Doctor of Philosophy degree in VLSI from University of Allahabad, Allahabad. He has 10 years of teaching experience. He has guided many UG & PG projects.His research interest is in the areas of VLSI, Communication and Networking. He has published many national and international journals. He is a member of VLSI Society, IEEE & ISTE.