Power Optimization Techniques for N OC

— The on-chip network has become a significant solution for the communication limitation of SoC (System-on-chip). The demand for relative increment in bandwidth to facilitate high core utilization and the need for low power consumption as well as higher performance has increased. The major circuitry is the router in NoC, which barely affects on power dissipation, latency, and performance. The dynamic power consumption is one of the major components of total power consumption. This paper presents a detailed structure and verification of the router module and various power optimization techniques for NOC by restructuring the architecture. The design of the router is coded in Verilog, synthesized, and simulated in the Xilinx ISE Design Suite 19.1 tool.


INTRODUCTION
As technology is reaching a new height, memories and processors are becoming faster, smaller, cheaper, and energy efficient. This helps computer architects to incorporate a greater number of specifications in a single chip [1]. As Moore's Law is moving towards its limit, the intimation of simple cores is getting used to continue improving performance while reducing fabrication costs. Consequently, not only the computing power and memory access but also in inter-core communication are the bottleneck for performance.
As the Semiconductor industry is advancing towards a complex system on chip (SoC) design containing hundreds of IPs, the traditional bus architecture as shown in the figure.1, used for communication between the multiple cores. Bus architecture introduces a lot of latency [3] and area overhead in the design. As technology is scaling to a lower node, the need for high performance, higher efficiency, and low power design become the bottleneck. To satisfy these criteria of SOC, the bus-based communication lost its significance and there is a need for a better solution. As the technology is scaling down, the need for high performance, higher efficiency and low power design becomes the bottle neck. To satisfy these criteria of SOC, the bus-based communication lost its significance and there is a need for better solution.
The interconnected networks have become evident to replace traditional buses as the prevailing solution to provide fast, low cost, and scalable communications. They are the solution for successful future digital systems, both chip multiprocessors consist of identical cores and heterogeneous systems-on-chip. From the last ten years, there has been an in-depth effort towards optimizing the networks-on-chip (NoCs) from low-level physical aspects up to system-level and application-related problems. NoCs have currently reached a mature level of development with their integration as a fundamental component.
The Network on chip architecture differs based on the technology and application-specific. There are many topologies for NoC [2] like Chip level integration of communicating heterogeneous elements (CLICHE), Butterfly Fat Tree (BFT), Scalable programable integrated network (SPIN) and Mesh ( Figure.2). Every topology has its own advantages and disadvantages. The mesh topology has the following advantages over other topologies.
a. Each node is connected to 4 neighbours, except for the ones on the boundary. b. Easy to layout on-chip: regular and equal length links. c. Path diversity: Many ways to get from one node to another.

A. Flow Control
Flow control is the method through which an upstream router will know about the availability of buffer in the downstream router. The router sending the data packets is called an upstream router, the router receiving the data packets is called a downstream router. This is done by exchanging credits between the two routers. If the buffer is available, a packet can be forwarded to the next router after the acknowledgment signal from the downstream router to the upstream router.

B. Wramhole flow control
Data packets are divided into smaller units called flits. The data packet consists of three flits, head, body, and tail. Flits are sent across the fabric in wormhole fashion. Body flit follows the head flit, tail follows the body flit and it happens in a pipelined manner. Figure 3 represents the wormhole flow control mechanism. If the head flit is blocked then the rest of the packet is stopped, since the information about the destination (Routing information) is given only to the head flit. This type of flow control mechanism has lower latency and since we are not reserving buffers for the entire packet, it has efficient buffer utilization. In this type of flow control, the packet can occupy resources across multiple routers. This wormhole routing has one problem called head of line blocking. If the head flit cannot move due to contention, another worm (flit) cannot proceed even though links are may be idle.

C. Virtual Channel Flow Control
As discussed above, to avoid the head of line blocking, the Virtual Channel Flow Control mechanism has been proposed. The one physical channel has been multiplexed with multiple virtual channels. Single FIFO buffer is replaced with multiple buffers. A single physical channel is terminating at multiple buffers. These buffers are known as virtual channels. This concept is known as Virtual Channel Flow Control. Virtual Channels (VC) number is allocated once at each router to the head flit and the rest of the remaining flits of the packet inherit the same VC number. Now the flits of different packets can be interleaved on the same physical channel.

D. Router Microarchitecture
As shown in figure 4, the routers consist of Input ports. There are five input ports North, East, West, South, and Local. Each port has virtual channels (depending on the design). The crossbar connects the input ports to the output. The control logic facilitates the smooth flow of packets from the input side to the output side. The first function is the buffering of flits. Whenever a flit is coming to the channel, it occupies the buffer. The next task is route computation. For a flit that is residing inside a buffer, the route computation unit is going to assign the output port. The process of finding an output port for a packet residing inside a buffer is called route computing. Route computation is done for the head flit. The body and tail flit inherit the route assigned to the head flit. The next task is the virtual channel allocation. It is based upon handshaking between adjacent routers.

E. Router Algorythm
The process of reserving buffer in the downstream router is called virtual channel allocation. The next task is the switch allocation. Whenever multiple flits require the same output port, the switch allocator chooses the flit. Once the flit has been chosen, the next task is to switch traversal. At most 5 flits can be traversed at any given clock cycle. These are the 5 logical cycles that happen inside the router.

III. IMPLEMENTATION OF POWER TECHNIQUES IN NOC
As technology is scaling down to deep Nanometer, gradually scales down threshold voltage which contributed enormously towards an increase in subthreshold current leakage, therefore, making high power dissipation. In Multicore chips, transmission requires more bandwidth and speed in between IPs, due to an increase in more feature requirement. So NoC is one of the solutions which allows higher transmission speed in between core IPs, facilitates the latest technological trends like AI, Machine learning, etc. The development of NoC provides higher throughput, lower latency, and higher bandwidth. However, NoC suffers from power consumption caused by Leakage power and switching activity in multi-core circuitry.
To overcome this problem, the usage of Power gating in NoC will help. Buffers used as storage module for flits, which provides higher bandwidth and helps to avoid deadlock. In some recent research specifies [7] that buffer-less NoC helps in reducing power and area, but has trade-off for deadlock, low latency, and live lock. In the recent invention, power dissipation can be reduced with buffers included, by making some modification and adding circuitry in Integrated circuits.
To reduce power dissipation in circuits, Power optimization techniques like Power gating, Clock gating, as well as multi vt (threshold voltage) and multi Vdd (Supply voltage), etc are suitable. Figure .5

IV. PROPOSED WORK
The present work determines the reduction of power consumption in core architecture as well as an increase in signal transmission-speed by implementing NoC. The proposed work contains several routers connected in mesh architecture and verified for its functionality. Each router is connected to different IP's, here IP's represents the processor, RAM, ROM, cache memory, sensor unit, DSP, Display Processor, WI-FI, GPU, etc. For the betterment of device performance, the idle routers to be turned "off" or deactivate to save power. It is explained in the below section.

A. Applying Power Gating to routers
As mentioned in figure 8, while the data flits are transmitting from upstream router (transmitting router) to downstream router (receiving router), routers in this path will be inactive state by turning "off" PG. The destination router address is located at head flit of data, depending on the flit transmission path respective routers on that path will activate and few neighboring routers also turned "on" to avoid deadlock and latency problem. data to R33 (router33) through routers R10, R20, R21, R22, R23, R33. The path allocated by route computation base on low latency and to avoid the deadlock. If already a few paths are transporting data flits, the deactivated routers made "on" if any new data flit arrives. Eventually, this methodology saves power consumption exceptionally but keeping the routers continuously in an active state consumes power. The activation of PG is controlled by VC and data flits. After receiving all flits, if there is no further data transmission for a few cycles, then routers will go in sleep mode. A quick incoming flit during router is in the sleep state can lose the data packet or can cause latency in the transmission of data because turning on the power gating at the same cycle time will increase the latency. To avoid this problem, there is a methodology that generates an early notification for the routers.

B. Early Notification to PG
To avoid the problem of missing data due to the router off (sleep state), the proposed technique works in order to turn on (active state) the router before the flit is reached. The upcoming router in a path will get notified 5 cycles earlier to wake up to capture data. If the upstream router starts transmitting data to the downstream router, the downstream router gets notified to be an inactive state. To overcome the drawback of keeping unused routers on, we propose a method to overcome the drawbacks of keeping unused routers ON. The routers are divided into Different slots like Primary routers and Secondary routers. Microprocessor core has few main regular IPs like CPU, memory, Execution units which do not need to be turned off for regular iterations. Other units like GPU, DSP, Sensor processing can be turned OFF. Major power loss happens by supplying power to units that are non-functional. So, the main routers will act as primary routers and remaining can be slotted as secondary routers. Primary routers are continuously in an active state and secondary routers in the sleep state. If a router fails to conduct completely, then one of the secondary routers acts as a primary router and provides complete functionality. We also propose a memory storage table, so it can be easy to router identify which neighboring router is in ON state and which router has failed in its functionality. The table has a router transmission path memory status from the upstream router to the downstream router.

D. Route path Memory table
The memory table stores the information about the shortest path information of the source router to the destination router. For on-chip Networks in the core, there are possibilities of router failures, in that case, memory table stores the failed router information and avoids other routers to consider the failed outer path. The memory table will store the shortest flit transmission path to avoid latency and improve efficiency. In case of a router failure, the automatically upstream router finds a new path to reach the downstream router, the updated path will be stored in the memory table.  The comparison between the without power gating model and with power gating model, of different size Mesh topologies with and without power gating for dynamic and static. By the experimental results, a number of routers increase the buffer count in each router increases, a high number of buffers consume more power. Routers are kept in an active state for a period of cycle time during packet transmission between them. once transmission completes routers wait for 5-6 cycles ideally, then go into sleep mode. The figure 10 is a graphical representation of total power consumed by the routers. The graph shows the difference between without power gating and with applying power gating at routers, for deterministic input path. Here an increase in router count increases power dissipation, measured in uW (microwatts). But comparative applied methodology helps in the reduction of power.
VI. CONCLUSION The high performance and less power consuming NoC for multicore Integrated circuits have become important. We did a detailed study of NoC blocks, like router VC allocator, input buffer, route computation, Switch arbiter. Router with 2*2, 2*3, 4*4, 8*8 mesh topologies has been implemented in this paper, the total power consumption in routers without power gating and the power reduction in addition to power gating. Buffer is a major power-consuming part of the router. Power optimization techniques improve the efficiency of the circuit and reduce the power consumption. Also slotting the routers as a primary and secondary use for high-end applications that needs continuity of power with turning "off" them, in case of a router failure, the system considers alternative router by excluding the old router from path memory list. It contributes to avoiding the deadlock problem and reduces latency in interconnects.