FPGA Implementation Of Power Efficient Low Latency Viterbi Decoder

Nirmal Bhatt; Prof. Milind Shah; Prof. Bhavesh Asodariya

doi:10.17577/IJERTV2IS50069

Volume 02, Issue 05 (May 2013)

FPGA Implementation Of Power Efficient Low Latency Viterbi Decoder

DOI : 10.17577/IJERTV2IS50069

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 106
Total Downloads : 2373
Authors : Nirmal Bhatt, Prof. Milind Shah, Prof. Bhavesh Asodariya
Paper ID : IJERTV2IS50069
Volume & Issue : Volume 02, Issue 05 (May 2013)
Published (First Online): 02-05-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

FPGA Implementation Of Power Efficient Low Latency Viterbi Decoder

Nirmal Bhatt	Prof. Milind Shah	Prof. Bhavesh Asodariya
Pursuing ME(WCSN)	Asso. Professor	Asst. Professor
Department of ECE,	Department of ECE,	Department of ECE,
GEC, Surat, India	GEC, Surat, India	SSASIT, Surat, India

Abstract

Convolution encoding with Viterbi decoding is a powerful method for forward error correction. It has been widely deployed in many wireless communication systems to improve the limited capacity of the communication channels. The Viterbi algorithm, which is the most extensively employed decoding algorithm for convolution codes. In this paper, we present a field- programmable gate array implementation of power efficient low latency Viterbi Decoder with a constraint length of 7 and a code rate of 1/2.We also proposed One Pointer algorithm for Trace back Implementation of survivor sequence memory management for low power decoder design.

Keywords – ACS, Butterfly structure, Convolutional code, One Pointer, Register Exchange, Trace Back, Viterbi Algorithm.

Introduction

In recent years the continued rise of portable data- devices such as cell phones, PDAs and laptops has driven enormous growth in the area of wireless communications. Whenever data is sent over a wireless channel, it is subjected to degradation due to multipath fading and noise [3]. Depending on the amount of degradation, the effect can be a loss or corruption of the original data during transfer. In order to alleviate this problem and to ensure the reliable transfer of data, the typical solution has been the use of an error correction coding scheme.

This is also the building blocks for other wireless devices working on GSM, Wi-Fi, DAB, DVB standards. Design of CC encoder is simple but much more hardware will be utilized in the designing of decoder. As the constraint length of the code increases its error correcting capability will also be improved but

at the cost of power and memory requirement[1]. So, our aim is to design optimum CC decoder for power efficient and delay sensitive application.

In this paper we are going to discuss the convolutional encoder of coding rate 1/2, constraint length 7 and generator field G = [1111001; 1011011] and for decoding the CC code we proposed viterbi decoder for the same parameters.

Verilog HDL is used for coding the design. Xilinx ISE 13.4 foundation series is used for simulation. Spartan 3E XC3S500E FPGA has been used for implementing the design.
Convolutional Encoder

Convolutional codes differ significantly from block codes in structural form and have much more powerful error correcting capability[3].Convolutional coding can be applied to a continuous input stream (which cannot be done with block codes), as well as blocks of data. Convolutional codes are usually characterized by two parameters and the patterns of n modulo-2 adders (XOR gates). The two parameters are the code rate and constraint length. The code rate, k/n, is expressed as a ratio of the number of bits into the convolutional encoder (k) to the number of channel symbols output by the convolutional encoder (n) in a given encoder cycle. The constraint length K denotes the "length", i.e. how many bits are available to feed n modulo-2 adders.

In fact, a convolutional encoder can be viewed as a finite state machine having 2K-1 states. Its output is function of presents state and present input. The trellis is a time indexed version of the state diagram. Each node corresponds to a state at a given time index, and each branch corresponds to a state transition. Associated with each branch is the input symbol and output symbols corresponding to the state transition. Given a known starting state, every input sequence corresponds to a unique path through the trellis. The

convolutional encoder, shown in Figure 1, coding rate is 1/2, constraint length is 7 and generator field G = [1111001; 1011011].

Fig. 1 Convolutional Encoder (K=7, rate-1/2)[5]

Viterbi Decoder

A Viterbi decoder uses the Viterbi algorithm for decoding a bitstream that has been encoded using Forward error correction based on a code. There are other algorithms for decoding a convolutionally encoded stream(for example, the Fano algorithm).The Viterbi algorithm is the most resource consuming, but it does the maximum likelihood decoding[6]. Fig.2 shows the block diagram of viterbi decoder. It consists of the following modules:

Branch Metric Unit, ACS, State Metric Storage, Survivor Path Metric, Traceback and Output Decode Block.

Fig. 2 Block diagram of Viterbi decoder[1]

Branch Metric Unit

At the decoder side, we receive analog signal. Analog signal is a mixture of transmitted signal and noise signal. Now received signal is given to the 3-bit quantizer. Depending on the signal strength, it assigns the value shown in Table I.

G0

G1

adder adder adder adder

BM_00 BM_01 BM_10 BM_11

Fig. 3 Branch Metric Calculation[1]

Now BM calculates four values as shown in Fig.3 based on received signal.

BM_00 = G1 + G0

BM_01 = G1 + (2s complement of G0) BM_10 = (2s complement of G1) + G0

BM_11 = (2s complement of G1) + (2s complement of G0)

Suppose encoded bit 10 (G1, G0) is transmitted but due to effect of noise we receive 1 as a weak 1 and 0 as a strong 0. Value for weak 1 is assigned as -1(100) and strong 0 is assigned as +3(000) shown in Table I.

So as per the above equation BM_00 = -1 + 3 = 2, BM_01 = -1 + (-3) = -4, BM_10 = 1 + 3 = 4, BM_11 =

1 + (-3) = -2.

Table 1 3Bit Soft Quantization Value

Range	Value	Meaning
+1	010	relatively weak 0
+2	001	relatively strong 0
+3	000	strongest 0
0	011	weakest 0
-1	100	weakest 1
-2	101	relatively weak 1
-3	110	relatively strong 1
-4	111	strongest 1

Add Compare Select Unit (ACS)

ACS is also known as a Butterfly module as shown in Fig.4. The trellis diagram for a rate 1/n convolutional encoder consists of butterfly structures. This structure contains a pair of origin and destination states, and four interconnecting branches.

S0

Adder

If new state metric is updated by upper state metric then 0 is stored as a decision bit while it is updated by a lower state metric then 1 is stored as a decision bit. Suppose S0 is greater than S1, S0 is selected as a new state metric and decision bit 0 is stored. Otherwise S1 is selected as a new state metric and decision bit 1 is stored.
Path Metric Calculation and Storage

The Path Metrics are updated in two steps. First for each of the two incoming paths, the corresponding Branch Metric is added to the Path Metric of the

BM_00

S1

BM_11

BM_11

BM_00

Adder

Adder

Adder

Comparator

Comparator

S0

S32

Survivor Path storage

Decision bit Survivor Memory Unit(SMU)

Survivor Path storage

Decision bit Survivor Memory Unit(SMU)

original state. The two sums are compared and the larger one is stored as the new Path Metric and the corresponding path is stored as the best path.
Register Exchange Method for SMU

The register exchange (RE) method is the simplest conceptually and a commonly used technique. Because of the large power consumption and large area required in VLSI implementations of the RE method[9], the trace back method (TB) method is the preferred method in the design of large constraint length, high performance Viterbi decoders[8]. In the register exchange, a register assigned to each state contains information bits for the survivor path from the initial state to the current state[2]. In fact, the register keeps the partially decoded output sequence along the path, as illustrated in Figure 5.The register of state E1 at t=3 contains '101'. This is the decoded output sequence along the hold path from the initial state.[2]
Fig. 4 ACS Module

As shown in Fig. 4, S0 and S1 are present states and S0 & S32 are next states. With 64 states convolution encoder, we have 32 butterfly units. Each butterfly unit generates two survivor decision bits. ACS unit adds BM (Branch Metric) to partial path metric to form lower and upper state metric. It than compares state metric of two incoming states and selects one with maximum state metric value. The branch metric to be added with the path metric for state S0 (S1) is the Hamming distance between the expected code symbol 00 (11) and the received input. The received input is XORed with the expected code symbol. The branch metric is added to the partial path metric to calculate the new path metric. Two path metrics, the upper one and the lower one, are compared to select the survivor path, and the resultant metric of the selected path updates the path metric of state S0 (S1). The lowerwing is identical to the upper wing except that the expected values differ.

Fig. 5 Register Exchange Method for SMU[2]
The register-exchange method eliminates the need to trace back since the register of the final state contains the decoded output sequence.

However, this method results in complex hardware due to the need to copy the contents of all the registers in a stage to the next stage. The survivor path

information is applied to the least significant bit of each register, and all the registers perform a shift left operation at each stage to make room for the next bits. Hence, each register fills in the survivor path information from the least significant bit toward the most significant bit. The scheme is called shift update. The shift update method is simple in implementation but causes high switching activity due to the shift operation and, hence, results in high power dissipation.[2]
Traceback Unit

Traceback begins after completing the metric update of the last symbol in the frame. For frame by frame Viterbi decoding, all that is needed by the traceback algorithm are the state transitions. The starting state is state 00. For sliding window decoding, the starting state is the state with the largest state metric.

Two basic operations are performed in the traceback SMU (Survivor Memory Unit)[7].These operations are explained below:

Writing New Data (WR): Survivor vector is generated by the ACS unit it needs to be written into the memory. Each individual survivor is written into a memory position that corresponds to its state. The write pointer moves forward (from the left to the right in Figure 6) through the memory as the ACS unit moves on to each new stage in the code trellis.

Decode Read (DR): Once D (or more) survivor vectors have been written to memory, the SMU needs to start tracing a path back to the origin of the trellis in order to decode the output bits. Functionally decode read operation is identical to the traceback read operation, except that the pointers read from memory are decoded to form the output of the SMU. Here, we omit the traceback read operation because we know the starting point (state 0) of the traceback operation. We decode data directly. Decode read starting point is the end of the frame (state 0). This state has a decision bit either 0 or 1 in SMU. This decision bit is now extracted and is stored into FILO buffer. Depending on the decision bit we can find previous state and its decision bit. Process is continuing till the starting point of the frame. Decode read operation frees up memory space for use by the write operation. Since the memory size in hardware is limited, the write operation must fill the space freed up by decode read operation.

Several methods have been devised for trace back[7]. One well-known method, where the termination state is unknown, is to start the traceback from the state that has the best path metric. This method has fastest convergence rate, requires shortest

memory traceback and has lowest potential latency. However finding best state requires 2K-1 path metric comparisons which would increase area significantly. The other methods, k-pointer even and 1-pointer algorithm eliminate need of comparator since traceback can be started from any arbitrary state. However, at least D step traceback of Survivor memory is needed before producing decoded bits where D is called traceback depth (or traceback length). The k-pointer even algorithm [4] requires 2k memory banks, each of size D / (k-1) columns, and uses k read pointers and 1 write pointer to memory.

Survivor Vector

Survivor Vector

0 D 2D

Write Region Decoded Read Region

Fig. 6 Trace back operation in survivor memory

The One-Pointer Algorithm[4] differs significantly from the k-Pointer even algorithm. Instead of utilizing k read pointers to perform the required k reads for every column write operation, we chose to use a single read pointer, but accelerate read operations, so that every time the write counter advances by one column, k column reads occur. This acceleration of read operations is made possible by the fact that among the three operations, writing new data, trace back read and decode read, writing new data is by far the most time consuming. This observation is particularly important as 2K-1 bit are written every trellis-stage-time, as opposed to only k bits being read.

3.5.1 One Pointer Algorithm

In 1-pointer algorithm 2 memory banks, each D column long, are required, for a total of 2D columns. As shown in Figure7. Write data and decode read operation are started at time 0. Decode pointer has double speed compare to write pointer. At the time D / 2, write pointer wrote D / 2 data into memory bank 1 while decode read pointer read out D data in memory bank 2. Memory bank 2 is now free to write data. Between D/2 to D time decode read pointer is going into sleep mode while write pointer writes total D data into memory bank1. Now at the time D, write pointer wrote D data into memory bank 1 while decode read

pointer stayed in sleep mode. Now memory bank 1 is ready to decode data. So write pointer is starting to write data into memory bank 2 and decode read pointer is starting to read the data in memory bank 1 and above process is repeating. Total latency is 3D / 2 time. After this time output is stored in 64-bit FILO register and continuously coming out.

0

N-1

0

N-1

WR

0

N-1

0 D

BANK 1

0

D

BANK 2

WR

DC

WR

DC

WR

BANK 1

0

D

BANK 2

WR

DC

DC

SLEEP MODE

0

D/2

D

Latency = 4.44 Âµs

@

50 MHZ

Fig. 8 Encoder i/p and Viterbi Decoder o/p waveform

DC WR

0 D

N-1

0

N-1

0

N-1

DC

SLEEP MODE

WR

3D/2

WR

2D

Fig .7 One Pointer Algorithm

FPGA Implementation Results

Verilog is used for coding the designs and FPGA flow is taken for implementation. The design is targeted to Spartan 3E (90nm process technology with 1.5V power supply and device package fg320) FPGA using Xilinx ISE 13.4 foundation series. Chipscope Pro is used to see the results from FPGA kit. Xpower Estimator 11.1 is used to estimate the power of the design.

Fig. 9 Power by Function

Fig. 10 Power by Voltage

Power Summary of Viterbi Decoder@50Mhz
Optimization	None
Data	Production
Quiescent(w)	0.078
Dynamic (w)	0.030
Total (w)	0.108

Power Summary of Viterbi Decoder@50Mhz
Optimization	None
Data	Production
Quiescent(w)	0.078
Dynamic (w)	0.030
Total (w)	0.108

Fig.11 Power by Variance Table 2 Power Summary

Table 3 FPGA Performance Comparison of Viterbi Decoder

Parameter	Implemented Design		Paper Ref.(TB2F) Norm1 [11]	Paper Ref.(TB2F) Norm2[11]
FPGA Device	Spartan 3E	Virtex-II Pro	Virtex-II Pro	Virtex-II Pro
Decision	Soft	Soft	Soft	Soft
Area (Slices)	3731	3478	2136	2156
BRAMs	2	2	2	2
Max. Freq.	97 MHz	159 MHz	158 MHz	196 MHz
Max. Data rate	48.5 Mbps	79.5 Mbps	79 Mbps	98 Mbps
Min. Latency	2.29Âµs @97MhZ and 4.44Âµs @50Mhz	2.22Âµs @ 100Mhz	2.468 Âµs @ 158MHz	–
Total Power	108mw @50Mhz and 129mw @97Mhz	272mW @ 100MHz	1068mW @ 158MHz	1081mW @ 196MHz

Conclusion

The main consideration of design is to decrease the power dissipation and latency. The two techniques used for decoding the convolutional code are the register exchange method and trace back method. The Trace back method is used for larger constraint length but take more time than register exchange but the power consumption in the method is very less so that we design the Viterbi decoder with the trace back method. To decrease the power we used architectural level difference in ACS module. Conventional module consumes more power but to design ACS in Butterfly structure consume less power. One Pointer algorithm of Traceback is not only useful in saving the power but by increasing the speed of the read pointer twice than write pointer we can also reduce the latency.
Acknowledgment

We take this opportunity to express our deepest gratitude and appreciation to all those who have helped us directly or indirectly towards the successful completion of this paper.
References

Wei Chen, RTL Implementation of Viterbi Decoder, Masters thesis, Linkoping Uni. Rep, pp.6-25, 2006.
Samirkumar Ranpura and Dong Sam Ha, Low- Power Viterbi Decoder Design for Wireless Communications Applications, Int. ASIC conference, Sept. 1999, Washington, D.C.
B. Sklar, Digital Communications: Fundamentals And Applications, Prentice-Hall, 2nd Edition, 2002.
G.Feygin and P.G.Gulak, Survivor Sequence memory management in viterbi decoders, CSRI Tech.Rep.262, Univ.of Toronto, Jan1991.
Z. M. Patel, VLSI implementation of IEEE802.11a Physical layer baseband, M.Tech Dissertation, IITB, Powai. 2009.
Viterbi A. J. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm, IEEE Transactions on Information Theory13 (2): 260269, April, 1967.
M. Horwitz and R. Braun, A Generalized Design Technique for Traceback Survivor Memory Management in Viterbi Decoders, Proc. Symposium on Communication and Signal Processing Grahamstown, South Africa, pp. 63 68, 1997.
R C S Morling and N Haron, Novel Low- Latency Low-Power Viterbi Decoder Traceback Memory Architecture, IEEE MELECON, Dubrovnik, Croatia, May, 2004.
Sang-Ho Seo, and Sin-Chong Park, Low latency and power efficient VD using Register Exchanged state- mapping Algorithm, Proceedings of the 9th International Database Engineering & Application Symposium (IDEAS05).
S.W.Shaker, S.H.Elramly, K.A.Shehata, FPGA Implementation of a Viterbi Decoder for WiMAX Receiver, International Conference on Microelectronics(ICM), Marrakech, 2010.
F. Angarita, M. J. Canet, Architectures for the Implementation of a OFDM-WLAN Viterbi Decode, Journal of Signal Processing Systems 52, 35-44, 2008. Springer Science.

BANK 1 0	D	BANK 2
WR		DC
WR		DC

BANK 1 0	D		BANK 2
WR			DC
		DC

FPGA Implementation Of Power Efficient Low Latency Viterbi Decoder

S0

Adder

BM_00

S1

BM_11

BM_11

BM_00

Adder

Adder

Adder

Comparator

Comparator

S0

Survivor Path storage

Decision bit Survivor Memory Unit(SMU)

Survivor Path storage

Decision bit Survivor Memory Unit(SMU)

Survivor Vector

Survivor Vector

0

N-1

0

N-1

WR

0

N-1

0 D

SLEEP MODE

0

D/2

D

DC WR

0 D

N-1

0

N-1

0

N-1

DC

SLEEP MODE

WR

3D/2

WR

2D

Leave a Reply