 Open Access
 Total Downloads : 129
 Authors : Parshvi Shah, Sapna Prabhu
 Paper ID : IJERTV3IS080701
 Volume & Issue : Volume 03, Issue 08 (August 2014)
 Published (First Online): 30082014
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Hybrid LearningBased Branch Predictor
Parshvi Z. Shah
Department of Electronics Engineering Fr.C.R.C.E
Mumbai, India
Sapna U. Prabhu
Department of Electronics Engineering Fr.C.R.C.E
Mumbai, India
Abstract Accurate branch prediction mechanisms contributes to the optimization of processor performance. As the number of pipelines stages in modern processors are steadily increasing, the importance of a good branch prediction logic is rising.
Traditional branch prediction relies on history of branch for prediction. Several researchers have worked on branch prediction accuracy. Modern branch predictors rely heavily on longer global history to produce accurate branch prediction. The entire length of the program is however, not related with recently executed branches. For these parts of the program, the extra information encoded in the global history does more harm than good. This limitation gave us motivation for looking beyond using larger history length. In the Hybrid branch predictor, branch prediction accuracy can be improved by basing prediction on the outcome of neighboring branches (global behavior) and outcome of the branch itself (local behavior).
This paper deals with the implementation of the Hybrid branch predictor with the help of neural networks, which provides higher predictive capabilities than commonly used global branch predictors. The hardware resources required for Learning based predictor scale linearly with the history length, in contrast with other purely dynamic schemes that need exponential memory that permits our predictor to consider longer branch histories.
KeywordsBranch Predictor, Alloyed predictor, Global History, Perceptron , Path History.

INTRODUCTION
The techniques of instruction level parallelism and pipelining have been used well to speed up the execution of instructions. High performance and highaccuracy designs call for deeper pipelines, which lead to costly misprediction penalties. The conditional branches are the critical factor to the effectiveness of a deep pipeline. The question of how to predict the direction of conditional branches has received intense study in recent years. Branch prediction has been a very important method to improve processor performance which tries to solve control dependency issues exposed in program instructions and improve instruction level parallelism, thus enabling deeper processor pipeline for better performance results.
Branch prediction techniques can be classified into two groups: static or dynamic. Static Branch Prediction are simple compiletime schemes in which predictions are Static. They are fixed for each branch during the entire execution, and the predictions are compiling time guesses. In Dynamic Branch Prediction, the hardware influences the prediction
while execution proceeds. Prediction is decided on the computation history of the program.
Many researchers have been focusing on branch direction algorithms and recent improvement on the predictive accuracy in literatures is achieved by increasing the complexity of algorithms.
Traditional branch predictor keeps a separate history buffer for each conditional branch. As correlation between the branches are important, global history buffer have been used. It keeps a shared history of all conditional branches. The advantage of global history is that any correlation between different conditional branches is part of making a prediction. Generally speaking, dynamic branch prediction gives better results than static one at the cost of increased hardware complexity, and has caught a lot of attention in the past few years. There are many proposed algorithms for dynamic branch predictors, such as bimodal, twolevel and hybrid which are extensively implemented in to predict the branch directions. Bimodal prediction uses a table of 2bit saturating counters indexed by branch addresses to predict the most common direction. Twolevel prediction employs two levels of information such as branch history records and pattern history table for the direction prediction. Hybrid prediction takes advantage of combined strategies for the same purpose.
As the length of the application program is steadily increasing, longer global history length needs to be used to provide better prediction accuracy. Several researchers have been reported that larger history length branch prediction gives higher prediction accuracy.
A neural branch prediction which was first proposed in 1999 and is very promising now a days uses more complex algorithms to weight for the possible paths in order to select a definite way for branch direction. Neural branch predictor using perceptron, one of the simplest possible neural networks. Perceptrons are easy to understand, simple to implement, and have several attractive properties that differentiate them from more complex neural networks. Perceptron based branch predictor has shown promise in attaining high prediction accuracy due to linear resource growth compare to the exponential resource growth of traditional predictor.
Larger history length branch predictor using neural network, such as FPB,PLB,TAGE and OGEHL gives better
prediction accuracy than traditional counter parts such as G Share.[1][2][3][4][5].
The objective of the branch predictor is to reduce the probability of making an incorrect decision, to avoid fetching instructions that eventually ought to be discarded. Branch predictors plays an important role in achieving highly effective performance in several modern pipelined microprocessors.

RELATED WORK
Several researchers have worked on improving on branch prediction accuracy.[5][1][9][3][4][7][10][11].
Yeh and Patt [1] introduced the global history register (GHR) in their two level predictor; to keep outcomes of the previously executed branch.GHR provides the correlation of current branch with previously executed branches. The advantages of GHR further used in Gshare/G select predictor. Traditional predictors were using 1015 bits for GHR which provides short distance correlations. Generally, longer branch history enables branch correlation with more distant branches, so recent predictors tend to adopt long history at the cost of complicated hardware.
To boost the accuracy of branch predictor several researchers have worked on combining global history and local history of branch Zhi et al. introduced alloyed predictor, a new hardwarebased twolevel branch predictor that concatenate global and local history with some bits of program counter [9] .
Jimenez et al. [3, 4] introduced neural based perceptron predictor, which uses global history to train a neural network. A perceptron of weights which depends on the length of the global history register. When a branch arrives, predictor will input the has a number branch address to the hash function to generate the index to the perceptron weight table to pick a perceptron. The weights of the selected perceptron are then used with the global branch histories to compute the dot product and accumulate results .The branch prediction is determined by the value of summation, if the value is positive, then the branch is predicted Taken, otherwise Not Taken,. Finally the final branch outcomes will update the perceptron weights. If the branch outcome agrees with the prediction, then weights will be incremented, otherwise decremented accordingly.
Path history should provide better correlation information than pattern history, because path history is a superset of pattern history. Path information includes the branches by which the current branch ws reached, not just the pattern of directions that they went to reach the current branch. Jimenez et al [5] introduced Fast PathBased (FPB) branch predictor is an improved neural perceptron predictor. this predictor improves accuracy by combining path and pattern history to overcome limitations inherent to previous neural based predictors. It provides lower latency than previous neural predictors.
Seznec [7] introduced OGEHL predictor using geometric series principle. up to get the prediction result positive to predict Taken and vice versa. Seznec [8] introduced TAGE predictor combines a traditional predictor such as gshare table, T0, which is default predictor, with some tables of variable history length for branch forecast. Each table has independent hash function using different historical length; the tag field of each selected table entry will be matched against the tag field of incoming hashed index. If there is a match in a higher table, the resulting pred field will be selected and override results from lower tables .If there is no match among higher tables, the prediction from the default predictor will be used.
Daniel [10] introduced scaling based neural branch predictor based on the principle of inverse linearity curve because each branch doesnt contribute equally; unsurprisingly, more recent weights tends to have a stronger correlation with branch outcomes. Each weight of the perceptron table scale with correlation coefficients which can determine from the inverse linear curve. The correlation coefficients were generated using the publicly distributed traces for the CBP competition. By multiplying weights with coefficients proportional to their correlation, the predictor achieves higher accuracy.
Daniel [11] introduced optimized version of scaling based neural network ,hybridized with two simple twolevel adaptive predictors .The predictor achieves higher accuracy by using some tricks such as value of threshold and coefficient vectors are selected empirically during run time of program.

PROPOSED ALGORITHM
Learning based branch predictor can be implemented using CBP framework. The following variables are used by the algorithm:
W: A twodimensional variable array of integers weights for global prediction.7bits are used for weights saturate at +63 and – 64.
h: The global history length. This is an integer, variable in the implementation.
Lh: The local history length. This is an integer, 17 in the implementation.
LW: A twodimensional array of integers weights for local prediction.7bits are used for weights saturate at +63 and – 64.This array has 96 entries.
A: An array of addresses. As branches are executed, their addresses are shifted in to the first position of this array. In the implementation, the elements of the array are the lower
9 bits of the branch address.
C: An array of scaling coefficient. The Coefficients are chosen as c[i]=1/(a+b*i).where a=0.1111 and b=0.037 [10]. sum_global: An integer. This integer is the dot product of a weight vector chosen dynamically from W and the global history register.
sum_local: An integer. This integer is the dot product of a weight vector chosen dynamically from LW and the local history register.
Sum: An integer. This integer is the sum of sum_pos, sum_local and bias weight.
The neural prediction algorithm presented below achieves higher accuracies than previously proposed neural algorithms. Predictors are divided in to two parts one is for global prediction and other is for perbranch prediction i.e. Local prediction. The higher accuracies result from:

Accessing the weights using a function of the PC and the path for global prediction and function of PC for local prediction.

Breaking the weights into a number of independently accessible tables for global prediction.

Scaling the weights by the coefficients as previously described.

Taking the dot product of a global branch history vector and the scaled weights to get the value of sum_global.

Taking the dot product of local branch history which can access using a function of PC and local weights also can access using a function of PC to get value of sum_local.

Finally add the result of local predictor i.e. sum_local, global predictor i.e. sum_global with bias weight to get final result sum.

Once the actual outcomes of the branch become known, the training algorithm uses this outcome and the value of sum to update the weights in local and global weights table. Training algorithm is same as basic perceptron training algorithm with
Threshold value = [1.93h + 14],[7] where h is history length.

Global History based prediction
Fig 1. shows a block diagram of the global prediction algorithm. The two key parameters of the predictor are h, the global history length, with which the dot product is computed, and R the number of rows in each weights table. Other inputs to the global predictor are A, a vector of the loworder 9 bit of each of the past h branch addresses (A is effectively a path vector).
The two components of the dotproduct computation are the history vector and the weights vector. The history vector consists of h, which provides history bits of the most recent h branches. The second component of the dotproduct computation, the weights vector, is obtained by reading eight weights from each of columns. The first 7 columns, containing the weights for the most recent history bits, have
512 entries than other columns because the most recent weights are the most important. Next six columns have 256 entries. Rest of the columns has 128 entries.
For e.g. in the design if h=128 and r = 512, 256, or
128. The columns are partitioned into 16, rather than just one large indexed row of 128 weights (num_col=history length/block size), because the separation reduces aliasing and achieves higher accuracy with low additional complexity. To index each table, an eightbit fraction of the A vector is XORed with the loworder eight bits of the branch PC, resulting in an eightbit index for one of the rows. In the first 7 columns with 512 entries, an extra bit of the branch PC is XORed with a second address bit from the most recent branch to produce a ninebit index. The bias weight table is indexed with 11 lowerorder bits from the branch PC.

Perbranch history prediction
Fig 2.Shows a block diagram of the local branch prediction. To improve the accuracy, combination of global and per branch history is used rather than just global history as outlined in the algorithm above.
Fig. 2. Local branch prediction
Fig. 1.Global prediction Algorithm
A table of perbranch histories is kept and indexed by branch address modulo numbers of histories. Weights for local
Perceptrons are kept separately from global weights. These histories are incorporated into the computations for the prediction and training in the same way as the global histories.

Predictor Update
Updating the predictor consists of three phases, some of which can occur in parallel.

Updating histories:
When the outcome of a branch becomes known, it is shifted into H. The lowestorder bit of the branch address is shifted into A.

Training the predictor:
If the prediction was incorrect, or if the magnitude of the predictor output was under a set threshold ( = [1.93h + 14]), where h is history length, then the predictor invokes its training algorithm. As in previous neural predictors, the weights responsible for the output are incremented if the corresponding history outcome matches the current branch Outcome, and decremented otherwise. The weights use saturating arithmetic.

Updating the training threshold:


An adaptive threshold training algorithm is used to dynamically adjust the threshold at which training will be invoked for a correct prediction.
<>This algorithm is the same as the one used for O GEHL [7].The threshold is increased after a certain number of incorrect predictions, and decreased after a certain number of correct predictions whose outputs were not as large as the current threshold. Seznec observed that good accuracy is achieved when the training algorithm is invoked equally many times after correct and incorrect predictions [9]; this threshold training strategy strives to achieve that balance. This training requires single saturated Threshold counter.(TC).
if ((P!=Out) {TC= TC + 1; if ( is saturated positive)
{=+1; TC=0 ;}}
if ((P == Out) & (s<)) { TC= TC – 1; if ( is saturated negative)
{ =1; TC=0 ;}}
(MPKI) where,
MPKI= (mispredicted branches/number of instructions)*1000

Analysis of Results
Fig. 3 and Table I. shows the average MPKI for all four combinations, with three different history lengths.
TABLE I. COMPARATIVE ANALYSIS
HISTORY
LENGTH= 128
HISTORY
LENGTH= 256
HISTORY
LENGTH= 512
With All
4.93
4.79
4.77
Without
Scaling And With Path
5.14
5.03
5
Without Path And With
Scaling
5.37
5.26
5.26
Without Path
And With Scaling
5.72
5.49
5.44


RESULTS
This section presents the experimental results of the Hybrid learningbased predictor. The experimental data presented in this report were collected using CBP Traces. The Traces are compiled to execute on the Framework model. The framework models a simple outoforder core with 256 entry reorder buffer, 3 schedulers, 4 14 stage wide pipeline and 2 level cache. The trace set includes 40 traces, classified into 5 categories: CLIENT, INT (Integer), MM (Multimedia), SERVER and WS (Workstation).
Experiments have been conducted for the following different combinations:

without path and without scaling,

without path and with scaling

without scaling and with path

with all
With variation of global history length (128,256 and 512 bits) by running 40 traces of 5 different classes. All results are compared in term of Miss Prediction per kilo instructions
Fig. 3. Comparative Analysis

The performance of without path and without scaling with history length 256 is better than with history length 128 and gives improvement of approximately 4.02%.simillarly with history length 512 gives improvement of 0.9% over with history length 256 and 4.89% improvement over history length 128.

The performance of without path and with scaling with history length 256 is better than with history length 128 and gives improvement of approximately 2.04%.simillarly with history length 512 doesnt gives any improvement over with history length 256 and 2.04% improvement over history length 128.

The performance of without scaling and with path with history length 256 is better than with history length 128 and gives improvement of approximately 2.72%.simillarly with history
Source of bits
Quantity of bits for HL=128
Quantity of bits for HL=256
Quantity of bits for HL=512
Local history
384×17=6,528
384×17=6,528
384×17=6,528
Local
weights
7x96x17=11424
7x96x17=11424
7x96x17=11424
Global
weights:
1st 7 columns
7x(8x7x512)=2,
00,704
7x(8x7x512)=2,0
0,704
7x(8x7x512)=2,0
0,704
Next 6
columns
6x(8x7x256)=86
,016
6x(8x7x256)=86,
016
6x(8x7x256)=86,
016
Next 3
columns
3x(8x7x128)=21
504
19x(8x7x128)=1,
36,192
51x(8x7x128)=3,
65,568
Global
history
128
128
128
path history
9×128=1152
9×256=2304
9×512=4608
Threshold
counter
12
12
12
Total No. Of bits
3,27,468 bits
4,43,436 bits
6,75,372 bits
length 512 gives improvement of 0.59% over with history length 256 and 2.72 % improvement over history length 128.

The performance of with all with history length 256 is better than with history length 128 and offers improvement of approximately 2.83%.simillarly with history length 512 provides improvement of 0.41% over with history length 256 and 3.24 % improvement over history length 128.

Figure 3. Shows final output with four combinations and three different history lengths. It shows that combination with all gives better improvement irrespective of the history lengths.


Size of the predictor
The number of bits going to be used by the predictor has importance since there is always a tradeoff between hardware implementation and accuracy of the predictor.
TABLE II. COMPARATIVE ANALYSIS OF NO. OF BITS USING FOR PREDICTOR
.
No. of bits
MPKI for WITH
ALL
HL=128
335872
4.93
HL=256
450560
4.79
HL=512
671744
4.77
TABLE III. COMPARATIVE ANALYSIS OF NO. OF BITS AND MPKI
MPKI
% of improvement
GShare
7.75
0.00%
HL=128
4.93
36.39%
HL=256
4.79
38.19%
HL=512
4.77
38.45%
Table II. shows no of bits used by predictor for three different history lengths. With history length of 128 predictor uses approximately 41KB (kilobits).with history length of
256 predictor uses approximately 55KB and with history length of 512 predictor uses approximately 82KB.
Table III. shows the relation of number of bits with MPKI. Total MPKI can reduce by 2.83% using large history length of 256 with approximately 14 KB increasing in Number of bits. Similarly MPKI can reduce by 0.41% using larger history length of 512 with approximately 27KB increasing in number of bits.
Using a larger history length of 512 with increasing bits does not give much benefit. There for its always trade of between to large hardware and higher accuracy.

Comparision with Gshare predictor
GShare predictor is the traditional Global branch predictor. For the purpose of analysis, global history length 18 is used for Gshare. The approximate size of the Gshare predictor is 64KB.
TABLE IV. ANALYSIS OF MPKI AND PERCENTAGE OF IMPROVEMENT WITH GSHARE
Fig. 4. Analysis of MPKI with GShare
Fig. 5. Analysis of percentage of improvement with GShare
Table IV. And Fig. 4 and Fig.5 shows comparisons of Hybrid learning based predictor with traditional GShare Predictor. It shows that implemented predictor givesimprovement of 36.39% when history length is 128, 38.195% when history length is 256 and 38.45% when history length is 512.
Results show that hybrid learning based predictor implemented achieves better performance than the Gshare predictor.


CONCLUSIONS
Branch predictor continues to evolve and improving branc h prediction a c c u r a c y is still an open problem. The proposed branch predictor uses Neural learning techniquesthe perceptronas basic mechanism with the variation of global history length. It includes the features like scaling of global history length to reduce noise due to larger history length and path history to provide the correlation information amongst the branches. The proposed branch predictor ensures increase in prediction accuracy at the cost of increased in hardware.
The future scope includes investigations of other techniques liked mixed signal branch predictors which can be used to reduce mispredictions in a powerefficient manner.
REFERENCES

Y. Yeh and Y. N. Patt, Twolevel adaptive training branch prediction, in proc. 24th Annual IEEE/ACM International Symposium, 1991, pp. 5161

E.R.E.Kessler and D.A.Webb,The alpha 21264 microprocessor architecture, in Proc.International Conference on Computer Design, 1998, pp. 9095

D. A. J. ÃŒ. nez and C. Lin, Dynamic branch prediction with perceptrons, in Proc.7th International Symposium on High Performance Computer Architecture, January 2001, pp. 197206.

, Neural methods for dynamic branch p red ic ti on , in Proc.ACM Transactions on Computer Systems, November 2002.

D. A. J. ÃŒ. nez, Fast pathbased neural branch prediction, in Proc.36th International Symposium on Micro architecture, 2003 .

, Piecewise linear branch prediction, in Proc. 32nd Annual International C o n ference on Computer Architecture, June 2005, pp. 382393.

A. ÃŒ. Seznec, Analysis of the ogeometric history length branch predictor, in Proc 32nd Annual International Conference on Computer Architecture, June 2005, pp.394 405.

, The ltage predictor, J o u r n a l of Instruction Level Parallelism, April 2007.

M. R. S. K. S. Zhijian Lu, John Lach, Alloyed branch history: Combining global and local branch history for robust performance, International Journal of Parallel Programming, vol. 31, 2003