Implementation of 32-Bit Interlock Collapsing Alu Multipliers in VLSI using VHDL

Pratibha Pandey; Akanksha Awasthi

doi:10.17577/IJERTV7IS050035

Volume 07, Issue 05 (May 2018)

Implementation of 32-Bit Interlock Collapsing Alu Multipliers in VLSI using VHDL

DOI : 10.17577/IJERTV7IS050035

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 100
Total Downloads : 25
Authors : Pratibha Pandey , Akanksha Awasthi
Paper ID : IJERTV7IS050035
Volume & Issue : Volume 07, Issue 05 (May 2018)
Published (First Online): 08-05-2018
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Implementation of 32-Bit Interlock Collapsing Alu Multipliers in VLSI using VHDL

Pratibha Pandey

M. Tech Student

Department of Electronics & Communication Engineering Dr. C.V. Raman University Kota, Bilaspur

C.G., India

Akanksha Awasthi

Assistant Professor

Department of Electronics & Communication Engineering Dr. C.V. Raman University Kota, Bilaspur

C.G., India

Abstract – An important area in computer architecture is parallel processing. Machines employing parallel processing are called parallel machines. A parallel machine executes multiple instructions in one cycle. However, parallel machines have a limitation, they cannot execute interlocked instructions. They are executed in seriallike any serial machine. It takes more than one cycle to execute multiple instructions causing performance degradation. In addition there is hardware underutilization as a result of serial execution in parallel machine.

The solution requires a special kind of device called Interlock collapsing ALU. The Interlock Collapsing ALU, unlike conventional 2-1 ALUs is a 3-1 ALU. The proposed device executes the interlocked instructions in a single instruction cycle, unlike other parallel machines, resulting in high performance. The resulting implementation demonstrates that the proposed 3-1 Interlock Collapsing ALU can be designed to outperform existing schemes for ICALU, by a factor of at least two. The ICALU is

The proposed ALU, in addition to collapsing these interlocks also should be implemented in identical stages as the conventional ALUs. A functional model of the ICALU is assumed initially. The functional model is optimized by optimizing the models individual blocks. The design and optimization of each block is discussed in separate chapters.

Finally, two parallel machines with and without the ICALU are compared with regard to their execution times. The effect of variation of percentage interlocks in a given code on the execution times and the percentage speed ratio of the parallel machines is studied.

The ICALU is implemented in VHDL. Its functionality is verified through simulation.

implemented in VHL. Its functionality is verified througp. ICALU DATAFW MODELLO

simulation.

Keywords: ALU, Interlock collapsing, ICALU, parallel processing, computer, architecture, parallel machines.

1. INTRODUCTION:BACKGROUND:

Parallel machines cannot execute interlocked instruction concurrently.Interlocked instructions or instruction with dependencies cannot be executed concurrently in a parallel machine, thus degrading the performance of the machine. The thesis investigates a solution, called, interlock collapsing, to exec ute these interlocks concurrently. The solution requires a special kind of a device called the Interlock collapsing ALU. The Interlock collapsing ALU, unlike conventional 2-1 ALUs, is a 3-1 ALU.

THE INTERLOCK COLLAPSING ALU UNIT:

In this chapter all the designed components are put together to implement the ICALU. Also, ALU1 is created using the designed components. Finally, the Interlock collapsing ALU unit is implemented which consists of both ALU1 and ICALU. The chapter also estimates the relative delay.

Reduced Icalu Model :

Resulting from the design of the various stages in the preceding chapters a reduced ICALU is obtained. The result was the elimination of the multiplexers M2 and M3 and also better implementations of the Pre and Post-CLA Logic Blocks. The block diagram is shown in Fig 4.1. The program for ICALU is in the Appendix A.2.
Alu1 Model :

Fig. 2. 1 : Reduced Dataflow Model of Icalu

Fig 2.2 ( Dataflow Model of ALU1)

The control signals for multiplexer are K12 and K13 and are set as follows :
1. CATEGORY 1 ( ARITHMETIC ) : K12 = 1 and, K13 = 0 ;
  
  Output of ALU1 = O = A Â± B.
2. CATEGORY 2 ( LOGICAL ) : K12 = 0 and, K13 = 1 ;
Output of ALU1 = O = A LOP B.

The values of control signals are summarized in Table 4.1 :

CATEGORY

K12

K13

O

1

1

0

A Â± B

2

0

1

A LOP B

Table 2.3 : Output table for ALU1

2.2.2 Implementation :

The ALU1 is implemented using the block diagram above. The components CLA and PREBLK are the adder and the logic block respectively, for ALU1. The program for entity ALU1 is shown in A.3.

Where as, in case of the ICALU it is in parallel and has the same delay as the CSA.

The logic delay of both stages are :
1. CSA :
Interlock Collapsing Alu Unit :

The Interlock collapsing ALU unit consists of ALU1 and the ICALU operating in parallel. The block diagram of the Interlock collapsing unit was shown in Chapter 1, Fig 1.7. The program for entity ICUNIT is shown in A.2.
Estimation Of Relative Delay Between Alu1 And Icalu :

In this section the relative delay between the ALU1 in Fig 4.2 and the ICALU in Fig 4.1 is estimated. The relative delay is the difference between the delay of ALU1 and the ICALU. The delay is required to find out the instruction cycle length. The delay of a device can be estimated by taking a logic gate count from the input to the output. Only the delay between both ALUs considered because all other stages in their respective paths are identical, hence they also have identical delays.

Now, compare Fig 4.1 (ICALU) and Fig 4.2 (ALU1).

By elimination, it is deduced that the ICALU has two additional stages when compared to the ALU1 which are :
1. The CSA and,
2. The Post-CLA Logic Block.
  
  The procedure is :
  1. The CLA and multiplexers are common to both the ALUs. Hence they can be eliminated.
  2. The extra stages in the ICALU path are the CSA and the Post-CLA Logic Stage.
  3. The Pre-CLA Logic stages are not considered because in case of ALU1 it is parallel with the CLA stage and has lesser stages than the same.
To estimate this consider (3.13a) and (3.14) which represent the input-output transformations of the CSA sum and carry respectively. Both are in parallel.

SUM = Si = Ai V Bi V Ci,

i+1 = K2 Ai Bi + K1 Bi Ci + K1 Ai Ci + K3 Ci+1.

(3.13a) and (3.14) can each be implemented in one gate delay using custom-built CMOS libraries. A Â± 3 X 4 AO gate can serve this purpose ( + represents AND-OR and – represents AND-OR-INVERT). The delay of this gate is assumed to be 1 gate stage as that of any other gate in the assumed libraries.
1. LOGIC DELAY OF POST-CLA LOGIC BLOCK :
Similarly, (3.9) (shown below) can be implemented in one gate delay by the AO gate.

Li = Lli KPRE1 + Lri KPRE1 + Lli Lri KPRE2 + Lli Lri KPRE3 (3.9)Thus

total relative gate delay of the ICALU over the ALU1 =

Logic delay due to CSA stage + Logic delay due to Post-CLA Logic Stage =1+1 = 2.
Determination of Instruction Cycle Lengths of a Machine With And Without Icalu :

The average instruction length is calculated to find out the speed of the machine. The instruction cycle length varies for each instruction. Hence an average instruction length has to be calculated. It is sufficient to take the average of only frequently executed instructions. The following discussion shows how the instruction lengths can be calculated for a given instruction. But first, ig 2.4 is redrawn again.

Fig2.4.5 ( Phases of Instruction execution process )

Fig 4.3 represents the instruction path of serial machine. instructions given as I0, or the basic instruction cycle time.

been discussed in Chapter 1.

The time to execute an The individual stage have

2.4.1 Without ICALU :

For a parallel machine there are two such paths in parallel. Fig 4.4 shows instruction execution (considering

non-interlocked case) in a parallel machine with respect to time.

Fig 4.4 shows the instruction cycle of a parallel machine for a two-operand instruction pair shown below. The upper cycle in the figure represents execution of instruction 1. The instruction time is the same as the basic instruction cycle time, I0. Execution of Instruction 2 is shown in the lower half. It starts a memory write cycle after the first instruction, because memory cannot be accessed simultaneously. It

shifts to the right by 1MW. The xs in figure represents an idle cycle.

ADD R1, R2 / Executed by ALU1 / ADD R3, R4 / Executed by ALU2 /

The ID2 is smaller than ID1 by one memory access because we already have R2, fetched by Instruction 1. This compensates for the delay in start of execution of Instruction 2 and thus the execution cycles of both the instructions start at the same time. After the EX cycle is complete, Instruction 2 has to wait for 1MW for Instruction 1 to complete its memory access.

Instruction 2 takes a further 1MW to complete its cycle. Thus from the figure it can clearly be seen that the instruction time of a parallel machine is lengthened by 1MW.

2.4.1 With ICALU :

The instruction cycle in figure is for the pair given below :

ADD R1, R2

ADD R1, R3

The operation is almost similar to that of an ordinary parallel machine except that there is no memory access for ALU1. Hence the memory access starts once the ICALU completes its execution which is two additional logic or gate delays more than the 2-1

ALU. Hence its instruction cycle time increases to I0 + 2 D ( D Unit gate delay or the delay of one gate).

MW can be treated as three gate delays for CMOS memories. Substituting this value average instruction length can be calculated.

3. PERFORMANCE ANALYSIS PERFORMANCE ANLAYSIS

In this chapter the performance of a Non-ICALU and that of a parallel machine with the ICALU is compared. Table 5.1 shows the average instruction lengths of a machine with ICALU and a Non-ICALU parallel machine for the interlocked and Non-interlocked categories. The average instruction lengths were calculated by taking the average of instructions lengths obtained for all possible interlocked and non-interlocked pairs ( See Appendix B ). The average instruction length is the time taken to execute an instruction pair, that is two consecutive instructions.

CATEGORY	K12	K13	O
1	1	0	A Â± B
2	0	1	A LOP B

Implementation of 32-Bit Interlock Collapsing Alu Multipliers in VLSI using VHDL

Leave a Reply