 Open Access
 Total Downloads : 266
 Authors : Jai Prakash Mishra, Mukesh Maheshwari
 Paper ID : IJERTV3IS10602
 Volume & Issue : Volume 03, Issue 01 (January 2014)
 Published (First Online): 31012014
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Simulation & Synthesis of FPGA Based & Resource Efficient Matrix Coprocessor Architecture
Jai Prakash Mishra1, Mukesh Maheshwari2
1M.Tech Scholar, Electronics & Communication Engineering, JNU Jaipur, Rajasthan, India
2Assistant Professor, Electronics & Communication Engineering, JNU Jaipur, Rajasthan, India
Abstract
Due to fast routing and logic resources, embedded multipliers, very high gate densities, block RAM memory modules, and embedded soft processors, modern FPGAs offer an excellent platform for the implementation of efficient Field Programmable Systems on Chip. The latter are largely required in a wide range of applications, like telecommunication, wireless, networking, video, signal processing, robotics and digital control. All the above applications are characterized by an intensive computation of matrix operations like the matrix multiplication. Therefore, the introduction of a coprocessor to support the computation of matrix multiplications alongside a general purpose processor system can be a good solution to reduce the computational time & Resources. However, in order to get the most of their potential performance, FPGAs need to be programmed at the hardware low level and not the application level. This process needs a considerable hardware knowledge, which means that FPGA programming is still reserved to the specialist. The recent explosion in FPGAs resource densities and performance, together with their inherent reprogram ability feature, makes them very attractive as high performance, flexible, implementation platforms for these operations. In this present work a novel architecture has been developed for large matrix multiplication. For high performance applications, this operation will minimize the hardware resources. For this, we use a parallel architecture for the multiplication of two matrices using Field programmable Gate Array (FPGA).
Index Terms: FPGA, Xlinx ISE, Modelsim, Matrix Multiplication, Parallel BlockScheduling.

INTRODUCTION
Matrix Multiplication Architecture is a processor intended to perform matrix multiplication in an efficient manner, with reduced consumption of time and resources. It finds enormous use in real time image processing and computer vision applications, where the images are considered as matrices, and involve various matrix operations of which multiply is highly important and relatively difficult to implement. Matrix multiplication is a core operation in digital signal processing operations with a variety of applications such as image processing, computer graphics, sonar processing and robotics [2][3][4]. For high performance applications, this operation must be realizes in hardware. For this, we use a parallel architecture for the multiplication of two matrices using Field programmable Gate Array (FPGA) [1].
Traditionally, matrix multiplication operation is either realized as software running on fast processors or on dedicated hardware such as Application Specific Integrated Circuits (ASICs). Software based matrix multiplication is slow and can often become a bottle neck in the overall system operation. However, hardware (Field Programmable Gate Array (FPGA)) based design of matrix multiplier provides a significant speedup in computation time and flexibility as compared to software and ASIC based approaches respectively [6].
Modern FPGAs can accommodate multimillion gates on a single chip. During the last decade, the logic density, functionality and speed of FPGA have
improved considerably. Modern FPGAs are now capable of running at speed beyond 500 MHz [7]. Another important Feature of FPGAs is their potential for dynamic reconfiguration [8]; that is, reprogramming part of the device at run time so that resources can be reused through time multiplexing.
T h e matrix multiplication operation is often performed by parallel processing systems which distribute computations over several processors to achieve significant speedup gains. There exists many realization of matrix multiplication. These realizations mainly differ in terms of algorithms or the hardware platforms.
In this work, we consider the problem of implementing large matrix multiplication algorithm [ 10 ] and then mapping the algorithm to parallel architecture on FPGA. The FPGAbased systolic array parallel architecture for the trimatrix multiplication was evaluated for different matrix size [9], but if the size of trimatrix was increased then it required more hardware resources which were the computational complexity of multiplier. Since this design could not be fit into Spartan3 and large size matrices requires more hardware resources, more memory space, more power, RAM and more time requires for complete the computation. For matrix multiplication of large matrices, the memory based architecture is quite efficient whereas, for small and medium sized matrix multiplication.

The PARALLEL BLOCK (PB) Architecture

Algorithm
We separate the sequential block algorithm into two parallel algorithms, called Master and Slave. The Master algorithm is executed on a single processor, and the Slave on multiple processors. In the discussion to follow and Shown in fig1, we refer to the Slave processors as to Processing Elements (PE). The Master sends the data from matrices A and B within messages and loads the results into matrix C ordered as Si by Sj blocks, according to the Parallel Block scheduling algorithm [1]. The data in the messages are correctly ordered by the Master and are delivered in a preserved order to the PE chain. In each PE, the scheduling algorithm is performed in the following steps:
Step 1: The Master processor sends Si elements of one column of array A so that each PE receives Si/P elements.
Step 2: The Master processor sends Sj elements of
Fig1 The PB computational scheme an example for N = 4 and Si = Sj = 2
One row of array B to all PEs. The elements of array A and B are multiplied in each PE and added to the corresponding temporary elements of array C. Results are accumulated into the local PE memory.
Step 3: Repeat N times steps 1 and 2. Finally, the PE local memories will contain Si Ã— Sj elements of C.
Step 4: The Master processor transfers the Si Ã— Sj block of C from the PE local memories to the main memory. If there are unprocessed blocks, go to step 1.

Explanation
The Main Processor (comprises of Matrix A,
Matrix B
and Matrix C) sends a signal for data transfer from Matrix A and Matrix B which is received by Slave processor (comprises of MAC unit) in the form
of data acknowledgement signal.
As the Slave receives the data acknowledgement signal it fetch the data from Matrix A and Matrix B and store it in respective FIFO registers.
As per the functionality described above regarding the parallel block architecture, the data elements are fed to register 1 to register N, depending upon the size of matrix. If the size of matrix is 8×8 then the PEs required is 4. Then the data are multiplied and there after addition is performed, the resulted output is stored in output register from C11_reg to C41_reg. The computed result is fed to Matrix C.
The basic building block of proposed matrix multiplication is shown in figure2. This building block comprises of Matrix A, Matrix B, MAC Unit and Matrix C.
Fig2 Basic block building architecture
MatrixA
Matrix A is having elements of I x K matrix structure. The length of elements is 2nwhere n is in the multiples of 2. A clock signal is provided to it. When the read A signal is hih, Slave processor will fetch data from Matrix A. The address A signal will increment in the multiples of 2.
MatrixB
Matrix B is having elements of K x J matrix structure. The length of elements is 2n where n is in the multiples of 2. A clock signal is provided to it. When the read B signal is high, Slave processor will fetch data from Matrix B. The address B signal will increment in the multiples of 2.
MAC Unit
Basically MAC Unit comprised of Slave Processor which performs all the arithmetic operations. In Slave Processor individual processing elements are there which depends upon the length of matrix. When the reset signal is active high, the Slave Processor comes to reset hence before starting any new task/operation the reset signal should be kept high for one clock period.
Matrix C
Matrix C is of I x J matrix structure. After the results being computed from MAC Unit, Shown in fig


The data is stored in Matrix C according to the address location.
The input signal to both the PE and Slave Processor is same whereas the output of the PE is C1, C2, C3 and C4 whereas the output of the Slave Processor is controlled by 2:1 multiplexer. Data Acknowledgement and Data from Main are the two important signals of Slave Processor Shown in fig3. Until the Data acknowledgement signal is low; FIFO A and FIFO B will not fetch the data from respective matrices. Similarly when Done from Main signal becomes active high it means that all the required data has been fetched.
Fig3 Basic MAC Unit

Algorithm for Main Processor

Reset Main
Reset all counter and reset slave processor.

Main 1
Read 32 bit from Matrix A and Matrix B; increment address A count and address B count by 2.

Main 2
Load data from Matrix A Out to FIFO A and Matrix B Out to FIFO B; send data acknowledgement to Slave Processor; increment N count.

Main 3
Check N count; if N count=7 then go to main 4 else go to check for data request.

Check for data request
If data request =1 then go to Main1 else remain in check for data request.

Main 4
Send done from main signal to Slave Processor; reset N count and increase final count by 1.

Main 5
Load Ist set of data from Slave to Matrix C of Main; go to Main 6.

Main 6
Load IInd set of data from Slave to Matrix C of Main;
go to Main 7.

Main 7
Load IIIrd set of data from Slave to Matrix C of Main; go to Main 8.

Main 8
Load IVth set of data from Slave to Matrix C of Main; go to Main 9.

Main 9
If final count=1 then go to Main 10. If final count=2 then go to Main 11. If final count=3 then go to Main 12.
If final count=4 then go to Reset Main.

Main 10
Set reset address A count and set preset address B count; go to Main 1.

Main 11
Set preset address A count and set preset address B count; go to Main 1.

Main 12
Set preset address A count and set preset address B count; go to Main 1.
Fig4 FSM of Main Controller


Algorithm for Slave Processor

Check for data ack
Wait for data ack from the main processor.

Slave 1
1 cycle is required to store data in registers from FIFO A and FIFO B.

Slave 2
1 cycle is required to compute results; wait for done from main; from the main processor if done from main = 0 then go to check for data ack else go to slave 3.

Slave 4
1 cycle is required to send II set of data to main processor.

Slave 5
1 cycle is required to send III set of data to main processor.

Slave 6
1 cycle is required to send IV set of data to main processor; also data req = 0 is sent to main processor.

SIMULATION RESULTS
The matrix multiplication Architecture is simulated by using Xilinx 10.1i ISE Simulator by writing VHDL hardware description language.
Fig6 Simulation Result of Ist Block of Data
Fig5 FSM for Slave Controller
4) Slave 3
1 cycle is required to send I set of data to main processor.
Fig7 Simulation Result of IInd Block of Data
Fig8 Simulation Result of IIIrd Block of Data
Fig9 Simulation Result of IVrt Block of Data

RTL AND TECHNOLOGY SCHEMATIC

RTL View:
RTL View is a Register Transfer Level graphical representation of Architecture design. This representation (.ngr file produced by Xilinx Synthesis Technology (XST)) is generated by the synthesis tool at earlier stages of a synthesis process when technology mapping is not yet completed. The goal of this view is to be as close as possible to the original HDL code. In the RTL view, the design is represented in terms of macro blocks, such as adders, multipliers, and registers. Standard combinatorial logic is mapped onto logic gates, such as AND, NAND, and OR.
Fig10 RTL Schematic

Technology Schematic
This schematic shows the representation of design in terms of logic elements optimized to the target Xilinx device or technology, for example, in terms of LUTS, carry logic, IOBS and other technology specific components. Viewing this schematic allows seeing a technology level representation of HDL optimized for a specific Xilinx architecture, which may help discover design issues early in the design process.
Fig11 Technology Schematic


COMPARATIVE RESULTS & PERFORMANCE
The proposed architecture was modeled in VHDL hardware description language. The VHDL model was synthesized with Xilinx ISE 10.1i targeted for Spartan
3E (XC3S500e5vq100) FPGA device from Xilinx. In order to show the performance of the proposed architecture, the same FPGA device was used. For illustration of the proposed technique, we designed 8Ã—8 matrix multiplier for 8bit fixed point number. The design implemented uses 4 Processing Elements. The results demonstrate that our design have fewer Utilization of FPGA resources as compared to previous Architecture [10]. Table 1 & Table 2 summarized the performance in terms of Logic Utilization, Peak Memory and Power Consumptions comparison. The main idea of the proposed Architecture is to reuse the resource and input the data in serial. In this way, the hardware cost can be dramatically decreased. There are following Comparative Results between old architecture [10] and proposed architecture.
Table1 Hardware Cost for old Architecture
S.No
Logic Utilization
Hardware Used
%
Use
1
Number of Flip
Flops
365
3%
2
Number of 4 input
LUTs
452
4%
3
Number of occupied Slices
293
6%
4
Total Number of 4
input LUTs
458
4%
5
Number of bonded IOBs
165
71%
6
Number of
BUFGMUXs
1
14%
7
Peak Memory
Usage
172 MB
8
Total Power
0.102W
S.No
Logic Utilization
Logics Used
Available Logics
%
Use
1
Number of
Flip Flops
176
9312
1%
2
Number of 4 input LUTs
282
312
3%
3
Number of
occupied Slices
165
4656
3%
4
Total
Number of 4 input LUTs
298
9312
4%
Table2 Hardware Cost for Proposed Architecture
5
Number of
bonded IOBs
39
66
59%
6
Number of BUFGMUXs
2
24
8%
7
Peak Memory
Usage
116 MB
8
Total Power
0.080W

CONCLUSION
Matrix multiplication can be applied extensively in many areas. Original parallel method of the implementation on Simulink Xilinx 10.1, which is implemented for on Spartan 3E in which Block consumes considerable low hardware resources, which makes it hard to be realized. We provided a new method to improve the module in order to save the resources and consume minimum power. In this way, less hardware is needed to complete the computation. Other methods can be used to optimize the matrix multiplication, such as pipeline technology.
Several design techniques such as parallel processing and pipelining are employed to achieve high performance and efficient hardware realization of the matrix multiplier using FPGA. Implementation results demonstrate the effectiveness of the proposed design technique over previous solutions. In terms of Logic Utilization, the proposed architecture is better than the implementations that are reported in the literature.
REFERENCES

Yong Dou S. Vassiliadis G. K. Kuzmanov G. N. Gaydadjiev, 64bit FloatingPoint FPGA Matrix Multiplication, Computer Engineering, EEMCS, TU Delft, P.O. Box 5031,2600 GA delft, The Neatherlands.

J. D. Foley, A. van Dam, S. K. Feiner, and J. F. Hughes,"Computer Graphics, Principles and Practice", AddisonWesley, second edition, 1996.

P. Graham and B. Nelson, "FPGA based Sonar Processing", Proceedings of the 6th ACM / SIGDA international symposium on FPGAs, pp. 201208, February 1998.

A. Jones, A. Nayak. P. Banerjee, "Parallel Implementation of Matrix and Signal Processing Libraries on FPGAs", Proceedings of the 14th International Conference on Parallel and Distributed Computing Systems, 200. http:Hwww.ece.northwestern.edu/kiones/papers/342 050.pdf

K. Benkrid, D. Crookes, J. Smith, and A. Benkrid, 'High Level Programming for Real Time FPGA Based Video Programming', Proceedings of ICASSP'2000, Istanbul, June 2000. Volume VI, pp. 32273231.

S.M.Qasim, S.A.Abbasi, and B.Almashary, A proposed FPGAbases parallel architecture for matrix multiplication, Circuits and Systems, 2008, APCCAS 2008, Pages 17631766, Nov 2008.

T. J. Todman, G. A. Constantinides, S. J. E. Wilton, O.Mencer, W.Luk, and P. Y. K. Cheung,Reconfigurable computing: architectures and design methods, IEEE Proc. Computer and Digital Techniques,Vol. 152, No. 2, pp 193207, Mar. 2005.

L. Singhal and E. Bozorgzadeh, Multilayer floorplanning for reconfigurable designs,IET: Computers and Digital Techniques, Vol.1, No. 4, pp. 276294, July 2007.

Syed M.Qasim, Ahmed A. Telba and Abdulhameed
Y. AlMazroo Department of Electrical Engineering, FPGA Design and Implementation of Matrix Multiplier Architectures for Image and Signal Processing Applications Centre King Saud University, College of Engineering Riyadh 11421, Saudi Arabia{smanzoor, abbasi, bmashary}@ksu.edu.

Xiaoxiao Jiang1, Jun Tao2, Department of Electrical Engineering, University of Minnesota,Twin Cities, USA,Implementation of Effective Matrix Multiplication on FPGA, Proceeding of IEEE IC BNMT2011.