Development of State of the Art AI Vision Algorithm on Xilinx Alveo U-200 FPGA C loud and CPU+GPU Platform

:- Since the start of modern computer algorithm practices, a lot of new techniques on computer vision has been developed. This results in utilizing machine learning algorithms and has provided attributes to the evolution of convolutional neural networks for building state-of-the-art object detection, segmentation and classification algorithms. These CNN can achieve human-like results in computer vision application, however with the expense of more computation. To meet machine learning application requirements on hardware deployments, various AI-Accelerated FPGA development kits have been developed along with specialized toolkits aimed at efficient optimization and deployment of the models. In theory, the FPGA solutions can have similar accuracy, better inference time and power consumption compared to the GPUs, however, it comes at the cost of limited CNN model support and additional FPGA hardware design complexity. In this thesis, an existing object detection algorithm has been studied and real-time simulation of the object detection algorithm which works under a darknet framework utilizing both CPU+GPU efficiently using CUDA by Nvidia. Implementation of GoogleNet and ResNet50 object detection algorithm on a cloud-based FPGA platform using Xilinx Vitis-AI Toolkit has been carried out. The tools utilize diﬀerent strategies like model quantization and hardware architecture set up to achieve an accuracy similar to a GPU with at least 10% diﬀerence. A broad case study on hardware and software configurations made on Xilinx ALVEO U-200 FPGA for efficient deployment via the cloud has been carried out. Results of both the simulation platforms have been compared and discussed for further optimization and developments.

INTRODUCTION Early applications of computer vision involve replicating the vision seen by the computers to be visible to humans. This includes the applications of cameras, x-ray images, color detection etc., early experiments in computer vision started in the 1950s and it soon found its application to use for commercial purpose mainly to distinguish between typed and handwritten text by the 1970s. The term artificial neural network was first coined in the late 1950s [1]. It is a computational model of the animal neuron in brains. They were modelled to perform a higher computational task that conventional algorithms had little to no success in performing them. However, due to hardware limitation at that time, these ANN did not find its true potential. It was in the last decade, the driving force on ANN architecture modelling has paved its way for the improvements of CNN algorithms for computer vision application. Instead of using predefined features of ML algorithms, CNN (a class of DNN) can take advantage of the increasing hardware memory and computational capability to carry out large training rapidly improved computation capability to learn from huge training datasets resulting in notable recognition capability. After the year 2012 when the winning method of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) which used CNN to detect up to 60 classes [2]. Since then, CNN has been mainly used for computer vision applications. Thanks to the advancement of deep learning techniques for training, the AI-powered Computer vision can now read and perceive what's seen by the computer and gives meaningful output with efficient prediction capabilities. Various CNN architectures have been developed since then to meet specific results in the field in terms of applications, hardware compatibilities and computational complexity. Popular ones include: LeNet, GoogLeNet, ResNet, Darknet and RCNN [3].
The FPGA can be considered as an array of millions of gates that can be reprogrammed to run a specific algorithm. They provide massive parallelism targeted for that specific algorithm by making use of its combination of logic gates (lookup tables), digital signal processors for multiplying matrices, static RAM for the temporary storage of result in the memory of those computations and switching blocks that let us control the connection between programmable blocks. Most of the FPGAs used in development usually comes with plugged in systems on chip (SoC) components; which includes CPUs, PCI Express, DMA connections and Ethernet controllers. These make the FPGA to be feasible and reliable in development stages unlike ASICs [7].
The project involves comparing the architectures of existing CNN algorithms and effective implementation on various platforms. In our project, we will be using Xilinx Alveo U200 accelerator cards via a cloud based FPGA HPC machine offered by Nimbix, Inc. This is the newest and up to date fastest deploying FPGA for AI vision applications. It is capable of delivering massive integration and capabilities with ASIC architectures, system level performance massive I/O, memory bandwidth, data flow, advanced DSP and packet-processing performance. We made use of Xilinx U200 FPGA as a hardware platform [9].

A. Artificial Neural Network:
An ANN is a computational model of a biological neuron. It is inspired by the early research made by D. O. Hebb in 1940s [17]. Though these ANN does not work as accurately and sensitive as biological neurons, it can replicate the same learning behaviour. It does some mathematical calculations to derive an output as the result of activation functions similar to that of a biological neuron's action potential.
From the above figure, the input is fed to the dendrite to the nucleus. The nucleus triggers an electrochemical signal and is sent to synapses (axon terminal). This happens only when the aggregated electrochemical signal is higher than the synaptic threshold, it results in an electrochemical spike and moves down the axon to dendrites of other neurons.

B. Components of ANN:
The components of a typical neural network may be summarized as follows: Connections: They carry information signals that are processed by the neurons Propagation function and network input: They are responsible for receiving the output from a neuron and multiplying with a weight value. They convert the vector value of input signals into a scalar value. The network input is the output of the propagation function. These signals then move to the activation function Activation Functions: Based on the nature of our model, each neuron in the architecture has to be activated desirably. This can be made to happen when the neuronal input signal crosses a given threshold value and thereby making the neuron sensitive and starts to fire. Let j be a neuron with its unique threshold value Oj. The activation function can be written as The activation function of most of the neurons in architecture is typically the same but with different threshold values. These functions play a major role in deriving the appropriate output from input data. Some of the common activation functions are: Output function: The output function of a neuron j calculates the values which are transferred to the other neurons connected to j.
Learning Strategy: Every model of the neural network has specialized learning strategies. It's an algorithm which determines how the neural networks are trained and placed to get the desired output. We will discuss in detail about popular learning techniques used in the upcoming sections.
C. Inference vs Training: It is essential to know the two phases when it comes to deploying ML models. These are Inference and Training.
Training: This stage works on annotated datasets that deal with training the network to tune for desired output predictions. This phase gets generally implemented on typical deep learning algorithms like back-propagation which iterates and updates the parameters involved in the CNN. These parameters can be convolution weights to increase and tune the threshold value and thereby improving the prediction power of the network. One could make use of the available annotation tools to annotate manually with classes on the desired input sample images. This phase occurs at high computation power and is generally made to train for days or even weeks in high-end HPC machines.
There are many open-source pre-trained weights available on the internet. These weights are trained to be deployed on various state-of-the-art models without having to perform a training phase again.
Inference: This phase is generally the result of the training phase. The model in this phase can now predict the class of completely new input that were not used in the training phase. The inference phases are generally fast and are applicable for real-time object detection. Hence, to meet these needs, industries tend to use cloud machines to carry out the training phase and update the pre trained weights on the prototype edge machine for inference.

D. CONVOLUTION NEURAL NETWORK:
The idea of CNN was introduced long back for hand-written digit recognition by [LeCun et al., 1998]. CNN, a class of DNN, is a feed-forward ANN that are applied to analyzing computer vision. The mathematical definition of convolution is the operation on two functions that together produce an output function expressing how the signal fed is modified during the path. Images are high dimensional vectors. These are computed in the form of matrices and would result in large number of parameters being. To address this, new CNN architectures were introduced which are composed of batch of computational layers through which the input pixel values are made to pass through based on their functionalities. It involves processing the pixels of the images by forming a special matrix.

E. Building Blocks of CNN:
In this section, we shall look at the basic building blocks of CNNs in general. There are many states of the art CNN architectures available which we will discuss in the next chapter. A typical CNN network contains the following:

Convolution Layer
This Process is a 2D convolution to the input pixels. The pixels of each input, filter and its output are mapped as given in the figure below. The filter has the same number of layers as input volume channels, and output volume has the same depth as the number of filters. Filters play a major role to derive a particular feature from an image. For instance, one could use filters to detect horizontal or vertical lines lying on a specific range of matrices. Using filters we could make pixel colours that are not necessary during the learning process Activation Layer It is used to improve the performance of the model in non-linear functions without affecting or degrading the computational efficiency of convolutional layers. From the many available activation functions, for CNN the most favourable one is ReLu as it results in faster training. Leaky ReLU is also used as it can address the vanishing gradient problem.

Pooling Layer
These layers are common to be added after conv layers in order to minimize the spatial size of the representation. This can enable us to reduce the number of parameters used in the computation and thereby can increase the efficiency. We do this by selecting maximum average or sum of a batch of pixels. This can reduce the overfitting problem. Pooling layers apply a non-linear desampling on acticvation maps in order to use smaller filter kernel size and reduce the pixel matrix dimensions Fully Connected Layer and Loss Layer: It is a typical neural network architecture which maps the extracted features from the sample images to the desired output. The objective of this layer is to take the result of conv and pooling layer and map it into a label. The output from the previous layers is flattened in a single vector representing the probability of each feature's property. It is then passed to a softmax to give confidence.
The Loss layer is an optimization layer used in training. It adjusts the current weights by using a cost function which we will discuss in detail how neurons are trained section.

II. STATE OF THE ART CNN ALGORITHMS:
Using the building blocks of a CNN network, several efficiency states of the art easily deployable CNN algorithms were made for computer vision applications. We will look into few well known CNN models and perform simulations on CPU+GPU and FPGA platforms. LeNet was the first state-of-the-art CNN architecture is introduced by Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner in 1998 [12]. It was mainly trained to recognize handwritten digits and document recognition. The LeNet-5 architecture is quite simple and is illustrated in the below block diagram.
The LeNet-5 architecture consists of two sets of convolutional and average pooling layers, followed by a flattening convolutional layer, then two fully-connected layers and finally a softmax classifier. Since the last decade, a lot of new efficient CNN models were developed. The figure below shows the remarkable models' architectures.

A. AlexNet:
In 2015, AlexNet was outperformed by Microsoft's very deep CNN with over 100 layers, which won the ImageNet 2015 contest [13]. The AlexNet architecture consists of 8 layers: 5 convolution layers with each followed by pooling or and 3 fully connected layers. Some notable features of AlexNet are as follows. The use of non-linear ReLU activation function showed significant improvement and was able to less 25% error in training performance over tanh and sigmoid activation function.
The pooling layer in AlexNet makes use of overlapping to shoot outputs to the neighbouring neurons. This showed less error while training and avoided the chances of overfitting. This is because when overlapping are not used, the output from various occurrences tend to be similar to the previous iteration. This causes overfitting. A typical model has over 60million weight parameters after training. AlexNet allows us to train the neurons by using multiple GPUs. It can significantly reduce the training time and can accommodate huge models. Advantages: • The use of 3 fully connected layers enables us to achieve good accuracy at the end as the parameters are finely tuned in the network. • It can achieve 94.5% accuracy on imagenet dataset. • It is able to off-centre objects and its detected classes on an inference of image are reasonable. Disadvantages: • The Alexnet has about 3 fully connected layers which are extremely computation intensive. It comprises almost half of the network's computation. This is because the parameter count is involved in fully connected layers to map the results from convolutional layers' features. This involves too many iterations within the local scope of the algorithm [13].

B. Network in Network:
Traditional CNN architectures use linear filters to do the convolution and extract features out of images. These filters usually assume all latent concepts are linearly separable. But this kind of straight-line filters cannot be possible in many cases. Using a richer nonlinear function approximator can serve as a better feature extractor. This model introduces the concept of having a neural network in place of convolution filters. This mini neural network takes the convolution as input and gives the value of the neuron from the activation function as output. This mini NN is called Multi-layer perceptron convolution. The layers are sparsely connected or partially connected rather than fully connected. Every node does not connect to every other node. It is a 1x1 MLP layer before reaching the next layer. Advantages: • Due to direct mapping between the extracted feature and class score, the feature category can be treated with great confidence. • There are no parameters to be trained in FC layers, hence fewer chances for overfitting. • Since we use global average pooling sums, It is more robust to special translations of the input.
C. VGG -16: VGG was developed by Visual Geometry Group at Oxford University in 2014. The building components of VGG are exactly the same as LeNet and AlexNet except that it is an even deeper network with more convolutional, pooling, and dense layers. VGG-16 consists of 16 weight layers: 13 convolution layers followed by pooling layers, and 3 fully connected layers [15]. It showed significant improvement to Alexnet-50 by replacing the large kernel filters 11x11 and 5x5 in the initial stage with conventional 3x3 filters one after the other [16]. These multiple smallsized non-linear filters enables the network to increase its depth and learn more complex features at lower computational expense. increased the depth of the network to learn more complex features at lower computation. 3 conv filters are then followed by 3 FC layers. The width of the network starts at a small value of 64 and increases by a factor of 2 after every subsampling and pooling layers. It can achieve 92.3% on ImageNet.

Advantages:
• The pre-trained weights for this model showed good benchmarking results and are feasible for deployment. • It showed good performance and accuracy at the time of proposal but now better models exist.

Disadvantage:
• Very high model size ~533MB and requires more memory space while training. • The 3 fully connected layers at the end make more complex parameters and tend to consume huge computation power during iterations in training. • It takes longer to complete the training of the network.

D. GoogLeNet:
GoogLeNet is an upgrade of LeNet architecture that is focussed on easy deploying on GPUs [8]. It is very inefficient to have feedforward architecture to manipulate input matrices with 512 channels. A 3x3 kernel filter will have an unimaginably dense connection in its architecture when convoluting images with more channels. To address this, GoogLeNet proposed an architecture which has a sparse connection between the activations which results in vanishing some not important connection in 512 input channels. It came up with a model which is popularly known as Inception architecture, built with more than 1000 sample images in a dataset, the width and number of conv layers and also the kernel size of these layers are made small. They are of 5x5, 3x3 and 1x1 in dimensions. The use of 1x1 filters showed great influence in improving the overall model training accuracy. Let's take an example input having 192 channels. The inception model generally has a 3x3 kernel size of 128 filters and 5x5 kernel size of 32 filters. Convoluting with these dimensions over 25x32x192 input image dimensions will result in too deep a network. To avoid this 1x1 conv is used to reduce the dimensions of the input channels. So 1x1 kernel sizes of 16 filters are introduced before feeding into the 5x5 filters. Advantages: • The use of fully connected layers along with simple global average pooling layer which can average out the channel values in the 2D feature map after the last convolution layer can reduce the number of parameters and hence improving memory and computation efficiency. • It achieves 93.3% accuracy on ImageNet and is found to be faster than VGG. • The entire model is less sized ~100MB whereas VGG is ~500MB

Disadvantage:
• There aren't any proposed disadvantages to this model. Moreover, the newer version like Xception Network showed  [4]. The main objective of the CNN architectures are to minimize the gradient of the function over a period of learning time. The gradients are updated during backpropagation. As we discussed earlier, the backpropagation takes 2 paths while transmitting the gradient back to the input layer. These identity shortcuts can be used when the input and output have the same direction.

= ( , { }) + .
Where ( { }) is the residual mapping to be learned Y is the output and x is the input layers considered in the network.
In the case of non-identical dimensions, extra zero paddings are introduced in excess matrices.

= ( , { }) + .
Here Ws linear projection matrix is introduced as extra paddings. Advantages: • The use of bottleneck layers in this framework has shown significant efficiency in the performance • The model size is small ~102MB. This is because of replacing fully connected layers with global average pooling. Disadvantage: • The number of layers is too deep. Every percentage of improvement results in an increase in computation, memory cost and time taken to train. F. State of the art Object Detection Algorithms: With the above models, we can easily perform the classification of different classes. The computer vision application involves Classification, Object Detection and Multiple object detection with localization. The latter can be achieved using a single shot math action that doesn't involve deep learning layers. However, with deep learning layer's architecture, we could achieve better accuracy with drastically increased computation, but in our thesis, we will be demonstrating the YOLO object detection which works under single shot bounding box and object detection. Other popular models like SSD, RCNN and Faster RCNN also exists but we will not be discussing it.
But when it comes to creating a bounding box of multiple objects within a given image, we need to perform an efficient algorithm that can be defined as follows.
Yolo's Object Detection Approach First, the image is cut into various equally spaced grids (SxS size), and for each grid, it detects bounding box and the corresponding classes by an output vector of dimension S*S*(C+5), C is the number of classes. Then a single convolutional layer is used to perform the classification for that grid. This layer has a typical loss function as an error between the label vector and output activation. All the classes are generally fed forward into the same conv layer to define the parameters. To understand the workflow, let's consider the grids to be 3x3 capable of detecting 3 classes namely, Cars, pedestrians and cycles. The label y is a vector consisting of an eight-dimensional vector.
And when an object is placed in multiple grids, the algorithm has to be modified in a way that it can detect objects. We introduce anchor boxes that define the ground truth of the object. Below, a midpoint of the anchor box is introduced. Intersection over Union (IoU) computes the intersection over the union between the actual bounding box and the predicted one. IoU = Area of the yellow box (Intersection) / Area of the green box (Union) Then the prediction is good enough if the IoU is greater than a threshold. These are responsible for detecting the classes. c1, c2, c3 are the three classes. n classes can be represented as c1,c2,… cn in the table. When a certain class is detected, its vector cn will be 1. The with the output box parameter values, we can define the score for the probability of the class.
This object detection technique is almost the same for other models too, the only change happens in the deep neural network layers used in inference. Yolo has 24 conv layers and then 2 FC layers. Yolo uses some 1x1 conv layers to reduce the depth of the reduction layers. The last 7x7x1024 deep conv layer is then flattened and fed into those 2 FC layers and the last conv layer of output 7x7x30. It uses 3 loss function. One is for classification, another is for localization to define the error between the predicted boundary box and ground truth, and the confident loss for object detection.

Advantages:
• Yolo provides faster real-time inference with about 45 FPS which is currently the fastest one so far. • The use of mean average precision (mAP) one single layer to predict the bounding boxes has led to lightweight deployment for inference. • It provides easy to use pre-trained weight files. Moreover, it is built on the Darknet framework which is rewritten using CUDA and TensorFlow thus making it efficient to train the model using the home desktop with full GPU efficiency.

III. HARDWARE AS AI ACCELERATORS
CNN network needs large datasets and requires huge runtime, maybe weeks or even months if the network is trained for more classes. To address this, the use of dedicated hardware came to existence. AI Accelerators are specialized hardware designed to run algorithms that require parallel operations to reduce the time complexity problems. Computers traditionally use a special-purpose processor to rely on along with a general CPU. This includes GPU, video cards, TPU and FPGAs. During training, the process involves parallel computation with the massive data flow of FP operation that involves matrix and vector operations.
A. DL Frameworks: A deep learning framework is an interface, library or a tool which allows us to build deep learning models more easily and quickly, without getting into the details of underlying algorithms. They provide a clear and concise way for defining models using a collection of pre-built and optimized components. It enables users to deploy a large number of layers parameters and provide parallelism to the algorithm by just using these DL frameworks. These are optimized for better performance and reliability, easy to understand, provide parallelism, and above all, there are many frameworks available and are open-sourced. So users can code ML algorithms in the programming languages they like. In the following section, we will look at some popular open-GPU as Accelerators Throughput optimization: The FPGA can process the dataflow in a parallel flow. In throughput optimization, this flow is exploited to handle mapping of data that are inefficient to a general sophisticated array. In short, if the computation process of a particular stage of the network is large, it can be mapped in dedicated array memory to handle it and thereby reducing the complexity of memory handling and time taken to process it.
Latency optimization: Some applications require to infer the lowest single-image latency. In these situations, the xDNN processing pipeline can be easily configured. We have demonstrated the CNN models performance in both throughput and latency modes.
The Xilinx DNN engine in the Vitis AI makes it easy and reliable for production. It has specific execution paths for performing convolutions, pooling, padding, adding other filters for various frameworks. As parallelism is one of our primary motives, a new filter or input can be fed parallel to the network graph enabling adaptable parallel processing. As the conventional model made using CPU+GPU frameworks and languages has to be converted to optimize FPGA architecture, the Xilinx Vitis AI SDK stack has its own compiler stacks namely: Network Compiler-This compiler produces the network instruction sequence and is being executed. It enables implementing data-flow management and tensorlevel control to the network.
Model Quantizer-These enable the computation complexity of the model to reduce effectively and it provides a significant improvement in performance in terms of data memory management. It converts 32-bit floating-point FP32 weight and activations to fixed point INT8. A fixed point integer data type in the network can improve the speed and power efficiency by requiring less memory bandwidth. It takes the frozen graph for the TensorFlow framework and performs pre-processing by performing batch normalization and removing unwanted nodes. The activations, weights and biases and then quantized to a given bit size. It must be noted that not all models can perform model quantizer and its application is limited. Standard versions of Caffe and TensorFlow framework cannot be quantized using this and has to be configured separately. AI Compiler-It works like any other typical compiler by mapping the model in the form of the instruction set and data flow. It performs sophisticated optimizations like instruction scheduling, layer fusion and on-chip memory management. It can analyse the data and allocate optimum location for data onto the memory and enables reusing on-chip memory.
AI Library-The library contains high-level libraries and APIs for AI inference with DPU. This plays a major role by enabling users who don't have prior knowledge of FPGA architecture to implement and debug the code. The xDNN Runtime provides reusable programs, and the scheduler simplifies the communication and programming of the DPU engine. Its runtime also makes cloud-to-edge deployment more feasible.

A. Simulations on CPU+GPU
All the simulations here are performed using ipython notebooks on google collab platform. We made use of Fashion MNIST dataset with 10 classes. The models are capable of detecting more classes, but since it takes more time, we limited to 10. All the models were trained using 64 batches, 10 epochs and with learning rate at 0.01. We have used 5 conv layers and 3 fully connected layers (represented as dense n layers) to build the architecture of the Alexnet model. The input image to Alexnet model is a 256x256x3 RGB dimension. Our original dimensions of fashion MNIST dataset is 26x26. So we upscale its dimensions by 224x224. The first conv layer contains 96 kernels of size 11x11x3, which we defined in our code. The filter process in 54x54 channels. These channels are calculated using the term

1) Alexnet Results
The same calculations go for all the other layers. The use of multiple kernels of the same size in conv layers can extract very deep features from the image. There are 2 max-pooling layers after the first and second conv layers. And we added an overlapping max-pooling layer after the fifth conv layer. We applied ReLu nonlinearity after every conv layers and a local normalization is added after the first two convolution layers. The fully connected layers perform softmax classification. A detailed overview of the algorithm design and flow is given in the source code. It can be seen that our NiN model is slightly similar to AlexNet because it is based on the Alexnet architecture but with slight improvements

2) NiN Results
The NiN uses 11x1,5x5 and 3x3 conv layers and its channel numbers are identical to Alexnet. We added a maxpooling layer with stride:2 and kernel 3x33x3 size. The number of output channel is equal to the classes making the NiN denser and a global average pooling is performed to derive the vector logics. Since the model is dense we can see that it took more time to finish the training for the given hyperparameters. The fully connected layers use ReLU activation. In VGG, we added 5 sequential blocks, each having consecutive conv layers of 3x3 kernels and max-pooling of 2x2 of stride: 2 to minimize the resolution of the sample. The first conv layer has 64 output channels of kernel 112x112. The dimensions of the conv layer's channel are increased by 2 and its kernel size is decreased by half. It gets to 7x7 before getting flattened and put to the fully connected layer

3) VGG-11 Results
The performance of the VGG is computationintensive, however, we got the maximum training accuracy for the same hyperparameters used in all other models. The first layer uses 64 kernels of 1x1 and 3x3 conv layers with 24 channels. The other layers use the same as given in the above figure. We placed 3 inception blocks in our network as given in the figure (in googlenet intro section). It consists of 4 parallel paths. The first 3 paths have a conv layer of dimensions 1x11x1,3x33,3 and 5x55x5 to extract features in different special size. The middle paths perform the 1x11x1 conv to reduce the model complexity by vanishing the input channels. The fourth path uses 3x33x3 max-pooling along with 1x11x1 conv layer to modify the channels. A detailed explanation about code design and flow in the source code.

4) GoogLeNet Results
The name of the layer is a sequential layer because we added a group of layers (4 paths) to be performed simultaneously. The final fully connected layers get the flattened 2d array data and determine the class. The resnet uses a special type of function as given below to define the learning rate and other hyperparameters. For all f ∈ F exists weight W. The function given f*F which is our best fit for the function F and it is the base for determining the residual blocks used in the network. * : = ( , , ) ∈ .

5) Resnet-18 Results
The first two layers are similar to google net except for the fact that resnet has 4 blocks of residual function modules. Each module contains 1x11x1 conv layer. The max-pooling layer here is used to reduce the dimensions of the sample. It can be seen that the width and height of the output from each residual modules have been halved but the number of kernels has been doubled. Due to this, we can achieve low training loss in less time when compared to other models. We can easily create other Resnet models like Resnet-152 or Resnet-50 by changing the residual blocks and make the network more deeper to detect more classes effectively. We will look in detail about the code design and flow in our source code. 6) Comparison of the above Models:

7) Yolo Object Detection Result
We have performed an object detection algorithm on the personal laptop with intel i5 CPU+2G Nvidia MX150 integrated GPU+8G RAM. We were able to get real-time 75 convolution layers and the remaining 31 includes pooling, FCN and other layers. A detailed explanation of code design and flow is given in the source code. Three objects were detected in this image: dog, person and horse. A non-max suppression is used to detect multiple objects with mAP. It takes the list and calculates for the images with the scores more than a predefined threshold. Then the intersection of union and confident threshold was responsible for making the bounding boxes and detect the object. We have used the pre-trained weights file available from the Yolo darknet's official website which is capable of detecting up to 60 classes. We have performed a real-time demonstration using YOLO algorithm and was able to provide real-time detection at around 20 FPS which can be further improved by using better CPU+GPU platform. We have performed an inference simulation for a set of sample inputs and the output is predicted with prediction probabilities. At each instance, the above terminal gives the kernel configurations made within the FPGA compiler parameters. The quantized weight parameters have been used for running the model. We have performed two simulations for this model with two different datatypes namely 8bit and 16bit integers. 8-bit memory management and carry our instructions at 2 8 per second and a 16-bit memory management processor can carry out instructions at 2 16 per second. It is obvious from the theory and our results that 16-bit can give faster inference results. We have performed an inference simulation for a set of sample inputs and the output is predicted with prediction probabilities.
3) Simulation of Yolov3 on ALVEO_U50 We have performed an inference simulation for a set of sample inputs and the output is predicted with prediction probabilities. A set of 10 sample images has been fed and each image may have multiple objects in it. The YOLO algorithm can predict all the objects in a given image with prediction probability. The time taken to complete the inference was not derived. Some configurations have to be made in source code to print the time taken. This Xilinx U50 FPGA was recently launched in May 2020 and I didn't have enough time to study the flow of the source code.

V. CONCLUSION, DISCUSSIONS AND FUTURE WORKS:
Thus this project has accomplished the primary objective that is to run the object detection algorithm on FPGA. Many reported successes of deep learning algorithms for computer vision tasks have motivated the development of hardware implementations of CNNs. In particular, there has been increased interest in FPGAs as a platform to accelerate the post-training inference computations of CNNs. To achieve high performance and low energy cost, a CNN accelerator must 1) fully utilize the limited computing resources to maximize the parallelism, 2) exploit data locality by saving only the required data in on-chip buffers to minimize the cost of hardware memory management, and 3) manage the data storage patterns in the hardware to increase the data reuse.
The state-of-the-art CNN algorithms are studied and demonstrated for different hardware platforms. We have also discussed in detail about hardware architecture of ALVEO U200/U50 FPGA board that gives an efficient design method tools for optimum performance. We also trained 5 CNN architecture, analyzed its performance and plotted the efficiency graph using ipython notebooks by google cloud platform. The dataset used for training is MNIST fashion object detection dataset which has about 10 classes. Furthermore, the design flow to prepare a dataset from scratch and considerations to develop and train our own CNN model has been given along with this work. Newly updated CNN model, what is not added in this report, applications and future works to improve our study. Although ASICs provide better inference, we can see that the FPGA is still preferred because of its portability and reliability.
Future Works: We were not able to achieve deploying the algorithm on Edge Platform due to version error, lack of support on FPGA hardware and time constraints. The future works may include deploying it on board and controlling actuators based on the inferred inputs as FPGAs can be programmed to do anything, it can be easily deployed in mobile devices. A more efficient and robust object detection algorithm called YOLOv4 was recently released in early march which claims that it can be deployed on mobile phones using PC hardware for inference via wireless connections. This has lots of scopes in terms of education, GPS guidance and even on cooking. With the growing 5G Technology, the application for real-time inference can find its true potential enabling massive wireless data transfer with the grid and can provide a robust incremental learning environment. We can use this work to study the power consumed on each platform along with its time taken and efficiency values. VI.REFERENCES