Object Detection using YOLO And Mobilenet SSD: A Comparative Study

Sabina N; Aneesa M. P; Haseena P. V

doi:10.17577/IJERTV11IS060065

Volume 11, Issue 06 (June 2022)

Object Detection using YOLO And Mobilenet SSD: A Comparative Study

DOI : 10.17577/IJERTV11IS060065

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 4,117
Authors : Sabina N, Aneesa M. P, Haseena P. V
Paper ID : IJERTV11IS060065
Volume & Issue : Volume 11, Issue 06 (June 2022)
Published (First Online): 14-06-2022
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Object Detection using YOLO And Mobilenet SSD: A Comparative Study

Sabina N.

CSE Department

MGM College of Engineering and Pharmaceutical Sciences Valanchery, India

Aneesa M.P.

CSE Department

MGM College of Engineering and Pharmaceutical Sciences Valanchery, India

Haseena P.V. Assistant Professor CSE Department

MGM College of Engineering and Pharmaceutical Sciences Valanchery, India

Abstract Object detection refers to a computer vision technology that deals with detecting instances of semantic objects of explicit category in digital images and videos. The main purpose of object detection is to identify and spot one or more effective targets from still images or video data. It comprehensively includes a vital range of techniques like image processing, pattern recognition, artificial intelligence, and machine learning. Face detection, self-driving cars, vehicle detection and a few other technologies use object detection. Real-time object detection being a vivacious and complex area of computer vision needs faster computation power to identify the object at that specific time. The accuracy of object detection has increased tremendously with the advancement of deep learning techniques. In this work, two single-stage object detection models namely YOLO and MobileNet SSD are analysed based on their performances in different scenarios. Both models use Convolutional Neural networks for object detection. Different parameters used to determine the accuracy in detecting objects include loss function (LP), mean average precision (MAP), frames per second (FPS), etc.

Keywords- ObjectDetection,CNN,YOLO,MobileNet SSD,Detection Accuracy

INTRODUCTION

For object detection, artificial neurons are used in deep neural networks which are similar to humans composed of neurons. Object detection thus refers to the detection and localization of objects in an image that belong to a predefined set of classes. Tasks like detection, recognition, or localization find widespread applicability in real-world scenarios, making object detection (also referred to as object recognition) a very important subdomain of Computer Vision.

Generally object detection can be categorized in to

2. One stage Detector – where the object detection is a simple regression problem that takes an input and learns the class probabilities and bounding box coordinates. YOLO, YOLO v2, SSD, RetinaNet etc comes under the one stage detector. Object detection is an advanced form of image classification where a neural network predicts objects in an image and points them out in the form of bounding boxes.

The main purpose of our analysis is to compare the operational performance and accuracy of the object detection techniques YOLO and MobileNet SSD in different aspects and feature a portion of the notable elements that make this study stand out.

two as,
1. Two Stage Detector – where the detection completes in two steps. The first step uses a Region Proposal Networks to generate regions of interests that have high probability of being an object. The second step is the object detection which performs the final classification and bounding box regression of objects. RCNN, Fast RCNN, SPPNET, Faster RCNN etc are some of the two stage detectors.
YOLO (YOU ONLY LOOK ONCE)
This architecture takes an input image and resizes it to 448*448 by retaining the same aspect ratio and performing a technique called padding. This image is then passed to the CNN network. This particular model has 24 convolution layers, 4 max-pooling layers followed by 2 fully connected layers.
MOBILENET SSD
1. What is MobilenetSSD?
  
  Convolutional neural networks are used to develop a model which consists of multiple layers to classify the given objects into any of the defined classes. These objects are detected by making use of higher resolution feature maps and are possible because of the recent advancement in deep learning with image processing. Mobilenet SSD is an object detection model that computes the output bounding box and class of an object from an input image. This Single Shot Detector (SSD) object detection model uses Mobilenet as the backbone and can achieve fast object detection optimized for mobile devices.
2. SSD
  
  The term SSD stands for Single Shot Detector. The SSD technique is based on a feed-forward convolutional network that generates a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes and is followed by a non-maximum suppression step to produce the final detections[9]. Boxes contain offset values (cx,cy,w,h) from the default box. Scores contain confidence values for the presence of each of the object categories, the value 0 is reserved for the background.
  
  SSD introduces multi-reference and multi-resolution detection techniques. Multi-reference techniques define a set of anchor boxes of different sizes and aspect ratios at different locations of an image, and then predict the detection box based on these references. Multi-resolution techniques allow detecting objects at several scales and at different layers of the network. A SSD network implements an algorithm for detecting multiple object classes in images by generating confidence scores related to the presence of any object category in each default box. It also produces adjustments in boxes to better match the object shapes. This network is suited for real-time applications since it does not resample features for bounding box hypotheses. The SSD
  
  architecture is CNN-based and for detecting the target classes of objects it follows two stages: (1) extract the feature maps, and (2) apply convolutional filters to detect the objects. SSD uses VGG16 to extract feature maps. Then, it detects objects using the Conv4_3 layer of VGG16. Each prediction is composed of a bounding box and 21 scores for each class (one extra class for no object); the class with highest score is selected as the one for the bounded object [3]. The major objective during the training is to get a high class
  
  confidence score and this can be attained by matching the default boxes with the ground truth boxes.
3. MobileNet
MobileNet is a class of efficient models called for mobile and embedded vision applications. This class of models is based on a streamlined architecture that uses depthwise separable convolutions to build lightweight deep neural networks. The MobileNet model is based on depthwise separable convolutions which is a form of factorized convolutions. It factorizes a standard convolution into a depthwise convolution and a 1Ã—1 convolution known as a pointwise convolution. The depthwise convolution applies a single filter to each input channel in the case of MobileNets. The pointwise convolution then generates a 1Ã—1 convolution to combine the outputs of the depthwise convolution. A standard convolution has a single step for both filtering and combining inputs into a new set of outputs. But the depthwise separable convolution splits this into two layers, a separate layer for filtering and a separate layer for combining. This factorization reduces computation and model size drastically [8].

MobileNet models can be applied to various recognition tasks for efficient on-device intelligence.
COMPARATIVE ANALYSIS

Different metrics have been proposed to measure object localization accuracy. The Intersection over Union (IoU) which is also called Jaccard Index, is commonly used to evaluate the accuracy of detections. It can be calculated as the area of overlap between a predicted detection and its corresponding ground-truth divided by the area of the union between the predicted detection and the ground truth. The mean IoU for an image is computed by taking the IoU of each

class and averaging them, for the binary or multi-class detection problems. This can be applied to all the images of the test dataset to have an average IoU value. Another related detection metric is the F1-score (also called Dice Coefficient), which is calculated as two times by the area of overlap divided by the total number of pixels contained in the detected and the ground truth regions. This measure can be represented in terms of Precision and Recall metrics. It also can be applied to all the target objects present in an image and we can compute the average F1 score for all images of the test dataset. The IoU and F1-score metrics are related and positively correlated for given fixed ground truth. This means that, while comparing two models using IoU if the first model is better than the second one using this metric, it will also be better using the F1 score. When taking the average score over a set of detections in images, the IoU metric has a tendency to penalize quantitatively single inaccurate detections more than the F1-score even when both of them can predict a given object instance is badly detected [3].

The standard metrics normally used for analyzing object detection accuracy and speed include recall, precision, F1 score (F1), mean average precision (MAP), and frames per second (FPS). In the target detection process, precision is the ratio of correctly detected targets to the number of all detected targets and recall is the ratio of the number of accurately detected targets to all targets in the sample set. F1 represents the weighted harmonic average of precision and recall. Average precision (AP) is the precision across all elements of a category of pills, as defined in the formula given below:

Numerically, MAP is the average value of the AP sum across all categories, and this value is used to evaluate the overall performance of the model.

FPS is an indicator that is commonly used for evaluating the speed of model detection. The number of images that can be processed per second is referred to as FPS.

Both detectors can produce acceptable results for different object sizes, illumination conditions, image perspective, partial occlusion, complex background and multiple objects in scenes. One of the major strengths of SSD model is the almost elimination of FP cases which is preferable in applications related to the analysis. On the other side, YOLOv3 produces better average results. YOLO struggles to localize objects properly, but SSD is quicker than the previous progressive for single-shot detectors.

For real-time purposes, speed and accuracy are determining factors for smooth functioning. YOLO variants (especially up to YOLOv3) provide excellent accuracybut require computation-intensive hardware. For such devices, this model

would suffice the speed requirement. MobileNet-SSD V2 also provides a somewhat similar speed to that of YOLOv5s, but it just lacks in the accuracy. SSD could be a higher choice when we have a tendency to square measurable to run it on a video and therefore the truth trade-off is extremely modest. YOLO is a better option when exactness is considered than you want to go super quick. So, either of the models can be chosen depending on the requirement of various applications.

CONCLUSIONS

Real-time object detection and tracking on video streams is a crucial topic of surveillance systems in many field applications. The objective of our paper is to make a comparative study on two object recognition systems using CNN to identify the objects in the images. We studied and analyzed the YOLO object detection model and MobileNet SSD model for performance evaluation in different scenarios. Each of the compared models has its own unique properties and is successful in its respective applications. YOLO provides better accuracy compared to MobileNet SSD, which provides more detection speed.

REFERENCES

[1] Ashwani Kumar , Sonam Srivastava ,Object Detection System Based on Convolution Neural Networks Using Single Shot Multi-Box Detector,Third International Conference on Computing and Network Communications (CoCoNet19.)

[2] Alexey Bochkovskiy , Chien-Yao Wang,,Hong-Yuan Mark Liao,YOLOv4: Optimal Speed and Accuracy of Object Detection,arXiv:2004.10934v1,23 April 2020.

[3] Ãngel Morera , Ãngel SÃ¡nchez , A. BelÃ©n Moreno , Ãngel D. Sappa and JosÃ© F. VÃ©lez SSD vs. YOLO for Detection of Outdoor Urban Advertising Panels under Multiple Variabilities, Sensors 2020, 20, 4587; doi:10.3390/s20164587.

[4] Mark Sandler, Andrew Howard , Menglong Zhu , Andrey Zhmoginov and Liang-Chieh Chen ,MobileNetV2: Inverted Residuals and Linear Bottlenecks,arXiv1801.04381v4.

[5] Mohit Phadtare , Varad Choudhari , Rushal Pedram and Sohan Vartak, Comparison between YOLO and SSD Mobile Net for Object Detection in a Surveillance Drone, IJSREM, 2021.

[6] Harshal Honmote , Pranav Katta , Shreyas Gadekar and Prof. Madhavi Kulkarni,Real Time Object Detection and Recognition using MobileNet-SSD with OpenCV, International Journal of Engineering Research & Technology (IJERT), ISSN: 2278-0181, Vol. 11 Issue 01, January-2022.

[7] Lu Tan, Tianran Huangfu, Liyao Wu and Wenying Chen,Comparison of RetinaNet, SSD, and YOLO v3 for real-time pill identifcation, Tan et al. BMC Medical Informatics and Decision Making (2021) 21:324 https://doi.org/10.1186/s12911-021-01691.

[8] Andrew G. Howard, Menglong Zhu, Bo Chen,Dmitry Kalenichenko Weijun Wang,Tobias Weyand, Marco Andreetto and Hartwig Adam, MobileNets: Efficient Convolutional Neural Networks for Mobile VisionApplications,arXiv:1704.04861v1.

[9] Wei Liu , Dragomir Anguelov , Dumitru Erhan, Christian Szegedy , ScottReed,Cheng-YangFu, andAlexanderC.Berg,SSD:SingleShotMultiBoxDetector,arXiv:1512. 02325v5.

Object Detection using YOLO And Mobilenet SSD: A Comparative Study

Leave a Reply