HALE-Net: A Novel Violence Detection Model using Hybrid Active Learning Ensemble Network

Umaparvathy U; Ponnambili S

doi:10.17577/IJERTV14IS050259

Volume 14, Issue 05 (May 2025)

HALE-Net: A Novel Violence Detection Model using Hybrid Active Learning Ensemble Network

DOI : 10.17577/IJERTV14IS050259

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 23
Authors : Umaparvathy U, Ponnambili S
Paper ID : IJERTV14IS050259
Volume & Issue : Volume 14, Issue 05 (May 2025)
Published (First Online): 26-05-2025
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

HALE-Net: A Novel Violence Detection Model using Hybrid Active Learning Ensemble Network

Umaparvathy U,

PG Scholar, Department of Electronics and Communication Engineering, MIT, Kayamkulam

Ponnambili S

Assistant Professor, Department of Electronics and Communication Engineering, MIT, Kayamkulam

ABSTRACT: Violent activities in public areas pose a significant threat to social security, necessitating the development of efficient surveillance systems. Traditional manual monitoring is often inadequate due to its reliance on human observation, which can be slow and prone to errors. In response, this paper proposes a novel automated violence detection system that employs a hybrid deep learning model combining EfficientNet and MobileNet. This model leverages EfficientNets robust feature extraction capabilities and MobileNet's lightweight architecture to ensure high accuracy and real-time processing efficiency.To enhance the training process, we incorporate active learning techniques, enabling the model to focus on the most informative data samples, thereby minimizing the need for extensive labeled datasets. Additionally, an ensemble classifier is implemented to improve predictive performance by combining the strengths of multiple classifiers, leading to more reliable and accurate outcomes. The proposed system aims to surpass the limitations of existing violence detection models by achieving a balance between speed and precision, ultimately contributing to the advancement of automated public safety measures. Through rigorous evaluation, we demonstrate that our approach effectively enhances the performance of automated violence detection in complex environments, paving the way for safer public spaces.

KEYWORDS: Violence Detection, Active Learning, Ensemble Networks, MobileNet, EfficientNet, Deep Learning, Video Surveillance.

INTRODUCTION

Violence detection in surveillance videos is a critical task in public security and crime prevention.The need for thorough public safety monitoring using video surveillance cameras has grown dramatically in response to the escalating security and safety issues.However, identifying unusual behaviors is made extremely difficult by the volume of video data produced by these surveillance cameras as well as the scarcity and variety of unusual occurrences like theft, assault, and other crimes. manual monitoring of this vast amount of data is labor-intensive, impracticable, and prone to errors. This emphasizes the dire need for automated and efficient violence detection technologies.

Figure1.1 violence activities by state

With the increasing availability of video data, automated violence detection models have gained significant attention to enhance real-time monitoring systems. Traditional approaches relying on handcrafted features often fail to generalize across diverse environments, making deep learning-based methods more effective in capturing complex spatiotemporal patterns.

Figure 1.2 violence activities in different state

In this research, we propose a novel violence detection model that leverages a Hybrid Active Learning Ensemble Network (HALE-Net), integrating MobileNet and EfficientNet for feature extraction. The ensemble strategy enhances model robustness, while active learning efficiently selects the most informative samples for annotation, reducing the labeling cost. MobileNet offers lightweight computation, making it suitable for real-time applications, whereas EfficientNet provides enhanced accuracy with optimized network scaling.By combining these architectures, our approach balances efficiency and performance, ensuring a scalable and accurate violence detection framework.

The contributions of this study are threefold:
1. Hybrid Ensemble Learning We fuse MobileNet and EfficientNet to improve feature representation and classification accuracy.
2. Active Learning Optimization A query strategy is employed to select critical instances, minimizing annotation efforts while maintaining high detection accuracy.
3. Performance Benchmarking The proposed model is evaluated against benchmark datasets, demonstrating its superiority over existing methods in terms of precision, recall, and computational efficiency.
The experimental results highlight the effectiveness of our model in identifying violent activities across various video scenarios, showcasing its potential for real-world security applications.

The remainder of this paper is organized as follows: Section 2 reviews related works, Section 3 details the methodology, Section 4 presents the experimental setup, Section 5 discusses the results, and Section 6 concludes with future research directions.
RELATED WORKS
1. Violence Detection in Surveillance Systems
  
  Traditional surveillance systems rely on manual inspection, which is time-consuming and prone to errors. Surveillance personnel often face fatigue and cognitive overload, leading to missed events and slower response times. To address these shortcomings, automated violence detection systems leveraging deep learning have been developed.In [6], a hybrid deep learning model combining 2D and 3D CNNs was proposed for violence detection in videos, effectively capturing both spatial and temporal features. This approach outperformed traditional 2D CNNs by modeling motion changes across frames, achieving high accuracy and demonstrating strong potential for real-time surveillance in edge computing environments.
  
  In [1], Shripriya et al. (2021) introduced a violence detection system combining ResNet for deep feature extraction and SSD for object detection. This deep learning-based approach enhances accuracy and real-time responsiveness, enabling automatic alerts in sensitive environments. The models deployment feasibility on mobile devices marks a significant improvement over traditional motion-based methods.In [2], Das et al. (2019) proposed a violence detection framework using HOG features with a preprocessing step for enhanced frame quality. Various machine learning classifiers were evaluated, with Random Forest achieving 86% accuracy. The approach showed effectiveness in surveillance settings, offering improved performance over earlier techniques.
  
  In [3], Aarthy and Nithya (2022) proposed a lightweight violence detection method combining keyframe extraction with a pre-trained VGG16 model. By eliminating redundant frames, the approach reduced computational complexity while maintaining accuracy. Tested on the Hockey Fight dataset, the model demonstrated strong performance in challenging surveillance conditions, highlighting its suitability for real-world applications.
  
  In [4], Khalfaoui et al. (2024) introduced a lightweight hybrid model combining MobileNetV3 and LSTM with a Bi-Directional Motion Attention mechanism for efficient violence detection. Tested on multiple datasets, the model achieved 80.96% accuracy, demonstrating strong performance and real-time suitability for edge devices in smart city surveillance systems.In [5], Sachan et al. (2023) presented an intelligent violence detection system tailored for crowded environments, comparing ResNet50 and YOLOv8 architectures. While ResNet50 excelled in accuracy through deep feature learning, YOLOv8 proved more effective for real-time surveillance, highlighting key trade-offs for deployment in dynamic public safety scenarios.
  
  In [9], Vosta and Yow (2023) introduced KianNet, a deep learning model combining ResNet50, CovLSTM, and multi-head self-attention for accurate and efficient violence detection in surveillance footage. By capturing rich spatiotemporal features and emphasizing critical motion patterns, KianNet outperformed existing methods, demonstrating strong potential for real-world deployment in safety-critical environments.In [10], Suba et al. (2022) proposed a deep learning-based violence detection system using lightweight CNNs like MobileNetV2 and ResNet50V2 for real-time surveillance. Designed to overcome the limitations of manual monitoring, the model achieved high accuracy on datasets such as UCF Crime and RLVS, offering efficient threat detection and real- time alerting for public safety in resource-constrained environments.
METHODOLOGY
The extracted features were then passed to a Random Forest (RF) classifier comprising 50 decision trees. Each tree is trained on a random subset of the data (bagging), reducing variance.The classifier was trained using the deep feature vectors from the training set. Predictions and associated probabilities for the test set were obtained via the predict method. Internal randomness ensures diverse learners, improving generalization.This hybrid approach leveraged the representational power of CNNs with the ensemble learning capability of RFs for improved classification performance on violence detection tasks.Majority voting is used to produce the final class label.
EXPERIMENTAL SETUP

This section includes the experimental findings of our suggested model, "HALENet," and demonstrates how it enhances the performance of violence detection in our UCF-Crime trained video datasets.

Data

Although it might not be easy to locate a dataset that encompasses every form of violent behavior, the many datasets listed in Table 4.1 can help with violence detection. UCF-Crime is a benchmark dataset in VD that uses surveillance camera footage of several forms of violent behavior.The UCF-Crime dataset is one of the largest and most comprehensive benchmark datasets available for real-world anomaly and violence detection in surveillance videos. It comprises 13 categories of anomalous events, including violence-related activities such as fighting, abuse, road accidents, and robbery, captured from real surveillance camera footage under varying environmental conditions.

This dataset contains approximately 1,900 long untrimmed videos, spanning over 128 hours of content, and includes both normal and anomalous events. Each video is annotated with a binary label indicating whether it contains an abnormal activity or not. The abnormal class covers a wide range of events, making the dataset highly suitable for both binary classification tasks (normal vs. violence) and multiclass anomaly recognition.For this study, a subset of the UCF-Crime dataset was used, focusing specifically on violence-related categories such as Abuse, Assault, and Fighting, alongside Normal scenes to represent the non-violent class. Each video was segmented into frames and resized to a consistent resolution of 256Ã—256 pixels to facilitate uniform training input for the deep learning model.

To support the active learning framework, representative samples from each category were selected iteratively based on feature uncertainty and classification margins. This selection process allowed the model to focus on the most informative instances, thereby enhancing learning efficiency and reducing labeling costs.The UCF-Crime dataset presents several challenges, including occlusions, low lighting, camera motion, and complex background activities, thereby serving as a robust evaluation benchmark for real-world violence detection models.

Table 4.1 Details for different datasets in violence detection

Dataset	Modality	No:of videos/ima ges	Resolution	label
UCF-Crime	Surveillance videos	~1900	Varies(240p 480p)	13 classes incl(fight,bi nary anomaly)
Hockey Fight	Sports videos	1000clips	360Ã—240	Binary(Fig ht,NonFight )
RWF-2000	Surveillance videos	2000	320Ã—240	Binary(Fig ht,NonFight )
Violent- Flows	Videos + Optical Flow	492	varies	Binary (Fight, Non-Fight)
Street Fight Dataset	Real-world videos	200+clips	HD	Binary label
Kinetics (subset)	Action videos	varies	320Ã—240	Action labels incl. Punching, Fighting

Model Evaluation And Performance Metrics

To assess the effectiveness of the trained model, feature vectors were extracted from the fully connected layer (fc) of the trained CNN using test inputs. For individual prediction, a test image was resized to 256Ã—256 and passed through the trained network to obtain high-level features, which were then classified using the trained Random Forest (RF) model. The predicted class was displayed to the user via a graphical interface.

The overall system performance was evaluated on a separate test set derived from the actively learned dataset. Predicted class labels from the RF classifier were compared with ground truth labels to compute standard performance metrics. These included classification accuracy, sensitivity (recall), specificity, precision, F1-score, false positive rate (FPR), error rate, and Matthews correlation coefficient (MCC).

A confusion matrix CCC was constructed to determine true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) for each class. The evaluation metrics were computed using the following formulas:

Accuracy = TP+TN

TP+TN+FP+FN

F1 score =

2TP 2TP+FP+FN

Specificity = TN

TN+FP

Sensitivity = TP

TP+FN

Precision = TP

TP+FP

( 5.1)

(5.2)

(5.3)

(5.4)

(5.5)

where N is the total number of samples, S = TP+FN and P = TP+FP

N N

Figure 4.1 Performance matrics per class
Hardware and Software Specifications.

The implementation and experimentation of the proposed HALE-Net model were carried out on a personal computing system equipped with an Intel Core i7 processor, 16 GB RAM, and an NVIDIA GeForce GTX 1660 Ti GPU with 6 GB of VRAM to accelerate the training of deep learning models. The software environment consisted of MATLAB R2021a running on a Windows 10 operating system. The model utilzed MATLABs Deep Learning Toolbox, Image Processing Toolbox, and Statistics and Machine Learning Toolbox to build and train convolutional neural networks (CNNs), extract features, and implement the ensemble classifier using the Random Forest (TreeBagger) approach. The training and testing images were handled using MATLABs imageDatastore, and all image files were stored in .jpg format within structured subfolders corresponding to class labels. The GPU support was enabled via MATLAB's gpuDevice function to facilitate faster training and feature extraction processes.

RESULTS AND DISCUSSION

Comparative Performance Analysis With Existing Models

To evaluate the effectiveness of the proposed HALE-Net framework, a comparative performance analysis was conducted against an existing convolutional LSTM model. Both models were tested using the same dataset and evaluation metrics to ensure a fair comparison. The baseline model represents a conventional deep learning architecture without ensemble enhancement, while the proposed model integrates a convolutional neural network (CNN) for feature extraction with a Random Forest classifier for improved generalization and decision-making. Performance metrics including accuracy, sensitivity (recall), specificity, precision, and F1-score were computed. Results demonstrated that the proposed method outperformed the baseline model across all key metrics, particularly showing significant improvements in precision and F1-score. This indicates a higher ability to correctly identify violent activities while minimizing false alarms, making the approach more reliable for real- world surveillance applications.

To validate the effectiveness of the proposed ensemble CNN-based model, its performance was compared against a baseline method, specifically a ResNet with ConvLSTM-based architecture.The evaluation was conducted using the UCF-Crime dataset, and several standard classification metrics were used for comparison, including accuracy, sensitivity (recall), specificity, precision, F1-score,. The comparative results are summarized in Table 5.1.

Table 5.1 Comparison between HALE-Net and different baseline model

Model	Precision	Recall	F1 score	Accuracy
Convolutional LSTM	81.7	83.2	82.4	85.6
ResNetConvLSTM	88.7	90.2	89.4	91.3
Proposed model(HALE-Net)	94.0	93.6	93.8	94.7

Effectiveness of Active Learning in Minimizing Data Labeling.

Manual annotation of large-scale video datasets, such as UCF-Crime, can be labor-intensive and time-consuming, particularly in domains requiring expert labeling, like violence detection. To address this, the proposed framework incorporates an active learning strategy, which significantly reduces the need for extensive labeled data. Instead of labeling the entire dataset upfront, the model selectively queries the most informative and uncertain samples for manual annotation based on prediction entropy and classifier disagreement. This iterative querying process ensures that the model is trained on highly valuable data points, thereby accelerating the learning process with minimal supervision.

Experimental results demonstrate that the active learning-based approach achieved comparable or superior performance to fully supervised models while using only a fraction of the labeled data. Specifically, labeling just 30%40% of the most uncertain instances was sufficient to reach over 90% of the models final accuracy, as compared to training with the full labeled dataset. This highlights the effectiveness of active learning in reducing annotation costs without compromising detection performance. Consequently, this approach not only enhances scalability for real-world deployments but also makes violence detection systems more feasible in resource- constrained scenarios.
Impact of the Ensemble Classifier on Final Predictions

The integration of an ensemble classifier significantly enhances the robustness and accuracy of the proposed HALE-Net framework. By combining predictions from multiple modelsspecifically, a CNN for spatial feature extraction and a secondary classifier such as a Random Forest for decision-level fusionthe ensemble approach leverages the strengths of individual learners while mitigating their weaknesses. This strategy results in a more generalized and stable model, particularly when dealing with complex and diverse video content such as those in the UCF-Crime dataset.

Empirical evaluations reveal that the ensemble classifier outperforms individual classifiers in terms of accuracy, sensitivity, specificity, and F1-score. For example, while the base CNN model achieved an accuracy of approximately 89.5%, the ensemble framework improved this to 92.3%, with notable gains in sensitivity (recall)

and precision. The confusion matrix indicates reduced false positives and false negatives, demonstrating improved discrimination between violent and non-violent scenes. This performance gain is particularly critical in safety- sensitive applications, where incorrect predictions can have serious implications.

Overall, the ensemble classifier plays a vital role in improving the reliability and effectiveness of the system, ensuring that predictions are both more accurate and more consistent across varied video samples.

6.REFERENCES

C.Shripriya, J. Akshaya, R. Sowmya and M. Poonkodi, "Violence Detection System Using Resnet," 2021 5th International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 2021, pp. 1069-1072,
Das, A. Sarker and T. Mahmud, "Violence Detection from Videos using HOG Features," 2019 4th International Conference on Electrical Information and Communication Technology (EICT), Khulna, Bangladesh, 2019, pp. 1-5,
K.Aarthy and A. A. Nithya, "Crowd Violence Detection in Videos Using Deep Learning Architecture," 2022 IEEE 2nd Mysore Sub Section International Conference (MysuruCon), Mysuru, India, 2022, pp. 1-6,
KHALFAOUI, AICHA and BADRI, Abdelmajid and El Mourabit, Ilham, Efficient Violence Detection with Bi-Directional Motion Attention And mobilenetv3-Lst.
L.Sachan, P. Katiyar, Y. Kumbhawat, G. K. Rajput and T. Mehrotra, "Comparative Analysis on Violence Detection Using Yolo and ResNet," 2023 12th International Conference on System Modeling & Advancement in Research Trends (SMART), Moradabad, India, 2023, pp. 89-92,
P.P and A.J , "A hybrid model using 2D and 3D Convolutional Neural Networks for violence detection in a video dataset," 2022 3rd International Conference on Communication, Computing and Industry 4.0 (C2I4), Bangalore, India, 2022,
S. Kshatri, D. Singh, B. Narain, S. Bhatia, M. T. Quasim and G. R. Sinha, "An Empirical Analysis of Machine Learning Algorithms for Crime Prediction Using Stacked Generalization: An Ensemble Approach," in IEEE Access, vol. 9
Vo-Le, H. S. Vo, T. D. Vu and N. H. Son, "Violence Detection using Feature Fusion of Optical Flow and 3D CNN on AICS-Violence Dataset," 2022 IEEE Ninth International Conference on Communications and Electronics (ICCE), Nha Trang, Vietnam, 2022, pp. 395- 399,
S.Vosta and K. -C. Yow, "KianNet: A Violence Detection Model Using an Attention-Based CNN-LSTM Structure," in EEE Access, vol. 12, pp. 2198-2209, 2024, doi: 10.1109/ACCESS.2023.3339379.
Suba, A. Verma, P. Baviskar and S. Varma, "Violence detection for surveillance systems using lightweight CNN models," 7th International Conference on Computing in Engineering & Technology (ICCET 2022), Online Conference, 2022, pp. 23-29, doi: 10.1049/icp.2022.0587.