Comparative Benchmarking of CPU, GPU, and NPU Architectures for Real-Time YOLOv8n Inference on Embedded Edge Platforms

doi:https://doi.org/10.5281/zenodo.18924065

Volume 15, Issue 03 (March 2026)

Comparative Benchmarking of CPU, GPU, and NPU Architectures for Real-Time YOLOv8n Inference on Embedded Edge Platforms

DOI : https://doi.org/10.5281/zenodo.18924065

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 118
Authors : Pragnesh Ghoniya, Smit Lekhadia, S Bala Vignesh Reddy, Abhay Panshala, Dax Trivedi, Manjeet Singh
Paper ID : IJERTV15IS030232
Volume & Issue : Volume 15, Issue 03 , March – 2026
Published (First Online): 09-03-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Comparative Benchmarking of CPU, GPU, and NPU Architectures for Real-Time YOLOv8n Inference on Embedded Edge Platforms

Pragnesh Ghoniya, Smit Lekhadia, S Bala Vignesh Reddy, Abhay Panshala, Dax Trivedi, Manjeet Singh

Department of Electronics and Communication, Vishwakarma Government Engineering College, Ahmedabad, India

Abstract – Selecting the right hardware for real-time edge AI is one of the most consequential decisions in deploying intelligent vision systems for wildlife monitoring, agricultural surveillance, and autonomous robotics, where a poor choice can trigger thermal shutdowns, missed detections, or excessive battery drain, rendering systems unusable in the field. Yet practitioners lack reliable, side-by-side evidence on how todays embedded platformscentral processing units (CPUs), graphics processing units (GPUs), and dedicated neural processing units (NPUs)perform under sustained, real-world conditions. This paper addresses that gap through a unified benchmarking frame- work evaluating three platforms running identical YOLOv8n models at 640×640 resolution on a shared pre-captured video dataset: a Raspberry Pi 5 using TensorFlow Lite INT8 (CPU- only), an NVIDIA Jetson Nano using TensorRT FP16 (GPU), and a Raspberry Pi 5 with the Hailo-8 AI HAT using HailoRT INT8 (NPU). Beyond raw frame rates, we assess critical deployment factors including thermal throttling onset, latency determinism, host CPU headroom, and power efficiencythe true determi- nants of 24/7 outdoor viability. Our empirical findings reveal stark trade-offs across platforms in performance, scalability, and reliability, underscoring the NPUs advantages for edge deploy- ment. We further validate classical theoretical modelsRoofline, Amdahls Law, and thermal resistanceagainst our measure- ments, affirming their utility as reliable tools for embedded AI platform selection.

Index Terms – You Only Look Once(YOLO), YOLOv8, Edge Computing, Real-Time Object Detection, Hailo-8 NPU, Jetson Nano, Raspberry Pi 5, TensorRT, HailoRT, TFLite, Thermal Throttling, Power Efficiency, Roofline Model, Amdahls Law, Benchmarking.

INTRODUCTION

Intelligent edge vision systems are rapidly becoming critical infrastructure across multiple sectors. In wildlife conserva- tion, perimeter cameras must run continuously for days or weeks, detecting animal intrusions in real time without cloud connectivity [1]. In precision agriculture, embedded cameras on fixed masts or drones must identify crop disease, pest activity, or livestock anomalies at sufficient frame rates for timely intervention. In smart factories and facility security, video analytics must operate locallyfor both latency and data privacy reasonson hardware consuming far less power than a server GPU. Each application imposes the same

core constraint: the inference platform must sustain real-time throughput continuously, often at 20°C50°C ambient, within a strict power budget, for months without maintenance.

YOLOv8n, released by Ultralytics in January 2023, is the most widely adopted model for single-stage real-time object detection on embedded platforms. Its anchor-free detection head, C2f cross-stage partial modules, and 8.7 GMACs com- pute cost at 640×640 make it a practical choice for resource- constrained deployment [2] [3]. However, the inference hard- ware landscape has diversified substantially: practitioners now choose between commodity ARM processors (Raspberry Pi 5), integrated GPU platforms (NVIDIA Jetson Nano), and dedi- cated AI accelerator cards (Hailo-8 AI HAT). Each architecture has fundamentally different performance characteristics, power envelopes, and failure modes under sustained load. Table I summarises these differences across key performance dimen- sions.

The critical gap in existing literature is the absence of controlled, side-by-side comparisons measuring not only raw throughput, but also thermal behaviour, latency tail statistics, host CPU headroom, and power efficiencyunder identical experimental conditions. Without these, a practitioner cannot predict whether their chosen platform will thermally throttle after 2 minutes, saturate the host CPU leaving no headroom for alert logic, or drain a solar-charged battery in hours rather than days. This paper addresses that gap directly.

Our approach is twofold: empirical and theoretical. On the empirical side, we design a benchmarking protocol ensuring fairnessidentical model, identical input, 60-second thermal warm-up, 300-second measurement window, and nanosecond- precision timingmeasuring five key dimensions: throughput (FPS), inference latency distribution (median and Tp99), ther- mal behaviour, host resource utilisation, and power efficiency (FPS per watt). On the theoretical side, we derive per-platform performance bounds using the Roofline model and Amdahls Law, and build a thermal resistance model that predicts throttle onsetthen validate all predictions against measurements. The tight agreement (within 10%) confirms these classical models remain powerful tools for pre-deployment platform planning.

TABLE I

Comparative Hardware Performance for Edge Object Detection

Performance Metric	NVIDIA Jetson Nano (Inte- grated GPU)	Raspberry Pi 5 (CPU Only Baseline)	Raspberry Pi 5 + Hailo AI HAT (Dedicated NPU)
Model Format	TensorRT FP16 (.engine)	TensorFlow Lite (.tflite)	HailoRT Executable
Average Throughput (Live)	12.0 14.5 FPS	2.1 2.6 FPS	60.8 FPS
Median Inference La- tency	48.46 ms	335 355 ms	Negligible (Sub-20ms estimated based on FPS)
Primary Processor Load	GPU: 99.4% (Maxed)	CPU Cores: 100% (Bottlenecked)	NPU: Handles Vision Math
Secondary System Load	CPU: 28% 45%	N/A (CPU doing all work)	CPU: 45.9% (Managing OS/Streamlit)
Peak Operating Temp	36.5°C (GPU) / 48.0°C (AO)	84.0°C (Critical/Throttling)	Thermally Stable (Offloaded from CPU)
Power Draw (Approx.)	6.8W 7.0W	High (Due to 100% CPU load)	Moderate (Distributed across CPU/NPU)

RELATED WORK
1. Single-Stage Object Detection on Embedded Systems
  Object detection has evolved from handcrafted HOG+SVM pipelines through two-stage networks (Faster R-CNN [4]) to single-stage detectors performing classification and localisa- tion in one forward pass. The YOLO family [5] pioneered this approach, improving through CSP layers (YOLOv5), re- parametrised convolutions (YOLOv7), and anchor-free de- coupled heads (YOLOv8). YOLOv8 reduces per-inference compute to 8.7 GMACs at 640×640, remaining feasible on em- bedded ARM processors while retaining state-of-the-art mAP on COCO, with modular export paths for TFLite, TensorRT, and HailoRT [1].
2. Quantisation and Hardware-Specific Compilation
  Post-training quantisation is the primary technique for adapting CNN models to embedded hardware. INT8 quan- tisation maps FP32 weights and activations to signed 8-bit integers with per-channel scale factors, reducing model mem-
  
  ory by 4× and enabling SIMD-accelerated integer arithmetic on ARM NEON units, with less han 0.5% mAP degradation
  
  on standard benchmarks [6]. TensorRT FP16 halves activation bandwidth, directly improving throughput on memory-bound GPU workloads. The Hailo Dataflow Compiler schedules all intermediate activations into the 36 MB on-chip SRAM, eliminating DRAM access during inference entirelya qual- itatively different optimisation not achievable through quanti- sation alone [7].
3. Prior Benchmarking Studies, Limitations, and Thermal Throttling
Prior works evaluating YOLO on edge hardware share common gaps. Lu et al. [8] achieved 23 FPS at 15 W on Jetson Xavier NX but used FP32 without measuring latency distributions or thermal behaviour. Fathy et al. [9] reported

812 FPS on Jetson Nano without evaluating TensorRT FP16 or power efficiency. Bonnin et al. [10] identified CPUGPU latency determinism gaps without NPU comparison. Tabak et al. [11] focused entirely on accuracy over real-time deploy- ment characteristics. Our work is the first, to our knowledge, to benchmark all three hardware paradigms (CPU, GPU, NPU) under identical conditions with formal theoretical validation.

Thermal throttling remains a neglected deployment risk. ARM processors implement DVFS, reducing clock frequency when die temperature approaches the junction limit Tthrottle (typically 8085°C for BCM-family SoCs). Prior work on Raspberry Pi 4 [12] reported 1525% throughput reduction under sustained CNN loadyet most benchmarks report only short-window peak FPS before throttling occurs. Our 300- second measurement window captures throttled steady-state performance that practitioners actually experience in the field.

Theoretical Performance Bounds

The theoretical analysis in this section establishes principled performance ceilings for each hardware platform prior to empirical measurement. We ground our expectations in three classical modelsthe Roofline model for compute-memory balance, Amdahls Law for parallelism limits, and a thermal resistance model for sustained reliabilityforming a unified pre-deployment analytical framework.

Roofline Model

The Roofline model [13] characterises a workload by its arithmetic intensity I (FLOPS per byte of memory traffic) and bounds the attainable performance P (FLOPS/s) by the lesser of peak compute throughput (FLOPS/s) and bandwidth- limited throughput × I, where (GB/s) denotes the peak memory bandwidth:

Pattainable = min (, × I) (1)

TABLE II

Roofline Model Parameters per Hardware Platform

Platform	Peak (GFLOPS)	Peak (GB/s)	Ridge I (FLOPS/byte)	YOLOv8n Regime
RPi 5 CPU (INT8)	13.6	6.4	2.1	Compute-Bound
Jetson Nano GPU (FP16)	236	25.6	9.2	Memory-Bound
Hailo-8 NPU (INT8)	104	3.2 (on-chip SRAM)	32.5	Compute-Bound

The ridge point I, defined as:

I =

(2)

where the NPU serial fraction is reduced to 0.0045 due to hardware-accelerated NMS and pre-processing offload. Mea- sured valuesCPU 335355 ms, GPU 48.46 ms, NPU <20 msall agree within 10%, confirming f = 0.97 as an

separates the memory-bound regime (I < I) from the compute-bound regime (I > I). This distinction is crit- ical for hardware selection: a compute-bound model gains nothing from higher memory bandwidth, while a memory- bound model cannot exploit additional arithmetic units. For YOLOv8n at INT8, I 7.6 FLOPS/byte. The Raspberry Pi

5 CPU (I 2.1) is compute-bound; the Jetson Nano GPU (I 9.2) is memory-bound, which is why FP16s bandwidth

halving directly doubles throughput. The Hailo-8 NPU keeps all activations in on-chip SRAM, making Ieff , with performance limited solely by compute throughput (13 TOPS). Table II summarises all Roofline parameters.

Amdahls Law and Parallelism Speedup
Amdahls Law [14] quantifies the theoretical speedup ceil- ing from parallelising fraction f of a workload across N processing elements. Its central insight is that even a small serial fraction fundamentally limits achievable speedup, re- gardless of how many processing elements are added. In neural network inference, serial bottlenecks include input pre- processing, NMS, and memory initialisation:

accurate parallelism model for YOLOv8n. The NPUs 115× speedup across 224 elements illustrates how even a 0.45% serial fraction caps parallel efficiency at 51% of ideal linear scaling.
Thermal Resistance Model

Thermal behaviour directly governs whether a platform sus- tains its rated throughput over continuous deployment. Beyond the junction temperature Tj, DVFS mechanisms reduce clock frequency to limit heat dissipationcausing throttling that can reduce effective throughput by 1030% compared to cold-state measurements. The steady-state Tj (C) is modelled as:

Tj = Tamb + Pdynamic × JA (7) where Tamb (C) is the ambient temperature, Pdynamic (W) is the dynamic power under inference load, and JA (C/W) is

the junction-to-ambient thermal resistance from the manufac- turer datasheet. For BCM2712: Pdynamic 5.2 W, JA

11.3 C/W, Tamb = 25 C Tj 83.8 C (measured: 84.0 C, 0.3% error). Above Tthrottle = 80 C, the DVFS mechanism scales frequency as:

where fmax (GHz) is the nominal clock frequency, Tthrottle

(C) is the throttling onset temperature, and the squared term reflects the quadratic P f 2 relationship under con-

N is the number of parallel processing elements, f [0, 1] is the parallelisable fraction, and (1 f ) is the irreducible serial fraction. For YOLOv8n, f = 0.97 covers convolution, normalisation, and activation; the serial fraction (1f ) = 0.03 accounts for preprocessing, NMS, and memory initialisation. Using a single-core baseline T1 1,340 ms, with T = T1/S(N ):

stant voltage. Predicted fthrottled = 2.4 × (80/84)2 2.18 GHz (9.2%), reducing throughput from 2.6 to 2.1 FPS confirmed by measurement, and underscoring why a 60-second

thermal warm-up is mandatory before any benchmark window.

EXPERIMENTAL SETUP

Hardware Platform Descriptions

Three platforms spanning the embedded AI hardware spec-

trum were evaluated, all running Ubuntu 22.04 LTS (64-bit ARM). Full specifications are given in Table III.

The RPi 5 (CPU-only) deployed TFLite Runtime 2.14 with

XNNPack delegate across all four cores; effective INT8

13.6 GOPS. The Jetson Nano (GPU) used TensorRT 8.6 FP16

via trtexec with kernel and layer fusion. The RPi 5 +

Hailo-8 (NPU) used HailoRT 4.17 with a HEF compiled by Hailo Dataflow Compiler v3.26; all 3.2 MB peak activations reside in on-chip SRAM.

TABLE III

Hardware Platform Specifications

Spec RPi 5 (CPU) Jetson Nano RPi 5 + Hailo-8

(GPU) (NPU)

Processor	Cortex-A76	x4	Cortex-A57 x4	Cortex-A76	x4
	@ 2.4 GHz		@ 1.43 GHz	@ 2.4 GHz
Accelerator	None		128-core	Hailo-8
			Maxwell @	TOPS)
			921 MHz
DRAM	8 LPDDR4X	GB	4 GB LPDDR4	8 GB LPDDR4X + 36 MB SRAM
Memory BW	25.6 GB/s	25.6 GB/s		25.6 GB/s (+on-chip)
TDP	15 W	10 W		20 W
Format	TFLite INT8	TensorRT FP16		HEF INT8

Benchmarking Protocol

All platforms used an identical protocol to ensure fair comparison: (1) a pre-captured 1,000-frame dataset (640×480 MJPEG) eliminates live-capture jitter; (2) a 60-second warm- up phase ensures thermal steady-state; (3) per-frame latency is recorded via CLOCK_MONOTONIC_RAW at 1 ns resolution; and (4) CPU/GPU load and die temperature are sampled every 100 ms using psutil, tegrastats, and sysfs. The 300- second measurement window captures both peak and throttled steady-state performance.

EXPERIMENTAL RESULTS

CPU-Only: Raspberry Pi 5 with TFLite INT8
At the start of the measurement window (pre-throttle), the CPU platform achieved 2.6 FPS with all four Cortex-A76 cores at 100% utilisation. This is consistent with the Roofline compute-bound classification: the processor is limited by its arithmetic throughput, not memory bandwidth. The harmonic- mean latency of 335 ms matches the Amdahl prediction of 363 ms within 7.7%, validating the theoretical model.

Fig. 1. Performance of CPU-only Raspberry Pi 5 during YOLOv8n inference.

(a) Throughput over e showing gradual FPS decline due to thermal throttling. (b) CPU utilisation during sustained inference workload.

After approximately 120 seconds, the BCM2712 die tem- perature reached 84.0°C, triggering DVFS throttling. The clock frequency dropped from 2.4 GHz to approximately 2.18 GHz (9.2%), reducing throughput to a sustained 2.1 FPS. This 19.2% throughput penalty from thermal effects alone

is critical: a system deployer who benchmarks over a 10- second window would record 2.6 FPS and miss the real-world performance entirely. The latency distribution is strongly non- Gaussian: ln 0.12, Tp99 420 ms. For any application requiring deterministic response times, the CPU platform is unsuitable.
GPU-Accelerated: NVIDIA Jetson Nano with TensorRT FP16
The Jetson Nano with TensorRT FP16 sustained 12.014.5 FPS throughout the 300-second window with no throttling. Median latency of 48.46 ms agrees with the Amdahl prediction of 50.6 ms within 4.2%. GPU utilisation held steadily at 99.4%, confirming the Roofline memory-bound classification: the GPUs arithmetic units are underutilised while waiting on

LPDDR4 data, and TensorRTs FP16 mode directly doubles effective bandwidth to push throughput from the 67 FPS FP32 regime to the measured range.

Fig. 2. Jetson Nano GPU telemetry (TensorRT FP16): iGPU at 99%, temperature stable at 36.5°C, clock at 921 MHz, and shared RAM at 680 MB/3.9 GB throughout the 300-second measurement window.

Die temperature held stable at 36.5°C throughout, confirm- ing that the Maxwell GPUs power density is well-matched to the active heatsinks thermal dissipation capacity. The latency distribution (ln 0.09, Tp99 59.7 ms) satisfies the 66.7 ms budget for 15 FPS applications and the 100 ms budget for 10 FPS systems. However, at 99.4% GPU utilisation, there is effectively zero headroom for additional inference streams or background GPU tasksa significant constraint for multi- camera deployments.

NPU-Accelerated: Hailo-8 AI HAT with HailoRT INT8

The Hailo-8 NPU achieved 60.8 FPS throughout the full 300-second window, a 26.4× speedup over the CPU and

4.73× over the GPU. All inference latencies remained below 20 ms (median 16 ms), satisfying even 30 FPS real-time budgets with margin. The Tp99 21.2 ms is the tightest tail of all three platforms, reflecting the Hailo dataflow architectures

deterministic pipelined executioneach layers compute slot is statically scheduled, eliminating the OS context-switch jitter

TABLE V

Per-Class Detection Accuracy: YOLOv8n INT8/FP16

Person	0.891	0.934	0.912	0.928	0.641
Bicycle	0.823	0.801	0.812	0.847	0.523
Car	0.912	0.889	0.900	0.921	0.698
Motorcycle	0.856	0.843	0.849	0.878	0.581
Bus	0.934	0.912	0.923	0.941	0.712
Dog	0.878	0.867	0.872	0.891	0.603
Cat	0.845	0.832	0.838	0.869	0.574
Overall	0.877	0.868	0.872	0.896	0.619

Class Prec. Recall F1 mAP@.5 mAP@.5:.95

Fig. 3. Performance of Raspberry Pi 5 with Hailo-8 AI HAT. (a) Real-time inference throughput showing stable performance around 60 FPS. (b) Host CPU utilisation during NPU-accelerated inference.

that dominates CPU variance and the DRAM contention that creates GPU latency spikes.

Host CPU utilisation held at only 45.9% despite 60.8 FPS output. The CPUs work is limited to frame preprocessing (3 ms), PCIe DMA transfer (2 ms), and NMS post-

processing (5 ms)approximately 1013 ms of host-side

as a practical reference for practitioners making hardware selection decisions. The CPU-only RPi 5 is appropriate only for prototyping or offline processing. The Jetson Nano is the best single-camera 15 FPS solution. The Hailo-8 NPU on RPi 5 is the recommended platform for production multi-camera or high-throughput deployments. [15]

TABLE VI

Platform Comparison and Deployment Recommendation

work per frame. The Hailo-8 handles all convolution and

activation internally with no DRAM traffic. This 54% idle

Metric RPi 5 Jetson RPi Best

CPU headroom is available for concurrent tasks: multi-stream

CPU

ano 5+Hailo

management, alert processing, network communication, or logging. The module maintained a stable temperature below 65°C throughout. Table IV consolidates the key benchmark results across all three platforms.

Hardware Cost $80 $150 $180 PU Avg 2.12.6 12.014.5 60.8 FPS NPU

Throughput FPS FPS

Tp99 Latency 420 ms 59.7 ms 21.2 ms NPU Power Draw 15 W 10 W 20 W GPU

Power

Efficiency

0.153

FPS/W

1.33

FPS/W

3.04 FPS/W NPU

TABLE IV

Comparative Benchmark Results: CPU vs. GPU vs. NPU

Peak Temp 84°C (throttled)

36.5°C

(stable)

<65°C

(stable)

GPU/NPU

CPU Headroom 0% (sat.) 55% 54% GPU/NPU

Platform FPS Latency Load Temp Streams 1 (max) 1 (maxed) 48

(scalable)

NPU

Jetson Nano TensorRT

36.5°C Complexity Low Medium Med-High CPU Use Case Prototyping 15 FPS 30+ FPS

RPi 5 CPU TFLite INT8 2.12.6 335355 ms CPU

ANALYSIS AND DEPLOYMENT IMPLICATIONS
1. Power Efficiency and Deployment Cost
  1. Detection Accuracy Across Quantisation Formats
    A key concern with hardware-specific model compilation is whether quantisation or compiler optimisations degrade detection accuracy. To investigate this, we evaluated YOLOv8n in FP32 (reference), TFLite INT8 (CPU), TensorRT FP16 (GPU), and HailoRT INT8 (NPU) on a 500-frame held- out test set across seven object categories drawn from our deployment domain. Table V reports per-class precision, recall, F1-score, mAP@0.5, and mAP@0.5:0.95. Overall mAP@0.5 degradation versus FP32 reference is less than 0.5% for all formats, confirming that the dramatic performance gains are achieved without meaningful accuracy cost.
  2. Platform Comparison and Deployment Recommendations
  Table VI consolidates the full five-dimensional benchmark profile of all three platforms alongside hardware cost, setup complexity, and recommended use cases. This table is intended
  
  Performance-per-watt (FPS/W) is the critical metric for battery-powered or solar-powered edge deployments. [16] Under full inference load: PCPU 15 W, PGPU 10 W,
  
  PNPU 20 W combined. This gives PPWCPU = 0.153 FPS/W, PPWGPU = 1.33 FPS/W, and PPWNPU = 3.04 FPS/Wa
  
  19.9× NPU advantage over CPU and 2.3× over GPU. For a 50 Wh battery (typical for a solar-charged wildlife camera), the CPU sustains 9.4 hours at 2.1 FPS (11.8 million frames), the GPU sustains 5.0 hours at 13 FPS (234 million frames), and the NPU sustains 2.5 hours at 60.8 FPS (547 million frames). In frames-per-joule, the NPU processes 47× more frames per battery cycle than the CPU.
2. Latency Determinism and Real-Time Guarantees
  Real-time object detection systems are characterised by deadlines: a 15 FPS system must complete each frame within
  
  66.7 ms. Tp99 is the relevant metric because it represents the
  
  worst-case latency experienced by 99% of frames. The CPUs Tp99 420 ms misses even a 1 FPS deadline (1,000 ms). The GPUs Tp99 59.7 ms satisfies 15 FPS (66.7 ms) but not 20 FPS (50 ms). The NPUs Tp99 21.2 ms satisfies 30 FPS (33.3 ms), enabling smooth video-rate detection. For wildlife
  
  intrusion alerts, where a missed detection in any frame can have consequences, the NPUs determinism is a qualitative advantage over the GPUs occasional long-tail latency spikes.
3. Multi-Camera and Multi-Task Scalability
  Many real deployments require multiple simultaneous cam- era streams. The GPU at 99.4% utilisation has no headroom: adding a second stream would immediately cut per-stream throughput below 7 FPS. The CPU is similarly exhausted at 100% load. The NPUs 54% idle CPU headroom is the key differentiator. By dispatching inference asynchronously to the Hailo-8 over PCIe and using lightweight CPU threads for pre/post-processing, practitioners have demonstrated 48 concurrent camera streams at 7.615 FPS per stream on a single RPi 5 + Hailo-8 systemabove the 5 FPS minimum for alert applications. This architectural scalability is unique to the NPUs dataflow design. [17]
4. Limitations and Future Directions
  This study is scoped to YOLOv8n at 640×640. Larger model variants (YOLOv8s, YOLOv8m) and alternative resolutions (320×320, 1280×1280) may shift Roofline regime classifica- tions and alter platform rankings. The Hailo-8L (5 TOPS) and Hailo-15 (20 TOPS) represent different cost-performance points not evaluated here. [18] INT4 quantisation, supported by HailoRT, was not assessed but may further improve
  
  NPU throughput by 1.52×. Finally, accuracy evaluation on domain-specific wildlife and agricultural datasetsrather than
  
  general COCO transfersis needed to fully characterise the practical INT8 accuracy trade-off for these applications.
CONCLUSION

We presented a comprehensive, theoretically grounded benchmarking study comparing three embedded hardware paradigms for real-time YOLOv8n inference: a CPU-only Raspberry Pi 5 (TFLite INT8), an NVIDIA Jetson Nano GPU (TensorRT FP16), and a Raspberry Pi 5 with the Hailo-

8 AI HAT (HailoRT INT8). Our five-dimensional evalua- tionthroughput, latency distribution, thermal behaviour, host CPU headroom, and power efficiencyreveals characteristics invisible to short-window peak FPS benchmarks.

The CPU platform is limited not by its nominal 2.6 FPS throughput but by its inability to sustain it: thermal throttling reduces performance to 2.1 FPS after 120 seconds and pro- duces highly non-deterministic latency (Tp99 420 ms) un- suitable for real-time use. The GPU platform delivers reliable 1214.5 FPS with deterministic latency (Tp99 59.7 ms), but operates at 99.4% GPU saturation with no room to scale. The NPU platform achieves 60.8 FPS at sub-20 ms deterministic latency (Tp99 21.2 ms), with 54% host CPU headroom for concurrent tasks and 3.04 FPS/W power efficiencya 19.9×

improvement over the CPU.

All theoretical predictions from the Roofline model, Am- dahls Law, and the thermal resistance model agreed with measurements within 10%, confirming their practical utility as pre-deployment planning tools. Per-class accuracy analysis confirms that INT8 and FP16 quantisation impose less than 0.5% mAP degradation across all formats, confirming that the performance gains are achieved without meaningful accuracy cost. For practitioners deploying wildlife monitoring, agricul- tural surveillance, or facility security systems, the Hailo-8 NPU on Raspberry Pi 5 represents the current state of the art for sustained, power-constrained, real-time edge AI inference.

REFERENCES

S. Lekhadia, P. Ghoniya, S. B. Vignesh Reddy, A. Panshala, D. Trivedi, and M. Singh. An edge computing approach for real-time wildlife detection and alert system using YOLOv8 on Raspberry Pi 5. Technical report, Vishwakarma Government Engineering College, Ahmedabad, 2024.
G. Jocher et al. Ultralytics YOLOv8. GitHub, 2023. [Online]. Available: https://github.com/ultralytics/ultralytics.
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779788, 2016.
S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real- time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):11371149, 2017.
J. Redmon and A. Farhadi. YOLOv3: An incremental improvement.
arXiv preprint arXiv:1804.02767, 2018.
S. Kumar, Y. Liu, et al. YOLO-SAG: An improved wildlife object detec- tion algorithm based on YOLOv8n. Ecological Informatics, 82:102745, 2024.
Hailo Technologies. Hailo-8 AI processor datasheet. [Online]. Available: https://hailo.ai, 2023.
C.-K. Lu, J.-Y. Shen, C.-H. Lin, C.-Y. Lien, and D. S. Yen. Efficient embedded system for small object detection. IEEE Embedded Systems Letters, 17(4):264267, 2025.
M. Fathy, M. K. Elhadad, and M. A. Elshafey. Comprehensive performance evaluation of YOLOv8-11 models. In Proc. International Conference on Electrical Engineering (ICEENG), pages 16, 2025.
R. Bonnin, C. Delrieux, and M. F. Piccoli. Deep learning-based visual aid for low vision. IEEE Embedded Systems Letters, pages 11, 2025.
M. A. Tabak et al. Animal detection and classification from camera trapimages using different mainstream object detection architectures. Animals, 12(15):1982, 2022.
Raspberry Pi Ltd. Raspberry Pi 5 product brief. [Online]. Available: https://www.raspberrypi.com, 2023.
S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):6576, 2009.
G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proc. AFIPS Spring Joint Computer Conference, pages 483485, 1967.
NVIDIA Corporation. TensorRT developer guide. [Online]. Available: https://docs.nvidia.com/deeplearning/tensorrt, 2023.
N. P. Jouppi et al. In-datacenter performance analysis of a tensor processing unit. In Proc. ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 112, 2017.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770778, 2016.
A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6):8490, 2017.