AI Smart Glasses: A Real-Time Navigation and Spatial Awareness System for Assistive Mobility

doi:10.17577/IJERTV15IS043030

Volume 15, Issue 04 (April 2026)

AI Smart Glasses: A Real-Time Navigation and Spatial Awareness System for Assistive Mobility

DOI : 10.17577/IJERTV15IS043030

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 0
Authors : Dr. Hima Vyshnavi A M, Sidharth K, Dhuvathisan S, Rangesh V. C
Paper ID : IJERTV15IS043030
Volume & Issue : Volume 15, Issue 04 , April – 2026
Published (First Online): 01-05-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

AI Smart Glasses: A Real-Time Navigation and Spatial Awareness System for Assistive Mobility

Sidharth K

Artificial Intelligence and Data Science Department

Easa college of Engineering and technology Coimbatore, India

Rangesh V.C

Artificial Intelligence and Data Science Department

Easa college of Engineering and technology Coimbatore, India

Dhuvathisan S

Artificial Intelligence and Data Science Department

Easa college of Engineering and technology Coimbatore, India

Dr. Hima Vyshnavi A M

Artificial Intelligence and Data Science Department

Easa college of Engineering and technology Coimbatore, India

ABSTRACT:Approximately 2.2 billion people worldwide suffer from some form of visual impairment, making independent navigation a critical challenge. This paper presents an AI Smart Glasses system, a real-time assistive navigation prototype that processes live video to deliver contextual voice-based guidance for visually impaired users. The architecture integrates YOLOv8 for rapid object detection, Intel MiDaS for monocular depth estimation, and a heuristic spatial decision engine that maps detected objects to horizontal zones (Left, Centre, Right) and distance tiers (Near, Medium, Far). A ground-hazard detector compares depth gradients between mid-frame and bottom-frame strips to identify drops such as staircase descents before the user reaches them. Navigation decisions are synthesized into concise spoken instructions via a dual-backend text-to-speech module that incorporates an anti-spam cooldown policy. The system is implemented entirely in Python and deployed as a Streamlit web application augmented with streamlit-webrtc for low-latency live camera ingestion. The Evaluation demonstrates 18 FPS (Frames per Second) throughput on commodity CPU hardware, 0.83 Spearman rank correlation for depth ordering, and 81 % evaluator-rated instruction appropriateness, validating the pipeline for real-world assistive deployment.

Keywords: AI smart glasses, assistive navigation, YOLOv8, monocular depth estimation, MiDaS, text-to-speech, visually impaired, real-time object detection, spatial awareness

INTRODUCTION

Approximately 2.2 billion individuals worldwide live with some form of visual impairment, of whom nearly 43 million are classified as completely blind, according to the WHO. For such individuals, independent navigation in unfamiliar and dynamic environments poses a persistent challenge that significantly affects daily quality of life. Conventional aids such as white canes and guide dogs provide only limited spatial information, with neither capable of providing descriptive real-time contextual guidance about obstacles, distances, or recommended paths. The growing maturity of computer vision, deep learning, and embedded systems has opened new possibilities for intelligent assistive

devices that can approximate some of the perceptual capabilities of a sighted person.

Artificial intelligencepowered smart glasses represent one of the most promising directions in this space. By attaching a compact camera and processing unit to a wearable form factor, such a system can continuously analyze the wearer's visual field and deliver timely, actionable audio cues. Recent advances in lightweight object-detection architectures such as YOLOv8 and monocular depth estimation frameworks such as Intel's MiDaS have made it feasible to run this kind of real-time pipeline on modest computing hardware.

In this paper, we present the design, architecture, and evaluation of an AI-powered Smart Glasses prototype that combines YOLOv8-based object detection, MiDaS-based depth estimation, a heuristic risk-scoring decision engine, and a multi-backend text-to-speech module. The system is built entirely in Python and deployed through a Streamlit interface augmented with WebRTC for low-latency live video ingestion. This prototype targets visually impaired users, providing spoken guidance about nearby obstacles, their spatial positions, and recommended navigation actions in real time.
1. OBJECTIVES
  1. To design a modular, end-to-end vision pipeline capable of detecting and localizing objects from a live webcam stream in real time at practical frame rates on commodity hardware.
  2. To integrate monocular depth estimation alongside object detection, enabling the system to assign meaningful distance categories (near, medium, far) to each detected obstacle without requiring specialized depth cameras.
  3. To develop a rule-based decision engine that aggregates spatial and depth information into interpretable navigation instructions delivered as synthesized speech with appropriate cooldown and anti-spam controls.
  4. To evaluate the system's responsiveness, detection accuracy, and audio guidance quality across both static image inputs and continuous live video, and to identify realistic deployment pathways.
2. EXISTING SYSTEM
  
  Early assistive navigation devices for the visually impaired relied primarily on ultrasonic or infrared proximity sensors embedded into canes, belts, or vests. While such devices can reliably detect the presence of an obstacle within a fixed range, they provide no semantic information about the nature of the obstacle, its horizontal position, or whether it poses an immediate risk. More recent wearable systems have incorporated stereo cameras or structured-light depth sensors (e.g., Intel RealSense), which yield dense depth maps but require specialized, costly hardware and consume substantial power.
  
  Smartphone-based solutions that leverage the device's rear camera and run object-detection models on the CPU or mobile GPU have also been explored. However, the need to hold or mount the phone, combined with the relatively high latency of on-device inference on mid-range handsets, limits practical usability. Cloud-offloaded variants mitigate computation constraints but introduce network-latency dependency that undermines real-time performance requirements. None of the broadly available commercial or research systems combines lightweight, state-of-the-art deep learning models with depth estimation, a semantic decision engine, and accessible text-to-speech output in a single, open-source Python framework.
3. PROPOSED SYSTEM
  
  The proposed AI Smart Glasses system addresses the limitations of prior approaches by integrating four purpose-built subsystems into a unified, real-time Python pipeline. The Perception Engine employs YOLOv8 for object detection and Intel MiDaS for monocular depth estimation, enabling the system to identify obstacles and estimate their relative distances using a single standard camera. A spatial mapping module partitions each frame into horizontal zones (Left, Center, Right) and depth tiers (Near, Medium, Far) to build a structured representation of the scene.
  
  A heuristic Decision and Alert Engine translates this spatial map into prioritized navigation instructions, distinguishing among critical hazards such as ground drops or pits, center-path blockers that require turning, and minor lateral obstacles that warrant a slight corrective step. Finally, a Voice Synthesis Engine converts these instructions into audio playback through pyttsx3 or gTTS, equipped with a cooldown mechanism to prevent repetitive announcements. The entire system is accessible through a Streamlit web application that leverages streamlit-webrtc for lw-latency camera ingestion, enabling real-time operation without additional hardware beyond a standard webcam.
4. SYSTEM FEATURES
  1. Continuous real-time object detection using YOLOv8 (Nano or Small variant) with user-tunable frame resolution and skip-rate controls to balance accuracy against throughput.
  2. Optional monocular depth estimation via Intel MiDaS that can be toggled at runtime to conserve computational resources when depth information is not required.
  3. Automatic ground-hazard detection that compares median depth values across mid-frame and bottom-frame quadrants to identify sudden drops such as staircases or curb edges.
  4. A weighted risk-scoring matrix that evaluates left vs. right path safety and recommends the direction of least resistance when the center path is blocked.
  5. A dual-backend text-to-speech module supporting both offline (pyttsx3) and online (gTTS) synthesis, with a configurable repetition-delay policy to avoid audio spam.
  6. A Streamlit sidebar providing runtime controls for model selection, resolution, depth-estimation toggle, and audio policy, enabling on-the-fly tuning without restarting the application.
LITERATURE SURVEY

A substantial body of research has explored computer vision and deep learning for assistive navigation. Poggi et al.
1. demonstrated that self-supervised monocular depth estimation can achieve performance competitive with supervised approaches on outdoor scenes, providing a foundation for depth-based obstacle avoidance without expensive labelled datasets. Redmon and Farhadi
2. introduced the YOLO series of real-time object detectors, establishing the single-shot paradigm that underpins the YOLOv8 model used in the present work; subsequent iterations have progressively improved accuracy-latency trade-offs on embedded and edge hardware.Bochkovskiy et al.
3. extended the YOLO lineage with YOLOv4, introducing cross-stage partial connections and PANet feature aggregation that substantially improved small-object detectioncapabilities inherited and refined in later YOLO versions. In the depth-estimation domain, Ranftl et al.
4. introduced the MiDaS architecture, which was trained on a diverse mixture of datasets using scale- and shift-invariant losses; this design enables robust relative depth inference from a single image without scene-specific calibration.In the assistive technology domain, Bhowmick and Hazarika
5. surveyed electronic travel aids from sonar-based canes to camera-equipped systems, highlighting the gap between sensor richness and real-world deployability. Islam et al.
6. proposed a deep learningbased obstacle detection framework targeting pedestrian environments, reporting detection rates sufficient for practical mobility assistance at walking speeds. Shoval et al.
7. studied human factors in audio-guided navigation, demonstrating that concise, direction-specific cues significantly outperform verbose descriptions in reducing cognitive load during mobility tasks.Tapu et al.
8. developed a smartphone-based system for outdoor visually impaired navigation using stereo vision and deep features, observing that frame-level latency below 200 ms is essential for comfortable real-time guidance. Ahmetovic et al.
9. examined crosswalk detection and intersection navigation, illustrating the importance of contextual scene understanding beyond simple bounding-box localization. Collectively, these works underscore the viability of vision-based, AI-driven navigation assistance and motivate the modular, real-time design of the proposed AI Smart Glasses system.
METHODOLOGY

The AI Smart Glasses system is constructed as a modular Python pipeline organized into four layers: Frame Acquisition and Pre-processing, Spatial Perception, Decision and Alert Generation, and Audio Synthesis. These layers are orchestrated either continuously via the SmartGlassesProcessor WebRTC component for live camera feeds, or on demand for static image inputs via the run_pipeline_on_bgr utility function. The Streamlit application provides a unified interface for both

operating modes and exposes runtime configuration controls through its sidebar panel.
Navigation instructions are formatted as concise natural-language strings (e.g., "Person ahead. Center blocked. Turn left.") and submitted to the Voice Policy module. This module compares the proposed message against the most recently spoken message and the elapsed time since it was last spoken. A message is only passed to the synthesizer if it either differs substantially from the previous message or if the repetition-delay interval (configurable, default 4 seconds) has elapsed. Synthesis is performed by pyttsx3 in offline mode or gTTS in online mode; the resulting audio bytes are rendered via a standard HTML audio element embedded in the Streamlit interface.

The complete working model.

Fig 1. Flowchart

Fig 2. Architecture
RESULTS AND DISCUSSIONS
The YOLOv8 architecture is designed to balance speed and accuracy, making it suitable for real-time applications. It consists of three main components: backbone, neck, and head.

Backbone: The backbone is responsible for feature extraction. Architectures such as CSPDarkNet or EfficientNet are commonly used. The backbone processes raw images and extracts hierarchical features, ranging from low-level edges and textures to high-level shapes and patterns. CSPDarkNet, for example, uses cross-stage partial connections to improve gradient flow and reduce computational cost. EfficientNet scales depth, width, and resolution systematically, achieving high accuracy with fewer parameters.

Neck: The neck combines features from different scales using structures like PANet (Path Aggregation Network) or FPN (Feature Pyramid Network). Multi-scale fusion is crucial for detecting objects of varying sizes. For instance, small traffic signs and large vehicles require different feature resolutions. PANet enhances information flow between layers, while FPN builds feature pyramids that capture both fine and coarse details.

Head: The detection head predicts bounding boxes, confidence scores, and class labels. YOLOs single-stage design allows direct prediction without region proposals, enabling real-time performance. The head outputs multiple bounding boxes per grid cell, along with confidence scores indicating the likelihood of object presence. Anchor boxes are used to match predicted boxes with ground truth, improving localization accuracy.

This modular architecture ensures efficient feature extraction, multi-scale fusion, and accurate detection. YOLOvX models are optimized for speed, making them suitable for deployment in applications such as autonomous driving, surveillance, and robotics.

YOLO ARCHITECTURE IN ACTION

The diagram demonstrates how the YOLO object detection framework processes a city street scene in real time. Bounding boxes are drawn around multiple objects such as cars, pedestrians, and traffic lights, each labeled with their respective categories. This visualization highlights YOLOs ability to simultaneously detect and classify diverse objects within complex environments, showcasing its efficiency and accuracy in practical computer vision applications.
TRAINING STRATEGY:

Training YOLO models involves optimizing parameters to minimize detection errors. The process begins with defining loss functions. Intersection over Union (IoU) measures the overlap between predicted and ground truth bounding boxes. Generalized IoU (GIoU) and Complete IoU (CIoU) extend this by penalizing misaligned boxes and improving convergence.

Hyperparameter tuning is critical for balancing training speed and accuracy. Parameters such as batch size, learning rate, and number of epochs are adjusted to optimize performance. Anchor box sizes are also tuned to match object dimensions in the dataset. For example, larger anchor boxes may be used for vehicles, while smaller ones are suited for pedestrians.

The training process includes monitoring validation performance, applying early stopping to prevent overfitting, and saving checkpoints for model evaluation. Transfer learning is often employed, where pre-trained weights from large datasets are fine-tuned on custom datasets. This reduces training time and improves generalization. Data augmentation is also applied during training to increase robustness.
Despite advancements, YOLO models face several limitations during experimental setup. Overitting is a common issue, where the model memorizes training data instead of generalizing to unseen inputs. This is mitigated through data augmentation, dropout, and early stopping. Dataset bias is another challenge. If the dataset is skewed towards certain classes (e.g., person or car), the model may perform poorly on underrepresented categories. Balanced sampling and synthetic data generation can help address this.

Hardware constraints also limit experimentation. Training YOLOv7 or YOLOv8 on large datasets requires high-end GPUs, which may not be accessible to all researchers. Edge deployment on devices like Raspberry Pi demands lightweight models, necessitating compression techniques such as pruning, quantization, and knowledge distillation.

Another consideration is generalization. Models trained on benchmark datasets may not perform well in real-world environments with different lighting, weather, or object distributions. Domain adaptation and fine-tuning are required to improve robustness. Interpretability is also limited, as YOLO models function as black boxes, making it difficult to explain predictions in critical applications like healthcare.

It also becomes affected on the model that we have made using the datasets that are there and if the models accuracy in training is low it becomes hard for the live world.
RESULT:
This indicates that further data augmentation, class rebalancing, or specialized feature extraction may be required to improve weaker classes.

COMPARATIVE PERFORMANCE METRICS OF THE YOLOV8

6.5. Outputs

Fig.1. AI-Powered Smart Glasses System

Interface

Fig.2. Detected Pit Holes
CONCLUSION AND FUTURE WORK

This paper presented the design and implementation of an AI-powered Smart Glasses system for real-time assistive navigation aimed at visually impaired users. The proposed system integrates object detection using YOLOv8, monocular depth estimation using MiDaS, and a rule-based decision engine to generate contextual navigation instructions. Unlike traditional assistive tools, the system provides both semantic understanding of the environment and spatial awareness, enabling users to make informed navigation decisions.

The developed pipeline successfully combines perception, spatial mapping, and audio feedback into a unified real-time framework that operates on standard hardware without requiring specialized sensors. Experimental results demonstrate that the system achieves practical real-time performance (~18 FPS on CPU), reliable relative depth estimation (Spearman correlation of 0.83), and effective user guidance with 81% evaluator-rated instruction accuracy.

The inclusion of ground hazard detection and priority-based decision-making enhances user safety by identifying critical obstacles such as pits or staircases before they are encountered. Furthermore, the dual-mode text-to-speech system with spam control ensures that guidance remains concise and non-intrusive, improving usability in dynamic environments.

Overall, the proposed AI Smart Glasses system demonstrates the feasibility of deploying an end-to-end computer visionbased assistive solution that bridges the gap between perception and actionable guidance. The results indicate strong potential for real-world application in assistive mobility, particularly as lightweight deep learning models continue to improve in efficiency and accuracy.
FUTURE WORK:

Future research on YOLOv8 object detection should focus on improving performance for underrepresented and challenging classes such as bicyclists and pedestrians. One promising direction is the expansion of datasets with balanced samples, ensuring that minority classes are adequately represented during training. Advanced data augmentation techniques, including synthetic image generation and adversarial augmentation, can help improve generalization and robustness. Incorporating multi-scale feature extraction and attention mechanisms may enhance the detection of small and occluded objects. Another area of exploration is the integration

of YOLOv8 with temporal models such as transformers or LSTMs to extend detection capabilities to video streams.

Deploying YOLOv8 on edge devices like Raspberry Pi or Jetson Nano will allow evaluation of its real-time performance in resource-constrained environments. Semi-supervised and self-supervised learning approaches could leverage large amounts of unlabeled data, reducing reliance on costly manual annotations. Comparative benchmarking against other state-of-the-art detectors such as DETR and EfficientDet will provide insights into relative strengths and weaknesses. Post-processing techniques, including ensemble learning, can be applied to reduce false positives and improve overall reliability. Finally, extending YOLOv8 into multi-task learning frameworks that combine detection with tracking, segmentation, or action recognition will broaden its applicability in complex real-world scenarios. These directions collectively aim to refine YOLOv8 into a more versatile, balanced, and scalable detection system.

8. REFERENCES

Poggi, M., Aleotti, F., Tosi, F., & Mattoccia, S. (2020). On the uncertainty of self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 32273237.
Redmon, J., & Farhadi, A. (2018). YOOv3: An incremental improvement. arXiv preprint arXiv:1804.02767.
Bochkovskiy, A., Wang, C.-Y., & Liao, H.-Y. M. (2020). YOLOv4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934.
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3), 16231637.
Bhowmick, A., & Hazarika, S. M. (2017). An insight into assistive technology for the visually impaired and blind people: State-of-the-art and future trends. Journal on Multimodal User Interfaces, 11(2), 149172.
Islam, M. M., Sadi, M. S., Zamli, K. Z., & Ahmed, M. M. (2019). Developing walking assistants for visually impaired people: A review. IEEE Sensors Journal, 19(8), 28142828.
Shoval, S., Borenstein, J., & Koren, Y. (2003). Auditory guidance with the NavBelta computerized travel aid for the blind. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 28(3), 459467.
Tapu, R., Mocanu, B., & Zaharia, T. (2020). Wearable assistive devices for visually impaired: A state-of-the-art survey. Pattern Recognition Letters, 137, 3752.
Ahmetovic, D., Gleason, C., Ruan, C., Kitani, K., Takagi, H., & Asakawa, C. (2016). NavCog: A navigational cognitive assistant for the blind. In Proceedings of the 18th International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI), pp. 9099.

W. Liu et al., SSD: Single shot multibox detector, in Proc. European Conf. Computer Vision (ECCV), 2016, pp. 2137.
J. Dai et al., Deformable convolutional networks, in Proc. IEEE Int. Conf. Computer Vision (ICCV), 2017, pp. 764773.
N. Carion et al., End-to-end object detection with transformers, in Proc. European Conf. Computer Vision (ECCV), 2020, pp. 213229.
Y. Xu, J. Zhang, and L. Wang, Real-time pedestrian detection using YOLOv8 for smart surveillance, IEEE Sensors Journal, vol. 24, no. 3, pp. 14561465, 2024.
R. Kumar and S. Patel, YOLOv8-based traffic sign detection for autonomous vehicles, in Proc. IEEE Int. Conf. Intelligent Transportation Systems (ITSC), 2024, pp. 11231130.
L. Zhao, H. Li, and Y. Wu, YOLOv8 for medical image analysis: Detecting lung nodules in CT scans, IEEE Journal of Biomedical and Health Informatics, vol. 28, no. 1, pp. 5564, 2024.
P. Singh and A. Verma, Comparative study of YOLOv5, YOLOv7, and YOLOv8 for agricultural object detection, in Proc. IEEE Int. Conf. Agriculture Technology (AgriTech), 2024, pp. 8996.
J. Chen, M. Huang, and Y. Liu, YOLOv8 deployment on edge devices for IoT applications, IEEE Internet of Things Journal, vol. 11, no. 5, pp. 876885, 2025.
S. Gupta and R. Sharma, Improving bicyclist detection using YOLOv8 with synthetic augmentation, in Proc. IEEE Conf. Computer Vision Workshops (CVPRW), 2025, pp. 231238.
F. Li and Z. Yang, Anchor-free detection in YOLOv8: Performance analysis on COCO dataset, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 2, pp. 345356, 2025.
Y. Zhang and M. Zhou, YOLOv8 for aerial imagery: Detecting vehicles in UAV datasets, IEEE Geoscience and Remote Sensing Letters, vol. 22, no. 4, pp. 11201127, 2025.

AI Smart Glasses: A Real-Time Navigation and Spatial Awareness System for Assistive Mobility

YOLOv8 TRAINING CURVES