TITLE:
A Multi-Modal Approach for Arabic Sign Language Gesture Recognition Using Deep Learning
AUTHORS:
Nouf Alharbi
KEYWORDS:
Arabic Sign Language, Gesture Recognition, Deep Learning, Multi-Modal Feature Extraction, Attention-Based Fusion, CNN, Transformer, Depth-CNN, MLP, KAN
JOURNAL NAME:
Journal of Intelligent Learning Systems and Applications,
Vol.18 No.1,
January
29,
2026
ABSTRACT: This paper proposes a multi-modal deep learning framework for Arabic Sign Language (ArSL) recognition, addressing the challenges of both static and dynamic gesture recognition. The framework integrates spatial, temporal, and depth features using CNN, Transformer, and Depth-CNN models, combined via an attention-based fusion mechanism. A hierarchical recognition approach first classifies gestures as static or dynamic, then processes them with specialized models: MobileNetV3 for dynamic gestures and an MLP-KAN hybrid for static gestures. Evaluated on four ArSL datasets (Kaggle ASL, ArSL2018, DArSL50, KSU-ArSL), the system achieves 98.4% overall accuracy with real-time inference speeds of 0.007 seconds for static gestures and 0.02 seconds for dynamic gestures. Ablation studies confirm the importance of multi-modal fusion, with attention-based fusion improving accuracy by 11% compared to simple concatenation. The system demonstrates strong generalization across diverse datasets and conditions, making it suitable for real-world deployment in assistive communication technologies.