<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.4 20241031//EN" "JATS-journalpublishing1-4.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article" dtd-version="1.4" xml:lang="en">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">jilsa</journal-id>
      <journal-title-group>
        <journal-title>Journal of Intelligent Learning Systems and Applications</journal-title>
      </journal-title-group>
      <issn pub-type="epub">2150-8410</issn>
      <issn pub-type="ppub">2150-8402</issn>
      <publisher>
        <publisher-name>Scientific Research Publishing</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.4236/jilsa.2026.181002</article-id>
      <article-id pub-id-type="publisher-id">jilsa-149228</article-id>
      <article-categories>
        <subj-group>
          <subject>Article</subject>
        </subj-group>
        <subj-group>
          <subject>Computer Science</subject>
          <subject>Communications</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>A Multi-Modal Approach for Arabic Sign Language Gesture Recognition Using Deep Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <name name-style="western">
            <surname>Alharbi</surname>
            <given-names>Nouf</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
      </contrib-group>
      <aff id="aff1"><label>1</label> College of Computer Science and Engineering, Taibah University, Madinah, Saudi Arabia </aff>
      <author-notes>
        <fn fn-type="conflict" id="fn-conflict">
          <p>The author declares no conflicts of interest regarding the publication of this paper.</p>
        </fn>
      </author-notes>
      <pub-date pub-type="epub">
        <day>02</day>
        <month>02</month>
        <year>2026</year>
      </pub-date>
      <pub-date pub-type="collection">
        <month>02</month>
        <year>2026</year>
      </pub-date>
      <volume>18</volume>
      <issue>01</issue>
      <fpage>11</fpage>
      <lpage>21</lpage>
      <history>
        <date date-type="received">
          <day>15</day>
          <month>10</month>
          <year>2025</year>
        </date>
        <date date-type="accepted">
          <day>26</day>
          <month>01</month>
          <year>2026</year>
        </date>
        <date date-type="published">
          <day>29</day>
          <month>01</month>
          <year>2026</year>
        </date>
      </history>
      <permissions>
        <copyright-statement>© 2026 by the authors and Scientific Research Publishing Inc.</copyright-statement>
        <copyright-year>2026</copyright-year>
        <license license-type="open-access">
          <license-p> This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license ( <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link> ). </license-p>
        </license>
      </permissions>
      <self-uri content-type="doi" xlink:href="https://doi.org/10.4236/jilsa.2026.181002">https://doi.org/10.4236/jilsa.2026.181002</self-uri>
      <abstract>
        <p>This paper proposes a multi-modal deep learning framework for Arabic Sign Language (ArSL) recognition, addressing the challenges of both static and dynamic gesture recognition. The framework integrates spatial, temporal, and depth features using CNN, Transformer, and Depth-CNN models, combined via an attention-based fusion mechanism. A hierarchical recognition approach first classifies gestures as static or dynamic, then processes them with specialized models: MobileNetV3 for dynamic gestures and an MLP-KAN hybrid for static gestures. Evaluated on four ArSL datasets (Kaggle ASL, ArSL2018, DArSL50, KSU-ArSL), the system achieves 98.4% overall accuracy with real-time inference speeds of 0.007 seconds for static gestures and 0.02 seconds for dynamic gestures. Ablation studies confirm the importance of multi-modal fusion, with attention-based fusion improving accuracy by 11% compared to simple concatenation. The system demonstrates strong generalization across diverse datasets and conditions, making it suitable for real-world deployment in assistive communication technologies.</p>
      </abstract>
      <kwd-group kwd-group-type="author-generated" xml:lang="en">
        <kwd>Arabic Sign Language</kwd>
        <kwd>Gesture Recognition</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Multi-Modal Feature Extraction</kwd>
        <kwd>Attention-Based Fusion</kwd>
        <kwd>CNN</kwd>
        <kwd>Transformer</kwd>
        <kwd>Depth-CNN</kwd>
        <kwd>MLP</kwd>
        <kwd>KAN</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec1">
      <title>1. Introduction</title>
      <p>Sign language is the primary mode of communication for millions of deaf and hard-of-hearing individuals worldwide. According to the World Health Organization (WHO), over 1.5 billion people globally experience some degree of hearing loss, with approximately 430 million requiring rehabilitation for disabling hearing loss [<xref ref-type="bibr" rid="B1">1</xref>]. Within the Middle East and North Africa (MENA) region, the prevalence of hearing disabilities is substantial, with more than 11 million individuals affected [<xref ref-type="bibr" rid="B2">2</xref>].</p>
      <p>Unlike American Sign Language (ASL) and British Sign Language (BSL), which have well-documented linguistic structures and large annotated datasets, Arabic Sign Language (ArSL) presents unique challenges due to dialectal variations, limited datasets, and the complexity of dynamic gestures [<xref ref-type="bibr" rid="B3">3</xref>]. ArSL lacks a standardized form, as different Arab countries have developed their own dialects. This variation makes it difficult to create a unified recognition model that generalizes across multiple regions. Existing ArSL recognition systems primarily focus on static gestures often neglecting dynamic movements, which are important for understanding continuous sign language communication [<xref ref-type="bibr" rid="B4">4</xref>].</p>
      <p>Several approaches have been explored for Sign Language Recognition (SLR), including traditional computer vision techniques, Deep Learning (DL)-based image classification, and sequence modeling using Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Transformers [<xref ref-type="bibr" rid="B5">5</xref>]. Early methods relied on handcrafted feature extraction using edge detection, Histogram of Oriented Gradients (HOG), and optical flow analysis [<xref ref-type="bibr" rid="B6">6</xref>]-[<xref ref-type="bibr" rid="B8">8</xref>]. However, these methods struggled with variations in lighting, background noise, and signer differences. With the advancement of DL, Convolutional Neural Networks (CNNs) have been widely used for spatial feature extraction from sign images [<xref ref-type="bibr" rid="B6">6</xref>][<xref ref-type="bibr" rid="B8">8</xref>][<xref ref-type="bibr" rid="B9">9</xref>]. While CNNs excel in static gesture recognition, they fail to capture temporal dependencies. To address this, RNNs and LSTMs have been employed for sequential modeling, but they suffer from vanishing gradient problems and high computational costs [<xref ref-type="bibr" rid="B5">5</xref>][<xref ref-type="bibr" rid="B10">10</xref>]. More recently, Transformer-based models have gained popularity due to their ability to model long-range dependencies efficiently [<xref ref-type="bibr" rid="B11">11</xref>].</p>
      <p>To address these challenges, this study proposes a multi-modal DL framework for ArSL recognition, incorporating spatial, temporal, and depth features. The key contributions of this work are:</p>
      <p><bold>Hierarchical Recognition Framework</bold>: A novel two-tier approach that first classifies gestures into static or dynamic categories using a CNN classifier, followed by specialized models for precise recognition. <bold>Attention-Based Multi-Modal Feature Fusion</bold>: A fusion mechanism that integrates spatial (CNN), temporal (Transformer), and depth (Depth-CNN) features to achieve rich feature representation. <bold>Extensive Benchmarking on Multiple ArSL Datasets</bold>: Four distinct ArSL datasets are used to assess the proposed methodology. <bold>Efficient Real-Time Processing</bold>: The system achieves over 98% accuracy with an inference time of 0.007 seconds for static gestures and 0.02 seconds for dynamic gestures. </p>
    </sec>
    <sec id="sec2">
      <title>2. Related Works</title>
      <p>SLR has emerged as a critical research area aimed at bridging communication barriers for deaf and mute communities. A number of studies have explored different methodologies for ArSL recognition, employing Machine Learning (ML), DL, and hybrid models to address the inherent challenges in static and dynamic gesture recognition.</p>
      <p>Tharwat, Ahmed, and Bouallegue (2021) proposed a vision-based system for recognizing ArSL alphabets, emphasizing the use of traditional ML techniques [<xref ref-type="bibr" rid="B12">12</xref>]. Their Arabic Alphabet Sign Language Recognition System (AArSLRS) employed a dataset of 9240 images and achieved 99.5% accuracy with KNN under controlled conditions, but was limited to static gestures.</p>
      <p>Duwairi and Halloush (2022) employed transfer learning techniques for ArSL alphabet recognition, utilizing pre-trained models like AlexNet, VGGNet, and GoogleNet [<xref ref-type="bibr" rid="B6">6</xref>]. Using the ArSL2018 dataset comprising 54,049 images of 32 Arabic characters, VGGNet achieved an accuracy of 97%, but focused only on static gestures.</p>
      <p>Noor <italic>et al</italic>. (2024) proposed a hybrid model integrating CNN and LSTM networks for both static and dynamic gestures [<xref ref-type="bibr" rid="B5">5</xref>]. Their framework utilized a custom dataset of 4000 images for static gestures and 500 videos for dynamic sequences, achieving accuracies of 94.4% and 82.7% respectively.</p>
      <p>Ameer <italic>et al</italic>. (2024) extended the focus on dynamic gesture recognition with an attention-based LSTM model [<xref ref-type="bibr" rid="B10">10</xref>]. Their DArSL50 dataset, comprising 50 dynamic gestures across 7500 videos, served as the foundation, achieving accuracies of 85% for individual volunteers.</p>
      <p>Zakariah <italic>et al</italic>. (2022) explored transfer learning for ArSL recognition using the EfficientNetB4 architecture [<xref ref-type="bibr" rid="B13">13</xref>]. Using the ArSL2018 dataset, EfficientNetB4 achieved a testing accuracy of 95%, but relied on single-hand static gestures with high computational requirements.</p>
      <p>Hdioud and Tirari (2023) proposed a DL-based ArSL system designed to recognize Arabic letters [<xref ref-type="bibr" rid="B14">14</xref>]. Their approach combined pre-processing with MediaPipe and a custom CNN architecture, achieving 97.07% validation accuracy, but was limited to static gestures.</p>
      <p>Alharthi and Alzahrani (2023) explored the integration of Vision Transformer (ViT) and transfer learning [<xref ref-type="bibr" rid="B7">7</xref>]. Their study utilized pretrained models like InceptionResNetV2, ViT, and Swin, achieving a maximum accuracy of 98.17% with InceptionResNetV2.</p>
      <p>Al Khuzayem <italic>et al</italic>. (2024) focused on Saudi Sign Language (SSL) recognition with their Efhamni application, which applied a CNN and Bidirectional Long Short-Term Memory (BiLSTM) architecture [<xref ref-type="bibr" rid="B15">15</xref>]. The model was trained on the KSU-SSL dataset, achieving precision of 94.61%, recall of 94.56%, and F1-score of 94.52%.</p>
      <p>Alsolai <italic>et al</italic>. (2024) proposed the SLDC-RSAHDL framework, which utilized MobileNet for feature extraction, coupled with a hybrid DL model integrating CNN and LSTM layers [<xref ref-type="bibr" rid="B8">8</xref>]. Using the ASL alphabet dataset, the system achieved an accuracy of 99.51%.</p>
      <p>While these studies have advanced the state of the art in ArSL recognition, several gaps remain: limited handling of dynamic gestures, small or non-diverse datasets, high computational requirements, single-modality approaches, and lack of attention to dialectal variations [<xref ref-type="bibr" rid="B16">16</xref>][<xref ref-type="bibr" rid="B17">17</xref>].</p>
    </sec>
    <sec id="sec3">
      <title>3. Proposed Methodology</title>
      <sec id="sec3dot1">
        <title>3.1. Overview</title>
        <p>The proposed methodology diagram is illustrated in <xref ref-type="fig" rid="fig1">Figure 1</xref>. The proposed methodology for ArSL integrates multiple phases: dataset collection, pre-processing, multi-modal feature extraction, attention-based fusion, and hierarchical recognition. The methodology begins with collecting four datasets: Kaggle ASL, ArSL2018, DArSL50, and KSU-ArSL. These datasets cover static and dynamic gestures, essential for training the model. Pre-processing involves normalization of data, keypoint extraction using Mediapipe, and depth map generation. Feature extraction uses three models: CNN for spatial features, Transformer for temporal features, and Depth-CNN for depth features. The extracted features are fused using an attention-based mechanism. The framework consists of two tiers. Tier 1 classifies gestures as static or dynamic using a CNN classifier. Tier 2 processes dynamic gestures through MobileNetV3 and static gestures through a hybrid MLP and KAN model for accurate gesture recognition.</p>
        <fig id="fig1">
          <label>Figure 1</label>
          <graphic xlink:href="https://html.scirp.org/file/9601742-rId15.jpeg?20260129024858" />
        </fig>
        <p><bold>Figure 1</bold><bold>.</bold> Overview of the proposed methodology for Arabic SLR.</p>
      </sec>
      <sec id="sec3dot2">
        <title>3.2. Multi-Modal Feature Extraction</title>
        <p>Multi-Modal Feature Extraction plays a climactic role in capturing diverse aspects of ArSL gestures. This phase involves the extraction of three distinct feature types: spatial, temporal, and depth features. Spatial features are extracted using a CNN, focusing on the image-based representation of static gestures. Temporal features are captured through a Transformer model, which processes the time-dependent characteristics of dynamic gestures. Depth features are obtained through a Depth-CNN, which analyzes the depth information from depth maps.</p>
        <p>3.2.1. Spatial Features</p>
        <p>Spatial features capture essential patterns, shapes, and edges from hand gesture images. The CNN model consists of four convolutional layers, each followed by a max-pooling operation. <xref ref-type="fig" rid="fig2">Figure 2</xref> illustrates the architecture. <bold>Table 1</bold> presents the detailed hyperparameter settings.</p>
        <fig id="fig2">
          <label>Figure 2</label>
          <graphic xlink:href="https://html.scirp.org/file/9601742-rId16.jpeg?20260129024858" />
        </fig>
        <p><bold>Figure 2</bold><bold>.</bold> Visual architecture illustration of the CNN model for spatial feature extraction.</p>
        <p><bold>Table 1.</bold> Hyperparameter details for CNN architecture (Spatial Feature Extraction).</p>
        <table-wrap id="tbl1">
          <label>Table 1</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Hyperparameter</bold>
                </td>
                <td>
                  <bold>Value</bold>
                </td>
              </tr>
              <tr>
                <td>Learning Rate</td>
                <td>0.001</td>
              </tr>
              <tr>
                <td>Batch Size</td>
                <td>32</td>
              </tr>
              <tr>
                <td>Epochs</td>
                <td>50</td>
              </tr>
              <tr>
                <td>Optimizer</td>
                <td>Adam</td>
              </tr>
              <tr>
                <td>Dropout Rate</td>
                <td>0.25</td>
              </tr>
              <tr>
                <td>Filter Sizes</td>
                <td>3 × 3</td>
              </tr>
              <tr>
                <td>Number of Filters</td>
                <td>32 (1st), 64 (2nd), 128 (3rd), 256 (4th)</td>
              </tr>
              <tr>
                <td>Pooling Type</td>
                <td>Max Pooling</td>
              </tr>
              <tr>
                <td>Pooling Window Size</td>
                <td>2 × 2</td>
              </tr>
              <tr>
                <td>Activation Function</td>
                <td>ReLU (hidden layers)</td>
              </tr>
              <tr>
                <td>Fully Connected Layer 1 Units</td>
                <td>512</td>
              </tr>
              <tr>
                <td>Fully Connected Layer 2 Units</td>
                <td>256</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>3.2.2. Temporal Features</p>
        <p>Temporal features capture motion patterns and sequential dependencies within dynamic gestures. The Transformer model processes sequential input to learn contextual relationships.</p>
        <p><bold>Table 2</bold> presents the hyperparameter settings. </p>
        <p><bold>Table 2.</bold> Hyperparameter settings for the transformer model.</p>
        <table-wrap id="tbl2">
          <label>Table 2</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Hyperparameter</bold>
                </td>
                <td>
                  <bold>Value</bold>
                </td>
              </tr>
              <tr>
                <td>Embedding Dimension</td>
                <td>512</td>
              </tr>
              <tr>
                <td>Number of Attention Heads</td>
                <td>8</td>
              </tr>
              <tr>
                <td>Number of Transformer Layers</td>
                <td>6</td>
              </tr>
              <tr>
                <td>Feedforward Dimension</td>
                <td>2048</td>
              </tr>
              <tr>
                <td>Dropout Rate</td>
                <td>0.1</td>
              </tr>
              <tr>
                <td>Optimizer</td>
                <td>AdamW</td>
              </tr>
              <tr>
                <td>Learning Rate</td>
                <td>0.0001</td>
              </tr>
              <tr>
                <td>Batch Size</td>
                <td>32</td>
              </tr>
              <tr>
                <td>Number of Epochs</td>
                <td>50</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>3.2.3. Depth Features</p>
        <p>Depth features capture the three-dimensional structure of hand movements. The Depth-CNN model consists of four convolutional layers followed by max-pooling operations. <xref ref-type="fig" rid="fig3">Figure 3</xref> illustrates the architecture. <bold>Table 3</bold> presents the hyperparameters.</p>
        <p><bold>Table 3.</bold> Hyperparameter settings for the Depth-CNN model.</p>
        <table-wrap id="tbl3">
          <label>Table 3</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Hyperparameter</bold>
                </td>
                <td>
                  <bold>Value</bold>
                </td>
              </tr>
              <tr>
                <td>Input Depth Map Size</td>
                <td>224 × 224</td>
              </tr>
              <tr>
                <td>Number of Convolutional Layers</td>
                <td>4</td>
              </tr>
              <tr>
                <td>Filter Sizes</td>
                <td>3 × 3</td>
              </tr>
              <tr>
                <td>Number of Filters</td>
                <td>32, 64, 128, 256</td>
              </tr>
              <tr>
                <td>Dropout Rate</td>
                <td>0.3</td>
              </tr>
              <tr>
                <td>Optimizer</td>
                <td>Adam</td>
              </tr>
              <tr>
                <td>Learning Rate</td>
                <td>0.001</td>
              </tr>
              <tr>
                <td>Batch Size</td>
                <td>32</td>
              </tr>
              <tr>
                <td>Number of Epochs</td>
                <td>50</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <fig id="fig3">
          <label>Figure 3</label>
          <graphic xlink:href="https://html.scirp.org/file/9601742-rId17.jpeg?20260129024859" />
        </fig>
        <p><bold>Figure 3.</bold> Visual architecture illustration of the depth-CNN model for depth feature extraction.</p>
        <p>3.2.4. Attention-Based Fusion</p>
        <p>An attention-based fusion mechanism is employed to assign adaptive importance to each feature representation. Let the extracted feature vectors be: </p>
        <disp-formula id="FD1">
          <label>(1)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>F</mml:mi>
                <mml:mi>s</mml:mi>
              </mml:msub>
              <mml:mo>∈</mml:mo>
              <mml:msup>
                <mml:mi>ℝ</mml:mi>
                <mml:mrow>
                  <mml:msub>
                    <mml:mi>d</mml:mi>
                    <mml:mi>s</mml:mi>
                  </mml:msub>
                </mml:mrow>
              </mml:msup>
              <mml:mo>,</mml:mo>
              <mml:mtext>
                 
              </mml:mtext>
              <mml:mtext>
                 
              </mml:mtext>
              <mml:msub>
                <mml:mi>F</mml:mi>
                <mml:mi>t</mml:mi>
              </mml:msub>
              <mml:mo>∈</mml:mo>
              <mml:msup>
                <mml:mi>ℝ</mml:mi>
                <mml:mrow>
                  <mml:msub>
                    <mml:mi>d</mml:mi>
                    <mml:mi>t</mml:mi>
                  </mml:msub>
                </mml:mrow>
              </mml:msup>
              <mml:mo>,</mml:mo>
              <mml:mtext>
                 
              </mml:mtext>
              <mml:mtext>
                 
              </mml:mtext>
              <mml:msub>
                <mml:mi>F</mml:mi>
                <mml:mi>d</mml:mi>
              </mml:msub>
              <mml:mo>∈</mml:mo>
              <mml:msup>
                <mml:mi>ℝ</mml:mi>
                <mml:mrow>
                  <mml:msub>
                    <mml:mi>α</mml:mi>
                    <mml:mi>d</mml:mi>
                  </mml:msub>
                </mml:mrow>
              </mml:msup>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>These are concatenated: <inline-formula><mml:math><mml:mrow><mml:mi> F </mml:mi><mml:mo> = </mml:mo><mml:mrow><mml:mo> [ </mml:mo><mml:mrow><mml:msub><mml:mi> F </mml:mi><mml:mi> s </mml:mi></mml:msub><mml:mo> ; </mml:mo><mml:msub><mml:mi> F </mml:mi><mml:mi> t </mml:mi></mml:msub><mml:mo> ; </mml:mo><mml:msub><mml:mi> F </mml:mi><mml:mi> d </mml:mi></mml:msub></mml:mrow><mml:mo> ] </mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> . Attention weights are computed: </p>
        <disp-formula id="FD2">
          <label>(2)</label>
          <mml:math>
            <mml:mrow>
              <mml:mi>α</mml:mi>
              <mml:mo>=</mml:mo>
              <mml:mtext>softmax</mml:mtext>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:mi>W</mml:mi>
                  <mml:mi>F</mml:mi>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>where <inline-formula><mml:math><mml:mrow><mml:mi> α </mml:mi><mml:mo> = </mml:mo><mml:mrow><mml:mo> [ </mml:mo><mml:mrow><mml:msub><mml:mi> α </mml:mi><mml:mi> s </mml:mi></mml:msub><mml:mo> , </mml:mo><mml:msub><mml:mi> α </mml:mi><mml:mi> t </mml:mi></mml:msub><mml:mo> , </mml:mo><mml:msub><mml:mi> α </mml:mi><mml:mi> d </mml:mi></mml:msub></mml:mrow><mml:mo> ] </mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> . The weighted fusion is: </p>
        <disp-formula id="FD3">
          <label>(3)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>F</mml:mi>
                <mml:mrow>
                  <mml:mtext>fusion</mml:mtext>
                </mml:mrow>
              </mml:msub>
              <mml:mo>=</mml:mo>
              <mml:msub>
                <mml:mi>α</mml:mi>
                <mml:mi>s</mml:mi>
              </mml:msub>
              <mml:msub>
                <mml:mi>F</mml:mi>
                <mml:mi>s</mml:mi>
              </mml:msub>
              <mml:mo>+</mml:mo>
              <mml:msub>
                <mml:mi>α</mml:mi>
                <mml:mi>t</mml:mi>
              </mml:msub>
              <mml:msub>
                <mml:mi>F</mml:mi>
                <mml:mi>t</mml:mi>
              </mml:msub>
              <mml:mo>+</mml:mo>
              <mml:msub>
                <mml:mi>α</mml:mi>
                <mml:mi>d</mml:mi>
              </mml:msub>
              <mml:msub>
                <mml:mi>F</mml:mi>
                <mml:mi>d</mml:mi>
              </mml:msub>
            </mml:mrow>
          </mml:math>
        </disp-formula>
      </sec>
      <sec id="sec3dot3">
        <title>3.3. Hierarchical Recognition Framework</title>
        <p>The framework consists of two tiers. Tier 1 classifies gestures as static or dynamic. Tier 2 processes them with specialized models.</p>
        <fig id="fig4">
          <label>Figure 4</label>
          <graphic xlink:href="https://html.scirp.org/file/9601742-rId28.jpeg?20260129024859" />
        </fig>
        <p><bold>Figure 4.</bold> Architecture of the CNN model used for static vs. dynamic gesture classification.</p>
        <p>3.3.1. Tier 1—Static vs. Dynamic Gesture Classification</p>
        <p>A CNN classifier with two convolutional layers is used. <xref ref-type="fig" rid="fig4">Figure 4</xref> illustrates the architecture. <bold>Table 4</bold> presents the hyperparameters.</p>
        <p><bold>Table 4.</bold> Hyperparameters of the CNN model for static vs. dynamic gesture classification.</p>
        <table-wrap id="tbl4">
          <label>Table 4</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Hyperparameter</bold>
                </td>
                <td>
                  <bold>Value</bold>
                </td>
              </tr>
              <tr>
                <td>Learning Rate</td>
                <td>0.001</td>
              </tr>
              <tr>
                <td>Batch Size</td>
                <td>32</td>
              </tr>
              <tr>
                <td>Epochs</td>
                <td>20</td>
              </tr>
              <tr>
                <td>Optimizer</td>
                <td>Adam</td>
              </tr>
              <tr>
                <td>Dropout Rate</td>
                <td>0.2</td>
              </tr>
              <tr>
                <td>Number of Filters</td>
                <td>16 (1st), 32 (2nd)</td>
              </tr>
              <tr>
                <td>Fully Connected Layer Units</td>
                <td>128</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>3.3.2. Tier 2—Gesture Recognition</p>
        <p>Dynamic gestures are recognized using MobileNetV3. Static gestures are classified using a hybrid MLP and KAN model.</p>
        <p><bold>Dynamic Gesture Recognition</bold></p>
        <p>MobileNetV3 is used with hyperparameters in <bold>Table 5</bold>. </p>
        <p><bold>Table 5.</bold> Hyperparameters of MobileNetV3 for dynamic gesture recognition.</p>
        <table-wrap id="tbl5">
          <label>Table 5</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Hyperparameter</bold>
                </td>
                <td>
                  <bold>Value</bold>
                </td>
              </tr>
              <tr>
                <td>Architecture Type</td>
                <td>MobileNetV3-Small</td>
              </tr>
              <tr>
                <td>Activation Function</td>
                <td>Hard-Swish</td>
              </tr>
              <tr>
                <td>Dropout Rate</td>
                <td>0.2</td>
              </tr>
              <tr>
                <td>Optimizer</td>
                <td>Adam</td>
              </tr>
              <tr>
                <td>Learning Rate</td>
                <td>0.0005</td>
              </tr>
              <tr>
                <td>Batch Size</td>
                <td>32</td>
              </tr>
              <tr>
                <td>Number of Epochs</td>
                <td>50</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p><bold>Static Gesture Recognition</bold></p>
        <p>The hybrid MLP-KAN model is used. Hyperparameters for MLP and KAN are in <bold>Table 6</bold> and <bold>Table 7</bold>. </p>
        <p><bold>Table 6.</bold> Hyperparameters of MLP for static gesture recognition.</p>
        <table-wrap id="tbl6">
          <label>Table 6</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Hyperparameter</bold>
                </td>
                <td>
                  <bold>Value</bold>
                </td>
              </tr>
              <tr>
                <td>Number of Hidden Layers</td>
                <td>2</td>
              </tr>
              <tr>
                <td>Hidden Layer 1 Units</td>
                <td>512</td>
              </tr>
              <tr>
                <td>Hidden Layer 2 Units</td>
                <td>256</td>
              </tr>
              <tr>
                <td>Dropout Rate</td>
                <td>0.3</td>
              </tr>
              <tr>
                <td>Optimizer</td>
                <td>Adam</td>
              </tr>
              <tr>
                <td>Learning Rate</td>
                <td>0.001</td>
              </tr>
              <tr>
                <td>Batch Size</td>
                <td>32</td>
              </tr>
              <tr>
                <td>Number of Epochs</td>
                <td>50</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p><bold>Table 7.</bold> Hyperparameters of KAN for static gesture recognition.</p>
        <table-wrap id="tbl7">
          <label>Table 7</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Hyperparameter</bold>
                </td>
                <td>
                  <bold>Value</bold>
                </td>
              </tr>
              <tr>
                <td>Number of Layers</td>
                <td>3</td>
              </tr>
              <tr>
                <td>Knowledge Module Type</td>
                <td>Graph-Based</td>
              </tr>
              <tr>
                <td>Hidden Layer 1 Units</td>
                <td>512</td>
              </tr>
              <tr>
                <td>Hidden Layer 2 Units</td>
                <td>256</td>
              </tr>
              <tr>
                <td>Dropout Rate</td>
                <td>0.3</td>
              </tr>
              <tr>
                <td>Optimizer</td>
                <td>AdamW</td>
              </tr>
              <tr>
                <td>Learning Rate</td>
                <td>0.0001</td>
              </tr>
              <tr>
                <td>Batch Size</td>
                <td>32</td>
              </tr>
              <tr>
                <td>Number of Epochs</td>
                <td>50</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
      </sec>
    </sec>
    <sec id="sec4">
      <title>4. Experimental Setup</title>
      <p>The experiments were conducted using Python and TensorFlow on a system with an NVIDIA GPU. Datasets were split into 70% training, 15% validation, and 15% testing. Evaluation metrics included accuracy, precision, recall, F1-score, training time, and inference speed.</p>
    </sec>
    <sec id="sec5">
      <title>5. Results and Discussion</title>
      <sec id="sec5dot1">
        <title>5.1. Performance Evaluation</title>
        <p>The proposed multi-modal model achieved an exceptional 98.4% overall accuracy. For static gesture recognition: precision = 98.7%, recall = 98.4%, F1-score = 98.5%. For dynamic gestures: precision = 97.9%, recall = 98.2%, F1-score = 98.0%. <xref ref-type="fig" rid="fig5">Figure 5</xref> shows the performance metrics.</p>
        <fig id="fig5">
          <label>Figure 5</label>
          <graphic xlink:href="https://html.scirp.org/file/9601742-rId29.jpeg?20260129024902" />
        </fig>
        <p><bold>Figure 5</bold><bold>.</bold> Performance evaluation metrics of the proposed model.</p>
        <p>The model exhibited impressive computational efficiency. Inference time was 0.007 seconds for static and 0.02 seconds for dynamic gestures (<xref ref-type="fig" rid="fig6">Figure 6</xref>).</p>
        <fig id="fig6">
          <label>Figure 6</label>
          <graphic xlink:href="https://html.scirp.org/file/9601742-rId30.jpeg?20260129024901" />
        </fig>
        <p><bold>Figure 6</bold><bold>.</bold> Training and inference time comparison.</p>
      </sec>
      <sec id="sec5dot2">
        <title>5.2. Ablation Study</title>
        <p>Ablation studies confirmed the importance of each component: </p>
        <p>Without attention-based fusion: Accuracy dropped to 87.4%. Spatial-only (CNN): 92.3%. Temporal-only (Transformer): 93.1%. Depth-only (Depth-CNN): 90.2%. All modalities combined: 98.4%. </p>
      </sec>
    </sec>
    <sec id="sec6">
      <title>6. Conclusion</title>
      <p>This paper presents a comprehensive approach for ArSL recognition, addressing challenges associated with both static and dynamic gestures. The proposed methodology leverages multi-modal feature extraction (CNN, Transformer, Depth-CNN) with attention-based fusion, followed by a hierarchical recognition framework. The system achieves 98.4% accuracy with low inference times, demonstrating its potential for real-world applications. Future work will focus on expanding dialectal coverage and optimizing for mobile deployment.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <title>References</title>
      <ref id="B1">
        <label>1.</label>
        <citation-alternatives>
          <mixed-citation publication-type="web">World Health Organization (2025) Deafness and Hearing Loss. https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss</mixed-citation>
          <element-citation publication-type="web">
            <year>2025</year>
            <article-title>Deafness and Hearing Loss</article-title>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B2">
        <label>2.</label>
        <citation-alternatives>
          <mixed-citation publication-type="web">Center for Strategic and International Studies (2025) Disability Inclusion in Foreign Policy: Special Advisor Sara Minkara. https://www.csis.org/events/disability-inclusion-foreign-policy-special-advisor-sara-minkara</mixed-citation>
          <element-citation publication-type="web">
            <year>2025</year>
            <article-title>Disability Inclusion in Foreign Policy: Special Advisor Sara Minkara</article-title>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B3">
        <label>3.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Shin, J., Miah, A.S.M., Kabir, M.H., Rahim, M.A. and Al Shiam, A. (2024) A Methodological and Structural Review of Hand Gesture Recognition across Diverse Data Modalities. <italic>IEEE Access</italic>, 12, 142606-142639. https://doi.org/10.1109/access.2024.3456436 <pub-id pub-id-type="doi">10.1109/access.2024.3456436</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/access.2024.3456436">https://doi.org/10.1109/access.2024.3456436</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Shin, J.</string-name>
              <string-name>Miah, A.S.M.</string-name>
              <string-name>Kabir, M.H.</string-name>
              <string-name>Rahim, M.A.</string-name>
              <string-name>Shiam, A.</string-name>
            </person-group>
            <year>2024</year>
            <article-title>A Methodological and Structural Review of Hand Gesture Recognition across Diverse Data Modalities</article-title>
            <source>IEEE Access</source>
            <volume>12</volume>
            <pub-id pub-id-type="doi">10.1109/access.2024.3456436</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B4">
        <label>4.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Al Abdullah, B.A., Amoudi, G.A. and Alghamdi, H.S. (2024) Advancements in Sign Language Recognition: A Comprehensive Review and Future Prospects. <italic>IEEE Access</italic>, 12, 128871-128895. https://doi.org/10.1109/access.2024.3457692 <pub-id pub-id-type="doi">10.1109/access.2024.3457692</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/access.2024.3457692">https://doi.org/10.1109/access.2024.3457692</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Abdullah, B.A.</string-name>
              <string-name>Amoudi, G.A.</string-name>
              <string-name>Alghamdi, H.S.</string-name>
            </person-group>
            <year>2024</year>
            <article-title>Advancements in Sign Language Recognition: A Comprehensive Review and Future Prospects</article-title>
            <source>IEEE Access</source>
            <volume>12</volume>
            <pub-id pub-id-type="doi">10.1109/access.2024.3457692</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B5">
        <label>5.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Noor, T.H., Noor, A., Alharbi, A.F., Faisal, A., Alrashidi, R., Alsaedi, A.S., et al. (2024) Real-Time Arabic Sign Language Recognition Using a Hybrid Deep Learning Model. <italic>Sensors</italic>, 24, Article 3683. https://doi.org/10.3390/s24113683 <pub-id pub-id-type="doi">10.3390/s24113683</pub-id><pub-id pub-id-type="pmid">38894473</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3390/s24113683">https://doi.org/10.3390/s24113683</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Noor, T.H.</string-name>
              <string-name>Noor, A.</string-name>
              <string-name>Alharbi, A.F.</string-name>
              <string-name>Faisal, A.</string-name>
              <string-name>Alrashidi, R.</string-name>
              <string-name>Alsaedi, A.S.</string-name>
            </person-group>
            <year>2024</year>
            <article-title>Real-Time Arabic Sign Language Recognition Using a Hybrid Deep Learning Model</article-title>
            <source>Sensors</source>
            <volume>24</volume>
            <elocation-id>3683</elocation-id>
            <pub-id pub-id-type="doi">10.3390/s24113683</pub-id>
            <pub-id pub-id-type="pmid">38894473</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B6">
        <label>6.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Duwairi, R.M. and Halloush, Z.A. (2022) Automatic Recognition of Arabic Alphabets Sign Language Using Deep Learning. <italic>International Journal of Electrical and Computer Engineering</italic>( <italic>IJECE</italic>), 12, 2996-3004. https://doi.org/10.11591/ijece.v12i3.pp2996-3004 <pub-id pub-id-type="doi">10.11591/ijece.v12i3.pp2996-3004</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.11591/ijece.v12i3.pp2996-3004">https://doi.org/10.11591/ijece.v12i3.pp2996-3004</ext-link></mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Duwairi, R.M.</string-name>
              <string-name>Halloush, Z.A.</string-name>
            </person-group>
            <year>2022</year>
            <article-title>Automatic Recognition of Arabic Alphabets Sign Language Using Deep Learning</article-title>
            <source>International Journal of Electrical and Computer Engineering (IJECE)</source>
            <volume>12</volume>
            <pub-id pub-id-type="doi">10.11591/ijece.v12i3.pp2996-3004</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B7">
        <label>7.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Alharthi, N.M. and Alzahrani, S.M. (2023) Vision Transformers and Transfer Learning Approaches for Arabic Sign Language Recognition. <italic>Applied Sciences</italic>, 13, Article 11625. https://doi.org/10.3390/app132111625 <pub-id pub-id-type="doi">10.3390/app132111625</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3390/app132111625">https://doi.org/10.3390/app132111625</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Alharthi, N.M.</string-name>
              <string-name>Alzahrani, S.M.</string-name>
            </person-group>
            <year>2023</year>
            <article-title>Vision Transformers and Transfer Learning Approaches for Arabic Sign Language Recognition</article-title>
            <source>Applied Sciences</source>
            <volume>13</volume>
            <elocation-id>11625</elocation-id>
            <pub-id pub-id-type="doi">10.3390/app132111625</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B8">
        <label>8.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Alsolai, H., Alsolai, L., Al-Wesabi, F.N., Othman, M., Rizwanullah, M. and Abdelmageed, A.A. (2024) Automated Sign Language Detection and Classification Using Reptile Search Algorithm with Hybrid Deep Learning. <italic>Heliyon</italic>, 10, e23252. https://doi.org/10.1016/j.heliyon.2023.e23252 <pub-id pub-id-type="doi">10.1016/j.heliyon.2023.e23252</pub-id><pub-id pub-id-type="pmid">38148822</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.heliyon.2023.e23252">https://doi.org/10.1016/j.heliyon.2023.e23252</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Alsolai, H.</string-name>
              <string-name>Alsolai, L.</string-name>
              <string-name>Al-Wesabi, F.N.</string-name>
              <string-name>Othman, M.</string-name>
              <string-name>Rizwanullah, M.</string-name>
              <string-name>Abdelmageed, A.A.</string-name>
            </person-group>
            <year>2024</year>
            <article-title>Automated Sign Language Detection and Classification Using Reptile Search Algorithm with Hybrid Deep Learning</article-title>
            <source>Heliyon</source>
            <volume>10</volume>
            <pub-id pub-id-type="doi">10.1016/j.heliyon.2023.e23252</pub-id>
            <pub-id pub-id-type="pmid">38148822</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B9">
        <label>9.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Kong, F., Hu, K., Li, Y., Li, D., Liu, X. and Durrani, T.S. (2022) A Spectral-Spatial Feature Extraction Method with Polydirectional CNN for Multispectral Image Compression. <italic>IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing</italic>, 15, 2745-2758. https://doi.org/10.1109/jstars.2022.3158281 <pub-id pub-id-type="doi">10.1109/jstars.2022.3158281</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/jstars.2022.3158281">https://doi.org/10.1109/jstars.2022.3158281</ext-link></mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Kong, F.</string-name>
              <string-name>Hu, K.</string-name>
              <string-name>Li, Y.</string-name>
              <string-name>Li, D.</string-name>
              <string-name>Liu, X.</string-name>
              <string-name>Durrani, T.S.</string-name>
            </person-group>
            <year>2022</year>
            <article-title>A Spectral-Spatial Feature Extraction Method with Polydirectional CNN for Multispectral Image Compression</article-title>
            <source>IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing</source>
            <volume>15</volume>
            <pub-id pub-id-type="doi">10.1109/jstars.2022.3158281</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B10">
        <label>10.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Abdul Ameer, R.S., Ahmed, M.A., Al-Qaysi, Z.T., Salih, M.M. and Shuwandy, M.L. (2024) Empowering Communication: A Deep Learning Framework for Arabic Sign Language Recognition with an Attention Mechanism. <italic>Computers</italic>, 13, Article 153. https://doi.org/10.3390/computers13060153 <pub-id pub-id-type="doi">10.3390/computers13060153</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3390/computers13060153">https://doi.org/10.3390/computers13060153</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Ameer, R.S.</string-name>
              <string-name>Ahmed, M.A.</string-name>
              <string-name>Al-Qaysi, Z.T.</string-name>
              <string-name>Salih, M.M.</string-name>
              <string-name>Shuwandy, M.L.</string-name>
            </person-group>
            <year>2024</year>
            <article-title>Empowering Communication: A Deep Learning Framework for Arabic Sign Language Recognition with an Attention Mechanism</article-title>
            <source>Computers</source>
            <volume>13</volume>
            <elocation-id>153</elocation-id>
            <pub-id pub-id-type="doi">10.3390/computers13060153</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B11">
        <label>11.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L. and Polosukhin, I. (2017) Attention Is All You Need. arXiv: 1706.03762.</mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Vaswani, A.</string-name>
              <string-name>Shazeer, N.</string-name>
              <string-name>Parmar, N.</string-name>
              <string-name>Uszkoreit, J.</string-name>
              <string-name>Jones, L.</string-name>
              <string-name>Gomez, A.</string-name>
              <string-name>Kaiser, L.</string-name>
              <string-name>Polosukhin, I.</string-name>
            </person-group>
            <year>2017</year>
            <article-title>Attention Is All You Need</article-title>
            <fpage>1706</fpage>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B12">
        <label>12.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Tharwat, G., Ahmed, A.M. and Bouallegue, B. (2021) Arabic Sign Language Recognition System for Alphabets Using Machine Learning Techniques. <italic>Journal of Electrical and Computer Engineering</italic>, 2021, Article ID: 2995851. https://doi.org/10.1155/2021/2995851 <pub-id pub-id-type="doi">10.1155/2021/2995851</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1155/2021/2995851">https://doi.org/10.1155/2021/2995851</ext-link></mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Tharwat, G.</string-name>
              <string-name>Ahmed, A.M.</string-name>
              <string-name>Bouallegue, B.</string-name>
            </person-group>
            <year>2021</year>
            <article-title>Arabic Sign Language Recognition System for Alphabets Using Machine Learning Techniques</article-title>
            <source>Journal of Electrical and Computer Engineering</source>
            <volume>2021</volume>
            <fpage>299585</fpage>
            <elocation-id>ID</elocation-id>
            <pub-id pub-id-type="doi">10.1155/2021/2995851</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B13">
        <label>13.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Zakariah, M., Alotaibi, Y.A., Koundal, D., Guo, Y. and Mamun Elahi, M. (2022) Sign Language Recognition for Arabic Alphabets Using Transfer Learning Technique. <italic>Computational Intelligence and Neuroscience</italic>, 2022, Article ID: 4567989. https://doi.org/10.1155/2022/4567989 <pub-id pub-id-type="doi">10.1155/2022/4567989</pub-id><pub-id pub-id-type="pmid">35498192</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1155/2022/4567989">https://doi.org/10.1155/2022/4567989</ext-link></mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Zakariah, M.</string-name>
              <string-name>Alotaibi, Y.A.</string-name>
              <string-name>Koundal, D.</string-name>
              <string-name>Guo, Y.</string-name>
              <string-name>Elahi, M.</string-name>
            </person-group>
            <year>2022</year>
            <article-title>Sign Language Recognition for Arabic Alphabets Using Transfer Learning Technique</article-title>
            <source>Computational Intelligence and Neuroscience</source>
            <volume>2022</volume>
            <fpage>456798</fpage>
            <elocation-id>ID</elocation-id>
            <pub-id pub-id-type="doi">10.1155/2022/4567989</pub-id>
            <pub-id pub-id-type="pmid">35498192</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B14">
        <label>14.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Hdioud, B. and Tirari, M.E.H. (2023) A Deep Learning Based Approach for Recognition of Arabic Sign Language Letters. <italic>International Journal of Advanced Computer Science and Applications</italic>, 14, 424-429. https://doi.org/10.14569/ijacsa.2023.0140447 <pub-id pub-id-type="doi">10.14569/ijacsa.2023.0140447</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.14569/ijacsa.2023.0140447">https://doi.org/10.14569/ijacsa.2023.0140447</ext-link></mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Hdioud, B.</string-name>
              <string-name>Tirari, M.E.H.</string-name>
            </person-group>
            <year>2023</year>
            <article-title>A Deep Learning Based Approach for Recognition of Arabic Sign Language Letters</article-title>
            <source>International Journal of Advanced Computer Science and Applications</source>
            <volume>14</volume>
            <pub-id pub-id-type="doi">10.14569/ijacsa.2023.0140447</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B15">
        <label>15.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Al Khuzayem, L., Shafi, S., Aljahdali, S., Alkhamesie, R. and Alzamzami, O. (2024) Efhamni: A Deep Learning-Based Saudi Sign Language Recognition Application. <italic>Sensors</italic>, 24, Article 3112. https://doi.org/10.3390/s24103112 <pub-id pub-id-type="doi">10.3390/s24103112</pub-id><pub-id pub-id-type="pmid">38793964</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3390/s24103112">https://doi.org/10.3390/s24103112</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Khuzayem, L.</string-name>
              <string-name>Shafi, S.</string-name>
              <string-name>Aljahdali, S.</string-name>
              <string-name>Alkhamesie, R.</string-name>
              <string-name>Alzamzami, O.</string-name>
            </person-group>
            <year>2024</year>
            <article-title>Efhamni: A Deep Learning-Based Saudi Sign Language Recognition Application</article-title>
            <source>Sensors</source>
            <volume>24</volume>
            <elocation-id>3112</elocation-id>
            <pub-id pub-id-type="doi">10.3390/s24103112</pub-id>
            <pub-id pub-id-type="pmid">38793964</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B16">
        <label>16.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Zhang, Y. and Jiang, X. (2024) Recent Advances on Deep Learning for Sign Language Recognition. <italic>Computer Modeling in Engineering &amp; Sciences</italic>, 139, 2399-2450. https://doi.org/10.32604/cmes.2023.045731 <pub-id pub-id-type="doi">10.32604/cmes.2023.045731</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.32604/cmes.2023.045731">https://doi.org/10.32604/cmes.2023.045731</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Zhang, Y.</string-name>
              <string-name>Jiang, X.</string-name>
            </person-group>
            <year>2024</year>
            <article-title>Recent Advances on Deep Learning for Sign Language Recognition</article-title>
            <source>Computer Modeling in Engineering &amp; Sciences</source>
            <volume>139</volume>
            <pub-id pub-id-type="doi">10.32604/cmes.2023.045731</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B17">
        <label>17.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Gao, Q., Zhang, M. and Ju, Z. (2025) LGF-SLR: Hand Local-Global Fusion Network for Skeleton-Based Sign Language Recognition. <italic>IEEE Sensors Journal</italic>, 25, 8586-8597. https://doi.org/10.1109/jsen.2025.3527198 <pub-id pub-id-type="doi">10.1109/jsen.2025.3527198</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/jsen.2025.3527198">https://doi.org/10.1109/jsen.2025.3527198</ext-link></mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Gao, Q.</string-name>
              <string-name>Zhang, M.</string-name>
              <string-name>Ju, Z.</string-name>
            </person-group>
            <year>2025</year>
            <article-title>LGF-SLR: Hand Local-Global Fusion Network for Skeleton-Based Sign Language Recognition</article-title>
            <source>IEEE Sensors Journal</source>
            <volume>25</volume>
            <pub-id pub-id-type="doi">10.1109/jsen.2025.3527198</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
    </ref-list>
  </back>
</article>