<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.4 20241031//EN" "JATS-journalpublishing1-4.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article" dtd-version="1.4" xml:lang="en">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">jcc</journal-id>
      <journal-title-group>
        <journal-title>Journal of Computer and Communications</journal-title>
      </journal-title-group>
      <issn pub-type="epub">2327-5227</issn>
      <issn pub-type="ppub">2327-5219</issn>
      <publisher>
        <publisher-name>Scientific Research Publishing</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.4236/jcc.2026.143008</article-id>
      <article-id pub-id-type="publisher-id">jcc-150438</article-id>
      <article-categories>
        <subj-group>
          <subject>Article</subject>
        </subj-group>
        <subj-group>
          <subject>Computer Science</subject>
          <subject>Communications</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>A Lightweight MobileViT with a Dual-Path Attention Mechanism for MRI Image Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <name name-style="western">
            <surname>Xu</surname>
            <given-names>Youji</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <name name-style="western">
            <surname>Xiang</surname>
            <given-names>Siyu</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <name name-style="western">
            <surname>Feng</surname>
            <given-names>Huifang</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
      </contrib-group>
      <aff id="aff1"><label>1</label> College of Mathematics and Statistics, Northwest Normal University, Lanzhou, China </aff>
      <author-notes>
        <fn fn-type="conflict" id="fn-conflict">
          <p>The authors declare no conflicts of interest regarding the publication of this paper.</p>
        </fn>
      </author-notes>
      <pub-date pub-type="epub">
        <day>03</day>
        <month>03</month>
        <year>2026</year>
      </pub-date>
      <pub-date pub-type="collection">
        <month>03</month>
        <year>2026</year>
      </pub-date>
      <volume>14</volume>
      <issue>03</issue>
      <fpage>149</fpage>
      <lpage>173</lpage>
      <history>
        <date date-type="received">
          <day>04</day>
          <month>03</month>
          <year>2026</year>
        </date>
        <date date-type="accepted">
          <day>23</day>
          <month>03</month>
          <year>2026</year>
        </date>
        <date date-type="published">
          <day>26</day>
          <month>03</month>
          <year>2026</year>
        </date>
      </history>
      <permissions>
        <copyright-statement>© 2026 by the authors and Scientific Research Publishing Inc.</copyright-statement>
        <copyright-year>2026</copyright-year>
        <license license-type="open-access">
          <license-p> This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license ( <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link> ). </license-p>
        </license>
      </permissions>
      <self-uri content-type="doi" xlink:href="https://doi.org/10.4236/jcc.2026.143008">https://doi.org/10.4236/jcc.2026.143008</self-uri>
      <abstract>
        <p>Deep learning has been successfully applied in the field of medical diagnosis, and improving the accurate classification of MRI images through deep learning is important for early treatment and patient prognosis. Aiming at the current deep learning-based MRI image classification algorithms with large parameter counts and high computational complexity, a lightweight MobileViT with a dual-path attention mechanism for MRI image classification is proposed in this paper. Embedding the Convolutional Block Attention Module (CBAM) in the original MobileViT network enhances the extraction of key feature information by attending to both the channel and spatial dimensions of the feature map. A Dual-Path Attention Module (DPAM) is constructed by integrating CSPNet with the CBAM mechanism to further enhance the potential of feature extraction of the proposed model while maintaining a minimal parameter count. The proposed model also employs a transfer learning method to accelerate the learning speed of the network model on the MRI image datasets, and uses a cosine annealing algorithm to optimize the learning rate of the model during the model training process to help the model converge better. The state-of-the-art performance of the proposed model is validated on the Alzheimer’s disease and brain tumor MRI datasets, respectively. We evaluate the performance of our proposed model with the latest deep learning models. The experimental results show that the model not only substantially enhances the accuracy of MRI image classification but also exhibits reduced computational complexity, making it highly suitable for mobile devices with constrained computing resources.</p>
      </abstract>
      <kwd-group kwd-group-type="author-generated" xml:lang="en">
        <kwd>MRI Image Classification</kwd>
        <kwd>MobileViT</kwd>
        <kwd>Attention Mechanism</kwd>
        <kwd>Data Enhancement</kwd>
        <kwd>Transfer Learning</kwd>
        <kwd>Lightweight</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec1">
      <title>1. Introduction</title>
      <p>Alzheimer’s Disease (AD) is a degenerative brain disease, which means it worsens over time [<xref ref-type="bibr" rid="B1">1</xref>]. One of the most notable symptoms of Alzheimer’s disease is memory loss and a progressive decline in cognitive function. Patients may gradually forget familiar people, places, and daily activities, which can seriously affect quality of life. There is no effective cure for Alzheimer’s disease, and existing medications and treatments are primarily designed to slow the progression of the disease and relieve symptoms. </p>
      <p>Brain tumors are a group of abnormal tissue growths that form in the central nervous system, such as the brain, brainstem, and spinal cord, and they are one of the top ten malignant tumors in terms of morbidity and mortality today [<xref ref-type="bibr" rid="B2">2</xref>]. The main types of brain tumors are meningiomas, pituitary tumors, and gliomas, of which the most important and common is glioma, which not only affects the neuroglial cells but also invades other surrounding tissues. Brain tumor is a serious problem that endangers human health. As the tumor grows, the patient’s intracranial pressure increases, sometimes leading to brain damage or even death. Therefore, timely detection and accurate determination of brain tumor type play an important role in treatment planning and patient care. </p>
      <p>Magnetic Resonance Imaging (MRI) of the head is a medical imaging technique that allows MRI to obtain three-dimensional images of the skull without the use of X-rays, and it has an important role in the diagnosis of brain disorders. The diagnosis of both brain tumors and Alzheimer’s disease can be determined by the patient’s MRI images. In recent years, deep learning has been widely used in the auxiliary diagnosis of medical images and has achieved good results. The use of computer vision technology to assist doctors in reading medical images can reduce the burden on doctors and improve diagnostic efficiency. Therefore, it is of great application value to carry out research on MRI image classification based on deep learning. </p>
      <p>Convolutional Neural Networks (CNNs) are among the commonly used deep learning methods for MRI image classification. Asgharzadeh-Bonab <italic>et al.</italic> [<xref ref-type="bibr" rid="B3">3</xref>] proposed two feature fusion schemes, decision-level and feature-level, to combine different input information and used different CNN network models to classify MRI images under different features. The experimental results showed that EfficientNet-B7 had a better classification effect. Zhang <italic>et al.</italic> [<xref ref-type="bibr" rid="B4">4</xref>] introduced an augmented neural network, ADnet, built upon VGG16. This model employed depthwise separable convolutions in place of conventional convolutions to decrease the number of parameters, and incorporated the ELU activation function instead of the ReLU to mitigate the risk of gradient explosion. Experimental results confirmed that these enhancements led to improved accuracy across various classification tasks. Yang <italic>et al.</italic> [<xref ref-type="bibr" rid="B5">5</xref>] proposed a new region-to-sample graph convolutional neural network framework based on graph convolutional neural networks. Qian <italic>et al.</italic> [<xref ref-type="bibr" rid="B6">6</xref>] proposed a 3D residual network with multi-scale and an attention module for multi-task learning, which used a 3D network to avoid subjectivity when manually selecting slices, and also preserved the spatial structure information of the 3D data. Ait Amou <italic>et al.</italic> [<xref ref-type="bibr" rid="B7">7</xref>] considered the complexity of hyperparameter tuning for CNN networks. An efficient hyperparameter optimization technique for CNN based on Bayesian optimization was proposed, and the model showed excellent classification results on three categories of brain tumor image datasets. Ozdemir [<xref ref-type="bibr" rid="B8">8</xref>] devised a novel deep convolutional neural network architecture to address the brain tumor classification challenge, achieving successful classification across three distinct types of brain tumors. </p>
      <p>Although CNN-based network models have excellent performance in classification accuracy on MRI image datasets, the prevalent CNN models have large parameter counts and are not favorable for practical applications. Therefore, more and more researchers are devoted to the study of lightweight MRI image classification models [<xref ref-type="bibr" rid="B9">9</xref>]. Zhang <italic>et al.</italic> [<xref ref-type="bibr" rid="B10">10</xref>] proposed a lightweight neural network approach based on ShuffleNet and introduced the ECA attention mechanism to achieve efficient and scalable automatic Alzheimer’s disease detection. Khatri <italic>et al.</italic> [<xref ref-type="bibr" rid="B11">11</xref>] combined the convolutional attention mechanism and the Transformer to design a lightweight Alzheimer’s disease diagnostic model, in which lightweight multi-head attention was used instead of multi-head attention, which improved model performance without consuming too many computational resources. Liu <italic>et al.</italic> [<xref ref-type="bibr" rid="B12">12</xref>] introduced a lightweight automated 3D algorithm featuring an attention mechanism for segmenting brain tumor images. Specifically, this study used hierarchical decoupled convolution instead of standard convolution to reduce the number of parameters in the model. Dilation convolution was incorporated to augment the network’s capability to capture multi-scale information within the bottom convolution module, and an attention mechanism was also introduced to improve model accuracy. Vaiyapuri <italic>et al.</italic> [<xref ref-type="bibr" rid="B13">13</xref>] used an integrated model of EfficientNet, DenseNet, and MobileNet for feature extraction of brain tumor images. Luo <italic>et al.</italic> [<xref ref-type="bibr" rid="B14">14</xref>] tackled the challenge posed by the extensive parameter count and computational complexity inherent in CNN models by introducing a lightweight brain tumor segmentation network. This network incorporated multi-view extraction and dense attention mechanisms, mitigating the issues associated with the large parameter set and computational demands. Egaz <italic>et al.</italic> [<xref ref-type="bibr" rid="B15">15</xref>] developed a lightweight CNN-LSTM model for the diagnosis of Alzheimer’s disease. Nizamani <italic>et al.</italic> [<xref ref-type="bibr" rid="B16">16</xref>] proposed a lightweight deep fusion model for brain tumor classification tasks, integrating the Lightweight Feature Extraction Module (LEM), Cross-Stream Attention (CSA), Feature Fusion Module (FFM), and Attention Prediction Head (APH). </p>
      <p>Although deep convolutional neural networks have achieved some results in the task of image classification, on the whole, traditional convolutional neural networks still have some limitations, such as excessive model complexity, low classification accuracy, and loss of key information about the lesion region in the images. To address these shortcomings, we propose a lightweight MobileViT with a dual-path attention mechanism for MRI image classification. The model is an improved version based on the MobileViT model, which maintains excellent image classification results while keeping a compact structure and small computational overhead, striking an optimal balance between performance and efficiency. </p>
      <p>The main contributions of this paper are outlined as follows: </p>
      <p>1) A lightweight MobileViT with a dual-path attention mechanism is proposed to solve the problem of large parameter counts and high computational complexity in current deep learning-based MRI image classification. </p>
      <p>2) A dual-path attention module is constructed by integrating CSPNet and CBAM mechanisms to further extract detailed features of lesion regions in MRI images and improve the classification accuracy without increasing the computational cost. </p>
      <p>3) The transfer learning method is employed to pre-train the model on the ImageNet dataset, which not only discovers more features but also accelerates the learning speed of the network model on brain tumor images. Meanwhile, to improve the robustness and generalization of the model, the cosine annealing algorithm is used to optimize the learning rate of the model during the training process. </p>
      <p>Extensive experiments are conducted on Alzheimer’s disease and brain tumor MRI image datasets, and the state-of-the-art performance of the model is evaluated by comparing it with the latest deep learning models. </p>
      <p>The structure of this paper is organized as follows: In Section 2, we introduce the proposed MRI image classification model, detailing its architecture and key components. Section 3 describes the dataset used in the experiments, along with the data preprocessing techniques and the experimental setup. Section 4 provides an overview of the experimental environment, detailing the hardware and software configurations used. It also presents a comprehensive analysis of the experimental results, comparing the performance of our model with that of existing approaches. Section 5 addresses the limitations of the proposed model, highlighting areas that could benefit from further improvement. Finally, Section 6 provides a summary of the key findings and outlines potential directions for future research to enhance the model’s performance and applicability. </p>
    </sec>
    <sec id="sec2">
      <title>2. Methodology</title>
      <p>In order to solve the problem of large parameter counts and high computational complexity in current deep learning-based MRI image classification, we propose a lightweight MobileViT with a dual-path attention mechanism for MRI image classification. </p>
      <sec id="sec2dot1">
        <title>2.1. Overall Framework of the Proposed Model</title>
        <p><xref ref-type="fig" rid="fig1">Figure 1</xref> depicts the architecture of our proposed model. The specific parameter information for each module in the proposed model is shown in <bold>Table 1</bold>. The key components of the proposed model include the MobileViT, MV2, CBAM, and dual-path attention modules. These modules are described in the following sections. </p>
        <fig id="fig1">
          <label>Figure 1</label>
          <graphic xlink:href="https://html.scirp.org/file/1733488-rId15.jpeg?20260326025912" />
        </fig>
        <p>Figure 1. The structure of the proposed model.</p>
        <p>Table 1. Parameter information for each module. </p>
        <table-wrap id="tbl1">
          <label>Table 1</label>
          <table>
            <tbody>
              <tr>
                <td colspan="2">Module</td>
                <td>Input Feature Matrix</td>
                <td colspan="2">Input Size</td>
                <td>Output Feature Matrix</td>
                <td>Output Size</td>
              </tr>
              <tr>
                <td colspan="2">Conv-3 × 3 ↓2</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>0</mml:mn>
                          <mml:mrow>
                            <mml:mi>i</mml:mi>
                            <mml:mi>n</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td colspan="2">[224, 224, 3]</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>0</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[112, 112, 16]</td>
              </tr>
              <tr>
                <td>Layer1</td>
                <td>MV2 ↓2</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>0</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td colspan="2">[112, 112, 16]</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>1</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[112, 112, 16]</td>
              </tr>
              <tr>
                <td colspan="2">DPAM</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>1</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td colspan="2">[112, 112, 16]</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mrow>
                            <mml:mi>D</mml:mi>
                            <mml:mn>1</mml:mn>
                          </mml:mrow>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[56, 56, 24]</td>
              </tr>
              <tr>
                <td rowspan="2">Layer2</td>
                <td>MV2 ↓2</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>1</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td colspan="2">[112, 112, 16]</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mover accent="true">
                            <mml:mi>X</mml:mi>
                            <mml:mo>˜</mml:mo>
                          </mml:mover>
                          <mml:mn>2</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[56, 56, 24]</td>
              </tr>
              <tr>
                <td>MV2 × 2</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mover accent="true">
                            <mml:mi>X</mml:mi>
                            <mml:mo>˜</mml:mo>
                          </mml:mover>
                          <mml:mn>2</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td colspan="2">[56, 56, 24]</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>2</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[56, 56, 24]</td>
              </tr>
              <tr>
                <td colspan="2">CBAM</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>2</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td colspan="2">[56, 56, 24]</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mrow>
                            <mml:mi>C</mml:mi>
                            <mml:mn>1</mml:mn>
                          </mml:mrow>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[56, 56, 24]</td>
              </tr>
              <tr>
                <td colspan="2">Add⊕</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>2</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                  ,
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mrow>
                            <mml:mi>D</mml:mi>
                            <mml:mn>1</mml:mn>
                          </mml:mrow>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                  ,
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mrow>
                            <mml:mi>C</mml:mi>
                            <mml:mn>1</mml:mn>
                          </mml:mrow>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td colspan="2">[56, 56, 24]</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>3</mml:mn>
                          <mml:mrow>
                            <mml:mi>i</mml:mi>
                            <mml:mi>n</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[56, 56, 24]</td>
              </tr>
              <tr>
                <td rowspan="2">Layer3</td>
                <td>MV2 ↓2</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>3</mml:mn>
                          <mml:mrow>
                            <mml:mi>i</mml:mi>
                            <mml:mi>n</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td colspan="2">[56, 56, 24]</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mover accent="true">
                            <mml:mi>X</mml:mi>
                            <mml:mo>˜</mml:mo>
                          </mml:mover>
                          <mml:mn>3</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[28, 28, 48]</td>
              </tr>
              <tr>
                <td>MobileViT</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mover accent="true">
                            <mml:mi>X</mml:mi>
                            <mml:mo>˜</mml:mo>
                          </mml:mover>
                          <mml:mn>3</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td colspan="2">[28, 28, 48]</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>3</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[28, 28, 48]</td>
              </tr>
              <tr>
                <td colspan="2">DPAM</td>
                <td colspan="2">
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>3</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[28, 28, 48]</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mrow>
                            <mml:mi>D</mml:mi>
                            <mml:mn>2</mml:mn>
                          </mml:mrow>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[14, 14, 64]</td>
              </tr>
              <tr>
                <td rowspan="2">Layer4</td>
                <td>MV2 ↓2</td>
                <td colspan="2">
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>3</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[28,28,48]</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mover accent="true">
                            <mml:mi>X</mml:mi>
                            <mml:mo>˜</mml:mo>
                          </mml:mover>
                          <mml:mn>4</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[14,14,64]</td>
              </tr>
              <tr>
                <td>MobileViT</td>
                <td colspan="2">
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mover accent="true">
                            <mml:mi>X</mml:mi>
                            <mml:mo>˜</mml:mo>
                          </mml:mover>
                          <mml:mn>4</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[14,14,64]</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>4</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[14,14,64]</td>
              </tr>
              <tr>
                <td colspan="2">CBAM</td>
                <td colspan="2">
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>4</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[14, 14, 64]</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mrow>
                            <mml:mi>C</mml:mi>
                            <mml:mn>2</mml:mn>
                          </mml:mrow>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[14, 14, 64]</td>
              </tr>
              <tr>
                <td colspan="2">Add⊕</td>
                <td colspan="2">
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>4</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                  ,
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mrow>
                            <mml:mi>D</mml:mi>
                            <mml:mn>2</mml:mn>
                          </mml:mrow>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                  ,
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mrow>
                            <mml:mi>C</mml:mi>
                            <mml:mn>2</mml:mn>
                          </mml:mrow>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[14, 14, 64]</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>5</mml:mn>
                          <mml:mrow>
                            <mml:mi>i</mml:mi>
                            <mml:mi>n</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[14, 14, 64]</td>
              </tr>
              <tr>
                <td rowspan="2">Layer5</td>
                <td>MV2 ↓2</td>
                <td colspan="2">
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>5</mml:mn>
                          <mml:mrow>
                            <mml:mi>i</mml:mi>
                            <mml:mi>n</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[14,14,64]</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mover accent="true">
                            <mml:mi>X</mml:mi>
                            <mml:mo>˜</mml:mo>
                          </mml:mover>
                          <mml:mn>5</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[7,7,80]</td>
              </tr>
              <tr>
                <td>MobileViT</td>
                <td colspan="2">
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mover accent="true">
                            <mml:mi>X</mml:mi>
                            <mml:mo>˜</mml:mo>
                          </mml:mover>
                          <mml:mn>5</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[7,7,80]</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>5</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[7,7,80]</td>
              </tr>
              <tr>
                <td colspan="2">Conv-1 × 1</td>
                <td colspan="2">
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>5</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[7, 7, 80]</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>6</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[7, 7, 320]</td>
              </tr>
              <tr>
                <td colspan="2">Global pool → Linear</td>
                <td colspan="2">
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>6</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[7, 7, 320]</td>
                <td>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msubsup>
                          <mml:mi>X</mml:mi>
                          <mml:mn>7</mml:mn>
                          <mml:mrow>
                            <mml:mi>o</mml:mi>
                            <mml:mi>u</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msubsup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
                <td>[1, 4]</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
      </sec>
      <sec id="sec2dot2">
        <title>2.2. MobileViT Module</title>
        <p>The MobileViT module is the core of the MobileViT model [<xref ref-type="bibr" rid="B17">17</xref>] and is depicted in <xref ref-type="fig" rid="fig2">Figure 2</xref>. The main calculation process in the MobileViT module can be summarized in the following four steps: </p>
        <p>1) An input feature matrix <inline-formula><mml:math><mml:mrow><mml:mi> X </mml:mi><mml:mo> ∈ </mml:mo><mml:msup><mml:mi> R </mml:mi><mml:mrow><mml:mi> H </mml:mi><mml:mo> × </mml:mo><mml:mi> W </mml:mi><mml:mo> × </mml:mo><mml:mi> C </mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> is applied to a <inline-formula><mml:math><mml:mrow><mml:mi> n </mml:mi><mml:mo> × </mml:mo><mml:mi> n </mml:mi></mml:mrow></mml:math></inline-formula> convolution layer and followed by a 1 × 1 convolution layer. The <inline-formula><mml:math><mml:mrow><mml:mi> n </mml:mi><mml:mo> × </mml:mo><mml:mi> n </mml:mi></mml:mrow></mml:math></inline-formula> convolution layer is employed to capture local spatial information within the feature map, while the 1 × 1 convolution is utilized to project the feature map into a higher-dimensional feature space. After two convolution operations on the input feature map <inline-formula><mml:math><mml:mi> X </mml:mi></mml:math></inline-formula> , the local representations <inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> X </mml:mi><mml:mi> L </mml:mi></mml:msub><mml:mo> ∈ </mml:mo><mml:msup><mml:mi> R </mml:mi><mml:mrow><mml:mi> H </mml:mi><mml:mo> × </mml:mo><mml:mi> W </mml:mi><mml:mo> × </mml:mo><mml:mi> d </mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> are obtained. </p>
        <p>2) <inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> X </mml:mi><mml:mi> L </mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is transformed into a sequence of non-overlapping flattened patches denoted as <inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> X </mml:mi><mml:mi> U </mml:mi></mml:msub><mml:mo> ∈ </mml:mo><mml:msup><mml:mi> R </mml:mi><mml:mrow><mml:mi> N </mml:mi><mml:mo> × </mml:mo><mml:mi> P </mml:mi><mml:mo> × </mml:mo><mml:mi> d </mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> , where <inline-formula><mml:math><mml:mrow><mml:mi> P </mml:mi><mml:mo> = </mml:mo><mml:mi> w </mml:mi><mml:mi> h </mml:mi></mml:mrow></mml:math></inline-formula> ,<inline-formula><mml:math><mml:mrow><mml:mi> N </mml:mi><mml:mo> = </mml:mo><mml:mrow><mml:mrow><mml:mi> H </mml:mi><mml:mi> W </mml:mi></mml:mrow><mml:mo> / </mml:mo><mml:mi> P </mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> , <inline-formula><mml:math><mml:mrow><mml:mrow><mml:mo> ( </mml:mo><mml:mrow><mml:mi> w </mml:mi><mml:mo> , </mml:mo><mml:mi> h </mml:mi></mml:mrow><mml:mo> ) </mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> are the height and width of image patches. </p>
        <p>3) The global inter-patch relationship representation <inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> X </mml:mi><mml:mi> G </mml:mi></mml:msub><mml:mo> ∈ </mml:mo><mml:msup><mml:mi> R </mml:mi><mml:mrow><mml:mi> N </mml:mi><mml:mo> × </mml:mo><mml:mi> P </mml:mi><mml:mo> × </mml:mo><mml:mi> d </mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> is obtained as follows: </p>
        <disp-formula id="FD1">
          <label>(1)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>X</mml:mi>
                <mml:mi>G</mml:mi>
              </mml:msub>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mi>p</mml:mi>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>=</mml:mo>
              <mml:mtext>Transformer</mml:mtext>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:msub>
                    <mml:mi>X</mml:mi>
                    <mml:mi>U</mml:mi>
                  </mml:msub>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mi>p</mml:mi>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>,</mml:mo>
              <mml:mn>1</mml:mn>
              <mml:mo>≤</mml:mo>
              <mml:mi>p</mml:mi>
              <mml:mo>≤</mml:mo>
              <mml:mi>P</mml:mi>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>4) <inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> X </mml:mi><mml:mi> F </mml:mi></mml:msub><mml:mo> ∈ </mml:mo><mml:msup><mml:mi> R </mml:mi><mml:mrow><mml:mi> H </mml:mi><mml:mo> × </mml:mo><mml:mi> W </mml:mi><mml:mo> × </mml:mo><mml:mi> d </mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> is obtained by folding <inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> X </mml:mi><mml:mi> G </mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> . Then <inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> X </mml:mi><mml:mi> F </mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is mapped back to the original feature space using a 1 × 1 convolution layer and is combined with <inline-formula><mml:math><mml:mi> X </mml:mi></mml:math></inline-formula> . Subsequently, the combined features are integrated through a 3 × 3 convolution layer. </p>
      </sec>
      <sec id="sec2dot3">
        <title>2.3. MV2 Module</title>
        <p>The MV2 module is the inverted residual module in MobileNetV2 [<xref ref-type="bibr" rid="B18">18</xref>], and its structure is shown in <xref ref-type="fig" rid="fig3">Figure 3</xref>. First, the traditional convolution layer is replaced with a depthwise separable convolution in MobileNetV2. The depthwise separable convolution consists of two stages: the first stage is depthwise convolution followed by pointwise convolution. This technique effectively decomposes the convolution operation and improves computational efficiency while maintaining expressive power. The depthwise convolution (DWConv) is responsible for extracting features within each channel, and pointwise convolution (Conv) fuses features between channels. Depthwise separable convolution significantly diminishes the number of parameters and computations needed to attain lightweight networks. The inverted residual with linear bottleneck used in the MV2 is the opposite of the bottleneck structure in ResNet. Initially, we expand the feature maps using a convolution, followed by the extraction of features through a depthwise convolution. Subsequently, a convolution is employed to reduce the number of channels. </p>
        <fig id="fig2">
          <label>Figure 2</label>
          <graphic xlink:href="https://html.scirp.org/file/1733488-rId127.jpeg?20260326025913" />
        </fig>
        <p>Figure 2. The structure of the MobileViT module.</p>
        <fig id="fig3">
          <label>Figure 3</label>
          <graphic xlink:href="https://html.scirp.org/file/1733488-rId128.jpeg?20260326025913" />
        </fig>
        <p>Figure 3. The structure of the MV2 module. </p>
        <p>Within the MV2 module, the activation function utilized is ReLU6, a modified version of the Rectified Linear Unit (ReLU). The ReLU function solves the gradient vanishing problem for positive inputs; however, it encounters the challenge of having a constant derivative of 0 for negative inputs, leading to a gradient vanishing problem in negative intervals, which inhibits the updating of many neurons. Therefore, in this paper, we replace the commonly used ReLU activation function with the SiLU (Sigmoid Linear Unit) activation function, which is an enhanced version combining the advantages of both Sigmoid and ReLU. SiLU offers several important properties, including smoothness, the absence of an upper bound (while still maintaining a lower bound), and non-monotonicity. These characteristics make SiLU more suitable for deeper networks, as it helps to mitigate issues like the vanishing gradient problem and improves the flow of gradients during training. The SiLU activation function is mathematically represented in Equation (2). </p>
        <disp-formula id="FD2">
          <label>(2)</label>
          <mml:math>
            <mml:mrow>
              <mml:mtext>SiLU</mml:mtext>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mi>x</mml:mi>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>=</mml:mo>
              <mml:mfrac>
                <mml:mi>x</mml:mi>
                <mml:mrow>
                  <mml:mn>1</mml:mn>
                  <mml:mo>+</mml:mo>
                  <mml:msup>
                    <mml:mtext>e</mml:mtext>
                    <mml:mrow>
                      <mml:mo>−</mml:mo>
                      <mml:mi>x</mml:mi>
                    </mml:mrow>
                  </mml:msup>
                </mml:mrow>
              </mml:mfrac>
            </mml:mrow>
          </mml:math>
        </disp-formula>
      </sec>
      <sec id="sec2dot4">
        <title>2.4. Convolutional Block Attention Module</title>
        <p>The Convolutional Block Attention Module (CBAM) [<xref ref-type="bibr" rid="B19">19</xref>] is a powerful composite attention mechanism designed to improve the model’s representation capability by incorporating both channel and spatial attention. CBAM operates in two stages: the first stage focuses on channel attention, which enables the model to weigh the importance of different feature channels, and the second stage applies spatial attention, which allows the model to focus on important regions in the feature map. This dual attention mechanism helps the model to selectively enhance useful features while suppressing irrelevant ones, making it highly effective for various vision tasks. The architecture of CBAM, including both its channel and spatial attention sub-modules, is depicted in <xref ref-type="fig" rid="fig4">Figure 4</xref>. </p>
        <fig id="fig4">
          <label>Figure 4</label>
          <graphic xlink:href="https://html.scirp.org/file/1733488-rId131.jpeg?20260326025914" />
        </fig>
        <p>Figure 4. The structure of the Convolutional Block Attention Module (CBAM). </p>
        <p>The channel attention module is designed to assist the network in discerning the significance of various channels. The calculation procedure is shown in Equation (3) and Equation (4), and the main steps include: </p>
        <p>1) Extracting global information from the feature map via global Average Pooling (AvePool) and global Maximum Pooling (MaxPool) operations. </p>
        <p>2) These two types of information are then processed through the Multilayer Perceptron (MLP) respectively, and the obtained results are then summed up. </p>
        <p>3) The channel attention weights are obtained through a Sigmoid activation function. </p>
        <p>4) Multiply the input features with the channel attention weight vector <inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> M </mml:mi><mml:mi> c </mml:mi></mml:msub><mml:mrow><mml:mo> ( </mml:mo><mml:mi> F </mml:mi><mml:mo> ) </mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> to obtain the channel attention feature map. Similarly, we can obtain the spatial attention feature map according to Equation (5) and Equation (6). </p>
        <disp-formula id="FD3">
          <label>(3)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>M</mml:mi>
                <mml:mi>c</mml:mi>
              </mml:msub>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mi>F</mml:mi>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>=</mml:mo>
              <mml:mi>σ</mml:mi>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:mtext>MLP</mml:mtext>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:mtext>AvgPool</mml:mtext>
                      <mml:mrow>
                        <mml:mo>(</mml:mo>
                        <mml:mi>F</mml:mi>
                        <mml:mo>)</mml:mo>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                  <mml:mo>+</mml:mo>
                  <mml:mtext>MLP</mml:mtext>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:mtext>MaxPool</mml:mtext>
                      <mml:mrow>
                        <mml:mo>(</mml:mo>
                        <mml:mi>F</mml:mi>
                        <mml:mo>)</mml:mo>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <disp-formula id="FD4">
          <label>(4)</label>
          <mml:math>
            <mml:mrow>
              <mml:mi>F</mml:mi>
              <mml:mn>1</mml:mn>
              <mml:mo>=</mml:mo>
              <mml:mi>F</mml:mi>
              <mml:mo>⊗</mml:mo>
              <mml:msub>
                <mml:mi>M</mml:mi>
                <mml:mi>c</mml:mi>
              </mml:msub>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mi>F</mml:mi>
                <mml:mo>)</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <disp-formula id="FD5">
          <label>(5)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>M</mml:mi>
                <mml:mi>s</mml:mi>
              </mml:msub>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mi>F</mml:mi>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>=</mml:mo>
              <mml:mi>σ</mml:mi>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:mtext>Conv</mml:mtext>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:mrow>
                        <mml:mo>[</mml:mo>
                        <mml:mrow>
                          <mml:mtext>AvgPool</mml:mtext>
                          <mml:mrow>
                            <mml:mo>(</mml:mo>
                            <mml:mi>F</mml:mi>
                            <mml:mo>)</mml:mo>
                          </mml:mrow>
                        </mml:mrow>
                        <mml:mo>]</mml:mo>
                      </mml:mrow>
                      <mml:mo>;</mml:mo>
                      <mml:mtext>MaxPool</mml:mtext>
                      <mml:mrow>
                        <mml:mo>(</mml:mo>
                        <mml:mi>F</mml:mi>
                        <mml:mo>)</mml:mo>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <disp-formula id="FD6">
          <label>(6)</label>
          <mml:math>
            <mml:mrow>
              <mml:mi>F</mml:mi>
              <mml:mn>2</mml:mn>
              <mml:mo>=</mml:mo>
              <mml:mi>F</mml:mi>
              <mml:mn>1</mml:mn>
              <mml:mo>⊗</mml:mo>
              <mml:msub>
                <mml:mi>M</mml:mi>
                <mml:mi>s</mml:mi>
              </mml:msub>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mi>F</mml:mi>
                <mml:mo>)</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>where <italic>F</italic>, <italic>F</italic>1, and <italic>F</italic>2 denote the input features, channel attention feature, and spatial attention feature map, respectively. <inline-formula><mml:math><mml:mrow><mml:mtext> AvgPool </mml:mtext><mml:mrow><mml:mo> ( </mml:mo><mml:mo> ⋅ </mml:mo><mml:mo> ) </mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> denotes global average pooling, and <inline-formula><mml:math><mml:mrow><mml:mtext> MaxPool </mml:mtext><mml:mrow><mml:mo> ( </mml:mo><mml:mo> ⋅ </mml:mo><mml:mo> ) </mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> denotes global maximum pooling. <inline-formula><mml:math><mml:mrow><mml:mtext> MLP </mml:mtext><mml:mrow><mml:mo> ( </mml:mo><mml:mo> ⋅ </mml:mo><mml:mo> ) </mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> denotes the multilayer perceptron. <inline-formula><mml:math><mml:mrow><mml:mi> σ </mml:mi><mml:mrow><mml:mo> ( </mml:mo><mml:mo> ⋅ </mml:mo><mml:mo> ) </mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> denotes the Sigmoid activation function. <inline-formula><mml:math><mml:mrow><mml:mo> ⊗ </mml:mo><mml:mo></mml:mo></mml:mrow></mml:math></inline-formula> denotes point-by-point multiplication. <inline-formula><mml:math><mml:mrow><mml:mo></mml:mo><mml:mtext> Conv </mml:mtext><mml:mrow><mml:mo> ( </mml:mo><mml:mo> ⋅ </mml:mo><mml:mo> ) </mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> is a convolution operation. </p>
      </sec>
      <sec id="sec2dot5">
        <title>2.5. Dual-Path Attention Module</title>
        <p>Inspired by CSPNet [<xref ref-type="bibr" rid="B20">20</xref>] and CSPAttention [<xref ref-type="bibr" rid="B21">21</xref>], we propose a Dual-Path Attention Module (DPAM), which is specifically designed to enhance attention on the lesion foreground in MRI images. By integrating the CSPNet and CBAM mechanisms, the DPAM is able to efficiently capture and refine detailed features of the lesion regions, leading to improved classification accuracy. Importantly, this approach achieves higher performance without significantly increasing computational cost, making it both effective and efficient for medical image analysis. The structure of DPAM is shown in <xref ref-type="fig" rid="fig5">Figure 5</xref>. Firstly, the input feature map of size [H, W, C_in] is split into two sub-feature maps of size [H, W, C_in/2] along the channel direction. Then, these two sub-feature maps are adjusted to size [H/2, W/2, C_out] through a Conv Block. The adjusted two feature maps are respectively passed through the CBAM attention module followed by a multiplication operation, and finally an output feature map of size [H/2, W/2, C_out] is obtained. The Conv Block contains a Pointwise Convolution (P_Conv) layer, Batch Normalization (BN) layer, SiLU activation function layer, and Avgpool layer. In this context, H, W, and C_in symbolize the height, width, and the quantity of channels in the input feature map, respectively. Additionally, C_out signifies the number of channels in the output feature map. </p>
      </sec>
      <sec id="sec2dot6">
        <title>2.6. Transfer Learning</title>
        <p>Transfer Learning (TL) is a machine learning approach that accelerates and improves learning and problem-solving in new domains by leveraging knowledge and experience gained from related domains. TL has been widely used in natural language processing, computer vision, and other fields. For example, it can enhance model performance in tasks such as text categorization, image recognition, target detection, and semantic segmentation. </p>
        <p>In our prediction model, designed for image classification in computer vision, we pretrained the MobileViT model parameters using the ImageNet dataset, which includes 1000 classes and 1.26 million natural images. Although natural images may differ from MRI images, they are still relevant. Through transfer learning, the model can learn features such as corners, edges, colors, and textures from the ImageNet dataset. These learned features can assist in classifying MRI images, thereby improving the effectiveness of convolutional neural networks. </p>
        <fig id="fig5">
          <label>Figure 5</label>
          <graphic xlink:href="https://html.scirp.org/file/1733488-rId154.jpeg?20260326025916" />
        </fig>
        <p>Figure 5. The structure of the Dual-Path Attention Module (DPAM). </p>
      </sec>
    </sec>
    <sec id="sec3">
      <title>3. Datasets and Pre-Processing</title>
      <sec id="sec3dot1">
        <title>3.1. The Alzheimer’s Disease Dataset</title>
        <p>The dataset used in this paper is the Alzheimer dataset, a publicly available dataset from the Kaggle website [<xref ref-type="bibr" rid="B22">22</xref>]. The dataset contains a total of 6400 MRI images of Alzheimer’s disease, including 896 for mild demented, 64 for moderate demented, 3200 for non demented, and 2240 for very mild demented, each of which is 128 × 128 in size, and a sample of the Alzheimer’s disease dataset is shown in <xref ref-type="fig" rid="fig6">Figure 6</xref>. </p>
        <fig id="fig6">
          <label>Figure 6</label>
          <graphic xlink:href="https://html.scirp.org/file/1733488-rId155.jpeg?20260326025917" />
        </fig>
        <p>Figure 6. The Alzheimer’s disease dataset. </p>
        <p>The distribution of these cases is as follows: 14% are classified as mild dementia, 1% as moderate dementia, 50% as non-dementia, and 35% as very mild dementia. <bold>Table 2</bold> shows the distribution of samples in the Alzheimer’s disease dataset. Compared to the other three classes, moderate dementia exhibits significant class imbalance. Imbalanced classes can lead to the classifier tending to predict the majority class and neglect the learning of minority class samples during subsequent model training. Even though the overall correctness of the model performs well, the prediction for the minority class is not good, and it happens that the lesion samples in the minority class are exactly what need to be focused on for learning and prediction. In this paper, traditional data augmentation methods such as flipping, cropping, panning, and rotating are used to expand the mild demented samples and moderate dementia samples. Before performing any data augmentation, the Alzheimer’s disease dataset is divided into 70% for training and 30% for testing. The augmentation is applied only to the training set, thereby preventing data leakage where augmented variants of the same original image appear simultaneously in both the training and test sets. </p>
        <p>Table 2. The Alzheimer’s disease datasets.</p>
        <table-wrap id="tbl2">
          <label>Table 2</label>
          <table>
            <tbody>
              <tr>
                <td>Dataset</td>
                <td>Mild Demented</td>
                <td>Moderate Demented</td>
                <td>Non Demented</td>
                <td>Very Mild Demented</td>
                <td>Totals</td>
              </tr>
              <tr>
                <td>Training set</td>
                <td>628</td>
                <td>45</td>
                <td>2240</td>
                <td>1568</td>
                <td>4481</td>
              </tr>
              <tr>
                <td>Test set</td>
                <td>268</td>
                <td>19</td>
                <td>960</td>
                <td>672</td>
                <td>1919</td>
              </tr>
              <tr>
                <td>Totals</td>
                <td>896</td>
                <td>64</td>
                <td>3200</td>
                <td>2240</td>
                <td>6400</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
      </sec>
      <sec id="sec3dot2">
        <title>3.2. The Brain Tumor Dataset</title>
        <p>The brain tumor dataset used in this paper is sourced from Kaggle, a platform that offers publicly accessible datasets. Specifically, it is a four-class classification dataset of brain tumor MRI scans, made available in July 2020 by Sartaj Bhuvaji and colleagues from the National Institute of Technology, Durgapur, India [<xref ref-type="bibr" rid="B23">23</xref>]. The dataset includes a total of 3264 MRI images, categorized into four distinct classes: 926 images of glioma tumors, 937 images of meningioma tumors, 500 images representing normal (no tumor) cases, and 901 images of pituitary tumors. These images provide a comprehensive representation of different types of brain tumors, offering a valuable resource for training and evaluating machine learning models in medical image analysis. A selection of images from this dataset is displayed in <xref ref-type="fig" rid="fig7">Figure 7</xref>. <bold>Table 3</bold> shows the distribution of samples in the brain tumor dataset. </p>
        <fig id="fig7">
          <label>Figure 7</label>
          <graphic xlink:href="https://html.scirp.org/file/1733488-rId156.jpeg?20260326025918" />
        </fig>
        <p>Figure 7. The brain tumor dataset.</p>
        <p>Table 3. Distribution of samples in the brain tumor dataset.</p>
        <table-wrap id="tbl3">
          <label>Table 3</label>
          <table>
            <tbody>
              <tr>
                <td>
                </td>
                <td>Glioma tumor</td>
                <td>Meningioma tumor</td>
                <td>No tumor</td>
                <td>Pituitary tumor</td>
                <td>Totals</td>
              </tr>
              <tr>
                <td>Training set</td>
                <td>649</td>
                <td>656</td>
                <td>350</td>
                <td>631</td>
                <td>2286</td>
              </tr>
              <tr>
                <td>Test set</td>
                <td>277</td>
                <td>281</td>
                <td>150</td>
                <td>270</td>
                <td>978</td>
              </tr>
              <tr>
                <td>Totals</td>
                <td>926</td>
                <td>937</td>
                <td>500</td>
                <td>901</td>
                <td>3264</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>For better model training, the images are processed to a uniform size of 224 × 224, and the images in the training set are randomly flipped horizontally with a probability of 0.5. Then, the images are converted to a tensor type and normalized. The normalized image <inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> X </mml:mi><mml:mi> n </mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> can be represented as: </p>
        <disp-formula id="FD7">
          <label>(7)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>X</mml:mi>
                <mml:mi>n</mml:mi>
              </mml:msub>
              <mml:mo>=</mml:mo>
              <mml:mfrac>
                <mml:mrow>
                  <mml:msub>
                    <mml:mi>X</mml:mi>
                    <mml:mi>i</mml:mi>
                  </mml:msub>
                  <mml:mo>−</mml:mo>
                  <mml:mi>μ</mml:mi>
                </mml:mrow>
                <mml:mi>σ</mml:mi>
              </mml:mfrac>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>where <inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> X </mml:mi><mml:mi> i </mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> denotes a pixel value in each channel of the image, <inline-formula><mml:math><mml:mi> μ </mml:mi></mml:math></inline-formula> denotes the mean value of the pixel value in each channel of the image, and <inline-formula><mml:math><mml:mi> σ </mml:mi></mml:math></inline-formula> denotes the standard deviation of the pixel value in each channel of the image. </p>
      </sec>
    </sec>
    <sec id="sec4">
      <title>4. Experiment</title>
      <sec id="sec4dot1">
        <title>4.1. Experimental Environment and Parameter Settings</title>
        <p>The experimental environment used in this paper is the PyTorch 2.2.2 deep learning framework based on Windows 10, using the Python 3.8.5 language to build the network model. The processor is an Intel(R) Xeon(R) Gold 6226R CPU @ 2.90 GHz, and the memory is 16.0 GB. </p>
        <p>To ensure the rigor of comparative experiments, all baseline models were retrained or fine-tuned under identical training configurations. All models uniformly employed the AdamW optimizer with a weight decay rate of 0.01. A cosine annealing decay strategy was adopted, with an initial learning rate uniformly set to 1e−3 and a minimum learning rate of 1e−5. The experiments are set up with the number of iterations as 40, and the BatchSize is 30. All models utilized the same early stopping mechanism. Transfer learning is uniformly applied across all models to augment their training efficacy. All baseline models utilize the same pre-trained checkpoint and follow a unified phased fine-tuning strategy: first freezing the backbone network while training only the classification head, then progressively unfreezing higher layers for joint optimization. </p>
      </sec>
      <sec id="sec4dot2">
        <title>4.2. Evaluation Metrics</title>
        <p>The following performance metrics are used to evaluate the classification performance of the model: </p>
        <disp-formula id="FD8">
          <label>(8)</label>
          <mml:math>
            <mml:mrow>
              <mml:mtext>Accuracy</mml:mtext>
              <mml:mo>=</mml:mo>
              <mml:mfrac>
                <mml:mrow>
                  <mml:mtext>TP</mml:mtext>
                  <mml:mo>+</mml:mo>
                  <mml:mtext>TN</mml:mtext>
                </mml:mrow>
                <mml:mrow>
                  <mml:mtext>TP</mml:mtext>
                  <mml:mo>+</mml:mo>
                  <mml:mtext>TN</mml:mtext>
                  <mml:mo>+</mml:mo>
                  <mml:mtext>FP</mml:mtext>
                  <mml:mo>+</mml:mo>
                  <mml:mtext>FN</mml:mtext>
                </mml:mrow>
              </mml:mfrac>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <disp-formula id="FD9">
          <label>(9)</label>
          <mml:math>
            <mml:mrow>
              <mml:mtext>Precision</mml:mtext>
              <mml:mo>=</mml:mo>
              <mml:mfrac>
                <mml:mrow>
                  <mml:mtext>TP</mml:mtext>
                </mml:mrow>
                <mml:mrow>
                  <mml:mtext>TP</mml:mtext>
                  <mml:mo>+</mml:mo>
                  <mml:mtext>FP</mml:mtext>
                </mml:mrow>
              </mml:mfrac>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <disp-formula id="FD10">
          <label>(10)</label>
          <mml:math>
            <mml:mrow>
              <mml:mtext>Recall</mml:mtext>
              <mml:mo>=</mml:mo>
              <mml:mfrac>
                <mml:mrow>
                  <mml:mtext>TP</mml:mtext>
                </mml:mrow>
                <mml:mrow>
                  <mml:mtext>TP</mml:mtext>
                  <mml:mo>+</mml:mo>
                  <mml:mtext>FN</mml:mtext>
                </mml:mrow>
              </mml:mfrac>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <disp-formula id="FD11">
          <label>(11)</label>
          <mml:math>
            <mml:mrow>
              <mml:mtext>F</mml:mtext>
              <mml:mn>1</mml:mn>
              <mml:mo>=</mml:mo>
              <mml:mfrac>
                <mml:mrow>
                  <mml:mn>2</mml:mn>
                  <mml:mo>×</mml:mo>
                  <mml:mtext>Precision</mml:mtext>
                  <mml:mo>×</mml:mo>
                  <mml:mtext>Recall</mml:mtext>
                </mml:mrow>
                <mml:mrow>
                  <mml:mtext>Precision</mml:mtext>
                  <mml:mo>+</mml:mo>
                  <mml:mtext>Recall</mml:mtext>
                </mml:mrow>
              </mml:mfrac>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>TP (True Positive) refers to instances that were actually positive and were correctly classified as positive by the model. TN (True Negative) refers to instances that were originally negative and were accurately predicted as negative. FP (False Positive) represents cases that were originally negative but were incorrectly classified as positive, and FN (False Negative) refers to instances that were actually positive but were mistakenly predicted as negative by the model. These four metrics provide a comprehensive way to assess a model’s ability to correctly and incorrectly identify positive and negative samples. </p>
        <p>The Kappa coefficient is a statistic used to assess the performance of a classification model. It accounts for randomness in classification prediction results and corrects for potential biases in accuracy. Its value range is [−1, 1], but in practical applications, it is usually between [0, 1]. A higher Kappa coefficient indicates better classification accuracy. The calculation formulas are shown in (12), (13), and (14). </p>
        <disp-formula id="FD12">
          <label>(12)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>p</mml:mi>
                <mml:mn>0</mml:mn>
              </mml:msub>
              <mml:mo>=</mml:mo>
              <mml:mfrac>
                <mml:mrow>
                  <mml:mtext>TP</mml:mtext>
                  <mml:mo>+</mml:mo>
                  <mml:mtext>TN</mml:mtext>
                </mml:mrow>
                <mml:mrow>
                  <mml:mtext>TP</mml:mtext>
                  <mml:mo>+</mml:mo>
                  <mml:mtext>TN</mml:mtext>
                  <mml:mo>+</mml:mo>
                  <mml:mtext>FP</mml:mtext>
                  <mml:mo>+</mml:mo>
                  <mml:mtext>FN</mml:mtext>
                </mml:mrow>
              </mml:mfrac>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <disp-formula id="FD13">
          <label>(13)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>p</mml:mi>
                <mml:mi>e</mml:mi>
              </mml:msub>
              <mml:mo>=</mml:mo>
              <mml:mfrac>
                <mml:mrow>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:mtext>TP</mml:mtext>
                      <mml:mo>+</mml:mo>
                      <mml:mtext>FN</mml:mtext>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:mtext>TP</mml:mtext>
                      <mml:mo>+</mml:mo>
                      <mml:mtext>FP</mml:mtext>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                  <mml:mo>+</mml:mo>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:mtext>TN</mml:mtext>
                      <mml:mo>+</mml:mo>
                      <mml:mtext>FN</mml:mtext>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:mtext>TN</mml:mtext>
                      <mml:mo>+</mml:mo>
                      <mml:mtext>FP</mml:mtext>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mrow>
                  <mml:msup>
                    <mml:mrow>
                      <mml:mrow>
                        <mml:mo>(</mml:mo>
                        <mml:mrow>
                          <mml:mtext>TP</mml:mtext>
                          <mml:mo>+</mml:mo>
                          <mml:mtext>TN</mml:mtext>
                          <mml:mo>+</mml:mo>
                          <mml:mtext>FP</mml:mtext>
                          <mml:mo>+</mml:mo>
                          <mml:mtext>FN</mml:mtext>
                        </mml:mrow>
                        <mml:mo>)</mml:mo>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mn>2</mml:mn>
                  </mml:msup>
                </mml:mrow>
              </mml:mfrac>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <disp-formula id="FD14">
          <label>(14)</label>
          <mml:math>
            <mml:mrow>
              <mml:mtext>Kappa</mml:mtext>
              <mml:mo>=</mml:mo>
              <mml:mfrac>
                <mml:mrow>
                  <mml:msub>
                    <mml:mi>p</mml:mi>
                    <mml:mn>0</mml:mn>
                  </mml:msub>
                  <mml:mo>−</mml:mo>
                  <mml:msub>
                    <mml:mi>p</mml:mi>
                    <mml:mi>e</mml:mi>
                  </mml:msub>
                </mml:mrow>
                <mml:mrow>
                  <mml:mn>1</mml:mn>
                  <mml:mo>−</mml:mo>
                  <mml:msub>
                    <mml:mi>p</mml:mi>
                    <mml:mi>e</mml:mi>
                  </mml:msub>
                </mml:mrow>
              </mml:mfrac>
            </mml:mrow>
          </mml:math>
        </disp-formula>
      </sec>
      <sec id="sec4dot3">
        <title>4.3. Results and Analysis</title>
        <p>4.3.1. Prediction Accuracy and Loss of the Alzheimer’s Disease Dataset</p>
        <p>The model proposed in this paper is an enhanced version of MobileViT, designed to improve performance on the given dataset. To assess the effectiveness of the proposed model, we conduct experiments comparing its training results with those of the MobileViT model, both before and after data augmentation. Specifically, <xref ref-type="fig" rid="fig8">Figure 8(a)</xref> and <xref ref-type="fig" rid="fig8">Figure 8(b)</xref> present the loss and accuracy curves for both the MobileViT and the proposed model when evaluated on the dataset prior to any data augmentation. This allows for a direct comparison of how the two models perform under the same initial conditions. Both models were trained for 40 epochs using cross-entropy as the loss function. As shown in <xref ref-type="fig" rid="fig8">Figure 8(a)</xref>, the proposed model has less loss on both the training and test sets and converges to zero faster compared to the MobileViT. <xref ref-type="fig" rid="fig8">Figure 8(b)</xref> shows that the accuracies of the proposed model surpass those of the MobileViT on both the training and test sets. <xref ref-type="fig" rid="fig9">Figure 9(a)</xref> and <xref ref-type="fig" rid="fig9">Figure 9(b)</xref> display the loss and accuracy of both models on the dataset after data augmentation, respectively. These figures reveal that the proposed model performs markedly better on the more balanced dataset. </p>
        <fig id="fig8">
          <label>Figure 8</label>
          <graphic xlink:href="https://html.scirp.org/file/1733488-rId181.jpeg?20260326025922" />
        </fig>
        <p>Figure 8. Loss and accuracy of the two models before data augmentation. </p>
        <fig id="fig9">
          <label>Figure 9</label>
          <graphic xlink:href="https://html.scirp.org/file/1733488-rId182.jpeg?20260326025921" />
        </fig>
        <p>Figure 9. Loss and accuracy of the two models after data augmentation. </p>
        <p><bold>Table 4</bold> shows the classification results of the proposed model and MobileViT on the dataset before and after data augmentation. From <bold>Table 4</bold>, on the original dataset, the classification accuracies of MobileViT and the proposed model are 74.1% and 93.9%, respectively. The Kappa coefficients for both are 0.562 and 0.900, respectively. The values of the two performance metrics have increased by 26.7% and 60.1%, respectively. </p>
        <p>From <bold>Table 4</bold>, we can also conclude the effect of data augmentation on the model classification performance. The precision and Kappa coefficient of the MobileViT increase from 0.741 and 0.562 to 0.809 and 0.746, respectively. The precision and Kappa coefficient of the proposed model show significant improvements, increasing from 0.939 and 0.900 to 0.971 and 0.961, respectively. These results indicate that applying data augmentation techniques to address class imbalance has a notable positive impact on the model’s classification performance. Specifically, the increase in precision reflects a better ability of the model to correctly identify positive instances, while the higher Kappa coefficient suggests improved agreement between the model’s predictions and the actual labels. Overall, the evaluation results on the test dataset show that the proposed model is effective in terms of Precision, Recall, F1, Accuracy, and Kappa.</p>
        <p>Table 4. Classification results of MobileViT and the proposed model on the Alzheimer’s disease dataset.</p>
        <table-wrap id="tbl4">
          <label>Table 4</label>
          <table>
            <tbody>
              <tr>
                <td>Augmentation</td>
                <td>Model</td>
                <td>Class</td>
                <td>Precision</td>
                <td>Recall</td>
                <td>F1</td>
                <td>Accuracy</td>
                <td>Kappa</td>
              </tr>
              <tr>
                <td rowspan="8">Before</td>
                <td rowspan="4">MobileViT</td>
                <td>Mild Demented</td>
                <td>0.697</td>
                <td>0.575</td>
                <td>0.630</td>
                <td rowspan="4">0.741</td>
                <td rowspan="4">0.562</td>
              </tr>
              <tr>
                <td>Moderate Demented</td>
                <td>0.000</td>
                <td>0.000</td>
                <td>0.000</td>
              </tr>
              <tr>
                <td>Non Demented</td>
                <td>0.775</td>
                <td>0.857</td>
                <td>0.814</td>
              </tr>
              <tr>
                <td>Very Mild Demented</td>
                <td>0.700</td>
                <td>0.662</td>
                <td>0.680</td>
              </tr>
              <tr>
                <td rowspan="4">Proposed model</td>
                <td>Mild Demented</td>
                <td>0.875</td>
                <td>0.966</td>
                <td>0.918</td>
                <td rowspan="4">0.939</td>
                <td rowspan="4">0.900</td>
              </tr>
              <tr>
                <td>Moderate Demented</td>
                <td>0.950</td>
                <td>1.000</td>
                <td>0.974</td>
              </tr>
              <tr>
                <td>Non Demented</td>
                <td>0.966</td>
                <td>0.943</td>
                <td>0.954</td>
              </tr>
              <tr>
                <td>Very Mild Demented</td>
                <td>0.928</td>
                <td>0.920</td>
                <td>0.924</td>
              </tr>
              <tr>
                <td rowspan="8">After</td>
                <td rowspan="4">MobileViT</td>
                <td>Mild Demented</td>
                <td>0.813</td>
                <td>0.861</td>
                <td>0.836</td>
                <td rowspan="4">0.809</td>
                <td rowspan="4">0.746</td>
              </tr>
              <tr>
                <td>Moderate Demented</td>
                <td>0.979</td>
                <td>0.997</td>
                <td>0.988</td>
              </tr>
              <tr>
                <td>Non Demented</td>
                <td>0.878</td>
                <td>0.675</td>
                <td>0.763</td>
              </tr>
              <tr>
                <td>Very Mild Demented</td>
                <td>0.609</td>
                <td>0.763</td>
                <td>0.678</td>
              </tr>
              <tr>
                <td rowspan="4">Proposed model</td>
                <td>Mild Demented</td>
                <td>0.984</td>
                <td>0.987</td>
                <td>0.985</td>
                <td rowspan="4">0.971</td>
                <td rowspan="4">0.961</td>
              </tr>
              <tr>
                <td>Moderate Demented</td>
                <td>0.997</td>
                <td>1.000</td>
                <td>0.999</td>
              </tr>
              <tr>
                <td>Non Demented</td>
                <td>0.953</td>
                <td>0.974</td>
                <td>0.963</td>
              </tr>
              <tr>
                <td>Very Mild Demented</td>
                <td>0.960</td>
                <td>0.924</td>
                <td>0.942</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>4.3.2. Prediction Accuracy and Loss of the Brain Tumor Dataset</p>
        <p><xref ref-type="fig" rid="fig10">Figure 10</xref> shows the loss values and accuracy of MobileViT and the proposed model on the brain tumor dataset, respectively. It is obvious from the figures that, compared with the original MobileViT, the proposed model exhibits lower loss values on both the training and test sets, which indicates a more significant optimization effect. Meanwhile, the classification accuracy of the proposed model on the test set also reaches the optimal level, further proving its superiority and effectiveness. </p>
        <p>Table 5. Classification results of the original and improved models on the brain tumor dataset. </p>
        <table-wrap id="tbl5">
          <label>Table 5</label>
          <table>
            <tbody>
              <tr>
                <td>Model</td>
                <td>Class</td>
                <td>Precision</td>
                <td>Recall</td>
                <td>F1</td>
                <td>Accuracy</td>
                <td>Kappa</td>
              </tr>
              <tr>
                <td rowspan="4">MobileViT</td>
                <td>Glioma tumor</td>
                <td>0.953</td>
                <td>0.881</td>
                <td>0.916</td>
                <td rowspan="4">0.933</td>
                <td rowspan="4">0.908</td>
              </tr>
              <tr>
                <td>Meningioma tumor</td>
                <td>0.893</td>
                <td>0.918</td>
                <td>0.905</td>
              </tr>
              <tr>
                <td>No tumor</td>
                <td>0.973</td>
                <td>0.947</td>
                <td>0.959</td>
              </tr>
              <tr>
                <td>Pituitary tumor</td>
                <td>0.934</td>
                <td>0.993</td>
                <td>0.962</td>
              </tr>
              <tr>
                <td rowspan="4">Proposed model</td>
                <td>Glioma tumor</td>
                <td>0.971</td>
                <td>0.960</td>
                <td>0.966</td>
                <td rowspan="4">0.971</td>
                <td rowspan="4">0.961</td>
              </tr>
              <tr>
                <td>Meningioma tumor</td>
                <td>0.948</td>
                <td>0.968</td>
                <td>0.958</td>
              </tr>
              <tr>
                <td>No tumor</td>
                <td>1.000</td>
                <td>0.960</td>
                <td>0.980</td>
              </tr>
              <tr>
                <td>Pituitary tumor</td>
                <td>0.982</td>
                <td>0.993</td>
                <td>0.987</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <fig id="fig10">
          <label>Figure 10</label>
          <graphic xlink:href="https://html.scirp.org/file/1733488-rId183.jpeg?20260326025922" />
        </fig>
        <p>Figure 10. Loss and accuracy of MobileViT and the proposed model on the brain tumor dataset. </p>
        <p><bold>Table 5</bold> shows the classification results of the two models on the brain tumor dataset. Based on <bold>Table 5</bold>, it can be observed that the Kappa coefficient of the MobileViT model is 0.908, while that of this paper’s model is 0.961. The original model has an overall classification accuracy of 93.3% on the brain tumor dataset, but the classification accuracy for the meningioma category is only 89.3%. The improved MobileViT model not only achieves an overall classification accuracy of 97.1%, but also outperforms the original model in each category. Based on the experimental results in <bold>Table 4</bold> and <bold>Table 5</bold>, it is shown that the proposed model in this paper has better classification results than the original MobileViT model on both class-imbalanced and class-balanced datasets and is more adaptable to the complex and variable dataset situations in practical applications. </p>
        <p>4.3.3. Comparative Experiments</p>
        <p>In order to validate the state-of-the-art of the proposed model, nine different deep learning models were used to classify the expanded Alzheimer’s dataset and brain tumor dataset. These comparison models include: ResNet50 [<xref ref-type="bibr" rid="B24">24</xref>], ResNet34 [<xref ref-type="bibr" rid="B24">24</xref>], DenseNet121 [<xref ref-type="bibr" rid="B25">25</xref>], ShuffleNetV2_x2_0 [<xref ref-type="bibr" rid="B26">26</xref>], EfficientNet [<xref ref-type="bibr" rid="B27">27</xref>], MobileNetV3 _large [<xref ref-type="bibr" rid="B28">28</xref>], MobileNetV3_small [<xref ref-type="bibr" rid="B28">28</xref>], MobileNetV2 [<xref ref-type="bibr" rid="B18">18</xref>], and MobileViT [<xref ref-type="bibr" rid="B17">17</xref>]. Among these baselines, ResNets are deep convolutional neural networks. DenseNet, ShuffleNet, EfficientNet, and MobileNet are lightweight deep learning models. </p>
        <p><bold>1) Results of the Alzheimer’s disease dataset</bold></p>
        <p>Using the elements of the confusion matrix, several performance metrics—such as accuracy, precision, recall, and F1 score—are calculated for all models. These metrics are summarized in <bold>Table 6</bold>. Based on the data in <bold>Table 6</bold>, the following conclusions can be drawn: </p>
        <p>a) The proposed model achieves a classification accuracy of 97.1% on the expanded dataset, with precision, recall, and F1 scores all reaching 97.1% and an F1 score of 0.971. Its classification accuracy is just 0.3 percentage points lower than that of ResNet50 and 0.2 percentage points below ResNet34. </p>
        <p>b) In the baseline models, the classification performance of ResNet (ResNet50 and ResNet34) is significantly better than that of other lightweight deep learning models (DenseNet, ShuffleNet, EfficientNet, MobileNet, and MobileViT). This can be attributed to the fact that ResNet, as a deep residual network, demonstrates outstanding performance in various computer vision tasks, including image classification, object detection, and semantic segmentation. </p>
        <p>c) In the lightweight network model of the baseline model, MobileViT has a classification accuracy of 80.9%, a precision of 82.5%, a recall of 80.9%, and an F1 score of 81.1%. Although the classification accuracy is 5.4% lower than that of DenseNet121, DenseNet121 has more than 7 times the number of parameters as MobileViT. Our proposed model is not only about 16% higher than the original MobileViT model in all four metrics, but its experimental results are also the best among the lightweight network models, including DenseNet121. The above results show that our improvement of MobileViT is effective. </p>
        <p>Table 6. Classification results of each model on the Alzheimer’s disease dataset. </p>
        <table-wrap id="tbl6">
          <label>Table 6</label>
          <table>
            <tbody>
              <tr>
                <td>Model</td>
                <td>Accuracy</td>
                <td>Precision</td>
                <td>Recall</td>
                <td>F1</td>
              </tr>
              <tr>
                <td>ResNet50</td>
                <td>0.974</td>
                <td>0.974</td>
                <td>0.974</td>
                <td>0.974</td>
              </tr>
              <tr>
                <td>ResNet34</td>
                <td>0.973</td>
                <td>0.973</td>
                <td>0.973</td>
                <td>0.973</td>
              </tr>
              <tr>
                <td>DenseNet121</td>
                <td>0.863</td>
                <td>0.865</td>
                <td>0.863</td>
                <td>0.863</td>
              </tr>
              <tr>
                <td>ShuffleNetV2_x2_0</td>
                <td>0.718</td>
                <td>0.712</td>
                <td>0.718</td>
                <td>0.714</td>
              </tr>
              <tr>
                <td>EfficientNet</td>
                <td>0.663</td>
                <td>0.658</td>
                <td>0.663</td>
                <td>0.659</td>
              </tr>
              <tr>
                <td>MobileNetV3_large</td>
                <td>0.717</td>
                <td>0.710</td>
                <td>0.717</td>
                <td>0.707</td>
              </tr>
              <tr>
                <td>MobileNetV3_small</td>
                <td>0.617</td>
                <td>0.606</td>
                <td>0.617</td>
                <td>0.605</td>
              </tr>
              <tr>
                <td>MobileNetV2</td>
                <td>0.600</td>
                <td>0.588</td>
                <td>0.600</td>
                <td>0.581</td>
              </tr>
              <tr>
                <td>MobileViT</td>
                <td>0.809</td>
                <td>0.825</td>
                <td>0.809</td>
                <td>0.811</td>
              </tr>
              <tr>
                <td>Proposed Model</td>
                <td>0.971</td>
                <td>0.971</td>
                <td>0.971</td>
                <td>0.971</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p><bold>2) Results of the brain tumor dataset</bold></p>
        <p><bold>Table 7</bold> summarizes the classification performance of different deep learning models on the brain tumor dataset. From it, it can be observed that the MobileViT model, as a lightweight network model, exhibits excellent performance, with an accuracy of 93.3% on brain tumor images. Its prediction performance ranks second among lightweight models. The accuracy of the proposed model is 3.8% higher than the second-ranked MobileViT. Also, the accuracy of the proposed model is higher than the heavyweight network ResNet series. These results show the proposed model’s leading position and significant advantages in the brain tumor image classification task. </p>
        <p>Table 7. Classification results of each model on the brain tumor dataset. </p>
        <table-wrap id="tbl7">
          <label>Table 7</label>
          <table>
            <tbody>
              <tr>
                <td>Model</td>
                <td>Accuracy</td>
                <td>Precision</td>
                <td>Recall</td>
                <td>F1</td>
              </tr>
              <tr>
                <td>ResNet50</td>
                <td>0.970</td>
                <td>0.971</td>
                <td>0.970</td>
                <td>0.970</td>
              </tr>
              <tr>
                <td>ResNet34</td>
                <td>0.968</td>
                <td>0.969</td>
                <td>0.968</td>
                <td>0.968</td>
              </tr>
              <tr>
                <td>DenseNet121</td>
                <td>0.926</td>
                <td>0.927</td>
                <td>0.926</td>
                <td>0.926</td>
              </tr>
              <tr>
                <td>ShuffleNetV2_x2_0</td>
                <td>0.843</td>
                <td>0.846</td>
                <td>0.843</td>
                <td>0.842</td>
              </tr>
              <tr>
                <td>EfficientNet</td>
                <td>0.814</td>
                <td>0.815</td>
                <td>0.814</td>
                <td>0.814</td>
              </tr>
              <tr>
                <td>MobileNetV3_large</td>
                <td>0.906</td>
                <td>0.907</td>
                <td>0.906</td>
                <td>0.905</td>
              </tr>
              <tr>
                <td>MobileNetV3_small</td>
                <td>0.860</td>
                <td>0.862</td>
                <td>0.860</td>
                <td>0.860</td>
              </tr>
              <tr>
                <td>MobileNetV2</td>
                <td>0.822</td>
                <td>0.821</td>
                <td>0.822</td>
                <td>0.820</td>
              </tr>
              <tr>
                <td>MobileViT</td>
                <td>0.933</td>
                <td>0.933</td>
                <td>0.933</td>
                <td>0.932</td>
              </tr>
              <tr>
                <td>Proposed Model</td>
                <td>0.971</td>
                <td>0.972</td>
                <td>0.971</td>
                <td>0.971</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>4.3.4. Computational Complexity Analysis</p>
        <p>Computational complexity is also an important metric for deep learning models. Higher computational complexity may lead to long training times and slow inference, especially in application scenarios with limited resources or high real-time requirements, which may become a major factor limiting the application of the model. We use Params (parameters) and FLOPs (floating point operations per second) to assess computational complexity. FLOPs are commonly employed to gauge the computational load of a system or the model training process, with the number of FLOPs indicating the amount of computational resources needed during model training and inference. Params refers to the number of trainable parameters in the model, which directly impacts the memory and computational resource demands. More parameters usually allow the model to fit the details and complex relationships of the training data more accurately, but an increase in the number of parameters may increase the complexity of the model, which increases the risk of overfitting. The comparative analysis of computational complexity between the proposed model and the baseline model on two MRI datasets is shown in <bold>Table 8</bold>. </p>
        <p>From <bold>Table 8</bold>, the following conclusions can be drawn: </p>
        <p>1) The heavyweight network ResNet with deeper layers has far more Params and FLOPs than the lightweight network. The number of parameters of ResNet50 is about 3.38 times more than that of DenseNet121, the lightweight network with the most parameters, and about 24.69 times more than that of MobileViT, the lightweight network with the least parameters. As seen in conjunction with <bold>Table 7</bold>, while ResNet50 and ResNet34 achieve higher classification accuracy compared to lightweight networks, this comes at the expense of significantly higher computational complexity. </p>
        <p>2) The proposed model has the second smallest number of parameters (957,761) after MobileVIT (952,308). Compared with MobileVIT, although the number of parameters in the proposed model increases by 0.57%, the accuracy improves from 93.3% to 97.1%. In addition, the FLOPs of our proposed model (0.314778007 G) are larger than those of lightweight models such as MobileNetV3 (0.057191512 G, 0.220317112G, and 0.306178784G) and MobileViT (0.304916367 G). Compared to the accuracy, these costs are worthwhile. All in all, although the complexity of our proposed model is slightly higher than that of MobileViT, it still belongs to the lightweight network, which meets the requirements for real-time use or deployment in edge devices and also has relatively high classification accuracy. </p>
        <p>Table 8. Results of computational complexity on two MRI datasets.</p>
        <table-wrap id="tbl8">
          <label>Table 8</label>
          <table>
            <tbody>
              <tr>
                <td>Model</td>
                <td>Parameters</td>
                <td>FLOPs</td>
              </tr>
              <tr>
                <td>ResNet50</td>
                <td>23,516,228</td>
                <td>4.09826048G</td>
              </tr>
              <tr>
                <td>ResNet34</td>
                <td>21,286,724</td>
                <td>3.66699008G</td>
              </tr>
              <tr>
                <td>DenseNet121</td>
                <td>6,957,956</td>
                <td>2.848985856G</td>
              </tr>
              <tr>
                <td>ShuffleNetV2_x2_0</td>
                <td>5,353,192</td>
                <td>0.584959696G</td>
              </tr>
              <tr>
                <td>EfficientNet</td>
                <td>4,012,672</td>
                <td>0.393804448G</td>
              </tr>
              <tr>
                <td>MobileNetV3_large</td>
                <td>4,207,156</td>
                <td>0.220317112G</td>
              </tr>
              <tr>
                <td>MobileNetV3_small</td>
                <td>1,521,956</td>
                <td>0.057191512G</td>
              </tr>
              <tr>
                <td>MobileNetV2</td>
                <td>2,228,996</td>
                <td>0.306178784G</td>
              </tr>
              <tr>
                <td>MobileViT</td>
                <td>952,308</td>
                <td>0.304916367G</td>
              </tr>
              <tr>
                <td>Proposed model</td>
                <td>957,761</td>
                <td>0.314778007G</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
      </sec>
      <sec id="sec4dot4">
        <title>4.4. Ablation Experiment</title>
        <p>The proposed model is an improvement of the original MobileViT. We examine the impact of these enhanced modules on the model’s classification performance through ablation experiments. The results of the ablation experiments are presented in <bold>Table 9</bold>. As shown in <bold>Table 9</bold>, when the improved MV2 module and cosine annealing module are added to MobileViT separately, both show different degrees of improvement in model metrics on all three datasets compared to the baseline MobileViT. The fusion model of MobileViT + Improved MV2 + Cosine Annealing achieved the best performance on the Alzheimer’s Disease Dataset (Before augmentation) dataset. After adding the Dual-Path Attention module to MobileViT + Improved MV2 + Cosine Annealing, the best metrics have been achieved on the Alzheimer’s Disease Dataset (After augmentation) dataset and Brain Tumor Dataset. It is proved that the given designed model can improve the performance of the baseline and attain better outcomes than the baseline techniques.</p>
      </sec>
      <sec id="sec4dot5">
        <title>4.5. Interpretability Analysis of Our Proposed Model</title>
        <p>In order to demonstrate the interpretability or the rationale behind the decision-making for our proposed model, we used the Grad-CAM method [<xref ref-type="bibr" rid="B29">29</xref>] to visualize the last convolutional layer of both the original model MobileViT and our proposed model. Grad-CAM generates a type of heat map of spatial weights by back-propagating the gradient of the network output and multiplying the weights of the category activations with the feature map. The Grad-CAM algorithm provides a localization map on a given target image. </p>
        <p>Table 9. Results of ablation experiment.</p>
        <table-wrap id="tbl9">
          <label>Table 9</label>
          <table>
            <tbody>
              <tr>
                <td>Dataset</td>
                <td>MobileViT</td>
                <td>Improved MV2</td>
                <td>Cosine Annealing</td>
                <td>Dual-Path Attention</td>
                <td>Accuracy</td>
                <td>Precision</td>
                <td>Recall</td>
                <td>F1</td>
              </tr>
              <tr>
                <td rowspan="4">Alzheimer’s Disease Dataset (Before augmentation)</td>
                <td>√</td>
                <td>×</td>
                <td>×</td>
                <td>×</td>
                <td>0.741</td>
                <td>0.730</td>
                <td>0.741</td>
                <td>0.733</td>
              </tr>
              <tr>
                <td>√</td>
                <td>√</td>
                <td>×</td>
                <td>×</td>
                <td>0.923</td>
                <td>0.928</td>
                <td>0.923</td>
                <td>0.923</td>
              </tr>
              <tr>
                <td>√</td>
                <td>√</td>
                <td>√</td>
                <td>×</td>
                <td>0.944</td>
                <td>0.946</td>
                <td>0.944</td>
                <td>0.945</td>
              </tr>
              <tr>
                <td>√</td>
                <td>√</td>
                <td>√</td>
                <td>√</td>
                <td>0.939</td>
                <td>0.940</td>
                <td>0.939</td>
                <td>0.939</td>
              </tr>
              <tr>
                <td rowspan="4">Alzheimer’s Disease Dataset (after augmentation)</td>
                <td>√</td>
                <td>×</td>
                <td>×</td>
                <td>×</td>
                <td>0.809</td>
                <td>0.825</td>
                <td>0.809</td>
                <td>0.811</td>
              </tr>
              <tr>
                <td>√</td>
                <td>√</td>
                <td>×</td>
                <td>×</td>
                <td>0.949</td>
                <td>0.949</td>
                <td>0.949</td>
                <td>0.949</td>
              </tr>
              <tr>
                <td>√</td>
                <td>√</td>
                <td>√</td>
                <td>×</td>
                <td>0.969</td>
                <td>0.969</td>
                <td>0.969</td>
                <td>0.969</td>
              </tr>
              <tr>
                <td>√</td>
                <td>√</td>
                <td>√</td>
                <td>√</td>
                <td>0.971</td>
                <td>0.971</td>
                <td>0.971</td>
                <td>0.971</td>
              </tr>
              <tr>
                <td rowspan="4">Brain Tumor Dataset</td>
                <td>√</td>
                <td>×</td>
                <td>×</td>
                <td>×</td>
                <td>0.933</td>
                <td>0.933</td>
                <td>0.933</td>
                <td>0.932</td>
              </tr>
              <tr>
                <td>√</td>
                <td>√</td>
                <td>×</td>
                <td>×</td>
                <td>0.964</td>
                <td>0.964</td>
                <td>0.964</td>
                <td>0.964</td>
              </tr>
              <tr>
                <td>√</td>
                <td>√</td>
                <td>√</td>
                <td>×</td>
                <td>0.966</td>
                <td>0.966</td>
                <td>0.966</td>
                <td>0.966</td>
              </tr>
              <tr>
                <td>√</td>
                <td>√</td>
                <td>√</td>
                <td>√</td>
                <td>0.971</td>
                <td>0.972</td>
                <td>0.971</td>
                <td>0.971</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p><xref ref-type="fig" rid="fig11">Figure 11</xref> and <xref ref-type="fig" rid="fig12">Figure 12</xref> show the heatmaps of each module of the proposed model. In the visualized experiments, the warmer the color of a region, the greater the attention allocated by the model; conversely, the cooler the color of the region, the less attention it receives from the model. From the heatmap, compared to the MobileViT-based variants, the proposed model (MobileViT + Improved MV2 + Cosine Annealing + Dual-Path Attention) demonstrates superior accuracy in identifying single or multiple targets within an image, with less emphasis on the background compared to the baseline MobileViT. Namely, it can precisely cover the lesion information in the MRI images whether they are small, medium, or large. The above results show that the introduction of transfer learning, improved activation function, Dual-Path Attention, and optimized learning rate helps the proposed model to learn more accurate features than MobileViT, thus improving the representation ability of the model. </p>
      </sec>
    </sec>
    <sec id="sec5">
      <title>5. Limitations and Discussions</title>
      <p>Although the proposed model has achieved competitive performance in tumor image classification, the current model still has some obvious limitations. </p>
      <p>Firstly, in order to improve the model accuracy, we introduced the CBAM module, but also increased the model parameters by a small amount, which resulted in slightly larger FLOPs for the model than for the MobileViT’s. </p>
      <fig id="fig11">
        <label>Figure 11</label>
        <graphic xlink:href="https://html.scirp.org/file/1733488-rId184.jpeg?20260326025925" />
      </fig>
      <p>Figure 11. Grad-CAM-based model visualization (Alzheimer’s disease).</p>
      <fig id="fig12">
        <label>Figure 12</label>
        <graphic xlink:href="https://html.scirp.org/file/1733488-rId185.jpeg?20260326025925" />
      </fig>
      <p>Figure 12. Grad-CAM-based model visualization (brain tumor disease).</p>
      <p>Secondly, there exists a large diversity of MRI images, including different scanning parameters, resolutions, types and sizes, and other factors. The proposed model may encounter challenges in dealing with this diversity. The proposed model necessitates an extensive dataset for effective learning and generalization, and its performance could be constrained when dealing with a small dataset. </p>
      <p>Thirdly, optimization algorithms play a crucial role in deep learning, as their performance directly impacts both the efficiency of model training and its overall performance. In this paper, we only used cosine annealing to optimize the learning rate of deep neural networks and did not consider the optimization of the other hyperparameters. The model necessitates extensive experimentation and fine-tuning to attain optimal results. In the future, we will investigate better optimization algorithms to determine the parameter set that is most likely to achieve the best results. </p>
      <p>Finally, various types of medical image data exist in the field, such as skin disease images, colorectal cancer images, lung disease images, and retina OCT images, among others. Image data exhibit distinct characteristics across various medical domains. Whether our proposed MRI image classification model can be used in these fields remains to be further investigated. </p>
    </sec>
    <sec id="sec6">
      <title>6. Conclusions</title>
      <p>In order to enable the MRI image classification model to be applied to mobile and embedded devices, we proposed an improved lightweight MobileViT model based on the MobileViT network model. Firstly, CBAM and Dual-Path Attention were employed to improve the model’s ability to capture local information as well as to fuse global information. Secondly, the activation function ReLU6 in the MV2 module was replaced with the more robust activation function SiLU. Thirdly, a cosine annealing algorithm was used to update the learning rate of the proposed model to prevent the model from falling into local optimal points. Finally, the proposed model employs a transfer learning approach, where models are pre-trained on the ImageNet dataset to better capture the features of MRI images. Extensive experiments demonstrate that our model delivers competitive performance in MRI image classification. We have developed a low-cost intelligent diagnostic tool that not only assists medical specialists and radiographers in providing early diagnosis of brain disease, but is also suitable for deployment in edge devices. </p>
      <p>In future studies, experiments will be conducted using a larger data set. We plan to validate the effectiveness of our model on different medical image domains, such as skin disease images, colorectal cancer images, lung disease images, retinal images, etc. Additionally, the model proposed in this paper is a lightweight deep learning model. While it achieves high classification accuracy, its computational complexity is not the lowest. Therefore, a potential direction for future research could be to further reduce the computational time of the model without sacrificing accuracy. </p>
    </sec>
    <sec id="sec7">
      <title>Acknowledgements</title>
      <p>This research is supported in part by the National Natural Science Foundation of China (NSFC) (Grant No. 72461030).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <title>References</title>
      <ref id="B1">
        <label>1.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">(2024) 2024 Alzheimer’s Disease Facts and Figures. <italic>Alzheimer</italic>’ <italic>s &amp; Dementia</italic>, 20, 3708-3821.</mixed-citation>
          <element-citation publication-type="other">
            <year>2024</year>
            <article-title>2024 Alzheimer’s Disease Facts and Figures</article-title>
            <source>Alzheimer’s &amp; Dementia</source>
            <volume>20</volume>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B2">
        <label>2.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Louis, D.N., Perry, A., Wesseling, P., Brat, D.J., Cree, I.A., Figarella-Branger, D., <italic>et al</italic>. (2021) The 2021 WHO Classification of Tumors of the Central Nervous System: A Summary. <italic>Neuro</italic>- <italic>Oncology</italic>, 23, 1231-1251. https://doi.org/10.1093/neuonc/noab106 <pub-id pub-id-type="doi">10.1093/neuonc/noab106</pub-id><pub-id pub-id-type="pmid">34185076</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1093/neuonc/noab106">https://doi.org/10.1093/neuonc/noab106</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Louis, D.N.</string-name>
              <string-name>Perry, A.</string-name>
              <string-name>Wesseling, P.</string-name>
              <string-name>Brat, D.J.</string-name>
              <string-name>Cree, I.A.</string-name>
              <string-name>Figarella-Branger, D.</string-name>
            </person-group>
            <year>2021</year>
            <article-title>The 2021 WHO Classification of Tumors of the Central Nervous System: A Summary</article-title>
            <source>Neuro-Oncology</source>
            <volume>23</volume>
            <pub-id pub-id-type="doi">10.1093/neuonc/noab106</pub-id>
            <pub-id pub-id-type="pmid">34185076</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B3">
        <label>3.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Asgharzadeh-Bonab, A., Kalbkhani, H. and Azarfardian, S. (2023) An Alzheimer’s Disease Classification Method Using Fusion of Features from Brain Magnetic Resonance Image Transforms and Deep Convolutional Networks. <italic>Healthcare</italic><italic>Analytics</italic>, 4, Article ID: 100223. https://doi.org/10.1016/j.health.2023.100223 <pub-id pub-id-type="doi">10.1016/j.health.2023.100223</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.health.2023.100223">https://doi.org/10.1016/j.health.2023.100223</ext-link></mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Asgharzadeh-Bonab, A.</string-name>
              <string-name>Kalbkhani, H.</string-name>
              <string-name>Azarfardian, S.</string-name>
            </person-group>
            <year>2023</year>
            <article-title>An Alzheimer’s Disease Classification Method Using Fusion of Features from Brain Magnetic Resonance Image Transforms and Deep Convolutional Networks</article-title>
            <source>Healthcare Analytics</source>
            <volume>4</volume>
            <fpage>100223</fpage>
            <elocation-id>ID</elocation-id>
            <pub-id pub-id-type="doi">10.1016/j.health.2023.100223</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B4">
        <label>4.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Zhang, X., Gao, L., Wang, Z., Yu, Y., Zhang, Y. and Hong, J. (2024) Improved Neural Network with Multi-Task Learning for Alzheimer’s Disease Classification. <italic>Heliyon</italic>, 10, e26405. https://doi.org/10.1016/j.heliyon.2024.e26405 <pub-id pub-id-type="doi">10.1016/j.heliyon.2024.e26405</pub-id><pub-id pub-id-type="pmid">38434063</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.heliyon.2024.e26405">https://doi.org/10.1016/j.heliyon.2024.e26405</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Zhang, X.</string-name>
              <string-name>Gao, L.</string-name>
              <string-name>Wang, Z.</string-name>
              <string-name>Yu, Y.</string-name>
              <string-name>Zhang, Y.</string-name>
              <string-name>Hong, J.</string-name>
            </person-group>
            <year>2024</year>
            <article-title>Improved Neural Network with Multi-Task Learning for Alzheimer’s Disease Classification</article-title>
            <source>Heliyon</source>
            <volume>10</volume>
            <pub-id pub-id-type="doi">10.1016/j.heliyon.2024.e26405</pub-id>
            <pub-id pub-id-type="pmid">38434063</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B5">
        <label>5.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Yang, Z., Liu, W., Gan, H., Huang, Z., Zhou, R. and Shi, M. (2024) Alzheimer’s Disease Classification Based on Brain Region-to-Sample Graph Convolutional Network. <italic>Biomedical</italic><italic>Signal</italic><italic>Processing</italic><italic>and</italic><italic>Control</italic>, 96, Article ID: 106589. https://doi.org/10.1016/j.bspc.2024.106589 <pub-id pub-id-type="doi">10.1016/j.bspc.2024.106589</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.bspc.2024.106589">https://doi.org/10.1016/j.bspc.2024.106589</ext-link></mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Yang, Z.</string-name>
              <string-name>Liu, W.</string-name>
              <string-name>Gan, H.</string-name>
              <string-name>Huang, Z.</string-name>
              <string-name>Zhou, R.</string-name>
              <string-name>Shi, M.</string-name>
            </person-group>
            <year>2024</year>
            <article-title>Alzheimer’s Disease Classification Based on Brain Region-to-Sample Graph Convolutional Network</article-title>
            <source>Biomedical Signal Processing and Control</source>
            <volume>96</volume>
            <fpage>106589</fpage>
            <elocation-id>ID</elocation-id>
            <pub-id pub-id-type="doi">10.1016/j.bspc.2024.106589</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B6">
        <label>6.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Qian, C. and Wang, Y. (2024) Mmanet: A Multi-Task Residual Network for Alzheimer’s Disease Classification and Brain Age Prediction. <italic>IRBM</italic>, 45, Article ID: 100840. https://doi.org/10.1016/j.irbm.2024.100840 <pub-id pub-id-type="doi">10.1016/j.irbm.2024.100840</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.irbm.2024.100840">https://doi.org/10.1016/j.irbm.2024.100840</ext-link></mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Qian, C.</string-name>
              <string-name>Wang, Y.</string-name>
            </person-group>
            <year>2024</year>
            <article-title>Mmanet: A Multi-Task Residual Network for Alzheimer’s Disease Classification and Brain Age Prediction</article-title>
            <source>IRBM</source>
            <volume>45</volume>
            <fpage>100840</fpage>
            <elocation-id>ID</elocation-id>
            <pub-id pub-id-type="doi">10.1016/j.irbm.2024.100840</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B7">
        <label>7.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Ait Amou, M., Xia, K., Kamhi, S. and Mouhafid, M. (2022) A Novel MRI Diagnosis Method for Brain Tumor Classification Based on CNN and Bayesian Optimization. <italic>Healthcare</italic>, 10, Article No. 494. https://doi.org/10.3390/healthcare10030494 <pub-id pub-id-type="doi">10.3390/healthcare10030494</pub-id><pub-id pub-id-type="pmid">35326972</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3390/healthcare10030494">https://doi.org/10.3390/healthcare10030494</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Amou, M.</string-name>
              <string-name>Xia, K.</string-name>
              <string-name>Kamhi, S.</string-name>
              <string-name>Mouhafid, M.</string-name>
            </person-group>
            <year>2022</year>
            <article-title>A Novel MRI Diagnosis Method for Brain Tumor Classification Based on CNN and Bayesian Optimization</article-title>
            <source>Healthcare</source>
            <volume>10</volume>
            <elocation-id>No</elocation-id>
            <pub-id pub-id-type="doi">10.3390/healthcare10030494</pub-id>
            <pub-id pub-id-type="pmid">35326972</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B8">
        <label>8.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Ozdemir, C. (2023) Classification of Brain Tumors from MR Images Using a New CNN Architecture. <italic>Traitement</italic><italic>du</italic><italic>Signal</italic>, 40, 611-618. https://doi.org/10.18280/ts.400219 <pub-id pub-id-type="doi">10.18280/ts.400219</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.18280/ts.400219">https://doi.org/10.18280/ts.400219</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Ozdemir, C.</string-name>
            </person-group>
            <year>2023</year>
            <article-title>Classification of Brain Tumors from MR Images Using a New CNN Architecture</article-title>
            <source>Traitement du Signal</source>
            <volume>40</volume>
            <pub-id pub-id-type="doi">10.18280/ts.400219</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B9">
        <label>9.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Attallah, O. and Pacal, I. (2026) Comparative Evaluation of Lightweight Convolutional Neural Network and Vision Transformer Models for Multi-Class Brain Tumor Classification Using Merged Large MRI Datasets. <italic>Chemometrics</italic><italic>and</italic><italic>Intelligent</italic><italic>Laboratory</italic><italic>Systems</italic>, 269, Article ID: 105609. https://doi.org/10.1016/j.chemolab.2025.105609 <pub-id pub-id-type="doi">10.1016/j.chemolab.2025.105609</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.chemolab.2025.105609">https://doi.org/10.1016/j.chemolab.2025.105609</ext-link></mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Attallah, O.</string-name>
              <string-name>Pacal, I.</string-name>
            </person-group>
            <year>2026</year>
            <article-title>Comparative Evaluation of Lightweight Convolutional Neural Network and Vision Transformer Models for Multi-Class Brain Tumor Classification Using Merged Large MRI Datasets</article-title>
            <source>Chemometrics and Intelligent Laboratory Systems</source>
            <volume>269</volume>
            <fpage>105609</fpage>
            <elocation-id>ID</elocation-id>
            <pub-id pub-id-type="doi">10.1016/j.chemolab.2025.105609</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B10">
        <label>10.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Zhang, Q., Long, Y., Cai, H. and Chen, Y. (2024) Lightweight Neural Network for Alzheimer’s Disease Classification Using Multi-Slice sMRI. <italic>Magnetic</italic><italic>Resonance</italic><italic>Imaging</italic>, 107, 164-170. https://doi.org/10.1016/j.mri.2023.12.010 <pub-id pub-id-type="doi">10.1016/j.mri.2023.12.010</pub-id><pub-id pub-id-type="pmid">38176576</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.mri.2023.12.010">https://doi.org/10.1016/j.mri.2023.12.010</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Zhang, Q.</string-name>
              <string-name>Long, Y.</string-name>
              <string-name>Cai, H.</string-name>
              <string-name>Chen, Y.</string-name>
            </person-group>
            <year>2024</year>
            <article-title>Lightweight Neural Network for Alzheimer’s Disease Classification Using Multi-Slice sMRI</article-title>
            <source>Magnetic Resonance Imaging</source>
            <volume>107</volume>
            <pub-id pub-id-type="doi">10.1016/j.mri.2023.12.010</pub-id>
            <pub-id pub-id-type="pmid">38176576</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B11">
        <label>11.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Khatri, U. and Kwon, G. (2024) Diagnosis of Alzheimer’s Disease via Optimized Lightweight Convolution-Attention and Structural MRI. <italic>Computers</italic><italic>in</italic><italic>Biology</italic><italic>and</italic><italic>Medicine</italic>, 171, Article ID: 108116. https://doi.org/10.1016/j.compbiomed.2024.108116 <pub-id pub-id-type="doi">10.1016/j.compbiomed.2024.108116</pub-id><pub-id pub-id-type="pmid">38346370</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.compbiomed.2024.108116">https://doi.org/10.1016/j.compbiomed.2024.108116</ext-link></mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Khatri, U.</string-name>
              <string-name>Kwon, G.</string-name>
            </person-group>
            <year>2024</year>
            <article-title>Diagnosis of Alzheimer’s Disease via Optimized Lightweight Convolution-Attention and Structural MRI</article-title>
            <source>Computers in Biology and Medicine</source>
            <volume>171</volume>
            <fpage>108116</fpage>
            <elocation-id>ID</elocation-id>
            <pub-id pub-id-type="doi">10.1016/j.compbiomed.2024.108116</pub-id>
            <pub-id pub-id-type="pmid">38346370</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B12">
        <label>12.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Liu, H., Huo, G., Li, Q., Guan, X. and Tseng, M. (2023) Multiscale Lightweight 3D Segmentation Algorithm with Attention Mechanism: Brain Tumor Image Segmentation. <italic>Expert</italic><italic>Systems</italic><italic>with</italic><italic>Applications</italic>, 214, Article ID: 119166. https://doi.org/10.1016/j.eswa.2022.119166 <pub-id pub-id-type="doi">10.1016/j.eswa.2022.119166</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.eswa.2022.119166">https://doi.org/10.1016/j.eswa.2022.119166</ext-link></mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Liu, H.</string-name>
              <string-name>Huo, G.</string-name>
              <string-name>Li, Q.</string-name>
              <string-name>Guan, X.</string-name>
              <string-name>Tseng, M.</string-name>
            </person-group>
            <year>2023</year>
            <article-title>Multiscale Lightweight 3D Segmentation Algorithm with Attention Mechanism: Brain Tumor Image Segmentation</article-title>
            <source>Expert Systems with Applications</source>
            <volume>214</volume>
            <fpage>119166</fpage>
            <elocation-id>ID</elocation-id>
            <pub-id pub-id-type="doi">10.1016/j.eswa.2022.119166</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B13">
        <label>13.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Vaiyapuri, T., Mahalingam, J., Ahmad, S., Abdeljaber, H.A.M., Yang, E. and Jeong, S. (2023) Ensemble Learning Driven Computer-Aided Diagnosis Model for Brain Tumor Classification on Magnetic Resonance Imaging. <italic>IEEE</italic><italic>Access</italic>, 11, 91398-91406. https://doi.org/10.1109/access.2023.3306961 <pub-id pub-id-type="doi">10.1109/access.2023.3306961</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/access.2023.3306961">https://doi.org/10.1109/access.2023.3306961</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Vaiyapuri, T.</string-name>
              <string-name>Mahalingam, J.</string-name>
              <string-name>Ahmad, S.</string-name>
              <string-name>Abdeljaber, H.A.M.</string-name>
              <string-name>Yang, E.</string-name>
              <string-name>Jeong, S.</string-name>
            </person-group>
            <year>2023</year>
            <article-title>Ensemble Learning Driven Computer-Aided Diagnosis Model for Brain Tumor Classification on Magnetic Resonance Imaging</article-title>
            <source>IEEE Access</source>
            <volume>11</volume>
            <pub-id pub-id-type="doi">10.1109/access.2023.3306961</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B14">
        <label>14.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Luo, H., Zhou, D., Cheng, Y. and Wang, S. (2024) MPEDA-Net: A Lightweight Brain Tumor Segmentation Network Using Multi-Perspective Extraction and Dense Attention. <italic>Biomedical</italic><italic>Signal</italic><italic>Processing</italic><italic>and</italic><italic>Control</italic>, 91, Article ID: 106054. https://doi.org/10.1016/j.bspc.2024.106054 <pub-id pub-id-type="doi">10.1016/j.bspc.2024.106054</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.bspc.2024.106054">https://doi.org/10.1016/j.bspc.2024.106054</ext-link></mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Luo, H.</string-name>
              <string-name>Zhou, D.</string-name>
              <string-name>Cheng, Y.</string-name>
              <string-name>Wang, S.</string-name>
            </person-group>
            <year>2024</year>
            <article-title>MPEDA-Net: A Lightweight Brain Tumor Segmentation Network Using Multi-Perspective Extraction and Dense Attention</article-title>
            <source>Biomedical Signal Processing and Control</source>
            <volume>91</volume>
            <fpage>106054</fpage>
            <elocation-id>ID</elocation-id>
            <pub-id pub-id-type="doi">10.1016/j.bspc.2024.106054</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B15">
        <label>15.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Haq, E.U., Yong, Q., Yuan, Z., Huarong, X. and Haq, R.U. (2025) Multimodal Fusion Diagnosis of the Alzheimer’s Disease via Lightweight CNN-LSTM Model Using Magnetic Resonance Imaging (MRI). <italic>Biomedical</italic><italic>Signal</italic><italic>Processing</italic><italic>and</italic><italic>Control</italic>, 104, Article ID: 107545. https://doi.org/10.1016/j.bspc.2025.107545 <pub-id pub-id-type="doi">10.1016/j.bspc.2025.107545</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.bspc.2025.107545">https://doi.org/10.1016/j.bspc.2025.107545</ext-link></mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Haq, E.U.</string-name>
              <string-name>Yong, Q.</string-name>
              <string-name>Yuan, Z.</string-name>
              <string-name>Huarong, X.</string-name>
              <string-name>Haq, R.U.</string-name>
            </person-group>
            <year>2025</year>
            <article-title>Multimodal Fusion Diagnosis of the Alzheimer’s Disease via Lightweight CNN-LSTM Model Using Magnetic Resonance Imaging (MRI)</article-title>
            <source>Biomedical Signal Processing and Control</source>
            <volume>104</volume>
            <fpage>107545</fpage>
            <elocation-id>ID</elocation-id>
            <pub-id pub-id-type="doi">10.1016/j.bspc.2025.107545</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B16">
        <label>16.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Nizamani, A.H., Chen, Z. and Bhatti, U.A. (2026) Deep-Fusion: A Lightweight Feature Fusion Model with Cross-Stream Attention and Attention Prediction Head for Brain Tumor Diagnosis. <italic>Biomedical</italic><italic>Signal</italic><italic>Processing</italic><italic>and</italic><italic>Control</italic>, 111, Article ID: 108305. https://doi.org/10.1016/j.bspc.2025.108305 <pub-id pub-id-type="doi">10.1016/j.bspc.2025.108305</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.bspc.2025.108305">https://doi.org/10.1016/j.bspc.2025.108305</ext-link></mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Nizamani, A.H.</string-name>
              <string-name>Chen, Z.</string-name>
              <string-name>Bhatti, U.A.</string-name>
            </person-group>
            <year>2026</year>
            <article-title>Deep-Fusion: A Lightweight Feature Fusion Model with Cross-Stream Attention and Attention Prediction Head for Brain Tumor Diagnosis</article-title>
            <source>Biomedical Signal Processing and Control</source>
            <volume>111</volume>
            <fpage>108305</fpage>
            <elocation-id>ID</elocation-id>
            <pub-id pub-id-type="doi">10.1016/j.bspc.2025.108305</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B17">
        <label>17.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Mehta, S. and Rastegari, M. (2021) Mobilevit: Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer.</mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Mehta, S.</string-name>
              <string-name>Rastegari, M.</string-name>
              <string-name>Light-Weight, G</string-name>
            </person-group>
            <year>2021</year>
            <article-title>Mobilevit: Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer</article-title>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B18">
        <label>18.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. and Chen, L. (2018) Mobilenetv2: Inverted Residuals and Linear Bottlenecks. 2018 <italic>IEEE</italic>/ <italic>CVF</italic><italic>Conference</italic><italic>on</italic><italic>Computer</italic><italic>Vision</italic><italic>and</italic><italic>Pattern</italic><italic>Recognition</italic>, Salt Lake City, 18-22 June 2018, 4510-4520. https://doi.org/10.1109/cvpr.2018.00474 <pub-id pub-id-type="doi">10.1109/cvpr.2018.00474</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr.2018.00474">https://doi.org/10.1109/cvpr.2018.00474</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Sandler, M.</string-name>
              <string-name>Howard, A.</string-name>
              <string-name>Zhu, M.</string-name>
              <string-name>Zhmoginov, A.</string-name>
              <string-name>Chen, L.</string-name>
              <string-name>Recognition, S</string-name>
            </person-group>
            <year>2018</year>
            <article-title>Mobilenetv2: Inverted Residuals and Linear Bottlenecks</article-title>
            <source>2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
            <volume>18</volume>
            <pub-id pub-id-type="doi">10.1109/cvpr.2018.00474</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B19">
        <label>19.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Woo, S., Park, J., Lee, J. and Kweon, I.S. (2018) CBAM: Convolutional Block Attention Module. <italic>Proceedings of the European Conference on Computer Vision</italic> ( <italic>ECCV</italic>), Munich, 8-14 September 2018, 3-19. https://doi.org/10.1007/978-3-030-01234-2_1 <pub-id pub-id-type="doi">10.1007/978-3-030-01234-2_1</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-030-01234-2_1">https://doi.org/10.1007/978-3-030-01234-2_1</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Woo, S.</string-name>
              <string-name>Park, J.</string-name>
              <string-name>Lee, J.</string-name>
              <string-name>Kweon, I.S.</string-name>
            </person-group>
            <year>2018</year>
            <article-title>CBAM: Convolutional Block Attention Module</article-title>
            <source>Proceedings of the European Conference on Computer Vision (ECCV)</source>
            <volume>8</volume>
            <pub-id pub-id-type="doi">10.1007/978-3-030-01234-2_1</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B20">
        <label>20.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Wang, C.Y., Liao, H.Y.M., Wu, Y.H., <italic>et al</italic>. (2020) CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. 2020 <italic>IEEE</italic>/ <italic>CVF</italic><italic>Conference</italic><italic>on</italic><italic>Computer</italic><italic>Vision</italic><italic>and</italic><italic>Pattern</italic><italic>Recognition</italic><italic>Workshops</italic> ( <italic>CVPRW</italic>), 14-19 June 2020, 1571-1580. https://doi.org/10.1109/cvprw50498.2020.00203 <pub-id pub-id-type="doi">10.1109/cvprw50498.2020.00203</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvprw50498.2020.00203">https://doi.org/10.1109/cvprw50498.2020.00203</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Wang, C.Y.</string-name>
              <string-name>Liao, H.Y.M.</string-name>
              <string-name>Wu, Y.H.</string-name>
            </person-group>
            <year>2020</year>
            <article-title>CSPNet: A New Backbone That Can Enhance Learning Capability of CNN</article-title>
            <source>2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)</source>
            <volume>14</volume>
            <pub-id pub-id-type="doi">10.1109/cvprw50498.2020.00203</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B21">
        <label>21.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Shen, L. and Wang, Y. (2022) TCCT: Tightly-Coupled Convolutional Transformer on Time Series Forecasting. <italic>Neurocomputing</italic>, 480, 131-145. https://doi.org/10.1016/j.neucom.2022.01.039 <pub-id pub-id-type="doi">10.1016/j.neucom.2022.01.039</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.neucom.2022.01.039">https://doi.org/10.1016/j.neucom.2022.01.039</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Shen, L.</string-name>
              <string-name>Wang, Y.</string-name>
            </person-group>
            <year>2022</year>
            <article-title>TCCT: Tightly-Coupled Convolutional Transformer on Time Series Forecasting</article-title>
            <source>Neurocomputing</source>
            <volume>480</volume>
            <pub-id pub-id-type="doi">10.1016/j.neucom.2022.01.039</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B22">
        <label>22.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Pasnoori, N., Flores-Garcia, T. and Barkana, B.D. (2024) Histogram-Based Features Track Alzheimer’s Progression in Brain MRI. <italic>Scientific</italic><italic>Reports</italic>, 14, Article No. 257. https://doi.org/10.1038/s41598-023-50631-1 <pub-id pub-id-type="doi">10.1038/s41598-023-50631-1</pub-id><pub-id pub-id-type="pmid">38167618</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/s41598-023-50631-1">https://doi.org/10.1038/s41598-023-50631-1</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Pasnoori, N.</string-name>
              <string-name>Flores-Garcia, T.</string-name>
              <string-name>Barkana, B.D.</string-name>
            </person-group>
            <year>2024</year>
            <article-title>Histogram-Based Features Track Alzheimer’s Progression in Brain MRI</article-title>
            <source>Scientific Reports</source>
            <volume>14</volume>
            <elocation-id>No</elocation-id>
            <pub-id pub-id-type="doi">10.1038/s41598-023-50631-1</pub-id>
            <pub-id pub-id-type="pmid">38167618</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B23">
        <label>23.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Muezzinoglu, T., Baygin, N., Tuncer, I., Barua, P.D., Baygin, M., Dogan, S., <italic>et al</italic>. (2023) Patchresnet: Multiple Patch Division-Based Deep Feature Fusion Framework for Brain Tumor Classification Using MRI Images. <italic>Journal</italic><italic>of</italic><italic>Digital</italic><italic>Imaging</italic>, 36, 973-987. https://doi.org/10.1007/s10278-023-00789-x <pub-id pub-id-type="doi">10.1007/s10278-023-00789-x</pub-id><pub-id pub-id-type="pmid">36797543</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/s10278-023-00789-x">https://doi.org/10.1007/s10278-023-00789-x</ext-link></mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Muezzinoglu, T.</string-name>
              <string-name>Baygin, N.</string-name>
              <string-name>Tuncer, I.</string-name>
              <string-name>Barua, P.D.</string-name>
              <string-name>Baygin, M.</string-name>
              <string-name>Dogan, S.</string-name>
            </person-group>
            <year>2023</year>
            <article-title>Patchresnet: Multiple Patch Division-Based Deep Feature Fusion Framework for Brain Tumor Classification Using MRI Images</article-title>
            <source>Journal of Digital Imaging</source>
            <volume>36</volume>
            <pub-id pub-id-type="doi">10.1007/s10278-023-00789-x</pub-id>
            <pub-id pub-id-type="pmid">36797543</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B24">
        <label>24.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">He, K., Zhang, X., Ren, S. and Sun, J. (2016) Deep Residual Learning for Image Recognition. 2016 <italic>IEEE</italic><italic>Conference</italic><italic>on</italic><italic>Computer</italic><italic>Vision</italic><italic>and</italic><italic>Pattern</italic><italic>Recognition</italic> ( <italic>CVPR</italic>), Las Vegas, 27-30 June 2016, 770-778. https://doi.org/10.1109/cvpr.2016.90 <pub-id pub-id-type="doi">10.1109/cvpr.2016.90</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr.2016.90">https://doi.org/10.1109/cvpr.2016.90</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>He, K.</string-name>
              <string-name>Zhang, X.</string-name>
              <string-name>Ren, S.</string-name>
              <string-name>Sun, J.</string-name>
            </person-group>
            <year>2016</year>
            <article-title>Deep Residual Learning for Image Recognition</article-title>
            <source>2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
            <volume>27</volume>
            <pub-id pub-id-type="doi">10.1109/cvpr.2016.90</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B25">
        <label>25.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Huang, G., Liu, Z., Van Der Maaten, L. and Weinberger, K.Q. (2017) Densely Connected Convolutional Networks. 2017 <italic>IEEE</italic><italic>Conference</italic><italic>on</italic><italic>Computer</italic><italic>Vision</italic><italic>and</italic><italic>Pattern</italic><italic>Recognition</italic> ( <italic>CVPR</italic>), Honolulu, 21-26 July 2017, 4700-4708. https://doi.org/10.1109/cvpr.2017.243 <pub-id pub-id-type="doi">10.1109/cvpr.2017.243</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr.2017.243">https://doi.org/10.1109/cvpr.2017.243</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Huang, G.</string-name>
              <string-name>Liu, Z.</string-name>
              <string-name>Maaten, L.</string-name>
              <string-name>Weinberger, K.Q.</string-name>
            </person-group>
            <year>2017</year>
            <article-title>Densely Connected Convolutional Networks</article-title>
            <source>2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
            <volume>21</volume>
            <pub-id pub-id-type="doi">10.1109/cvpr.2017.243</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B26">
        <label>26.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Zhang, X., Zhou, X., Lin, M. and Sun, J. (2018) Shufflenet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. 2018 <italic>IEEE</italic>/ <italic>CVF</italic><italic>Conference</italic><italic>on</italic><italic>Computer</italic><italic>Vision</italic><italic>and</italic><italic>Pattern</italic><italic>Recognition</italic>, Salt Lake City, 18-22 June 2018, 6848-6856. https://doi.org/10.1109/cvpr.2018.00716 <pub-id pub-id-type="doi">10.1109/cvpr.2018.00716</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr.2018.00716">https://doi.org/10.1109/cvpr.2018.00716</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Zhang, X.</string-name>
              <string-name>Zhou, X.</string-name>
              <string-name>Lin, M.</string-name>
              <string-name>Sun, J.</string-name>
              <string-name>Recognition, S</string-name>
            </person-group>
            <year>2018</year>
            <article-title>Shufflenet: An Extremely Efficient Convolutional Neural Network for Mobile Devices</article-title>
            <source>2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
            <volume>18</volume>
            <pub-id pub-id-type="doi">10.1109/cvpr.2018.00716</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B27">
        <label>27.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Tan, M. and Le, Q. (2019) Efficientnet: Rethinking Model Scaling for Convolutional Neural Networks. <italic>International Conference on Machine Learning</italic>, <italic>PMLR</italic>, Long Beach, 9-15 June 2019, 6105-6114.</mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Tan, M.</string-name>
              <string-name>Le, Q.</string-name>
              <string-name>Learning, P</string-name>
              <string-name>MLR, L</string-name>
            </person-group>
            <year>2019</year>
            <article-title>Efficientnet: Rethinking Model Scaling for Convolutional Neural Networks</article-title>
            <source>International Conference on Machine Learning</source>
            <volume>9</volume>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B28">
        <label>28.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L., Tan, M., <italic>et al</italic>. (2019) Searching for MobileNetV3. 2019 <italic>IEEE</italic>/ <italic>CVF</italic><italic>International</italic><italic>Conference</italic><italic>on</italic><italic>Computer</italic><italic>Vision</italic> ( <italic>ICCV</italic>), Seoul, 27-28 October 2019, 1314-1324. https://doi.org/10.1109/iccv.2019.00140 <pub-id pub-id-type="doi">10.1109/iccv.2019.00140</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/iccv.2019.00140">https://doi.org/10.1109/iccv.2019.00140</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Howard, A.</string-name>
              <string-name>Sandler, M.</string-name>
              <string-name>Chen, B.</string-name>
              <string-name>Wang, W.</string-name>
              <string-name>Chen, L.</string-name>
              <string-name>Tan, M.</string-name>
            </person-group>
            <year>2019</year>
            <article-title>Searching for MobileNetV3</article-title>
            <source>2019 IEEE/CVF International Conference on Computer Vision (ICCV)</source>
            <volume>27</volume>
            <pub-id pub-id-type="doi">10.1109/iccv.2019.00140</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B29">
        <label>29.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D. and Batra, D. (2017) Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. 2017 <italic>IEEE</italic><italic>International</italic><italic>Conference</italic><italic>on</italic><italic>Computer</italic><italic>Vision</italic> ( <italic>ICCV</italic>), Venice, 22-29 October 2017, 618-626. https://doi.org/10.1109/iccv.2017.74 <pub-id pub-id-type="doi">10.1109/iccv.2017.74</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/iccv.2017.74">https://doi.org/10.1109/iccv.2017.74</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Selvaraju, R.R.</string-name>
              <string-name>Cogswell, M.</string-name>
              <string-name>Das, A.</string-name>
              <string-name>Vedantam, R.</string-name>
              <string-name>Parikh, D.</string-name>
              <string-name>Batra, D.</string-name>
            </person-group>
            <year>2017</year>
            <article-title>Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization</article-title>
            <source>2017 IEEE International Conference on Computer Vision (ICCV)</source>
            <volume>22</volume>
            <pub-id pub-id-type="doi">10.1109/iccv.2017.74</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
    </ref-list>
  </back>
</article>