<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article">
 <front>
  <journal-meta>
   <journal-id journal-id-type="publisher-id">
    jcc
   </journal-id>
   <journal-title-group>
    <journal-title>
     Journal of Computer and Communications
    </journal-title>
   </journal-title-group>
   <issn pub-type="epub">
    2327-5219
   </issn>
   <issn publication-format="print">
    2327-5227
   </issn>
   <publisher>
    <publisher-name>
     Scientific Research Publishing
    </publisher-name>
   </publisher>
  </journal-meta>
  <article-meta>
   <article-id pub-id-type="doi">
    10.4236/jcc.2025.134019
   </article-id>
   <article-id pub-id-type="publisher-id">
    jcc-142409
   </article-id>
   <article-categories>
    <subj-group subj-group-type="heading">
     <subject>
      Articles
     </subject>
    </subj-group>
    <subj-group subj-group-type="Discipline-v2">
     <subject>
      Computer Science 
     </subject>
     <subject>
       Communications
     </subject>
    </subj-group>
   </article-categories>
   <title-group>
    Forest Fire Recognition Algorithm Based on Improved RT-DETR
   </title-group>
   <contrib-group>
    <contrib contrib-type="author" xlink:type="simple">
     <name name-style="western">
      <surname>
       Da
      </surname>
      <given-names>
       Mu
      </given-names>
     </name>
    </contrib>
    <contrib contrib-type="author" xlink:type="simple">
     <name name-style="western">
      <surname>
       Yunfeng
      </surname>
      <given-names>
       Shang
      </given-names>
     </name>
    </contrib>
    <contrib contrib-type="author" xlink:type="simple">
     <name name-style="western">
      <surname>
       Xinlei
      </surname>
      <given-names>
       Hou
      </given-names>
     </name>
    </contrib>
   </contrib-group> 
   <aff id="affnull">
    <addr-line>
     aGraduate Student Office, North China Institute of Science and Technology, Dongyanjiao, Beijing, China
    </addr-line> 
   </aff> 
   <pub-date pub-type="epub">
    <day>
     16
    </day> 
    <month>
     04
    </month>
    <year>
     2025
    </year>
   </pub-date> 
   <volume>
    13
   </volume> 
   <issue>
    04
   </issue>
   <fpage>
    311
   </fpage>
   <lpage>
    323
   </lpage>
   <history>
    <date date-type="received">
     <day>
      24,
     </day>
     <month>
      April
     </month>
     <year>
      2025
     </year>
    </date>
    <date date-type="published">
     <day>
      27,
     </day>
     <month>
      April
     </month>
     <year>
      2025
     </year> 
    </date> 
    <date date-type="accepted">
     <day>
      27,
     </day>
     <month>
      April
     </month>
     <year>
      2025
     </year> 
    </date>
   </history>
   <permissions>
    <copyright-statement>
     © Copyright 2014 by authors and Scientific Research Publishing Inc. 
    </copyright-statement>
    <copyright-year>
     2014
    </copyright-year>
    <license>
     <license-p>
      This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/
     </license-p>
    </license>
   </permissions>
   <abstract>
    In recent years, with the frequent occurrence of global forest fires, fire prevention and control technology has become crucial. The advancement of artificial intelligence technology has provided emerging technical means for forest fire prevention. The RT-DETR model, as a newly emerged large model in recent years, has broken through the NMS limitations of the YOLO series and has shown great potential in the field of image recognition. However, due to its high resource consumption, it is not easy to deploy on embedded devices. To address this issue, a lightweight RT-DETR model has been proposed, which uses MobileNetV4 to optimize the backbone network, reducing parameters, enhancing computational speed, and lowering resource consumption. At the same time, to maintain the performance of forest fire detection, a Hierarchical Resolution Attention mechanism (HLF) and Learnable LPE encoding have been designed to ensure that the model remains lightweight without compromising detection capabilities. This improvement offers a new approach for the application of RT-DETR in the field of forest fire prevention.
   </abstract>
   <kwd-group> 
    <kwd>
     Deep Learning
    </kwd> 
    <kwd>
      Forest Fire Prevention
    </kwd> 
    <kwd>
      Wildfire Detection
    </kwd> 
    <kwd>
      Lightweight
    </kwd>
   </kwd-group>
  </article-meta>
 </front>
 <body>
  <sec id="s1">
   <title>1. Introduction</title>
   <p>In recent years, the frequency of fire incidents worldwide has been increasing, making forest fire prevention particularly urgent. Against this backdrop, forest fire detection, as a key component of forest fire prevention, has become a hot topic in research. YOLO, as a classic algorithm in the field of image processing, has won high praise from the academic community for its efficient processing speed and ease of use. However, YOLO series algorithms generally rely on the Non-Maximum Suppression (NMS) step, which has become a bottleneck for improving their performance.</p>
   <p>In this context, the emergence of RT-DETR <xref ref-type="bibr" rid="scirp.142409-1">
     [1]
    </xref> has completely subverted the traditional processing methods. It adopts an end-to-end Transformer architecture and, through continuous improvements, it outperforms previous Transformer architectures in terms of performance, processing speed, and model size. RT-DETR has thus become a rising star in the field of image recognition, with potential that can match or even surpass YOLO in certain aspects. Currently, research on applying RT-DETR to forest fire recognition is limited, indicating that the potential of RT-DETR in the field of forest fire prevention has not been fully explored. Looking forward, RT-DETR has a broad application prospect in forest fire prevention with infinite potential for development. The trained RT-DETR model needs to be deployed and run not only on cloud platforms but also efficiently on embedded or mobile devices. This is because cloud platforms are typically located in remote locations far from the fire scene, making it difficult to achieve real-time response and processing of fire situations. Therefore, deploying the model to on-site embedded devices is particularly crucial. However, a major limitation of embedded devices is their relatively limited computing resources, which makes it difficult to process large amounts of data quickly. To this end, it is necessary to lightweight the model to ensure it remains efficient on resource-constrained devices. Jin, L <xref ref-type="bibr" rid="scirp.142409-2">
     [2]
    </xref> and others have reduced the consumption of computing resources through reparameterization to facilitate model deployment. Li, J and others <xref ref-type="bibr" rid="scirp.142409-3">
     [3]
    </xref> have reduced the model’s size by using MobileNetv3 instead of conventional convolution modules. Huang, J <xref ref-type="bibr" rid="scirp.142409-4">
     [4]
    </xref>, Chen, G <xref ref-type="bibr" rid="scirp.142409-5">
     [5]
    </xref>, and others have proposed using ghost and seNet modules to simplify the model. It is evident that ghostnet and mobilenet are very popular network structures, each accounting for 40%, and they are the focus of research on lightweight network structures now and in the future. In addition to lightweight networks, model compression can also be achieved through knowledge distillation, channel pruning, and other operations. Wang S and others <xref ref-type="bibr" rid="scirp.142409-6">
     [6]
    </xref> have proposed using channel-level sparsity to simplify the model by assigning channels a certain proportion factor and then eliminating channels with small factors through training. Zhou, M and others <xref ref-type="bibr" rid="scirp.142409-7">
     [7]
    </xref> have used semi-supervised knowledge distillation techniques to simplify the YOLO model’s architecture, reducing its complexity and improving detection accuracy. Although these methods do make the network structure more lightweight and easier to deploy on embedded devices, they often lead to a decrease in model performance compared to the original structure. Therefore, an improved RT-DETR model has been designed, aiming to balance lightweight and detection effectiveness. By using the newly proposed MobileNetV4 network to slim down the model, lightweighting has been achieved. Then, by replacing the original sine and cosine encoding strategy with a learnable encoding strategy, the model can effectively capture pixel location information, leading to more precise processing and generation of encoded data. Finally, the HLF module has been introduced to deeply integrate features from different levels, including semantic and detailed information, thereby further enhancing the model’s detection capabilities.</p>
  </sec><sec id="s2">
   <title>2. Materials and Methods</title>
   <sec id="s2_1">
    <title>2.1. Baseline RT-DETR Model</title>
    <p>The RT-DETR model has set a precedent for the first real-time end-to-end object detection model. This object detection model, based on the Transformer architecture, aims to surpass YOLO and achieve more efficient and rapid real-time object recognition. RT-DETR fully leverages the potential of Transformers in the field of computer vision, elevating the performance of real-time object detection to new heights. Like DETR, RT-DETR consists of a backbone network and a hybrid encoder.</p>
    <p>The backbone is the initial stage where input images are processed to extract features. This typically involves a convolutional neural network that reduces the spatial dimensions of the image while increasing depth, generating feature maps at various stages (S3, S4, S5).</p>
    <p>The hybrid encoder comprises two main parts: Attention-based Intra-scale Feature Interaction (AIFI) and Cross-scale Feature Fusion (CCFF).</p>
    <p>AIFI processes only the S5 feature map from the backbone. Since S5 represents the deepest layer, it contains the richest semantic information about complete objects and their context in the image—making it highly suitable for the Transformer-based attention to capture meaningful relationships between objects. Although S3 and S4 contain useful low-level features such as edges and object parts, applying attention to these layers would be computationally expensive and less efficient because they do not yet represent complete objects that need to be related to each other.</p>
    <p>CCFF, based on CNN layers, is responsible for combining the S3, S4, and AIFI (S5) feature maps. Its task is to fuse features from different scales, ensuring that the final feature map contains information from a variety of resolutions. The core of CCFF is a special block called RepBlock, which uses a structure named RepConv. RepConv allows the network to switch between different forms of convolution operations during training and inference, thus improving efficiency without sacrificing performance.</p>
    <p>Following this, uncertainty queries are performed, and after selecting the most confident queries, RT-DETR uses a Transformer decoder with multi-scale deformable attention to predict object locations and categories. The decoder architecture is designed for efficient processing of multi-scale feature maps while maintaining high precision.</p>
   </sec>
   <sec id="s2_2">
    <title>2.2. Improved Model RT-DETR-HMI</title>
    <p>To facilitate the deployment of the model on embedded devices for on-site real-time forest fire detection while maintaining its detection performance, several improvements have been made. Firstly, to address the issues of large parameter size and computational complexity in RT-DETR, the backbone network has been replaced with a lightweight MobileNetV4 structure. To counteract the decrease in detection accuracy after lightweighting, the encoding method has been improved to the LPE (Learning Potential Encoding) method, which allows the model to better learn feature information. Additionally, the SDI (Scale-Discrepancy Integration) module has been employed to enhance the integration of fire features at different resolutions. The improved structure is shown in <xref ref-type="fig" rid="fig1">
      Figure 1
     </xref>.</p>
    <fig id="fig1" position="float">
     <label>Figure 1</label>
     <caption>
      <title>Figure 1. Structure of the improved RT-DETR model.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1733098-rId12.jpeg?20250430045100" />
    </fig>
   </sec>
   <sec id="s2_3">
    <title>2.3. MobileNetV4 Lightweight Module</title>
    <p>The latest generation of MobileNets, MobileNetV4 <xref ref-type="bibr" rid="scirp.142409-8">
      [8]
     </xref>, presents a general and efficient architecture design tailored for mobile devices. At its core, the research team has introduced a search block known as Universal Inverted Bottleneck (UIB). This structure is unified and flexible, ingeniously integrating the Inverted Bottleneck (IB) <xref ref-type="bibr" rid="scirp.142409-9">
      [9]
     </xref>, ConvNext <xref ref-type="bibr" rid="scirp.142409-10">
      [10]
     </xref>, Feed-Forward Network (FFN), and an innovative Extra Depthwise (ExtraDW) <xref ref-type="bibr" rid="scirp.142409-11">
      [11]
     </xref> variant. In addition to UIB, the research team has also proposed Mobile MQA, an attention block specifically designed for mobile accelerators that can significantly enhance acceleration performance by 39%. Furthermore, to further improve the search efficiency of MNv4, the team has developed an optimized Neural Architecture Search (NAS) recipe.</p>
    <p>(1) UIB</p>
    <p>As shown in <xref ref-type="fig" rid="fig2">
      Figure 2
     </xref>, in the design of the inverted bottleneck, two optional Depthwise layers have been added. They are placed before the expansion layer and between the expansion layer and the projection layer. The inclusion of these layers is a key decision in the Neural Architecture Search (NAS) optimization process, leading to the creation of a new network architecture. Although this adjustment may seem simple, it ingeniously integrates multiple key network blocks, including the original Inverted Bottleneck (IB) block, the ConvNext block, and the Feed-Forward Network (FFN) block from ViT. Additionally, the UIB block introduces an innovative variant, the Extra Depthwise Inverted Bottleneck (ExtraDW) block.</p>
    <fig id="fig2" position="float">
     <label>Figure 2</label>
     <caption>
      <title>Figure 2. UIB and various IB combination structure modules.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1733098-rId13.jpeg?20250430045101" />
    </fig>
    <p>The Inverted Bottleneck (IB) increases the model’s capacity at the cost of increased computational expense by performing spatial mixing on the expanded feature activations. ConvNext achieves a less costly spatial mixing by performing the spatial mixing operation before feature expansion and allows for the use of larger convolution kernel sizes. ExtraDW is a new variant proposed in this article that enhances the network’s depth and receptive field at a lower cost, combining the advantages of ConvNext and the Inverted Bottleneck (IB). The FFN consists of two stacked 1x1 pointwise convolutions (PW) layers, interspersed with activation layers and normalization layers. PW operations are among the most effective on accelerators, but their impact is most significant when used in conjunction with other network blocks.</p>
    <p>(2) Mobile MQA</p>
    <p>MQA significantly reduces the burden of memory access by implementing the sharing of keys and values across all attention heads. Building on this, a strategy of Spatial Reduction Attention (SRA) is further adopted, applying an asymmetric spatial downsampling technique to the keys and values to enhance computational efficiency. In the implementation details of MQA, a 3 × 3 depthwise separable convolution is used to replace the traditional average pooling step, effectively halving the spatial resolution of the keys and values. This dramatically reduces the inference time. The following is the mathematical expression for Mobile MQA:</p>
    <p>
     <xref ref-type="bibr" rid="scirp.142409-"></xref> 
     <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <mi>
         M 
       </mi> 
       <mi>
         o 
       </mi> 
       <mi>
         b 
       </mi> 
       <mi>
         i 
       </mi> 
       <mi>
         l 
       </mi> 
       <mi>
         e 
       </mi> 
       <mi>
         M 
       </mi> 
       <mi>
         Q 
       </mi> 
       <mi>
         A 
       </mi> 
       <mrow> 
        <mo>
          ( 
        </mo> 
        <mi>
          x 
        </mi> 
        <mo>
          ) 
        </mo> 
       </mrow> 
       <mo>
         = 
       </mo> 
       <mi>
         C 
       </mi> 
       <mi>
         o 
       </mi> 
       <mi>
         n 
       </mi> 
       <mi>
         c 
       </mi> 
       <mi>
         a 
       </mi> 
       <mi>
         t 
       </mi> 
       <mrow> 
        <mo>
          ( 
        </mo> 
        <mrow> 
         <mi>
           a 
         </mi> 
         <mi>
           t 
         </mi> 
         <mi>
           t 
         </mi> 
         <mi>
           e 
         </mi> 
         <mi>
           n 
         </mi> 
         <mi>
           t 
         </mi> 
         <mi>
           i 
         </mi> 
         <mi>
           o 
         </mi> 
         <msub> 
          <mi>
            n 
          </mi> 
          <mn>
            1 
          </mn> 
         </msub> 
         <mo>
           ⋯ 
         </mo> 
         <mi>
           a 
         </mi> 
         <mi>
           t 
         </mi> 
         <mi>
           t 
         </mi> 
         <mi>
           e 
         </mi> 
         <mi>
           n 
         </mi> 
         <mi>
           t 
         </mi> 
         <mi>
           i 
         </mi> 
         <mi>
           u 
         </mi> 
         <mi>
           o 
         </mi> 
         <msub> 
          <mi>
            n 
          </mi> 
          <mi>
            n 
          </mi> 
         </msub> 
        </mrow> 
        <mo>
          ) 
        </mo> 
       </mrow> 
       <msup> 
        <mi>
          W 
        </mi> 
        <mi>
          O 
        </mi> 
       </msup> 
      </mrow> 
     </math> (1)</p>
    <p>
     <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <mi>
         a 
       </mi> 
       <mi>
         t 
       </mi> 
       <mi>
         t 
       </mi> 
       <mi>
         e 
       </mi> 
       <mi>
         n 
       </mi> 
       <mi>
         t 
       </mi> 
       <mi>
         i 
       </mi> 
       <mi>
         o 
       </mi> 
       <msub> 
        <mi>
          n 
        </mi> 
        <mi>
          j 
        </mi> 
       </msub> 
       <mo>
         = 
       </mo> 
       <mrow> 
        <mo>
          ( 
        </mo> 
        <mrow> 
         <mfrac> 
          <mrow> 
           <mrow> 
            <mo>
              ( 
            </mo> 
            <mrow> 
             <mi>
               X 
             </mi> 
             <msup> 
              <mi>
                W 
              </mi> 
              <mrow> 
               <msub> 
                <mi>
                  Q 
                </mi> 
                <mi>
                  j 
                </mi> 
               </msub> 
              </mrow> 
             </msup> 
            </mrow> 
            <mo>
              ) 
            </mo> 
           </mrow> 
           <msup> 
            <mrow> 
             <mrow> 
              <mo>
                ( 
              </mo> 
              <mrow> 
               <mi>
                 S 
               </mi> 
               <mi>
                 R 
               </mi> 
               <mrow> 
                <mo>
                  ( 
                </mo> 
                <mi>
                  X 
                </mi> 
                <mo>
                  ) 
                </mo> 
               </mrow> 
               <msup> 
                <mi>
                  W 
                </mi> 
                <mi>
                  k 
                </mi> 
               </msup> 
              </mrow> 
              <mo>
                ) 
              </mo> 
             </mrow> 
            </mrow> 
            <mi>
              T 
            </mi> 
           </msup> 
          </mrow> 
          <mrow> 
           <msqrt> 
            <mrow> 
             <msub> 
              <mi>
                d 
              </mi> 
              <mi>
                k 
              </mi> 
             </msub> 
            </mrow> 
           </msqrt> 
          </mrow> 
         </mfrac> 
        </mrow> 
        <mo>
          ) 
        </mo> 
       </mrow> 
       <msup> 
        <mrow> 
         <mrow> 
          <mo>
            ( 
          </mo> 
          <mrow> 
           <mi>
             S 
           </mi> 
           <mi>
             R 
           </mi> 
           <mrow> 
            <mo>
              ( 
            </mo> 
            <mi>
              X 
            </mi> 
            <mo>
              ) 
            </mo> 
           </mrow> 
           <msup> 
            <mi>
              W 
            </mi> 
            <mi>
              k 
            </mi> 
           </msup> 
          </mrow> 
          <mo>
            ) 
          </mo> 
         </mrow> 
        </mrow> 
        <mi>
          V 
        </mi> 
       </msup> 
      </mrow> 
     </math> (2)</p>
    <p>SR stands for spatial reduction, which involves asymmetric spatial downsampling of keys and values to further enhance efficiency.</p>
   </sec>
   <sec id="s2_4">
    <title>2.4. AIFI-LPE Encoding</title>
    <p>In the attention mechanism of transformers, to address the issue of ambiguous word vector positions in parallel operations, positional encoding is used to imbue each word vector with positional semantic information. This allows the attention mechanism to operate in parallel more effectively. The original positional encoding in the Transformer model uses a combination of sine and cosine functions <xref ref-type="bibr" rid="scirp.142409-12">
      [12]
     </xref>, which is relatively fixed and may not be ideal for expressing complex positional information. Therefore, Localized Positional Encoding (LPE) is adopted.</p>
    <p>LPE treats positional information as a parameter during the training process, continuously adjusting the encoding information in each training iteration. This allows for a more thorough representation of the positional relationships between each vector, thereby enhancing the effectiveness of the attention mechanism. Moreover, since LPE is trained on different datasets, it can adapt to various data distributions.</p>
    <fig id="fig3" position="float">
     <label>Figure 3</label>
     <caption>
      <title>Figure 3. Schematic diagram of learnable location coding.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1733098-rId18.jpeg?20250430045101" />
    </fig>
    <p>As shown in <xref ref-type="fig" rid="fig3">
      Figure 3
     </xref>, each element in the token sequence finds its corresponding positional encoding through a lookup in the embedding table. The positional encoding is then added to the element to obtain a vector representation that includes positional information. Subsequently, during training, the loss function is computed, and during the backpropagation process, the weight information in the position embedding is optimized. This results in more suitable positional information for each element, leading to better training outcomes for the attention mechanism.</p>
   </sec>
   <sec id="s2_5">
    <title>2.5. HLF Module</title>
    <p>In the basic RT-DETR module, the feature fusion module aims to reduce computational complexity and the consumption of computational resources. It overlooks low-level detail information and prioritizes high-level semantic information, leading to insufficient fusion of high and low-level feature information. Therefore, the HLF module is proposed to enable more thorough fusion of high and low-level feature information. The schematic diagram of the module structure is shown in <xref ref-type="fig" rid="fig4">
      Figure 4
     </xref>.</p>
    <fig id="fig4" position="float">
     <label>Figure 4</label>
     <caption>
      <title>Figure 4. HLF module structure diagram.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1733098-rId19.jpeg?20250430045102" />
    </fig>
    <p>The illustration shows the HLF (Hierarchical Level Fusion) module. Initially, the feature layers of various scales from the encoder input are taken as inputs and processed through channel attention mechanisms <xref ref-type="bibr" rid="scirp.142409-13">
      [13]
     </xref> and spatial attention mechanisms <xref ref-type="bibr" rid="scirp.142409-14">
      [14]
     </xref>, respectively. The spatial attention mechanism focuses on identifying key regions within the feature maps, while the channel attention mechanism concentrates on selecting the primary channels within the feature maps. By utilizing these two attention mechanisms, the feature maps are able to fuse local spatial details with global channel information, resulting in a more expressive feature representation. This process is as shown in Equation 3:</p>
    <p>
     <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <msubsup> 
        <mi>
          f 
        </mi> 
        <mi>
          i 
        </mi> 
        <mn>
          1 
        </mn> 
       </msubsup> 
       <mo>
         = 
       </mo> 
       <msubsup> 
        <mi>
          φ 
        </mi> 
        <mi>
          i 
        </mi> 
        <mi>
          c 
        </mi> 
       </msubsup> 
       <mrow> 
        <mo>
          ( 
        </mo> 
        <mrow> 
         <msubsup> 
          <mi>
            φ 
          </mi> 
          <mi>
            i 
          </mi> 
          <mi>
            s 
          </mi> 
         </msubsup> 
         <mrow> 
          <mo>
            ( 
          </mo> 
          <mrow> 
           <msubsup> 
            <mi>
              f 
            </mi> 
            <mi>
              i 
            </mi> 
            <mn>
              0 
            </mn> 
           </msubsup> 
          </mrow> 
          <mo>
            ) 
          </mo> 
         </mrow> 
        </mrow> 
        <mo>
          ) 
        </mo> 
       </mrow> 
      </mrow> 
     </math> (3)</p>
    <p>Formula 3 represents the feature integration process on, where 
     <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <msubsup> 
        <mi>
          f 
        </mi> 
        <mi>
          i 
        </mi> 
        <mn>
          0 
        </mn> 
       </msubsup> 
      </mrow> 
     </math> denotes the feature layer at the i-th scale that is input after passing through the encoder. 
     <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <msubsup> 
        <mi>
          φ 
        </mi> 
        <mi>
          i 
        </mi> 
        <mi>
          s 
        </mi> 
       </msubsup> 
      </mrow> 
     </math> represents the feature at the i-th layer after being processed by the spatial attention mechanism, and 
     <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <msubsup> 
        <mi>
          φ 
        </mi> 
        <mi>
          i 
        </mi> 
        <mi>
          c 
        </mi> 
       </msubsup> 
      </mrow> 
     </math> indicates the feature at the i-th layer after being processed by the channel attention mechanism.</p>
    <p>Following this, feature scale adjustment is performed. If the size of the existing feature map is smaller than the target size, bilinear interpolation is used to upscale it to the target resolution. If the existing feature map is larger than the target size, adaptive average pooling is applied to downscale it to the target resolution. If the feature map’s size already matches the target resolution, it is used directly. This process is shown in formula 4:</p>
    <p>
     <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <msubsup> 
        <mi>
          f 
        </mi> 
        <mrow> 
         <mi>
           i 
         </mi> 
         <mi>
           j 
         </mi> 
        </mrow> 
        <mn>
          3 
        </mn> 
       </msubsup> 
       <mrow> 
        <mo>
          { 
        </mo> 
        <mtable columnalign="left"> 
         <mtr> 
          <mtd> 
           <mi>
             D 
           </mi> 
           <mrow> 
            <mo>
              ( 
            </mo> 
            <mrow> 
             <msubsup> 
              <mi>
                f 
              </mi> 
              <mi>
                i 
              </mi> 
              <mn>
                2 
              </mn> 
             </msubsup> 
             <mo>
               , 
             </mo> 
             <mrow> 
              <mo>
                ( 
              </mo> 
              <mrow> 
               <msub> 
                <mi>
                  H 
                </mi> 
                <mi>
                  i 
                </mi> 
               </msub> 
               <mo>
                 , 
               </mo> 
               <msub> 
                <mi>
                  W 
                </mi> 
                <mi>
                  i 
                </mi> 
               </msub> 
              </mrow> 
              <mo>
                ) 
              </mo> 
             </mrow> 
            </mrow> 
            <mo>
              ) 
            </mo> 
           </mrow> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mi>
             j 
           </mi> 
           <mo>
             &lt; 
           </mo> 
           <mi>
             i 
           </mi> 
          </mtd> 
         </mtr> 
         <mtr> 
          <mtd> 
           <mi>
             I 
           </mi> 
           <mrow> 
            <mo>
              ( 
            </mo> 
            <mrow> 
             <msubsup> 
              <mi>
                f 
              </mi> 
              <mi>
                i 
              </mi> 
              <mn>
                2 
              </mn> 
             </msubsup> 
            </mrow> 
            <mo>
              ) 
            </mo> 
           </mrow> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mi>
             j 
           </mi> 
           <mo>
             = 
           </mo> 
           <mi>
             i 
           </mi> 
          </mtd> 
         </mtr> 
         <mtr> 
          <mtd> 
           <mi>
             U 
           </mi> 
           <mrow> 
            <mo>
              ( 
            </mo> 
            <mrow> 
             <msubsup> 
              <mi>
                f 
              </mi> 
              <mi>
                i 
              </mi> 
              <mn>
                2 
              </mn> 
             </msubsup> 
             <mo>
               , 
             </mo> 
             <mrow> 
              <mo>
                ( 
              </mo> 
              <mrow> 
               <msub> 
                <mi>
                  H 
                </mi> 
                <mi>
                  i 
                </mi> 
               </msub> 
               <mo>
                 , 
               </mo> 
               <msub> 
                <mi>
                  W 
                </mi> 
                <mi>
                  i 
                </mi> 
               </msub> 
              </mrow> 
              <mo>
                ) 
              </mo> 
             </mrow> 
            </mrow> 
            <mo>
              ) 
            </mo> 
           </mrow> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mtext>
               
           </mtext> 
           <mi>
             j 
           </mi> 
           <mo>
             &gt; 
           </mo> 
           <mi>
             i 
           </mi> 
          </mtd> 
         </mtr> 
        </mtable> 
       </mrow> 
      </mrow> 
     </math> (4)</p>
    <p>Formula 4 represents the fusion of feature maps at different resolutions. D denotes downsampling operations when the image resolution is higher than the target image resolution. I indicates that the image is used directly when the image resolution is equal to the target image resolution. U represents upsampling operations using bilinear interpolation when the image resolution is lower than the target resolution, to restore it to the target resolution.</p>
    <p>For each feature map that has been resized, a 3 × 3 convolution kernel is applied for smoothing. This 3 × 3 convolution effectively removes interference noise while preserving the image details, resulting in clearer feature maps.</p>
    <p>This process is shown in Equation 5 as follows:</p>
    <p>
     <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <msubsup> 
        <mi>
          f 
        </mi> 
        <mrow> 
         <mi>
           i 
         </mi> 
         <mi>
           j 
         </mi> 
        </mrow> 
        <mn>
          5 
        </mn> 
       </msubsup> 
       <mo>
         = 
       </mo> 
       <msub> 
        <mi>
          θ 
        </mi> 
        <mrow> 
         <mi>
           i 
         </mi> 
         <mi>
           j 
         </mi> 
        </mrow> 
       </msub> 
       <mrow> 
        <mo>
          ( 
        </mo> 
        <mrow> 
         <msubsup> 
          <mi>
            f 
          </mi> 
          <mrow> 
           <mi>
             i 
           </mi> 
           <mi>
             j 
           </mi> 
          </mrow> 
          <mn>
            3 
          </mn> 
         </msubsup> 
        </mrow> 
        <mo>
          ) 
        </mo> 
       </mrow> 
      </mrow> 
     </math> (5)</p>
    <p>where θij represents the parameters of the smooth convolution, and fij is the j-th smoothed feature map at the i-th level.</p>
    <p>Finally, a Hadamard product <xref ref-type="bibr" rid="scirp.142409-15">
      [15]
     </xref> operation is applied to all the adjusted feature maps to generate the merged feature map. The Hadamard product is an effective method that combines two feature maps while retaining rich detail information.</p>
   </sec>
  </sec><sec id="s3">
   <title>3. Experimental Environment and Dataset</title>
   <sec id="s3_1">
    <title>3.1. Experimental Environment</title>
    <p>In this experiment, the training parameters are set as follows: the number of epochs is set to 100, image size is set to 640, batch size is set to 16, and the learning rate (lr) is set to 0.1. The optimizer chosen is AdamW, with a weight decay of 0.0001 and a global gradient clipping range of 0.1. The network was not initialized with pre-trained weights. Data augmentation includes random color distortion, expansion, cropping, flipping, and resizing. The linear warm-up steps are set to 10. The Hardware and Software Configuration of the experimental environment is shown in <xref ref-type="table" rid="table1">
      Table 1
     </xref> below.</p>
    <table-wrap id="table1">
     <label>
      <xref ref-type="table" rid="table1">
       Table 1
      </xref></label>
     <caption>
      <title>
       <xref ref-type="bibr" rid="scirp.142409-"></xref>Table 1. Environment configuration table.</title>
     </caption>
     <table class="MsoTableGrid custom-table" border="0" cellspacing="0" cellpadding="0"> 
      <tr> 
       <td class="custom-bottom-td acenter"><p style="text-align:center">environment</p></td> 
       <td class="custom-bottom-td acenter"><p style="text-align:center">version</p></td> 
      </tr> 
      <tr> 
       <td class="custom-top-td acenter"><p style="text-align:center">OS</p></td> 
       <td class="custom-top-td acenter"><p style="text-align:center">Windows11</p></td> 
      </tr> 
      <tr> 
       <td class="acenter"><p style="text-align:center">python</p></td> 
       <td class="acenter"><p style="text-align:center">3.8.8</p></td> 
      </tr> 
      <tr> 
       <td class="acenter"><p style="text-align:center">pytorch</p></td> 
       <td class="acenter"><p style="text-align:center">2.20</p></td> 
      </tr> 
      <tr> 
       <td class="acenter"><p style="text-align:center">GPU</p></td> 
       <td class="acenter"><p style="text-align:center">NVIDA GEForce RTX 3060</p></td> 
      </tr> 
      <tr> 
       <td class="acenter"><p style="text-align:center">CPU</p></td> 
       <td class="acenter"><p style="text-align:center">12th Gen Intel Core i5-12400F</p></td> 
      </tr> 
     </table>
    </table-wrap>
    <p>As shown in <xref ref-type="table" rid="table1">
      Table 1
     </xref>, The operating system used is Windows 11, with Python version 3.8.8. The PyTorch version is 2.20. For GPU, it’s equipped with an NVIDA GEForce RTX 3060, and the CPU is the 12th Gen Intel Core i5-12400F.</p>
   </sec>
   <sec id="s3_2">
    <title>3.2. Experimental Data</title>
    <p>The dataset used in this experiment was created personally. The author provides the dataset at the following address: <xref ref-type="bibr" rid="scirp.142409-https://download.csdn.net/download/m0_64879847/88535127">
      https://download.csdn.net/download/m0_64879847/88535127
     </xref>. To evaluate the effectiveness of the improved RT-DETR detection algorithm in the field of small target detection using unmanned aerial vehicle (UAV) aerial photography. The data was collected using a drone, The drone model is the DJI M300 RTK, operating at an altitude of 300 - 500 meters, and is equipped with an H20T payload camera. The video resolution is 1920 × 1080 at 30 frames per second. Data is transmitted to the backend server through CAN/4G/WiFi and other communication protocols. After frame extraction and data cleansing, an aerial forest fire image dataset is produced. This dataset encompasses records of forest fire incidents within the central region of the country over the past five years, spanning from 2020 to 2025 and covering multiple forested areas. The dataset is categorized into two classes: the flame dataset and the non-flame dataset. To enhance the completeness of the sample dataset, images of both small and large fire spots, as well as various environmental backgrounds under different seasonal conditions and daylight/darkness scenarios, have been collected, encompassing a range of forest vegetation coverage states. For the object detection task, 3699 images were selected as the training set, 800 images as the test set, and 500 images as the validation set. The resolution is 512 × 512 pixels.</p>
   </sec>
  </sec><sec id="s4">
   <title>4. Experimental Results and Analysis</title>
   <sec id="s4_1">
    <title>4.1. Evaluation Metrics</title>
    <p>The evaluation metrics used in this experiment are precision (P), recall®, mean Average Precision with an IOU threshold of 0.5 (mAP50), number of parameters (Parameter), and GFLOPS.</p>
    <p>Precision refers to the proportion of samples that we predict as positive that are actually positive. Recall refers to the proportion of samples that are actually positive that we correctly predict as positive. mAP50 is an indicator that refers to the calculation of Average Precision (AP) with an IoU threshold set to 0.5. This means that a prediction is considered accurate (i.e., judged as a true positive) only when the IoU score between the predicted bounding box and the ground truth bounding box is not less than 0.5. The setting of the IoU threshold directly affects the calculation of AP, and mAP50 is a widely adopted evaluation standard because it effectively balances the trade-off between the accuracy of the bounding box localization and the detection criteria. The formulas for P, R, and mAP are as follows:</p>
    <p>
     <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <mtext>
         Precision 
       </mtext> 
       <mo>
         = 
       </mo> 
       <mfrac> 
        <mrow> 
         <mtext>
           TP 
         </mtext> 
        </mrow> 
        <mrow> 
         <mtext>
           TP 
         </mtext> 
         <mo>
           + 
         </mo> 
         <mtext>
           FP 
         </mtext> 
        </mrow> 
       </mfrac> 
      </mrow> 
     </math> (6)</p>
    <p>
     <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <mtext>
         Pecall 
       </mtext> 
       <mo>
         = 
       </mo> 
       <mfrac> 
        <mrow> 
         <mtext>
           TP 
         </mtext> 
        </mrow> 
        <mrow> 
         <mtext>
           TP 
         </mtext> 
         <mo>
           + 
         </mo> 
         <mtext>
           FN 
         </mtext> 
        </mrow> 
       </mfrac> 
      </mrow> 
     </math> (7)</p>
    <p>
     <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <mtext>
         mAP 
       </mtext> 
       <mo>
         = 
       </mo> 
       <mfrac> 
        <mrow> 
         <mstyle displaystyle="true"> 
          <munderover> 
           <mo>
             ∑ 
           </mo> 
           <mrow> 
            <mi>
              i 
            </mi> 
            <mo>
              = 
            </mo> 
            <mn>
              1 
            </mn> 
           </mrow> 
           <mi>
             k 
           </mi> 
          </munderover> 
          <mrow> 
           <msub> 
            <mrow> 
             <mtext>
               AP 
             </mtext> 
            </mrow> 
            <mi>
              i 
            </mi> 
           </msub> 
          </mrow> 
         </mstyle> 
        </mrow> 
        <mi>
          k 
        </mi> 
       </mfrac> 
      </mrow> 
     </math> (8)</p>
    <p>TP stands for the number of positive samples that are predicted as positive, FP stands for the number of negative samples that are predicted as positive, and FN stands for the number of positive samples that are predicted as negative.</p>
    <p>mAP represents the sum of the average precision values for all classes divided by the number of classes.</p>
    <p>Here is the translation of your last sentence with the correct context:</p>
    <p>mAP represents the sum of the average precision values for all classes, normalized by the number of classes.</p>
   </sec>
   <sec id="s4_2">
    <title>4.2. Comparative Study</title>
    <p>The results in <xref ref-type="table" rid="table2">
      Table 2
     </xref> show that, when compared with the classic algorithms of the YOLO series through the improved RT-DETR, the enhanced algorithm demonstrates improvements in terms of precision (P), recall®, and mAP50. This indicates that RT-DETR, as a detection algorithm, can perform well in the field of forest fire prevention.</p>
    <table-wrap id="table2">
     <label>
      <xref ref-type="table" rid="table2">
       Table 2
      </xref></label>
     <caption>
      <title>
       <xref ref-type="bibr" rid="scirp.142409-"></xref>Table 2. Comparative experimental results.</title>
     </caption>
     <table class="MsoTableGrid custom-table" border="0" cellspacing="0" cellpadding="0"> 
      <tr> 
       <td class="custom-bottom-td acenter" width="16.66%"><p style="text-align:center">model</p></td> 
       <td class="custom-bottom-td acenter" width="16.67%"><p style="text-align:center">P/%</p></td> 
       <td class="custom-bottom-td acenter" width="16.67%"><p style="text-align:center">R/%</p></td> 
       <td class="custom-bottom-td acenter" width="16.66%"><p style="text-align:center">mAP50/%</p></td> 
       <td class="custom-bottom-td acenter" width="16.67%"><p style="text-align:center">Params/M</p></td> 
       <td class="custom-bottom-td acenter" width="16.67%"><p style="text-align:center">GFLOPS</p></td> 
      </tr> 
      <tr> 
       <td class="custom-top-td acenter" width="16.66%"><p style="text-align:center">Yolov5</p></td> 
       <td class="custom-top-td acenter" width="16.67%"><p style="text-align:center">0.511</p></td> 
       <td class="custom-top-td acenter" width="16.67%"><p style="text-align:center">0.465</p></td> 
       <td class="custom-top-td acenter" width="16.66%"><p style="text-align:center">0.458</p></td> 
       <td class="custom-top-td acenter" width="16.67%"><p style="text-align:center">2508854</p></td> 
       <td class="custom-top-td acenter" width="16.67%"><p style="text-align:center">7.1</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="16.66%"><p style="text-align:center">Yolov8</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">0.502</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">0.422</p></td> 
       <td class="acenter" width="16.66%"><p style="text-align:center">0.413</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">6092408</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">11.7</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="16.66%"><p style="text-align:center">Yolov10</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">0.501</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">0.45</p></td> 
       <td class="acenter" width="16.66%"><p style="text-align:center">0.46</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">8836518</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">24.4</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="16.66%"><p style="text-align:center">ours</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">0.516</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">0.458</p></td> 
       <td class="acenter" width="16.66%"><p style="text-align:center">0.491</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">11441624</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">39.6</p></td> 
      </tr> 
     </table>
    </table-wrap>
   </sec>
   <sec id="s4_3">
    <title>4.3. Ablation Study</title>
    <p>As shown in <xref ref-type="table" rid="table3">
      Table 3
     </xref>, The ablation study validates the performance improvement of the model with the addition of different modules. Firstly, by employing the lightweight module MobileNetv4 and through meticulous design of the network</p>
    <table-wrap id="table3">
     <label>
      <xref ref-type="table" rid="table3">
       Table 3
      </xref></label>
     <caption>
      <title>
       <xref ref-type="bibr" rid="scirp.142409-"></xref>Table 3. Results of ablation experiment.</title>
     </caption>
     <table class="MsoTableGrid custom-table" border="0" cellspacing="0" cellpadding="0"> 
      <tr> 
       <td class="custom-bottom-td acenter" width="16.66%"><p style="text-align:center">model</p></td> 
       <td class="custom-bottom-td acenter" width="16.67%"><p style="text-align:center">P/%</p></td> 
       <td class="custom-bottom-td acenter" width="16.67%"><p style="text-align:center">R/%</p></td> 
       <td class="custom-bottom-td acenter" width="16.66%"><p style="text-align:center">mAP50/%</p></td> 
       <td class="custom-bottom-td acenter" width="16.67%"><p style="text-align:center">Params/M</p></td> 
       <td class="custom-bottom-td acenter" width="16.67%"><p style="text-align:center">GFLOPS</p></td> 
      </tr> 
      <tr> 
       <td class="custom-top-td acenter" width="16.66%"><p style="text-align:center">RT-DETR</p></td> 
       <td class="custom-top-td acenter" width="16.67%"><p style="text-align:center">0.502</p></td> 
       <td class="custom-top-td acenter" width="16.67%"><p style="text-align:center">0.421</p></td> 
       <td class="custom-top-td acenter" width="16.66%"><p style="text-align:center">0.441</p></td> 
       <td class="custom-top-td acenter" width="16.67%"><p style="text-align:center">32811576</p></td> 
       <td class="custom-top-td acenter" width="16.67%"><p style="text-align:center">108.0</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="16.66%"><p style="text-align:center">+Mobilenetv4</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">0.512</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">0.437</p></td> 
       <td class="acenter" width="16.66%"><p style="text-align:center">0.446</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">11311576</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">39.5</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="16.66%"><p style="text-align:center">+LPE</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">0.511</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">0.453</p></td> 
       <td class="acenter" width="16.66%"><p style="text-align:center">0.464</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">11413976</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">39.5</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="16.66%"><p style="text-align:center">+HLM</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">0.516</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">0.448</p></td> 
       <td class="acenter" width="16.66%"><p style="text-align:center">0.472</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">11441624</p></td> 
       <td class="acenter" width="16.67%"><p style="text-align:center">39.6</p></td> 
      </tr> 
     </table>
    </table-wrap>
    <p>architecture, such as the judicious selection of layer width (number of channels), stride, and layer stacking methods, it is possible to reduce the complexity of the model while maintaining accuracy. MobileNetV4 utilizes an improved inverted residual structure, which first increases the number of channels with pointwise convolutions, then applies depthwise convolutions, and finally uses pointwise convolutions to reduce the number of channels. This design aids in enhancing the network’s representational capacity while maintaining its lightweight nature. Through the detailed design of the network architecture, the model’s parameters and GFLOPS are reduced by approximately two-thirds. By employing Learnable Positional Encoding (LPE), positional encoding becomes a learnable form, allowing for specific adjustments of the parameters based on the dataset, thereby providing a more accurate description of the positional relationships between targets. This is beneficial for feature extraction, leading to an increase in both the R and mAP50 metrics by about 2 percentage points. Finally, the Hierarchical Local-Global Module (HLM) is added. This module further enhances the fusion and connection between high-level semantic features and low-level detail features, leading to a more comprehensive expression of target features. As a result, the model shows an almost one-percentage-point increase in both the P and mAP50 metrics. This confirms the superior performance of the improved model over the original model in the domain of forest fire detection.</p>
    <fig-group id="fig5" position="float">
     <fig id="fig5" position="float">
      <label>Figure 5</label>
      <caption>
       <title>RT-DETR ours--Figure 5. Comparison of visual results.</title>
      </caption>
      <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1733098-rId39.jpeg?20250430045105" />
     </fig>
     <fig id="fig5" position="float">
      <label>Figure 5</label>
      <caption>
       <title>RT-DETR ours--Figure 5. Comparison of visual results.</title>
      </caption>
      <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1733098-rId40.jpeg?20250430045105" />
     </fig>
    </fig-group>
    <p>It can be clearly observed from <xref ref-type="fig" rid="fig5">
      Figure 5
     </xref> that the optimized RT-DETR model excels in extracting flame features and fusing flame features at different scales, thereby significantly improving the recall rate and enhancing the precision of the model’s detection. Particularly in the detection of small target flames, its detection capability has been significantly enhanced.</p>
   </sec>
  </sec><sec id="s5">
   <title>5. Conclusion</title>
   <p>In response to the challenges of deploying the original RT-DETR model on embedded devices, the introduction of the MobileNetV4 lightweight structure has reduced the model parameters by 60%, significantly decreasing the resource consumption of the devices. Building on this, an innovative LPE encoding scheme and HLM feature fusion structure have been designed, effectively enhancing the detection performance of the improved RT-DETR, with an increase of 3.1 percentage points in mAP50. This improvement considers both the size of the model and its performance, demonstrating the good lightweight potential of the RT-DETR model. By lightweighting the RT-DETR model and optimizing the encoding method and feature fusion strategy, the model maintains detection performance while being more deployable on embedded devices. This provides a useful reference for the application of the RT-DETR model in the field of forest fire detection. However, its detection speed still lags significantly behind YOLO. Therefore, further optimization of the detection speed should be prioritized in the next phase of work to enhance the model’s real-time capabilities.</p>
  </sec>
 </body><back>
  <ref-list>
   <title>References</title>
   <ref id="scirp.142409-ref1">
    <label>1</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., et al. (2024) DETRs Beat YOLOs on Real-Time Object Detection. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 16-22 June 2024, 16965-16974. 
     <u>&gt;https://doi.org/10.1109/cvpr52733.2024.01605</u>
    </mixed-citation>
   </ref>
   <ref id="scirp.142409-ref2">
    <label>2</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Jin, L., Yu, Y., Zhou, J., Bai, D., Lin, H. and Zhou, H. (2024) SWVR: A Lightweight Deep Learning Algorithm for Forest Fire Detection and Recognition. Forests, 15, Article 204. 
     <u>&gt;https://doi.org/10.3390/f15010204</u>
    </mixed-citation>
   </ref>
   <ref id="scirp.142409-ref3">
    <label>3</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Li, J., Tang, H., Li, X., Dou, H. and Li, R. (2023) LEF-YOLO: A Lightweight Method for Intelligent Detection of Four Extreme Wildfires Based on the YOLO Framework. International Journal of Wildland Fire, vol. 33. 
     <u>&gt;https://doi.org/10.1071/wf23044</u>
    </mixed-citation>
   </ref>
   <ref id="scirp.142409-ref4">
    <label>4</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Huang, J., He, Z., Guan, Y. and Zhang, H. (2023) Real-Time Forest Fire Detection by Ensemble Lightweight YOLOX-L and Defogging Method. Sensors, 23, Article 1894. 
     <u>&gt;https://doi.org/10.3390/s23041894</u>
    </mixed-citation>
   </ref>
   <ref id="scirp.142409-ref5">
    <label>5</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Chen, G., Cheng, R., Lin, X., Jiao, W., Bai, D. and Lin, H. (2023) LMDFS: A Lightweight Model for Detecting Forest Fire Smoke in UAV Images Based on YOLOv7. Remote Sensing, 15, Article 3790. 
     <u>&gt;https://doi.org/10.3390/rs15153790</u>
    </mixed-citation>
   </ref>
   <ref id="scirp.142409-ref6">
    <label>6</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Wang, S., Zhao, J., Ta, N., Zhao, X., Xiao, M. and Wei, H. (2021) A Real-Time Deep Learning Forest Fire Monitoring Algorithm Based on an Improved Pruned + KD Model. Journal of Real-Time Image Processing, 18, 2319-2329. 
     <u>&gt;https://doi.org/10.1007/s11554-021-01124-9</u>
    </mixed-citation>
   </ref>
   <ref id="scirp.142409-ref7">
    <label>7</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Zhou, M., Wu, L., Liu, S. and Li, J. (2023) UAV Forest Fire Detection Based on Lightweight YOLOv5 Model. Multimedia Tools and Applications, 83, 61777-61788. 
     <u>&gt;https://doi.org/10.1007/s11042-023-15770-7</u>
    </mixed-citation>
   </ref>
   <ref id="scirp.142409-ref8">
    <label>8</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Qin, D., Leichner, C., Delakis, M., Fornoni, M., Luo, S., Yang, F., et al. (2024) Mobilenetv4: Universal Models for the Mobile Ecosystem. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T. and Varol, G., Eds., Computer Vision—ECCV 2024, Springer, 78-96. 
     <u>&gt;https://doi.org/10.1007/978-3-031-73661-2_5</u>
    </mixed-citation>
   </ref>
   <ref id="scirp.142409-ref9">
    <label>9</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. and Chen, L. (2018) MobileNetV2: Inverted Residuals and Linear Bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 4510-4520. 
     <u>&gt;https://doi.org/10.1109/cvpr.2018.00474</u>
    </mixed-citation>
   </ref>
   <ref id="scirp.142409-ref10">
    <label>10</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T. and Xie, S. (2022) A ConvNet for the 2020s. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 11966-11976. 
     <u>&gt;https://doi.org/10.1109/cvpr52688.2022.01167</u>
    </mixed-citation>
   </ref>
   <ref id="scirp.142409-ref11">
    <label>11</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Wang, B., Shang, L., Lioma, C., Jiang, X., Yang, H., Liu, Q. and Simonsen, J.G. (2020) On Position Embeddings in Bert. International Conference on Learning Representations. &gt;https://doi.org/10.48550/arXiv.2404.10518 
    </mixed-citation>
   </ref>
   <ref id="scirp.142409-ref12">
    <label>12</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Takase, S. and Okazaki, N. (2019) Positional Encoding to Control Output Sequence Length. Proceedings of the 2019 Conference of the North, Minneapolis, 3-5 June 2019, 3999-4004. 
     <u>&gt;https://doi.org/10.18653/v1/n19-1401</u>
    </mixed-citation>
   </ref>
   <ref id="scirp.142409-ref13">
    <label>13</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Bastidas, A.A. and Tang, H. (2019) Channel Attention Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, 16-17 June 2019, 881-888. 
     <u>&gt;https://doi.org/10.1109/cvprw.2019.00117</u>
    </mixed-citation>
   </ref>
   <ref id="scirp.142409-ref14">
    <label>14</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Zhu, X., Cheng, D., Zhang, Z., Lin, S. and Dai, J. (2019) An Empirical Study of Spatial Attention Mechanisms in Deep Networks. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 6687-6696. 
     <u>&gt;https://doi.org/10.1109/iccv.2019.00679</u>
    </mixed-citation>
   </ref>
   <ref id="scirp.142409-ref15">
    <label>15</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Pratt, W.K., Kane, J. and Andrews, H.C. (1969) Hadamard Transform Image Coding. Proceedings of the IEEE, 57, 58-68. 
     <u>&gt;https://doi.org/10.1109/proc.1969.6869</u>.
    </mixed-citation>
   </ref>
  </ref-list>
 </back>
</article>