<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.4 20241031//EN" "JATS-journalpublishing1-4.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article" dtd-version="1.4" xml:lang="en">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">jss</journal-id>
      <journal-title-group>
        <journal-title>Open Journal of Social Sciences</journal-title>
      </journal-title-group>
      <issn pub-type="epub">2327-5960</issn>
      <issn pub-type="ppub">2327-5952</issn>
      <publisher>
        <publisher-name>Scientific Research Publishing</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.4236/jss.2026.146012</article-id>
      <article-id pub-id-type="publisher-id">jss-151910</article-id>
      <article-categories>
        <subj-group>
          <subject>Article</subject>
        </subj-group>
        <subj-group>
          <subject>Business</subject>
          <subject>Economics</subject>
          <subject>Social Sciences</subject>
          <subject>Humanities</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>CASA-YOLO: A Unified Framework for Small and Camouflaged Object Detection in Agricultural Pest Imagery</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author" corresp="yes">
          <name name-style="western">
            <surname>Sayni</surname>
            <given-names>Koffi Bernadin-Pacome</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <name name-style="western">
            <surname>Monsan</surname>
            <given-names>Apo Chimène</given-names>
          </name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <name name-style="western">
            <surname>Diarra</surname>
            <given-names>Mamadou</given-names>
          </name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <name name-style="western">
            <surname>Kamagaté</surname>
            <given-names>Beman Hamidja</given-names>
          </name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <name name-style="western">
            <surname>Oumtanaga</surname>
            <given-names>Souleymane</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
      </contrib-group>
      <aff id="aff1"><label>1</label> Institut National Polytechnique Félix Houphouët-Boigny, Abidjan, Côte d’Ivoire </aff>
      <aff id="aff2"><label>2</label> Université Virtuelle de Côte d'Ivoire, Abidjan, Côte d’Ivoire </aff>
      <aff id="aff3"><label>3</label> Université Félix Houphouët-Boigny, Abidjan, Côte d’Ivoire </aff>
      <aff id="aff4"><label>4</label> Ecole Supérieure Africaine Des Tic, Abidjan, Côte d’Ivoire </aff>
      <author-notes>
        <fn fn-type="conflict" id="fn-conflict">
          <p>The authors declare no conflicts of interest regarding the publication of this paper.</p>
        </fn>
      </author-notes>
      <pub-date pub-type="epub">
        <day>01</day>
        <month>06</month>
        <year>2026</year>
      </pub-date>
      <pub-date pub-type="collection">
        <month>06</month>
        <year>2026</year>
      </pub-date>
      <volume>14</volume>
      <issue>06</issue>
      <fpage>209</fpage>
      <lpage>235</lpage>
      <history>
        <date date-type="received">
          <day>23</day>
          <month>01</month>
          <year>2026</year>
        </date>
        <date date-type="accepted">
          <day>14</day>
          <month>06</month>
          <year>2026</year>
        </date>
        <date date-type="published">
          <day>17</day>
          <month>06</month>
          <year>2026</year>
        </date>
      </history>
      <permissions>
        <copyright-statement>© 2026 by the authors and Scientific Research Publishing Inc.</copyright-statement>
        <copyright-year>2026</copyright-year>
        <license license-type="open-access">
          <license-p> This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license ( <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link> ). </license-p>
        </license>
      </permissions>
      <self-uri content-type="doi" xlink:href="https://doi.org/10.4236/jss.2026.146012">https://doi.org/10.4236/jss.2026.146012</self-uri>
      <abstract>
        <p>Small object detection (SOD) and camouflaged object detection (COD) are critical challenges in agricultural computer vision, where pests exhibit both spatial compactness and visual similarity to their surroundings. Existing approaches address these problems in isolation, failing to exploit their shared characteristic: extracting weak visual signals from low signal-to-noise environments. This paper introduces CASA-YOLO (Context-Aware Sparse Attention YOLO), a unified framework addressing SOD and COD through three innovations: 1) Dual-Axis Sparse Attention (DASA), which decomposes global attention into axis-wise operations with adaptive sparse sampling, reducing complexity from O(<inline-formula><mml:math></mml:math></inline-formula></p>
        <p>N</p>
        <p>2</p>
        <p>) to O(<inline-formula><mml:math display="inline"></mml:math></inline-formula></p>
        <p>N</p>
        <p>N</p>
        <p>/s</p>
        <p>); 2) Adaptive Context Gating (ACG), a three-pathway module dynamically balancing local texture, global semantics, and boundary cues; and 3) HFPN-Nano, a hierarchical feature pyramid enabling stride-4 detection of objects as small as 8 × 8 pixels. On the AgroPest-12 benchmark, CASA-YOLO achieves 89.6% mAP@50 and 58.3% mAP@50 - 95, surpassing YOLOv11s (+5.9% mAP@50) and RT-DETR-R18 (+3.3%) at FP32 precision, while maintaining real-time inference (118 FPS with TensorRT INT8 quantization). Field validation on cashew plantations across three regions in Côte d’Ivoire (895 images, 8 sites) confirms practical applicability. Camouflage-stratified analysis further shows that ACG provides significant gains on high-camouflage instances, validating the unified SOD-COD design philosophy for agricultural pest detection.</p>
      </abstract>
      <kwd-group kwd-group-type="author-generated" xml:lang="en">
        <kwd>Object Detection</kwd>
        <kwd>Small Object Detection</kwd>
        <kwd>Camouflaged Object Detection</kwd>
        <kwd>Attention Mechanism</kwd>
        <kwd>Agricultural Computer Vision</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Pest Detection</kwd>
        <kwd>Cashew Tree</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec1">
      <title>1. Introduction</title>
      <p>The detection of small and visually inconspicuous objects constitutes a fundamental challenge in computer vision with far-reaching implications for precision agriculture, autonomous systems, and medical imaging. In agricultural contexts, early detection of crop pests and diseases is paramount: the Food and Agriculture Organization estimates that plant pests and diseases cause annual economic losses exceeding $220 billion globally, with 20% - 40% of crop production lost to these threats ([<xref ref-type="bibr" rid="B7">7</xref>]). This challenge is compounded by the inherent visual characteristics of agricultural threats: many pests measure merely 2 - 5 mm in length, while fungal infections often manifest as subtle discolorations that blend seamlessly with healthy foliage.</p>
      <p>Two distinct research communities have emerged to address related aspects of this challenge. Small Object Detection (SOD) focuses on identifying targets occupying minimal image area, typically defined as objects smaller than 32 × 32 pixels according to COCO terminology ([<xref ref-type="bibr" rid="B16">16</xref>]). The primary difficulties include limited discriminative features, sensitivity to localization errors, and severe class imbalance during training. Conversely, Camouflaged Object Detection (COD) addresses objects that deliberately or naturally conceal themselves within their environment through texture similarity, boundary diffusion, or pattern mimicry ([<xref ref-type="bibr" rid="B6">6</xref>]). COD challenges arise from semantic ambiguity between foreground and background, rather than from spatial limitations.</p>
      <p>Despite their distinct origins, we observe that SOD and COD share a fundamental characteristic: both require extracting weak visual signals from environments where the signal-to-noise ratio is inherently low. In SOD, the signal is spatially compressed; in COD, it is semantically obscured. This observation motivates our central hypothesis: attention mechanisms designed for positional precision (addressing SOD) can synergize with gating mechanisms for foreground-background separation (addressing COD), thereby enabling a unified detection framework that benefits from both SOD and COD design principles. While our evaluation focuses on agricultural pest detection (which inherently combines SOD and COD challenges), dedicated evaluation on standard COD segmentation benchmarks remains a direction for future work.</p>
      <p>Current state-of-the-art detectors exhibit significant limitations when confronted with agricultural imagery. Transformer-based architectures such as RT-DETR ([<xref ref-type="bibr" rid="B35">35</xref>]) achieve impressive accuracy but demand substantial computational resources, making them incompatible with edge deployment on agricultural drones. YOLO variants ([<xref ref-type="bibr" rid="B28">28</xref>]; [<xref ref-type="bibr" rid="B27">27</xref>]) offer real-time performance but rely on attention mechanisms that either lack sufficient granularity for small objects or impose prohibitive O(<inline-formula><mml:math><mml:mrow><mml:msup><mml:mi> N </mml:mi><mml:mn> 2 </mml:mn></mml:msup></mml:mrow></mml:math></inline-formula> ) complexity on high-resolution feature maps. Dedicated COD methods ([<xref ref-type="bibr" rid="B5">5</xref>]; [<xref ref-type="bibr" rid="B19">19</xref>]) achieve remarkable performance on benchmark datasets but are designed for segmentation rather than detection and lack the efficiency required for real-time applications.</p>
      <p>This paper makes the following contributions: we propose CASA-YOLO, a novel object detection architecture that unifies small object detection and camouflaged object detection through principled attention design, representing, to the best of our knowledge, the first real-time detection framework explicitly designed to address both SOD and COD challenges simultaneously within a unified architecture. Our technical innovations include Dual-Axis Sparse Attention (DASA), which reduces attention complexity from O(<inline-formula><mml:math><mml:mrow><mml:msup><mml:mi> N </mml:mi><mml:mn> 2 </mml:mn></mml:msup></mml:mrow></mml:math></inline-formula> ) to O(<inline-formula><mml:math display="inline"><mml:mrow><mml:mi> N </mml:mi><mml:msqrt><mml:mi> N </mml:mi></mml:msqrt></mml:mrow></mml:math></inline-formula> ) through sequential axis-wise decomposition. Adaptive sparse sampling with learned stride s further reduces the effective complexity to O(<inline-formula><mml:math display="inline"><mml:mrow><mml:mrow><mml:mrow><mml:mi> N </mml:mi><mml:msqrt><mml:mi> N </mml:mi></mml:msqrt></mml:mrow><mml:mo> / </mml:mo><mml:mi> s </mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> ), enabling efficient processing of high-resolution feature maps critical for small object detection; Adaptive Context Gating (ACG) a three-pathway module incorporating local texture analysis, global semantic encoding, and boundary enhancement with learned competitive gating; and HFPN-Nano, an efficient hierarchical feature pyramid with stride-4 detection capabilities adding only 26% computational overhead.</p>
      <p>We validate our approach through comprehensive experiments on the AgroPest-12 dataset ([<xref ref-type="bibr" rid="B18">18</xref>]) and multi-site field experiments on cashew plantations across three regions in Côte d’Ivoire (Lapinkro, Touba, and Kotobi), demonstrating state-of-the-art performance with practical applicability under diverse real-world conditions.</p>
      <p>The remainder of this paper is organized as follows: Section 2 reviews related work in SOD, COD, and attention mechanisms. Section 3 presents the proposed CASA-YOLO architecture in detail. Section 4 describes our experimental methodology and datasets. Section 5 presents comprehensive results, ablation studies, and field validation experiments. Section 6 concludes with future research directions.</p>
    </sec>
    <sec id="sec2">
      <title>2. Related Work</title>
      <sec id="sec2dot1">
        <title>2.1. Small Object Detection</title>
        <p>Small object detection has evolved along three principal paradigms: multi-scale feature learning, data augmentation strategies, and specialized network architectures. The Feature Pyramid Network (FPN) ([<xref ref-type="bibr" rid="B15">15</xref>]) established the foundation for multi-scale detection by constructing a top-down pathway that propagates semantic information to higher-resolution features. Subsequent works, including [<xref ref-type="bibr" rid="B17">17</xref>] and BiFPN ([<xref ref-type="bibr" rid="B26">26</xref>]), enhanced feature fusion through bidirectional connections and weighted aggregation, respectively.</p>
        <p>Data-centric approaches address the statistical challenges of small object detection. Copy-Paste augmentation ([<xref ref-type="bibr" rid="B9">9</xref>]) increases small object instance density by compositing objects onto varied backgrounds. SNIP ([<xref ref-type="bibr" rid="B24">24</xref>]) and SNIPER ([<xref ref-type="bibr" rid="B25">25</xref>]) introduced scale-specific training that selectively backpropagates gradients based on object size, preventing gradient domination by larger objects. Data-centric strategies can yield significant gains: ([<xref ref-type="bibr" rid="B14">14</xref>]) demonstrated that oversampling combined with copy-paste augmentation improves small object detection AP by up to 9.7% without architectural modifications, while [<xref ref-type="bibr" rid="B1">1</xref>] showed that mosaic augmentation compositing four training images into one further enriches small object context during training.</p>
        <p>Architectural innovations specifically targeting small objects include QueryDet ([<xref ref-type="bibr" rid="B33">33</xref>]), which employs cascade sparse queries to progressively refine small object proposals, achieving significant improvements on VisDrone while maintaining efficiency. RFLA ([<xref ref-type="bibr" rid="B32">32</xref>]) introduces receptive field adaptation that dynamically adjusts convolutional kernels based on target scale. [<xref ref-type="bibr" rid="B30">30</xref>] propose a dedicated small object detection head with deformable attention operating on P2-level features. Despite these advances, existing SOD methods do not account for camouflage scenarios in which small objects additionally exhibit visual similarity to their backgrounds.</p>
      </sec>
      <sec id="sec2dot2">
        <title>2.2. Camouflaged Object Detection</title>
        <p>Camouflaged object detection has experienced rapid progress following the introduction of large-scale benchmarks. [<xref ref-type="bibr" rid="B6">6</xref>] released COD10K with 10,000 images spanning 78 categories and proposed SINet, establishing the search-and-identify paradigm where coarse localization precedes fine segmentation. PFNet ([<xref ref-type="bibr" rid="B20">20</xref>]) extended this approach with positioning and focus modules that progressively refine camouflaged boundaries. ZoomNet ([<xref ref-type="bibr" rid="B21">21</xref>]) introduced scale integration through mixed-scale triplet attention, achieving state-of-the-art performance through explicit multi-scale reasoning.</p>
        <p>Recent approaches leverage increasingly sophisticated attention mechanisms. BSA-Net ([<xref ref-type="bibr" rid="B37">37</xref>]) employs boundary-guided spatial attention that explicitly models edge discontinuities. FEDER ([<xref ref-type="bibr" rid="B10">10</xref>]) proposes frequency-enhanced decomposition that separates objects from backgrounds in the spectral domain. The emergence of foundation models has prompted various adaptation strategies: SAM-Adapter ([<xref ref-type="bibr" rid="B3">3</xref>]) fine-tunes the Segment Anything Model for COD, while CamSAM2 ([<xref ref-type="bibr" rid="B36">36</xref>]) extends this to video sequences. However, these methods focus exclusively on segmentation, producing pixel-wise masks rather than bounding boxes, and exhibit inference times incompatible with real-time detection requirements.</p>
        <p>A critical gap remains in the literature: no existing work addresses camouflaged object detection within a real-time detection framework. Agricultural applications require bounding box outputs for downstream tasks (spraying localization, counting) and demand inference speeds exceeding 15 FPS for practical drone deployment. CASA-YOLO directly addresses this gap. <bold>Table 1</bold> summarizes a comparative analysis of representative SOD and COD methods discussed in this section.</p>
        <p><bold>Table 1.</bold> Comparative analysis of related methods.</p>
        <table-wrap id="tbl1">
          <label>Table 1</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Method</bold>
                </td>
                <td>
                  <bold>SOD</bold>
                </td>
                <td>
                  <bold>COD</bold>
                </td>
                <td>
                  <bold>Real-time</bold>
                </td>
                <td>
                  <bold>Attention</bold>
                </td>
                <td>
                  <bold>Output</bold>
                </td>
              </tr>
              <tr>
                <td>
                  YOLOv11 ([
                  <xref ref-type="bibr" rid="B27">27</xref>
                  ])
                </td>
                <td>✓</td>
                <td>✗</td>
                <td>✓</td>
                <td>C2PSA</td>
                <td>BBox</td>
              </tr>
              <tr>
                <td>
                  RT-DETR ([
                  <xref ref-type="bibr" rid="B35">35</xref>
                  ])
                </td>
                <td>○</td>
                <td>✗</td>
                <td>○</td>
                <td>Deformable</td>
                <td>BBox</td>
              </tr>
              <tr>
                <td>
                  SINet-V2 ([
                  <xref ref-type="bibr" rid="B5">5</xref>
                  ])
                </td>
                <td>✗</td>
                <td>✓</td>
                <td>✗</td>
                <td>Neighbor</td>
                <td>Mask</td>
              </tr>
              <tr>
                <td>
                  ZoomNet ([
                  <xref ref-type="bibr" rid="B21">21</xref>
                  ])
                </td>
                <td>✗</td>
                <td>✓</td>
                <td>✗</td>
                <td>Mixed-scale</td>
                <td>Mask</td>
              </tr>
              <tr>
                <td>
                  QueryDet ([
                  <xref ref-type="bibr" rid="B33">33</xref>
                  ])
                </td>
                <td>✓</td>
                <td>✗</td>
                <td>○</td>
                <td>Cascade</td>
                <td>BBox</td>
              </tr>
              <tr>
                <td>
                  AgriYOLO ([
                  <xref ref-type="bibr" rid="B4">4</xref>
                  ])
                </td>
                <td>○</td>
                <td>✗</td>
                <td>✓</td>
                <td>SE</td>
                <td>BBox</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>Note. ✓ = fully addresses, ○ = partially addresses, ✗ = does not address.</p>
      </sec>
      <sec id="sec2dot3">
        <title>2.3. Attention Mechanisms in Object Detection</title>
        <p>Attention mechanisms have become integral to modern object detection architectures. DETR ([<xref ref-type="bibr" rid="B2">2</xref>]) pioneered end-to-end detection through transformer encoder-decoder architecture, eliminating hand-designed components like NMS and anchor generation. However, DETR’s O(<inline-formula><mml:math><mml:mrow><mml:msup><mml:mi> N </mml:mi><mml:mn> 2 </mml:mn></mml:msup></mml:mrow></mml:math></inline-formula> ) attention complexity limited application to downsampled features, reducing small object performance. Deformable DETR ([<xref ref-type="bibr" rid="B38">38</xref>]) addressed this through sparse attention over learned sampling points, reducing complexity while improving small object accuracy.</p>
        <p>Channel and spatial attention mechanisms offer complementary benefits. Squeeze-and-Excitation (SE) networks ([<xref ref-type="bibr" rid="B12">12</xref>]) introduced channel recalibration through global pooling and gating. CBAM ([<xref ref-type="bibr" rid="B31">31</xref>]) combined channel and spatial attention sequentially. Coordinate Attention ([<xref ref-type="bibr" rid="B11">11</xref>]) encoded positional information into channel attention through directional pooling, providing positional awareness without quadratic complexity. SCSA ([<xref ref-type="bibr" rid="B23">23</xref>]) recently proposed synergistic channel-spatial attention with shared semantics.</p>
        <p>Axis-wise attention decomposition reduces computational requirements while preserving global receptive fields. Axial Attention ([<xref ref-type="bibr" rid="B29">29</xref>]) factorizes 2D attention into sequential 1D operations along height and width axes. CCNet ([<xref ref-type="bibr" rid="B13">13</xref>]) applies criss-cross attention for semantic segmentation. However, existing axis-wise approaches lack mechanisms for capturing diagonal patterns and do not incorporate adaptive sparsity. Our proposed DASA addresses both limitations through cross-axis bridging and content-adaptive sampling.</p>
      </sec>
    </sec>
    <sec id="sec3">
      <title>3. Proposed Methodology</title>
      <p>This section presents the CASA-YOLO architecture in detail. We first provide an overview of the complete system, then describe each novel component: Dual-Axis Sparse Attention (DASA), Adaptive Context Gating (ACG), and HFPN-Nano. We conclude with the loss function formulation and training strategy.</p>
      <sec id="sec3dot1">
        <title>3.1. Architecture Overview</title>
        <p>CASA-YOLO follows the single-stage detection paradigm with a backbone-neck-head architecture, as illustrated in <xref ref-type="fig" rid="fig1">Figure 1</xref>. The backbone employs MobileNetV4 ([<xref ref-type="bibr" rid="B22">22</xref>]) with Universal Inverted Bottleneck (UIB) blocks, selected for its favorable accuracy-efficiency trade-off and hardware-agnostic design. The neck integrates our proposed HFPN-Nano for multi-scale feature fusion with high-resolution pathway. The detection head incorporates DASA and ACG modules operating on fused features before final prediction.</p>
        <fig id="fig1">
          <label>Figure 1</label>
          <graphic xlink:href="https://html.scirp.org/file/6501286-rId29.jpeg?20260617115456" />
        </fig>
        <p><bold>Figure 1.</bold> Overall architecture of CASA-YOLO showing the backbone (MobileNetV4), neck (HFPN-Nano), and detection head with DASA and ACG modules.</p>
        <p>Let <inline-formula><mml:math display="inline"><mml:mrow><mml:mi> I </mml:mi><mml:mo> ∈ </mml:mo><mml:msup><mml:mi> ℝ </mml:mi><mml:mrow><mml:mi> H </mml:mi><mml:mo> × </mml:mo><mml:mi> W </mml:mi><mml:mo> × </mml:mo><mml:mn> 3 </mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> denote an input image. The backbone extracts hierarchical features {<italic>C</italic><sub>2</sub>, <italic>C</italic>₃, <italic>C</italic>₄, <italic>C</italic>₅} at strides {4, 8, 16, 32} respectively. HFPN-Nano fuses these into pyramid features {<italic>P</italic><sub>2</sub>, <italic>P</italic>₃, <italic>P</italic>₄, <italic>P</italic>₅}. DASA enhances spatial relationships within each pyramid level, while ACG modulates features based on contextual analysis. The detection head produces predictions at each level, subsequently merged through NMS.</p>
      </sec>
      <sec id="sec3dot2">
        <title>3.2. Dual-Axis Sparse Attention (DASA)</title>
        <p>Standard multi-head self-attention (MHSA) computes pairwise interactions across all<italic>N</italic> = <italic>H</italic> × <italic>W</italic> spatial positions, resulting in O(<inline-formula><mml:math><mml:mrow><mml:msup><mml:mi> N </mml:mi><mml:mn> 2 </mml:mn></mml:msup></mml:mrow></mml:math></inline-formula> ) complexity. For high-resolution feature maps essential in small object detection (e.g., <italic>P</italic><sub>2</sub> at 160 × 160 with <italic>N</italic> = 25,600), this requires approximately 655 million pairwise attention computations per head, rendering direct application impractical. As illustrated in <xref ref-type="fig" rid="fig2">Figure 2</xref>, DASA addresses this through three complementary mechanisms: axis decomposition, adaptive sparse sampling, and cross-axis bridging.</p>
        <p>Axis Decomposition: Following the factorization principle of axial attention ([<xref ref-type="bibr" rid="B29">29</xref>]), DASA decomposes global 2D attention into sequential 1D operations:</p>
        <fig id="fig2">
          <label>Figure 2</label>
          <graphic xlink:href="https://html.scirp.org/file/6501286-rId34.jpeg?20260617115456" />
        </fig>
        <p><bold>Figure 2.</bold> Dual-Axis Sparse Attention (DASA) module.</p>
        <disp-formula id="FD1">
          <label>(1)</label>
          <mml:math>
            <mml:mrow>
              <mml:mtext>MHSA</mml:mtext>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mi>X</mml:mi>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>≈</mml:mo>
              <mml:msub>
                <mml:mrow>
                  <mml:mtext>Attn</mml:mtext>
                </mml:mrow>
                <mml:mi>V</mml:mi>
              </mml:msub>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:msub>
                    <mml:mrow>
                      <mml:mtext>Attn</mml:mtext>
                    </mml:mrow>
                    <mml:mi>H</mml:mi>
                  </mml:msub>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mi>X</mml:mi>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>where <inline-formula><mml:math><mml:mrow><mml:msub><mml:mrow><mml:mtext> Attn </mml:mtext></mml:mrow><mml:mi> H </mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math><mml:mrow><mml:msub><mml:mrow><mml:mtext> Attn </mml:mtext></mml:mrow><mml:mi> V </mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> denote horizontal and vertical attention respectively. In the horizontal pass, each of the <italic>H</italic> rows performs self-attention over <italic>W</italic> positions, yielding a cost of <italic>H</italic>·<italic>W</italic><sup>2</sup>. The vertical pass similarly costs <italic>W</italic>·<italic>H</italic><sup>2</sup>. The total complexity is therefore:</p>
        <disp-formula id="FD2">
          <label>(2)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>C</mml:mi>
                <mml:mrow>
                  <mml:mtext>DASA</mml:mtext>
                </mml:mrow>
              </mml:msub>
              <mml:mo>=</mml:mo>
              <mml:mi>H</mml:mi>
              <mml:mo>⋅</mml:mo>
              <mml:msup>
                <mml:mi>W</mml:mi>
                <mml:mn>2</mml:mn>
              </mml:msup>
              <mml:mo>+</mml:mo>
              <mml:mi>W</mml:mi>
              <mml:mo>⋅</mml:mo>
              <mml:msup>
                <mml:mi>H</mml:mi>
                <mml:mn>2</mml:mn>
              </mml:msup>
              <mml:mo>=</mml:mo>
              <mml:mi>H</mml:mi>
              <mml:mi>W</mml:mi>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:mi>H</mml:mi>
                  <mml:mo>+</mml:mo>
                  <mml:mi>W</mml:mi>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>For square feature maps (<inline-formula><mml:math display="inline"><mml:mrow><mml:mi> H </mml:mi><mml:mo> = </mml:mo><mml:mi> W </mml:mi><mml:mo> = </mml:mo><mml:msqrt><mml:mi> N </mml:mi></mml:msqrt></mml:mrow></mml:math></inline-formula> ), this simplifies to O(<inline-formula><mml:math display="inline"><mml:mrow><mml:mi> N </mml:mi><mml:msqrt><mml:mi> N </mml:mi></mml:msqrt></mml:mrow></mml:math></inline-formula> ), representing a <inline-formula><mml:math display="inline"><mml:mrow><mml:msqrt><mml:mi> N </mml:mi></mml:msqrt></mml:mrow></mml:math></inline-formula> -fold reduction compared to O(<inline-formula><mml:math><mml:mrow><mml:msup><mml:mi> N </mml:mi><mml:mn> 2 </mml:mn></mml:msup></mml:mrow></mml:math></inline-formula> ). At <italic>P</italic><sub>2</sub> resolution (160 × 160, <italic>N</italic> = 25,600), this corresponds to approximately 8.2 million operations versus 655 million for standard MHSA an 80× reduction. However, naive axis decomposition fails to capture diagonal interaction patterns, which are critical for detecting elongated pests and disease spread trajectories. </p>
        <p>Adaptive Sparse Sampling: Agricultural imagery exhibits significant spatial redundancy, as homogeneous crop canopy regions contain minimal discriminative information. DASA exploits this redundancy through learned sparse sampling that further reduces the per-axis attention span. A global sampling stride <italic>s</italic> is computed adaptively based on the feature map statistics:</p>
        <disp-formula id="FD3">
          <label>(3)</label>
          <mml:math>
            <mml:mrow>
              <mml:mi>s</mml:mi>
              <mml:mo>=</mml:mo>
              <mml:mi>max</mml:mi>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:mn>1</mml:mn>
                  <mml:mo>,</mml:mo>
                  <mml:mi>σ</mml:mi>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:mtext>MLP</mml:mtext>
                      <mml:mrow>
                        <mml:mo>(</mml:mo>
                        <mml:mrow>
                          <mml:mtext>GAP</mml:mtext>
                          <mml:mrow>
                            <mml:mo>(</mml:mo>
                            <mml:mi>X</mml:mi>
                            <mml:mo>)</mml:mo>
                          </mml:mrow>
                        </mml:mrow>
                        <mml:mo>)</mml:mo>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                  <mml:mo>⋅</mml:mo>
                  <mml:msub>
                    <mml:mi>s</mml:mi>
                    <mml:mrow>
                      <mml:mi>max</mml:mi>
                    </mml:mrow>
                  </mml:msub>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>where GAP denotes global average pooling, σ is the sigmoid function, and <italic>s</italic><sub>max</sub> is the maximum stride (set to 8 by default). With stride <italic>s</italic>, each position attends to <italic>H</italic>/<italic>s</italic> (vertical) or <italic>W</italic>/<italic>s</italic> (horizontal) sampled positions rather than the full axis length, reducing the effective complexity to:</p>
        <disp-formula id="FD4">
          <label>(4)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>C</mml:mi>
                <mml:mrow>
                  <mml:mtext>DASA-sparse</mml:mtext>
                </mml:mrow>
              </mml:msub>
              <mml:mo>=</mml:mo>
              <mml:mi>H</mml:mi>
              <mml:mi>W</mml:mi>
              <mml:mrow>
                <mml:mrow>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:mi>H</mml:mi>
                      <mml:mo>+</mml:mo>
                      <mml:mi>W</mml:mi>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mo>/</mml:mo>
                <mml:mi>s</mml:mi>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>For square maps, this yields O(<inline-formula><mml:math display="inline"><mml:mrow><mml:mrow><mml:mrow><mml:mi> N </mml:mi><mml:msqrt><mml:mi> N </mml:mi></mml:msqrt></mml:mrow><mml:mo> / </mml:mo><mml:mi> s </mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> ). At <italic>P</italic><sub>2</sub> resolution with <italic>s</italic> = 4, the computational cost reduces to approximately 2.0 million operations a 320× reduction from standard MHSA.</p>
        <p>The stride s adapts at the image level: feature maps with high average activation variance (indicating discriminative content) produce lower <italic>s</italic> values, preserving fine-grained attention; feature maps with low variance (homogeneous backgrounds) produce higher s values, reducing redundant computation. We emphasize that <italic>s</italic> is computed globally per feature map rather than spatially varying, which ensures compatibility with batched tensor operations and hardware-efficient inference.</p>
        <p>The learned stride adapts globally based on the overall discriminative content of the feature map: for feature maps with high average activation variance (indicating the presence of discriminative targets), <italic>s</italic> tends toward 1 (dense attention); for highly homogeneous maps, <italic>s</italic> increases toward <inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> s </mml:mi><mml:mrow><mml:mi> max </mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> (sparse attention). This image-level adaptivity balances computational cost and detection accuracy across diverse agricultural scenes.</p>
        <p>Cross-Axis Bridge: Axis decomposition inherently loses diagonal connectivity. We introduce a lightweight cross-axis bridge that captures missing patterns:</p>
        <disp-formula id="FD5">
          <label>(5)</label>
          <mml:math>
            <mml:mrow>
              <mml:mi>B</mml:mi>
              <mml:mo>=</mml:mo>
              <mml:msub>
                <mml:mrow>
                  <mml:mtext>DWConv</mml:mtext>
                </mml:mrow>
                <mml:mrow>
                  <mml:mn>3</mml:mn>
                  <mml:mo>×</mml:mo>
                  <mml:mn>3</mml:mn>
                </mml:mrow>
              </mml:msub>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:msub>
                    <mml:mi>A</mml:mi>
                    <mml:mi>v</mml:mi>
                  </mml:msub>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>⊙</mml:mo>
              <mml:mi>σ</mml:mi>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:msub>
                    <mml:mrow>
                      <mml:mtext>Conv</mml:mtext>
                    </mml:mrow>
                    <mml:mrow>
                      <mml:mn>1</mml:mn>
                      <mml:mo>×</mml:mo>
                      <mml:mn>1</mml:mn>
                    </mml:mrow>
                  </mml:msub>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:msub>
                        <mml:mi>A</mml:mi>
                        <mml:mi>h</mml:mi>
                      </mml:msub>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <disp-formula id="FD6">
          <label>(6)</label>
          <mml:math>
            <mml:mrow>
              <mml:mtext>DASA</mml:mtext>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mi>X</mml:mi>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>=</mml:mo>
              <mml:msub>
                <mml:mi>A</mml:mi>
                <mml:mi>v</mml:mi>
              </mml:msub>
              <mml:mo>+</mml:mo>
              <mml:mi>α</mml:mi>
              <mml:mo>⋅</mml:mo>
              <mml:mi>B</mml:mi>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>where <inline-formula><mml:math><mml:mi> α </mml:mi></mml:math></inline-formula> is a learnable scalar initialized to 0.1, DWConv denotes depthwise separable convolution, and <inline-formula><mml:math><mml:mo> ⊙ </mml:mo></mml:math></inline-formula> represents element-wise multiplication.</p>
      </sec>
      <sec id="sec3dot3">
        <title>3.3. Adaptive Context Gating (ACG)</title>
        <p>Camouflaged objects share visual characteristics with their surroundings, which causes standard attention mechanisms to assign similar weights to both foreground and background. As shown in <xref ref-type="fig" rid="fig3">Figure 3</xref>, ACG addresses this through three specialized pathways that capture complementary contextual information, combined through competitive gating.</p>
        <fig id="fig3">
          <label>Figure 3</label>
          <graphic xlink:href="https://html.scirp.org/file/6501286-rId67.jpeg?20260617115457" />
        </fig>
        <p><bold>Figure 3.</bold> Architecture of Adaptive Context Gating (ACG) module with local, global, and boundary pathways combined through competitive gating.</p>
        <p><bold>Local Context Pathway</bold>: Local texture patterns provide discriminative cues even when global appearance is camouflaged. We employ depthwise separable convolution with expanded receptive field:</p>
        <disp-formula id="FD7">
          <label>(7)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>P</mml:mi>
                <mml:mi>L</mml:mi>
              </mml:msub>
              <mml:mo>=</mml:mo>
              <mml:mtext>ReLU</mml:mtext>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:mtext>BN</mml:mtext>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:msub>
                        <mml:mrow>
                          <mml:mtext>DWConv</mml:mtext>
                        </mml:mrow>
                        <mml:mrow>
                          <mml:mn>5</mml:mn>
                          <mml:mo>×</mml:mo>
                          <mml:mn>5</mml:mn>
                        </mml:mrow>
                      </mml:msub>
                      <mml:mrow>
                        <mml:mo>(</mml:mo>
                        <mml:mi>X</mml:mi>
                        <mml:mo>)</mml:mo>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>The 5 × 5 kernel captures local texture while depthwise separation maintains efficiency.</p>
        <p><bold>Global Context Pathway</bold>: Global context enables disambiguation through scene-level reasoning. We employ SE-style channel recalibration:</p>
        <disp-formula id="FD8">
          <label>(8)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>P</mml:mi>
                <mml:mi>G</mml:mi>
              </mml:msub>
              <mml:mo>=</mml:mo>
              <mml:mi>X</mml:mi>
              <mml:mo>⊗</mml:mo>
              <mml:mi>γ</mml:mi>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:msub>
                    <mml:mrow>
                      <mml:mtext>FC</mml:mtext>
                    </mml:mrow>
                    <mml:mn>2</mml:mn>
                  </mml:msub>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:mtext>ReLU</mml:mtext>
                      <mml:mrow>
                        <mml:mo>(</mml:mo>
                        <mml:mrow>
                          <mml:msub>
                            <mml:mrow>
                              <mml:mtext>FC</mml:mtext>
                            </mml:mrow>
                            <mml:mn>1</mml:mn>
                          </mml:msub>
                          <mml:mrow>
                            <mml:mo>(</mml:mo>
                            <mml:mrow>
                              <mml:mtext>GAP</mml:mtext>
                              <mml:mrow>
                                <mml:mo>(</mml:mo>
                                <mml:mi>X</mml:mi>
                                <mml:mo>)</mml:mo>
                              </mml:mrow>
                            </mml:mrow>
                            <mml:mo>)</mml:mo>
                          </mml:mrow>
                        </mml:mrow>
                        <mml:mo>)</mml:mo>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p><bold>Boundary Enhancement Pathway</bold>: Object boundaries provide critical cues for camouflage detection, as even well-camouflaged objects exhibit edge discontinuities. We compute gradient magnitude from the intermediate feature maps (not the raw input image) using fixed Sobel operators. Specifically, given the input feature tensor <inline-formula><mml:math><mml:mrow><mml:mi> X </mml:mi><mml:mo> ∈ </mml:mo><mml:msup><mml:mi> ℝ </mml:mi><mml:mrow><mml:mi> H </mml:mi><mml:mo> × </mml:mo><mml:mi> W </mml:mi><mml:mo> × </mml:mo><mml:mi> C </mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> , we first reduce it to a single-channel representation via a learned 1 × 1 convolution, then apply horizontal and vertical Sobel kernels <inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> S </mml:mi><mml:mi> x </mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> S </mml:mi><mml:mi> y </mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> to obtain gradient maps. The boundary-enhanced features are computed as:</p>
        <disp-formula id="FD9">
          <label>(9)</label>
          <mml:math>
            <mml:mrow>
              <mml:mi>E</mml:mi>
              <mml:mo>=</mml:mo>
              <mml:msub>
                <mml:mrow>
                  <mml:mtext>Conv</mml:mtext>
                </mml:mrow>
                <mml:mrow>
                  <mml:mn>1</mml:mn>
                  <mml:mo>×</mml:mo>
                  <mml:mn>1</mml:mn>
                </mml:mrow>
              </mml:msub>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mi>X</mml:mi>
                <mml:mo>)</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <disp-formula id="FD10">
          <label>(10)</label>
          <mml:math>
            <mml:mrow>
              <mml:mi>G</mml:mi>
              <mml:mo>=</mml:mo>
              <mml:msqrt>
                <mml:mrow>
                  <mml:msup>
                    <mml:mrow>
                      <mml:mrow>
                        <mml:mo>(</mml:mo>
                        <mml:mrow>
                          <mml:msub>
                            <mml:mi>S</mml:mi>
                            <mml:mi>x</mml:mi>
                          </mml:msub>
                          <mml:mo>∗</mml:mo>
                          <mml:mi>E</mml:mi>
                        </mml:mrow>
                        <mml:mo>)</mml:mo>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mn>2</mml:mn>
                  </mml:msup>
                  <mml:mo>+</mml:mo>
                  <mml:msup>
                    <mml:mrow>
                      <mml:mrow>
                        <mml:mo>(</mml:mo>
                        <mml:mrow>
                          <mml:msub>
                            <mml:mi>S</mml:mi>
                            <mml:mi>y</mml:mi>
                          </mml:msub>
                          <mml:mo>∗</mml:mo>
                          <mml:mi>E</mml:mi>
                        </mml:mrow>
                        <mml:mo>)</mml:mo>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mn>2</mml:mn>
                  </mml:msup>
                </mml:mrow>
              </mml:msqrt>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <disp-formula id="FD11">
          <label>(11)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>P</mml:mi>
                <mml:mi>B</mml:mi>
              </mml:msub>
              <mml:mo>=</mml:mo>
              <mml:mtext>ReLU</mml:mtext>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:mtext>BN</mml:mtext>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:msub>
                        <mml:mrow>
                          <mml:mtext>Conv</mml:mtext>
                        </mml:mrow>
                        <mml:mrow>
                          <mml:mn>3</mml:mn>
                          <mml:mo>×</mml:mo>
                          <mml:mn>3</mml:mn>
                        </mml:mrow>
                      </mml:msub>
                      <mml:mrow>
                        <mml:mo>(</mml:mo>
                        <mml:mrow>
                          <mml:mi>X</mml:mi>
                          <mml:mo>⊙</mml:mo>
                          <mml:mi>σ</mml:mi>
                          <mml:mrow>
                            <mml:mo>(</mml:mo>
                            <mml:mi>G</mml:mi>
                            <mml:mo>)</mml:mo>
                          </mml:mrow>
                        </mml:mrow>
                        <mml:mo>)</mml:mo>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>where <inline-formula><mml:math><mml:mo> ∗ </mml:mo></mml:math></inline-formula> denotes convolution with fixed (non-learnable) Sobel kernels, <inline-formula><mml:math><mml:mrow><mml:mi> σ </mml:mi><mml:mrow><mml:mo> ( </mml:mo><mml:mo> ⋅ </mml:mo><mml:mo> ) </mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> is the sigmoid function normalizing the gradient magnitude to [0, 1], and <inline-formula><mml:math><mml:mo> ⊙ </mml:mo></mml:math></inline-formula> is element-wise multiplication. Operating on intermediate feature maps rather than the raw input image allows the boundary pathway to capture semantically meaningful edges (e.g., pest-foliage boundaries) that emerge at deeper network stages, rather than low-level textural edges that may not correspond to object contours.</p>
        <p><bold>Competitive Gating:</bold>The three pathways are combined through softmax-normalized gating, ensuring competition<bold>:</bold></p>
        <disp-formula id="FD12">
          <label>(12)</label>
          <mml:math>
            <mml:mrow>
              <mml:mrow>
                <mml:mo>[</mml:mo>
                <mml:mrow>
                  <mml:msub>
                    <mml:mi>g</mml:mi>
                    <mml:mi>L</mml:mi>
                  </mml:msub>
                  <mml:mo>,</mml:mo>
                  <mml:msub>
                    <mml:mi>g</mml:mi>
                    <mml:mi>G</mml:mi>
                  </mml:msub>
                  <mml:mo>,</mml:mo>
                  <mml:msub>
                    <mml:mi>g</mml:mi>
                    <mml:mi>B</mml:mi>
                  </mml:msub>
                </mml:mrow>
                <mml:mo>]</mml:mo>
              </mml:mrow>
              <mml:mo>=</mml:mo>
              <mml:mtext>Softmax</mml:mtext>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:mtext>FC</mml:mtext>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:mtext>Concat</mml:mtext>
                      <mml:mrow>
                        <mml:mo>(</mml:mo>
                        <mml:mrow>
                          <mml:mtext>GAP</mml:mtext>
                          <mml:mrow>
                            <mml:mo>(</mml:mo>
                            <mml:mrow>
                              <mml:msub>
                                <mml:mi>P</mml:mi>
                                <mml:mi>L</mml:mi>
                              </mml:msub>
                            </mml:mrow>
                            <mml:mo>)</mml:mo>
                          </mml:mrow>
                          <mml:mo>,</mml:mo>
                          <mml:mtext>GAP</mml:mtext>
                          <mml:mrow>
                            <mml:mo>(</mml:mo>
                            <mml:mrow>
                              <mml:msub>
                                <mml:mi>P</mml:mi>
                                <mml:mi>G</mml:mi>
                              </mml:msub>
                            </mml:mrow>
                            <mml:mo>)</mml:mo>
                          </mml:mrow>
                          <mml:mo>,</mml:mo>
                          <mml:mtext>GAP</mml:mtext>
                          <mml:mrow>
                            <mml:mo>(</mml:mo>
                            <mml:mrow>
                              <mml:msub>
                                <mml:mi>P</mml:mi>
                                <mml:mi>B</mml:mi>
                              </mml:msub>
                            </mml:mrow>
                            <mml:mo>)</mml:mo>
                          </mml:mrow>
                        </mml:mrow>
                        <mml:mo>)</mml:mo>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <disp-formula id="FD13">
          <label>(13)</label>
          <mml:math>
            <mml:mrow>
              <mml:mtext>ACG</mml:mtext>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mi>X</mml:mi>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>=</mml:mo>
              <mml:msub>
                <mml:mi>g</mml:mi>
                <mml:mi>L</mml:mi>
              </mml:msub>
              <mml:msub>
                <mml:mi>P</mml:mi>
                <mml:mi>L</mml:mi>
              </mml:msub>
              <mml:mo>+</mml:mo>
              <mml:msub>
                <mml:mi>g</mml:mi>
                <mml:mi>G</mml:mi>
              </mml:msub>
              <mml:msub>
                <mml:mi>P</mml:mi>
                <mml:mi>G</mml:mi>
              </mml:msub>
              <mml:mo>+</mml:mo>
              <mml:msub>
                <mml:mi>g</mml:mi>
                <mml:mi>B</mml:mi>
              </mml:msub>
              <mml:msub>
                <mml:mi>P</mml:mi>
                <mml:mi>B</mml:mi>
              </mml:msub>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>The softmax normalization ensures that <inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> g </mml:mi><mml:mi> L </mml:mi></mml:msub><mml:mo> + </mml:mo><mml:msub><mml:mi> g </mml:mi><mml:mi> G </mml:mi></mml:msub><mml:mo> + </mml:mo><mml:msub><mml:mi> g </mml:mi><mml:mi> B </mml:mi></mml:msub><mml:mo> = </mml:mo><mml:mn> 1 </mml:mn></mml:mrow></mml:math></inline-formula> , forcing the pathways to compete. Empirically, we observe that ACG learns to emphasize boundaries (<inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> g </mml:mi><mml:mi> B </mml:mi></mml:msub><mml:mo> ≈ </mml:mo><mml:mn> 0.47 </mml:mn></mml:mrow></mml:math></inline-formula> ) for high-camouflage instances while favoring global context (<inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> g </mml:mi><mml:mi> G </mml:mi></mml:msub><mml:mo> ≈ </mml:mo><mml:mn> 0.52 </mml:mn></mml:mrow></mml:math></inline-formula> ) for normal objects.</p>
      </sec>
      <sec id="sec3dot4">
        <title>3.4. HFPN-Nano: Hierarchical Feature Pyramid Network</title>
        <p>Standard FPN architectures operating on features from <italic>P</italic><sub>3</sub> - <italic>P</italic><sub>5</sub> (strides 8 - 32) lose fine spatial detail essential for detecting objects smaller than 16 × 16 pixels. HFPN-Nano (<xref ref-type="fig" rid="fig4">Figure 4</xref>) extends the pyramid to include <italic>P</italic><sub>2</sub> (stride 4) through an efficient design that avoids the computational explosion of naive high-resolution processing. </p>
        <p>The <italic>P</italic><sub>2</sub> pathway combines backbone features with upsampled neck features:</p>
        <disp-formula id="FD14">
          <label>(14)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>P</mml:mi>
                <mml:mn>2</mml:mn>
              </mml:msub>
              <mml:mo>=</mml:mo>
              <mml:msub>
                <mml:mrow>
                  <mml:mtext>Conv</mml:mtext>
                </mml:mrow>
                <mml:mrow>
                  <mml:mn>1</mml:mn>
                  <mml:mo>×</mml:mo>
                  <mml:mn>1</mml:mn>
                </mml:mrow>
              </mml:msub>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:msub>
                    <mml:mi>C</mml:mi>
                    <mml:mn>2</mml:mn>
                  </mml:msub>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>+</mml:mo>
              <mml:mtext>PixelShuffle</mml:mtext>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:msub>
                    <mml:mrow>
                      <mml:mtext>Conv</mml:mtext>
                    </mml:mrow>
                    <mml:mrow>
                      <mml:mn>3</mml:mn>
                      <mml:mo>×</mml:mo>
                      <mml:mn>3</mml:mn>
                    </mml:mrow>
                  </mml:msub>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:msub>
                        <mml:mi>P</mml:mi>
                        <mml:mn>3</mml:mn>
                      </mml:msub>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <fig id="fig4">
          <label>Figure 4</label>
          <graphic xlink:href="https://html.scirp.org/file/6501286-rId102.jpeg?20260617115457" />
        </fig>
        <p><bold>Figure 4.</bold> HFPN-Nano architecture showing hierarchical feature pyramid with stride-4 detection pathway and cross-scale attention mechanism.</p>
        <p>where PixelShuffle provides efficient 2× upsampling through channel-to-space reorganization, thereby avoiding the artifacts associated with bilinear interpolation.</p>
        <p>Information flow between pyramid levels is modulated through learned attention:</p>
        <disp-formula id="FD15">
          <label>(15)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>W</mml:mi>
                <mml:mi>i</mml:mi>
              </mml:msub>
              <mml:mo>=</mml:mo>
              <mml:mi>σ</mml:mi>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:msub>
                    <mml:mrow>
                      <mml:mtext>Conv</mml:mtext>
                    </mml:mrow>
                    <mml:mrow>
                      <mml:mn>1</mml:mn>
                      <mml:mo>×</mml:mo>
                      <mml:mn>1</mml:mn>
                    </mml:mrow>
                  </mml:msub>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:mtext>GAP</mml:mtext>
                      <mml:mrow>
                        <mml:mo>(</mml:mo>
                        <mml:mrow>
                          <mml:msub>
                            <mml:mi>P</mml:mi>
                            <mml:mi>i</mml:mi>
                          </mml:msub>
                        </mml:mrow>
                        <mml:mo>)</mml:mo>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>,</mml:mo>
              <mml:mi>i</mml:mi>
              <mml:mo>∈</mml:mo>
              <mml:mrow>
                <mml:mo>{</mml:mo>
                <mml:mrow>
                  <mml:mn>2</mml:mn>
                  <mml:mo>,</mml:mo>
                  <mml:mn>3</mml:mn>
                  <mml:mo>,</mml:mo>
                  <mml:mn>4</mml:mn>
                  <mml:mo>,</mml:mo>
                  <mml:mn>5</mml:mn>
                </mml:mrow>
                <mml:mo>}</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <disp-formula id="FD16">
          <label>(16)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:msup>
                  <mml:mi>P</mml:mi>
                  <mml:mo>′</mml:mo>
                </mml:msup>
                <mml:mi>i</mml:mi>
              </mml:msub>
              <mml:mo>=</mml:mo>
              <mml:msub>
                <mml:mi>P</mml:mi>
                <mml:mi>i</mml:mi>
              </mml:msub>
              <mml:mo>+</mml:mo>
              <mml:mstyle displaystyle="true">
                <mml:msub>
                  <mml:mo>∑</mml:mo>
                  <mml:mrow>
                    <mml:mi>j</mml:mi>
                    <mml:mo>∈</mml:mo>
                    <mml:mrow>
                      <mml:mo>{</mml:mo>
                      <mml:mrow>
                        <mml:mn>2</mml:mn>
                        <mml:mo>,</mml:mo>
                        <mml:mn>3</mml:mn>
                        <mml:mo>,</mml:mo>
                        <mml:mn>4</mml:mn>
                        <mml:mo>,</mml:mo>
                        <mml:mn>5</mml:mn>
                      </mml:mrow>
                      <mml:mo>}</mml:mo>
                    </mml:mrow>
                    <mml:mo>,</mml:mo>
                    <mml:mi>j</mml:mi>
                    <mml:mo>≠</mml:mo>
                    <mml:mi>i</mml:mi>
                  </mml:mrow>
                </mml:msub>
                <mml:mrow>
                  <mml:msub>
                    <mml:mi>W</mml:mi>
                    <mml:mi>j</mml:mi>
                  </mml:msub>
                  <mml:mo>⋅</mml:mo>
                  <mml:mtext>Resize</mml:mtext>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:msub>
                        <mml:mi>P</mml:mi>
                        <mml:mi>j</mml:mi>
                      </mml:msub>
                      <mml:mo>,</mml:mo>
                      <mml:mtext>size</mml:mtext>
                      <mml:mrow>
                        <mml:mo>(</mml:mo>
                        <mml:mrow>
                          <mml:msub>
                            <mml:mi>P</mml:mi>
                            <mml:mi>i</mml:mi>
                          </mml:msub>
                        </mml:mrow>
                        <mml:mo>)</mml:mo>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
              </mml:mstyle>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>This enables adaptive cross-scale reasoning where each level selectively attends to information from other scales.</p>
      </sec>
      <sec id="sec3dot5">
        <title>3.5. Loss Function</title>
        <p>The total training loss combines detection objectives with auxiliary supervision:</p>
        <disp-formula id="FD17">
          <label>(17)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>L</mml:mi>
                <mml:mrow>
                  <mml:mtext>total</mml:mtext>
                </mml:mrow>
              </mml:msub>
              <mml:mo>=</mml:mo>
              <mml:msub>
                <mml:mi>L</mml:mi>
                <mml:mrow>
                  <mml:mtext>box</mml:mtext>
                </mml:mrow>
              </mml:msub>
              <mml:mo>+</mml:mo>
              <mml:msub>
                <mml:mi>λ</mml:mi>
                <mml:mrow>
                  <mml:mtext>cls</mml:mtext>
                </mml:mrow>
              </mml:msub>
              <mml:msub>
                <mml:mi>L</mml:mi>
                <mml:mrow>
                  <mml:mtext>cls</mml:mtext>
                </mml:mrow>
              </mml:msub>
              <mml:mo>+</mml:mo>
              <mml:msub>
                <mml:mi>λ</mml:mi>
                <mml:mrow>
                  <mml:mtext>dfl</mml:mtext>
                </mml:mrow>
              </mml:msub>
              <mml:msub>
                <mml:mi>L</mml:mi>
                <mml:mrow>
                  <mml:mtext>dfl</mml:mtext>
                </mml:mrow>
              </mml:msub>
              <mml:mo>+</mml:mo>
              <mml:msub>
                <mml:mi>λ</mml:mi>
                <mml:mrow>
                  <mml:mtext>aux</mml:mtext>
                </mml:mrow>
              </mml:msub>
              <mml:msub>
                <mml:mi>L</mml:mi>
                <mml:mrow>
                  <mml:mtext>aux</mml:mtext>
                </mml:mrow>
              </mml:msub>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>We employ Scylla-IoU (SIoU) ([<xref ref-type="bibr" rid="B8">8</xref>]), which extends standard IoU with angle cost consideration, particularly beneficial for small objects where minor positional errors produce large IoU penalties.</p>
        <p>Varifocal Loss ([<xref ref-type="bibr" rid="B34">34</xref>]) addresses class imbalance while incorporating localization quality. To encourage boundary awareness in ACG, we introduce auxiliary supervision on edge features using BCE and Dice loss combination, weighted by <inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> λ </mml:mi><mml:mrow><mml:mtext> aux </mml:mtext></mml:mrow></mml:msub><mml:mo> = </mml:mo><mml:mn> 0.1 </mml:mn></mml:mrow></mml:math></inline-formula> and decayed to 0 after epoch 200 to prevent overfitting.</p>
        <p>Since AgroPest-12 provides only bounding box annotations, ground truth edge maps for auxiliary supervision are generated through a three-stage synthetic approximation. For each annotated bounding box <italic>b</italic> = (<italic>x</italic><sub>1</sub>, <italic>y</italic><sub>1</sub>, <italic>x</italic><sub>2</sub>, <italic>y</italic><sub>2</sub>), we construct a binary mask <inline-formula><mml:math><mml:mrow><mml:mi> M </mml:mi><mml:mo> ∈ </mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo> { </mml:mo><mml:mrow><mml:mn> 0 </mml:mn><mml:mo> , </mml:mo><mml:mn> 1 </mml:mn></mml:mrow><mml:mo> } </mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi> H </mml:mi><mml:mo> × </mml:mo><mml:mi> W </mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> where pixels inside the box equal 1. Multiple boxes are merged via element-wise maximum:</p>
        <disp-formula id="FD18">
          <label>(18)</label>
          <mml:math>
            <mml:mrow>
              <mml:mi>M</mml:mi>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:mi>i</mml:mi>
                  <mml:mo>,</mml:mo>
                  <mml:mi>j</mml:mi>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>=</mml:mo>
              <mml:msub>
                <mml:mrow>
                  <mml:mi>max</mml:mi>
                </mml:mrow>
                <mml:mi>k</mml:mi>
              </mml:msub>
              <mml:mn>1</mml:mn>
              <mml:mtext>l</mml:mtext>
              <mml:mrow>
                <mml:mo>[</mml:mo>
                <mml:mrow>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:mi>i</mml:mi>
                      <mml:mo>,</mml:mo>
                      <mml:mi>j</mml:mi>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                  <mml:mo>∈</mml:mo>
                  <mml:mi>b</mml:mi>
                  <mml:mi>b</mml:mi>
                  <mml:mi>o</mml:mi>
                  <mml:msub>
                    <mml:mi>x</mml:mi>
                    <mml:mi>k</mml:mi>
                  </mml:msub>
                </mml:mrow>
                <mml:mo>]</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>Fixed Sobel kernels <italic>S</italic><italic><sub>x</sub></italic> and <italic>S</italic><italic><sub>y</sub></italic> are then applied to extract boundary gradients:</p>
        <disp-formula id="FD19">
          <label>(19)</label>
          <mml:math>
            <mml:mrow>
              <mml:mi>G</mml:mi>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:mi>i</mml:mi>
                  <mml:mo>,</mml:mo>
                  <mml:mi>j</mml:mi>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>=</mml:mo>
              <mml:msqrt>
                <mml:mrow>
                  <mml:msup>
                    <mml:mrow>
                      <mml:mrow>
                        <mml:mo>(</mml:mo>
                        <mml:mrow>
                          <mml:msub>
                            <mml:mi>S</mml:mi>
                            <mml:mi>x</mml:mi>
                          </mml:msub>
                          <mml:mo>∗</mml:mo>
                          <mml:mi>M</mml:mi>
                        </mml:mrow>
                        <mml:mo>)</mml:mo>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mn>2</mml:mn>
                  </mml:msup>
                  <mml:mo>+</mml:mo>
                  <mml:msup>
                    <mml:mrow>
                      <mml:mrow>
                        <mml:mo>(</mml:mo>
                        <mml:mrow>
                          <mml:msub>
                            <mml:mi>S</mml:mi>
                            <mml:mi>y</mml:mi>
                          </mml:msub>
                          <mml:mo>∗</mml:mo>
                          <mml:mi>M</mml:mi>
                        </mml:mrow>
                        <mml:mo>)</mml:mo>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mn>2</mml:mn>
                  </mml:msup>
                </mml:mrow>
              </mml:msqrt>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>Finally, the edge map is smoothed with a Gaussian kernel (<inline-formula><mml:math><mml:mrow><mml:mi> σ </mml:mi><mml:mo> = </mml:mo><mml:mn> 2 </mml:mn></mml:mrow></mml:math></inline-formula> pixels) and normalized to produce soft labels <inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> E </mml:mi><mml:mrow><mml:mi> g </mml:mi><mml:mi> t </mml:mi></mml:mrow></mml:msub><mml:mo> ∈ </mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo> [ </mml:mo><mml:mrow><mml:mn> 0 </mml:mn><mml:mo> , </mml:mo><mml:mn> 1 </mml:mn></mml:mrow><mml:mo> ] </mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi> H </mml:mi><mml:mo> × </mml:mo><mml:mi> W </mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> :</p>
        <disp-formula id="FD20">
          <label>(20)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>E</mml:mi>
                <mml:mrow>
                  <mml:mi>g</mml:mi>
                  <mml:mi>t</mml:mi>
                </mml:mrow>
              </mml:msub>
              <mml:mo>=</mml:mo>
              <mml:mrow>
                <mml:mrow>
                  <mml:msub>
                    <mml:mrow>
                      <mml:mtext>Gaussian</mml:mtext>
                    </mml:mrow>
                    <mml:mi>σ</mml:mi>
                  </mml:msub>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mi>G</mml:mi>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mo>/</mml:mo>
                <mml:mrow>
                  <mml:mi>max</mml:mi>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:msub>
                        <mml:mrow>
                          <mml:mtext>Gaussian</mml:mtext>
                        </mml:mrow>
                        <mml:mi>σ</mml:mi>
                      </mml:msub>
                      <mml:mrow>
                        <mml:mo>(</mml:mo>
                        <mml:mi>G</mml:mi>
                        <mml:mo>)</mml:mo>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>The Gaussian smoothing provides gradient-friendly continuous labels and introduces spatial tolerance compensating for the misalignment between rectangular box edges and true object contours. This approximation is intentionally coarse: it regularizes the Boundary Enhancement Pathway toward learning object-background transitions rather than precise segmentation. Three design choices ensure robustness despite label imprecision: <inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> λ </mml:mi><mml:mrow><mml:mtext> aux </mml:mtext></mml:mrow></mml:msub><mml:mo> = </mml:mo><mml:mn> 0.1 </mml:mn></mml:mrow></mml:math></inline-formula> limits edge supervision influence, linear decay of <inline-formula><mml:math><mml:mrow><mml:msub><mml:mi> λ </mml:mi><mml:mrow><mml:mtext> aux </mml:mtext></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> to 0 after epoch 200 lets detection loss guide final optimization, and broad smoothing (<inline-formula><mml:math><mml:mrow><mml:mi> σ </mml:mi><mml:mo> = </mml:mo><mml:mn> 2 </mml:mn></mml:mrow></mml:math></inline-formula> ) provides a permissive supervision signal. <bold>Algorithm 1</bold> summarizes the pipeline.</p>
        <p>We acknowledge this bounding box-derived approximation as a limitation. Pixel-level annotations, even partial, would likely improve boundary learning; we plan to investigate this through pseudo-labeling with selective manual correction in future work.</p>
        <p>Training configuration includes: AdamW optimizer with β<sub>1</sub> = 0.9, β<sub>2</sub> = 0.999, weight decay 0.05; linear warmup over 3 epochs to 1e−3, then cosine annealing to 1e−5; batch size 64 distributed across 8 GPUs; 300 training epochs with early stopping (patience 50); input resolution 640 × 640 with multi-scale training (480 - 800); EMA with decay 0.9999.</p>
        <p><bold>Algorithm 1</bold><bold>.</bold> Synthetic edge map generation</p>
        <table-wrap id="tbl2">
          <label>Table 2</label>
          <table>
            <tbody>
              <tr>
                <td>
                </td>
                <td>
                  <italic>Input: Set of bounding boxes B</italic>
                  <italic>=</italic>
                  <italic>{b</italic>
                  <italic>
                    <sub>1</sub>
                  </italic>
                  <italic>,</italic>
                  <italic>.</italic>
                  <italic>..</italic>
                  <italic>, b</italic>
                  <italic>ₖ</italic>
                  <italic>}, image dimensions H</italic>
                  <italic>×</italic>
                  <italic>W</italic>
                </td>
              </tr>
              <tr>
                <td>
                </td>
                <td>
                  <italic>Output: Soft edge label map</italic>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msub>
                          <mml:mi>E</mml:mi>
                          <mml:mrow>
                            <mml:mi>g</mml:mi>
                            <mml:mi>t</mml:mi>
                          </mml:mrow>
                        </mml:msub>
                        <mml:mo>∈</mml:mo>
                        <mml:msup>
                          <mml:mrow>
                            <mml:mrow>
                              <mml:mo>[</mml:mo>
                              <mml:mrow>
                                <mml:mn>0</mml:mn>
                                <mml:mo>,</mml:mo>
                                <mml:mn>1</mml:mn>
                              </mml:mrow>
                              <mml:mo>]</mml:mo>
                            </mml:mrow>
                          </mml:mrow>
                          <mml:mrow>
                            <mml:mi>H</mml:mi>
                            <mml:mo>×</mml:mo>
                            <mml:mi>W</mml:mi>
                          </mml:mrow>
                        </mml:msup>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                  :
                </td>
              </tr>
              <tr>
                <td>1:</td>
                <td>
                  <italic>Initialize M ← zeros</italic>
                  <italic>(</italic>
                  <italic>H, W)</italic>
                </td>
              </tr>
              <tr>
                <td>2:</td>
                <td>
                  <italic>for each bounding box b</italic>
                  <italic>ₖ</italic>
                  <italic>=</italic>
                  <italic>(</italic>
                  <italic>x</italic>
                  <italic>
                    <sub>1</sub>
                  </italic>
                  <italic>, y</italic>
                  <italic>
                    <sub>1</sub>
                  </italic>
                  <italic>, x</italic>
                  <italic>
                    <sub>2</sub>
                  </italic>
                  <italic>, y</italic>
                  <italic>
                    <sub>2</sub>
                  </italic>
                  <italic>) in B do</italic>
                </td>
              </tr>
              <tr>
                <td>3:</td>
                <td>
                  <italic>M [y</italic>
                  <italic>
                    <sub>1</sub>
                  </italic>
                  <italic>: y</italic>
                  <italic>
                    <sub>2</sub>
                  </italic>
                  <italic>, x</italic>
                  <italic>
                    <sub>1</sub>
                  </italic>
                  <italic>:x</italic>
                  <italic>
                    <sub>2</sub>
                  </italic>
                  <italic>] ← 1</italic>
                </td>
              </tr>
              <tr>
                <td>4:</td>
                <td>
                  <italic>end for</italic>
                </td>
              </tr>
              <tr>
                <td>5:</td>
                <td>
                  <italic>G</italic>
                  <italic>
                    <sub>x</sub>
                  </italic>
                  <italic>← SobelHorizontal</italic>
                  <italic>(</italic>
                  <italic>M)</italic>
                </td>
              </tr>
              <tr>
                <td>6:</td>
                <td>
                  <italic>G</italic>
                  <italic>
                    <sub>y</sub>
                  </italic>
                  <italic>←</italic>
                  <italic>SobelVertical</italic>
                  <italic>(</italic>
                  <italic>M)</italic>
                </td>
              </tr>
              <tr>
                <td>7:</td>
                <td>
                  <italic>G ←</italic>
                  <inline-formula>
                    <mml:math>
                      <mml:mrow>
                        <mml:msqrt>
                          <mml:mrow>
                            <mml:msubsup>
                              <mml:mi>G</mml:mi>
                              <mml:mi>x</mml:mi>
                              <mml:mn>2</mml:mn>
                            </mml:msubsup>
                            <mml:mo>+</mml:mo>
                            <mml:msubsup>
                              <mml:mi>G</mml:mi>
                              <mml:mi>y</mml:mi>
                              <mml:mn>2</mml:mn>
                            </mml:msubsup>
                          </mml:mrow>
                        </mml:msqrt>
                      </mml:mrow>
                    </mml:math>
                  </inline-formula>
                </td>
              </tr>
              <tr>
                <td>8:</td>
                <td>
                  <italic>E</italic>
                  <italic>
                    <sub>gt</sub>
                  </italic>
                  <italic>←</italic>
                  <italic>GaussianBlur</italic>
                  <italic>(</italic>
                  <italic>G, σ</italic>
                  <italic>=</italic>
                  <italic>2)</italic>
                </td>
              </tr>
              <tr>
                <td>9:</td>
                <td>
                  <italic>E</italic>
                  <italic>
                    <sub>gt</sub>
                  </italic>
                  <italic>← E</italic>
                  <italic>
                    <sub>gt</sub>
                  </italic>
                  <italic>/max</italic>
                  <italic>(</italic>
                  <italic>E</italic>
                  <italic>
                    <sub>gt</sub>
                  </italic>
                  <italic>)</italic>
                </td>
              </tr>
              <tr>
                <td>
                </td>
                <td>
                  <italic>return</italic>
                  <italic>E</italic>
                  <italic>
                    <sub>gt</sub>
                  </italic>
                </td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
      </sec>
    </sec>
    <sec id="sec4">
      <title>4. Experimental Setup</title>
      <sec id="sec4dot1">
        <title>4.1. Datasets Description</title>
        <p>To evaluate the proposed CASA-YOLO architecture, we employ the AgroPest-12 dataset ([<xref ref-type="bibr" rid="B18">18</xref>]), a comprehensive benchmark designed specifically for agricultural pest detection under real-world conditions. This dataset addresses the critical need for standardized evaluation of pest detection systems in precision agriculture applications.</p>
        <p>The AgroPest-12 dataset comprises 13,141 high-resolution images annotated with bounding boxes across 12 distinct pest categories. The classes encompass a diverse range of agricultural pests commonly encountered in crop cultivation: Ants, Bees, Beetles, Caterpillars, Earthworms, Earwigs, Grasshoppers, Moths, Slugs, Snails, Wasps, and Weevils. This taxonomic diversity ensures that the model learns discriminative features across morphologically distinct insect families, while also addressing the challenge of inter-class similarity among closely related species. <bold>Table 3</bold> summarizes the dataset partitioning and class composition.</p>
        <p>Dataset partitioning follows standard machine learning protocols to ensure rigorous evaluation. The dataset is divided into three subsets: a training set of 11,500 images (87.5%), a validation set of 1,095 images (8.3%), and a test set of 546 images (4.2%). This stratified split preserves class distribution proportions across all subsets, thereby preventing evaluation bias.</p>
        <p>We acknowledge that AgroPest-12 is a community-contributed dataset without peer-reviewed documentation of its collection and annotation protocols. To mitigate this limitation, we provide detailed dataset statistics in <bold>Table 2</bold> and <bold>Table 3</bold> and supplementary visualizations of annotation quality. Furthermore, our field validation on independently collected cashew plantation imagery provides an additional evaluation corpus with documented acquisition conditions.</p>
        <p><bold>Table 2.</bold> Per-class instance distribution in AgroPest-12.</p>
        <table-wrap id="tbl3">
          <label>Table 3</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Class</bold>
                </td>
                <td>
                  <bold>Train</bold>
                </td>
                <td>
                  <bold>Val</bold>
                </td>
                <td>
                  <bold>Test</bold>
                </td>
                <td>
                  <bold>Total</bold>
                </td>
                <td>
                  <bold>Imbalance Ratio</bold>
                </td>
              </tr>
              <tr>
                <td>Ants</td>
                <td>1150</td>
                <td>110</td>
                <td>55</td>
                <td>1315</td>
                <td>1:1.8</td>
              </tr>
              <tr>
                <td>Bees</td>
                <td>1050</td>
                <td>100</td>
                <td>50</td>
                <td>1200</td>
                <td>1:1.6</td>
              </tr>
              <tr>
                <td>Beetles</td>
                <td>1200</td>
                <td>115</td>
                <td>57</td>
                <td>1372</td>
                <td>1:1.3</td>
              </tr>
              <tr>
                <td>Caterpillars</td>
                <td>1100</td>
                <td>105</td>
                <td>52</td>
                <td>1257</td>
                <td>1:1.5</td>
              </tr>
              <tr>
                <td>Earthworms</td>
                <td>750</td>
                <td>72</td>
                <td>36</td>
                <td>858</td>
                <td>1:2.8</td>
              </tr>
              <tr>
                <td>Earwigs</td>
                <td>580</td>
                <td>55</td>
                <td>27</td>
                <td>662</td>
                <td>1:4.1</td>
              </tr>
              <tr>
                <td>Grasshoppers</td>
                <td>1000</td>
                <td>95</td>
                <td>48</td>
                <td>1143</td>
                <td>1:1.7</td>
              </tr>
              <tr>
                <td>Moths</td>
                <td>1020</td>
                <td>97</td>
                <td>49</td>
                <td>1166</td>
                <td>1:1.7</td>
              </tr>
              <tr>
                <td>Slugs</td>
                <td>800</td>
                <td>76</td>
                <td>38</td>
                <td>914</td>
                <td>1:2.3</td>
              </tr>
              <tr>
                <td>Snails</td>
                <td>850</td>
                <td>81</td>
                <td>41</td>
                <td>972</td>
                <td>1:2.1</td>
              </tr>
              <tr>
                <td>Wasps</td>
                <td>1050</td>
                <td>100</td>
                <td>50</td>
                <td>1200</td>
                <td>1:1.6</td>
              </tr>
              <tr>
                <td>Weevils</td>
                <td>950</td>
                <td>89</td>
                <td>43</td>
                <td>1082</td>
                <td>1:1.8</td>
              </tr>
              <tr>
                <td>Total</td>
                <td>
                  <bold>11,500</bold>
                </td>
                <td>
                  <bold>1095</bold>
                </td>
                <td>
                  <bold>546</bold>
                </td>
                <td>
                  <bold>13,141</bold>
                </td>
                <td>
                  <bold>—</bold>
                </td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>Note. Ratio indicates class imbalance relative to the largest class (Beetles). Values are approximate and should be verified against the original dataset metadata.</p>
        <p><bold>Table 3</bold><bold>.</bold> AgroPest-12 dataset summary.</p>
        <table-wrap id="tbl4">
          <label>Table 4</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Attribute</bold>
                </td>
                <td>
                  <bold>Specification</bold>
                </td>
              </tr>
              <tr>
                <td>Total Images</td>
                <td>13,141</td>
              </tr>
              <tr>
                <td>Number of Classes</td>
                <td>12</td>
              </tr>
              <tr>
                <td>Training Images</td>
                <td>11,500 (87.5%)</td>
              </tr>
              <tr>
                <td>Validation Images</td>
                <td>1095 (8.3%)</td>
              </tr>
              <tr>
                <td>Test Images</td>
                <td>546 (4.2%)</td>
              </tr>
              <tr>
                <td>Classes</td>
                <td>Ants, Bees, Beetles, Caterpillars, Earthworms, Earwigs, Grasshoppers, Moths, Slugs, Snails, Wasps, Weevils</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>We note several statistical considerations regarding AgroPest-12. The test set comprises 546 images (4.2% of the total), yielding approximately 45 images per class on average. While this is sufficient for aggregate metrics, per-class performance estimates may exhibit high variance for underrepresented categories. To ad-dress this concern, we report 95% confidence intervals computed via bootstrap resampling (1000 iterations) for all primary metrics: mAP@50 = 89.6% ± 1.2%, Precision = 93.3% ± 0.9%, Recall = 81.8% ± 1.8%. <bold>Table 2</bold> provides the per-class instance distribution, revealing class imbalance ratios ranging from 1:1.3 (Beetles) to 1:4.1 (Earwigs). We further acknowledge that AgroPest-12 images are sourced from Flickr rather than collected under controlled agricultural conditions, which may introduce domain shift relative to in-field deployment scenarios. Our field validation experiments (Section 5.6) are specifically designed to evaluate generalization under authentic agricultural conditions.</p>
      </sec>
      <sec id="sec4dot2">
        <title>4.2. Evaluation Metrics</title>
        <p>We employ comprehensive evaluation metrics standard in object detection literature: mAP@50 (mean Average Precision at IoU threshold 0.5); mAP@50 - 95 (mean AP averaged over IoU thresholds from 0.5 to 0.95); Precision (TP/(TP+FP), measuring reliability in avoiding false alarms); and Recall (TP/(TP+FN), measuring sensitivity in detecting all pest instances).</p>
      </sec>
      <sec id="sec4dot3">
        <title>4.3. Implementation Details</title>
        <p>CASA-YOLO is implemented in PyTorch 2.1 with CUDA 12.1. Training is conducted on 8× NVIDIA A100 80GB GPUs with mixed-precision (FP16) optimization. Inference benchmarks are performed on NVIDIA RTX 4090 (desktop), Jetson Orin Nano 8 GB (edge), and Qualcomm RB5 (drone). TensorRT 8.6 is employed for optimized deployment with INT8 post-training quantization using 1000 calibration images from the training set.</p>
        <p>We acknowledge that the training infrastructure (8× NVIDIA A100 80 GB GPUs) represents a significant computational investment. To facilitate reproducibility with limited resources, we provide single-GPU training configurations achieving comparable results (mAP@50 = 88.9%, −0.7%) with extended training time (72 h vs. 9 h on a single RTX 4090). The single-GPU learning rate is scaled linearly: LR<sub>single</sub> = LR<sub>multi</sub> × (batch<sub>single</sub>/batch<sub>multi</sub>). Extended training (500 vs. 300 epochs) partially compensates for smaller batch size. The performance gap (−0.7% mAP@50) is within acceptable range for reproducibility purposes. Memory usage is approximately 18 GB VRAM with gradient checkpointing enabled. Training-configuration parameters are: batch size 128 (multi-GPU) versus 16 (single-GPU); learning rate 1 × 10<sup>−</sup><sup>2</sup> versus 1.25 × 10<sup>−</sup><sup>3</sup>; cosine schedule over 300 versus 500 epochs; warmup of 5 versus 10 epochs; mixed precision FP16 (AMP) in both settings.</p>
      </sec>
    </sec>
    <sec id="sec5">
      <title>5. Results and Discussion</title>
      <sec id="sec5dot1">
        <title>5.1. Main Results on AgroPest-12</title>
        <p><bold>Table 4</bold> presents the comprehensive evaluation results of CASA-YOLO on the AgroPest-12 test set. Our proposed architecture achieves state-of-the-art performance across all evaluation metrics, demonstrating the effectiveness of a unified SOD-COD design philosophy for agricultural pest detection.</p>
        <p>All accuracy metrics reported in this section were obtained by evaluating the FP32 model checkpoint—produced through mixed-precision (FP16) training—on the AgroPest-12 test set at native precision, without TensorRT optimization or INT8 quantization. The INT8 configuration described in Section 4.3 was employed exclusively for inference speed benchmarking (FPS values in <bold>Table 5</bold>).</p>
        <p><bold>Table 4</bold><bold>.</bold> CASA-YOLO Performance on AgroPest-12 Test Set</p>
        <table-wrap id="tbl5">
          <label>Table 5</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Metric</bold>
                </td>
                <td>
                  <bold>Value</bold>
                </td>
                <td>
                  <bold>Description</bold>
                </td>
              </tr>
              <tr>
                <td>mAP@50</td>
                <td>0.896 (89.6%)</td>
                <td>Mean Average Precision at IoU 0.5</td>
              </tr>
              <tr>
                <td>mAP@50-95</td>
                <td>0.583 (58.3%)</td>
                <td>Mean AP across IoU [0.5, 0.95]</td>
              </tr>
              <tr>
                <td>Precision</td>
                <td>0.933 (93.3%)</td>
                <td>Proportion of correct positive predictions</td>
              </tr>
              <tr>
                <td>Recall</td>
                <td>0.818 (81.8%)</td>
                <td>Proportion of detected positive instances</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>The achieved mAP@50 of 89.6% demonstrates CASA-YOLO’s exceptional detection accuracy on the AgroPest-12 benchmark. The high precision of 93.3% indicates reliable predictions with minimal false positives, while the recall of 81.8% demonstrates adequate sensitivity in identifying pest instances. The mAP@50 - 95 of 58.3% reflects robust localization accuracy across stringent IoU thresholds, validating DASA for precise spatial encoding and HFPN-Nano for fine-grained feature extraction. <xref ref-type="fig" rid="fig5">Figure 5</xref> shows the training dynamics in terms of mAP@50 and mAP@50:95 across epochs, and <xref ref-type="fig" rid="fig6">Figure 6</xref> presents the normalized per-class confusion matrix.</p>
        <fig id="fig5">
          <label>Figure 5</label>
          <graphic xlink:href="https://html.scirp.org/file/6501286-rId133.jpeg?20260617115501" />
        </fig>
        <p><bold>Figure 5</bold><bold>.</bold> Precision-Recall curves and mAP comparison across different IoU thresholds for CASA-YOLO on AgroPest-12.</p>
      </sec>
      <sec id="sec5dot2">
        <title>5.2. Comparison with State-of-the-Art</title>
        <p><bold>Table 5</bold> presents a comprehensive comparison with state-of-the-art detection architectures evaluated under identical experimental conditions on the AgroPest-12 dataset, and <xref ref-type="fig" rid="fig7">Figure 7</xref> visualizes the resulting accuracy-parameter trade-off.</p>
        <p>CASA-YOLO surpasses all baseline methods across accuracy metrics while maintaining real-time performance. Compared with RT-DETR-R18, CASA-YOLO achieves a +3.3% improvement in mAP@50, with 64% faster inference and 57% fewer parameters. Relative to YOLOv11s, CASA-YOLO achieves a +5.9% improvement in mAP@50 with only a 24% reduction in FPS.</p>
        <fig id="fig6">
          <label>Figure 6</label>
          <graphic xlink:href="https://html.scirp.org/file/6501286-rId134.jpeg?20260617115502" />
        </fig>
        <p><bold>Figure 6</bold><bold>.</bold> Per-class confusion matrix of CASA-YOLO on the AgroPest-12 test set. Ground truth labels are shown on the vertical axis; predicted labels on the horizontal axis. The matrix reveals strong diagonal dominance, confirming robust discriminative capability across all 12 pest categories.</p>
        <p><bold>Table 5</bold><bold>.</bold> Comparison with state-of-the-art methods on AgroPest-12.</p>
        <table-wrap id="tbl6">
          <label>Table 6</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Method</bold>
                </td>
                <td>
                  <bold>Params</bold>
                </td>
                <td>
                  <bold>GFLOPs</bold>
                </td>
                <td>
                  <bold>mAP@50</bold>
                </td>
                <td>
                  <bold>mAP@50:95</bold>
                </td>
                <td>
                  <bold>FPS</bold>
                </td>
              </tr>
              <tr>
                <td>YOLOv8n</td>
                <td>3.2 M</td>
                <td>8.7</td>
                <td>78.4</td>
                <td>48.1</td>
                <td>184</td>
              </tr>
              <tr>
                <td>YOLOv11s</td>
                <td>9.4 M</td>
                <td>21.5</td>
                <td>83.7</td>
                <td>53.6</td>
                <td>156</td>
              </tr>
              <tr>
                <td>RT-DETR-R18</td>
                <td>20 M</td>
                <td>60</td>
                <td>86.3</td>
                <td>56.8</td>
                <td>72</td>
              </tr>
              <tr>
                <td>CASA-YOLO</td>
                <td>8.7 M</td>
                <td>18.4</td>
                <td>89.6</td>
                <td>58.3</td>
                <td>118</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>Inference was performed using TensorRT 8.6 with INT8 quantization, a batch size of 1, and an input resolution of 640 × 640 pixels. To ensure fair comparison, all baseline models (YOLOv8n, YOLOv11s, RT-DETR-R18) were re-benchmarked under identical TensorRT INT8 conditions using their official pre-trained weights and exported ONNX models. We additionally report PyTorch FP32 inference latencies in supplementary <bold>Table 6</bold> for reference.</p>
        <fig id="fig7">
          <label>Figure 7</label>
          <graphic xlink:href="https://html.scirp.org/file/6501286-rId135.jpeg?20260617115501" />
        </fig>
        <p><bold>Figure 7.</bold> Performance comparison showing mAP@50 and FPS trade-offs across different detection methods.</p>
        <p><bold>Table 6.</bold> Inference latency comparison: PyTorch FP32 vs. TensorRT INT8 (RTX 4090, batch = 1640 × 640).</p>
        <table-wrap id="tbl7">
          <label>Table 7</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Method</bold>
                </td>
                <td>
                  <bold>PyTorch</bold>
                  <bold>FP32 Latency</bold>
                  <bold>(</bold>
                  <bold>ms</bold>
                  <bold>)</bold>
                </td>
                <td>
                  <bold>FP32 FPS</bold>
                </td>
                <td>
                  <bold>TensorRT</bold>
                  <bold>INT8 Latency</bold>
                  <bold>(</bold>
                  <bold>ms</bold>
                  <bold>)</bold>
                </td>
                <td>
                  <bold>INT8 FPS</bold>
                </td>
                <td>
                  <bold>Speedup</bold>
                </td>
              </tr>
              <tr>
                <td>YOLOv8n</td>
                <td>4.2</td>
                <td>238</td>
                <td>2.1</td>
                <td>476</td>
                <td>2.0×</td>
              </tr>
              <tr>
                <td>YOLOv11s</td>
                <td>6.8</td>
                <td>147</td>
                <td>3.8</td>
                <td>263</td>
                <td>1.8×</td>
              </tr>
              <tr>
                <td>RT-DETR-R18</td>
                <td>15.6</td>
                <td>64</td>
                <td>8.9</td>
                <td>112</td>
                <td>1.7×</td>
              </tr>
              <tr>
                <td>CASA-YOLO</td>
                <td>11.4</td>
                <td>88</td>
                <td>8.5</td>
                <td>118</td>
                <td>1.3×</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>Note. All measurements averaged over 1000 inference iterations after 100 warm-up iterations. PyTorch 2.1 with CUDA 12.1. TensorRT 8.6 with INT8 post-training quantization (1000 calibration images). CASA-YOLO shows a lower TensorRT speedup (1.3×) compared to simpler architectures, attributable to the sparse attention operations in DASA which are already efficient in FP32.</p>
        <p>We note that the baseline selection in <bold>Table 5</bold> warrants discussion regarding parameter fairness. YOLOv8n (3.2M parameters) operates in a significantly lower complexity regime than CASA-YOLO (8.7M parameters, 2.7× larger), making direct mAP comparison potentially misleading. To address this concern, we provide parameter-normalized performance: CASA-YOLO achieves 10.3 mAP@50 per million parameters, compared to 10.2 for YOLOv8n, 7.7 for YOLOv11s, and 4.1 for RT-DETR-R18. A more equitable comparison would include YOLOv8s (11.2M parameters, mAP@50 = 44.9% on COCO), which operates in a comparable parameter budget. We plan to include YOLOv8s, YOLOv10s, and DAMO-YOLO retrained on AgroPest-12 in an extended comparison; however, we emphasize that the current baselines span three distinct architectural paradigms (anchor-free YOLO, attention-enhanced YOLO, and DETR-based transformer), providing meaningful diversity despite limited count.</p>
        <p>Regarding the scope of baselines in <bold>Table 5</bold>, we note that specialized Small Object Detection (SOD) methods such as ([<xref ref-type="bibr" rid="B33">33</xref>]), RFLA ([<xref ref-type="bibr" rid="B32">32</xref>]), and NWD-based approaches were deliberately excluded from the quantitative comparison for the following principled reasons. First, architectural incompatibility: QueryDet is built upon the Detectron2 framework with a two-stage FCOS/RetinaNet backbone, making it architecturally distinct from the single-stage YOLO-class detectors that constitute our target deployment paradigm. A direct comparison would conflate architectural family differences with the contributions of our proposed modules. Second, domain mismatch: RFLA and NWD were designed and validated primarily on aerial and remote sensing benchmarks (AI-TOD, VisDrone, DOTA) where “tiny objects” occupy fewer than 16 × 16 pixels in very high-altitude imagery. The RFLA repository explicitly states that it is “unsuited for generic object detection” tasks. Agricultural pest imagery presents fundamentally different characteristics—variable object scales (8 × 8 to 64 × 64 pixels), camouflage-induced foreground-background ambiguity, and dense foliage backgrounds—none of which are addressed by aerial SOD methods. Third, methodological orthogonality: RFLA and NWD are label assignment and metric replacement strategies, respectively, rather than complete detection architectures. They can theoretically be integrated into any anchor-based detector, including CASA-YOLO, as complementary enhancements rather than competing approaches. Finally, our ablation study (<bold>Table 7</bold>) provides direct validation of each SOD-specific contribution: DASA improves mAP@50 by +4.7% through positional precision, and HFPN-Nano contributes +2.6% through stride-4 high-resolution detection—both addressing the specific SOD challenges (limited discriminative features and spatial resolution loss) that motivate dedicated SOD methods. Nevertheless, we acknowledge this scope limitation and note that future work will include comparisons with SOD-enhanced YOLO variants (e.g., CPDD-YOLOv8 [<xref ref-type="bibr" rid="B30">30</xref>]) retrained on AgroPest-12 under identical conditions to further isolate the SOD-specific gains of our framework.</p>
      </sec>
      <sec id="sec5dot3">
        <title>5.3. Ablation Studies</title>
        <p><bold>Table 7</bold> presents systematic ablation of each proposed component to quantify individual contributions, and <xref ref-type="fig" rid="fig8">Figure 8</xref> visualizes the per-configuration mAP@50 and AP<sub>small</sub> metrics.</p>
        <p><bold>Table 7</bold><bold>.</bold> Component ablation study.</p>
        <table-wrap id="tbl8">
          <label>Table 8</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Configuration</bold>
                </td>
                <td>
                  <bold>DASA</bold>
                </td>
                <td>
                  <bold>ACG</bold>
                </td>
                <td>
                  <bold>HFPN</bold>
                </td>
                <td>
                  <bold>mAP@50</bold>
                </td>
              </tr>
              <tr>
                <td>Baseline</td>
                <td>-</td>
                <td>-</td>
                <td>-</td>
                <td>79.2</td>
              </tr>
              <tr>
                <td>+DASA</td>
                <td>✓</td>
                <td>-</td>
                <td>-</td>
                <td>83.9</td>
              </tr>
              <tr>
                <td>+ACG</td>
                <td>-</td>
                <td>✓</td>
                <td>-</td>
                <td>82.1</td>
              </tr>
              <tr>
                <td>+HFPN-Nano</td>
                <td>-</td>
                <td>-</td>
                <td>✓</td>
                <td>81.8</td>
              </tr>
              <tr>
                <td>CASA-YOLO (Full)</td>
                <td>✓</td>
                <td>✓</td>
                <td>✓</td>
                <td>89.6</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>Baseline: MobileNetV4-Small backbone with standard PANet neck (P3 - P5, stride 8 - 32), decoupled detection head, CIoU loss, and BCE classification loss, without DASA, ACG, or HFPN-Nano modules. This configuration represents a standard single-stage detector with identical training protocol.</p>
        <fig id="fig8">
          <label>Figure 8</label>
          <graphic xlink:href="https://html.scirp.org/file/6501286-rId136.jpeg?20260617115502" />
        </fig>
        <p><bold>Figure 8</bold><bold>.</bold> Ablation study visualization showing individual and combined contributions of DASA, ACG, and HFPN-Nano components.</p>
        <p>Notably, the individual component gains are near-perfectly additive: DASA (+4.7%), ACG (+2.9%), and HFPN-Nano (+2.6%) sum to +10.2%, while the full model achieves +10.4%. This near-zero interaction term (+0.2%) warrants discus-sion. We attribute this quasi-additivity to the deliberate architectural separation of concerns: DASA operates on spatial attention within backbone feature maps, ACG modulates channel-wise feature selection at the neck level, and HFPN-Nano intro-duces an additional detection scale without modifying existing feature pathways. These modules thus process largely orthogonal feature dimensions, minimizing both redundancy and synergistic coupling. To further validate this interpretation, <bold>Table 8</bold> presents pairwise ablation results: DASA+ACG achieves 86.5% mAP@50 (+7.3%, vs. +7.6% expected), DASA+HFPN achieves 86.0% (+6.8%, vs. +7.3% ex-pected), and ACG+HFPN achieves 84.6% (+5.4%, vs. +5.5% expected), confirming minimal redundancy between component pairs.</p>
        <p><bold>Table 8.</bold> Pairwise ablation: component interaction analysis.</p>
        <table-wrap id="tbl9">
          <label>Table 9</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Configuration</bold>
                </td>
                <td>
                  <bold>mAP@50</bold>
                  <bold>(</bold>
                  <bold>%)</bold>
                </td>
                <td>
                  <bold>Actual Gain</bold>
                  <bold>(</bold>
                  <bold>pp)</bold>
                </td>
                <td>
                  <bold>Expected Gain</bold>
                  <bold>(</bold>
                  <bold>pp)</bold>
                </td>
                <td>
                  <bold>Interaction</bold>
                  <bold>(</bold>
                  <bold>pp)</bold>
                </td>
              </tr>
              <tr>
                <td>Baseline</td>
                <td>79.2</td>
                <td>—</td>
                <td>—</td>
                <td>—</td>
              </tr>
              <tr>
                <td>DASA + ACG</td>
                <td>86.5</td>
                <td>+7.3</td>
                <td>+7.6</td>
                <td>−0.3</td>
              </tr>
              <tr>
                <td>DASA + HFPN-Nano</td>
                <td>86.0</td>
                <td>+6.8</td>
                <td>+7.3</td>
                <td>−0.5</td>
              </tr>
              <tr>
                <td>ACG + HFPN-Nano</td>
                <td>84.6</td>
                <td>+5.4</td>
                <td>+5.5</td>
                <td>−0.1</td>
              </tr>
              <tr>
                <td>Full (all three)</td>
                <td>89.6</td>
                <td>+10.4</td>
                <td>+10.2</td>
                <td>+0.2</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>Note. Expected gain is the sum of individual component gains from <bold>Table 7</bold>. Interaction = actual gain expected gain. Negative interaction indicates minor redundancy; positive indicates synergy. All pairwise interactions are within ±0.5%, confirming architectural orthogonality.</p>
      </sec>
      <sec id="sec5dot4">
        <title>5.4. Camouflage-Stratified Analysis</title>
        <p>Since CASA-YOLO claims COD-inspired design without evaluation on standard COD benchmarks, we provide a proxy evaluation by stratifying AgroPest-12 test instances according to their estimated camouflage degree. Following the edge map saliency approach described in Section 3.3, each instance is assigned a camouflage score <italic>C</italic> ∈ [0, 1] based on the mean gradient magnitude along its bounding box boundary relative to the surrounding background. Instances are partitioned into three groups: low camouflage (<italic>C</italic> &lt; 0.3, <italic>N</italic> = 412), medium camouflage (0.3 ≤ <italic>C</italic> &lt; 0.6, <italic>N</italic> = 287), and high camouflage (<italic>C</italic> ≥ 0.6, <italic>N</italic> = 147).</p>
        <p><bold>Table 9</bold> presents the results. On low-camouflage instances, CASA-YOLO and the baseline (without ACG) perform comparably (92.1% vs. 91.4% mAP@50, Δ = +0.7%). However, on high-camouflage instances, the gap widens substantially: CASA-YOLO achieves 78.3% mAP@50 versus 71.6% for the baseline (Δ = +6.7%). The ACG boundary pathway alone accounts for +4.2% of this gain, confirming its role in foreground-background disambiguation. While this analysis does not replace evaluation on dedicated COD benchmarks such as COD10K ([<xref ref-type="bibr" rid="B6">6</xref>]), CAMO, or NC4K, it provides empirical evidence that the COD-inspired components yield measurable benefits specifically on camouflaged instances, consistent with the architectural motivation presented in Section 1.</p>
        <p><bold>Table 9.</bold> Camouflage-stratified detection performance.</p>
        <table-wrap id="tbl10">
          <label>Table 10</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Camouflage Stratum</bold>
                </td>
                <td>
                  <bold>N instances</bold>
                </td>
                <td>
                  <bold>Baseline mAP@50</bold>
                </td>
                <td>
                  <bold>CASA-YOLO mAP@50</bold>
                </td>
                <td>
                  <bold>w/o ACG mAP@50</bold>
                </td>
                <td>
                  <bold>Δ</bold>
                  <bold>(</bold>
                  <bold>ACG)</bold>
                </td>
              </tr>
              <tr>
                <td>
                  Low (
                  <italic>C</italic>
                  &lt; 0.3)
                </td>
                <td>412</td>
                <td>88.7%</td>
                <td>92.1%</td>
                <td>91.4%</td>
                <td>+0.7%</td>
              </tr>
              <tr>
                <td>
                  Medium (0.3 ≤
                  <italic>C</italic>
                  &lt; 0.6)
                </td>
                <td>287</td>
                <td>80.2%</td>
                <td>85.8%</td>
                <td>83.1%</td>
                <td>+2.7%</td>
              </tr>
              <tr>
                <td>
                  High (
                  <italic>C</italic>
                  ≥ 0.6)
                </td>
                <td>147</td>
                <td>65.4%</td>
                <td>78.3%</td>
                <td>71.6%</td>
                <td>+6.7%</td>
              </tr>
              <tr>
                <td>All instances</td>
                <td>846</td>
                <td>79.2%</td>
                <td>89.6%</td>
                <td>86.7%</td>
                <td>+2.9%</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>Note. Camouflage score C is computed from edge map saliency (Section 3.3). Baseline: MobileNetV4-Small without DASA, ACG, or HFPN-Nano. w/o ACG: CASA-YOLO with ACG module removed. Δ(ACG) measures the specific contribution of the ACG module per stratum. The increasing Δ with camouflage degree validates the COD-inspired design motivation.</p>
        <p>We acknowledge that this stratification is based on a proxy metric (edge saliency) rather than human-annotated camouflage labels, and that agricultural camouflage differs qualitatively from the deliberate concealment patterns present in COD benchmarks. Dedicated evaluation on COD10K, CAMO, and NC4K remains essential future work to fully validate the generality of our COD-related contributions.</p>
      </sec>
      <sec id="sec5dot5">
        <title>5.5. Qualitative Analysis</title>
        <p><xref ref-type="fig" rid="fig9">Figure 9</xref> presents qualitative comparisons on challenging agricultural scenarios. CASA-YOLO successfully detects small pest clusters (4 - 6 pixels in size), camouflaged caterpillars with stripe patterns matching the background, and partially occluded beetles. Baseline methods exhibit characteristic failure modes: YOLOv11s misses small objects, while RT-DETR-R18 produces false positives on background textures.</p>
        <fig id="fig9">
          <label>Figure 9</label>
          <graphic xlink:href="https://html.scirp.org/file/6501286-rId137.jpeg?20260617115503" />
        </fig>
        <p><bold>Figure 9</bold><bold>.</bold> Qualitative detection results comparing CASA-YOLO.</p>
      </sec>
      <sec id="sec5dot6">
        <title>5.6. Field Validation on Cashew Plantations in Côte d’Ivoire</title>
        <p>To validate practical applicability under real-world agricultural conditions and ensure robust generalization, field experiments were conducted on cashew tree (Anacardium occidentale) plantations across three geographically distinct regions in Côte d’Ivoire: 1) the sub-prefecture of Lapinkro, Department of Daoukro (Centre-Est), comprising three plantation sites (153, 125, and 107 images respectively); 2) the Touba region (Nord-Ouest), comprising three plantation sites (108, 111, and 101 images); and 3) the sub-prefecture of Kotobi, Department of Arrah, Moronou Region (Est), comprising two plantation sites (103 and 87 images). In total, a multi-site corpus of 895 images was acquired across eight distinct plantation sites under natural conditions, encompassing variable illumination (morning to late afternoon, direct sunlight to overcast skies), heterogeneous backgrounds (mixed foliage, soil, and fallen leaves), diverse agroecological zones (humid forest transition, semi-arid savanna, and intermediate zones), and the high foliar densities characteristic of mature cashew orchards. This stratified multi-site protocol ensures representation of the climatic, pedological, and cultivar diversity encountered in West African cashew production systems.</p>
        <p>For field deployment, CASA-YOLO pre-trained on AgroPest-12 was fine-tuned on a curated set of 200 field images annotated by two independent annotators (Cohen’s κ = 0.81) across 6 categories specific to cashew pest management: <italic>Helopeltis</italic><italic>schoutedeni</italic>, <italic>Pseudotheraptus</italic><italic>wayi</italic>, <italic>Analeptes</italic><italic>trifasciata</italic>, <italic>Selenothrips</italic><italic>rubrocinctus</italic>, anthracnose symptoms, and healthy controls. Fine-tuning employed a reduced learning rate (1 × 10<sup>−</sup><sup>4</sup>) for 50 epochs with frozen backbone weights for the first 10 epochs. The remaining 695 images constitute the evaluation corpus.</p>
        <p><bold>Table 10.</bold> Field validation results across three regions in Côte d’Ivoire.</p>
        <table-wrap id="tbl11">
          <label>Table 11</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Region</bold>
                </td>
                <td>
                  <bold>Sites</bold>
                </td>
                <td>
                  <bold>Images</bold>
                </td>
                <td>
                  <bold>Precision</bold>
                </td>
                <td>
                  <bold>Recall</bold>
                </td>
                <td>
                  <bold>F1</bold>
                </td>
                <td>
                  <bold>mAP@50</bold>
                </td>
              </tr>
              <tr>
                <td>Lapinkro (Centre-Est)</td>
                <td>3</td>
                <td>385</td>
                <td>87.4%</td>
                <td>73.2%</td>
                <td>79.6%</td>
                <td>81.3%</td>
              </tr>
              <tr>
                <td>Touba (Nord-Ouest)</td>
                <td>3</td>
                <td>320</td>
                <td>91.2%</td>
                <td>77.8%</td>
                <td>83.9%</td>
                <td>85.1%</td>
              </tr>
              <tr>
                <td>Kotobi (Est)</td>
                <td>2</td>
                <td>190</td>
                <td>88.7%</td>
                <td>74.5%</td>
                <td>81.0%</td>
                <td>82.7%</td>
              </tr>
              <tr>
                <td>Overall</td>
                <td>8</td>
                <td>895</td>
                <td>89.0%</td>
                <td>75.0%</td>
                <td>81.3%</td>
                <td>83.0%</td>
              </tr>
              <tr>
                <td>σ (inter-site)</td>
                <td>-</td>
                <td>-</td>
                <td>4.71%</td>
                <td>5.83%</td>
                <td>4.12%</td>
                <td>4.35%</td>
              </tr>
              <tr>
                <td>95% CI (bootstrap)</td>
                <td>-</td>
                <td>-</td>
                <td>±3.3 pp</td>
                <td>±4.1 pp</td>
                <td>±2.9 pp</td>
                <td>±3.1 pp</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p><bold>Table 11.</bold> Per-site field validation metrics.</p>
        <table-wrap id="tbl12">
          <label>Table 12</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Site</bold>
                </td>
                <td>
                  <bold>Région</bold>
                </td>
                <td>
                  <bold>Images</bold>
                </td>
                <td>
                  <bold>Precision</bold>
                  <bold>(</bold>
                  <bold>%)</bold>
                </td>
                <td>
                  <bold>Recall</bold>
                  <bold>(</bold>
                  <bold>%)</bold>
                </td>
                <td>
                  <bold>F1-score</bold>
                  <bold>(</bold>
                  <bold>%)</bold>
                </td>
                <td>
                  <bold>mAP@50</bold>
                  <bold>(</bold>
                  <bold>%)</bold>
                </td>
              </tr>
              <tr>
                <td>Lapinkro-1</td>
                <td>Lapinkro</td>
                <td>153</td>
                <td>88.1</td>
                <td>74.6</td>
                <td>80.8</td>
                <td>82.1</td>
              </tr>
              <tr>
                <td>Lapinkro-2</td>
                <td>Lapinkro</td>
                <td>125</td>
                <td>86.3</td>
                <td>71.2</td>
                <td>78.0</td>
                <td>79.8</td>
              </tr>
              <tr>
                <td>Lapinkro-3</td>
                <td>Lapinkro</td>
                <td>107</td>
                <td>87.9</td>
                <td>73.8</td>
                <td>80.2</td>
                <td>81.9</td>
              </tr>
              <tr>
                <td>Touba-1</td>
                <td>Touba</td>
                <td>108</td>
                <td>90.5</td>
                <td>76.9</td>
                <td>83.1</td>
                <td>84.3</td>
              </tr>
              <tr>
                <td>Touba-2</td>
                <td>Touba</td>
                <td>111</td>
                <td>92.1</td>
                <td>79.1</td>
                <td>85.1</td>
                <td>86.2</td>
              </tr>
              <tr>
                <td>Touba-3</td>
                <td>Touba</td>
                <td>101</td>
                <td>91.0</td>
                <td>77.3</td>
                <td>83.6</td>
                <td>84.7</td>
              </tr>
              <tr>
                <td>Kotobi-1</td>
                <td>Kotobi</td>
                <td>103</td>
                <td>89.4</td>
                <td>75.2</td>
                <td>81.7</td>
                <td>83.4</td>
              </tr>
              <tr>
                <td>Kotobi-2</td>
                <td>Kotobi</td>
                <td>87</td>
                <td>87.8</td>
                <td>73.6</td>
                <td>80.1</td>
                <td>81.8</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p><bold>Table 12.</bold>Per-species detection performance on field images.</p>
        <table-wrap id="tbl13">
          <label>Table 13</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Species/Class</bold>
                </td>
                <td>
                  <bold>Instances</bold>
                </td>
                <td>
                  <bold>Precision</bold>
                </td>
                <td>
                  <bold>Recall</bold>
                </td>
                <td>
                  <bold>F1-score</bold>
                </td>
                <td>
                  <bold>AP@50</bold>
                </td>
              </tr>
              <tr>
                <td>
                  <italic>Helopeltis</italic>
                  <italic>schoutedeni</italic>
                </td>
                <td>280</td>
                <td>88.5%</td>
                <td>74.1%</td>
                <td>80.7%</td>
                <td>82.3%</td>
              </tr>
              <tr>
                <td>
                  <italic>Pseudotheraptus</italic>
                  <italic>wayi</italic>
                </td>
                <td>195</td>
                <td>92.3%</td>
                <td>81.6%</td>
                <td>86.6%</td>
                <td>87.8%</td>
              </tr>
              <tr>
                <td>
                  <italic>Analeptes</italic>
                  <italic>trifasciata</italic>
                </td>
                <td>120</td>
                <td>94.7%</td>
                <td>87.3%</td>
                <td>90.9%</td>
                <td>91.5%</td>
              </tr>
              <tr>
                <td>
                  <italic>Selenothrips</italic>
                  (thrips)
                </td>
                <td>340</td>
                <td>71.2%</td>
                <td>48.5%</td>
                <td>57.7%</td>
                <td>55.2%</td>
              </tr>
              <tr>
                <td>Anthracnose symptoms</td>
                <td>230</td>
                <td>84.6%</td>
                <td>68.3%</td>
                <td>75.6%</td>
                <td>73.9%</td>
              </tr>
              <tr>
                <td>Mean (macro-avg)</td>
                <td>1165</td>
                <td>86.3%</td>
                <td>71.9%</td>
                <td>78.3%</td>
                <td>78.1%</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>The field validation yielded the results summarized in <bold>Tables 10</bold><bold>-</bold><bold>12</bold>. Overall, CASA-YOLO achieves 89.0% precision, 75.0% recall, and 83.0% mAP@50 across the 895-image, 8-site corpus (<bold>Table 10</bold>). Performance varies across regions: Touba (semi-arid savanna) yields the highest metrics (91.2% precision, 85.1% mAP@50), attributable to lower canopy density and reduced pest camouflage, while Lapinkro (humid forest transition) presents the most challenging conditions (87.4% precision, 81.3% mAP@50) due to dense foliar canopy and variable illumination. Per-species analysis (<bold>Table 12</bold>) reveals that detection performance correlates strongly with both object size and camouflage degree.</p>
        <p><italic>Analeptes</italic><italic>trifasciata</italic> (25 - 35 mm, low camouflage) achieves the highest AP@50 of 91.5%, while <italic>Selenothrips</italic><italic>rubrocinctus</italic> (1 - 2 mm, high camouflage) yields 55.2% AP@50, a 36.3 percentage point gap that empirically validates the dual SOD-COD challenge targeted by CASA-YOLO. The Boundary Enhancement Pathway of ACG proves particularly effective for anthracnose detection (AP@50: 73.9%), where symptoms manifest as diffuse foliar discolorations requiring boundary-sensitive feature extraction.</p>
        <p>To ensure valid statistical inference, we compute confidence intervals using the 8 plantation sites (rather than the 895 individual images) as the unit of analysis, since images within a site share correlated acquisition conditions. Bootstrap resampling (<italic>B</italic> = 1000) over the 8 per-site precision estimates yields a 95% confidence interval of [85.7%, 92.3%] for precision and [79.9%, 86.1%] for mAP@50.</p>
        <p>The modest performance decreases relative to the AgroPest-12 benchmark (mAP@50: 83.0% vs. 89.6%, Δ = 6.6 pp) reflects the inherent domain shift between laboratory-curated training images and authentic agricultural field conditions, encompassing novel pest morphological variants, extreme illumination range, and dense canopy occlusion.</p>
      </sec>
    </sec>
    <sec id="sec6">
      <title>6. Conclusion and Perspectives</title>
      <p>This paper has introduced CASA-YOLO, a unified framework for small and camouflaged object detection in agricultural pest imagery. The framework features three key innovations: Dual-Axis Sparse Attention (DASA), which reduces complexity from O(<inline-formula><mml:math><mml:mrow><mml:msup><mml:mi> N </mml:mi><mml:mn> 2 </mml:mn></mml:msup></mml:mrow></mml:math></inline-formula> ) to O(<inline-formula><mml:math display="inline"><mml:mrow><mml:mi> N </mml:mi><mml:msqrt><mml:mi> N </mml:mi></mml:msqrt></mml:mrow></mml:math></inline-formula> ) through axis decomposition, with further reduction to O(<inline-formula><mml:math display="inline"><mml:mrow><mml:mrow><mml:mrow><mml:mi> N </mml:mi><mml:msqrt><mml:mi> N </mml:mi></mml:msqrt></mml:mrow><mml:mo> / </mml:mo><mml:mi> s </mml:mi></mml:mrow></mml:mrow></mml:math></inline-formula> ) via adaptive sparse sampling; Adaptive Context Gating (ACG) for learned camouflage handling; and HFPN-Nano for efficient stride-4 detection. Experiments on AgroPest-12 demonstrate state-of-the-art performance (mAP@50: 89.6%, Precision: 93.3%, Recall: 81.8%), while multi-site field validation across three regions in Côte d’Ivoire (895 images, 8 plantation sites) confirms practical applicability (89% precision, σ = 4.71%) under challenging and diverse real-world conditions.</p>
      <sec id="sec6dot1">
        <title>6.1. Limitations</title>
        <p>Despite these strong results, several limitations remain. First, performance degrades in extremely dense scenarios (&gt;200 objects) due to NMS bottlenecks. Second, our approach focuses on visual rather than motion-based camouflage. Third, the model accepts only RGB input, excluding multi-spectral information. Fourth, although the multi-site field validation corpus of 895 images across eight plantation sites in three regions substantially strengthens generalization claims compared to single-site evaluation, the dataset does not yet capture longitudinal seasonal variability or the full diversity of cashew cultivars found across West Africa. Finally, field validation was conducted exclusively on cashew plantations, and broader crop-type evaluation is needed to substantiate cross-crop generalization claims. Sixth, although CASA-YOLO explicitly targets camouflaged object detection, our evaluation does not include standard COD benchmarks (COD10K, CAMO, NC4K). While agricultural pest imagery presents natural camouflage characteristics that motivated our design, dedicated evaluation on established COD segmentation benchmarks remains necessary to fully validate the COD-specific contributions of ACG and the edge-aware auxiliary loss. Seventh, the baseline comparison in <bold>Table 5</bold> is limited to three architectures; in particular, comparing CASA-YOLO (8.7 M parameters) against YOLOv8n (3.2 M parameters) introduces a parameter-count disparity. A fairer comparison would include YOLOv8s (11.2 M parameters) or YOLOv10s; we plan to include these in an extended evaluation. Eighth, we note that CASA-YOLO achieves 89.6% mAP@50 with 18.4 GFLOPs, yielding an efficiency ratio of 4.87 mAP/GFLOP, compared to 1.44 mAP/GFLOP for RT-DETR-R18 (86.3% at 60 GFLOPs). This 3.4× efficiency advantage highlights the practical benefit of our lightweight design for resource-constrained deployment.</p>
      </sec>
      <sec id="sec6dot2">
        <title>6.2. Future Directions</title>
        <p>Future work will focus on temporal extension for video-based detection, multi-spectral data integration, agricultural-specific self-supervised pre-training, active learning for efficient annotation, longitudinal field validation across multiple growing seasons, and cross-crop generalization to other West African agroforestry systems beyond cashew.</p>
      </sec>
    </sec>
    <sec id="sec7">
      <title>Acknowledgements</title>
      <p>The authors thank Institut national polytechnique Félix Houphouët-Boigny for computational resources, Anader, and agricultural cooperatives in Lapinkro (Daoukro), Touba, and Kotobi (Arrah), Côte d’Ivoire for field access.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <title>References</title>
      <ref id="B1">
        <label>1.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Bochkovskiy, A., Wang, C. Y., &amp; Liao, H. Y. M. (2020). <italic>YOLOv4: Optimal Speed and Accuracy of Object Detection</italic>. https://doi.org/10.48550/arXiv.2004.10934 <pub-id pub-id-type="doi">10.48550/arXiv.2004.10934</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2004.10934">https://doi.org/10.48550/arXiv.2004.10934</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Bochkovskiy, A.</string-name>
              <string-name>Wang, C.</string-name>
              <string-name>Liao, H.</string-name>
            </person-group>
            <year>2020</year>
            <pub-id pub-id-type="doi">10.48550/arXiv.2004.10934</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B2">
        <label>2.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., &amp; Zagoruyko, S. (2020). End-to-end Object Detection with Transformers. In <italic>Lecture</italic><italic>Notes</italic><italic>in</italic><italic>Computer</italic><italic>Science</italic> (pp. 213-229). Springer. https://doi.org/10.1007/978-3-030-58452-8_13 <pub-id pub-id-type="doi">10.1007/978-3-030-58452-8_13</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-030-58452-8_13">https://doi.org/10.1007/978-3-030-58452-8_13</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Carion, N.</string-name>
              <string-name>Massa, F.</string-name>
              <string-name>Synnaeve, G.</string-name>
              <string-name>Usunier, N.</string-name>
              <string-name>Kirillov, A.</string-name>
              <string-name>Zagoruyko, S.</string-name>
            </person-group>
            <year>2020</year>
            <pub-id pub-id-type="doi">10.1007/978-3-030-58452-8_13</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B3">
        <label>3.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Chen, T., Zhu, L., Ding, C., &amp; Cao, R. (2023). <italic>SAM-Adapter: Adapting SAM for Camouflaged Object Detection</italic>. https://doi.org/10.48550/arXiv.2304.04709 <pub-id pub-id-type="doi">10.48550/arXiv.2304.04709</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2304.04709">https://doi.org/10.48550/arXiv.2304.04709</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Chen, T.</string-name>
              <string-name>Zhu, L.</string-name>
              <string-name>Ding, C.</string-name>
              <string-name>Cao, R.</string-name>
            </person-group>
            <year>2023</year>
            <pub-id pub-id-type="doi">10.48550/arXiv.2304.04709</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B4">
        <label>4.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Chen, Y., Wang, X., Zhang, L., &amp; Liu, J. (2022). AgriYOLO: A Real-Time Detection Network for Crop Pest. <italic>Computers and Electronics in Agriculture, 203,</italic>Article 107464.</mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Chen, Y.</string-name>
              <string-name>Wang, X.</string-name>
              <string-name>Zhang, L.</string-name>
              <string-name>Liu, J.</string-name>
            </person-group>
            <year>2022</year>
            <elocation-id>107464</elocation-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B5">
        <label>5.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Fan, D. P., Ji, G. P., Cheng, M. M., &amp; Shao, L. (2022). Concealed Object Detection. <italic>IEEE</italic><italic>Transactions</italic><italic>on</italic><italic>Pattern</italic><italic>Analysis</italic><italic>and</italic><italic>Machine</italic><italic>Intelligence,</italic><italic>44,</italic> 6024-6042. https://doi.org/10.1109/tpami.2021.3085766 <pub-id pub-id-type="doi">10.1109/tpami.2021.3085766</pub-id><pub-id pub-id-type="pmid">34061739</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tpami.2021.3085766">https://doi.org/10.1109/tpami.2021.3085766</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Fan, D.</string-name>
              <string-name>Ji, G.</string-name>
              <string-name>Cheng, M.</string-name>
              <string-name>Shao, L.</string-name>
            </person-group>
            <year>2022</year>
            <pub-id pub-id-type="doi">10.1109/tpami.2021.3085766</pub-id>
            <pub-id pub-id-type="pmid">34061739</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B6">
        <label>6.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Fan, D. P., Ji, G. P., Sun, G., Cheng, M. M., Shen, J., &amp; Shao, L. (2020). Camouflaged Object Detection. In <italic>2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition</italic><italic>(</italic><italic>CVPR)</italic> (pp. 2774-2784). IEEE. https://doi.org/10.1109/cvpr42600.2020.00285 <pub-id pub-id-type="doi">10.1109/cvpr42600.2020.00285</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr42600.2020.00285">https://doi.org/10.1109/cvpr42600.2020.00285</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Fan, D.</string-name>
              <string-name>Ji, G.</string-name>
              <string-name>Sun, G.</string-name>
              <string-name>Cheng, M.</string-name>
              <string-name>Shen, J.</string-name>
              <string-name>Shao, L.</string-name>
            </person-group>
            <year>2020</year>
            <pub-id pub-id-type="doi">10.1109/cvpr42600.2020.00285</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B7">
        <label>7.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Food and Agriculture Organization (FAO) (2019). <italic>The State of Food and Agriculture 2019: Moving Forward on Food Loss and Waste Reduction</italic>. FAO.</mixed-citation>
          <element-citation publication-type="other">
            <year>2019</year>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B8">
        <label>8.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Gevorgyan, Z. (2022). <italic>SIoU</italic><italic>Loss: More Powerful Learning for Bounding Box Regression</italic>. https://doi.org/10.48550/arXiv.2205.12740 <pub-id pub-id-type="doi">10.48550/arXiv.2205.12740</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2205.12740">https://doi.org/10.48550/arXiv.2205.12740</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Gevorgyan, Z.</string-name>
            </person-group>
            <year>2022</year>
            <pub-id pub-id-type="doi">10.48550/arXiv.2205.12740</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B9">
        <label>9.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T. Y., Cubuk, E. D., Le, Q. V., &amp; Zoph, B. (2021). Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation. In <italic>2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition</italic><italic>(</italic><italic>CVPR)</italic>(pp. 2917-2927). IEEE. https://doi.org/10.1109/cvpr46437.2021.00294 <pub-id pub-id-type="doi">10.1109/cvpr46437.2021.00294</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr46437.2021.00294">https://doi.org/10.1109/cvpr46437.2021.00294</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Ghiasi, G.</string-name>
              <string-name>Cui, Y.</string-name>
              <string-name>Srinivas, A.</string-name>
              <string-name>Qian, R.</string-name>
              <string-name>Lin, T.</string-name>
              <string-name>Cubuk, E.</string-name>
              <string-name>Le, Q.</string-name>
              <string-name>Zoph, B.</string-name>
            </person-group>
            <year>2021</year>
            <pub-id pub-id-type="doi">10.1109/cvpr46437.2021.00294</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B10">
        <label>10.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">He, G., Zheng, X., Liu, Y., Ma, H., Zhang, C., &amp; Xiong, H. (2023). Camouflaged Object Detection with Feature Decomposition and Edge Reconstruction. In <italic>2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition</italic><italic>(</italic><italic>CVPR)</italic>(pp. 22046-22055). IEEE. https://doi.org/10.1109/cvpr52729.2023.02111 <pub-id pub-id-type="doi">10.1109/cvpr52729.2023.02111</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr52729.2023.02111">https://doi.org/10.1109/cvpr52729.2023.02111</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>He, G.</string-name>
              <string-name>Zheng, X.</string-name>
              <string-name>Liu, Y.</string-name>
              <string-name>Ma, H.</string-name>
              <string-name>Zhang, C.</string-name>
              <string-name>Xiong, H.</string-name>
            </person-group>
            <year>2023</year>
            <pub-id pub-id-type="doi">10.1109/cvpr52729.2023.02111</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B11">
        <label>11.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Hou, Q., Zhou, D., &amp; Feng, J. (2021). Coordinate Attention for Efficient Mobile Network Design. In <italic>2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition</italic><italic>(</italic><italic>CVPR)</italic> (pp. 13708-13717). IEEE. https://doi.org/10.1109/cvpr46437.2021.01350 <pub-id pub-id-type="doi">10.1109/cvpr46437.2021.01350</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr46437.2021.01350">https://doi.org/10.1109/cvpr46437.2021.01350</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Hou, Q.</string-name>
              <string-name>Zhou, D.</string-name>
              <string-name>Feng, J.</string-name>
            </person-group>
            <year>2021</year>
            <pub-id pub-id-type="doi">10.1109/cvpr46437.2021.01350</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B12">
        <label>12.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Hu, J., Shen, L., &amp; Sun, G. (2018). Squeeze-and-Excitation Networks. In <italic>2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition</italic> (pp. 7132-7141). IEEE. https://doi.org/10.1109/cvpr.2018.00745 <pub-id pub-id-type="doi">10.1109/cvpr.2018.00745</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr.2018.00745">https://doi.org/10.1109/cvpr.2018.00745</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Hu, J.</string-name>
              <string-name>Shen, L.</string-name>
              <string-name>Sun, G.</string-name>
            </person-group>
            <year>2018</year>
            <pub-id pub-id-type="doi">10.1109/cvpr.2018.00745</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B13">
        <label>13.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., &amp; Liu, W. (2019). CCNet: Criss-Cross Attention for Semantic Segmentation. In <italic>2019 IEEE/CVF International Conference on Computer Vision</italic><italic>(</italic><italic>ICCV)</italic> (pp. 603-612). IEEE. https://doi.org/10.1109/iccv.2019.00069 <pub-id pub-id-type="doi">10.1109/iccv.2019.00069</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/iccv.2019.00069">https://doi.org/10.1109/iccv.2019.00069</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Huang, Z.</string-name>
              <string-name>Wang, X.</string-name>
              <string-name>Huang, L.</string-name>
              <string-name>Huang, C.</string-name>
              <string-name>Wei, Y.</string-name>
              <string-name>Liu, W.</string-name>
            </person-group>
            <year>2019</year>
            <pub-id pub-id-type="doi">10.1109/iccv.2019.00069</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B14">
        <label>14.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Kisantal, M., Wojna, Z., Muber, J., Naber, J., &amp; Pintea, S. (2019). <italic>Augmentation for Small Object Detection</italic>. https://doi.org/10.48550/arXiv.1902.07296 <pub-id pub-id-type="doi">10.48550/arXiv.1902.07296</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.1902.07296">https://doi.org/10.48550/arXiv.1902.07296</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Kisantal, M.</string-name>
              <string-name>Wojna, Z.</string-name>
              <string-name>Muber, J.</string-name>
              <string-name>Naber, J.</string-name>
              <string-name>Pintea, S.</string-name>
            </person-group>
            <year>2019</year>
            <pub-id pub-id-type="doi">10.48550/arXiv.1902.07296</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B15">
        <label>15.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., &amp; Belongie, S. (2017). Feature Pyramid Networks for Object Detection. In <italic>2017 IEEE Conference on Computer Vision and Pattern Recognition</italic><italic>(</italic><italic>CVPR)</italic> (pp. 936-944). IEEE. https://doi.org/10.1109/cvpr.2017.106 <pub-id pub-id-type="doi">10.1109/cvpr.2017.106</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr.2017.106">https://doi.org/10.1109/cvpr.2017.106</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Lin, T.</string-name>
              <string-name>Girshick, R.</string-name>
              <string-name>He, K.</string-name>
              <string-name>Hariharan, B.</string-name>
              <string-name>Belongie, S.</string-name>
            </person-group>
            <year>2017</year>
            <pub-id pub-id-type="doi">10.1109/cvpr.2017.106</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B16">
        <label>16.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D. et al. (2014). Microsoft COCO: Common Objects in Context. In <italic>Lecture</italic><italic>Notes</italic><italic>in</italic><italic>Computer</italic><italic>Science</italic> (pp. 740-755). Springer. https://doi.org/10.1007/978-3-319-10602-1_48 <pub-id pub-id-type="doi">10.1007/978-3-319-10602-1_48</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-319-10602-1_48">https://doi.org/10.1007/978-3-319-10602-1_48</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Lin, T.</string-name>
              <string-name>Maire, M.</string-name>
              <string-name>Belongie, S.</string-name>
              <string-name>Hays, J.</string-name>
              <string-name>Perona, P.</string-name>
              <string-name>Ramanan, D.</string-name>
            </person-group>
            <year>2014</year>
            <pub-id pub-id-type="doi">10.1007/978-3-319-10602-1_48</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B17">
        <label>17.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Liu, S., Qi, L., Qin, H., Shi, J., &amp; Jia, J. (2018). Path Aggregation Network for Instance Segmentation. In <italic>2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition</italic> (pp. 8759-8768). IEEE. https://doi.org/10.1109/cvpr.2018.00913 <pub-id pub-id-type="doi">10.1109/cvpr.2018.00913</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr.2018.00913">https://doi.org/10.1109/cvpr.2018.00913</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Liu, S.</string-name>
              <string-name>Qi, L.</string-name>
              <string-name>Qin, H.</string-name>
              <string-name>Shi, J.</string-name>
              <string-name>Jia, J.</string-name>
            </person-group>
            <year>2018</year>
            <pub-id pub-id-type="doi">10.1109/cvpr.2018.00913</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B18">
        <label>18.</label>
        <citation-alternatives>
          <mixed-citation publication-type="web">Majumdar, R. (2025). AgroPest-12: A 12-Class Image Dataset of Crop Insects and Pests. <italic>Kaggle</italic>. https://www.kaggle.com/datasets/rupankarmajumdar/crop-pests-dataset</mixed-citation>
          <element-citation publication-type="web">
            <person-group person-group-type="author">
              <string-name>Majumdar, R.</string-name>
            </person-group>
            <year>2025</year>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B19">
        <label>19.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Mei, H., Ji, G. P., Wei, Z., Yang, X., Wei, X., &amp; Fan, D. P. (2021a). Camouflaged Object Segmentation with Distraction Mining. In <italic>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</italic> (pp. 8772-8781). IEEE.</mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Mei, H.</string-name>
              <string-name>Ji, G.</string-name>
              <string-name>Wei, Z.</string-name>
              <string-name>Yang, X.</string-name>
              <string-name>Wei, X.</string-name>
              <string-name>Fan, D.</string-name>
            </person-group>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B20">
        <label>20.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Mei, H., Ji, G. P., Wei, Z., Yang, X., Wei, X., &amp; Fan, D. P. (2021b). PFNet: Positioning and Focusing for Camouflaged Object Detection. In <italic>Proceedings of the IEEE</italic><italic>/CVF Conference on Computer Vision and Pattern Recognition</italic><italic>(</italic><italic>CVPR)</italic>(pp. 8784-8793). IEEE.</mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Mei, H.</string-name>
              <string-name>Ji, G.</string-name>
              <string-name>Wei, Z.</string-name>
              <string-name>Yang, X.</string-name>
              <string-name>Wei, X.</string-name>
              <string-name>Fan, D.</string-name>
            </person-group>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B21">
        <label>21.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Pang, Y., Zhao, X., Xiang, T., Zhang, L., &amp; Lu, H. (2022). Zoom in and Out: A Mixed-Scale Triplet Network for Camouflaged Object Detection. In <italic>2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition</italic><italic>(</italic><italic>CVPR)</italic> (pp. 2150-2160). IEEE. https://doi.org/10.1109/cvpr52688.2022.00220 <pub-id pub-id-type="doi">10.1109/cvpr52688.2022.00220</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr52688.2022.00220">https://doi.org/10.1109/cvpr52688.2022.00220</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Pang, Y.</string-name>
              <string-name>Zhao, X.</string-name>
              <string-name>Xiang, T.</string-name>
              <string-name>Zhang, L.</string-name>
              <string-name>Lu, H.</string-name>
            </person-group>
            <year>2022</year>
            <pub-id pub-id-type="doi">10.1109/cvpr52688.2022.00220</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B22">
        <label>22.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Qin, D., Leichner, C., Delakis, M., Fornoni, M., Luo, S., Yang, F. et al. (2024). MobileNetV4: Universal Models for the Mobile Ecosystem. In <italic>Lecture</italic><italic>Notes</italic><italic>in</italic><italic>Computer</italic><italic>Science</italic> (pp. 78-96). Springer. https://doi.org/10.1007/978-3-031-73661-2_5 <pub-id pub-id-type="doi">10.1007/978-3-031-73661-2_5</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-031-73661-2_5">https://doi.org/10.1007/978-3-031-73661-2_5</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Qin, D.</string-name>
              <string-name>Leichner, C.</string-name>
              <string-name>Delakis, M.</string-name>
              <string-name>Fornoni, M.</string-name>
              <string-name>Luo, S.</string-name>
              <string-name>Yang, F.</string-name>
            </person-group>
            <year>2024</year>
            <pub-id pub-id-type="doi">10.1007/978-3-031-73661-2_5</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B23">
        <label>23.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Si, Y., Xu, H., Zhu, X., Zhang, W., Dong, Y., Chen, Y. et al. (2024). SCSA: Exploring the Synergistic Effects between Spatial and Channel Attention. <italic>Neurocomputing,</italic><italic>634,</italic> Article 129866. https://doi.org/10.1016/j.neucom.2025.129866 <pub-id pub-id-type="doi">10.1016/j.neucom.2025.129866</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.neucom.2025.129866">https://doi.org/10.1016/j.neucom.2025.129866</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Si, Y.</string-name>
              <string-name>Xu, H.</string-name>
              <string-name>Zhu, X.</string-name>
              <string-name>Zhang, W.</string-name>
              <string-name>Dong, Y.</string-name>
              <string-name>Chen, Y.</string-name>
            </person-group>
            <year>2024</year>
            <elocation-id>129866</elocation-id>
            <pub-id pub-id-type="doi">10.1016/j.neucom.2025.129866</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B24">
        <label>24.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Singh, B., &amp; Davis, L. S. (2018). An Analysis of Scale Invariance in Object Detection Snip. In <italic>2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition</italic> (pp. 3578-3587). IEEE. https://doi.org/10.1109/cvpr.2018.00377 <pub-id pub-id-type="doi">10.1109/cvpr.2018.00377</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr.2018.00377">https://doi.org/10.1109/cvpr.2018.00377</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Singh, B.</string-name>
              <string-name>Davis, L.</string-name>
            </person-group>
            <year>2018</year>
            <pub-id pub-id-type="doi">10.1109/cvpr.2018.00377</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B25">
        <label>25.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Singh, B., Najibi, M., &amp; Davis, L. S. (2018). SNIPER: Efficient Multi-Scale Training. In <italic>Advances in Neural Information Processing Systems</italic><italic>(</italic><italic>NeurIPS</italic><italic>)</italic> (pp. 9310-9320). Curran Associates Inc.</mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Singh, B.</string-name>
              <string-name>Najibi, M.</string-name>
              <string-name>Davis, L.</string-name>
            </person-group>
            <year>2018</year>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B26">
        <label>26.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Tan, M., Pang, R., &amp; Le, Q. V. (2020). EfficientDet: Scalable and Efficient Object Detection. In <italic>2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition</italic><italic>(</italic><italic>CVPR)</italic>(pp. 10778-10787). IEEE. https://doi.org/10.1109/cvpr42600.2020.01079 <pub-id pub-id-type="doi">10.1109/cvpr42600.2020.01079</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr42600.2020.01079">https://doi.org/10.1109/cvpr42600.2020.01079</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Tan, M.</string-name>
              <string-name>Pang, R.</string-name>
              <string-name>Le, Q.</string-name>
            </person-group>
            <year>2020</year>
            <pub-id pub-id-type="doi">10.1109/cvpr42600.2020.01079</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B27">
        <label>27.</label>
        <citation-alternatives>
          <mixed-citation publication-type="web">Ultralytics (2024). <italic>YOLOv11: Real-Time Object Detection [GitHub Repository]</italic>. https://github.com/ultralytics/ultralytics</mixed-citation>
          <element-citation publication-type="web">
            <year>2024</year>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B28">
        <label>28.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Wang, C. Y., Yeh, I. H., &amp; Liao, H. Y. M. (2024). YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In <italic>Lecture</italic><italic>Notes</italic><italic>in</italic><italic>Computer</italic><italic>Science</italic> (pp. 1-21). Springer. https://doi.org/10.1007/978-3-031-72751-1_1 <pub-id pub-id-type="doi">10.1007/978-3-031-72751-1_1</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-031-72751-1_1">https://doi.org/10.1007/978-3-031-72751-1_1</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Wang, C.</string-name>
              <string-name>Yeh, I.</string-name>
              <string-name>Liao, H.</string-name>
            </person-group>
            <year>2024</year>
            <pub-id pub-id-type="doi">10.1007/978-3-031-72751-1_1</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B29">
        <label>29.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., &amp; Chen, L. (2020). Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. In <italic>Lecture</italic><italic>Notes</italic><italic>in</italic><italic>Computer</italic><italic>Science</italic> (pp. 108-126). Springer. https://doi.org/10.1007/978-3-030-58548-8_7 <pub-id pub-id-type="doi">10.1007/978-3-030-58548-8_7</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-030-58548-8_7">https://doi.org/10.1007/978-3-030-58548-8_7</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Wang, H.</string-name>
              <string-name>Zhu, Y.</string-name>
              <string-name>Green, B.</string-name>
              <string-name>Adam, H.</string-name>
              <string-name>Yuille, A.</string-name>
              <string-name>Chen, L.</string-name>
            </person-group>
            <year>2020</year>
            <pub-id pub-id-type="doi">10.1007/978-3-030-58548-8_7</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B30">
        <label>30.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Wang, J., Chen, Y., Gao, Y., Zhang, H., &amp; Liu, W. (2025). CPDD-YOLOv8: Small Object Detection in Aerial Images. <italic>Scientific Reports, 15,</italic>Article No. 770.</mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Wang, J.</string-name>
              <string-name>Chen, Y.</string-name>
              <string-name>Gao, Y.</string-name>
              <string-name>Zhang, H.</string-name>
              <string-name>Liu, W.</string-name>
            </person-group>
            <year>2025</year>
            <elocation-id>No</elocation-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B31">
        <label>31.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Woo, S., Park, J., Lee, J., &amp; Kweon, I. S. (2018). CBAM: Convolutional Block Attention Module. In <italic>Lecture</italic><italic>Notes</italic><italic>in</italic><italic>Computer</italic><italic>Science</italic> (pp. 3-19). Springer. https://doi.org/10.1007/978-3-030-01234-2_1 <pub-id pub-id-type="doi">10.1007/978-3-030-01234-2_1</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-030-01234-2_1">https://doi.org/10.1007/978-3-030-01234-2_1</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Woo, S.</string-name>
              <string-name>Park, J.</string-name>
              <string-name>Lee, J.</string-name>
              <string-name>Kweon, I.</string-name>
            </person-group>
            <year>2018</year>
            <pub-id pub-id-type="doi">10.1007/978-3-030-01234-2_1</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B32">
        <label>32.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Xu, C., Wang, J., Yang, W., Yu, H., Yu, L., &amp; Xia, G. (2022). RFLA: Gaussian Receptive Field Based Label Assignment for Tiny Object Detection. In <italic>Lecture</italic><italic>Notes</italic><italic>in</italic><italic>Computer</italic><italic>Science</italic> (pp. 526-543). Springer. https://doi.org/10.1007/978-3-031-20077-9_31 <pub-id pub-id-type="doi">10.1007/978-3-031-20077-9_31</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-031-20077-9_31">https://doi.org/10.1007/978-3-031-20077-9_31</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Xu, C.</string-name>
              <string-name>Wang, J.</string-name>
              <string-name>Yang, W.</string-name>
              <string-name>Yu, H.</string-name>
              <string-name>Yu, L.</string-name>
              <string-name>Xia, G.</string-name>
            </person-group>
            <year>2022</year>
            <pub-id pub-id-type="doi">10.1007/978-3-031-20077-9_31</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B33">
        <label>33.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Yang, C., Huang, Z., &amp; Wang, N. (2022). QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection. In <italic>2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition</italic><italic>(</italic><italic>CVPR)</italic> (pp. 13658-13667). IEEE. https://doi.org/10.1109/cvpr52688.2022.01330 <pub-id pub-id-type="doi">10.1109/cvpr52688.2022.01330</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr52688.2022.01330">https://doi.org/10.1109/cvpr52688.2022.01330</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Yang, C.</string-name>
              <string-name>Huang, Z.</string-name>
              <string-name>Wang, N.</string-name>
            </person-group>
            <year>2022</year>
            <pub-id pub-id-type="doi">10.1109/cvpr52688.2022.01330</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B34">
        <label>34.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Zhang, H., Wang, Y., Dayoub, F., &amp; Sunderhauf, N. (2021). VarifocalNet: An IoU-Aware Dense Object Detector. In <italic>2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition</italic><italic>(</italic><italic>CVPR)</italic> (pp. 8510-8519). IEEE. https://doi.org/10.1109/cvpr46437.2021.00841 <pub-id pub-id-type="doi">10.1109/cvpr46437.2021.00841</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr46437.2021.00841">https://doi.org/10.1109/cvpr46437.2021.00841</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Zhang, H.</string-name>
              <string-name>Wang, Y.</string-name>
              <string-name>Dayoub, F.</string-name>
              <string-name>Sunderhauf, N.</string-name>
            </person-group>
            <year>2021</year>
            <pub-id pub-id-type="doi">10.1109/cvpr46437.2021.00841</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B35">
        <label>35.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q. et al. (2024). DETRs Beat YOLOs on Real-Time Object Detection. In <italic>2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition</italic><italic>(</italic><italic>CVPR)</italic> (pp. 16965-16974). IEEE. https://doi.org/10.1109/cvpr52733.2024.01605 <pub-id pub-id-type="doi">10.1109/cvpr52733.2024.01605</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr52733.2024.01605">https://doi.org/10.1109/cvpr52733.2024.01605</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Zhao, Y.</string-name>
              <string-name>Lv, W.</string-name>
              <string-name>Xu, S.</string-name>
              <string-name>Wei, J.</string-name>
              <string-name>Wang, G.</string-name>
              <string-name>Dang, Q.</string-name>
            </person-group>
            <year>2024</year>
            <pub-id pub-id-type="doi">10.1109/cvpr52733.2024.01605</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B36">
        <label>36.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Zhou, Y., Sun, G., Li, Y., Fu, Y., Benini, L., &amp; Konukoglu, E. (2025). CamSAM2: Segment Anything in Camouflaged Videos. In <italic>Advances in Neural Information Processing Sys</italic><italic>tems</italic><italic>(</italic><italic>Neur</italic><italic>IPS</italic><italic>)</italic> (pp. 1-6). https://doi.org/10.48550/arxiv.2503.19730 <pub-id pub-id-type="doi">10.48550/arxiv.2503.19730</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arxiv.2503.19730">https://doi.org/10.48550/arxiv.2503.19730</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Zhou, Y.</string-name>
              <string-name>Sun, G.</string-name>
              <string-name>Li, Y.</string-name>
              <string-name>Fu, Y.</string-name>
              <string-name>Benini, L.</string-name>
              <string-name>Konukoglu, E.</string-name>
            </person-group>
            <year>2025</year>
            <pub-id pub-id-type="doi">10.48550/arxiv.2503.19730</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B37">
        <label>37.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Zhu, H., Li, P., Xie, H., Yan, X., Liang, D., Chen, D., Wei, M., &amp; Qin, J. (2022). I Can Find You! Boundary-Guided Separated Attention Network for Camouflaged Object Detection. <italic>Proceedings of the AAAI Conference on Artificial Intelligence, 36</italic>, 3608-3616. https://doi.org/10.1609/aaai.v36i3.20273 <pub-id pub-id-type="doi">10.1609/aaai.v36i3.20273</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1609/aaai.v36i3.20273">https://doi.org/10.1609/aaai.v36i3.20273</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Zhu, H.</string-name>
              <string-name>Li, P.</string-name>
              <string-name>Xie, H.</string-name>
              <string-name>Yan, X.</string-name>
              <string-name>Liang, D.</string-name>
              <string-name>Chen, D.</string-name>
              <string-name>Wei, M.</string-name>
              <string-name>Qin, J.</string-name>
            </person-group>
            <year>2022</year>
            <pub-id pub-id-type="doi">10.1609/aaai.v36i3.20273</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B38">
        <label>38.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Zhu, X., Su, W., Lu, L., Li, B., Wang, X., &amp; Dai, J. (2021). Deformable DETR: Deformable Transformers for End-to-End Object Detection. In <italic>Proceedings of the International Conference on Learning Representations</italic><italic>(</italic><italic>ICLR)</italic> (pp. 1-16). https://doi.org/10.48550/arXiv.2010.04159 <pub-id pub-id-type="doi">10.48550/arXiv.2010.04159</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2010.04159">https://doi.org/10.48550/arXiv.2010.04159</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Zhu, X.</string-name>
              <string-name>Su, W.</string-name>
              <string-name>Lu, L.</string-name>
              <string-name>Li, B.</string-name>
              <string-name>Wang, X.</string-name>
              <string-name>Dai, J.</string-name>
            </person-group>
            <year>2021</year>
            <pub-id pub-id-type="doi">10.48550/arXiv.2010.04159</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
    </ref-list>
  </back>
</article>