<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article">
 <front>
  <journal-meta>
   <journal-id journal-id-type="publisher-id">
    jcc
   </journal-id>
   <journal-title-group>
    <journal-title>
     Journal of Computer and Communications
    </journal-title>
   </journal-title-group>
   <issn pub-type="epub">
    2327-5219
   </issn>
   <issn publication-format="print">
    2327-5227
   </issn>
   <publisher>
    <publisher-name>
     Scientific Research Publishing
    </publisher-name>
   </publisher>
  </journal-meta>
  <article-meta>
   <article-id pub-id-type="doi">
    10.4236/jcc.2024.129011
   </article-id>
   <article-id pub-id-type="publisher-id">
    jcc-136447
   </article-id>
   <article-categories>
    <subj-group subj-group-type="heading">
     <subject>
      Articles
     </subject>
    </subj-group>
    <subj-group subj-group-type="Discipline-v2">
     <subject>
      Computer Science 
     </subject>
     <subject>
       Communications
     </subject>
    </subj-group>
   </article-categories>
   <title-group>
    Efficient Vision Transformers for Autonomous Off-Road Perception Systems
   </title-group>
   <contrib-group>
    <contrib contrib-type="author" xlink:type="simple">
     <name name-style="western">
      <surname>
       Max H. Faykus
      </surname>
      <given-names>
       III
      </given-names>
     </name>
    </contrib>
    <contrib contrib-type="author" xlink:type="simple">
     <name name-style="western">
      <surname>
       Adam
      </surname>
      <given-names>
       Pickeral
      </given-names>
     </name>
    </contrib>
    <contrib contrib-type="author" xlink:type="simple">
     <name name-style="western">
      <surname>
       Ethan
      </surname>
      <given-names>
       Marquez
      </given-names>
     </name>
    </contrib>
    <contrib contrib-type="author" xlink:type="simple">
     <name name-style="western">
      <surname>
       Melissa C.
      </surname>
      <given-names>
       Smith
      </given-names>
     </name>
    </contrib>
    <contrib contrib-type="author" xlink:type="simple">
     <name name-style="western">
      <surname>
       Jon C.
      </surname>
      <given-names>
       Calhoun
      </given-names>
     </name>
    </contrib>
   </contrib-group> 
   <aff id="affnull">
    <addr-line>
     aHolcombe Department of Electrical and Computer Engineering, Clemson University, Clemson, SC, USA
    </addr-line> 
   </aff> 
   <pub-date pub-type="epub">
    <day>
     06
    </day> 
    <month>
     09
    </month>
    <year>
     2024
    </year>
   </pub-date> 
   <volume>
    12
   </volume> 
   <issue>
    09
   </issue>
   <fpage>
    188
   </fpage>
   <lpage>
    207
   </lpage>
   <history>
    <date date-type="received">
     <day>
      23,
     </day>
     <month>
      July
     </month>
     <year>
      2024
     </year>
    </date>
    <date date-type="published">
     <day>
      27,
     </day>
     <month>
      July
     </month>
     <year>
      2024
     </year> 
    </date> 
    <date date-type="accepted">
     <day>
      27,
     </day>
     <month>
      September
     </month>
     <year>
      2024
     </year> 
    </date>
   </history>
   <permissions>
    <copyright-statement>
     © Copyright 2014 by authors and Scientific Research Publishing Inc. 
    </copyright-statement>
    <copyright-year>
     2014
    </copyright-year>
    <license>
     <license-p>
      This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/
     </license-p>
    </license>
   </permissions>
   <abstract>
    The development of autonomous vehicles has become one of the greatest research endeavors in recent years. These vehicles rely on many complex systems working in tandem to make decisions. For practical use and safety reasons, these systems must not only be accurate, but also quickly detect changes in the surrounding environment. In autonomous vehicle research, the environment perception system is one of the key components of development. Environment perception systems allow the vehicle to understand its surroundings. This is done by using cameras, light detection and ranging (LiDAR), with other sensor systems and modalities. Deep learning computer vision algorithms have been shown to be the strongest tool for translating camera data into accurate and safe traversability decisions regarding the environment surrounding a vehicle. In order for a vehicle to safely traverse an area in real time, these computer vision algorithms must be accurate and have low latency. While much research has studied autonomous driving for traversing well-structured urban environments, limited research exists evaluating perception system improvements in off-road settings. This research aims to investigate the adaptability of several existing deep-learning architectures for semantic segmentation in off-road environments. Previous studies of two Convolutional Neural Network (CNN) architectures are included for comparison with new evaluation of Vision Transformer (ViT) architectures for semantic segmentation. Our results demonstrate viability of ViT architectures for off-road perception systems, having a strong segmentation accuracy, lower inference speed and memory footprint compared to previous results with CNN architectures.
   </abstract>
   <kwd-group> 
    <kwd>
     Semantic Segmentation
    </kwd> 
    <kwd>
      Off-Road Vision
    </kwd> 
    <kwd>
      Transformers
    </kwd> 
    <kwd>
      CNNs
    </kwd> 
    <kwd>
      Autonomous Driving
    </kwd>
   </kwd-group>
  </article-meta>
 </front>
 <body>
  <sec id="s1">
   <title>1. Introduction</title>
   <p>Off-road autonomous vehicles, or Unmanned Ground Vehicles (UGVs), are important research efforts in academia and industry. UGVs are typically deployed in challenging terrain and atypical conditions not pre-built for driving. For example, small UGVs are often deployed for terrain exploration or monitoring <xref ref-type="bibr" rid="scirp.136447-1">
     [1]
    </xref>. Larger UGVs have many uses in military surveillance and defense <xref ref-type="bibr" rid="scirp.136447-2">
     [2]
    </xref>. These unmanned vehicles use many systems in tandem to make traversal decisions, one of the most crucial being the perception system <xref ref-type="bibr" rid="scirp.136447-3">
     [3]
    </xref>.</p>
   <p>Perception systems are classified into two main subsystems: the Position Estimation System and the Environment Perception System <xref ref-type="bibr" rid="scirp.136447-4">
     [4]
    </xref>. Position Estimation Systems typically use satellite GPS or Inertial Measurement Units (IMUs) to estimate the position of the vehicle <xref ref-type="bibr" rid="scirp.136447-4">
     [4]
    </xref>. Environment Perception Systems acquire knowledge by scanning the surrounding terrain to detect changes in driving conditions, using various sensors to gather information <xref ref-type="bibr" rid="scirp.136447-4">
     [4]
    </xref>. Sensor examples include Radio Detection and Ranging (RADAR), Light Detection and Ranging (LiDAR), and various types of cameras <xref ref-type="bibr" rid="scirp.136447-4">
     [4]
    </xref>. Since RADAR does not support simultaneous detection of multiple objects and LiDAR is used for point-cloud distance data <xref ref-type="bibr" rid="scirp.136447-5">
     [5]
    </xref>. The cameras used are an important sensor modality that provides fine-grain environmental knowledge for perception systems.</p>
   <p>Typically multiple types of camera data—including truecolor red-green-blue (RGB) images, Forward Looking Infrared (FLIR) thermal, multispectral and stereo, are used in an environment perception system. This work focuses on extraction of meaningful insights from RGB image data, also known as Computer Vision <xref ref-type="bibr" rid="scirp.136447-3">
     [3]
    </xref>. In environment perception, object detection and semantic segmentation are the two most common computer vision tasks with RGB images. Object detection classifies objects in images and marks their location, typically using a bounding box <xref ref-type="bibr" rid="scirp.136447-6">
     [6]
    </xref>. Semantic segmentation classifies every pixel in an image, providing more fine-grained detail of an image’s scene than object detection but at a higher computational cost <xref ref-type="bibr" rid="scirp.136447-7">
     [7]
    </xref>. Since autonomous vehicles require details of their surroundings for safe and accurate decision-making, semantic segmentation is a crucial element of environment perception systems.</p>
   <p>Methods used for semantic segmentation take on numerous forms. In recent years, multi-layered neural networks (deep learning) have shown stronger results in semantic segmentation and other computer vision tasks compared to traditional rule-based algorithms. The use of deep learning has generated much research in the creation of new semantic segmentation neural network designs, or “architectures”. The design of an architecture determines how the neural network represents patterns in data or “learns” information. Convolutional neural network (CNN) architectures have traditionally been the state of the art for semantic segmentation <xref ref-type="bibr" rid="scirp.136447-8">
     [8]
    </xref>-<xref ref-type="bibr" rid="scirp.136447-11">
     [11]
    </xref>. Vision Transformer (ViT) architectures have recently shown comparable, and in some cases superior, results against CNN architectures in semantic segmentation and other computer vision tasks <xref ref-type="bibr" rid="scirp.136447-12">
     [12]
    </xref> <xref ref-type="bibr" rid="scirp.136447-13">
     [13]
    </xref>.</p>
   <p>Semantic segmentation architectures, in autonomy or otherwise, are typically evaluated by their ability to segment scenes in well-structured environments; most benchmark datasets contain images from urban areas <xref ref-type="bibr" rid="scirp.136447-14">
     [14]
    </xref> <xref ref-type="bibr" rid="scirp.136447-15">
     [15]
    </xref>. For autonomous urban travel, cars, street signs, and roads with distinct lines and intersections give perception systems visual cues for path planning <xref ref-type="bibr" rid="scirp.136447-14">
     [14]
    </xref>. However, far less research has evaluated semantic segmentation architectures in off-road or unstructured settings. Modeling off-road environments common for UGV deployment, e.g., forests, country roads, deserts, is difficult with meshing between naturally present objects and generally noisy terrain. These environments are rarely studied with recently developed deep learning architectures, observed from: 1) the scarcity of quality, labeled image datasets created in off-road environments <xref ref-type="bibr" rid="scirp.136447-16">
     [16]
    </xref>; 2) the absence of studies using semantic segmentation architectures with available off-road datasets <xref ref-type="bibr" rid="scirp.136447-3">
     [3]
    </xref>.</p>
   <p>In addition, the real-world deployment of deep learning architectures introduces other constraints. Typically UGVs are deployed with limited computational resources. Therefore, semantic segmentation architectures that use large amounts of calculations and time to produce insights about a surrounding environment are undesirable. It is crucial that the architectures deployed in perception systems not only be accurate, but be able to compute segmentation predictions (“inference”) quickly on devices with limited computational resources (typically “edge” devices).</p>
   <p>This work evaluates the viability of multiple deep learning architectures for use in off-road UGVs, where accuracy must be high and inference speed fast. State-of-the-art architectures are evaluated on the basis of their ability to both accurately segment off-road data and inference quickly. Specifically, results from two different CNN architectures—DeepLabV3+ <xref ref-type="bibr" rid="scirp.136447-17">
     [17]
    </xref> and Swiftnet <xref ref-type="bibr" rid="scirp.136447-18">
     [18]
    </xref>, are compared with results from two different ViT architectures—EfficientViT <xref ref-type="bibr" rid="scirp.136447-19">
     [19]
    </xref> and Segformer <xref ref-type="bibr" rid="scirp.136447-20">
     [20]
    </xref>, on off-road data.</p>
   <p>This paper contributes the following to the literature:</p>
  </sec><sec id="s2">
   <title>2. Background and Related Works</title>
   <p>When evaluating segmentation architectures, there is a trade-off between accuracy and inference speed <xref ref-type="bibr" rid="scirp.136447-21">
     [21]
    </xref>-<xref ref-type="bibr" rid="scirp.136447-23">
     [23]
    </xref>. Typically, more accurate architectures take longer to inference. Conversely, architectures with higher inference speeds normally suffer from accuracy loss.</p>
   <p>Finding a balance between accuracy and inference speed is important in off-road autonomous driving for several reasons: 1) Safety: For the vehicle to make decisions that alleviate damage to itself or any cargo, it must have an accurate representation surrounding environments to determine traversable terrain. 2) Latency Expectations: Deployment of these vehicles for missions requires real-time decision-making. Perception systems must be able to determine changes in an environment multiple times a second. 3) Constrained Resources: UGVs are typically deployed using edge devices for perception system computation. Ensuring the perception system efficiently segments RGB camera data on devices with limited computational resources (small GPUs and low memory) is crucial for deployment.</p>
   <p>This section details deep learning architectures explored in off-road settings, the image datasets used to evaulate them.</p>
   <sec id="s2_1">
    <title>2.1. Deep Learning Architectures for Semantic Segmentation</title>
    <p>To effectively segment RGB off-road camera data, a variety of deep learning architectures were utilized. The architectures used in this work fall broadly into two categories: Convolutional Neural Networks and Vision Transformers.</p>
    <p>Convolutional neural networks (CNNs) use kernels, also known as filters, to extract spatial features from images <xref ref-type="bibr" rid="scirp.136447-9">
      [9]
     </xref>. Kernels are typically represented by a small square grid, where each grid element contains a numerical value. This grid transforms images by transforming an input image into a new representation, where each new pixel value is a weighted sum of all pixels in the grid’s window. The weights of neighboring pixels that contribute to a new pixel’s sum, the kernel’s values, are learned through the training process and updated to recognize patterns in images that pertain to specific classes in the input data. The combination of these convolutional layers creates deep neural networks that have proven to work well for computer vision tasks <xref ref-type="bibr" rid="scirp.136447-9">
      [9]
     </xref> <xref ref-type="bibr" rid="scirp.136447-24">
      [24]
     </xref> <xref ref-type="bibr" rid="scirp.136447-25">
      [25]
     </xref>.</p>
    <p>Transformer architectures <xref ref-type="bibr" rid="scirp.136447-26">
      [26]
     </xref>, originally designed for natural language processing, use self-attention as a means to represent complex patterns in sequences. Self-attention uses attention “heads” to extract information from a sequence of data. Each head transforms individual parts of an input sequence into representations of “Queries”, questions about the data, “Keys”, answers to these queries, and “Values”, which determine how data should be transformed based on matching queries and keys. Each head transforms input sequence data into queries, keys, and values using weights learned through training. Based on the relationship between queries, keys, and values in each head, “attention maps” are created, highlighting complex relationships in the input data, such as semantic meaning or relationships to other data points. Each head creates distinct queries, keys, and values; resulting attention map data from multiple heads are combined to aggregate information. The Vision Transformer architecture (ViT) modified the idea to work with computer vision tasks <xref ref-type="bibr" rid="scirp.136447-13">
      [13]
     </xref>. New semantic segmentation architectures based on the ideas of the ViT typically have two things in common. First, most multi-scale architectures process input images at multiple scales to combine fine and coarse features <xref ref-type="bibr" rid="scirp.136447-12">
      [12]
     </xref>. Second, they use attention to create a global receptive field, meaning relationships between patterns everywhere in the image are considered <xref ref-type="bibr" rid="scirp.136447-12">
      [12]
     </xref>.</p>
    <p>CNN-based architectures are limited by the spatial window size of their kernels; using attention allows ViT models to find unique relationships that are not limited by such spatial constraints. However, it is well studied that the transformer model is hindered by high computational costs, memory footprint, and a need for large amounts of training data to perform well <xref ref-type="bibr" rid="scirp.136447-27">
      [27]
     </xref>. Another noted downside is the quadratic computational complexity of typical attention functions, meaning the cost to compute predictions typically grows quadratically with respect to the input data size. This computational cost is incredibly cumbersome when using transformers for real-time applications in autonomous vehicles that use high-resolution images for perception. Several efforts have been made to reduce the memory and computational cost of ViT models. Some of these methods include creating hybrid architectures that combine CNN and ViT architectures which use efficient attention operations <xref ref-type="bibr" rid="scirp.136447-19">
      [19]
     </xref>, or using machine learning to compress data for more efficient processing <xref ref-type="bibr" rid="scirp.136447-20">
      [20]
     </xref>.</p>
   </sec>
   <sec id="s2_2">
    <title>2.2. Previous CNN Studies</title>
    <p>The DeepLabV3+ CNN architecture shown, in <xref ref-type="fig" rid="fig1">
      Figure 1
     </xref>, was investigated for image segmentation with the Rellis-3D dataset on a previous study <xref ref-type="bibr" rid="scirp.136447-28">
      [28]
     </xref>. DeeplabV3+ has four primary components: Atrous convolution, Atrous spatial pyramid pooling, and an encoder-decoder structure <xref ref-type="bibr" rid="scirp.136447-17">
      [17]
     </xref>. An atrous convolution is dilated with holes in the filter weights, allowing for denser feature maps. Atrous spatial pyramid pooling replaces general pooling and introduces global access pooling for a global context. The encoder-decoder structure utilizes a backbone (ResNet in this study) to extract meaningful features in the data.</p>
    <p>In <xref ref-type="bibr" rid="scirp.136447-30">
      [30]
     </xref>, the SwiftNet multi-scale architecture <xref ref-type="bibr" rid="scirp.136447-18">
      [18]
     </xref> was explored with Rellis-3D. Swiftnet also uses an encoder-decoder structure as shown in <xref ref-type="fig" rid="fig2">
      Figure 2
     </xref>. The</p>
    <fig id="fig1" position="float">
     <label>Figure 1</label>
     <caption>
      <title>Figure 1. DeeplabV3+ Architecture <xref ref-type="bibr" rid="scirp.136447-29">
        [29]
       </xref>.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1732833-rId18.jpeg?20240930044929" />
    </fig>
    <fig id="fig2" position="float">
     <label>Figure 2</label>
     <caption>
      <title>Figure 2. SwiftNet multi-scale architecture <xref ref-type="bibr" rid="scirp.136447-18">
        [18]
       </xref>.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1732833-rId19.jpeg?20240930044929" />
    </fig>
    <p>encoder blocks (EB) are comprised of ResNet-18 layers <xref ref-type="bibr" rid="scirp.136447-11">
      [11]
     </xref>. The features are extracted at three scales (full, 1/2, and 1/4 resolution) using different branches. The decoder consists of a ladder-style structure with two inputs: the low-resolution feature maps from the preceding upsampling module (UP), and the high-resolution feature maps from the encoder blocks. The feature maps are combined with summation before passage to the decoder. All upsampling is done with bilinear interpolation.</p>
   </sec>
   <sec id="s2_3">
    <title>2.3. Efficient Vision Transformer Architectures</title>
    <p>As previously discussed when comparing CNNs and ViTs, the biggest hindrance of the ViT is the memory consumption and computational cost self-attention <xref ref-type="bibr" rid="scirp.136447-27">
      [27]
     </xref>. Efforts to use hierarchical pyramidal fusions, convolutional layers, and self-supervised Vision Transformers have been made to reduce computational complexity and memory footprint <xref ref-type="bibr" rid="scirp.136447-12">
      [12]
     </xref>. In this study, two recently developed architectures, Segformer and EfficientViT, are investigated because of their cited efficient use for semantic segmentation.</p>
    <p>The Segformer architecture was created for semantic segmentation with a lightweight multi-layer perceptron (MLP) decoder and multi-scale attention <xref ref-type="bibr" rid="scirp.136447-20">
      [20]
     </xref>. The architecture uses attention at large and small scales of the input image data to capture fine-grained and coarse feature maps. Segformer uses projection to shorten input sequences into a smaller representation, making attention slightly more efficient, although it is still quadratically complex. As shown in <xref ref-type="fig" rid="fig3">
      Figure 3
     </xref>, attention feature maps are created at the 1/4, 1/8, 1/16, and 1/32 scale of the original input image. These feature maps are merged and upsampled using nearest-neighbor interpolation and then passed to the decoder. The decoder uses an MLP to output a 1/4 scale prediction segmentation. For this study, we used bicubic interpolation to upsample the final prediction segmentation back to full scale.</p>
    <fig id="fig3" position="float">
     <label>Figure 3</label>
     <caption>
      <title>Figure 3. Segformer architecture <xref ref-type="bibr" rid="scirp.136447-20">
        [20]
       </xref>.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1732833-rId20.jpeg?20240930044929" />
    </fig>
    <p>EfficientViT is a hybrid CNN-ViT architecture for low-latency computer vision in real-world systems. EfficientViT follows the typical encoder-decoder structure for segmentation neural networks. The encoder is pre-trained on the ImageNet dataset <xref ref-type="bibr" rid="scirp.136447-31">
      [31]
     </xref> for classification. The encoder backbone comprises an input stem and four stages that contribute to the produced feature maps. The full architecture is shown in <xref ref-type="fig" rid="fig4">
      Figure 4
     </xref>. The input stem is a simple convolutional layer followed by a depth-wise separable convolution layer. The first two stages consist of multiple mobile inverted bottleneck convolutional layers. Stages 3 and 4 consist of the same convolutional layers as Stages 1 and 2, followed by the EffcientViT module: a ReLU Linear Attention module with convolutions to aggregate nearby tokens. Using ReLU as a function in attention calculation, in place of the traditional softmax function <xref ref-type="bibr" rid="scirp.136447-13">
      [13]
     </xref>, allows for hardware efficiency, but is weaker for discovering patterns <xref ref-type="bibr" rid="scirp.136447-19">
      [19]
     </xref>. When input data is passed through the backbone, the outputs of stages 2, 3, and 4 are saved, forming a pyramid of feature maps. Bicubic upsampling is then used to match their spatial and channel size, followed by a fusion of</p>
    <fig id="fig4" position="float">
     <label>Figure 4</label>
     <caption>
      <title>Figure 4. EfficientViT Architecture <xref ref-type="bibr" rid="scirp.136447-19">
        [19]
       </xref>.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1732833-rId21.jpeg?20240930044930" />
    </fig>
    <p>the data by addition. The head design is simple, using a few MBConv layers to decode the feature maps.</p>
    <p>The linear attention backbone, the EfficientViT module, is shown in <xref ref-type="fig" rid="fig5">
      Figure 5
     </xref>. The learned query (Q), key (K), and value (V) matrices are fed into three channels before concatenation. The first layer only uses the ReLU linear attention function. The second and third layers additionally pass the resulting tokens through depth-wise separable convolution layers, with kernel sizes 3 × 3 and 5 × 5, respectively, to aggregate nearby information. The tokens are then passed through a 1 × 1 group convolution, aggregating channels into groups for efficient computation. This hierarchy creates three different scales of representation in the tokens. After passing through the Multi-Scale Linear Attention module, the data is passed through a simple feed-forward network with a depth-wise separable convolution layer to project the data further. This addition helps compensate for the weakness of ReLU as an attention function.</p>
    <fig id="fig5" position="float">
     <label>Figure 5</label>
     <caption>
      <title>Figure 5. EfficientViT Module <xref ref-type="bibr" rid="scirp.136447-19">
        [19]
       </xref>.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1732833-rId22.jpeg?20240930044930" />
    </fig>
   </sec>
   <sec id="s2_4">
    <title>2.4. Datasets</title>
    <p>To evaluate the segmentation ability of a deep learning architecture in an off-road setting, it must be trained on a large-scale dataset with labeled images of off-road environments. The datasets used in this study are the Rellis-3D dataset <xref ref-type="bibr" rid="scirp.136447-32">
      [32]
     </xref> and the CAVS Traversability (CaT) dataset <xref ref-type="bibr" rid="scirp.136447-33">
      [33]
     </xref>. These specific datasets were chosen because they contain 1) thousands of labeled images in various off-road environments; 2) images in the datasets are high-resolution. A large number of images gives architectures wider range of scenes to learn from and be evaluated on. The high resolution of the images allows architectures to make detailed predictions about the surrounding scenes, a necessary trait for real-world use. These two factors are rare to find in off-road datasets <xref ref-type="bibr" rid="scirp.136447-16">
      [16]
     </xref>, making them the best choices for study in an off-road setting.</p>
    <p>Rellis-3D is an off-road dataset created to fill the lack of multi-modal datasets for off-road environments. This off-road dataset challenges state-of-the-art deep learning architectures designed to segment urban data. It provides a full sensor stack that includes RGB camera images, LiDAR point clouds, stereo images, high-precision GPS measurements, and IMU data. This multimodal data aims to enhance autonomous off-road navigation with a comprehensive ontology of object and terrain classes.</p>
    <p>
     <xref ref-type="bibr" rid="scirp.136447-"></xref>The Rellis-3D image collection contains 6234 labeled RGB images of size 1200 × 1920 <xref ref-type="bibr" rid="scirp.136447-32">
      [32]
     </xref>. <xref ref-type="fig" rid="fig6">
      Figure 6
     </xref> shows the ontology of the Rellis-3D dataset. Twenty class labels consist of two main subgroups: 1) traversable areas such as dirt, grass, asphalt; 2) obstacles—bushes, trees, objects, and poles. Since there are very few dirt labels, as seen in the dataset’s label distribution in <xref ref-type="fig" rid="fig7">
      Figure 7
     </xref>, this label is excluded from the study.</p>
    <fig id="fig6" position="float">
     <label>Figure 6</label>
     <caption>
      <title>Figure 6. Rellis-3D image example and ontology <xref ref-type="bibr" rid="scirp.136447-32">
        [32]
       </xref>.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1732833-rId23.jpeg?20240930044931" />
    </fig>
    <fig id="fig7" position="float">
     <label>Figure 7</label>
     <caption>
      <title>Figure 7. Rellis-3D data distribution.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1732833-rId24.jpeg?20240930044931" />
    </fig>
    <p>The Center for Advanced Vehicular Systems (CAVS) Traversability dataset (CaT) was created to explore off-road terrain in environments containing obstacles, ditches, and hidden objects <xref ref-type="bibr" rid="scirp.136447-33">
      [33]
     </xref>. The dataset includes 3624 labeled RGB images of varying high-definition sizes. The terrain in the images is segmented to show the traversing ability of three different-sized vehicles: a sedan, a pickup, and a sizeable off-road vehicle. A sedan is considered the vehicle with the least traversability and the off-road vehicle the most. <xref ref-type="fig" rid="fig8">
      Figure 8
     </xref> shows example images and annotations from the dataset. As shown in <xref ref-type="fig" rid="fig9">
      Figure 9
     </xref>, the CaT dataset has a class distribution with 25.29% of the pixels representing the driving capabilities of a sedan, 14.69% for a pickup, and 15.17% for an off-road vehicle. The last 44.86% are background pixels or untraversable terrain.</p>
    <fig id="fig8" position="float">
     <label>Figure 8</label>
     <caption>
      <title>Figure 8. CaT Image Examples and Corresponding Traversability Labels <xref ref-type="bibr" rid="scirp.136447-33">
        [33]
       </xref>.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1732833-rId25.jpeg?20240930044932" />
    </fig>
    <fig id="fig9" position="float">
     <label>Figure 9</label>
     <caption>
      <title>Figure 9. CaT data distribution.</title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1732833-rId26.jpeg?20240930044932" />
    </fig>
   </sec>
  </sec><sec id="s3">
   <title>3. Methods</title>
   <p>To determine if ViTs improve a UGV perception system, two different ViT architectures for semantic segmentation are evaluated on the Rellis-3D off-road dataset. ViT architectures are evaluated based on their accuracy, ability to identify traversable terrain and inference speed. Additionally, the inference memory usage and architecture memory size for each architecture are compared. The model must be suited for real-time use and is further explored on the CaT dataset. Results for the ViT architectures are compared to previous studies of CNN-based architectures on the Rellis-3D dataset <xref ref-type="bibr" rid="scirp.136447-28">
     [28]
    </xref> <xref ref-type="bibr" rid="scirp.136447-30">
     [30]
    </xref>. All hardware and software used for testing are listed in the Appendix.</p>
   <sec id="s3_1">
    <title>3.1. Model Training</title>
    <p>The Segformer and EfficientViT architectures are implemented in Python using PyTorch. To update the weights of the neural networks in both architectures, the AdamW optimization algorithm is used with default parameter values for the weight decay, epsilon, and beta parameters <xref ref-type="bibr" rid="scirp.136447-34">
      [34]
     </xref>. Both models were trained until convergence. Other specific hyperparameters for each architecture are documented in their respective papers and detailed below <xref ref-type="bibr" rid="scirp.136447-19">
      [19]
     </xref> <xref ref-type="bibr" rid="scirp.136447-20">
      [20]
     </xref>.</p>
    <p>To train the Segformer architecture, an initial learning rate of 0.00006 is used with a polynomial learning rate scheduler, as documented in the original paper <xref ref-type="bibr" rid="scirp.136447-20">
      [20]
     </xref>. Random flipping and random cropping were used for pre-processing the images as documented in the original paper <xref ref-type="bibr" rid="scirp.136447-20">
      [20]
     </xref>.</p>
    <p>For the EfficientViT architecture, training began with an initial 20 epochs of warm-up training. In the warm-up epochs, the learning rate gradually increased from 0.0 to the base learning rate of 0.001. The learning rate was adjusted throughout the training based on a cosine learning rate scheduler <xref ref-type="bibr" rid="scirp.136447-35">
      [35]
     </xref>. Random flipping, random cropping, hue changing, and random erasing of image data were used for pre-processing <xref ref-type="bibr" rid="scirp.136447-36">
      [36]
     </xref>.</p>
   </sec>
   <sec id="s3_2">
    <title>3.2. Evaluation Metrics</title>
    <p>The CNN and ViT architectures were evaluated based on their ability to recognize and generalize patterns in an off-road setting (segmentation accuracy) and their ability to do so efficiently (inference speed and memory usage).</p>
    <p>The primary accuracy measurement in segmentation is intersection over union (IoU), shown in Equation (1). The intersection and union are based on the true positive (TP), false positive (FP), and false negative (FN) predictions of each class. The mean IoU (mIoU), is an average of all the individual class IoU scores (see Equation (2)). For exploring Rellis-3D, the architectures are trained on 70% of the dataset (4364 images) and evaluated on 30% of the image data (1870 images), which is the same as previous studies. For exploring CaT, 70% of the dataset (2356 images) was used for training, and 30% (1088 images) was used for testing.</p>
    <p>
     <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <msub> 
        <mrow> 
         <mtext>
           IoU 
         </mtext> 
        </mrow> 
        <mrow> 
         <mtext>
           class 
         </mtext> 
        </mrow> 
       </msub> 
       <mo>
         = 
       </mo> 
       <mfrac> 
        <mrow> 
         <msub> 
          <mrow> 
           <mtext>
             Prediction 
           </mtext> 
          </mrow> 
          <mrow> 
           <mtext>
             class 
           </mtext> 
          </mrow> 
         </msub> 
         <mo>
           ∩ 
         </mo> 
         <msub> 
          <mrow> 
           <mtext>
             GroundTruth 
           </mtext> 
          </mrow> 
          <mrow> 
           <mtext>
             class 
           </mtext> 
          </mrow> 
         </msub> 
        </mrow> 
        <mrow> 
         <msub> 
          <mrow> 
           <mtext>
             Prediction 
           </mtext> 
          </mrow> 
          <mrow> 
           <mtext>
             class 
           </mtext> 
          </mrow> 
         </msub> 
         <mo>
           ∪ 
         </mo> 
         <msub> 
          <mrow> 
           <mtext>
             GroundTruth 
           </mtext> 
          </mrow> 
          <mrow> 
           <mtext>
             class 
           </mtext> 
          </mrow> 
         </msub> 
        </mrow> 
       </mfrac> 
       <mo>
         = 
       </mo> 
       <mfrac> 
        <mrow> 
         <mtext>
           TP 
         </mtext> 
        </mrow> 
        <mrow> 
         <mtext>
           TP 
         </mtext> 
         <mo>
           + 
         </mo> 
         <mtext>
           FP 
         </mtext> 
         <mo>
           + 
         </mo> 
         <mtext>
           FN 
         </mtext> 
        </mrow> 
       </mfrac> 
      </mrow> 
     </math>(1)</p>
    <p>
     <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"> <mrow> 
       <mtext>
         IoU 
       </mtext> 
       <mo>
         = 
       </mo> 
       <mfrac> 
        <mrow> 
         <mstyle displaystyle="true"> 
          <mo>
            ∑ 
          </mo> 
          <mrow> 
           <msub> 
            <mrow> 
             <mtext>
               IoU 
             </mtext> 
            </mrow> 
            <mrow> 
             <mtext>
               class 
             </mtext> 
            </mrow> 
           </msub> 
          </mrow> 
         </mstyle> 
        </mrow> 
        <mrow> 
         <msub> 
          <mi>
            n 
          </mi> 
          <mrow> 
           <mtext>
             classes 
           </mtext> 
          </mrow> 
         </msub> 
        </mrow> 
       </mfrac> 
      </mrow> 
     </math>(2)</p>
   </sec>
   <sec id="s3_3">
    <title>3.3. Inference Speed and Memory Usage</title>
    <p>The timing approach for evaluating inference speeds and memory consumption is detailed in <xref ref-type="bibr" rid="scirp.136447-#l1">
      Listing 1
     </xref>. The architectures were timed on 200 iterations of predicting segmented images on Rellis-3D resolution data (1200 × 1920), and the average results were reported. Since perception systems must transfer knowledge to the CPU for decision-making, the inference speed calculations included the time to transfer the predictions back to the CPU.</p>
    <fig id="fig10" position="float">
     <label>Figure 10</label>
     <caption>
      <title>Listing 1. Inference Speed and Memory Data Collection in PyTorch.4. Results and DiscussionThis section presents new results for inference speed and memory consumption, as well as mIoU on the Rellis-3D dataset. Additionally, the number of parameters and size in memory of each architecture are detailed. We then measure the inference time of the most accurate architectures on a Jetson Xavier AGX edge device to ensure real-time viability. Further, using the architecture deemed most suited for accurate and fast inference, the CaT dataset was explored. We compare CaT results against the benchmark IoU scores outlined in the CaT dataset paper <xref ref-type="bibr" rid="scirp.136447-33">
        [33]
       </xref>.4.1. Rellis-3D AccuracyFirst, the CNN and ViT architectures are evaluated on the Rellis-3D dataset and compared for accuracy in the off-road setting. The class and mIoU results are shown in <xref ref-type="table" rid="table1">
        Table 1
       </xref>.<xref ref-type="bibr" rid="scirp.136447-"></xref>Table 1. Class and Mean IoU Accuracy (%) on Rellis-3D.
       <table class="MsoTableGrid custom-table" border="0" cellspacing="0" cellpadding="0"> 
 
        <tr> 
  
         <td class="custom-bottom-td acenter" width="19.99%"><p style="text-align:center">Class</p></td> 
  
         <td class="custom-bottom-td acenter" width="22.49%"><p style="text-align:center">DeeplabV3+ <xref ref-type="bibr" rid="scirp.136447-28">
            [28]
           </xref></p></td> 
  
         <td class="custom-bottom-td acenter" width="17.52%"><p style="text-align:center">Swiftnet <xref ref-type="bibr" rid="scirp.136447-30">
            [30]
           </xref></p></td> 
  
         <td class="custom-bottom-td acenter" width="20.00%"><p style="text-align:center">EfficientViT</p></td> 
  
         <td class="custom-bottom-td acenter" width="20.00%"><p style="text-align:center">Segformer</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="custom-top-td acenter" width="19.99%"><p style="text-align:center">grass</p></td> 
  
         <td class="custom-top-td acenter" width="22.49%"><p style="text-align:center">72.70</p></td> 
  
         <td class="custom-top-td acenter" width="17.52%"><p style="text-align:center">91.83</p></td> 
  
         <td class="custom-top-td acenter" width="20.00%"><p style="text-align:center">92.07</p></td> 
  
         <td class="custom-top-td acenter" width="20.00%"><p style="text-align:center">85.65</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="acenter" width="19.99%"><p style="text-align:center">tree</p></td> 
  
         <td class="acenter" width="22.49%"><p style="text-align:center">83.45</p></td> 
  
         <td class="acenter" width="17.52%"><p style="text-align:center">90.04</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">90.06</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">82.05</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="acenter" width="19.99%"><p style="text-align:center">pole</p></td> 
  
         <td class="acenter" width="22.49%"><p style="text-align:center">7.57</p></td> 
  
         <td class="acenter" width="17.52%"><p style="text-align:center">42.15</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">40.53</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">14.48</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="acenter" width="19.99%"><p style="text-align:center">water</p></td> 
  
         <td class="acenter" width="22.49%"><p style="text-align:center">53.35</p></td> 
  
         <td class="acenter" width="17.52%"><p style="text-align:center">81.48</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">79.22</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">50.63</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="acenter" width="19.99%"><p style="text-align:center">sky</p></td> 
  
         <td class="acenter" width="22.49%"><p style="text-align:center">95.84</p></td> 
  
         <td class="acenter" width="17.52%"><p style="text-align:center">97.54</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">97.57</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">96.28</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="acenter" width="19.99%"><p style="text-align:center">vehicle</p></td> 
  
         <td class="acenter" width="22.49%"><p style="text-align:center">26.96</p></td> 
  
         <td class="acenter" width="17.52%"><p style="text-align:center">67.30</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">65.34</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">31.02</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="acenter" width="19.99%"><p style="text-align:center">object</p></td> 
  
         <td class="acenter" width="22.49%"><p style="text-align:center">24.89</p></td> 
  
         <td class="acenter" width="17.52%"><p style="text-align:center">72.73</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">68.44</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">13.32</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="acenter" width="19.99%"><p style="text-align:center">asphalt</p></td> 
  
         <td class="acenter" width="22.49%"><p style="text-align:center">60.95</p></td> 
  
         <td class="acenter" width="17.52%"><p style="text-align:center">86.08</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">85.34</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">58.40</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="acenter" width="19.99%"><p style="text-align:center">building</p></td> 
  
         <td class="acenter" width="22.49%"><p style="text-align:center">10.49</p></td> 
  
         <td class="acenter" width="17.52%"><p style="text-align:center">65.08</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">59.46</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">9.49</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="acenter" width="19.99%"><p style="text-align:center">log</p></td> 
  
         <td class="acenter" width="22.49%"><p style="text-align:center">25.97</p></td> 
  
         <td class="acenter" width="17.52%"><p style="text-align:center">61.79</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">56.90</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">36.12</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="acenter" width="19.99%"><p style="text-align:center">person</p></td> 
  
         <td class="acenter" width="22.49%"><p style="text-align:center">66.46</p></td> 
  
         <td class="acenter" width="17.52%"><p style="text-align:center">92.52</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">90.78</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">71.23</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="acenter" width="19.99%"><p style="text-align:center">fence</p></td> 
  
         <td class="acenter" width="22.49%"><p style="text-align:center">15.79</p></td> 
  
         <td class="acenter" width="17.52%"><p style="text-align:center">65.61</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">58.31</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">18.88</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="acenter" width="19.99%"><p style="text-align:center">bush</p></td> 
  
         <td class="acenter" width="22.49%"><p style="text-align:center">70.95</p></td> 
  
         <td class="acenter" width="17.52%"><p style="text-align:center">85.09</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">85.55</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">73.18</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="acenter" width="19.99%"><p style="text-align:center">concrete</p></td> 
  
         <td class="acenter" width="22.49%"><p style="text-align:center">80.23</p></td> 
  
         <td class="acenter" width="17.52%"><p style="text-align:center">91.24</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">90.96</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">84.83</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="acenter" width="19.99%"><p style="text-align:center">barrier</p></td> 
  
         <td class="acenter" width="22.49%"><p style="text-align:center">65.57</p></td> 
  
         <td class="acenter" width="17.52%"><p style="text-align:center">87.63</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">86.19</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">68.37</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="acenter" width="19.99%"><p style="text-align:center">puddle</p></td> 
  
         <td class="acenter" width="22.49%"><p style="text-align:center">59.27</p></td> 
  
         <td class="acenter" width="17.52%"><p style="text-align:center">80.96</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">80.69</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">67.08</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="acenter" width="19.99%"><p style="text-align:center">mud</p></td> 
  
         <td class="acenter" width="22.49%"><p style="text-align:center">29.51</p></td> 
  
         <td class="acenter" width="17.52%"><p style="text-align:center">65.46</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">66.09</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">45.37</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="acenter" width="19.99%"><p style="text-align:center">rubble</p></td> 
  
         <td class="acenter" width="22.49%"><p style="text-align:center">36.43</p></td> 
  
         <td class="acenter" width="17.52%"><p style="text-align:center">77.87</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">74.96</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">49.88</p></td> 
 
        </tr> 
 
        <tr> 
  
         <td class="acenter" width="19.99%"><p style="text-align:center">mIoU</p></td> 
  
         <td class="acenter" width="22.49%"><p style="text-align:center">49.24</p></td> 
  
         <td class="acenter" width="17.52%"><p style="text-align:center">77.9</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">76.03</p></td> 
  
         <td class="acenter" width="20.00%"><p style="text-align:center">53.13</p></td> 
 
        </tr>

       </table>EfficientViT and Swiftnet were the strongest performing architectures with 76.03% and 77.9% mIoU, respectively. The class IoUs for these architectures show that each is well generalized to large terrain patterns and small obstacles/objects.Results from the previous study show that DeepLabV3+ could generalize to the large terrain patterns—e.g., tree, grass, sky, bush and concrete, while it struggled to generalize to the smaller objects/obstacles. Similarly, while struggling with the smaller objects, Segformer generalizes well for significant patterns in the dataset—e.g., grass, bush, concrete, sky, and trees. Both architectures seem to be affected by the class imbalance challenge common in off-road datasets, with sky, grass, tree, and bush being the most over-represented classes in the Rellis-3D dataset, as previously shown in <xref ref-type="fig" rid="fig7">
        Figure 7
       </xref>. Prediction segmentation results with the ViT architectures compared to the ground truth segmentation are shown in <xref ref-type="fig" rid="fig10">
        Figure 10
       </xref>. The traversable tracks of <xref ref-type="fig" rid="fig10(d)">
        Figure 10(d)
       </xref> show a mixture of classes, highlighting Segformer’s inaccuracy on small patterns. As seen in <xref ref-type="fig" rid="fig10(c)">
        Figure 10(c)
       </xref>, EfficientViT smoothly identifies traversability patterns in the off-road environment, barely deviating from the ground truth segmentation in <xref ref-type="fig" rid="fig10(b)">
        Figure 10(b)
       </xref>.<xref ref-type="bibr" rid="scirp.136447-"></xref><p class="imgGroupCss_v"><img class=" imgMarkCss lazy" data-original="https://html.scirp.org/file/1732833-rId32.jpeg?20240930044935" /></p></title>
     </caption>
     <graphic mimetype="image" position="float" xlink:type="simple" xlink:href="https://html.scirp.org/file/1732833-rId31.jpeg?20240930044934" />
    </fig>
   </sec>
   <sec id="s3_4">
    <title>4.2. Rellis-3D Inference Speed and Memory Usage</title>
    <p>Based on the accuracy results, Swiftnet and the ViT architectures have results promising for real-world use. To determine the viability of each architecture for implementation on an edge device, a baseline comparison of inference speed, inference memory usage, parameters, and architecture size, these architectures are studied using a large GPU.</p>
    <p>All inference results presented in <xref ref-type="table" rid="table2">
      Table 2
     </xref> were measured using an NVIDIA V100 GPU (specifications shown in Appendix <xref ref-type="table" rid="tableA1">
      Table A1
     </xref>).</p>
    <table-wrap id="table1">
     <label>
      <xref ref-type="table" rid="table1">
       Table 1
      </xref></label>
     <caption>
      <title>
       <xref ref-type="bibr" rid="scirp.136447-"></xref>Table 2. Architecture Inference Speed and Memory Usage on Rellis-3D.</title>
     </caption>
     <table class="MsoTableGrid custom-table" border="0" cellspacing="0" cellpadding="0"> 
      <tr> 
       <td class="custom-bottom-td acenter" width="19.99%"><p style="text-align:center">Architecture</p></td> 
       <td class="custom-bottom-td acenter" width="20.00%"><p style="text-align:center">Inference Speed</p></td> 
       <td class="custom-bottom-td acenter" width="20.00%"><p style="text-align:center">Parameters</p></td> 
       <td class="custom-bottom-td acenter" width="20.00%"><p style="text-align:center">Architecture Size</p></td> 
       <td class="custom-bottom-td acenter" width="20.00%"><p style="text-align:center">Inference Memory Usage</p></td> 
      </tr> 
      <tr> 
       <td class="custom-top-td acenter" width="19.99%"><p style="text-align:center">EfficientViT</p></td> 
       <td class="custom-top-td acenter" width="20.00%"><p style="text-align:center">11.53 ms</p></td> 
       <td class="custom-top-td acenter" width="20.00%"><p style="text-align:center">0.7 M</p></td> 
       <td class="custom-top-td acenter" width="20.00%"><p style="text-align:center">2.76 MB</p></td> 
       <td class="custom-top-td acenter" width="20.00%"><p style="text-align:center">392 MB</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="19.99%"><p style="text-align:center">Swiftnet</p></td> 
       <td class="acenter" width="20.00%"><p style="text-align:center">23.32 ms</p></td> 
       <td class="acenter" width="20.00%"><p style="text-align:center">12 M</p></td> 
       <td class="acenter" width="20.00%"><p style="text-align:center">46.14 MB</p></td> 
       <td class="acenter" width="20.00%"><p style="text-align:center">746 MB</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="19.99%"><p style="text-align:center">Segformer</p></td> 
       <td class="acenter" width="20.00%"><p style="text-align:center">87.86 ms</p></td> 
       <td class="acenter" width="20.00%"><p style="text-align:center">3.7 M</p></td> 
       <td class="acenter" width="20.00%"><p style="text-align:center">14.22 MB</p></td> 
       <td class="acenter" width="20.00%"><p style="text-align:center">2571 MB</p></td> 
      </tr> 
     </table>
    </table-wrap>
    <p>EfficientViT outperformed the other state-of-the-art architectures in terms of inference speed. It is about twice as fast as the CNN-based Swiftnet on a V100 while using fewer parameters and half as much memory for inference as Swiftnet. Segformer suffers from a slower inference speed and a significant increase in memory consumption for inference, likely due to the inefficient attention functionality, a notable downside of self-attention with high-resolution images.</p>
    <p>Based on the results from the large GPU study, Swiftnet and EfficientViT are viable for edge device use given their fast inference speed and low memory usage. Only the Swiftnet and EfficientViT architectures were translated to run on the smaller edge device since the results of <xref ref-type="table" rid="table2">
      Table 2
     </xref> show that the Segformer inference time was significantly slower than the other two architectures, even with a powerful GPU like a V100. To verify that Swiftnet and EfficientViT maintain their inference speed in a real-time setting, the architectures were tested on a NVIDIA Jetson Xavier AGX edge device (specifications shown in Appendix <xref ref-type="table" rid="tableA2">
      Table A2
     </xref>). After testing these two architectures on the Xavier with the same method from Algorithm 1, the NVIDIA TensorRT engine <xref ref-type="bibr" rid="scirp.136447-37">
      [37]
     </xref> was used to optimize the architectures for inference on the Xavier. EfficientViT strongly outperforms Swiftnet regarding inference speed on the edge device as shown in <xref ref-type="table" rid="table3">
      Table 3
     </xref>. Without TensorRT, it is more than 3× faster; with TensorRT, it is about 4× faster. Based on these results, EfficientViT has the traits most desirable for real-world performance: strong segmentation accuracy substainally faster inference speed than the other architectures studied.</p>
   </sec>
   <sec id="s3_5">
    <title>4.3. CaT Dataset Results</title>
    <p>Since the results form Rellis-3D show EfficientViT is the most viable architecture</p>
    <table-wrap id="table2">
     <label>
      <xref ref-type="table" rid="table2">
       Table 2
      </xref></label>
     <caption>
      <title>
       <xref ref-type="bibr" rid="scirp.136447-"></xref>Table 3. Inference speed on jetson xavier AGX edge device.</title>
     </caption>
     <table class="MsoTableGrid custom-table" border="0" cellspacing="0" cellpadding="0"> 
      <tr> 
       <td class="custom-bottom-td acenter" width="33.34%"><p style="text-align:center">Architecture</p></td> 
       <td class="custom-bottom-td acenter" width="33.33%"><p style="text-align:center">No Optimization</p></td> 
       <td class="custom-bottom-td acenter" width="33.33%"><p style="text-align:center">TensorRT Optimized</p></td> 
      </tr> 
      <tr> 
       <td class="custom-top-td acenter" width="33.34%"><p style="text-align:center">EfficientViT</p></td> 
       <td class="custom-top-td acenter" width="33.33%"><p style="text-align:center">114 ms</p></td> 
       <td class="custom-top-td acenter" width="33.33%"><p style="text-align:center">83 ms</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="33.34%"><p style="text-align:center">Swiftnet</p></td> 
       <td class="acenter" width="33.33%"><p style="text-align:center">388 ms</p></td> 
       <td class="acenter" width="33.33%"><p style="text-align:center">321 ms</p></td> 
      </tr> 
     </table>
    </table-wrap>
    <table-wrap id="table3">
     <label>
      <xref ref-type="table" rid="table3">
       Table 3
      </xref></label>
     <caption>
      <title>
       <xref ref-type="bibr" rid="scirp.136447-"></xref>Table 4. IoU (%) results on CaT compared to state of the art benchmark <xref ref-type="bibr" rid="scirp.136447-33">
        [33]
       </xref>.</title>
     </caption>
     <table class="MsoTableGrid custom-table" border="0" cellspacing="0" cellpadding="0"> 
      <tr> 
       <td class="custom-bottom-td cell-with-diagonal-border aright" width="37.22%"><p style="text-align:right">Classes</p><p style="text-align:left">Architectures</p></td> 
       <td class="custom-bottom-td acenter" width="15.69%"><p style="text-align:center">Sedan</p></td> 
       <td class="custom-bottom-td acenter" width="15.69%"><p style="text-align:center">Pickup</p></td> 
       <td class="custom-bottom-td acenter" width="15.69%"><p style="text-align:center">Off-Road</p></td> 
       <td class="custom-bottom-td acenter" width="15.71%"><p style="text-align:center">mIoU</p></td> 
      </tr> 
      <tr> 
       <td class="custom-top-td acenter" width="37.22%"><p style="text-align:center">PSPNet w/ResNet-18</p></td> 
       <td class="custom-top-td acenter" width="15.69%"><p style="text-align:center">90.44</p></td> 
       <td class="custom-top-td acenter" width="15.69%"><p style="text-align:center">66.62</p></td> 
       <td class="custom-top-td acenter" width="15.69%"><p style="text-align:center">79.71</p></td> 
       <td class="custom-top-td acenter" width="15.71%"><p style="text-align:center">78.92</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="37.22%"><p style="text-align:center">PSPNet w/ResNet-34</p></td> 
       <td class="acenter" width="15.69%"><p style="text-align:center">91.21</p></td> 
       <td class="acenter" width="15.69%"><p style="text-align:center">68.64</p></td> 
       <td class="acenter" width="15.69%"><p style="text-align:center">80.52</p></td> 
       <td class="acenter" width="15.71%"><p style="text-align:center">80.12</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="37.22%"><p style="text-align:center">PSPNet w/ResNet-50</p></td> 
       <td class="acenter" width="15.69%"><p style="text-align:center">90.70</p></td> 
       <td class="acenter" width="15.69%"><p style="text-align:center">67.40</p></td> 
       <td class="acenter" width="15.69%"><p style="text-align:center">80.00</p></td> 
       <td class="acenter" width="15.71%"><p style="text-align:center">79.36</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="37.22%"><p style="text-align:center">PSPNet w/ResNet-101</p></td> 
       <td class="acenter" width="15.69%"><p style="text-align:center">91.64</p></td> 
       <td class="acenter" width="15.69%"><p style="text-align:center">69.08</p></td> 
       <td class="acenter" width="15.69%"><p style="text-align:center">81.00</p></td> 
       <td class="acenter" width="15.71%"><p style="text-align:center">80.57</p></td> 
      </tr> 
      <tr> 
       <td class="acenter" width="37.22%"><p style="text-align:center">EfficientViT</p></td> 
       <td class="acenter" width="15.69%"><p style="text-align:center">98.22</p></td> 
       <td class="acenter" width="15.69%"><p style="text-align:center">92.01</p></td> 
       <td class="acenter" width="15.69%"><p style="text-align:center">93.09</p></td> 
       <td class="acenter" width="15.71%"><p style="text-align:center">94.44</p></td> 
      </tr> 
     </table>
    </table-wrap>
    <p>for real-world use, with high inference speed and accuracy, we compare it to current results with the CaT dataset to further test the ability of the architecture to determine traversable terrain in a different off-road setting. The results for training EfficientViT on the CaT dataset are shown in <xref ref-type="table" rid="table4">
      Table 4
     </xref>. When comparing these results to the state-of-the-art CaT Benchmark, EfficientViT detects the three types of traversable terrain in the off-road environment more accurately. Comparing our results to the state-of-the-art benchmark from the CaT dataset <xref ref-type="bibr" rid="scirp.136447-33">
      [33]
     </xref>, these results achieved a mIoU score of 94.44% a significant increase in mIoU 13.87% over the CaT benchmark of 80.57%. Individually, these results show an improved IoU score for sedan traversability by 6.57%, pickup by 22.93% and off-road by 12.09%. Traversability accuracy with both CaT and Rellis-3D coupled with high inference speed results prove EfficientViT is extremely viable for real-world use in determining traversable terrain in a perception system.</p>
   </sec>
  </sec><sec id="s4">
   <title>5. Conclusion</title>
   <p>Using a state-of-the-art ViT architecture, EfficientViT, we were able to demonstrate the viability of a ViT architecture for us in an off-road perception system. Compared to previous results with a CNN architecture, Swiftnet, EfficientViT maintained a strong accuracy in off-road environments while having a much faster inference speed. EfficientViT has 1.9% mIoU reduction on the Rellis3D dataset compared to Swiftnet, while being 2× as fast as the Swiftnet for inference on a large GPU, and up to 4× as fast on an edge device with TensorRT optimization. Additionally, EfficientViT uses half as much memory for inference as Swiftnet and has a 20× smaller model size—two traits extremely desirable in real-world systems with limited memory capacity. The use of hardware efficient attention and efficient convolution operations makes this architecture extremely fast, while maintaining a strong accuracy with few parameters. These results make EfficientViT a viable option for real-time use in UGV perception systems.</p>
   <p>EfficientViT also demonstrated new state-of-the-art results on the CaT dataset with 94.44% mIoU on traversable terrain. These results further demonstrate the ability of the EfficientViT architecture to determine traversable terrain for a UGV, maintaining high accuracy, fast inference, and low memory usage.</p>
   <p>To add to current developments toward integrating higher levels of autonomy into UGVs, this research provides insights into new methods for improving off-road perception systems. Use of new semantic segmentation architectures that maintain accuracy, with a lower memory footprint and higher inference speed, will alleviate latency and memory bottlenecks within the perception system, allowing vehicles to make safe decisions in real-time.</p>
   <sec id="s4_1">
    <title>Future Work</title>
    <p>Perception systems may deploy a variety of sensors including RADAR, LiDAR, FLIR, multispectral and stereo images. Combinations and fusions of these sensor modalities can lead to a richer understanding of the surrounding environment, for example providing depth/distances for contextual information. In future work, the use and adaptation of ViT architectures with these additional sensor modalities for enriched perception will be explored.</p>
    <p>With power and physical space restrictions common on autonomous vehicles, data transfer can be utilized to send perception data to external devices for increased computation demands. Offloading data for processing can introduce new challenges where restricted bandwidth of the transfer requires data manipulation to maintain high processing speeds and reduce latency.</p>
   </sec>
  </sec><sec id="s5">
   <title>Acknowledgements</title>
   <p>DISTRIBUTION STATEMENT A. approved for public release; distribution is unlimited. OPSEC#8920.</p>
   <p>This work was supported by the Virtual Prototyping of Autonomy Enabled Ground Systems (VIPR-GS), a US Army Center of Excellence for modeling and simulation of ground vehicles, under Cooperative Agreement W56HZV-21-2-0001 with the US Army DEVCOM Ground Vehicle Systems Center (GVSC).</p>
   <p>This research was also supported by the U.S. National Science Foundation under Grants SHF-1910197, SHF-1943114 and CCF-2312616.</p>
   <p>Clemson University is acknowledged for their generous allotment of compute time on the Palmetto Cluster.</p>
   <p>Clemson Future Computing Technologies Laboratory summer research students Michael Ellis, Adam Niemczura, Precious Eyabi and Ryan Chen are acknowledged for their contributions to this study.</p>
  </sec><sec id="s6">
   <title>Appendix</title>
   <table-wrap id="table4">
    <label>
     <xref ref-type="table" rid="table4">
      Table 4
     </xref></label>
    <caption>
     <title>
      <xref ref-type="bibr" rid="scirp.136447-"></xref>Table A1. V100 inference testing specifications.</title>
    </caption>
    <table class="MsoTableGrid custom-table" border="0" cellspacing="0" cellpadding="0"> 
     <tr> 
      <td class="custom-bottom-td acenter"><p style="text-align:center">GPU Name</p></td> 
      <td class="custom-bottom-td acenter"><p style="text-align:center">NVIDIA Tesla V100</p></td> 
     </tr> 
     <tr> 
      <td class="custom-top-td acenter"><p style="text-align:center">Power Cap</p></td> 
      <td class="custom-top-td acenter"><p style="text-align:center">250 W</p></td> 
     </tr> 
     <tr> 
      <td class="acenter"><p style="text-align:center">CUDA Cores</p></td> 
      <td class="acenter"><p style="text-align:center">5120</p></td> 
     </tr> 
     <tr> 
      <td class="acenter"><p style="text-align:center">GPU Memory</p></td> 
      <td class="acenter"><p style="text-align:center">16 GB (GPU dedicated)</p></td> 
     </tr> 
     <tr> 
      <td class="acenter"><p style="text-align:center">CUDA Version</p></td> 
      <td class="acenter"><p style="text-align:center">12.4</p></td> 
     </tr> 
     <tr> 
      <td class="acenter"><p style="text-align:center">Python Version</p></td> 
      <td class="acenter"><p style="text-align:center">3.11.4</p></td> 
     </tr> 
     <tr> 
      <td class="acenter"><p style="text-align:center">PyTorch Version</p></td> 
      <td class="acenter"><p style="text-align:center">2.1.0</p></td> 
     </tr> 
     <tr> 
      <td class="acenter"><p style="text-align:center">Torchvision Version</p></td> 
      <td class="acenter"><p style="text-align:center">0.16.0</p></td> 
     </tr> 
    </table>
   </table-wrap>
   <table-wrap id="table5">
    <label>
     <xref ref-type="table" rid="table5">
      Table 5
     </xref></label>
    <caption>
     <title>
      <xref ref-type="bibr" rid="scirp.136447-"></xref>Table A2. Jetson Xavier testing specifications.</title>
    </caption>
    <table class="MsoTableGrid custom-table" border="0" cellspacing="0" cellpadding="0"> 
     <tr> 
      <td class="custom-bottom-td acenter"><p style="text-align:center">Device Name</p></td> 
      <td class="custom-bottom-td acenter"><p style="text-align:center">Jetson Xavier AGX</p></td> 
     </tr> 
     <tr> 
      <td class="custom-top-td acenter"><p style="text-align:center">Power Cap</p></td> 
      <td class="custom-top-td acenter"><p style="text-align:center">15 W</p></td> 
     </tr> 
     <tr> 
      <td class="acenter"><p style="text-align:center">CUDA Cores</p></td> 
      <td class="acenter"><p style="text-align:center">512</p></td> 
     </tr> 
     <tr> 
      <td class="acenter"><p style="text-align:center">GPU Memory</p></td> 
      <td class="acenter"><p style="text-align:center">32 GB (shared)</p></td> 
     </tr> 
     <tr> 
      <td class="acenter"><p style="text-align:center">CUDA Version</p></td> 
      <td class="acenter"><p style="text-align:center">11.8</p></td> 
     </tr> 
     <tr> 
      <td class="acenter"><p style="text-align:center">Python Version</p></td> 
      <td class="acenter"><p style="text-align:center">3.8.0</p></td> 
     </tr> 
     <tr> 
      <td class="acenter"><p style="text-align:center">PyTorch Version</p></td> 
      <td class="acenter"><p style="text-align:center">2.0.0</p></td> 
     </tr> 
     <tr> 
      <td class="acenter"><p style="text-align:center">Torchvision Version</p></td> 
      <td class="acenter"><p style="text-align:center">0.15.0</p></td> 
     </tr> 
    </table>
   </table-wrap>
  </sec><sec id="s7">
   <title>NOTES</title>
   <p>*Indicates Equal Contribution.</p>
  </sec>
 </body><back>
  <ref-list>
   <title>References</title>
   <ref id="scirp.136447-ref1">
    <label>1</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Akhil, K.P., Manikutty, G., Ravindran, R. and Rao, R.B. (2019) Autonomous Navigation of an Unmanned Ground Vehicle for Soil Pollution Monitoring. 2019 2nd International Conference on Intelligent Computing, Instrumentation and Control Technologies, Kannur, 5-6 July 2019, 1563-1567. &gt;https://doi.org/10.1109/icicict46008.2019.8993292 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref2">
    <label>2</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Long, L.N., Hanford, S.D., Janrathitikarn, O., Sinsley, G.L. and Miller, J.A. (2007) A Review of Intelligent Systems Software for Autonomous Vehicles. 2007 IEEE Symposium on Computational Intelligence in Security and Defense Applications, Honolulu, 1-5 April 2007, 69-76. &gt;https://doi.org/10.1109/cisda.2007.368137 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref3">
    <label>3</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Islam, F., Nabi, M.M. and Ball, J.E. (2022) Off-Road Detection Analysis for Autonomous Ground Vehicles: A Review. Sensors, 22, Article 8463. &gt;https://doi.org/10.3390/s22218463 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref4">
    <label>4</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Rosique, F., Navarro, P.J., Fernández, C. and Padilla, A. (2019) A Systematic Review of Perception System and Simulators for Autonomous Vehicles Research. Sensors, 19, Article 648. &gt;https://doi.org/10.3390/s19030648 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref5">
    <label>5</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Pavitha, P.P., Rekha, K.B. and Safinaz, S. (2021) Perception System in Autonomous Vehicle: A Study on Contemporary and Forthcoming Technologies for Object Detection in Autonomous Vehicles. 2021 International Conference on Forensics, Analytics, Big Data, Security, Bengaluru, 21-22 December 2021, 1-6. &gt;https://doi.org/10.1109/fabs52071.2021.9702569 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref6">
    <label>6</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Redmon, J. and Farhadi, A. (2018) Yolov3: An Incremental Improvement.
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref7">
    <label>7</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Chen, L.-C., Papandreou, G., Schroff, F. and Adam, H. (2017) Rethinking Atrous Convolution for Semantic Image Segmentation.
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref8">
    <label>8</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Ketkar, N. and Moolayil, J. (2021) Convolutional Neural Networks. In: Ketkar, N. and Moolayil, J., Eds., Deep Learning with Python, Apress, 197-242. &gt;https://doi.org/10.1007/978-1-4842-5364-9_6 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref9">
    <label>9</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2017) Imagenet Classification with Deep Convolutional Neural Networks. Communications of the ACM, 60, 84-90. &gt;https://doi.org/10.1145/3065386 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref10">
    <label>10</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Simonyan, K. and Zisserman, A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref11">
    <label>11</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     He, K., Zhang, X., Ren, S. and Sun, J. (2016) Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 770-778. &gt;https://doi.org/10.1109/cvpr.2016.90 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref12">
    <label>12</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Khan, S.H., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S. and Shah, M. (2021) Transformers in Vision: A Survey.
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref13">
    <label>13</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref14">
    <label>14</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., et al. (2016) The Cityscapes Dataset for Semantic Urban Scene Understanding. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 3213-3223. &gt;https://doi.org/10.1109/cvpr.2016.350 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref15">
    <label>15</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A. and Torralba, A. (2017) Scene Parsing through ADE20K Dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 5122-5130. &gt;https://doi.org/10.1109/cvpr.2017.544 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref16">
    <label>16</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Szabó, L. and Weltsch, Z. (2024) A Comprehensive Review of Existing Datasets for Off-Road Autonomous Vehicles. 2024 IEEE 22nd World Symposium on Applied Machine Intelligence and Informatics, Stará Lesná, 25-27 January 2024, 403-410. &gt;https://doi.org/10.1109/sami60510.2024.10432820 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref17">
    <label>17</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Chen, L., Zhu, Y., Papandreou, G., Schroff, F. and Adam, H. (2018) Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., Computer Vision—ECCV 2018, Springer, 833-851. &gt;https://doi.org/10.1007/978-3-030-01234-2_49 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref18">
    <label>18</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Oršić, M. and Šegvić, S. (2021) Efficient Semantic Segmentation with Pyramidal Fusion. Pattern Recognition, 110, Article 107611. &gt;https://doi.org/10.1016/j.patcog.2020.107611 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref19">
    <label>19</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Cai, H., Li, J., Hu, M., Gan, C. and Han, S. (2023) Efficientvit: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction. 2023 IEEE/CVF International Conference on Computer Vision, Paris, 1-6 October 2023, 17256-17267. &gt;https://doi.org/10.1109/iccv51070.2023.01587 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref20">
    <label>20</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Xie, E., Wang, W., Yu, Z., Anandkumar, A., et al. (2021) Segformer: Simple and Efficient Design for Semantic Segmentation with Transformers.
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref21">
    <label>21</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Hofmarcher, M., Unterthiner, T., Arjona-Medina, J., Klambauer, G., Hochreiter, S. and Nessler, B. (2019) Visual Scene Understanding for Autonomous Driving Using Semantic Segmentation. In: Samek, W., Montavon, G., Vedaldi, A., Hansen, L.K. and Müller, K.R., Eds., Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, Springer, 285-296. &gt;https://doi.org/10.1007/978-3-030-28954-6_15 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref22">
    <label>22</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Ma, Y., Wang, Z., Yang, H. and Yang, L. (2020) Artificial Intelligence Applications in the Development of Autonomous Vehicles: A Survey. IEEE/CAA Journal of Automatica Sinica, 7, 315-329. &gt;https://doi.org/10.1109/jas.2020.1003021 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref23">
    <label>23</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Martinez-Gonzalez, P. and Garcia-Rodriguez, J. (2018) A Survey on Deep Learning Techniques for Image and Video Semantic Segmentation. Applied Soft Computing, 70, 41-65. &gt;https://doi.org/10.1016/j.asoc.2018.05.018 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref24">
    <label>24</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. (2016) You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 779-788. &gt;https://doi.org/10.1109/cvpr.2016.91 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref25">
    <label>25</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Howard, A.G., Zhu, M., Chen, B., et al. (2017) MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref26">
    <label>26</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need.
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref27">
    <label>27</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Tay, Y., Dehghani, M., Bahri, D. and Metzler, D. (2020) Efficient Transformers: A Survey. 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref28">
    <label>28</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Faykus, M.H., Selee, B. and Smith, M. (2023) Utilizing Neural Networks for Semantic Segmentation on RGB/LIDAR Fused Data for Off-Road Autonomous Military Vehicle Perception. &gt;https://doi.org/10.4271/2023-01-0740 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref29">
    <label>29</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Chen, L., Zhu, Y., Papandreou, G., Schroff, F. and Adam, H. (2018) Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., Computer Vision—ECCV 2018, Springer, 833-851. &gt;https://doi.org/10.1007/978-3-030-01234-2_49 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref30">
    <label>30</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Selee, B., Faykus, M. and Smith, M. (2023) Semantic Segmentation with High Inference Speed in Off-Road Environments. &gt;https://doi.org/10.4271/2023-01-0868 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref31">
    <label>31</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Russakovsky, O., Deng, J., Su, H., Krause, J., et al. (2014) Imagenet Large Scale Visual Recognition Challenge.
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref32">
    <label>32</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Jiang, P., Osteen, P., Wigness, M. and Saripalli, S. (2021) RELLIS-3D Dataset: Data, Benchmarks and Analysis. 2021 IEEE International Conference on Robotics and Automation, Xi’an, 30 May-5 June 2021, 1110-1116. &gt;https://doi.org/10.1109/icra48506.2021.9561251 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref33">
    <label>33</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Sharma, S., Dabbiru, L., Hannis, T., Mason, G., Carruth, D.W., Doude, M., et al. (2022) Cat: CAVS Traversability Dataset for Off-Road Autonomous Driving. IEEE Access, 10, 24759-24768. &gt;https://doi.org/10.1109/access.2022.3154419 
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref34">
    <label>34</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Loshchilov, I. and Hutter, F. (2017) Fixing Weight Decay Regularization in Adam.
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref35">
    <label>35</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Loshchilov, I. and Hutter, F. (2017) SGDR: Stochastic Gradient Descent with Warm Restarts. International Conference on Learning Representations, Toulon, 24-26 April 2017.
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref36">
    <label>36</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Yang, S., Xiao, W., Zhang, M., Guo, S., Zhao, J. and Shen, F. (2023) Image Data Augmentation for Deep Learning: A Survey.
    </mixed-citation>
   </ref>
   <ref id="scirp.136447-ref37">
    <label>37</label>
    <mixed-citation publication-type="other" xlink:type="simple">
     Zhou, Y. and Yang, K. (2022) Exploring Tensorrt to Improve Real-Time Inference for Deep Learning. 2022 IEEE 24th Int Conf on High Performance Computing&amp;Communications; 8th Int Conf on Data Science&amp;Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud&amp;Big Data Systems&amp;Application (HPCC/DSS/SmartCity/DependSys), Hainan, 18-20 December 2022, 2011-2018.
    </mixed-citation>
   </ref>
  </ref-list>
 </back>
</article>