<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.4 20241031//EN" "JATS-journalpublishing1-4.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article" dtd-version="1.4" xml:lang="en">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">jis</journal-id>
      <journal-title-group>
        <journal-title>Journal of Information Security</journal-title>
      </journal-title-group>
      <issn pub-type="epub">2153-1242</issn>
      <issn pub-type="ppub">2153-1234</issn>
      <publisher>
        <publisher-name>Scientific Research Publishing</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.4236/jis.2026.173012</article-id>
      <article-id pub-id-type="publisher-id">jis-151912</article-id>
      <article-categories>
        <subj-group>
          <subject>Article</subject>
        </subj-group>
        <subj-group>
          <subject>Computer Science</subject>
          <subject>Communications</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>Deep Reinforcement Learning for Phishing Detection with Transformer-Based Semantic Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <name name-style="western">
            <surname>Faisal</surname>
            <given-names>Aseer Al</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <name name-style="western">
            <surname>Rahman</surname>
            <given-names>Atiqur</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
      </contrib-group>
      <aff id="aff1"><label>1</label> Department of Electrical and Computer Engineering, North South University, Dhaka, Bangladesh </aff>
      <author-notes>
        <fn fn-type="conflict" id="fn-conflict">
          <p>The authors declare no conflicts of interest regarding the publication of this paper.</p>
        </fn>
      </author-notes>
      <pub-date pub-type="epub">
        <day>27</day>
        <month>05</month>
        <year>2026</year>
      </pub-date>
      <pub-date pub-type="collection">
        <month>05</month>
        <year>2026</year>
      </pub-date>
      <volume>17</volume>
      <issue>03</issue>
      <fpage>221</fpage>
      <lpage>242</lpage>
      <history>
        <date date-type="received">
          <day>11</day>
          <month>03</month>
          <year>2026</year>
        </date>
        <date date-type="accepted">
          <day>14</day>
          <month>06</month>
          <year>2026</year>
        </date>
        <date date-type="published">
          <day>17</day>
          <month>06</month>
          <year>2026</year>
        </date>
      </history>
      <permissions>
        <copyright-statement>© 2026 by the authors and Scientific Research Publishing Inc.</copyright-statement>
        <copyright-year>2026</copyright-year>
        <license license-type="open-access">
          <license-p> This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license ( <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link> ). </license-p>
        </license>
      </permissions>
      <self-uri content-type="doi" xlink:href="https://doi.org/10.4236/jis.2026.173012">https://doi.org/10.4236/jis.2026.173012</self-uri>
      <abstract>
        <p>Phishing is a form of cybercrime in which people are deceived into exposing their personal information which can result in financial loss. These attacks are often executed via fraudulent messages, misleading advertisements and compromised legitimate websites. This study proposes a framework based on Quantile Regression Deep Q-Network (QR-DQN) that integrates RoBERTa semantic embeddings and crafted lexical features to enhance phishing detection. Instead of predicting mean returns, QR-DQN uses quantile regression to model the distribution over returns which improves stability and generalization for previously unseen phishing samples over traditional RL DQN approaches when combined with semantic embeddings. A custom crawled diverse dataset of 105,000 URLs were curated from PhishTank, OpenPhish, Cloudflare etc. The framework uses an 80/20 split of the dataset. The QR-DQN model with RoBERTa embeddings and lexical features achieved test accuracy 99.86%, precision 99.75%, recall 99.96% and F1-score 99.85% demonstrating high effectiveness. Compared to the standard DQN with lexical features, the suggested QR-DQN framework with lexical and semantic features lowers the generalization gap from 1.66 to 0.04 percent. The experiments using 5-fold cross-validation have resulted in consistent results under this protocol with a mean accuracy of 99.90% and standard deviation of 0.04%. This shows the hybrid technique which combines quantile-based value estimation with RoBERTa semantic embeddings and lexical features reports strong performance and reduced generalization gap.</p>
      </abstract>
      <kwd-group kwd-group-type="author-generated" xml:lang="en">
        <kwd>Index Words-Phishing Detection</kwd>
        <kwd>Deep Reinforcement Learning</kwd>
        <kwd>RoBERTa Semantic Embeddings</kwd>
        <kwd>Quantile Regression Deep Q-Network</kwd>
        <kwd>QR-DQN</kwd>
        <kwd>URL Classification</kwd>
        <kwd>Lexical Features</kwd>
        <kwd>Cybersecurity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec1">
      <title>1. Introduction</title>
      <p>The modern internet faces persistent danger from evolving phishing threats, consistently identified as one of the most critical attack vectors. What used to be a straightforward trick has quickly transformed into a complex, multifaceted cybersecurity threat. Phishing is a common cyber threat that coerces users into revealing sensitive information, including credentials and financial details. The cybercrime information center nearly detected 1M phishing incidents between November 2023 and January 2024 [<xref ref-type="bibr" rid="B1">1</xref>]. Phishing is executed by cyber criminals who possess an understanding of human psychology, which is mostly influenced by greed, gullibility, and the desire for exploration. The manipulation technique that lures humans into disclosing their private and sensitive information is known as social engineering. People’s tendency to make mistakes or to trust them easily makes them a prime target for cyber threats that result in breaches of the security system. It is one of the most common techniques to manipulate the user into revealing their confidential information. The application of various forms of machine learning algorithms to detect phishing classification problems and in particular to security and malware detection, has gained a lot of traction from the research community in recent years. These attacks intentionally trigger the victim psychologically and urge them to take immediate action, manipulating trust, exploiting the tendency to comply with authority. This approach is really effective with social engineering attacks, representing a significant portion of security breaches [<xref ref-type="bibr" rid="B2">2</xref>]. Even though one of the main underlying causes of Uniform Resource Locator (URL) is identifying theft and financial fraud through hijacking, ad injection attacks and URL spoofing [<xref ref-type="bibr" rid="B3">3</xref>].</p>
      <p>The dynamic nature of these attacks, where the defense signatures that were effective yesterday become predictable vulnerabilities today, has revealed a significant weakness in traditional static cybersecurity approaches. Most standard detection algorithms depend on the extraction of superficial characteristics referred to as engineered lexical features. These characteristics encompass metrics like the length of the URL, the count of special characters and the existence of particular keywords. This approach makes features weak against small changes by attackers that lead to big problems in generalization. This manifests as a challenge and a significant difference is often seen in controlled training and its ability to sustain that accuracy against unseen data and a variety of attack patterns. In the field of machine learning this limitation is called the train-test generalization gap (Ggap), where a large gap signifies that the detection policy has memorized the specific pattern of known attack instances in the training data rather than learning the transferable concepts of phishing URLs. Deep reinforcement learning (DRL) has emerged as a paradigm for tackling cybersecurity challenges, especially via algorithms like DQN and DDQN, which allow the agent to learn adaptive policies [<xref ref-type="bibr" rid="B4">4</xref>][<xref ref-type="bibr" rid="B5">5</xref>]. Existing approaches show limitations in unpredictability and generalization. Unlike prior RL approaches relying exclusively on engineered lexical features and content-based representations, our work advances by introducing the Quantile Regression Deep Q-Network (QR-DQN) for distributional reinforcement learning. This hybrid approach, combined with distributional value learning, achieves a better generalization compared to standard DQN on identical features through uncertainty-aware, semantically-informed policy learning. This framework combines frozen RoBERTa transformer embeddings 768-dimensional semantic representations with lexical features, which allows the agent to capture richer semantic interpretation of malicious intent and stronger generalization beyond training distributions, such as domain impersonation, path obfuscation, and brand mimicry that transcend superficial pattern matching.</p>
      <p>Previous studies investigating Deep Reinforcement Learning (DRL) for URL state representations have suffered from generalization limits and non-smoothness due to reliance on sub-optimal features [<xref ref-type="bibr" rid="B6">6</xref>]. By giving the reinforcement learning (RL) state more detailed contextual data representations, it learns to adapt deep generalization tactics to be able to defend against various sorts of phishing attacks. RoBERTa’s bidirectional contextual encoding analyzes each URL string and produces 768-dimensional semantic embeddings with 50 engineered lexical features, building up a hybrid state representation that enables the QR-DQN agent to see deep contextual embeddings with statistical pattern recognition by which it learns to spot phishing data that were not seen in training. The architecture applies QRDQN featuring a Q-network structure of hidden units, soft target network decoupling, experience replay buffering and gradient clipping [<xref ref-type="bibr" rid="B7">7</xref>]. The integration of semantic URL representations supports stronger detection of novel and complex obfuscation strategies [<xref ref-type="bibr" rid="B8">8</xref>].</p>
    </sec>
    <sec id="sec2">
      <title>2. Related Works</title>
      <p>As our technological devices and software systems have become increasingly advanced, the complexity of phishing is evolving at a fast pace. A variety of approaches are discussed below.</p>
      <sec id="sec2dot1">
        <title>2.1. Traditional Machine Learning Approaches</title>
        <p>A variety of machine learning techniques and algorithms have been applied in recent years that show promising results. Logistic regression, decision trees, support vector machines and neural networks have been used to solve this problem. The support vector machine (SVM) establishes a boundary that separates phishing and legitimate URLs. SVM reduces overfitting by maximizing the margin between classes and performs well on training data, but can struggle with newer types of threats. Random Forest (RF) builds many decision trees and combines their answers to make predictions. It is more resilient compared to many other ML techniques. If the dataset contains random noise, it can build trees that focus too much on this noise that result in lower accuracy in predictions. K-Nearest Neighbour (KNN) has speed advantage in training but it can get quite slow as the dataset grows. KNN struggles with high dimensional spaces losing accuracy as feature count increases. It has poor performance when it comes to generalization as it primarily memorizes the training data. It is also highly sensitive to noisy data and suited well for small scale applications. Kumar <italic>et al.</italic> (2024) [<xref ref-type="bibr" rid="B9">9</xref>] investigate the nature of machine learning algorithms such as logistic regression and decision trees into phishing detection systems. These models used a wide range of features like lexical/textual, host based website characteristics and content based attributes to improve the classification of phishing. By training these models on labeled datasets they achieved better detection rates than traditional rule based models. Despite these improvements the machine learning models faced considerable limitations as they relied on static data and they had difficulty adapting quickly to new or unfamiliar phishing attempts. Those models lack the ability to learn from real-time phishing techniques.</p>
      </sec>
      <sec id="sec2dot2">
        <title>2.2. BERT-Based Approaches</title>
        <p>Recent developments in transformer-based deep learning models have transformed the way to detect malicious URLs. In particular, bidirectional encoder representations from transformers (BERT) feature self-attention methods to recognize semantic links among character and word-level tokens in URL strings [<xref ref-type="bibr" rid="B6">6</xref>]. Compared to conventional lexical or host-based feature extraction methods, BERT-based systems are able to attain a richer semantic understanding because of this bidirectional context modeling. The ability of BERT to recognize patterns of semantic maliciousness that dodge simple lexical heuristics gives it an advantage over traditional machine learning techniques. BERT learns contextual representations end-to-end from large URL datasets compared to SVM or Random Forest techniques that rely on manually created features. The caveat is that pure semantic models like BERT have difficulty capturing numeric or combined features such as URL length or the number of special characters. These features can be important indicators of phishing, since phishing URLs often show unusual lengths or symbol patterns [<xref ref-type="bibr" rid="B6">6</xref>][<xref ref-type="bibr" rid="B10">10</xref>]. BERT works on relationships between individual tokens and does not naturally compute statistics over the whole sequence. Advanced transformers can be engineered to perform such functions but this is not their standard capability. These models excel at learning deep semantic context and meaning from textual representation but they still have limitations that can reduce detection performance when numeric, categorical and engineered features are critical for labeling phishing from safe URLs. Our approach uses a hybrid state representation that combines manually engineered structural and lexical features with transformer-based semantic embeddings with variation of BERT models. The reinforcement learning (RL) agent receives a feature vector that contains both the transformer based contextual information and explicit numeric, structural and statistical features such as URL length, special character counts and other known indicators of phishing. The agent can use both high-level semantic understanding and low-level statistical cues to improve detection and generalization beyond what each feature type alone could achieve.</p>
      </sec>
      <sec id="sec2dot3">
        <title>2.3. Reinforcement Learning Approaches</title>
        <p>Several studies have used a range of reinforcement learning techniques to address the challenge of phishing detection. Early explorations in this direction used Deep Learning (DRL) where a self learning agent is trained to identify and model malicious URLs by learning both the value function and a classification policy. Chatterjee <italic>et al.</italic> [<xref ref-type="bibr" rid="B11">11</xref>] used an early deep reinforcement learning (RL) framework uses Deep-Q-Network (DQN) to model phishing URL detection as a sequential decision process which depends on 14 crafted lexical features such as URL length, IP presence, subdomain count etc. achieving 90.1% accuracy but relying solely on handcrafted URL-based attributes vulnerable to use evasion. but this approach is vulnerable to bypassing surface level attributes without altering core malicious objectives. The application of Deep Reinforcement Learning (DRL) for the purpose of intrusion detection is examined in [<xref ref-type="bibr" rid="B12">12</xref>]. The performance of DQN, DDQN, policy gradient and actor-critic against various machine learning (ML) algorithms on the NSL-KDD and AWID datasets. The output scores show that DDQN outperforms the other DRL based algorithms. And compared to methods like DDQN, distributional RL methods like QR-DQN offer better convergence properties and robustness in cybersecurity contexts specially in uncertain scenarios [<xref ref-type="bibr" rid="B13">13</xref>]. Through the process of trial, error and penalties the learning process of the reinforcement learning (RL) agent tries different actions and learns to continuously adapt detection processes by getting feedback from its results and is well suited for real-time environments where new threats are constantly evolving and learns to make better decisions. Our proposed reinforcement learning (RL) method can learn much more effectively from environment interactions when semantic embedding like BERT combined with lexical features for phishing detection. RL frameworks can learn nuanced decision policies that gain significant advantage as the integration provides both contextual depth and structural statistical information about the URLs. By integrating semantic embeddings with explicit lexical and numeric features, hybrid RL architectures empower agents to learn more nuanced and adaptive decision policies. This synergy leverages both high-level semantic insights and low-level statistical features.</p>
      </sec>
    </sec>
    <sec id="sec3">
      <title>3. Methodology</title>
      <p>This paper introduces a framework that combines RoBERTa’s semantic embeddings with lexical analysis of phishing detection URLs. This framework establishes superior adaptability, generalization and transfer learning capabilities across different attack vectors. This system addresses generalization and adaptability challenges in phishing detection as evolving attack techniques can reduce the effectiveness of existing detection systems on previously unseen phishing websites [<xref ref-type="bibr" rid="B14">14</xref>]. Conventional supervised methods treat phishing detection as astatic classification problem where models are trained on labeled datasets and deployed with fixed parameters [<xref ref-type="bibr" rid="B15">15</xref>]. Attackers look for newer strategies to bypass typical detection systems. The reinforcement learning (RL) framework addresses this via trial, error and penalties and optimizes its decision making policy for phishing detection. Reinforcement learning (RL) allows updating of the policies on reinforcing. This framework is evaluated offline using a fixed dataset. An effective phishing detection system requires semantic embeddings with lexical and structural domain characteristics for deeper context of URLs. To effectively capture the semantic context RoBERTa—a Bidirectional Encoder Representations from Transformers [<xref ref-type="bibr" rid="B16">16</xref>] a pre-trained language model is used. It generates deep bidirectional contextual relationships with textual data by considering both left and right contexts across all of its layers [<xref ref-type="bibr" rid="B17">17</xref>]. In recent studies it is being shown that BERT based similar transformer models demonstrate significant improvement in detecting adversarial manipulations in URLs [<xref ref-type="bibr" rid="B18">18</xref>]. BERT based transformer models show very high significance in detecting adversarial manipulations in URLs such as random character insertions, homoglyph attacks or deceptive use of sub-domains, due to the deep contextual embeddings which goes way beyond shallow surface level features. As a result, BERT based models are increasingly preferred for cybersecurity applications. This framework uses 768-dimensional semantic embeddings mixed with manually crafted lexical features and input into a Quantile Regression Deep-Q-Network (QR-DQN) agent. This hybrid representation in reinforcement learning (RL) agent learns a robust policy that can adapt to changing phishing strategies through RL environment interaction and reward driven optimization.</p>
      <sec id="sec3dot1">
        <title>3.1. Data Collection and Initial Processing</title>
        <p>The dataset is made by collecting and extracting URLs from multiple sources to ensure a comprehensive representation of both malicious and legitimate web content. URLs were crawled from various sources like PhishTank, OpenPhish etc. ensuring a comprehensive representation of both malicious and legitimate URLs. These sources were chosen for their community validation processes and recognized credibility within the research community. To collect legitimate URLs, Cloudflare’s domain ranking system was used for top-ranked domains filtered by high traffic volume. All in all these form a reliable real world dataset with 105k URLs. During crawling each URL feature was extracted. The script prints detailed progress that shows the current URL that is being processed. A set of manipulated urls was included in the dataset to test how well the detection model handles unusual inputs. The manipulated character level modification mimics common patterns used in phishing attacks to assess models’ resilience against evolving tactics. The extraction pipeline was created along with managing network level failures effectively. Type conversion and null filling are performed for columns that are expected to be booleans. Some lexical URL features are derived from URL structure and character composition, including length-based and special-character-based properties [<xref ref-type="bibr" rid="B19">19</xref>]. </p>
        <p>Additional features may include obfuscation ratios, character continuation rates, percentage encoded segments, non-alphanumeric characters, and log probability of URL character sequences.</p>
        <disp-formula id="FD1">
          <label>(1)</label>
          <mml:math>
            <mml:mrow>
              <mml:mtext>CharLogProb</mml:mtext>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:mtext>URL</mml:mtext>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>=</mml:mo>
              <mml:munderover>
                <mml:mstyle mathsize="140%" displaystyle="true">
                  <mml:mo>∑</mml:mo>
                </mml:mstyle>
                <mml:mrow>
                  <mml:mi>i</mml:mi>
                  <mml:mo>=</mml:mo>
                  <mml:mn>1</mml:mn>
                </mml:mrow>
                <mml:mi>L</mml:mi>
              </mml:munderover>
              <mml:mi>log</mml:mi>
              <mml:mi>p</mml:mi>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:msub>
                    <mml:mi>c</mml:mi>
                    <mml:mi>i</mml:mi>
                  </mml:msub>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>Adjustable delays are added for requesting server resources for tackling networking failures and saving extracted features in a csv file. The resulting feature set includes 50 columns covering URL-based, HTML-based and derived phishing indicators, consistent with feature-engineering approaches used for phishing URL detection [<xref ref-type="bibr" rid="B20">20</xref>].</p>
        <table-wrap id="tbl1">
          <label>Table 1</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Category</bold>
                </td>
                <td>
                  <bold>Features (50 Total)</bold>
                </td>
              </tr>
              <tr>
                <td>URL-Based</td>
                <td>URLLength, DomainLength, IsDomainIP, URLSimilarityIndex, CharContinuationRate, TLDLegitimateProb, URLCharProb, TLDLength, NoOfSubDomain, HasObfuscation, NoOfObfuscatedChar, ObfuscationRatio, NoOfLettersInURL, LetterRatioInURL, NoOfDigitsInURL, DegitRatioInURL, NoOfEqualsInURL, NoOfQMarkInURL, NoOfAmpersandInURL, NoOfOtherSpecialCharsInURL, SpacialCharRatioInURL, IsHTTPS</td>
              </tr>
              <tr>
                <td>HTML Structure</td>
                <td>LineOfCode, LargestLineLength, HasTitle, DomainTitleMatchScore, URLTitleMatchScore, HasFavicon, Robots, IsResponsive, HasDescription</td>
              </tr>
              <tr>
                <td>Redirect &amp; Popup</td>
                <td>NoOfURLRedirect, NoOfSelfRedirect, NoOfPopup, NoOfiFrame</td>
              </tr>
              <tr>
                <td>Form &amp; Link Analysis</td>
                <td>HasExternalFormSubmit, HasSocialNet, HasSubmitButton, HasHiddenFields, HasPasswordField, NoOfSelfRef, NoOfEmptyRef, NoOfExternalRef</td>
              </tr>
              <tr>
                <td>Indicators &amp; Content</td>
                <td>Bank, Pay, Crypto, HasCopyrightInfo, NoOfImage, NoOfCSS, NoOfJS</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p>The final training dataset consists of 100,000 URLs, evenly balanced between 50,000 phishing and 50,000 legitimate samples (labels: 1 = phishing, 0 = legitimate). It was constructed by merging a large aggregated corpus of 95,524 URLs with an additional phishing feed of 4476 URLs. Within the aggregated corpus, 45,524 samples are phishing and 50,000 are legitimate, including 2000 legitimate URLs sourced from the Cloudflare Top Sites list and randomly selected URLs from the PhiUSIIL dataset, together with curated benign web sources. The merged data were processed using exact URL-string matching for deduplication, followed by handling of missing or malformed values and min-max normalization; no further filtering or sample exclusion was applied.</p>
        <p>To avoid the data leakage, both the 80/20 train-test split as well as the 5-folds cross-validation were performed at group, not individual URLs. Before partitioning, each sample assigned to its root domain was given a canonical group. So, all related URLs and their manipulated variants were regarded as one unit. These links comprise copies of the same pages or almost identical pages like mirror pages and those that employ adversarial variants. Adversarial variants are character substitutions, token insertions or padding, subdomain rearrangements, forms like homoglyphs and encoded forms. The next partition was done with the stratification by groups. This means that any samples that belong to the same canonical group were assigned in their entirety to either a split or a fold. Thus, any variant stemming from the same base URL or domain could not show up in the training, validation or test partitions. This process deals directly with the risk of leakage from near duplicate or processed samples, thereby assuring the results reported generalize to unseen URL families rather than memorization of related variants.</p>
        <p>A subset of engineered features relies on successful HTTP/HTML retrieval of the target website. These are all content- and response-based attributes such as page title features (HasTitle, DomainTitleMatchScore, URLTitleMatchScore), description and robot indicators (HasDescription, Robots), structural elements (NoOfImage, NoOfCSS, NoOfJS, NoOfiFrame, NoOfPopup), form and interaction features (HasExternalFormSubmit, HasSubmitButton, HasHiddenFields, HasPasswordField), hyperlink-based statistics (NoOfSelfRef, NoOfEmptyRef, NoOfExternalRef), and redirect and responsiveness indicators (NoOfURLRedirect, NoOfSelfRedirect, IsResponsive, HasFavicon, HasSocialNet). All other features are directly derived from the URL string with no need for a page fetch. While collecting data, in some instances a very small number of samples could not be retrieved because of network errors, empty HTML responses, and inactive pages. These cases were not discarded; instead, all HTML-dependent features for such samples were deterministically set to 0 to represent the missing content. A zero-imputation strategy was applied to 497 samples (0.50% of the dataset) for which HTML retrieval failed. As part of the training process, features were normalized after zero-imputation, whereby failed retrievals were represented through zero-imputed HTML-dependent features and treated as valid but uninformative feature values. During inference, the same fallback mechanism is applied. If live HTML retrieval fails, prediction proceeds using URL-derived features together with zero-imputed content features to ensure robustness.</p>
      </sec>
      <sec id="sec3dot2">
        <title>3.2. Contextual Feature Engineering and Hybrid State Space</title>
        <p>The framework provides input for the QR-DQN agent by merging pre-calculated semantic embeddings with lexically created features. This state vector is designed to capture the syntactic irregularities commonly seen in basic phishing along with the semantic cues necessary to detect more advanced hidden threats. We created a custom dataset using feature engineering processes. The processed datasets are for URL and content analysis modes, utilizing structured data that includes domain-specific attributes. The main goal of this implementation is to extract structural features by analyzing the URL string through systematic parsing. A short subdomain and a less common top-level domain indication are treated as predictive of low trustworthiness. These unusual structures are reinforced with lexical analyses that assess the ratios of digits to alphanumeric characters. The framework also checks for suspicious patterns such as IP addresses in domain names or homograph attacks [<xref ref-type="bibr" rid="B21">21</xref>].</p>
        <p>In addition, probabilistic metrics are used to enhance detection. For example, the framework tracks the ratio of repeating characters, which can indicate redundant character repetition used to obfuscate text. A model that estimates the probability distribution of characters assigns a log-probability score to each URL string. The training set for this model consists only of negative items (benign URLs). This probability method identifies sequences that are statistically improbable, which is beneficial for detecting extreme anomalies and generated phishing URLs. Each URL also undergoes a semantic encoding process. The URL is tokenized using a RoBERTa Byte-Pair Encoding (BPE) model with a sequence length of 256 tokens. The output is truncated and transformed into tensor format. If a GPU is available, the model operates on the GPU; otherwise, it runs on a CPU. This design reduces computational demand while preserving the strong semantic capability of the pre-trained RoBERTa model.</p>
      </sec>
      <sec id="sec3dot3">
        <title>3.3. Phishing Environment Design</title>
        <p>The task of phishing detection is structured as a single-step Markov Decision Process (MDP) within a specialized PhishEnv environment developed on the Gymnasium framework [<xref ref-type="bibr" rid="B22">22</xref>]. At the core of this environment lies a hybrid state representation that integrates two complementary categories of features. The first category consists of a collection of meticulously crafted features, normalized to a range of [0, 1] to facilitate stable learning [<xref ref-type="bibr" rid="B23">23</xref>]. The second category is a contextual semantic embedding produced by a pre-trained RoBERTa model. For any given input text, such as a URL or content, the [CLS] token embedding from the final layer of RoBERTa is extracted, yielding a dense 768-dimensional vector that encapsulates intricate linguistic and structural patterns. These numerical and semantic vectors are then concatenated to create a comprehensive state vector, which defines a high-dimensional, continuous observation space.</p>
        <p>Even though phishing detection is a single-step binary decision problem, it is presented as single-step MDP to directly optimize decisions involving asymmetric misclassification costs. The setup includes the cost in the reward function so that the model learns a policy that prevents the high-cost error (e.g., a false negative). Unlike cost-sensitive supervised classifiers which indirectly incorporate such costs through assignment of class weights or change of the decision threshold, the cost is part of the optimization objective itself. Using this formulation, QR-DQN trains a distribution of returns instead of the expected return. Using a quantile-based approach, it accounts for variation and all URL uncertainties. In phishing detection, rare but false negatives that can have a high cost must be penalized more heavily. By modeling the entire return distribution, QR-DQN allows for risk-sensitive decision-making that takes into account the worst-case scenario as well as the mean. This is not naturally captured by standard classifiers without additional risk modeling or mean-value estimation.</p>
        <p>The agent interacts with this state through a discrete action space containing two options corresponding to legitimate (0) or phishing (1). One key improvement of this environment is the deliberately asymmetrical reward function, which simulates the severe real-world consequences of missed phishing attacks. The agent receives a positive reward (+1) for a correct prediction. On the other hand, the penalties for mistakes are carefully calibrated. A false negative, where a phishing site is missed, receives a heavier penalty (−2) to discourage this high-risk outcome. Meanwhile, a false positive, where a legitimate site is classified as phishing, receives a lower penalty (−0.5). This reward framework instructs the learning algorithm to prioritize high recall for the phishing category, which aligns with the security objective of assigning greater cost to missing a threat.</p>
        <p>In order to achieve class-balanced sampling, the environment’s reset function samples classes so that no imbalance is introduced. When the environment resets, it randomly chooses an instance from either the legitimate category or the phishing category with equal probability (50/50). In addition, for performance and reproducibility, all RoBERTa embeddings are pre-computed at initialization and cached to disk using a hash of the dataset. This prevents repeated inference in future runs and speeds up training cycles significantly. The combined design creates a secure, efficient, and stable training environment that helps the DQN agent evolve into a precise and strong phishing detector [<xref ref-type="bibr" rid="B24">24</xref>]. The reward function is expressed as follows:</p>
        <disp-formula id="FD2">
          <label>(2)</label>
          <mml:math display="inline">
            <mml:mrow>
              <mml:mi>R</mml:mi>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:msub>
                    <mml:mi>s</mml:mi>
                    <mml:mi>t</mml:mi>
                  </mml:msub>
                  <mml:mo>,</mml:mo>
                  <mml:msub>
                    <mml:mi>a</mml:mi>
                    <mml:mi>t</mml:mi>
                  </mml:msub>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>=</mml:mo>
              <mml:mrow>
                <mml:mo>{</mml:mo>
                <mml:mrow>
                  <mml:mtable columnalign="left">
                    <mml:mtr columnalign="left">
                      <mml:mtd columnalign="left">
                        <mml:mrow>
                          <mml:mo>+</mml:mo>
                          <mml:mn>1.0</mml:mn>
                          <mml:mo>,</mml:mo>
                        </mml:mrow>
                      </mml:mtd>
                      <mml:mtd columnalign="left">
                        <mml:mrow>
                          <mml:mtext>correct classification of legitimate or phishing</mml:mtext>
                          <mml:mo>,</mml:mo>
                        </mml:mrow>
                      </mml:mtd>
                    </mml:mtr>
                    <mml:mtr columnalign="left">
                      <mml:mtd columnalign="left">
                        <mml:mrow>
                          <mml:mo>−</mml:mo>
                          <mml:mn>2.0</mml:mn>
                          <mml:mo>,</mml:mo>
                        </mml:mrow>
                      </mml:mtd>
                      <mml:mtd columnalign="left">
                        <mml:mrow>
                          <mml:mtext>predicted legitimate when the true label is phishing</mml:mtext>
                          <mml:mtext>
                             
                          </mml:mtext>
                          <mml:mrow>
                            <mml:mo>(</mml:mo>
                            <mml:mrow>
                              <mml:mtext>false negative</mml:mtext>
                            </mml:mrow>
                            <mml:mo>)</mml:mo>
                          </mml:mrow>
                          <mml:mo>,</mml:mo>
                        </mml:mrow>
                      </mml:mtd>
                    </mml:mtr>
                    <mml:mtr columnalign="left">
                      <mml:mtd columnalign="left">
                        <mml:mrow>
                          <mml:mo>−</mml:mo>
                          <mml:mn>0.5</mml:mn>
                          <mml:mo>,</mml:mo>
                        </mml:mrow>
                      </mml:mtd>
                      <mml:mtd columnalign="left">
                        <mml:mrow>
                          <mml:mtext>predicted phishing when the true label is legitimate</mml:mtext>
                          <mml:mtext>
                             
                          </mml:mtext>
                          <mml:mrow>
                            <mml:mo>(</mml:mo>
                            <mml:mrow>
                              <mml:mtext>false positive</mml:mtext>
                            </mml:mrow>
                            <mml:mo>)</mml:mo>
                          </mml:mrow>
                          <mml:mo>.</mml:mo>
                        </mml:mrow>
                      </mml:mtd>
                    </mml:mtr>
                  </mml:mtable>
                </mml:mrow>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
      </sec>
      <sec id="sec3dot4">
        <title>3.4. Evaluation Setup</title>
        <p>To fully assess how strong the model is against evasion attacks, the dataset was improved with modified phishing URLs. These examples were created to imitate real obfuscation techniques that criminals could use to avoid detection. The alterations included character-level changes such as google to g00gle, token insertion and padding, shuffled domain and subdomain structures such as secure.paypal.com.login.cn, and encoding-based manipulation such as Unicode variants. Through such variants, a pre-existing functional URL can remain valid even though its lexical structure is altered. We incorporated these variations into both training and testing splits to study the generalization capability of the model under obfuscation scenarios. Many phishing datasets include intentional lexical deception to evade existing rule-based and machine learning detectors [<xref ref-type="bibr" rid="B25">25</xref>]. URLs often imitate formats used by real-world entities or exploit similarities at the domain level to mislead users and classifiers. Including these variants in model evaluation provides an authentic stress test, ensuring that the model can generalize beyond clean datasets.</p>
        <table-wrap id="tbl2">
          <label>Table 2</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Category</bold>
                </td>
                <td>
                  <bold>Example/Pattern</bold>
                </td>
                <td>
                  <bold>Description</bold>
                </td>
              </tr>
              <tr>
                <td>Homoglyph/Character Substitution</td>
                <td>g00gle.com, paypa1.net</td>
                <td>Characters visually similar to legitimate domains</td>
              </tr>
              <tr>
                <td>Punycode/IDN Homograph</td>
                <td>xn-example-9db.com</td>
                <td>Unicode-based domain encoding to mimic trusted sites</td>
              </tr>
              <tr>
                <td>Encoded URLs</td>
                <td>
                  <ext-link ext-link-type="uri" xlink:href="http://example.com/%32%31">http://example.com/%32%31</ext-link>
                </td>
                <td>Use of percent-encoded characters for obfuscation</td>
              </tr>
              <tr>
                <td>Look-alike TLDs</td>
                <td>.co, .orq, .net</td>
                <td>TLDs crafted to resemble popular legitimate domains</td>
              </tr>
              <tr>
                <td>IP/Randomized Domains</td>
                <td>
                  <ext-link ext-link-type="uri" xlink:href="http://192.168.0.1/login">http://192.168.0.1/login</ext-link>
                </td>
                <td>Numeric or non-semantic domain structures</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
      </sec>
      <sec id="sec3dot5">
        <title>3.5. QR-DQN Agent Architecture</title>
        <p>The phishing detection policy is developed using a Quantile Regression DQN (QR-DQN) agent that models a distribution of returns per action via multiple quantiles and the quantile Huber loss, rather than a single scalar Q-value, enabling uncertainty-aware value estimation. The primary policy network is an MLP with layers [512 → 512 → 256 → 128] using ReLU activations, tailored for a high-dimensional hybrid state that concatenates normalized numerical features (URL structural and domain-specific metrics) with 768-dimensional BERT embeddings capturing deep semantic context [<xref ref-type="bibr" rid="B26">26</xref>]. The online network outputs a tensor of shape [num_actions × num_quantiles], reshaped to [num_actions, num_quantiles], and a distinct target network soft-updated via Polyak averaging (<inline-formula><mml:math><mml:mrow><mml:mi> τ </mml:mi><mml:mo> = </mml:mo><mml:mn> 0.005 </mml:mn></mml:mrow></mml:math></inline-formula> ) at a fixed interval of every 1000 steps provides stable quantile targets for distributional Bellman updates. This distributional design improves calibration of value estimates and enhances policy generalization under stochastic rewards.</p>
        <p>The learning process uses an experience replay buffer of 150,000 transitions and begins updates after an initial 5000-step data collection phase. Training uses batches of 512 with 8 gradient updates after every 4 environment interactions, a quantile Huber loss for robust distributional regression, and gradient clipping with max norm 10 for stability. Exploration follows <inline-formula><mml:math><mml:mi> ε </mml:mi></mml:math></inline-formula> -greedy starting at <inline-formula><mml:math><mml:mrow><mml:mi> ε </mml:mi><mml:mo> = </mml:mo><mml:mn> 1.0 </mml:mn></mml:mrow></mml:math></inline-formula> , linearly decaying to <inline-formula><mml:math><mml:mrow><mml:mi> ε </mml:mi><mml:mo> = </mml:mo><mml:mn> 0.02 </mml:mn></mml:mrow></mml:math></inline-formula> over the first 25% of a 300,000-timestep training budget; action selection uses the expected return computed as the mean over quantiles for each action. A high discount factor (<inline-formula><mml:math><mml:mrow><mml:mi> γ </mml:mi><mml:mo> = </mml:mo><mml:mn> 0.995 </mml:mn></mml:mrow></mml:math></inline-formula> ) emphasizes long-term rewards, while the target network supplies quantile targets to prevent destabilizing feedback during learning, with updates applied at a fixed interval of every 1000 steps.</p>
        <p><bold>QR-DQN Training Hyperparameters:</bold></p>
        <table-wrap id="tbl3">
          <label>Table 3</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Hyperparameter</bold>
                </td>
                <td>
                  <bold>Value</bold>
                </td>
                <td>
                  <bold>Description</bold>
                </td>
              </tr>
              <tr>
                <td>Architecture</td>
                <td>512 → 512 → 256 → 128</td>
                <td>Multi-layer perceptron hidden units for hybrid state fusion.</td>
              </tr>
              <tr>
                <td>Replay Buffer Size</td>
                <td>150,000</td>
                <td>Experience transitions stored for decorrelated sampling.</td>
              </tr>
              <tr>
                <td>Batch Size</td>
                <td>512</td>
                <td>Training batch size per update.</td>
              </tr>
              <tr>
                <td>Gradient Updates/Step</td>
                <td>8 per 4 steps</td>
                <td>Update frequency per environment interaction.</td>
              </tr>
              <tr>
                <td>Loss Function</td>
                <td>Quantile Huber loss</td>
                <td>Distributional regression for QR-DQN targets (replaces Smooth L1).</td>
              </tr>
              <tr>
                <td>Gradient Clipping</td>
                <td>Norm ≤ 10</td>
                <td>Maximum allowed gradient norm for stability.</td>
              </tr>
              <tr>
                <td>Exploration Schedule</td>
                <td>
                  <italic>ε</italic>
                  : 1 → 0.02
                </td>
                <td>Linear decay over the first 25% of 300,000 steps.</td>
              </tr>
              <tr>
                <td>Training Steps</td>
                <td>300,000</td>
                <td>Total interaction steps.</td>
              </tr>
              <tr>
                <td>Polyak Averaging Coefficient</td>
                <td>0.005</td>
                <td>
                  Target-network smoothing parameter
                  <italic>τ</italic>
                  .
                </td>
              </tr>
              <tr>
                <td>Target Network Interval</td>
                <td>1000 steps</td>
                <td>Frequency of target network updates.</td>
              </tr>
              <tr>
                <td>
                  Discount Factor (
                  <italic>γ</italic>
                  )
                </td>
                <td>0.995</td>
                <td>Future reward weighting.</td>
              </tr>
              <tr>
                <td>Initial Experience Steps</td>
                <td>5000</td>
                <td>Steps collected before learning starts.</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p><bold>RoBERTa</bold><bold>Parameters:</bold></p>
        <table-wrap id="tbl4">
          <label>Table 4</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Parameter</bold>
                </td>
                <td>
                  <bold>Value</bold>
                </td>
                <td>
                  <bold>Description</bold>
                </td>
              </tr>
              <tr>
                <td>Model</td>
                <td>roberta-base</td>
                <td>Hugging Face pretrained BERT backbone.</td>
              </tr>
              <tr>
                <td>Token Extraction</td>
                <td>token (first)</td>
                <td>Source token for dense representation in RoBERTa (analogous to CLS).</td>
              </tr>
              <tr>
                <td>Embedding Dimension</td>
                <td>768</td>
                <td>Width of contextual feature vector for roberta-base.</td>
              </tr>
              <tr>
                <td>Max Sequence Length</td>
                <td>256</td>
                <td>Tokens per URL; padded or truncated.</td>
              </tr>
              <tr>
                <td>Tokenizer</td>
                <td>Byte-Pair Encoding</td>
                <td>RoBERTa’s BPE tokenizer for URL segmentation.</td>
              </tr>
              <tr>
                <td>Fine-Tuning</td>
                <td>None</td>
                <td>Pretrained weights used; not domain-adapted.</td>
              </tr>
              <tr>
                <td>Inference Hardware</td>
                <td>GPU/CPU</td>
                <td>Run on GPU if available, otherwise CPU.</td>
              </tr>
              <tr>
                <td>Empty/Invalid String</td>
                <td>Zero vector</td>
                <td>Encoding for missing strings.</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
      </sec>
      <sec id="sec3dot6">
        <title>3.6. Agent Environment Interaction</title>
        <p>In QR-DQN (Quantile Regression Deep Q-Network), the Q-function is represented by a set of quantiles that estimate the distribution of possible returns instead of a single expected value as in traditional Q-learning. Each state-action pair has its own quantile values, which reflect the full distribution of returns. The Q-function representation and expected return are given by [<xref ref-type="bibr" rid="B27">27</xref>]:</p>
        <disp-formula id="FD3">
          <label>(3)</label>
          <mml:math>
            <mml:mrow>
              <mml:mi>Q</mml:mi>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:mi>s</mml:mi>
                  <mml:mo>,</mml:mo>
                  <mml:mi>a</mml:mi>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>=</mml:mo>
              <mml:mi mathvariant="double-struck">E</mml:mi>
              <mml:mrow>
                <mml:mo>[</mml:mo>
                <mml:mrow>
                  <mml:msub>
                    <mml:mi>R</mml:mi>
                    <mml:mi>t</mml:mi>
                  </mml:msub>
                  <mml:mo>|</mml:mo>
                  <mml:msub>
                    <mml:mi>s</mml:mi>
                    <mml:mi>t</mml:mi>
                  </mml:msub>
                  <mml:mo>=</mml:mo>
                  <mml:mi>s</mml:mi>
                  <mml:mo>,</mml:mo>
                  <mml:msub>
                    <mml:mi>a</mml:mi>
                    <mml:mi>t</mml:mi>
                  </mml:msub>
                  <mml:mo>=</mml:mo>
                  <mml:mi>a</mml:mi>
                  <mml:mo>,</mml:mo>
                  <mml:mi>π</mml:mi>
                </mml:mrow>
                <mml:mo>]</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>For QR-DQN, the return distribution is represented using <inline-formula><mml:math><mml:mi> N </mml:mi></mml:math></inline-formula> quantile estimates, each corresponding to a quantile level [<xref ref-type="bibr" rid="B28">28</xref>]:</p>
        <disp-formula id="FD4">
          <label>(4)</label>
          <mml:math>
            <mml:mrow>
              <mml:mi>Z</mml:mi>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:mi>s</mml:mi>
                  <mml:mo>,</mml:mo>
                  <mml:mi>a</mml:mi>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>=</mml:mo>
              <mml:mrow>
                <mml:mo>[</mml:mo>
                <mml:mrow>
                  <mml:msub>
                    <mml:mi>q</mml:mi>
                    <mml:mn>1</mml:mn>
                  </mml:msub>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:mi>s</mml:mi>
                      <mml:mo>,</mml:mo>
                      <mml:mi>a</mml:mi>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                  <mml:mo>,</mml:mo>
                  <mml:msub>
                    <mml:mi>q</mml:mi>
                    <mml:mn>2</mml:mn>
                  </mml:msub>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:mi>s</mml:mi>
                      <mml:mo>,</mml:mo>
                      <mml:mi>a</mml:mi>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                  <mml:mo>,</mml:mo>
                  <mml:mo>⋯</mml:mo>
                  <mml:mo>,</mml:mo>
                  <mml:msub>
                    <mml:mi>q</mml:mi>
                    <mml:mi>N</mml:mi>
                  </mml:msub>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:mi>s</mml:mi>
                      <mml:mo>,</mml:mo>
                      <mml:mi>a</mml:mi>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mo>]</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>The Bellman target for each quantile is as follows [<xref ref-type="bibr" rid="B29">29</xref>]:</p>
        <disp-formula id="FD5">
          <label>(5)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>z</mml:mi>
                <mml:mi>j</mml:mi>
              </mml:msub>
              <mml:mo>=</mml:mo>
              <mml:mi>r</mml:mi>
              <mml:mo>+</mml:mo>
              <mml:mi>γ</mml:mi>
              <mml:msubsup>
                <mml:mi>z</mml:mi>
                <mml:mi>j</mml:mi>
                <mml:mo>−</mml:mo>
              </mml:msubsup>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:msup>
                    <mml:mi>s</mml:mi>
                    <mml:mo>′</mml:mo>
                  </mml:msup>
                  <mml:mo>,</mml:mo>
                  <mml:msup>
                    <mml:mi>a</mml:mi>
                    <mml:mtext>*</mml:mtext>
                  </mml:msup>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>where:</p>
        <p><inline-formula><mml:math><mml:mi> r </mml:mi></mml:math></inline-formula> is the observed reward, <inline-formula><mml:math><mml:mi> γ </mml:mi></mml:math></inline-formula> is the discount factor, <inline-formula><mml:math><mml:mrow><mml:msubsup><mml:mi> z </mml:mi><mml:mi> j </mml:mi><mml:mo> − </mml:mo></mml:msubsup><mml:mrow><mml:mo> ( </mml:mo><mml:mrow><mml:msup><mml:mi> s </mml:mi><mml:mo> ′ </mml:mo></mml:msup><mml:mo> , </mml:mo><mml:msup><mml:mi> a </mml:mi><mml:mtext> * </mml:mtext></mml:msup></mml:mrow><mml:mo> ) </mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> is the <inline-formula><mml:math><mml:mi> j </mml:mi></mml:math></inline-formula> -th quantile predicted by the target network for the next state <inline-formula><mml:math><mml:msup><mml:mi> s </mml:mi><mml:mo> ′ </mml:mo></mml:msup></mml:math></inline-formula> and action <inline-formula><mml:math><mml:mrow><mml:msup><mml:mi> a </mml:mi><mml:mtext> * </mml:mtext></mml:msup></mml:mrow></mml:math></inline-formula> , <inline-formula><mml:math><mml:mrow><mml:msup><mml:mi> a </mml:mi><mml:mo> * </mml:mo></mml:msup><mml:mo> = </mml:mo><mml:mi> arg </mml:mi><mml:msub><mml:mrow><mml:mi> max </mml:mi></mml:mrow><mml:msup><mml:mi> a </mml:mi><mml:mo> ′ </mml:mo></mml:msup></mml:msub><mml:mfrac><mml:mn> 1 </mml:mn><mml:mi> N </mml:mi></mml:mfrac><mml:mstyle displaystyle="true"><mml:msubsup><mml:mo> ∑ </mml:mo><mml:mrow><mml:mi> k </mml:mi><mml:mo> = </mml:mo><mml:mn> 1 </mml:mn></mml:mrow><mml:mi> N </mml:mi></mml:msubsup><mml:mrow><mml:msub><mml:mi> z </mml:mi><mml:mi> k </mml:mi></mml:msub><mml:mrow><mml:mo> ( </mml:mo><mml:mrow><mml:msup><mml:mi> s </mml:mi><mml:mo> ′ </mml:mo></mml:msup><mml:mo> , </mml:mo><mml:msup><mml:mi> a </mml:mi><mml:mo> ′ </mml:mo></mml:msup></mml:mrow><mml:mo> ) </mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:mrow></mml:math></inline-formula> .</p>
        <p>The binary action space has two choices: either labeling the sample as phishing or legitimate. After each action, the environment provides a new observation, reward, and episode-termination flag. During training, quantile-based Bellman targets guide the agent’s updates.</p>
        <p>Reinforcement learning (RL) agents use their experiences in the environment to gradually modify their value estimates according to the rewards returned by the environment. In our study, however, the agent learns offline on a static dataset. The value estimates are updated from sampled transitions using quantile-based targets. The core temporal-difference update used in QR-DQN can be expressed as follows [<xref ref-type="bibr" rid="B28">28</xref>]:</p>
        <disp-formula id="FD6">
          <label>(6)</label>
          <mml:math>
            <mml:mrow>
              <mml:mi>Q</mml:mi>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:mi>s</mml:mi>
                  <mml:mo>,</mml:mo>
                  <mml:mi>a</mml:mi>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>←</mml:mo>
              <mml:mi>Q</mml:mi>
              <mml:mrow>
                <mml:mo>(</mml:mo>
                <mml:mrow>
                  <mml:mi>s</mml:mi>
                  <mml:mo>,</mml:mo>
                  <mml:mi>a</mml:mi>
                </mml:mrow>
                <mml:mo>)</mml:mo>
              </mml:mrow>
              <mml:mo>+</mml:mo>
              <mml:mi>α</mml:mi>
              <mml:mrow>
                <mml:mo>[</mml:mo>
                <mml:mrow>
                  <mml:mi>r</mml:mi>
                  <mml:mo>+</mml:mo>
                  <mml:mi>γ</mml:mi>
                  <mml:munder>
                    <mml:mrow>
                      <mml:mi>max</mml:mi>
                    </mml:mrow>
                    <mml:msup>
                      <mml:mi>a</mml:mi>
                      <mml:mo>′</mml:mo>
                    </mml:msup>
                  </mml:munder>
                  <mml:mfrac>
                    <mml:mn>1</mml:mn>
                    <mml:mi>N</mml:mi>
                  </mml:mfrac>
                  <mml:munderover>
                    <mml:mstyle mathsize="140%" displaystyle="true">
                      <mml:mo>∑</mml:mo>
                    </mml:mstyle>
                    <mml:mrow>
                      <mml:mi>k</mml:mi>
                      <mml:mo>=</mml:mo>
                      <mml:mn>1</mml:mn>
                    </mml:mrow>
                    <mml:mi>N</mml:mi>
                  </mml:munderover>
                  <mml:mtext>
                     
                  </mml:mtext>
                  <mml:msub>
                    <mml:mi>z</mml:mi>
                    <mml:mi>k</mml:mi>
                  </mml:msub>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:msup>
                        <mml:mi>s</mml:mi>
                        <mml:mo>′</mml:mo>
                      </mml:msup>
                      <mml:mo>,</mml:mo>
                      <mml:msup>
                        <mml:mi>a</mml:mi>
                        <mml:mo>′</mml:mo>
                      </mml:msup>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                  <mml:mo>−</mml:mo>
                  <mml:mi>Q</mml:mi>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:mi>s</mml:mi>
                      <mml:mo>,</mml:mo>
                      <mml:mi>a</mml:mi>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mo>]</mml:mo>
              </mml:mrow>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>The parameter updates take place only during offline training on a fixed dataset. The deployed model is not updated online. Therefore, the reported results are a product of offline learning rather than continual online adaptation to newly emerging threats. The overall architecture of the proposed phishing detection framework in <xref ref-type="fig" rid="fig1">Figure 1</xref>.</p>
        <fig id="fig1">
          <label>Figure 1</label>
          <graphic xlink:href="https://html.scirp.org/file/7801218-rId55.jpeg?20260617013611" />
        </fig>
        <p><bold>Figure 1</bold><bold>.</bold> Architecture of the phishing detection system.</p>
      </sec>
      <sec id="sec3dot7">
        <title>3.7. Evaluation Metrics</title>
        <p>To evaluate the performance of the proposed phishing detection framework, several standard metrics were used to assess both predictive accuracy and generalization capability on unseen data. The framework uses predictions categorized as follows:</p>
        <p><bold>True Positives (TP):</bold> Phishing URLs correctly classified as phishing.<bold>True Negatives (TN):</bold> Legitimate URLs correctly classified as legitimate.<bold>False Positives (FP):</bold> Legitimate URLs incorrectly classified as phishing.<bold>False Negatives (FN):</bold> Phishing URLs incorrectly classified as legitimate (missed attacks).</p>
        <p>Based on these fundamental counts, the following metrics are calculated.</p>
        <p>Accuracy measures the overall proportion of correct predictions across both classes.</p>
        <disp-formula id="FD7">
          <label>(7)</label>
          <mml:math>
            <mml:mrow>
              <mml:mtext>Accuracy</mml:mtext>
              <mml:mo>=</mml:mo>
              <mml:mfrac>
                <mml:mrow>
                  <mml:mi>T</mml:mi>
                  <mml:mi>P</mml:mi>
                  <mml:mo>+</mml:mo>
                  <mml:mi>T</mml:mi>
                  <mml:mi>N</mml:mi>
                </mml:mrow>
                <mml:mrow>
                  <mml:mi>T</mml:mi>
                  <mml:mi>P</mml:mi>
                  <mml:mo>+</mml:mo>
                  <mml:mi>T</mml:mi>
                  <mml:mi>N</mml:mi>
                  <mml:mo>+</mml:mo>
                  <mml:mi>F</mml:mi>
                  <mml:mi>P</mml:mi>
                  <mml:mo>+</mml:mo>
                  <mml:mi>F</mml:mi>
                  <mml:mi>N</mml:mi>
                </mml:mrow>
              </mml:mfrac>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>Balanced Accuracy accounts for class imbalance by computing the arithmetic mean of recall and specificity.</p>
        <disp-formula id="FD8">
          <label>(8)</label>
          <mml:math>
            <mml:mrow>
              <mml:mtext>Balanced</mml:mtext>
              <mml:mo>
              </mml:mo>
              <mml:mo>
              </mml:mo>
              <mml:mtext>Accuracy</mml:mtext>
              <mml:mo>=</mml:mo>
              <mml:mfrac>
                <mml:mrow>
                  <mml:mtext>Recall</mml:mtext>
                  <mml:mo>+</mml:mo>
                  <mml:mtext>Specificity</mml:mtext>
                </mml:mrow>
                <mml:mn>2</mml:mn>
              </mml:mfrac>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>Precision measures the ratio of correct positive predictions, revealing the model’s dependability when it flags a URL as phishing.</p>
        <disp-formula id="FD9">
          <label>(9)</label>
          <mml:math>
            <mml:mrow>
              <mml:mtext>Precision</mml:mtext>
              <mml:mo>=</mml:mo>
              <mml:mfrac>
                <mml:mrow>
                  <mml:mi>T</mml:mi>
                  <mml:mi>P</mml:mi>
                </mml:mrow>
                <mml:mrow>
                  <mml:mi>T</mml:mi>
                  <mml:mi>P</mml:mi>
                  <mml:mo>+</mml:mo>
                  <mml:mi>F</mml:mi>
                  <mml:mi>P</mml:mi>
                </mml:mrow>
              </mml:mfrac>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>Recall measures the model’s capacity to identify attacks by quantifying the fraction of real phishing URLs correctly detected.</p>
        <disp-formula id="FD10">
          <label>(10)</label>
          <mml:math>
            <mml:mrow>
              <mml:mtext>Recall</mml:mtext>
              <mml:mo>=</mml:mo>
              <mml:mfrac>
                <mml:mrow>
                  <mml:mi>T</mml:mi>
                  <mml:mi>P</mml:mi>
                </mml:mrow>
                <mml:mrow>
                  <mml:mi>T</mml:mi>
                  <mml:mi>P</mml:mi>
                  <mml:mo>+</mml:mo>
                  <mml:mi>F</mml:mi>
                  <mml:mi>N</mml:mi>
                </mml:mrow>
              </mml:mfrac>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>F1-score is the harmonic mean of precision and recall. It combines the model’s ability to identify positive cases while minimizing both false positives and false negatives.</p>
        <disp-formula id="FD11">
          <label>(11)</label>
          <mml:math>
            <mml:mrow>
              <mml:mi>F</mml:mi>
              <mml:mn>1</mml:mn>
              <mml:mo>=</mml:mo>
              <mml:mfrac>
                <mml:mrow>
                  <mml:mn>2</mml:mn>
                  <mml:mo>×</mml:mo>
                  <mml:mrow>
                    <mml:mo>(</mml:mo>
                    <mml:mrow>
                      <mml:mtext>Precision</mml:mtext>
                      <mml:mo>×</mml:mo>
                      <mml:mtext>Recall</mml:mtext>
                    </mml:mrow>
                    <mml:mo>)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mrow>
                  <mml:mtext>Precision</mml:mtext>
                  <mml:mo>+</mml:mo>
                  <mml:mtext>Recall</mml:mtext>
                </mml:mrow>
              </mml:mfrac>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>Specificity measures the proportion of legitimate URLs correctly identified as legitimate.</p>
        <disp-formula id="FD12">
          <label>(12)</label>
          <mml:math>
            <mml:mrow>
              <mml:mtext>Specificity</mml:mtext>
              <mml:mo>=</mml:mo>
              <mml:mfrac>
                <mml:mrow>
                  <mml:mi>T</mml:mi>
                  <mml:mi>N</mml:mi>
                </mml:mrow>
                <mml:mrow>
                  <mml:mi>T</mml:mi>
                  <mml:mi>N</mml:mi>
                  <mml:mo>+</mml:mo>
                  <mml:mi>F</mml:mi>
                  <mml:mi>P</mml:mi>
                </mml:mrow>
              </mml:mfrac>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>False Negative Rate (FNR) quantifies the proportion of phishing attacks that evade detection.</p>
        <disp-formula id="FD13">
          <label>(13)</label>
          <mml:math>
            <mml:mrow>
              <mml:mtext>FNR</mml:mtext>
              <mml:mo>=</mml:mo>
              <mml:mfrac>
                <mml:mrow>
                  <mml:mi>F</mml:mi>
                  <mml:mi>N</mml:mi>
                </mml:mrow>
                <mml:mrow>
                  <mml:mi>F</mml:mi>
                  <mml:mi>N</mml:mi>
                  <mml:mo>+</mml:mo>
                  <mml:mi>T</mml:mi>
                  <mml:mi>P</mml:mi>
                </mml:mrow>
              </mml:mfrac>
              <mml:mo>=</mml:mo>
              <mml:mn>1</mml:mn>
              <mml:mo>−</mml:mo>
              <mml:mtext>Recall</mml:mtext>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>False Positive Rate (FPR) measures the proportion of legitimate URLs incorrectly flagged as phishing, which affects system usability.</p>
        <disp-formula id="FD14">
          <label>(14)</label>
          <mml:math>
            <mml:mrow>
              <mml:mtext>FPR</mml:mtext>
              <mml:mo>=</mml:mo>
              <mml:mfrac>
                <mml:mrow>
                  <mml:mi>F</mml:mi>
                  <mml:mi>P</mml:mi>
                </mml:mrow>
                <mml:mrow>
                  <mml:mi>F</mml:mi>
                  <mml:mi>P</mml:mi>
                  <mml:mo>+</mml:mo>
                  <mml:mi>T</mml:mi>
                  <mml:mi>N</mml:mi>
                </mml:mrow>
              </mml:mfrac>
              <mml:mo>=</mml:mo>
              <mml:mn>1</mml:mn>
              <mml:mo>−</mml:mo>
              <mml:mtext>Specificity</mml:mtext>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>The cost of different types of errors is asymmetric. Missing a phishing attack (false negative) has far more severe consequences than incorrectly flagging a legitimate site (false positive). The asymmetric reward function discussed in Section 3.3 reflects this imbalance by applying a penalty of −2.0 for false negatives and −0.5 for false positives, indicating that missing an attack is more costly than triggering a false alarm. Machine-learning-based security systems also face the challenge of generalization, since overfitting occurs when a model memorizes static patterns rather than learning transferable behavior for unseen threats.</p>
        <p>The accuracy gap measures how much performance varies on test data compared with training data.</p>
        <disp-formula id="FD15">
          <label>(15)</label>
          <mml:math>
            <mml:mrow>
              <mml:msub>
                <mml:mi>G</mml:mi>
                <mml:mrow>
                  <mml:mtext>gap</mml:mtext>
                </mml:mrow>
              </mml:msub>
              <mml:mo>=</mml:mo>
              <mml:msub>
                <mml:mrow>
                  <mml:mtext>Accuracy</mml:mtext>
                </mml:mrow>
                <mml:mrow>
                  <mml:mtext>train</mml:mtext>
                </mml:mrow>
              </mml:msub>
              <mml:mo>−</mml:mo>
              <mml:msub>
                <mml:mrow>
                  <mml:mtext>Accuracy</mml:mtext>
                </mml:mrow>
                <mml:mrow>
                  <mml:mtext>test</mml:mtext>
                </mml:mrow>
              </mml:msub>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>The F1 gap similarly measures generalization capability using the F1-score.</p>
        <disp-formula id="FD16">
          <label>(16)</label>
          <mml:math>
            <mml:mrow>
              <mml:mi>F</mml:mi>
              <mml:msub>
                <mml:mn>1</mml:mn>
                <mml:mrow>
                  <mml:mtext>gap</mml:mtext>
                </mml:mrow>
              </mml:msub>
              <mml:mo>=</mml:mo>
              <mml:mi>F</mml:mi>
              <mml:msub>
                <mml:mn>1</mml:mn>
                <mml:mrow>
                  <mml:mtext>train</mml:mtext>
                </mml:mrow>
              </mml:msub>
              <mml:mo>−</mml:mo>
              <mml:mi>F</mml:mi>
              <mml:msub>
                <mml:mn>1</mml:mn>
                <mml:mrow>
                  <mml:mtext>test</mml:mtext>
                </mml:mrow>
              </mml:msub>
            </mml:mrow>
          </mml:math>
        </disp-formula>
        <p>Lower F1-gap values indicate that the model preserves its F1-score better on test data. Models with smaller generalization gaps remain effective even when phishing tactics differ from the training examples. This indicates less overfitting and stronger robustness to newly emerging attacks, making the model more reliable over time.</p>
        <p><bold>Recall prioritization:</bold> This prioritizes maximizing recall, emphasizing the proportion of true phishing URLs that are correctly identified. <bold>Balanced evaluation:</bold> Metrics such as F1-score consider both precision and recall jointly, providing a fair assessment by accounting for false positives and false negatives together. This is particularly important for imbalanced datasets. <bold>Generalization emphasis:</bold> Static pattern memorization does not hold up against evolving attacks. The generalization-gap metrics directly evaluate a model’s ability to adapt to new attack variations, which is a main benefit of the proposed BERT-enhanced method. <bold>Operational transparency:</bold> By reporting the full confusion matrix (TP, TN, FP, FN), security analysts can determine whether the system’s error profile aligns with their organization’s risk tolerance and operational constraints. </p>
      </sec>
    </sec>
    <sec id="sec4">
      <title>4. Results</title>
      <sec id="sec4dot1">
        <title>4.1. Experimental Setup and Configuration</title>
        <p>The balanced phishing-URL dataset was used after performing stratified train–test splits. We combined engineered numerical features with pre-computed RoBERTa embeddings and characterized the detection problem as a one-step Markov decision process (MDP), where each episode evaluates a single sample and the action space is binary. A deep multilayer perceptron (MLP)-based QR-DQN agent was trained with hyperparameters selected according to the dataset scale. The reward function emphasized reducing missed phishing attempts by assigning a penalty of −2.0 to false negatives. It also applied a penalty of −0.5 to false positives while awarding +1.0 for correct predictions. Metric comparisons were conducted on the held-out test set using accuracy, F1-score, recall, and generalization-gap measures.</p>
      </sec>
      <sec id="sec4dot2">
        <title>4.2. Comparative Performance Analysis</title>
        <p>The QR-DQN agent enhanced with RoBERTa obtained test accuracy of 99.86%, precision 99.75%, recall 99.96%, F1 score 99.85%, with only 4 false negatives and 25 false positives out of 20,000 test samples. When we compare, the baseline agent with lexical features only achieves 98.30% test accuracy, 99.82% precision, 96.76% recall, and 98.27% F1-score with 323 false negatives and 17 false positives. The RoBERTa-based model’s accuracy score increased by 1.56% while the number of false negatives decreased by 1.56%.</p>
        <p>An ablation study was carried out to isolate the contribution of distributional reinforcement learning by running QR-DQN (dotted) and standard DQN (dot-dashed) in the same lexicon-only feature setting. The DQN baseline achieved 98.30% accuracy, 96.75% recall, and 98.27% F1-score while QR-DQN achieved a test accuracy of 98.12%, a recall of 96.39%, and an F1-score of 98.08% in this setting. The QR-DQN produced a larger number of false negatives (359 versus 323) and a slightly larger generalization gap (1.77% versus 1.63%).</p>
        <p>The findings imply that distributional RL does not provide a significant advantage over standard DQN in this framework, nor does it further enhance generalization in a single-step deterministic phishing detection setting that only contains lexical features.</p>
        <p>However, when semantic embeddings from RoBERTa are added, QR-DQN consistently outperforms DQN with respect to recall, reduced false negatives, and a significantly smaller generalization gap. The advantage of using QR-DQN appears in rich state representation settings when modeling the return distribution makes uncertain and risk-sensitive decision making more effective.</p>
      </sec>
      <sec id="sec4dot3">
        <title>4.3. Generalization Capability</title>
        <p>To quantify the generalization of the models, the gap in accuracy was measured based on train-test data. The QR-DQN agent using RoBERTa semantic embeddings and lexical features has a very small generalization gap of 0.042 percent. The total loss of baseline DQN agent using only lexical features showed a significantly larger gap of 1.63 percent. As measured by the generalization gap, this shows a 39-fold reduction, which indicates that the RoBERTa-enhanced agent learns generalizable concepts of malicious URLs and not training data patterns.</p>
        <p>The QR-DQN model, which relies solely on lexical features, shows an even larger generalization gap (Accuracy Gap: 1.77%, F1 Gap: 1.81) than the lexical DQN baseline. It is shown that distributional RL does not improve generalization by itself and that semantic embeddings are required for this.</p>
      </sec>
      <sec id="sec4dot4">
        <title>4.4. Semantic Feature Integration Benefits</title>
        <p>Using BERT-style embeddings substantially improved contextual understanding and robustness. Because the transformer encoder attends to both left and right context, the model can better capture brand-name manipulation, typosquatting, and semantic obfuscation. The semantic features proved more resistant to common evasion strategies, including character-level obfuscation, homoglyph attacks, insertion of special characters, subdomain manipulation, and URL shortening.</p>
      </sec>
      <sec id="sec4dot5">
        <title>4.5. Training Stability and Convergence</title>
        <p>The RoBERTa-enhanced agent also demonstrated improved training stability and better convergence properties. Policy optimization became more stable, and state exploration was guided more effectively by semantic information, resulting in reduced Q-value fluctuations. The asymmetric reward structure imposed a severe penalty on false negatives, which carry greater security risk. This helped the training process prioritize attack detection while still managing false positives at a level compatible with practical usability.</p>
      </sec>
      <sec id="sec4dot6">
        <title>4.6. Computational Efficiency Considerations</title>
        <p>The computational overhead of extracting RoBERTa embeddings was reduced through pre-computation and disk caching, along with efficient batching and parallelized processing. These design choices made training substantially more practical. Successful model training within 300,000 timesteps suggests that the framework is feasible for real-world deployment settings.</p>
      </sec>
      <sec id="sec4dot7">
        <title>4.7. Experimental Reliability and Evaluation Consistency</title>
        <p>The model achieved high accuracy, precision, recall, and F1-score on the held-out test set. With mean accuracy of 99.9% and standard deviation of 0.04%, performance under 5-fold cross validation was found to be stable. The assessment emphasizes descriptive performance in this experimental context and does not extend to multi-seed analysis, confidence interval estimates, or formal statistical significance tests. A static dataset was used for all experiments executed offline. The QR-DQN policy was trained in this offline regime and used during inference with no online updates. Thus, the results show offline phishing detection performance under the specified train/test split and cross-validation protocol. The association of QR-DQN with the semantic embeddings of RoBERTa, along with lexical features explains the observed improvement in results.</p>
      </sec>
      <sec id="sec4dot8">
        <title>4.8. Results Tables</title>
        <p>For completeness, <xref ref-type="fig" rid="fig2">Figure 2</xref> together with <bold>Tables 1</bold><bold>-</bold><bold>3</bold> summarize the train-split performance, test-split performance, and overfitting behavior of the evaluated models.</p>
        <fig id="fig2">
          <label>Figure 2</label>
          <graphic xlink:href="https://html.scirp.org/file/7801218-rId76.jpeg?20260617013612" />
        </fig>
        <p><bold>Figure 2.</bold>Train-test accuracy gap comparison across the evaluated phishing-detection models.</p>
        <p><bold>Table 1</bold><bold>.</bold>Train-split performance comparison of the evaluated phishing-detection models.</p>
        <table-wrap id="tbl5">
          <label>Table 5</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Model Name</bold>
                </td>
                <td>
                  <bold>Accuracy (%)</bold>
                </td>
                <td>
                  <bold>Precision (%)</bold>
                </td>
                <td>
                  <bold>Recall (%)</bold>
                </td>
                <td>
                  <bold>F1 Score (%)</bold>
                </td>
                <td>
                  <bold>FP</bold>
                </td>
                <td>
                  <bold>FN</bold>
                </td>
              </tr>
              <tr>
                <td>DQN (Lexical Features)</td>
                <td>99.93</td>
                <td>99.93</td>
                <td>99.93</td>
                <td>99.93</td>
                <td>29</td>
                <td>26</td>
              </tr>
              <tr>
                <td>QR-DQN (RoBERTa + Lexical Features)</td>
                <td>99.90</td>
                <td>99.83</td>
                <td>99.97</td>
                <td>99.90</td>
                <td>70</td>
                <td>12</td>
              </tr>
              <tr>
                <td>QR-DQN (Lexical Features)</td>
                <td>99.92</td>
                <td>99.91</td>
                <td>99.92</td>
                <td>99.92</td>
                <td>32</td>
                <td>28</td>
              </tr>
              <tr>
                <td>DQN (RoBERTa + Lexical Features)</td>
                <td>99.85</td>
                <td>99.72</td>
                <td>99.97</td>
                <td>99.85</td>
                <td>112</td>
                <td>11</td>
              </tr>
              <tr>
                <td>DQN (BERT + Lexical Features)</td>
                <td>99.83</td>
                <td>99.77</td>
                <td>99.88</td>
                <td>99.83</td>
                <td>92</td>
                <td>47</td>
              </tr>
              <tr>
                <td>QR-DQN (RoBERTa)</td>
                <td>99.69</td>
                <td>99.53</td>
                <td>99.85</td>
                <td>99.69</td>
                <td>189</td>
                <td>61</td>
              </tr>
              <tr>
                <td>DQN (DistilBERT + Lexical Features)</td>
                <td>99.34</td>
                <td>99.27</td>
                <td>99.41</td>
                <td>99.34</td>
                <td>291</td>
                <td>236</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p><bold>Table 2</bold><bold>.</bold>Test-split performance comparison of the evaluated phishing-detection models.</p>
        <table-wrap id="tbl6">
          <label>Table 6</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Model Name</bold>
                </td>
                <td>
                  <bold>Accuracy (%)</bold>
                </td>
                <td>
                  <bold>Precision (%)</bold>
                </td>
                <td>
                  <bold>Recall (%)</bold>
                </td>
                <td>
                  <bold>F1 Score (%)</bold>
                </td>
                <td>
                  <bold>FP</bold>
                </td>
                <td>
                  <bold>FN</bold>
                </td>
              </tr>
              <tr>
                <td>QR-DQN (RoBERTa + Lexical Features)</td>
                <td>99.86</td>
                <td>99.75</td>
                <td>99.96</td>
                <td>99.85</td>
                <td>25</td>
                <td>4</td>
              </tr>
              <tr>
                <td>DQN (RoBERTa + Lexical Features)</td>
                <td>99.79</td>
                <td>99.63</td>
                <td>99.95</td>
                <td>99.79</td>
                <td>37</td>
                <td>5</td>
              </tr>
              <tr>
                <td>QR-DQN (Lexical Features)</td>
                <td>98.12</td>
                <td>99.82</td>
                <td>96.40</td>
                <td>98.08</td>
                <td>17</td>
                <td>359</td>
              </tr>
              <tr>
                <td>DQN (BERT + Lexical Features)</td>
                <td>99.74</td>
                <td>99.64</td>
                <td>99.84</td>
                <td>99.74</td>
                <td>36</td>
                <td>16</td>
              </tr>
              <tr>
                <td>QR-DQN (RoBERTa)</td>
                <td>99.57</td>
                <td>99.38</td>
                <td>99.76</td>
                <td>99.57</td>
                <td>62</td>
                <td>24</td>
              </tr>
              <tr>
                <td>DQN (DistilBERT + Lexical Features)</td>
                <td>98.73</td>
                <td>99.29</td>
                <td>98.16</td>
                <td>98.72</td>
                <td>70</td>
                <td>183</td>
              </tr>
              <tr>
                <td>DQN (Lexical Features)</td>
                <td>98.30</td>
                <td>99.82</td>
                <td>96.76</td>
                <td>98.27</td>
                <td>17</td>
                <td>323</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p><bold>Table 3.</bold>Overfitting and generalization-gap comparison across model variants.</p>
        <table-wrap id="tbl7">
          <label>Table 7</label>
          <table>
            <tbody>
              <tr>
                <td>
                  <bold>Model Name</bold>
                </td>
                <td>
                  <bold>Test Accuracy (%)</bold>
                </td>
                <td>
                  <bold>Accuracy Gap (%)</bold>
                </td>
                <td>
                  <bold>Test F1 (%)</bold>
                </td>
                <td>
                  <bold>F1 Gap (%)</bold>
                </td>
              </tr>
              <tr>
                <td>QR-DQN (RoBERTa + Lexical Features)</td>
                <td>99.86</td>
                <td>0.042</td>
                <td>99.85</td>
                <td>0.043</td>
              </tr>
              <tr>
                <td>DQN (RoBERTa + Lexical Features)</td>
                <td>99.79</td>
                <td>0.056</td>
                <td>99.79</td>
                <td>0.057</td>
              </tr>
              <tr>
                <td>QR-DQN (Lexical Features)</td>
                <td>98.12</td>
                <td>1.77</td>
                <td>98.08</td>
                <td>1.81</td>
              </tr>
              <tr>
                <td>DQN (BERT + Lexical Features)</td>
                <td>99.74</td>
                <td>0.086</td>
                <td>99.74</td>
                <td>0.087</td>
              </tr>
              <tr>
                <td>QR-DQN (RoBERTa)</td>
                <td>99.57</td>
                <td>0.117</td>
                <td>99.57</td>
                <td>0.118</td>
              </tr>
              <tr>
                <td>DQN (DistilBERT + Lexical Features)</td>
                <td>98.73</td>
                <td>0.606</td>
                <td>98.72</td>
                <td>0.618</td>
              </tr>
              <tr>
                <td>DQN (Lexical Features)</td>
                <td>98.30</td>
                <td>1.631</td>
                <td>98.27</td>
                <td>1.663</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
      </sec>
    </sec>
    <sec id="sec5">
      <title>5. Discussion</title>
      <p>The proposed RoBERTa + QR-DQN framework shows substantial improvement over prior phishing-detection approaches, but the key point is that QR-DQN does not improve generalization on its own. Compared with transformer-only methods such as URLTran, which reported an 86.80% true positive rate under a strict false-positive threshold, the proposed hybrid framework achieved 99.86% accuracy and 99.96% recall on held-out test data, indicating a major improvement in detection capability. Relative to the Deep Q-Network (DQN) model of Chatterjee <italic>et al.</italic>, which used a single agent and 14 lexical features, the RoBERTa-QR-DQN agent achieved higher overall accuracy. DDQN-based unbalanced classifiers such as the approach discussed by Maci <italic>et al.</italic> explicitly address skew through ICMDP reward design and report strong performance under varying imbalance ratios without data sampling [<xref ref-type="bibr" rid="B30">30</xref>]. Compared with earlier CNN-BiGRU hybrid systems, which also reported strong accuracy and recall [<xref ref-type="bibr" rid="B31">31</xref>], the RoBERTa-QR-DQN framework offers a major advantage in generalization only when QR-DQN is combined with lexical features and semantic embeddings.</p>
      <p>The results suggest that the reduction in the train–test accuracy gap is driven by the hybrid state representation rather than by distributional reinforcement learning alone. In the lexical-only setting, QR-DQN does not outperform the corresponding DQN baseline in generalization; instead, the strongest gains appear when QR-DQN is paired with RoBERTa semantic embeddings together with engineered lexical features. This combined representation enables the model to capture advanced evasion patterns such as brand mimicry, typosquatting, URL obfuscation, and structural manipulation more effectively than lexical features alone. In that setting, the quantile-based distributional reinforcement learning framework contributes uncertainty-aware and risk-sensitive decision-making, while the asymmetric reward structure encourages the agent to reduce costly security mistakes and the learned offline cost-sensitive policy helps control false alarms without sacrificing phishing detection. Overall, the experimental evidence indicates that QR-DQN improves generalization only when it operates on a richer hybrid representation that includes both lexical and semantic embeddings.</p>
    </sec>
    <sec id="sec6">
      <title>6. Conclusions</title>
      <p>This study shows that combining semantic feature enhancement with advanced reinforcement learning improves test recall and substantially reduces the train–test generalization gap relative to lexical-feature-only baselines. The main conclusion is not that QR-DQN by itself is better than DQN with lexical-only features; rather, the advantage appears when reinforcement learning is paired with contextual embeddings such as RoBERTa together with lexical features. While the DQN agent using only lexical inputs achieves high training accuracy, it suffers from a larger generalization gap and a substantially higher false negative rate, indicating that lexical-only pattern matching is insufficient against diverse phishing strategies. In contrast, agents that incorporate semantic embeddings achieve higher test accuracy and, more importantly, stronger generalization and resilience against adversarial manipulation.</p>
      <p>These improvements suggest that the model benefits from learning semantic intent and structural meaning from URLs rather than merely exploiting superficial artifacts. From a security perspective, the findings emphasize that missing a phishing attack is substantially more costly than raising a false alarm, and that reinforcement-learning models become more effective in shrinking the generalization gap when they are augmented with semantic embeddings. Overall, the best-performing model was QR-DQN with RoBERTa embeddings and lexical features, achieving 99.86% test accuracy, 99.75% precision, 99.96% recall, 99.85% F1-score, 25 false positives, 4 false negatives, and a train-test accuracy gap of only 0.04%.</p>
    </sec>
    <sec id="sec7">
      <title>7. Limitations and Future Work</title>
      <sec id="sec7dot1">
        <title>7.1. Limitations</title>
        <p>This study has several limitations that create important directions for future investigation.</p>
        <p>1) <bold>Computational cost of semantic embeddings:</bold> Generating BERT-based embeddings is computationally expensive. Although pre-computation and caching reduce the online training burden, the initial feature-extraction stage remains resource-intensive. Very high-throughput real-time systems may therefore face latency and scalability constraints. </p>
        <p>2) <bold>Static offline training setting:</bold> The agent was trained on a large but fixed static dataset. As a result, the current model does not adapt online to newly emerging phishing campaigns during deployment. Regular retraining with recent data may help maintain performance under a rapidly evolving threat landscape. </p>
        <p>3) <bold>Limited feature scope:</bold> The present hybrid model combines BERT embeddings with 50 engineered lexical features, but it does not incorporate other potentially useful modalities such as visual similarity to known brands, DNS-level indicators, or network-traffic features. Including these additional signals may further improve detection capability and robustness. </p>
      </sec>
      <sec id="sec7dot2">
        <title>7.2. Future Work</title>
        <p>Several directions can be pursued to extend this work.</p>
        <p><bold>Online and continual learning:</bold> A key future objective is to evolve the current static RL framework into an online continual-learning system. This would allow the agent to update its policy gradually from a continuous stream of URLs, thereby adapting to emerging attack strategies while mitigating catastrophic forgetting. <bold>Domain-specific BERT fine-tuning:</bold> Future work can explore fine-tuning the transformer encoder on large labeled corpora of phishing and benign URLs. Domain-adaptive fine-tuning may improve semantic representation quality and further strengthen downstream RL policy optimization. <bold>Integration of explainable AI:</bold> Improving the interpretability of agent decisions is important for operational trust and adoption. Future extensions may incorporate explainability methods such as SHAP or LIME to provide post hoc explanations indicating which lexical or semantic URL components most strongly influenced the agent’s decision. </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <title>References</title>
      <ref id="B1">
        <label>1.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Putra, F.P.E., Ubaidi, U., Zulfikri, A., Arifin, G. and Ilhamsyah, R.M. (2024) Analysis of Phishing Attack Trends, Impacts and Prevention Methods: Literature Study. <italic>Brilliance</italic>: <italic>Research of Artificial Intelligence</italic>, 4, 413-421. https://doi.org/10.47709/brilliance.v4i1.4357 <pub-id pub-id-type="doi">10.47709/brilliance.v4i1.4357</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.47709/brilliance.v4i1.4357">https://doi.org/10.47709/brilliance.v4i1.4357</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Putra, F.P.E.</string-name>
              <string-name>Ubaidi, U.</string-name>
              <string-name>Zulfikri, A.</string-name>
              <string-name>Arifin, G.</string-name>
              <string-name>Ilhamsyah, R.M.</string-name>
              <string-name>Trends, I</string-name>
            </person-group>
            <year>2024</year>
            <article-title>Analysis of Phishing Attack Trends, Impacts and Prevention Methods: Literature Study</article-title>
            <source>Brilliance: Research of Artificial Intelligence</source>
            <volume>4</volume>
            <pub-id pub-id-type="doi">10.47709/brilliance.v4i1.4357</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B2">
        <label>2.</label>
        <citation-alternatives>
          <mixed-citation publication-type="thesis">Akbar, N. (2014) Analysing Persuasion Principles in Phishing Emails. Ph.D. Thesis, University of Twente.</mixed-citation>
          <element-citation publication-type="thesis">
            <person-group person-group-type="author">
              <string-name>Akbar, N.</string-name>
              <string-name>Thesis, U</string-name>
            </person-group>
            <year>2014</year>
            <article-title>Analysing Persuasion Principles in Phishing Emails</article-title>
            <source>Ph.D. Thesis</source>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B3">
        <label>3.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Ejaz, A., Mian, A.N. and Manzoor, S. (2023) Life-Long Phishing Attack Detection Using Continual Learning. <italic>Scientific Reports</italic>, 13, Article No. 11488. https://doi.org/10.1038/s41598-023-37552-9 <pub-id pub-id-type="doi">10.1038/s41598-023-37552-9</pub-id><pub-id pub-id-type="pmid">37460588</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/s41598-023-37552-9">https://doi.org/10.1038/s41598-023-37552-9</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Ejaz, A.</string-name>
              <string-name>Mian, A.N.</string-name>
              <string-name>Manzoor, S.</string-name>
            </person-group>
            <year>2023</year>
            <article-title>Life-Long Phishing Attack Detection Using Continual Learning</article-title>
            <source>Scientific Reports</source>
            <volume>13</volume>
            <elocation-id>No</elocation-id>
            <pub-id pub-id-type="doi">10.1038/s41598-023-37552-9</pub-id>
            <pub-id pub-id-type="pmid">37460588</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B4">
        <label>4.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Nguyen, T.T. and Reddi, V.J. (2023) Deep Reinforcement Learning for Cyber Security. <italic>IEEE Transactions on Neural Networks and Learning Systems</italic>, 34, 3779-3795. https://doi.org/10.1109/tnnls.2021.3121870 <pub-id pub-id-type="doi">10.1109/tnnls.2021.3121870</pub-id><pub-id pub-id-type="pmid">34723814</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tnnls.2021.3121870">https://doi.org/10.1109/tnnls.2021.3121870</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Nguyen, T.T.</string-name>
              <string-name>Reddi, V.J.</string-name>
            </person-group>
            <year>2023</year>
            <article-title>Deep Reinforcement Learning for Cyber Security</article-title>
            <source>IEEE Transactions on Neural Networks and Learning Systems</source>
            <volume>34</volume>
            <pub-id pub-id-type="doi">10.1109/tnnls.2021.3121870</pub-id>
            <pub-id pub-id-type="pmid">34723814</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B5">
        <label>5.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Sarker, I.H. (2021) Deep Cybersecurity: A Comprehensive Overview from Neural Network and Deep Learning Perspective. <italic>SN Computer Science</italic>, 2, Article No. 154. https://doi.org/10.1007/s42979-021-00535-6 <pub-id pub-id-type="doi">10.1007/s42979-021-00535-6</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/s42979-021-00535-6">https://doi.org/10.1007/s42979-021-00535-6</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Sarker, I.H.</string-name>
            </person-group>
            <year>2021</year>
            <article-title>Deep Cybersecurity: A Comprehensive Overview from Neural Network and Deep Learning Perspective</article-title>
            <source>SN Computer Science</source>
            <volume>2</volume>
            <elocation-id>No</elocation-id>
            <pub-id pub-id-type="doi">10.1007/s42979-021-00535-6</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B6">
        <label>6.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Su, M. and Su, K. (2023) Bert-Based Approaches to Identifying Malicious URLs. <italic>Sensors</italic>, 23, Article 8499. https://doi.org/10.3390/s23208499 <pub-id pub-id-type="doi">10.3390/s23208499</pub-id><pub-id pub-id-type="pmid">37896591</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3390/s23208499">https://doi.org/10.3390/s23208499</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Su, M.</string-name>
              <string-name>Su, K.</string-name>
            </person-group>
            <year>2023</year>
            <article-title>Bert-Based Approaches to Identifying Malicious URLs</article-title>
            <source>Sensors</source>
            <volume>23</volume>
            <elocation-id>8499</elocation-id>
            <pub-id pub-id-type="doi">10.3390/s23208499</pub-id>
            <pub-id pub-id-type="pmid">37896591</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B7">
        <label>7.</label>
        <citation-alternatives>
          <mixed-citation publication-type="web">Mnih, V., Kavukcuoglu, K., Silver, D., <italic>et al</italic>. (2015) Human-Level Control through Deep Reinforcement Learning. <italic>Nature</italic>, 518, 529-533. https://www.nature.com/articles/nature14236</mixed-citation>
          <element-citation publication-type="web">
            <person-group person-group-type="author">
              <string-name>Mnih, V.</string-name>
              <string-name>Kavukcuoglu, K.</string-name>
              <string-name>Silver, D.</string-name>
            </person-group>
            <year>2015</year>
            <article-title>Human-Level Control through Deep Reinforcement Learning</article-title>
            <source>Nature</source>
            <volume>518</volume>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B8">
        <label>8.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Liu, R., Wang, Y., Xu, H., Qin, Z., Zhang, F., Liu, Y., <italic>et al</italic>. (2025) PMANet: Malicious URL Detection via Post-Trained Language Model Guided Multi-Level Feature Attention Network. <italic>Information Fusion</italic>, 113, Article 102638. https://doi.org/10.1016/j.inffus.2024.102638 <pub-id pub-id-type="doi">10.1016/j.inffus.2024.102638</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.inffus.2024.102638">https://doi.org/10.1016/j.inffus.2024.102638</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Liu, R.</string-name>
              <string-name>Wang, Y.</string-name>
              <string-name>Xu, H.</string-name>
              <string-name>Qin, Z.</string-name>
              <string-name>Zhang, F.</string-name>
              <string-name>Liu, Y.</string-name>
            </person-group>
            <year>2025</year>
            <article-title>PMANet: Malicious URL Detection via Post-Trained Language Model Guided Multi-Level Feature Attention Network</article-title>
            <source>Information Fusion</source>
            <volume>113</volume>
            <elocation-id>102638</elocation-id>
            <pub-id pub-id-type="doi">10.1016/j.inffus.2024.102638</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B9">
        <label>9.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">AVS Kumar, S., <italic>et al</italic>. (2024) Phishing Email Detection Using Machine Learning. <italic>International Journal of Artificial Intelligence and Data Analysis</italic>, 11, 48-59.</mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Kumar, S.</string-name>
            </person-group>
            <year>2024</year>
            <article-title>Phishing Email Detection Using Machine Learning</article-title>
            <source>International Journal of Artificial Intelligence and Data Analysis</source>
            <volume>11</volume>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B10">
        <label>10.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Rao, R.S., Kondaiah, C., Pais, A.R. and Lee, B. (2025) A Hybrid Super Learner Ensemble for Phishing Detection on Mobile Devices. <italic>Scientific Reports</italic>, 15, Article 16308. https://doi.org/10.1038/s41598-025-02009-8 <pub-id pub-id-type="doi">10.1038/s41598-025-02009-8</pub-id><pub-id pub-id-type="pmid">40374830</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/s41598-025-02009-8">https://doi.org/10.1038/s41598-025-02009-8</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Rao, R.S.</string-name>
              <string-name>Kondaiah, C.</string-name>
              <string-name>Pais, A.R.</string-name>
              <string-name>Lee, B.</string-name>
            </person-group>
            <year>2025</year>
            <article-title>A Hybrid Super Learner Ensemble for Phishing Detection on Mobile Devices</article-title>
            <source>Scientific Reports</source>
            <volume>15</volume>
            <elocation-id>16308</elocation-id>
            <pub-id pub-id-type="doi">10.1038/s41598-025-02009-8</pub-id>
            <pub-id pub-id-type="pmid">40374830</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B11">
        <label>11.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Chatterjee, M. and Namin, A. (2019) Detecting Phishing Websites through Deep Reinforcement Learning. 2019 <italic>IEEE</italic>43 <italic>rd Annual Computer Software and Applications Conference</italic> (COMPSAC), Milwaukee, 15-19 July 2019, 227-232. https://doi.org/10.1109/compsac.2019.10211 <pub-id pub-id-type="doi">10.1109/compsac.2019.10211</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/compsac.2019.10211">https://doi.org/10.1109/compsac.2019.10211</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Chatterjee, M.</string-name>
              <string-name>Namin, A.</string-name>
            </person-group>
            <year>2019</year>
            <article-title>Detecting Phishing Websites through Deep Reinforcement Learning</article-title>
            <source>2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC)</source>
            <volume>15</volume>
            <pub-id pub-id-type="doi">10.1109/compsac.2019.10211</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B12">
        <label>12.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Lopez-Martin, M., Carro, B. and Sanchez-Esguevillas, A. (2020) Application of Deep Reinforcement Learning to Intrusion Detection for Supervised Problems. <italic>Expert Systems with Applications</italic>, 141, Article 112963. https://doi.org/10.1016/j.eswa.2019.112963 <pub-id pub-id-type="doi">10.1016/j.eswa.2019.112963</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.eswa.2019.112963">https://doi.org/10.1016/j.eswa.2019.112963</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Lopez-Martin, M.</string-name>
              <string-name>Carro, B.</string-name>
              <string-name>Sanchez-Esguevillas, A.</string-name>
            </person-group>
            <year>2020</year>
            <article-title>Application of Deep Reinforcement Learning to Intrusion Detection for Supervised Problems</article-title>
            <source>Expert Systems with Applications</source>
            <volume>141</volume>
            <elocation-id>112963</elocation-id>
            <pub-id pub-id-type="doi">10.1016/j.eswa.2019.112963</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B13">
        <label>13.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Terranova, F., <italic>et al</italic>. (2024) Leveraging Deep Reinforcement Learning for Cyber-Attack Path Discovery. ACM Digital Library.</mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Terranova, F.</string-name>
            </person-group>
            <year>2024</year>
            <article-title>Leveraging Deep Reinforcement Learning for Cyber-Attack Path Discovery</article-title>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B14">
        <label>14.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Sahingoz, O.K., Buber, E., Demir, O. and Diri, B. (2019) Machine Learning Based Phishing Detection from URLs. <italic>Expert Systems with Applications</italic>, 117, 345-357. https://doi.org/10.1016/j.eswa.2018.09.029 <pub-id pub-id-type="doi">10.1016/j.eswa.2018.09.029</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.eswa.2018.09.029">https://doi.org/10.1016/j.eswa.2018.09.029</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Sahingoz, O.K.</string-name>
              <string-name>Buber, E.</string-name>
              <string-name>Demir, O.</string-name>
              <string-name>Diri, B.</string-name>
            </person-group>
            <year>2019</year>
            <article-title>Machine Learning Based Phishing Detection from URLs</article-title>
            <source>Expert Systems with Applications</source>
            <volume>117</volume>
            <pub-id pub-id-type="doi">10.1016/j.eswa.2018.09.029</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B15">
        <label>15.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Kheddar, H., Dawoud, D.W., Awad, A.I., Himeur, Y. and Khan, M.K. (2024) Reinforcement-Learning-Based Intrusion Detection in Communication Networks: A Review. <italic>IEEE Open Journal of the Communications Society</italic>, 5, 2115-2141.</mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Kheddar, H.</string-name>
              <string-name>Dawoud, D.W.</string-name>
              <string-name>Awad, A.I.</string-name>
              <string-name>Himeur, Y.</string-name>
              <string-name>Khan, M.K.</string-name>
            </person-group>
            <year>2024</year>
            <article-title>Reinforcement-Learning-Based Intrusion Detection in Communication Networks: A Review</article-title>
            <source>IEEE Open Journal of the Communications Society</source>
            <volume>5</volume>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B16">
        <label>16.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Devlin, J., Chang, M., Lee, K. and Toutanova, K. (2019) BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North, Minneapolis, 2 June-7 June 2019, 4171-4186. https://doi.org/10.18653/v1/n19-1423 <pub-id pub-id-type="doi">10.18653/v1/n19-1423</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.18653/v1/n19-1423">https://doi.org/10.18653/v1/n19-1423</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Devlin, J.</string-name>
              <string-name>Chang, M.</string-name>
              <string-name>Lee, K.</string-name>
              <string-name>Toutanova, K.</string-name>
              <string-name>North, M</string-name>
            </person-group>
            <year>2019</year>
            <article-title>BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding</article-title>
            <source>Proceedings of the 2019 Conference of the North</source>
            <volume>2</volume>
            <pub-id pub-id-type="doi">10.18653/v1/n19-1423</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B17">
        <label>17.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Young, T., Hazarika, D., Poria, S. and Cambria, E. (2018) Recent Trends in Deep Learning Based Natural Language Processing [Review Article]. <italic>IEEE Computational</italic><italic>Intelligence Magazine</italic>, 13, 55-75. https://doi.org/10.1109/mci.2018.2840738 <pub-id pub-id-type="doi">10.1109/mci.2018.2840738</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/mci.2018.2840738">https://doi.org/10.1109/mci.2018.2840738</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Young, T.</string-name>
              <string-name>Hazarika, D.</string-name>
              <string-name>Poria, S.</string-name>
              <string-name>Cambria, E.</string-name>
            </person-group>
            <year>2018</year>
            <article-title>Recent Trends in Deep Learning Based Natural Language Processing [Review Article]</article-title>
            <source>IEEE Computational Intelligence Magazine</source>
            <volume>13</volume>
            <pub-id pub-id-type="doi">10.1109/mci.2018.2840738</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B18">
        <label>18.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Maneriker, P., Stokes, J.W., Lazo, E.G., Carutasu, D., Tajaddodianfar, F. and Gururajan, A. (2021). URLTran: Improving Phishing URL Detection Using Transformers. <italic>MILCOM 2021-2021 IEEE Military Communications Conference</italic> ( <italic>MILCOM</italic>), San Diego, 29 November-2 December 2021, 197-204. https://doi.org/10.1109/milcom52596.2021.9653028 <pub-id pub-id-type="doi">10.1109/milcom52596.2021.9653028</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/milcom52596.2021.9653028">https://doi.org/10.1109/milcom52596.2021.9653028</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Maneriker, P.</string-name>
              <string-name>Stokes, J.W.</string-name>
              <string-name>Lazo, E.G.</string-name>
              <string-name>Carutasu, D.</string-name>
              <string-name>Tajaddodianfar, F.</string-name>
              <string-name>Gururajan, A.</string-name>
            </person-group>
            <year>2021</year>
            <pub-id pub-id-type="doi">10.1109/milcom52596.2021.9653028</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B19">
        <label>19.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Saleem Raja, A., Vinodini, R. and Kavitha, A. (2021) Lexical Features Based Malicious URL Detection Using Machine Learning Techniques. <italic>Materials Today</italic>: <italic>Proceedings</italic>, 47, 163-166. https://doi.org/10.1016/j.matpr.2021.04.041 <pub-id pub-id-type="doi">10.1016/j.matpr.2021.04.041</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.matpr.2021.04.041">https://doi.org/10.1016/j.matpr.2021.04.041</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Raja, A.</string-name>
              <string-name>Vinodini, R.</string-name>
              <string-name>Kavitha, A.</string-name>
            </person-group>
            <year>2021</year>
            <article-title>Lexical Features Based Malicious URL Detection Using Machine Learning Techniques</article-title>
            <source>Materials Today: Proceedings</source>
            <volume>47</volume>
            <pub-id pub-id-type="doi">10.1016/j.matpr.2021.04.041</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B20">
        <label>20.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Ari Kustiawan, Y. and Ghauth, K.I. (2025) Evaluating the Impact of Feature Engineering in Phishing URL Detection: A Comparative Study of URL, HTML, and Derived Features. <italic>IEEE Access</italic>, 13, 126756-126768. https://doi.org/10.1109/access.2025.3579223 <pub-id pub-id-type="doi">10.1109/access.2025.3579223</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/access.2025.3579223">https://doi.org/10.1109/access.2025.3579223</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Kustiawan, Y.</string-name>
              <string-name>Ghauth, K.I.</string-name>
              <string-name>URL, H</string-name>
            </person-group>
            <year>2025</year>
            <article-title>Evaluating the Impact of Feature Engineering in Phishing URL Detection: A Comparative Study of URL, HTML, and Derived Features</article-title>
            <source>IEEE Access</source>
            <volume>13</volume>
            <pub-id pub-id-type="doi">10.1109/access.2025.3579223</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B21">
        <label>21.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Asif, A.U.Z., Shirazi, H. and Ray, I. (2023) Machine Learning-Based Phishing Detection Using URL Features: A Comprehensive Review. In: Dolev, S. and Schieber, B., Eds., <italic>Lecture Notes in Computer Science</italic>, Springer, 481-497. https://doi.org/10.1007/978-3-031-44274-2_36 <pub-id pub-id-type="doi">10.1007/978-3-031-44274-2_36</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-031-44274-2_36">https://doi.org/10.1007/978-3-031-44274-2_36</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Asif, A.U.Z.</string-name>
              <string-name>Shirazi, H.</string-name>
              <string-name>Ray, I.</string-name>
              <string-name>Dolev, S.</string-name>
              <string-name>Schieber, B.</string-name>
              <string-name>Science, S</string-name>
            </person-group>
            <year>2023</year>
            <article-title>Machine Learning-Based Phishing Detection Using URL Features: A Comprehensive Review</article-title>
            <source>In: Dolev</source>
            <volume>481</volume>
            <pub-id pub-id-type="doi">10.1007/978-3-031-44274-2_36</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B22">
        <label>22.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Brockman, G., Cheung, V., Pettersson, L., <italic>et al</italic>. (2016) OpenAI Gym. arXiv:1606.01540. https://arxiv.org/abs/1606.01540</mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Brockman, G.</string-name>
              <string-name>Cheung, V.</string-name>
              <string-name>Pettersson, L.</string-name>
            </person-group>
            <year>2016</year>
            <article-title>OpenAI Gym</article-title>
            <fpage>1606</fpage>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B23">
        <label>23.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Luong, N.C., Hoang, D.T., Gong, S., Niyato, D., Wang, P., Liang, Y., <italic>et al</italic>. (2019) Applications of Deep Reinforcement Learning in Communications and Networking: A Survey. <italic>IEEE Communications Surveys &amp; Tutorials</italic>, 21, 3133-3174. https://doi.org/10.1109/comst.2019.2916583 <pub-id pub-id-type="doi">10.1109/comst.2019.2916583</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/comst.2019.2916583">https://doi.org/10.1109/comst.2019.2916583</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Luong, N.C.</string-name>
              <string-name>Hoang, D.T.</string-name>
              <string-name>Gong, S.</string-name>
              <string-name>Niyato, D.</string-name>
              <string-name>Wang, P.</string-name>
              <string-name>Liang, Y.</string-name>
            </person-group>
            <year>2019</year>
            <article-title>Applications of Deep Reinforcement Learning in Communications and Networking: A Survey</article-title>
            <source>IEEE Communications Surveys &amp; Tutorials</source>
            <volume>21</volume>
            <pub-id pub-id-type="doi">10.1109/comst.2019.2916583</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B24">
        <label>24.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Mnih, V., Kavukcuoglu, K., Silver, D., <italic>et al</italic>. (2013) Playing Atari with Deep Reinforcement Learning. arXiv:1312.5602. https://arxiv.org/abs/1312.5602</mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Mnih, V.</string-name>
              <string-name>Kavukcuoglu, K.</string-name>
              <string-name>Silver, D.</string-name>
            </person-group>
            <year>2013</year>
            <article-title>Playing Atari with Deep Reinforcement Learning</article-title>
            <fpage>1312</fpage>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B25">
        <label>25.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Kim, T., Park, N., Hong, J. and Kim, S. (2022) Phishing URL Detection: A Net-Work-Based Approach Robust to Evasion. <italic>Proceedings of the</italic>2022 <italic>ACM SIGSAC Conference on Computer and Communications Security</italic>, Los Angeles, 7-11 November 2022, 1679-1782. https://doi.org/10.1145/3548606.3560615 <pub-id pub-id-type="doi">10.1145/3548606.3560615</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3548606.3560615">https://doi.org/10.1145/3548606.3560615</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Kim, T.</string-name>
              <string-name>Park, N.</string-name>
              <string-name>Hong, J.</string-name>
              <string-name>Kim, S.</string-name>
              <string-name>Security, L</string-name>
            </person-group>
            <year>2022</year>
            <article-title>Phishing URL Detection: A Net-Work-Based Approach Robust to Evasion</article-title>
            <source>Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security</source>
            <volume>7</volume>
            <pub-id pub-id-type="doi">10.1145/3548606.3560615</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B26">
        <label>26.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Otieno, D.O., Abri, F., Namin, A.S. and Jones, K.S. (2023) Detecting Phishing URLs using the BERT Transformer Model. 2023 <italic>IEEE International Conference on Big Data</italic> ( <italic>BigData</italic>), Sorrento, 15-18 December 2023, 1303-1310.</mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Otieno, D.O.</string-name>
              <string-name>Abri, F.</string-name>
              <string-name>Namin, A.S.</string-name>
              <string-name>Jones, K.S.</string-name>
            </person-group>
            <year>2023</year>
            <article-title>Detecting Phishing URLs using the BERT Transformer Model</article-title>
            <source>2023 IEEE International Conference on Big Data (BigData)</source>
            <volume>15</volume>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B27">
        <label>27.</label>
        <citation-alternatives>
          <mixed-citation publication-type="book">Sutton, R.S. and Barto, A.G. (2018) Reinforcement Learning: An Introduction. 2nd Edition, MIT Press.</mixed-citation>
          <element-citation publication-type="book">
            <person-group person-group-type="author">
              <string-name>Sutton, R.S.</string-name>
              <string-name>Barto, A.G.</string-name>
              <string-name>Edition, M</string-name>
            </person-group>
            <year>2018</year>
            <article-title>Reinforcement Learning: An Introduction</article-title>
            <source>2nd Edition</source>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B28">
        <label>28.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Dabney, W., Rowland, M., Bellemare, M. and Munos, R. (2018) Distributional Reinforcement Learning with Quantile Regression. <italic>Proceedings of the AAAI Conference on Artificial Intelligence</italic>, 32, 2892-2901. https://doi.org/10.1609/aaai.v32i1.11791 <pub-id pub-id-type="doi">10.1609/aaai.v32i1.11791</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1609/aaai.v32i1.11791">https://doi.org/10.1609/aaai.v32i1.11791</ext-link></mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Dabney, W.</string-name>
              <string-name>Rowland, M.</string-name>
              <string-name>Bellemare, M.</string-name>
              <string-name>Munos, R.</string-name>
            </person-group>
            <year>2018</year>
            <article-title>Distributional Reinforcement Learning with Quantile Regression</article-title>
            <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
            <volume>32</volume>
            <pub-id pub-id-type="doi">10.1609/aaai.v32i1.11791</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B29">
        <label>29.</label>
        <citation-alternatives>
          <mixed-citation publication-type="confproc">Bellemare, M.G., Dabney, W. and Munos, R. (2017) A Distributional Perspective on Reinforcement Learning. <italic>Proceedings of the</italic>34 <italic>th International Conference on Machine</italic><italic>Learning</italic>, Sydney, 6-11 August 2017, 449-458.</mixed-citation>
          <element-citation publication-type="confproc">
            <person-group person-group-type="author">
              <string-name>Bellemare, M.G.</string-name>
              <string-name>Dabney, W.</string-name>
              <string-name>Munos, R.</string-name>
              <string-name>Learning, S</string-name>
            </person-group>
            <year>2017</year>
            <article-title>A Distributional Perspective on Reinforcement Learning</article-title>
            <source>Proceedings of the 34th International Conference on Machine Learning</source>
            <volume>6</volume>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B30">
        <label>30.</label>
        <citation-alternatives>
          <mixed-citation publication-type="other">Maci, A., Santorsola, A., Coscia, A. and Iannacone, A. (2023) Unbalanced Web Phishing Classification through Deep Reinforcement Learning. <italic>Computers</italic>, 12, Article 118. https://doi.org/10.3390/computers12060118 <pub-id pub-id-type="doi">10.3390/computers12060118</pub-id><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3390/computers12060118">https://doi.org/10.3390/computers12060118</ext-link></mixed-citation>
          <element-citation publication-type="other">
            <person-group person-group-type="author">
              <string-name>Maci, A.</string-name>
              <string-name>Santorsola, A.</string-name>
              <string-name>Coscia, A.</string-name>
              <string-name>Iannacone, A.</string-name>
            </person-group>
            <year>2023</year>
            <article-title>Unbalanced Web Phishing Classification through Deep Reinforcement Learning</article-title>
            <source>Computers</source>
            <volume>12</volume>
            <elocation-id>118</elocation-id>
            <pub-id pub-id-type="doi">10.3390/computers12060118</pub-id>
          </element-citation>
        </citation-alternatives>
      </ref>
      <ref id="B31">
        <label>31.</label>
        <citation-alternatives>
          <mixed-citation publication-type="journal">Egigogo, O.E., Idris, I.A., Olalere, O.M., Abisoye, O.G. and Ojeniyi, B.A. (2022) Development of Hybridized CNN-BiGRU Framework for Detection of Website Phishing Attacks. <italic>Nigerian Journal of Technological Research</italic>, 3, 45-54.</mixed-citation>
          <element-citation publication-type="journal">
            <person-group person-group-type="author">
              <string-name>Egigogo, O.E.</string-name>
              <string-name>Idris, I.A.</string-name>
              <string-name>Olalere, O.M.</string-name>
              <string-name>Abisoye, O.G.</string-name>
              <string-name>Ojeniyi, B.A.</string-name>
            </person-group>
            <year>2022</year>
            <article-title>Development of Hybridized CNN-BiGRU Framework for Detection of Website Phishing Attacks</article-title>
            <source>Nigerian Journal of Technological Research</source>
            <volume>3</volume>
          </element-citation>
        </citation-alternatives>
      </ref>
    </ref-list>
  </back>
</article>