Intelligent Frequency-Hopping Strategies for Securing the Physical Layer Based on Learning for the Stochastic Dispersion Problem ()
1. Introduction
In today’s wireless communications landscape, securing the physical layer has become an essential defense against the emergence of increasingly sophisticated and adaptive jamming threats [1]. While traditional frequency-hopping methods often rely on predefined switching sequences that are vulnerable to interception or prediction, the integration of artificial intelligence opens new avenues for enhanced resilience. This article explores an intelligent frequency-hopping strategy designed to counter malicious adversaries in dynamic and uncertain environments. By leveraging the power of Q-learning, a reinforcement learning algorithm, the system becomes capable of autonomously learning jamming patterns and selecting optimal transmission channels in real time, without prior knowledge of the attacker’s strategy [2].
To overcome the limitations of purely random exploration and avoid getting stuck in local optima, we introduce a stochastic dispersion mechanism. This approach allows us to diversify frequency hops probabilistically, ensuring maximum unpredictability against the adversary’s tracking attempts while minimizing packet collisions. Through this synergy between reinforcement learning and controlled randomness, our work aims to optimize transmission success rates and reduce energy consumption, thereby offering a robust and self-adaptive solution to secure the critical communication networks of tomorrow.
2. Relevant Literature
Securing the physical layer has traditionally relied on the intrinsic properties of the transmission channel to ensure confidentiality, moving away from purely cryptographic approaches used in higher layers. In this field, Frequency Hopping is a pioneering spread-spectrum technique, initially designed to combat interference and intentional jamming. However, foundational literature highlights a major limitation: the use of static pseudorandom sequences which, once intercepted by an adversary with advanced computational capabilities, render the system vulnerable [3].
The emergence of Smart Jammers, capable of learning and predicting hopping patterns, necessitated a shift toward proactive defense mechanisms. The introduction of reinforcement learning, and more specifically Q-Learning, marked a decisive turning point. Research by Wang et al. has demonstrated how an agent can optimize its channel selection policy by interacting with a hostile environment, modeled as a Markov Decision Process [4]. Unlike reactive methods, Q-Learning allows the transmitter to anticipate the jammer’s actions by maximizing a reward function linked to the signal-to-interference-plus-noise ratio. Nevertheless, a recurring issue in the literature concerns the trade-off between exploration and exploitation. Insufficient exploration causes the transmitter to remain on suboptimal frequencies, while excessive exploration degrades quality of service. To address this rigidity, recent work explores stochastic dispersion as a complement to artificial intelligence, suggesting that introducing controlled randomness into the agent's action space enhances the system’s unpredictability [5].
This approach relies on wave propagation properties and channel stochasticity to conceal the transmitter’s intentions. By incorporating a dispersion component, the system no longer simply follows the “best” learned frequency, but navigates within a subspace of secure frequencies, making any attempt at tracking mathematically complex [6]. The current literature thus converges toward hybrid architectures where reinforcement learning ensures adaptation to traffic conditions, while the stochasticity of the physical layer guarantees robustness against interception and traffic analysis [7].
3. Methodology
The system is modeled as a Q-learning agent interacting with a dynamic and hostile radio environment. At each time step
, the transmitter observes the state of the spectrum
(interference levels on the channels) and selects an action
corresponding to a hopping frequency. The value function is updated according to the Bellman equation:
(1)
where the reward
is correlated with the success of the transmission (no collision with the jammer). To break the predictability of deterministic optimal policies, we incorporate a stochastic dispersion layer. Rather than systematically choosing the action
, the agent selects a frequency according to a Boltzmann probability distribution weighted by the channel entropy [8]. This approach allows the hops to be randomly dispersed within a subspace of high-reward frequencies, making the transmission pattern mathematically unpredictable to an external observer. The model’s convergence is evaluated by the bit error rate and the probability of interception under various reactive jamming configurations [9].
Attacker Model: From Stationary to Adaptive Jamming
In this study, we consider an adversarial environment where an entity, denoted as Eve, aims to disrupt the communication between Alice and Bob by injecting interference. To evaluate the robustness of our proposed Q-learning strategy, we define two levels of adversarial capability:
Reactive Smart Jammer: The primary threat model is an intelligent agent capable of sensing the spectrum and adapting its strategy to follow the transmitter’s hopping pattern. This represents the “stochastic dispersion” challenge where the transmitter must maintain non-deterministic behavior.
Stationary Benchmark: For the purpose of quantifying the baseline convergence and success rate, the simulations focus on a stationary jammer fixed at a center frequency of 150 Hz with a constant power spectral density
.
While the experimental results (Section 5) emphasize the system’s ability to “learn” and bypass the 150 Hz interference, this scenario serves as a fundamental proof-of-concept. By successfully identifying and avoiding a persistent obstacle, the agent demonstrates the underlying mechanism required to counter more complex, time-varying adaptive jammers.
4. System and Channel Modeling
The system consists of a transmitter (Alice), a legitimate receiver (Bob), and an intelligent jammer (Eve) operating in a two-dimensional space. The signal received by Bob at time
on frequency
is modeled by:
(2)
where
is the transmit power,
is the distance,
is the path loss exponent, and
is the Rayleigh channel coefficient. The jamming interference
depends on Eve’s collision strategy. Stochastic scattering exploits the random nature of the electromagnetic field. Based on Maxwell’s equations in a complex medium, the electric field
resulting from the superposition of multiple paths follows a statistical distribution. The spectral power density can be described by a spatio-temporal correlation function:
(3)
This intrinsic stochasticity is used to parameterize the variance of channel selection. The channel is discretized into
orthogonal subbands. The system state
is defined by the vector of received interference powers
. The instantaneous Shannon channel capacity, which will serve as the basis for the Q-learning reward, is the
(4)
where
is the jammer power and
is the thermal noise. This model ensures that the agent does not simply learn a fixed sequence, but adapts to the physical dynamics of the signal and the channel uncertainty, thereby maximizing the probability of secure transmission against a reactive adversary.
5. Design of the Reward Function
In our intelligent frequency-hopping model, the objective function must balance three criteria: successful transmission, energy efficiency, and resistance to interference. We define a composite objective function given by:
(5)
where
are the normalized weighting coefficients. The first term,
, represents transmission success, defined as the normalized bit rate achieved over the selected channel; it is maximized when the signal-to-interference-plus-noise ratio (SINR) exceeds a decoding threshold
. The second term,
, penalizes the energy cost associated with frequency switching, since each fast hopping operation consumes computational and synchronization resources. Finally,
is the collision penalty, activated when the selected frequency coincides with the jammer’s frequency band detected by the receiver [10]. To account for stochastic dispersion, we modify the reward structure by adding an entropy term:
(6)
where
is the entropy of the selection policy and
is a temperature parameter. This formulation encourages the agent to maintain a certain degree of variability in its frequency choices. A high reward is therefore not only assigned to the absence of jamming, but to a strategy that remains unpredictable (entropic) while remaining effective [11]. This design forces the Q-learning algorithm to converge toward a robust policy that discourages pattern prediction attacks, thereby ensuring optimal physical-layer security in a highly unstable environment.
![]()
Analysis of the results in Figure 1 obtained by the standard Q-learning algorithm reveals a clear advantage in terms of adaptability compared to conventional frequency hopping. In the early stages of the simulation, the error rate is unstable, corresponding to the exploration phase during which the agent randomly tests channels, including the jammer’s channel (150 Hz). However, after a small number of iterations, the cumulative reward curve shows logarithmic growth before stabilizing. This convergence indicates that the agent has correctly identified the attacker’s frequency signature and updated its Q-table to minimize the probability of selecting the compromised channel. Unlike the stationary system, which suffers from interference in a fatal and repetitive manner, the intelligent agent achieves a transmission success rate close to 100% in steady state.
Figure 1. A robust implementation of a standard Q-learning agent.
The final distribution of the Q values shows a marked contrast: the healthy frequencies (50, 100, 200, 250 Hz) exhibit positive and balanced values, while the 150 Hz frequency shows a strongly negative value, acting as a mathematical barrier. The effectiveness of the
-greedy strategy is crucial here, as it allows for continuous monitoring of the environment; if the jammer were to change targets, the agent would be able to rediscover a new secure path. These results confirm that reinforcement learning transforms physical-layer security from a reactive approach into a proactive strategy. By dynamically optimizing spectrum usage, this method ensures not only confidentiality but also the operational resilience of communication networks in the face of persistent and localized threats.
Figure 2. Performance graph.
Analysis of performance metrics in Figure 2 confirms the superiority of the intelligent approach in terms of reliability and spectral efficiency. The first graph, illustrating the convergence rate, shows that the Q-learning agent stabilizes its strategy in fewer than 150 episodes. This rapid learning phase is crucial for real-time communications, as it limits the duration during which the system is vulnerable to collisions. Once convergence is reached, the cumulative average reward plateaus at its maximum value, proving that the algorithm has mathematically excluded the noisy channel from its usual decision space.
The second indicator, the success rate as a function of SNR, validates the system’s physical robustness. We observe that the success rate reaches a plateau of 100% as soon as the signal-to-noise ratio exceeds the 8 dB threshold, demonstrating excellent resilience not only against intentional jamming but also against ambient thermal noise. Unlike conventional frequency-hopping methods, which would suffer a constant 20% loss (with 1 out of 5 channels compromised), our model maintains optimal service continuity.
This synergy between artificial intelligence and physical layer modeling enables a constant Quality of Service, even under degraded channel conditions. In conclusion, these quantitative results support the idea that self-adaptation is the key to securing wireless networks against dynamic and unpredictable threats.
5.1. Simulation Setup and Hyperparameters
To ensure the reproducibility of the results presented in this study, the simulation environment was configured to model a wireless link under adaptive jamming conditions. Alice and Bob communicate over a Rayleigh fading channel, which is discretized into
orthogonal subbands. The reinforcement learning agent (Alice) starts with no prior knowledge, initializing all Q-values to zero. The specific parameters used for the convergence analysis and success-rate evaluations are summarized in Table 1. The adversary is modeled as an intelligent jammer targeting a specific spectral segment (150 Hz) to evaluate the system’s avoidance capabilities. To ensure statistical significance, the results were averaged over multiple Monte Carlo iterations, accounting for both the stochastic nature of the electromagnetic channel and the probabilistic Boltzmann selection policy.
Table 1. System simulation and learning hyperparameters.
Category |
Parameter |
Value |
Network |
Number of Channels (
) |
5/64 |
|
Dwell Time (
) |
Milliseconds (ms) |
Adversary |
Jammer Behavior |
Stationary/Reactive at 150 Hz |
|
Jammer Power (
) |
Constant |
Learning |
Max Episodes (
) |
500 |
|
Learning Rate (
) |
0.1 |
|
Discount Factor (
) |
0.9 |
|
Exploration Rate (
) |
0.1 (
-greedy) |
|
Temperature (
) |
1.0 (Softmax) |
Physical |
SNR Range |
−5 dB to 20 dB |
|
Path Loss Exponent (
) |
2.0 |
|
Decoding Threshold (
) |
Variable |
5.2. Markov Decision Process Formulation
The frequency-hopping problem is modeled as a Markov Decision Process (MDP) defined by the tuple
. To maintain consistency across our analysis, we define the variables as follows:
State (
): The state
at time
represents the local sensing observation of the spectrum. We define
as the index of the previously occupied channel,
. This formulation avoids the complexity of a full interference vector while allowing the agent to learn transition probabilities relative to the jammer’s behavior.
Action (
): An action
corresponds to the selection of the next frequency band
from the set of
available orthogonal channels. Thus,
.
Reward (
): The reward
is a scalar feedback signal reflecting the quality of the transmission. It is defined as:
(7)
where
is a scaling factor,
is the penalty for interference, and
is the cost of frequency switching.
5.3. Novelty: Stochastic Dispersion and Standard Softmax
The primary distinction between the proposed stochastic dispersion layer and the standard Softmax exploration typically found in Q-learning lies in its temporal application and its role in the security architecture. In conventional reinforcement learning, Softmax is a transient exploration mechanism where the temperature parameter
often decays, eventually leading the agent to a deterministic policy that exploits the single best frequency. In our framework, the dispersion layer acts as a permanent policy-shaping constraint. Even after the Q-table converges, the action selection maintains a controlled level of entropy to ensure that the frequency-hopping sequence remains non-deterministic. From a reward-shaping perspective, we introduce a stochasticity bonus into the objective function, penalizing the agent for selecting the same subband with a probability exceeding a security threshold. This ensures that while the system avoids jammed channels (exploitation), it “disperses” its remaining transmissions across the safe spectrum to prevent an intelligent adversary from predicting and following the next hop—a feature not present in standard goal-oriented Q-learning.
5.4. Performance Analysis and Convergence Interpretation
As illustrated in the performance curves, the proposed system achieves a near-perfect success rate once the Signal-to-Noise Ratio (SNR) exceeds 8 dB. This result can be interpreted through the lens of the interaction between the Q-learning avoidance strategy and the underlying Rayleigh fading model. Below this threshold, the bit error rate (BER) is dominated by additive white Gaussian noise (AWGN) and deep fades inherent to the channel, which persist even when the agent successfully avoids the 150 Hz jammer. At 8 dB and above, the “learning gain” becomes the primary driver of performance: the agent has successfully updated its Q-table to identify the jammed subband as a high-penalty state, effectively neutralizing the adversary. Consequently, the success rate saturates because the signal power is now sufficient to overcome standard channel impairments in the remaining “safe” subbands. This result holds under the assumptions of quasi-static fading during the dwell time and a jammer with constant power spectral density, confirming that the stochastic dispersion layer successfully balances interference avoidance with robust signal recovery.
6. Complexity and Latency Analysis
The algorithmic complexity of standard Q-learning lies primarily in updating the value table and selecting the action. For a state space
and a set of actions
, the spatial complexity is
, which, in our case of frequency hopping with a single state (the current spectrum), reduces to
. In terms of time, each iteration requires a search for the maximum
, resulting in a complexity of
.
The introduction of stochastic dispersion via the Softmax function adds an exponential calculation for each channel:
The total computation time per slot,
, must satisfy the real-time condition:
(8)
where
is the dwell time.
In practice, for
channels, the calculation takes only a few microseconds (μs), whereas standard FHSS slots are on the order of milliseconds (ms). The additional latency introduced by artificial intelligence is therefore negligible, ensuring that the selection of the next frequency is completed before the end of the current transmission. This efficiency allows in Table 2 for maximum spectral agility to be maintained without degrading the bit rate, confirming the viability of the approach for securing the physical layer in highly dynamic environments. The critical analysis presented in the table highlights that, although standard Q-learning provides a robust proof of concept for anti-jamming, scaling it up to an industrial level requires structural adjustments. The combinatorial explosion of the state space in dense IoT networks justifies the transition to Deep Q-Learning, which is capable of handling continuous variables via neural networks. Furthermore, real-time viability will depend on hardware integration on FPGA chips, enabling the reduction of computational latency below the dwell time threshold.
Table 2. Critical analysis and the evolution of the intelligent frequency-hopping model.
Critical Aspect |
Current Approach (Q-Learning) |
Proposed Evolution (Outlook) |
Impact on Performance |
Complexity Management |
Static Q table
. Limited by combinatorial explosion. |
Deep Q-Learning (DQN): Use of deep neural networks. |
Better generalization in dense IoT environments. |
Latency and Throughput |
Assumed perfect synchronization. Unquantified computation delay. |
Hardware Optimization (FPGA): Parallel computation of rewards. |
Reduction of Dwell Time and increase in actual throughput. |
Attacker Model |
Stationary or predictable jamming (simple Smart Jamming). |
Multi-Agent Learning (MARL): Transmitter vs. AI Jammer competition. |
Resilience against self-adaptive attacks. |
Energy Efficiency |
Exclusive Focus on Success Rate (SNR). |
Multi-Objective Q-Learning: Reward Including Power Consumption. |
Extended Battery Life for IoT Devices. |
Synchronization |
Implicit coordination mechanism between nodes. |
Hybrid Sequences: Combination of fixed keys and AI-based adjustments. |
Rapid recovery after a major collision. |
Finally, security cannot be viewed as static; in the face of a “Smart Jammer” that also uses artificial intelligence, the adoption of Multi-Agent Game Theory (MARL) models becomes essential to maintaining a strategic advantage. Incorporating energy efficiency criteria into the reward function will enable this technology to be adapted to the strict constraints of autonomous sensors, thereby ensuring long-term and sustainable protection for next-generation networks.
7. Conclusions and Future Works
7.1. Conclusions
This paper has demonstrated the effectiveness of an intelligent frequency-hopping strategy for enhancing physical layer security against adaptive jamming threats. By combining reinforcement learning via Q-learning with a stochastic dispersion mechanism, we have demonstrated that a communication system can not only learn to avoid compromised frequencies but also maintain the unpredictability crucial for preventing interception. Simulation results confirm rapid convergence of the algorithm, enabling a transmission success rate close to 100% as soon as the signal-to-noise ratio exceeds 8 dB, where conventional FHSS methods fail due to a lack of agility. While the proposed intelligent frequency-hopping strategy demonstrates high resilience and rapid convergence, several simplifying assumptions were made to establish this baseline proof-of-concept. First, this study focuses on a single-link, single-jammer scenario. In dense network environments, the presence of multi-user interference and multiple distributed jammers would significantly increase the state-space complexity and may require multi-agent reinforcement learning architectures. Furthermore, the current model assumes perfect time and frequency synchronization between Alice and Bob. In practical hardware implementations, synchronization errors and propagation delays could affect the stability of the learning loop. Finally, the simulations were conducted under static conditions; the impact of node mobility—which introduces dynamic Doppler shifts and rapidly time-varying channel geometries—remains to be investigated. Future research will focus on scaling this stochastic dispersion layer to ad-hoc multi-node networks and evaluating its robustness against non-stationary mobility patterns.
7.2. Future Works
A natural extension of this study involves integrating Deep Q-Networks to handle large, continuous state spaces, overcoming the limitations of the classical Q-table when dealing with heterogeneous control signals. Furthermore, the implementation of multi-agent strategies would enable decentralized coordination among multiple legitimate users, thereby optimizing spectrum access while avoiding mutual interference in 6G networks.
Another major focus is the study of robustness against cognitive jammers, which also use reinforcement learning to predict the transmitter’s stochastic dispersion. Finally, we plan to validate these models on Software-Defined Radio platforms to measure the actual impact of hardware imperfections and synchronization delays on the stability of the learning loop in an industrial setting.