1. Introduction

jamp

Journal of Applied Mathematics and Physics

2327-4379 2327-4352

Scientific Research Publishing

10.4236/jamp.2026.146117

jamp-152055

Article

Physics Mathematics

A Unified Gradient Temporal Difference Learning Algorithm for Off-Policy Learning

Zhao

Yafei

1 Yang

Long

1 Department of Mathematics, Zhejiang International Studies University, Hangzhou, China 2 Chinese Academy of Sciences, Beijing, China

The authors declare no conflicts of interest regarding the publication of this paper.

11 06 2026

06 2026

14 06 2384 2408 28 05 2026 21 06 2026 24 06 2026

2026

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license ( https://creativecommons.org/licenses/by/4.0/ ).

https://doi.org/10.4236/jamp.2026.146117

In this paper, we propose a unification of gradient temporal difference (GTD) learning algorithm

GQ(

σ,λ

)

for off-policy learning. The proposed

GQ(

σ,λ

)

ranges from gradient

Tree Backup(

)

GQ(

)

when

ranges from 0 to 1. We investigate the structure of TD fixed-point of

GQ(

σ,λ

)

, and prove

GQ(

σ,λ

)

converges to its TD fixed-point with probability one. Furthermore, we prove that

GQ(

σ,λ

)

converges to an arbitrarily small neighborhood of the optimal solution with probability one. Empirical results show the

GQ(

σ,λ

)

with a value

σ∈(

0,1

)

that creates a mixture of

GQ(

)

and gradient

Tree Backup(

)

achieves a better performance than both the extreme end

σ=0

and

σ=1

Reinforcement Learning Off-Policy Learning

1. Introduction

In reinforcement learning (RL), unifying some disparate ideas not only providing a better understanding of existing algorithms but also creating better performing algorithms.

For example, TD ( λ ) [1] unifies one-step temporal difference learning (if λ = 0 ) and Monte Carlo method (if λ = 1 ) through the trace-decay parameter λ . Results show that the unified algorithm TD ( λ ) performs best at an intermediate value λ ∈ ( 0 , 1 ) rather than the extreme cases of λ = 0 and λ = 1 .

The work [2][3] propose a multi-step Q ( σ ) that unifies n -step Sarsa [4] (if σ = 1 , full-sampling) and n -step Tree-Backup [5] ( σ = 0 , pure-expectation), where the parameter σ ∈ [ 0 , 1 ] denotes the degree of the sampling. The work [2] have conducted experiments to show that for some value σ ∈ ( 0 , 1 ) , Q ( σ ) creates a mixture of full-sampling and pure-expectation approach, which performs better than the extreme case σ = 0 and σ = 1 . Later, the work [6][7] inherit the key idea of unification of TD ( λ ) and Q ( σ ) , they propose Q ( σ , λ ) unifies Sarsa ( λ ) [4] and Tree-Backup ( λ ) [5]. The previous works [6][7] show that Q ( σ , λ ) performs best at an intermediate value σ ∈ ( 0 , 1 ) .

It is noteworthy that the theoretical analysis of the work [2][6][7] only consider the tabular learning, which requires a very large table to store the estimated value function when the state space is huge. That implies the previous methods of [2][6][7] are considerably expensive for high-dimensional RL, which is the main focus of this work.

Our Main Works

A practical way to address the high-dimensional curse is using a parametric function to estimate the value function. In this paper, we focus on extending Q ( σ , λ ) with linear function approximation. Since the divergence of semi-gradient with multi-step bootstrapping for off-policy learning are well-documented in the existing literature (e.g., [3][8]), which could also happen in semi-gradient Q ( σ , λ ) . To propose a convergent gradient-based algorithm, we derive the GQ ( σ , λ ) algorithm via the mean square projected Bellman error (MSPBE) objective function [9], that inspired by weight-duplication trick (also known as “two-timescale stochastic approximation”) [9][10]. When σ ranges from 0 to 1, GQ ( σ , λ ) ranges from gradient Tree Backup ( λ ) ( TB ( λ ) ) to GQ ( λ ) [11], i.e., our GQ ( σ , λ ) unifies gradient Tree Backup ( λ ) and GQ ( λ ) .

Although GQ ( σ , λ ) | σ = 0 is a natural algorithm to extend TB ( λ ) with linear function approximation, to the best of our knowledge, the update rule of GQ ( σ , λ ) | σ = 0 has not been proposed in the existing literatures. It is worth to notice that Touati et al., [12] have proposed another version of gradient TB ( λ ) ( GTB ( λ ) ), which is different from the proposed GQ ( σ , λ ) | σ = 0 , we have clarified this point in Remark 3.

Then, we provide the convergence analysis of the proposed GQ ( σ , λ ) . Theorem 1 shows that GQ ( σ , λ ) converges to its TD fixed-point with probability one. Additionally, Theorem 1 illustrates the structure of such TD fixed-point: it is the global asymptotically stable equilibrium of its corresponding ordinary differential equation (ODE). For more discussion, see Remark 4. Furthermore, Theorem 2 shows that GQ ( σ , λ ) converges to an arbitrarily small neighborhood of the optimal solution with probability one.

Finally, our experiments show that when σ ranges from 0 to 1, GQ ( σ , λ ) achieves the best performance of off-policy evaluation or control within a certain σ ∈ ( 0 , 1 ) , neither σ = 0 , nor σ = 1 , which implies that with a certain value σ ∈ ( 0 , 1 ) , GQ ( σ , λ ) creates a mixture between GQ ( λ ) and gradient TB ( λ ) reaches a better performance than the extreme ends ( σ = 0 and σ = 1 ).

2. Preliminary

In this section, we briefly review the basics of reinforcement learning and off-policy evaluation.

2.1. Reinforcement Learning

Reinforcement learning (RL) [3] is often formalized as Markov decision processes (MDP) that considers a tuple ℳ = ( S , A , P , R , γ ) ; S is a set with finite states, A is a set with finite actions; P : S × A × S → [ 0 , 1 ] , p s s ′ a = P ( S t = s ′ | S t − 1 = s , A t − 1 = a ) is the probability of state transition from s to s ′ under playing the action a ; R ( ⋅ , ⋅ ) : S × A → ℝ 1 is the expected reward function; γ ∈ ( 0 , 1 ) .

A policy π is a probability distribution on S × A , and π ( a | s ) denotes the probability of playing a in state s . Let { S t , A t , R t + 1 } t ≥ 0 be generated by π , its state-action value function is: q π ( s , a ) = E π [ ∑ t = 0 ∞ γ t R t + 1 | S 0 = s , A 0 = a ] , where E π [ ⋅ | ⋅ ] is conditional expectation on the actions selected according to π . Let ℬ π : ℝ | S | → ℝ | S | denote Bellman operator with respect to policy π :

(1) ℬ π : q ↦ R π + γ P π q ,

where P π ∈ ℝ | S | × | S | , R π ∈ ℝ | S | × | A | , their corresponding elements are: [ P π ] s , s ′ = ∑ a ∈ A π ( a | s ) p s s ′ a , [ R π ] s , a = R ( s , a ) . It is well-known that q π is the unique fixed point of ℬ π , i.e., ℬ π q π = q π , which is known as Bellman equation.

2.2. Off-Policy Evaluation

Let us consider the trajectory τ = : { S t , A t , R t + 1 } t ≥ 0 generated by the behavior policy μ , where A t ~ μ ( ⋅ | S t ) , S t + 1 ~ P ( ⋅ | S t , A t ) . Off-policy evaluation is the task to estimate the value function of the target policy π via the data that is generated by the behavior policy μ , where μ ≠ π .

Assumption 1 (Ergodicity).The Markov chain induced by behavior policy μ is ergodic, i.e., there exists a stationary distribution ξ ( ⋅ , ⋅ ) over S × A : for ∀ ( S 0 , A 0 ) ,

(2) 1 n ∑ k = 1 n ℙ ( S k = s , A k = a | S 0 , A 0 ) → n → ∞ ξ ( s , a ) > 0.

The ergodicity of behavior policy μ is a standard assumption in off-policy learning [3], and it implies each-action pair is visited under this behavior policy μ . We use Ξ to denote a diagonal matrix whose diagonal element is ξ ( s , a ) , i.e., Ξ = diag { ⋯ , ξ ( s , a ) , ⋯ } .

2.3. Temporal Difference Learning and <italic>λ</italic> -Return

TD learning updates value function as follows, ∀ t ≥ 0 ,

(3) Q ( S t , A t ) ← Q ( S t , A t ) + α t δ t ,

where Q ( ⋅ , ⋅ ) is an estimator of q π , α t is step-size and δ t is TD error. Let Q t = : Q ( S t , A t ) , if

(4) δ t = δ t S = : R t + 1 + γ Q t + 1 − Q t ,

then update (3) is Sarsa [4]. If δ t is expected TD error:

(5) δ t ES = R t + 1 + γ ∑ a ∈ A π ( a | S t + 1 ) Q ( S t + 1 , a ) − Q t ,

then update (3) is Expected Sarsa [4].

The λ -return is an average contains all the n -step returns by weighting proportionally to λ n − 1 , where λ ∈ [ 0 , 1 ] . In this paper, we mainly consider two classic λ -return: Tree Backup ( λ ) ( TB ( λ ) ) and Expected Sarsa ( λ ) .

TreeBackup( λ ). For each pair ( S t , A t ) in the trajectory τ , TB ( λ ) [5] estimates q π ( S t , A t ) by

(6) G t λ = Q t + ∑ k = t ∞ ( γ λ ) k − t δ k ES ∏ i = t + 1 k π ( A i | S i ) ,

where δ k ES is expected TD error. Precup et al., [5] have proved the iteration (6) converges to q π with probability one under some certain conditions.

ExpectedSarsa( λ ). Sutton and Barto [3] have proposed a multi-step TD learning extends Expected Sarsa to λ -return version: for each t ≥ 0 ,

(7) G t λ = Q t + ∑ k = t ∞ ( γ λ ) k − t δ k ES ∏ i = t + 1 k π ( A i | S i ) μ ( A i | S i ) .

For the convenience, in the following paragraph, we consider the following notations,

ρ i = : π ( A i | S i ) μ ( A i | S i ) , ∏ i = t k ρ i = : ρ t : k , ρ t : t + 1 = 1.

2.4. A Unified View

In this section, we review an approach to unify TB ( λ ) and Expected Sarsa ( λ ) .

Q ( σ ) Algorithm. Recently, De Asis et al., [2] propose Q ( σ ) unifies multi-step Sarsa and multi-step TB ( 0 ) . Concretely, according to a mixed TD error δ t π , σ :

(8) δ t π , σ = σ δ t S + ( 1 − σ ) δ t ES ,

De Asis et al., [2] construct a multi-step estimator:

(9) G t σ = Q t + ∑ k = t ∞ γ k − t δ k π , σ ∏ i = t + 1 k [ ( 1 − σ ) π ( A i | S i ) + σ ] ,

where σ ∈ [ 0 , 1 ] is sampling parameter. When σ ranges from 0 to 1, the update (9) ranges from multi-step TB ( 0 ) to multi-step Sarsa. Experimental results show that a certain σ ∈ ( 0 , 1 ) results in a mixture of TB ( 0 ) and Sarsa performs better than both σ = 0 and σ = 1 [2], which implies unifying some seemingly disparate algorithmic ideas can create better performing algorithms.

Off-Policy Q ( σ , λ ) . Later, De Asis, [7] proposes a multi-step returns as follows,

(10) G t σ , λ = Q t + ∑ k = t ∞ δ k ES ∏ i = t + 1 k γ λ ( ( 1 − σ ) π ( A i | S i ) + σ ρ i ) = Q t + ∑ k = t ∞ δ k ES ∏ i = t + 1 k γ λ c i , σ ,

where

(11) c i , σ = ( 1 − σ ) π ( A i | S i ) + σ ρ i .

When σ ranges from 0 to 1, the estimator (10) ranges from TB ( λ ) (6) to Expected Sarsa ( λ ) (7).

We introduce a λ -operator ℬ σ , λ π , μ ( ⋅ ) : ℝ | S | × | A | → ℝ | S | × | A | that is a high level view of the λ -return (10),

(12) ℬ σ , λ π , μ ( ⋅ ) : q ↦ q + E μ [ ∑ k = 0 ∞ δ k ES ∏ i = 1 k γ λ c i , σ ] = q + σ ( I − λ γ P π ) − 1 ( ℬ π q − q ) + ( 1 − σ ) ( I − λ γ P π , μ ) − 1 ( ℬ π q − q ) ,

where ℬ π is Bellman operator (1), P π , μ ∈ ℝ | S | × | S | , and whose elements are: [ P π , μ ] s , s ′ = ∑ a ∈ A π ( a | s ) μ ( a | s ) p s s ′ a .

Remark 1.Yang et al. [6]propose another version of Q ( σ , λ ) algorithm that extends Q ( σ ) (9) with eligibility trace:

(13) G ˜ t σ , λ = Q t + ∑ k = t ∞ ( λ γ ) k − t δ k π , σ .

It is noteworthy that at one extreme end σ = 1 , both Q ( σ ) (9) and G ˜ t σ , λ (13) reduce to on-policy learning. Particularly, [6]prove that the performance of G ˜ t σ , λ for off-policy evaluation is determined by parameter σ :

(14) ‖ E μ [ G ˜ t σ , λ | ( S t , A t ) = ( s , a ) ] − q π ( s , a ) ‖ ∞ ≤ σ C ,

where C is a positive constant never reaches0 no matter how we choose the starting time t . The upper error bound of(14) illustrates the capacity of G ˜ t σ , λ for off-policy evaluation decays monotonously when σ ranges from0to1. At the extreme end σ = 1 , G ˜ t σ , λ achieves the worst performance of off-policy evaluation. We think this is a natural result since G ˜ t σ , λ | σ = 1 is an on-policy algorithm exactly. Thus, in this paper, we mainly concern (10) for off-policy learning.

2.5. Linear Function Approximation

TD learning (3) requires a very huge table to store the estimate value function Q ( ⋅ , ⋅ ) when | S | is very large, which implies tabular TD learning is considerably expensive for high-dimensional RL. We often use a parametric function Q θ ( ⋅ , ⋅ ) to approximate

(15) q π ( s , a ) ≈ ϕ ⊤ ( s , a ) θ = : Q θ ( s , a ) ,

where θ ∈ ℝ p is the parameter need to be learned, ϕ ( s , a ) = ( φ 1 ( s , a ) , φ 2 ( s , a ) , ⋯ , φ p ( s , a ) ) ⊤ , and each φ i : S × A → ℝ . Furthermore, Q θ can be rewritten as a version of matrix

(16) Q θ = Φ θ ≈ q π ,

where Φ is a matrix whose rows are the state-action feature vectors ϕ ⊤ ( s , a ) .

3. Divergence of <inline-formula> <mml:math display="inline"> <mml:mrow> <mml:mi>Q</mml:mi> <mml:mrow> <mml:mo>(</mml:mo> <mml:mrow> <mml:mi>σ</mml:mi> <mml:mo>,</mml:mo> <mml:mi>λ</mml:mi> </mml:mrow> <mml:mo>)</mml:mo> </mml:mrow> </mml:mrow> </mml:math> </inline-formula> with Semi-Gradient

In this section, firstly, we derive the semi-gradient Q ( σ , λ ) algorithm; then we briefly analyze the divergence of extending Q ( σ , λ ) (10) with semi-gradient method. In fact, the divergence of semi-gradient off-policy TD methods are well-documented in the literature (e.g., [8][12][13]), which are not specific to Q ( σ , λ ) .

3.1. Semi-Gradient <inline-formula> <mml:math display="inline"> <mml:mrow> <mml:mi>Q</mml:mi> <mml:mrow> <mml:mo>(</mml:mo> <mml:mrow> <mml:mi>σ</mml:mi> <mml:mo>,</mml:mo> <mml:mi>λ</mml:mi> </mml:mrow> <mml:mo>)</mml:mo> </mml:mrow> </mml:mrow> </mml:math> </inline-formula>

Recall τ = { ( S t , A t , R t + 1 ) } t ≥ 0 is generated by behavior policy μ , let ϕ t = ϕ ( S t , A t ) , we define semi-gradient Q ( σ , λ ) as follows:

(17) θ t + 1 = θ t − α t ∇ θ ( G t λ ( θ ) − θ ⊤ ϕ t ) 2 | θ = θ t = θ t − α t ( G t λ ( θ t ) − θ t ⊤ ϕ t ) ∇ θ ( − Q θ ( S t , A t ) ) | θ = θ t = θ t + α t ( ∑ k = t ∞ δ k ES ( θ t ) ∏ i = t + 1 k γ λ c i , σ ) ϕ t ,

where

G t λ ( θ t ) = θ t ⊤ ϕ t + ∑ k = t ∞ δ k ES ( θ t ) ∏ i = t + 1 k γ λ c i , σ

is an off-line estimator of value function according to Equation (10), TD error δ k ES ( θ t ) = R k + 1 + γ E π [ θ t ⊤ ϕ ( S k + 1 , ⋅ ) ] − θ t ⊤ ϕ k , and α t is step-size. Let

(18) A t = ϕ t ∑ k = t ∞ ( γ E π [ ϕ ( S k + 1 , ⋅ ) ] − ϕ k ) ⊤ ∏ i = t + 1 k γ λ c i , σ ,

(19) b t = ϕ t ∑ k = t ∞ R k + 1 ∏ i = t + 1 k γ λ c i , σ .

Then update (17) can be rewritten as follows,

(20) θ t + 1 = θ t + α t ( A t θ t + b t ) .

Furthermore, we have

(21) A σ = : E μ [ A t ] = Φ ⊤ Ξ P σ π , μ ( γ P π − I ) Φ ,

(22) b σ = : E μ [ b t ] = Φ ⊤ Ξ P σ π , μ R ,

P σ π , μ = σ ( I − λ γ P π ) − 1 + ( 1 − σ ) ( I − λ γ P π , μ ) − 1 .

3.2. Divergence Analysis

Now, we only briefly discuss the divergence lies in the iteration (17). Under the conditions of Proposition 4.8 presented by Bertsekas and Tsitsiklis [14], { θ t } t ≥ 0 (17) converges to a certain point if and only if A σ is a negative matrix. Furthermore, if iteration (17) converges, then it converges to its unique TD fixed point θ ⋆ that satisfes

(23) A σ θ ⋆ + b σ = 0.

Unfortunately, since the steady state-action distribution doesn’t match the transition probability during off-policy learning, we can not guarantee the negative definiteness of A σ , thus { θ t } t ≥ 0 may diverge. To clarify this point, we use the classic counterexample [12] to show the divergence of semi-gradient TD algorithms (17) for off-policy learning.

Figure 1

Figure 1. Counterexample from [12]: Two-State MDP.

For the MDP in Figure 1, the behavior policy μ ( right | ⋅ ) = 0.5 , and target policy π ( right | ⋅ ) = 1 . We assign the features { ( 1 , 0 ) ⊤ , ( 2 , 0 ) ⊤ , ( 0 , 1 ) ⊤ , ( 0 , 2 ) ⊤ } to the state-action pairs { ( s 1 , right ) , ( s 2 , right ) , ( s 1 , left ) , ( s 2 , left ) } , From the dynamic transition shown in Figure 1, we have

P π = ( 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 ) , Φ = ( 1 0 2 0 0 1 0 2 ) , Ξ = 1 2 I 4 × 4 .

Then, according to (21), we have

A σ = Φ ⊤ Ξ P σ π , μ ( γ P π − I ) Φ = ( 6 ( 2 − σ ) γ − γ λ − 5 ( 2 − σ ) 2 ( 1 − γ λ ) 0 3 γ 2 − 5 ( 1 − σ 2 ) ) ,

and the eigenvalues of A are: 6 ( 2 − σ ) γ − γ λ − 5 ( 2 − σ ) 2 ( 1 − γ λ ) and − 5 ( 1 − σ 2 ) . For any initial θ 0 = ( θ 0 , 1 , θ 0 , 2 ) ⊤ , let E [ θ t + 1 ] = : ( θ t + 1 , 1 , θ t + 1 , 2 ) ⊤ , according to (20), the first component of the term E [ θ t + 1 | θ t ] is:

(24) θ t + 1 , 1 = θ 0 , 1 ∏ i = 0 t ( 1 + α i 6 ( 2 − σ ) γ − γ λ − 5 ( 2 − σ ) 2 ( 1 − γ λ ) ) .

For any λ ∈ ( 0 , 1 ) , if γ ∈ ( 10 − 5 σ 12 − 6 σ − λ , 1 ) , then 6 ( 2 − σ ) γ − γ λ − 5 ( 2 − σ ) 2 ( 1 − γ λ ) is a positive scalar, which implies A σ can not be a negative matrix. Furthermore, if step size α t : ∑ i ≥ 0 α t = ∞ , we have

(25) | θ t + 1 , 1 | = | θ 0 , 1 | ∏ i = 0 t ( 6 ( 2 − σ ) γ − γ λ − 5 ( 2 − σ ) 2 ( 1 − γ λ ) ) → + ∞ ,

Equation (25) is a direct result of the following conclusion that could be found in any calculus textbook. Let p i = 1 + a i , where a i > 0 , if ∑ i = 1 ∞ a i = + ∞ , then ∏ i = 1 ∞ p i = ∏ i = 1 ∞ ( 1 + a i ) = + ∞ , which implies the way (17) to extend Q ( σ ) with linear function approximation via off-line estimate is unstable for off-policy learning.

4. Gradient <inline-formula> <mml:math display="inline"> <mml:mrow> <mml:mi>Q</mml:mi> <mml:mrow> <mml:mo>(</mml:mo> <mml:mrow> <mml:mi>σ</mml:mi> <mml:mo>,</mml:mo> <mml:mi>λ</mml:mi> </mml:mrow> <mml:mo>)</mml:mo> </mml:mrow> </mml:mrow> </mml:math> </inline-formula>

In this section, we derive the gradient Q ( σ , λ ) ( GQ ( σ , λ ) ) algorithm. The proposed GQ ( σ , λ ) unifies GQ ( λ ) [11] if σ = 1 . For more discussion, see Remark 2. At another extreme end σ = 0 , the proposed GQ ( σ , λ ) | σ = 0 can be seen as a new way to extend TB ( λ ) (6) with linear function approximation. Although GQ ( σ , λ ) | σ = 0 is a natural algorithm extends TB ( λ ) (6) with linear function approximation, to the best of our knowledge, the update rule of GQ ( σ , λ ) | σ = 0 has not been proposed in the existing literature. For more discussion, see Remark 3.

We derive the gradient the GQ ( σ , λ ) algorithm via mean square projected Bellman error (MSPBE) [9] objective function as follows,

(26) J ( θ ) = 1 2 ‖ Φ θ − Π ℬ σ , λ π , μ ( Φ θ ) ‖ Ξ 2 ,

where

Π = Φ ( Φ ⊤ Ξ Φ ) − 1 Φ ⊤ Ξ

is an | S | × | S | projection matrix, ‖ ⋅ ‖ Ξ 2 is the weighted Euclidean norm: ‖ x ‖ Ξ 2 = x ⊤ Ξ x . Furthermore, we can rewrite min θ J ( θ ) as follows,

(27) min θ J ( θ ) = min θ 1 2 ‖ A σ θ + b σ ‖ M − 1 2 ,

M = E μ [ ϕ t ϕ t ⊤ ] = Φ ⊤ Ξ Φ

The gradient method is a natural approach to solve problem (27), however, it is worth to notice that the challenges are two-fold: (I) Firstly, since the invertible matrix M − 1 is involved in ∇ J ( θ ) , so it is too expensive to apply stochastic gradient to solve the problem (27) directly. (II) Since ∇ J ( θ ) = A σ ⊤ M − 1 ( A σ θ + b σ ) involves the product of expectations, then the unbiased estimate of ∇ J ( θ ) cannot be obtained via a single sample. It needs to sample twice, so it is a double-sampling problem, which is the second bottleneck of applying gradient to solve the problem (27). Additionally, M − 1 = E [ ϕ t ϕ t ⊤ ] − 1 cannot also be estimated via a single sample. The above analysis pushes us to find a new practical way to solve the problem (27).

Let e − 1 , σ = 0 , e t , σ = λ γ c t , σ e t − 1 + ϕ t , ϕ ¯ t + 1 = E π [ ϕ ( S t + 1 , ⋅ ) ] , then we have the following equation:

(28) − 1 2 ∇ J ( θ t ) = − E μ [ ( γ ϕ ¯ t + 1 − ϕ t ) e t , σ ⊤ ] ϖ

(29) − 1 2 ∇ J ( θ t ) = E μ [ δ t ES e t , σ ] − E [ γ ( 1 − λ ) ϕ ¯ t + 1 e t , σ ⊤ ] ϖ ,

where

ϖ = E μ [ ϕ t ϕ t ⊤ ] − 1 E [ δ t ES e t , σ ] , δ t ES = R t + 1 + γ θ t ⊤ ϕ ¯ t + 1 − θ t ⊤ ϕ t .

For the limitation of space, we provide the derivation of (28)-(29) in Appendix A. We use the sign convention that the mean update of θ follows the negative gradient − ∇ J ( θ t ) . The term (31) provides its unbiased stochastic approximation, while the auxiliary recursion (30) tracks ϖ = M − 1 E μ [ δ t ES e t , σ ] , circumventing the double-sampling issue.

Let’s consider the term ϖ = E μ [ ϕ t ϕ t ⊤ ] − 1 E [ δ t ES e t , σ ] that is a solution of a least-squares problem, and a typical least mean square (LMS) update rule to find the vector ϖ is:

(30) ω t + 1 = ω t + β t ( δ t ES e t , σ − ϕ t ω t ⊤ ϕ t ) ,

where β t is step-size. Then, by directly sampling from (29) with (30), we define the update rule of θ as follows,

(31) θ t + 1 = θ t + α t ( δ t ES e t , σ − γ ( 1 − λ ) ϕ ¯ t + 1 e t , σ ⊤ ω t ) ,

where α t is step-size. We provide the details in Algorithm 1.

Remark 2 (Case of σ = 1 ).When σ = 1 , we regard GQ ( σ , λ ) | σ = 1 as an approach to extend Expected Sarsa ( λ ) (7) with linear function approximation. A more interesting result is that the proposed GQ ( σ , λ ) | σ = 1 is reduced to GQ ( λ ) [12]exactly. Now, we provide two fresh interpretations to the proposed GQ ( λ ) :

GQ ( λ ) is at one extreme end ( σ = 1 )of the proposed GQ ( σ , λ ) , i.e., the proposed GQ ( σ , λ ) contains GQ ( λ ) .

Figure 2

Furthermore, just because GQ ( λ ) is same as GQ ( σ , λ ) | σ = 1 , GQ ( λ ) can be seen as an extension of the tabular Expected Sarsa ( λ ) (7) with linear function approximation,while the original motivation ofMaeiand Sutton [11]to propose GQ ( λ ) is to introduce eligibility trace to gradient temporal difference learning for off-policy evaluation. Thus, the proposed GQ ( σ , λ ) | σ = 1 provides a fresh understanding for the GQ ( λ ) [11].

Computational complexity. The computational complexity of Algorithm 1 is O ( | A | p ) per step in time, and O ( p ) in memory, where p is the feature dimension and | A | is the number of actions. This maintains the same asymptotic efficiency as the baseline GQ ( λ ) and gradient TB ( λ ) methods.

Remark 3 (Case of σ = 0 ).When σ = 0 , the algorithm GQ ( σ , λ ) | σ = 0 is reduced to a version of extending TB ( λ ) (6) with linear function approximation. Although GQ ( σ , λ ) | σ = 0 is a natural algorithm extends TB ( λ ) (6) with linear function approximation, to the best of our knowledge, the update rule of GQ ( σ , λ ) | σ = 0 has not been proposed in the existing literatures. It is worth to notice that [12]have also proposed another version of gradient TB ( λ ) ( GTB ( λ ) ), while the difference between the proposed GQ ( σ , λ ) | σ = 0 and GTB ( λ ) [12]is reflected at least two aspects:

Firstly, the proposed gradient Q ( σ , λ ) | σ = 0 and GTB ( λ ) [12]share the same update rule of e t and ω t , but instead of (31), GTB ( λ ) updates the parameter θ as follows,

(32) θ t + 1 = θ t + α t ( − γ ϕ ¯ t + 1 + ϕ t ) e t ⊤ ω t ,

where e t = e t , σ | σ = 0 = λ γ π ( A t | S t ) e t − 1 .

Secondly, Touati et al., [12]derive the update rule (32) of their GTB ( λ ) via the convex-concave saddle-point framework [15], while our GQ ( σ , λ ) is based on the weight-duplication trick¹of (29)-(31).

5. Convergence Analysis

In this section, we prove the convergence of the proposed GQ ( σ , λ ) . Theorem 1 shows that GQ ( σ , λ ) converges to its TD fixed-point (23) with probability one. Besides, Theorem 1 illustrates the structure of this TD fixed-point: it is the global asymptotically stable equilibrium of its corresponding ordinary differential equation (ODE). For more discussion, we provide in Remark 4. Theorem 2 shows that GQ ( σ , λ ) converges to an arbitrarily small neighborhood of the optimal solution with probability one. While we rely on the standard two-timescale ODE method for the general convergence mechanism, our novel theoretical step is verifying that the proposed σ -dependent trace coefficient c t , σ yields a stable equilibrium structure for the ODEs.

5.1. Additional Assumptions

We need some additional assumptions to present the convergent of GQ ( σ , λ ) , those assumptions are widely used in reinforcement learning [13][17]-[19].

Assumption 2 (Diminishing Step-size).The positive sequences { α t } t ≥ 0 , { β t } t ≥ 0 satisfy the following conditions with probability one,

∑ t = 0 ∞ α t = ∑ t = 0 ∞ β t = ∞ , ∑ t = 0 ∞ α t 2 < ∞ , ∑ t = 0 ∞ β t 2 < ∞ .

Assumption 3.The features { ϕ t } t ≥ 0 is uniformly bounded by ϕ max . The reward function is uniformly bounded by R max . The importance sampling ρ t = π ( A t | S t ) μ ( A t | S t ) is uniformly bounded by ρ max .

Assumption 4 (Solvability of Problem).The matrix A σ is non-singular and rank ( Φ ) = p .

Assumption 4 requires the non-singular matrix A σ , which implies the optimal parameter θ ⋆ = − A − 1 b is well defined. The feature matrix Φ has linearly independent columns implies the matrix M = Φ ⊤ Ξ Φ is positive defined.

5.2. Main Results and Discussion

Theorem 1 (Convergence ofAlgorithm 1).Under Assumption 1-4, we consider the iteration { ( θ t , ω t ) } t ≥ 0 that is generated according to (30)-(31). The step-size α t , β t satisfy Assumption2 and η t = β t α t → 0 , as t → ∞ . We define two functions G ( θ ) , H ( ω , θ ) as follows,

(33) G ( θ ) = − A σ ⊤ M − 1 ( A σ θ + b σ ) ,

(34) H ( ω , θ ) = A σ θ + b σ − M ω .

Then ( θ t , ω t ) → w . p .1 ( θ ⋆ , ω ⋆ ) , as t → ∞ , where ( θ ⋆ , ω ⋆ ) is the unique global asymptotically stable equilibrium with respect to the following ODE correspondingly:

(35) { θ ˙ ( t ) = : d d t θ ( t ) = G ( θ ( t ) ) ω ˙ ( t ) = : d d t ω ( t ) = H ( ω ( t ) , θ ( t ) ) ,

where θ ( t ) , ω ( t ) ∈ ℝ p are the functions are defined on continuous time ( 0 , ∞ ) .

Proof. We provide its proof in Appendix B. □

Remark 4 (TD-Fixed Point of GQ ( σ , λ ) ).Theorem1 illustrates that the sequence { ( θ t , ω t ) } t ≥ 0 generated by GQ ( σ , λ ) converges to the global asymptotically stable equilibrium of its corresponding ODE (35). Furthermore, from the details of the proof inAppendix B, we know since A σ is invertible and M − 1 is positive definite, then A σ ⊤ M − 1 A σ is also positive defined. Because the ODE uses the negative MSPBE-gradient direction, the Jacobian at the equilibrium is − A σ ⊤ M − 1 A σ , whose eigenvalues have negative real parts.Sothe following ODE

(36) θ ˙ ( t ) = − A σ ⊤ M − 1 ( A σ θ ( t ) + b σ ) = G ( θ ( t ) ) ,

has a unique global asymptotically stable equilibrium θ ⋆ satisfies the equation

(37) A σ θ ⋆ + b σ = 0 ,

which implies the global asymptotically stable equilibrium θ ⋆ of the ODE θ ˙ ( t ) = G ( θ ( t ) ) is also the TD fixed point of (23). That means θ t converges to its TD fixed point:

θ t → θ ⋆ = − A σ − 1 b σ , w . p .1.

Remark 5 (Unification of TD-Fixed Point).If σ = 0 , the matrix A σ is reduced to

A σ | σ = 0 = Φ ⊤ Ξ ( I − λ γ P π , μ ) − 1 ( γ P π − I ) Φ ,

which implies GQ ( σ , λ ) | σ = 0 converges to the TD-fixed point of GTB ( λ ) [12], i.e., GQ ( σ , λ ) | σ = 0 shares the same TD-fixed point of GTB ( λ ) [12] as follows

θ ⋆ = − A σ − 1 b σ | σ = 0 .

If σ = 1 , then the key matrix A σ is reduced to

A σ | σ = 1 = Φ ⊤ Ξ ( I − λ γ P π ) − 1 ( γ P π − I ) Φ ,

which illustrates the TD-fixed point of GQ ( σ , λ ) | σ = 1 (i.e., GQ ( λ ) ) algorithm as follows

θ ⋆ = − A σ − 1 b σ | σ = 1 .

Theorem 1 presents an asymptotic result, which holds only in the limit as the number of iterations increases to infinity. Now, we present a result shows the distance between θ t and the optimal solution θ ⋆ convergence to 0 in probability. For any T > 0 , and t ≥ 0 , we introduce a notation

κ ( t ; T ) = min { k ≥ t | ∑ i = t k + 1 α i > T }

to denote the last iteration before the sum of step-size α i between it and the t -th iteration exceeds T . Since we consider the Assumption 2, the notation κ ( t ; T ) is well-defined.

Theorem 2.Under Assumption 1-4, we consider the iteration { ( θ t , ω t ) } t ≥ 0 generated by (30)-(31). The step-size α t , β t satisfy Assumption2. Moreover, for the integers m t → ∞ , m ′ t → ∞ , as t → ∞ ,

lim t → ∞ sup 0 < j ≤ m t | α t + j α t − 1 | = 0 , lim t → ∞ sup 0 < j ≤ m ′ t | β t + j β t − 1 | = 0.

Then there exists a sequence of positive numbers T t → ∞ (as t → ∞ ) such that for any ϵ > 0 ,

(38) lim t → ∞ ℙ [ θ i ∉ N ϵ ( θ ⋆ ) , ∃ i ∈ [ t , κ ( t , T t ) ] ] = 0 ,

where we use N ϵ ( θ ⋆ ) = { θ : ‖ θ − θ ⋆ ‖ 2 ≤ ϵ } to denote the ϵ -neighborhood of θ ⋆ .

Proof. We provide its proof in Appendix C. □

Remark 6Theorem2 shows that as t → ∞ , the sequence { θ i } i = t κ ( t , T t ) collects to an arbitrarily small neighborhood of the optimal solution θ ⋆ with probability one, i.e.,

lim t → ∞ ℙ ( ∩ i = t κ ( t , T t ) { θ i ∈ N ϵ ( θ ⋆ ) } ) = 1.

In fact, the results (38) implies

lim t → ∞ ℙ ( θ i ∉ N ( θ ⋆ , ϵ ) , ∃ i ∈ [ t , κ ( t , T t ) ] ) = 0 ,

then we know

1 − lim t → ∞ ℙ ( θ i ∉ N ( θ ⋆ , ϵ ) , ∃ i ∈ [ t , κ ( t , T t ) ] ) = lim t → ∞ ℙ ( θ i ∉ N ( θ ⋆ , ϵ ) , ∃ i ∈ [ t , κ ( t , T t ) ] ¯ ) = lim t → ∞ ℙ ( ∪ i = t κ ( t , T t ) { θ i ∉ N ( θ ⋆ , ϵ ) } ¯ ) = lim t → ∞ ℙ ( ∩ i = t κ ( t , T t ) { θ i ∈ N ( θ ⋆ , ϵ ) } ) .

That is lim t → ∞ ℙ ( ∩ i = t κ ( t , T t ) { θ i ∈ N ( θ ⋆ , ϵ ) } ) = 1 .

6. Experiments

In this section, we test the capacity of GQ ( σ , λ ) for two typical tasks: off-policy evaluation and control.

For off-policy evaluation, we compare GQ ( σ , λ ) with four state-of-art algorithms: GQ ( λ ) [11], ABQ ( ζ ) [20], GTB ( λ ) , GRetrace ( λ ) [12] over two typical measurements: MSPBE and mean square error (MSE). It is worth noting that if σ = 0 , GQ ( σ , λ ) is reduced to a new version of GTB ( λ ) (we denote it as GTB ( λ ) -v0), thus, we also show its comparison to GTB ( λ ) [12] (we denote it as GTB ( λ ) -v1).For the control task, our goal is to show the the trade-off between σ = 0 and σ = 1 . Empirical results show the GQ ( σ , λ ) with a value σ ∈ ( 0 , 1 ) that creates a mixture of GQ ( λ ) and gradient Tree Backup ( λ ) achieves a better performance than both the extreme end σ = 0 and σ = 1 .

These domains are standard diagnostic benchmarks intended to isolate the off-policy instability under linear function approximation. Scaling to very high-dimensional environments is left as future work.

6.1. Off-Policy Evaluation Experiments

We present the domains in the experiments as follows.

(1) Two State MDP [12]. This MDP is shown in Figure 1, the behavior policy μ ( right | ⋅ ) = 0.5 , and target policy π ( right | ⋅ ) = 1 . We assign the features { ( 1 , 0 ) ⊤ , ( 2 , 0 ) ⊤ , ( 0 , 1 ) ⊤ , ( 0 , 2 ) ⊤ } to the state-action pairs { ( s 1 , right ) , ( s 2 , right ) , ( s 1 , left ) , ( s 2 , left ) } , i.e.,

Φ = ( 1 2 0 0 0 0 1 2 ) ⊤ .

(2) Baird Star [21]. The Baird Star is an episodic seven states MDP with two actions: dashed action and solid action. In this example, the behavior policy μ ( ⋅ | dashed ) = 6 7 , μ ( ⋅ | solid ) = 1 7 and target policy π ( ⋅ | solid ) = 1 . We choose the feature map matrix as follows

Φ = ( 2 I 7 × 7 1 7 × 1 0 7 × 8 0 7 × 8 2 I 7 × 7 1 7 × 1 ) ,

where I denotes the identity matrix, 0 denotes a matrix whose elements are all 0, and 1 7 × 1 denotes a vector whose elements are all 1. We used θ 0 = ( 1 , 1 , 1 , 1 , 1 , 1 , 10 , 1 ) T as initial parameter vector for the methods that allow specifying a start estimate, TD-learning is known to diverge for this initialization of the parameter-vector [3][22].

(3) CliffWalking. This is a standard undiscounted, episodic task, with start and goal states, and the usual actions causing movement up, down, right, and left.

(4) Grid World. This environment from ([3], Chapter 4), where the agent on an 4 × 4 grid and your goal is to reach the terminal state at the top left or the bottom right corner.

(5) WindyGridworld. This environment from ([3], Chapter 6). Windy Gridworld problem for reinforcement learning. Actions include going up, down, right, and left. In each column the wind pushes you up a specific number of steps (for the next action). If an action would take you off the grid, you remain in the previous state. For each step you get a reward of −1, until the agent reach into a terminal state.

We summarize the domains, feature settings, target policy and behavior policy below.

Two Measurements for Off-Policy Evaluation. In this section, we use empirical RMSPBE = 1 2 ‖ b ^ + A ^ θ ‖ M ^ − 1 2 to evaluate the performance, where we evaluate A ^ , b ^ , and M ^ according to their unbiased estimators. Additionally, we also compare the performance over a common measurement empirical MSE: MSE = ‖ Φ θ − q π ‖ Ξ 2 , where q π is estimated by simulating the target policy π and averaging the discounted cumulative rewards over trajectories.

Hyper-parameter Setting. We run the hyper parameter σ as follows: σ ranges from 0 to 1 with step of 0.02, i.e.,

Σ = { 0 , 0.02 , 0.04 , ⋯ , 0.98 , 1.0 } .

It collects 51 results w.r.t. GQ ( σ , λ ) . We set λ = 0.99 , γ = 0.99 , and run the step-size α t = { 10 − 2 , 2 × 10 − 2 , 10 − 3 , 2 × 10 − 3 } , η t = β t / α t = { 2 0 , 2 − 1 , ⋯ , 2 − 10 } .

Results Report. All the results shown inFigure 2 and Figure 3 are the average of 5 runs, where we choose the best σ among the 51 results w.r.t. the space Σ The results the proposed GQ ( σ , λ ) outperforms the the baseline algorithms. The results of Figure 2 and Figure 3 also show GQ ( σ , λ ) with an intermediate σ (between 0 and 1) has a better performance than the extreme case ( σ = 0 and 1). This experiment further validates that unifying some existing algorithms can create a better algorithm.

Figure 3

Figure 2.RMSPBE comparison with different baseline algorithms.

Figure 4

Figure 3.MSE comparison with with different baseline algorithms.

6.2. Control Domain: Off-Policy Evaluation

In this section, we test the off-policy evaluation behavior of GQ ( σ , λ ) algorithm on mountain car domain, where the agent considers the task of driving an underpowered car up a steep mountain road. The agent receives a reward of -1 at every step until it reaches the goal region at the top of the hill. Since the state space of this domain is continuous, we use the open tile coding software² to extract feature of states. Recall the states and actions of MountainCar:

S = { ( Velocity , Position ) } = [ − 0.07 , 0.07 ] × [ − 1.2 , 0.6 ] ,

A = { left , neutral , right } .

In this experiment, if Velocity > 0 , we use behavior policy

μ = ( 1 100 , 1 100 , 98 100 ) , π = ( 1 10 , 1 10 , 8 10 ) ;

else

μ = ( 98 100 , 1 100 , 1 100 ) , π = ( 8 10 , 1 10 , 1 10 ) .

Note that the target policy π is fixed throughout. Thus, this experiment acts as an off-policy evaluation within the Mountain Car domain. In this experiment, we set the number of tilings to be 4, and there are no white noise features. As suggested by Sutton and Barto [3], we set all the initial parameters to be 0, which is optimistic about causing extensive exploration.

As before, we run the hyper parameter σ as follows: σ ranges from 0 to 1 with step of 0.02. It collects 51 results w.r.t. GQ ( σ , λ ) . We also set λ = 0.99 , γ = 0.99 , and run the step-size α t = { 10 − 2 , 2 × 10 − 2 , 10 − 3 , 2 × 10 − 3 } , η t = β t / α t = { 2 0 , 2 − 1 , ⋯ , 2 − 10 } , the dimension of feature p = { 512 , 1024 , 2048 } .

Overall Presentation. We give more comprehensive results of the trade-off between σ = 0 and σ = 1 . We statistic of the number of σ happens for the following three case:

(I) GQ ( σ , λ ) performs better than both σ = 0 and σ = 1 . (II) GQ ( σ , λ ) performs better than σ = 0 or σ = 1 . (III) GQ ( σ , λ ) performs worse than σ = 0 and σ = 1 .

The setting of σ is the same as the previous section, and the total number of σ reaches 51.

Table 1.Returns under various parameters.

Table 1

Case

σ = 0

σ = 1

σ = 0.1

σ = 0.2

σ = 0.3

σ = 0.4

σ = 0.5

σ = 0.6

σ = 0.7

σ = 0.8

σ = 0.9

α = 0.001

p = 512

−122.0

−123.6

− 123.5

− 119.8

− 117.4

− 120.8

− 119.1

−120.5

− 119.2

− 119.3

− 120.1

p = 1024

−119.9

−121.9

−127.8

−120.2

−122.4

−122.0

− 118.4

−121.6

−124.0

−120.1

p = 2048

−122.3

−121.4

−122.6

−121.9

− 120.1

−122.3

−122.6

− 119.6

− 120.9

−121.4

−122.5

α = 0.002

p = 512

−126.9

−124.2

− 124.2

−127.5

−125.3

−125.2

− 124.0

− 121.6

−125.4

−125.8

− 120.2

p = 1024

−124.1

−123.1

−122.4

−126.3

−122.3

−123.0

−121.4

−126.2

− 121.3

−123.6

−126.0

p = 2048

−124.8

−124.0

−126.4

−124.6

− 122.7

− 123.9

−125.1

− 122.5

− 123.6

−126.4

− 122.7

Table 2.Percentage under various parameters.

Table 2

Case		p = 512	p = 1024	p = 2048
α = 0.001	I	69.8%	20.4%	55.1%
	II	14.1%	36.7%	20.4%
	III	16.1%	42.9%	24.5%
α = 0.002	I	53.1%	6.1%	49.0%
	II	38.8%	16.3%	18.4%
	III	8.1%	77.6%	32.6%

Results Report. The results shown in Figure 2 and Table 1 are average of 5 runs, and each run contains 400 episodes. The result in Figure 2 shows that GQ ( σ , λ ) with an intermediate σ (between 0 and 1) has a better performance than the extreme case ( σ = 0 and 1). This experiment further validates that unifying some existing algorithms can create a better algorithm for reinforcement. Table 1 shows the returns mountaincar reaches the goal region at the top of the hill. As shown in Table 2, those results also show GQ ( σ , λ ) achieves the best performance at a σ ∈ ( 0 , 1 ) , which implies the trade-off between σ = 0 and σ = 1 . That is to see the GQ ( σ , λ ) with a value σ ∈ ( 0 , 1 ) that creates a mixture of GQ ( λ ) and gradient Tree Backup ( λ ) achieves a better performance than both the extreme end σ = 0 and σ = 1 .

7. Conclusion

In this paper, we extend tabular Q ( σ , λ ) with function approximation, and propose GQ ( σ , λ ) . We analyze the convergence of GQ ( σ , λ ) . Our theory analysis shows that GQ ( σ , λ ) converges to its TD fixed-point with probability one. Then, we show GQ ( σ , λ ) converges to the optimal solution of the minimizing MSPBE problem. Finally, we conduct experiments on some standard domains to confirm the effectiveness of the proposed GQ ( σ , λ ) . Result show that the best performance of GQ ( σ , λ ) achieved with a σ ∈ ( 0 , 1 ) , neither σ = 0 , nor σ = 1 . Extending this framework to non-linear function approximation is an important future direction. Such an extension faces theoretical hurdles, such as the parameter-dependence of the Jacobian and the loss of a fixed linear least-squares structure, which may require techniques like target networks or compatible gradients to stabilize.

Funding

The project was partially supported by Scientific Research Fund of Zhejiang Provincial Education Department under Grant No. Y202456228.

Appendix A. Derivation of (28)-(29)

Proof. Let us calculate MSPBE ( θ , λ ) directly,

(39) − 1 2 J ( θ t ) = − 1 2 ∇ θ MSPBE ( θ , λ ) | θ = θ t = − 1 2 ∇ θ ( E [ δ t e t , σ ] ⊤ E [ ϕ t ϕ t ⊤ ] − 1 E [ δ t e t , σ ] ) = − ( ∇ θ E [ δ t e t , σ ] ⊤ ) E [ ϕ t ϕ t ⊤ ] − 1 E [ δ t e t , σ ] = − E [ ( γ E π ϕ ( S t + 1 , ⋅ ) − ϕ t ) e t , σ ⊤ ] E [ ϕ t ϕ t ⊤ ] − 1 E [ δ t e t , σ ] = − E [ γ E π ϕ ( S t + 1 , ⋅ ) e t , σ ⊤ − ϕ t e t , σ ⊤ ] E [ ϕ t ϕ t ⊤ ] − 1 E [ δ t e t , σ ] = E [ ϕ t ϕ t ⊤ + ϕ t + 1 γ λ e t , σ ⊤ − γ E π ϕ ( S t + 1 , ⋅ ) e t , σ ⊤ ] E [ ϕ t ϕ t ⊤ ] − 1 E [ δ t e t , σ ] ︸ = : ϖ = E [ δ t ES e t , σ ] − E [ γ ( 1 − λ ) ϕ ¯ t + 1 e t , σ ⊤ ] ϖ .

□

B. Proof of Theorem 1

The ODE method (see Lemma 1) is our is our main tool to prove Theorem 1. We refer the reader to that reference for further technical details.

Lemma 1 ([25]).For the stochastic recursion of x n , y n given by

(40) x n + 1 = x n + a n [ g ( x n , y n ) + M n + 1 ( 1 ) ] ,

(41) y n + 1 = y n + b n [ h ( x n , y n ) + M n + 1 ( 2 ) ] , n ∈ ℕ

if the following assumptions are satisfied:

(A1) Step-sizes { a n } , { b n } are positive, satisfying

∑ n a n = ∑ n b n = ∞ , ∑ n a n 2 + b n 2 < ∞ , b n a n → 0 as n → ∞ .

(A2) The map g : ℝ d + k → ℝ d , h : ℝ d + k → ℝ k are Lipschitz. (A3) The sequence { M n + 1 ( 1 ) } n ∈ ℕ , { M n + 1 ( 2 ) } n ∈ ℕ are martingale difference sequencesw.r.t.the increasing σ -fields ℱ n = def σ ( x m , y m , M m ( 1 ) , M m ( 2 ) , m ≤ n ) , n ∈ ℕ , satisfying

E [ M n + 1 ( i ) | ℱ n ] = 0 , i = 1 , 2 , n ∈ ℕ .

Furthermore, { M n + 1 ( i ) } n ∈ ℕ , i = 1 , 2 , are square-integrable with

E [ ‖ M n + 1 ( i ) ‖ 2 | ℱ n ] ≤ K ( 1 + ‖ x n ‖ 2 + ‖ y n ‖ 2 ) ,

for some constant K > 0 .

(A4) For each x ∈ ℝ d , theo.d.e.

y ˙ ( t ) = h ( x , y ( t ) )

has a global asymptotically stable equilibrium Ω ( x ) such that: Ω ( x ) : ℝ d → ℝ k is Lipschitz.

(A5) Theo.d.e.

x ˙ ( t ) = g ( x ( t ) , Ω ( x ( t ) ) )

has a global asymptotically stable equilibrium x * .

Then, the iterates (40), (41) converge to ( x * , Ω ( x * ) ) a.s.on the set Q = def { sup n x n < ∞ , sup n y n < ∞ } .

Proof. Now, we apply Lemma 1 to prove results.

Step 1: On Convergence ofω_t.

We consider the following ODE:

θ ˙ ( t ) = 0 , ω ˙ ( t ) = E [ δ t ES e t | θ ( t ) ] − M ω ( t ) .

The equation θ ˙ ( t ) = 0 implies there exists a constant vector θ such that: θ ( t ) = θ , thus we can rewrite the above ODE associated ω ( t ) as follows,

(42) ω ˙ ( t ) = E [ δ t ES e t | θ ] − M ω ( t ) = A σ θ + b σ − M ω ( t ) .

Then, for any given θ ,

ω ⋆ = M − 1 ( A σ θ + b σ )

is the unique globally asymptotically stable equilibrium for the ODE (42). Let

H ( ω , θ ) = A σ θ + b σ − M ω ,

for a fixed θ , let

H ∞ ( ω , θ ) = lim r → ∞ H ( r ω ( t ) , θ ) r = − M ω ( t ) .

Since M is a positive definite matrix, then 0 is a globally asymptotically stable equilibrium for the following ODE

ω ˙ ( t ) = H ∞ ( ω ( t ) , θ ) .

Let the σ -field ℱ t = σ ( { θ k , ω k , ϕ k , R k } k < t ) be generated by the set { θ k , ω k , ϕ k , R k } k < t , where t ≥ 1 . Let

M t + 1 = δ t ES e t − ϕ t ω t ⊤ ϕ t − E [ δ t ES e t − ϕ t ω t ⊤ ϕ t | ℱ t ] ,

then for each t ≥ 0 , we have E [ M t + 1 | ℱ t ] = 0 . Furthermore, since Assumption 3 holds, there exists a non-negative constant K 1 > 0 , s.t. { M t } t ∈ ℕ is square-integrable with

E [ ‖ M t ‖ 2 | ℱ t ] ≤ K 1 ( 1 + ‖ θ t ‖ 2 + ‖ ω t ‖ 2 ) .

Since then, we have verified the conditions (A2)-(A5) of Lemma 1, thus

ω k − ω ⋆ → 0 , w . p .1

where w . p .1 is short for with probability one.

Step 2: On Convergence ofθ_t.

Let the σ -field G t = σ ( { θ k , ϕ k , R k } k < t ) be generated by the set { ω k , ϕ k , R k } k < t , where t ≥ 1 . The iteration (31) that can be rewritten as:

θ t + 1 = θ t + α t ( δ t ES e t − γ ( 1 − λ ) ϕ ¯ t + 1 e t ⊤ M − 1 E [ δ t ES e t | θ t ] ) .

Furthermore, we define a random variable N t as follows,

N t = δ t ES e t − γ ( 1 − λ ) ϕ ¯ t + 1 e t ⊤ M − 1 E [ δ t ES e t | θ t ] − E [ δ t ES e t − γ ( 1 − λ ) ϕ ¯ t + 1 e t ⊤ M − 1 E [ δ t ES e t | θ t ] | G t ] ,

then for each t ≥ 0 , we have E [ N t + 1 | G t ] = 0 . Now, we rewrite N t as follows,

N t = δ t ES e t − γ ( 1 − λ ) ϕ ¯ t + 1 e t ⊤ M − 1 E [ δ t ES e t θ t ] − E [ δ t ES e t θ t ] + γ ( 1 − λ ) E [ ϕ ¯ t + 1 e t ⊤ ] M − 1 E [ δ t ES e t | θ t ] .

Under Assumption 3, for each t ≥ 0 , there exists a non-negative constant K 2 > 0 such that,

E [ ‖ N t ‖ 2 | G t ] ≤ K 2 ( 1 + ‖ θ t ‖ 2 ) .

We consider the iteration (31) associated with the ODE

(43) θ ˙ ( t ) = ( I − γ ( 1 − λ ) E [ ϕ ¯ t + 1 e t ⊤ ] M − 1 ) E [ δ t ES e t | θ ( t ) ] .

Recall M = E [ ϕ t ϕ t ⊤ ] , from Equation (28), the ODE (43) can be rewritten as follows,

(44) θ ˙ ( t ) = − A σ ⊤ M − 1 ( A σ θ ( t ) + b σ ) = ( 33 ) G ( θ ( t ) ) .

Since A σ is invertible, then θ ⋆ = − A σ − 1 b σ is the unique global asymptotically stable equilibrium of ODE (44). Let

G ∞ ( θ ) = lim r → ∞ G ( r θ , ω ) r = − A σ ⊤ M − 1 A σ θ .

We consider the following ODE

(45) θ ˙ ( t ) = G ∞ ( θ ( t ) ) = − A σ ⊤ M − 1 A σ θ ( t ) .

Since A σ is invertible and M − 1 is positive definite, then A σ ⊤ M − 1 A σ is a positive defined matrix. Equivalently, − A σ ⊤ M − 1 A σ is negative definite. Thus the vector 0 is the unique global asymptotically stable equilibrium of (45). According to Lemma 1,

θ k − θ ⋆ → 0 , w . p .1.

Therefore the proof is completed. □

C. Proof of Theorem 2

Proof. Let ζ t = ( e t , S t , A t , S t + 1 ) ∈ ℝ p × S × A × S , then we rewrite the iteration (31) and (30) as follows

θ t + 1 = θ t + α t ( g ( θ t , ω t , ζ t ) + e t ι t ) ,

ω t + 1 = ω t + β t ( h ( θ t , ω t , ζ t ) + e t ν t ) ,

where

g ( θ t , ω t , ζ t ) = E [ δ t ES e t − γ ( 1 − λ ) E π [ ϕ ( S t + 1 , ⋅ ) ] e t ⊤ ω t ] ,

e t ι t = δ t ES e t − γ ( 1 − λ ) E π [ ϕ ( S t + 1 , ⋅ ) ] e t ⊤ ω t − g ( θ t , ω t , ζ t )

satisfies E [ e t ι t ] = 0 ; h ( θ t , ω t , ζ t ) = E [ δ t ES e t − ϕ t ω t ⊤ ϕ t ] and e t ν t = δ t ES e t − ϕ t ω t ⊤ ϕ t − h ( θ t , ω t , ζ t ) .

Now, we apply ([26], Theorem 2.3 of Chapter 8) twice, once for each time-scale. We refer the reader to that reference for further technical details in ([26], Theorem 2.3 of Chapter 8), which requires us to verify the following conditions of (i)-(iv):

(i)The random variables g ( θ t , ω t , ζ t ) and h ( θ t , ω t , ζ t ) are uniformly integrable (UI), i.e.,

lim a → ∞ sup t ∈ ℕ E [ ‖ g ( θ t , ω t , ζ t ) ‖ I { ‖ g ( θ t , ω t , ζ t ) ‖ ≥ a } ] = 0 ,

lim a → ∞ sup t ∈ ℕ E [ ‖ h ( θ t , ω t , ζ t ) ‖ I { ‖ h ( θ t , ω t , ζ t ) ‖ ≥ a } ] = 0 ,

where I { ⋅ } is the indicator function.

In fact, the uniform integrability of both g ( θ t , ω t , ζ t ) and h ( θ t , ω t , ζ t ) are ensured by the the UI property of e t = λ γ c t , σ e t − 1 + ϕ t , and according to the same analysis of Proposition 2 from [27], we have g ( θ t , ω t , ζ t ) and h ( θ t , ω t , ζ t ) are UI.

(ii)The set of random variables { ζ t } is tight,i.e, for each positive scalar δ , there exists a compact set D δ ∈ ℝ p × S × A × S such that

inf t ∈ ℕ ℙ [ ζ t ∈ D δ ] ≥ 1 − δ .

In fact, recall the Assumption 3 implies the trace vector e t satisfies sup t ∈ ℕ ‖ e t ‖ < ∞ , i.e.,

(46) ‖ e t ‖ 2 2 = ‖ ∑ k = 0 t ( γ λ ) t − k ∏ i = k + 1 t c i , σ ϕ k ‖ 2 2 ≤ ϕ max 2 1 − ( γ λ ( ( 1 − σ ) + σ ρ max ) ) 2 .

For each a > 0 , according to Markov inequality, we have

ℙ [ ‖ e t ‖ ≥ a ] ≤ sup t ∈ ℕ ‖ e t ‖ a → 0 , a → ∞ ,

which implies { e t } is tight. Since we consider the MDPs with finite state space and action space, so the sequence { ζ t = ( e t , S t , A t , S t + 1 ) } t ≥ 0 is also tight.

(iii)Let D ζ ∈ ℝ p × S × A × S be a compact set, for each ζ ∈ D ζ , both g ( θ , ω , ζ 0 ) and h ( θ , ω , ζ 0 ) are continuous with respect to ( θ , ω ) .

Recall the definitions of g ( θ , ω , ζ 0 ) and h ( θ , ω , ζ 0 ) , then we have

g ( θ , ω , ζ 0 ) = A σ ⊤ M − 1 ( A σ θ + b σ ) ,

h ( θ , ω , ζ 0 ) = A σ θ + b σ − M ω .

The Assumption 3 implies A σ , M , and b σ are bounded, thus both g ( θ , ω , ζ 0 ) and h ( θ , ω , ζ 0 ) are continuous with respect to ( θ , ω ) .

(iv)Recall the notation ϖ = E [ ϕ t ϕ t ⊤ ] − 1 E [ δ t E S e t ] , for each compact set D ζ ∈ ℝ p × S × A × S , let

H m , n = 1 m ∑ t = n n + m − 1 ( h ( θ , ω , ζ t ) − H ( θ , ω ) ) I ( ζ t ∈ D ζ ) ,

G m , n = 1 m ∑ t = n n + m − 1 ( g ( θ , ϖ , ζ t ) − G ( θ ) ) I ( ζ t ∈ D ζ ) ,

X m , n = 1 m ∑ t = n n + m − 1 α t β t g ( θ , ω , ζ t ) ,

then for each ( θ , ω ) ∈ ℝ p × ℝ p , we have:

lim m , n → ∞ E [ H m , n ] = lim m , n → ∞ E [ G m , n ] = lim m , n → ∞ E [ X m , n ] = 0.

By the Jensen’s inequality, it is sufficient that

H m , n → 0 , G m , n → 0 , X m , n → 0 ,

as m , n → ∞ . With the fact

E [ h ( θ , ω , ζ 0 ) ] = A σ θ + b σ − M ω

and following the same analysis of Proposition 2.3 in [28], we have H m , n → 0 , as m , n → ∞ . Similarly, we have G m , n → 0 , as m , n → ∞ .

Note that g ( θ , ω , ζ t ) is Lipschitz continuous with respect to the trace variable e t , where e t ∈ ζ t . Thus, there exists a positive scalar L such that ‖ g ( θ , ω , ζ t ) ‖ ≤ L ‖ e t ‖ . From Assumption 3, we have sup t ∈ ℕ ‖ e t ‖ < ∞ and lim t → ∞ α t / β t = 0 , then we have

X m , n → 0 ,

as m , n → ∞ . □

NOTES

¹The term “weight-duplication trick” we use here is coming from [10][11], while some other literatures may call it “two-timescale stochastic approximation”, e.g., [18].

²http://incompleteideas.net/rlai.cs.ualberta.ca/RLAI/RLtoolkit/tilecoding.html

References 1.

Sutton, R.S. (1988) Learning to Predict by the Methods of Temporal Differences. MachineLearning, 3, 9-44. https://doi.org/10.1023/a:1022633531479 10.1023/a:1022633531479

https://doi.org/10.1023/a:1022633531479

Sutton, R.S.

1988

Learning to Predict by the Methods of Temporal Differences

Machine Learning 3 102263

10.1023/a:1022633531479

De Asis, K., Hernandez-Garcia, J., Holland, G. and Sutton, R. (2018) Multi-Step Reinforcement Learning: A Unifying Algorithm. ProceedingsoftheAAAIConferenceonArtificialIntelligence, 32, 2902-2909. https://doi.org/10.1609/aaai.v32i1.11631 10.1609/aaai.v32i1.11631

https://doi.org/10.1609/aaai.v32i1.11631

Asis, K.

Hernandez-Garcia, J.

Holland, G.

Sutton, R.

2018

Multi-Step Reinforcement Learning: A Unifying Algorithm

Proceedings of the AAAI Conference on Artificial Intelligence 32

10.1609/aaai.v32i1.11631

Sutton, R.S. and Barto, A.G. (2018) Reinforcement Learning: An Introduction. MIT Press.

Sutton, R.S.

Barto, A.G.

2018

Reinforcement Learning: An Introduction

Rummery, G.A. and Niranjan, M. (1994) Online Q-Learning Using Connectionist Systems, Volume 37. University of Cambridge, Department of Engineering.

Rummery, G.A.

Niranjan, M.

Systems, V

Cambridge, D

1994

Online Q-Learning Using Connectionist Systems, Volume 37

University of Cambridge

Precup, D., Sutton, R.S., Singh, S.P., et al. (2000) Eligibility Traces for Off-Policy Policy Evaluation. Proceedings of the Seventeenth International Conference on Machine Learning, Standord, 29 June-2 July 2000, 759-766.

Precup, D.

Sutton, R.S.

Singh, S.P.

Learning, S

2000

Eligibility Traces for Off-Policy Policy Evaluation

Proceedings of the Seventeenth International Conference on Machine Learning 29

Yang, L., Shi, M., Zheng, Q., Meng, W. and Pan, G. (2018) A Unified Approach for Multi-Step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning. ProceedingsoftheTwenty- SeventhInternationalJointConferenceonArtificialIntelligence, Stockholm, 13-19 July 2018, 2984-2990. https://doi.org/10.24963/ijcai.2018/414 10.24963/ijcai.2018/414

https://doi.org/10.24963/ijcai.2018/414

Yang, L.

Shi, M.

Zheng, Q.

Meng, W.

Pan, G.

Intelligence, S

2018

A Unified Approach for Multi-Step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence 13

10.24963/ijcai.2018/414

De Asis, K. (2018) A Unified View of Multi-Step Temporal Difference Learning. Ph.D. Thesis, University of Alberta Edmonton.

Asis, K.

Thesis, U

2018

A Unified View of Multi-Step Temporal Difference Learning

Ph.D. Thesis

Sutton, R.S., Mahmood, A.R. and White, M. (2016) An Emphatic Approach to the Problem of Off-Policy Temporal-Difference Learning. Journal of Machine Learning Research, 17, 2603-2631.

Sutton, R.S.

Mahmood, A.R.

White, M.

2016

An Emphatic Approach to the Problem of Off-Policy Temporal-Difference Learning

Journal of Machine Learning Research 17

Sutton, R.S., Maei, H.R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., et al. (2009) Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation. Proceedingsofthe 26 thAnnualInternationalConferenceonMachineLearning, Montreal, 14-18 June 2009, 993-1000. https://doi.org/10.1145/1553374.1553501 10.1145/1553374.1553501

https://doi.org/10.1145/1553374.1553501

Sutton, R.S.

Maei, H.R.

Precup, D.

Bhatnagar, S.

Silver, D.

Learning, M

2009

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Proceedings of the 26th Annual International Conference on Machine Learning 14

10.1145/1553374.1553501

10.

Maei, H.R., Szepesvári, C., Bhatnagar, S., Precup, D., Silver, D. and Sutton, R.S. (2009) Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation. Proceedings of the23 rd International Conference on Neural Information Processing Systems, Vancouver, 7-10 December 2009, 1204-1212.

Maei, H.R.

Bhatnagar, S.

Precup, D.

Silver, D.

Sutton, R.S.

Systems, V

2009

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

Proceedings of the 23rd International Conference on Neural Information Processing Systems 7

11.

Maei, H.R. and Sutton, R.S. (2010) GQ( λ): A General Gradient Algorithm for Temporal-Difference Prediction Learning with Eligibility Traces. Proceedings of the 3 rdConference on Artificial General Intelligence ( AGI-10), Lugano, 5-8 March 2010, 100-105. https://doi.org/10.2991/agi.2010.22 10.2991/agi.2010.22

https://doi.org/10.2991/agi.2010.22

Maei, H.R.

Sutton, R.S.

2010

GQ(λ): A General Gradient Algorithm for Temporal-Difference Prediction Learning with Eligibility Traces

Proceedings of the 3rd Conference on Artificial General Intelligence (AGI-10) 5

10.2991/agi.2010.22

12.

Touati, A., Bacon, P.L., Precup, D. and Vincent, P. (2018) Convergent Tree-Backup and Retrace with Function Approximation. arXiv: 1705.09322.

Touati, A.

Bacon, P.L.

Precup, D.

Vincent, P.

2018

Convergent Tree-Backup and Retrace with Function Approximation

1705

13.

Yang, L., Zheng, G., Zhang, Y., Zheng, Q., Li, P. and Pan, G. (2021) On Convergence of Gradient Expected Sarsa( λ). ProceedingsoftheAAAIConferenceonArtificialIntelligence, 35, 10621-10629. https://doi.org/10.1609/aaai.v35i12.17270 10.1609/aaai.v35i12.17270

https://doi.org/10.1609/aaai.v35i12.17270

Yang, L.

Zheng, G.

Zhang, Y.

Zheng, Q.

Li, P.

Pan, G.

2021

On Convergence of Gradient Expected Sarsa(λ)

Proceedings of the AAAI Conference on Artificial Intelligence 35

10.1609/aaai.v35i12.17270

14.

Bertsekas, D.P. and Tsitsiklis, J.N. (1996) Neuro-Dynamic Programming, Volume 5. Athena Scientific.

Bertsekas, D.P.

Tsitsiklis, J.N.

Programming, V

1996

Neuro-Dynamic Programming, Volume 5

15.

Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S. and Petrik, M. (2015) Finite-Sample Analysis of Proximal Gradient TD Algorithms. arXiv: 2006.14364.

Liu, B.

Liu, J.

Ghavamzadeh, M.

Mahadevan, S.

Petrik, M.

2015

Finite-Sample Analysis of Proximal Gradient TD Algorithms

2006

16.

Dalal, G., Szorenyi, B., Thoppe, G. and Mannor, S. (2018) Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning. arXiv: 1703.05376.

Dalal, G.

Szorenyi, B.

Thoppe, G.

Mannor, S.

2018

Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning

1703

17.

Wang, Y., Chen, W., Liu, Y.T., Ma, Z.M. and Liu, T.Y. (2017) Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting. arXiv: 1809.08926.

Wang, Y.

Chen, W.

Liu, Y.T.

Ma, Z.M.

Liu, T.Y.

2017

Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting

1809

18.

Bhandari, J., Russo, D. and Singal, R. (2018) A Finite Time Analysis of Temporal Difference Learning with Linear Function Approximation. arXiv: 1806.02450.

Bhandari, J.

Russo, D.

Singal, R.

2018

A Finite Time Analysis of Temporal Difference Learning with Linear Function Approximation

1806

19.

Xu, T.Y., Zou, S.F. and Liang, Y.B. (2019) Two Time-Scale Off-Policy TD Learning: Non-Asymptotic Analysis over Markovian Samples. arXiv: 1909.11907.

Xu, T.Y.

Zou, S.F.

Liang, Y.B.

2019

Two Time-Scale Off-Policy TD Learning: Non-Asymptotic Analysis over Markovian Samples

1909

20.

Mahmood, A.R., Yu, H. and Sutton, R.S. (2017) Multi-Step Off-Policy Learning without Importance Sampling Ratios. arXiv: 1702.03006.

Mahmood, A.R.

Yu, H.

Sutton, R.S.

2017

Multi-Step Off-Policy Learning without Importance Sampling Ratios

1702

21.

Baird, L. (1995) Residual Algorithms: Reinforcement Learning with Function Approximation. In: Prieditis, A. and Russell, S., Eds., MachineLearningProceedings 1995, Elsevier, 30-37. https://doi.org/10.1016/b978-1-55860-377-6.50013-x 10.1016/b978-1-55860-377-6.50013-x

https://doi.org/10.1016/b978-1-55860-377-6.50013-x

Baird, L.

Prieditis, A.

Russell, S.

1995

Residual Algorithms: Reinforcement Learning with Function Approximation

In: Prieditis 30

10.1016/b978-1-55860-377-6.50013-x

22.

Dann, C., Neumann, G. and Peters, J. (2014) Policy Evaluation with Temporal Differences: A Survey and Comparison. The Journal of Machine Learning Research, 15, 809-883.

Dann, C.

Neumann, G.

Peters, J.

2014

Policy Evaluation with Temporal Differences: A Survey and Comparison

The Journal of Machine Learning Research 15

23.

Borkar, V.S. (1997) Stochastic Approximation with Two Time Scales. Systems&ControlLetters, 29, 291-294. https://doi.org/10.1016/s0167-6911(97)90015-3 10.1016/s0167-6911(97)90015-3

https://doi.org/10.1016/s0167-6911(97)90015-3

Borkar, V.S.

1997

Stochastic Approximation with Two Time Scales

Systems & Control Letters 6911 97

10.1016/s0167-6911(97)90015-3

24.

Kushner, H. and Yin, G.G. (2003) Stochastic Approximation and Recursive Algorithms and Applications, Volume 35. Springer Science & Business Media.

Kushner, H.

Yin, G.G.

Applications, V

2003

Stochastic Approximation and Recursive Algorithms and Applications, Volume 35

25.

Yu, H.Z. (2016) Weak Convergence Properties of Constrained Emphatic Temporal-Difference Learning with Constant and Slowly Diminishing Stepsize. Journal of Machine Learning Research, 17, 7745-7802.

Yu, H.Z.

2016

Weak Convergence Properties of Constrained Emphatic Temporal-Difference Learning with Constant and Slowly Diminishing Stepsize

Journal of Machine Learning Research 17

26.

Yu, H.Z. (2017) On Convergence of Some Gradient-Based Temporal-Differences Algorithms for Off-Policy Learning. arXiv: 1712.09652.

Yu, H.Z.

2017

On Convergence of Some Gradient-Based Temporal-Differences Algorithms for Off-Policy Learning

1712