A Unified Gradient Temporal Difference Learning Algorithm for Off-Policy Learning

Abstract

In this paper, we propose a unification of gradient temporal difference (GTD) learning algorithm GQ( σ,λ ) for off-policy learning. The proposed GQ( σ,λ ) ranges from gradient Tree?Backup( λ ) to GQ( λ ) when σ ranges from 0 to 1. We investigate the structure of TD fixed-point of GQ( σ,λ ) , and prove GQ( σ,λ ) converges to its TD fixed-point with probability one. Furthermore, we prove that GQ( σ,λ ) converges to an arbitrarily small neighborhood of the optimal solution with probability one. Empirical results show the GQ( σ,λ ) with a value σ( 0,1 ) that creates a mixture of GQ( λ ) and gradient Tree?Backup( λ ) achieves a better performance than both the extreme end σ=0 and σ=1 .

Share and Cite:

Zhao, Y. and Yang, L. (2026) A Unified Gradient Temporal Difference Learning Algorithm for Off-Policy Learning. Journal of Applied Mathematics and Physics, 14, 2384-2408. doi: 10.4236/jamp.2026.146117.

1. Introduction

In reinforcement learning (RL), unifying some disparate ideas not only providing a better understanding of existing algorithms but also creating better performing algorithms.

For example, TD( λ ) [1] unifies one-step temporal difference learning (if λ=0 ) and Monte Carlo method (if λ=1 ) through the trace-decay parameter λ . Results show that the unified algorithm TD( λ ) performs best at an intermediate value λ( 0,1 ) rather than the extreme cases of λ=0 and λ=1 .

The work [2] [3] propose a multi-step Q( σ ) that unifies n -step Sarsa [4] (if σ=1 , full-sampling) and n -step Tree-Backup [5] ( σ=0 , pure-expectation), where the parameter σ[ 0,1 ] denotes the degree of the sampling. The work [2] have conducted experiments to show that for some value σ( 0,1 ) , Q( σ ) creates a mixture of full-sampling and pure-expectation approach, which performs better than the extreme case σ=0 and σ=1 . Later, the work [6] [7] inherit the key idea of unification of TD( λ ) and Q( σ ) , they propose Q( σ,λ ) unifies Sarsa( λ ) [4] and Tree-Backup( λ ) [5]. The previous works [6] [7] show that Q( σ,λ ) performs best at an intermediate value σ( 0,1 ) .

It is noteworthy that the theoretical analysis of the work [2] [6] [7] only consider the tabular learning, which requires a very large table to store the estimated value function when the state space is huge. That implies the previous methods of [2] [6] [7] are considerably expensive for high-dimensional RL, which is the main focus of this work.

Our Main Works

A practical way to address the high-dimensional curse is using a parametric function to estimate the value function. In this paper, we focus on extending Q( σ,λ ) with linear function approximation. Since the divergence of semi-gradient with multi-step bootstrapping for off-policy learning are well-documented in the existing literature (e.g., [3] [8]), which could also happen in semi-gradient Q( σ,λ ) . To propose a convergent gradient-based algorithm, we derive the GQ( σ,λ ) algorithm via the mean square projected Bellman error (MSPBE) objective function [9], that inspired by weight-duplication trick (also known as “two-timescale stochastic approximation”) [9] [10]. When σ ranges from 0 to 1, GQ( σ,λ ) ranges from gradient Tree Backup( λ ) ( TB( λ ) ) to GQ( λ ) [11], i.e., our GQ( σ,λ ) unifies gradient Tree Backup( λ ) and GQ( λ ) .

Although GQ( σ,λ )| σ=0 is a natural algorithm to extend TB( λ ) with linear function approximation, to the best of our knowledge, the update rule of GQ( σ,λ )| σ=0 has not been proposed in the existing literatures. It is worth to notice that Touati et al., [12] have proposed another version of gradient TB( λ ) ( GTB( λ ) ), which is different from the proposed GQ( σ,λ )| σ=0 , we have clarified this point in Remark 3.

Then, we provide the convergence analysis of the proposed GQ( σ,λ ) . Theorem 1 shows that GQ( σ,λ ) converges to its TD fixed-point with probability one. Additionally, Theorem 1 illustrates the structure of such TD fixed-point: it is the global asymptotically stable equilibrium of its corresponding ordinary differential equation (ODE). For more discussion, see Remark 4. Furthermore, Theorem 2 shows that GQ( σ,λ ) converges to an arbitrarily small neighborhood of the optimal solution with probability one.

Finally, our experiments show that when σ ranges from 0 to 1, GQ( σ,λ ) achieves the best performance of off-policy evaluation or control within a certain σ( 0,1 ) , neither σ=0 , nor σ=1 , which implies that with a certain value σ( 0,1 ) , GQ( σ,λ ) creates a mixture between GQ( λ ) and gradient TB( λ ) reaches a better performance than the extreme ends ( σ=0 and σ=1 ).

2. Preliminary

In this section, we briefly review the basics of reinforcement learning and off-policy evaluation.

2.1. Reinforcement Learning

Reinforcement learning (RL) [3] is often formalized as Markov decision processes (MDP) that considers a tuple =( S,A,P,R,γ ) ; S is a set with finite states, A is a set with finite actions; P:S×A×S[ 0,1 ] , p s s a =P( S t = s | S t1 =s, A t1 =a ) is the probability of state transition from s to s under playing the action a ; R( , ):S×A 1 is the expected reward function; γ( 0,1 ) .

A policy π is a probability distribution on S×A , and π( a|s ) denotes the probability of playing a in state s . Let { S t , A t , R t+1 } t0 be generated by π , its state-action value function is: q π ( s,a )= E π [ t=0 γ t R t+1 | S 0 =s, A 0 =a ] , where E π [ | ] is conditional expectation on the actions selected according to π . Let π : | S | | S | denote Bellman operator with respect to policy π :

π :q R π +γ P π q, (1)

where P π | S |×| S | , R π | S |×| A | , their corresponding elements are: [ P π ] s, s = aA π( a|s ) p s s a , [ R π ] s,a =R( s,a ) . It is well-known that q π is the unique fixed point of π , i.e., π q π = q π , which is known as Bellman equation.

2.2. Off-Policy Evaluation

Let us consider the trajectory τ=: { S t , A t , R t+1 } t0 generated by the behavior policy μ , where A t ~μ( | S t ) , S t+1 ~P( | S t , A t ) . Off-policy evaluation is the task to estimate the value function of the target policy π via the data that is generated by the behavior policy μ , where μπ .

Assumption 1 (Ergodicity). The Markov chain induced by behavior policy μ is ergodic, i.e., there exists a stationary distribution ξ( , ) over S×A : for ( S 0 , A 0 ) ,

1 n k=1 n ( S k =s, A k =a| S 0 , A 0 ) n ξ( s,a )>0. (2)

The ergodicity of behavior policy μ is a standard assumption in off-policy learning [3], and it implies each-action pair is visited under this behavior policy μ . We use Ξ to denote a diagonal matrix whose diagonal element is ξ( s,a ) , i.e., Ξ=diag{ ,ξ( s,a ), } .

2.3. Temporal Difference Learning and λ-Return

TD learning updates value function as follows, t0 ,

Q( S t , A t )Q( S t , A t )+ α t δ t , (3)

where Q( , ) is an estimator of q π , α t is step-size and δ t is TD error. Let Q t =:Q( S t , A t ) , if

δ t = δ t S =: R t+1 +γ Q t+1 Q t , (4)

then update (3) is Sarsa [4]. If δ t is expected TD error:

δ t ES = R t+1 +γ aA π( a| S t+1 )Q( S t+1 ,a ) Q t , (5)

then update (3) is Expected Sarsa [4].

The λ -return is an average contains all the n -step returns by weighting proportionally to λ n1 , where λ[ 0,1 ] . In this paper, we mainly consider two classic λ -return: Tree Backup( λ ) ( TB( λ ) ) and Expected Sarsa( λ ) .

Tree Backup( λ ). For each pair ( S t , A t ) in the trajectory τ , TB( λ ) [5] estimates q π ( S t , A t ) by

G t λ = Q t + k=t ( γλ ) kt δ k ES i=t+1 k π( A i | S i ), (6)

where δ k ES is expected TD error. Precup et al., [5] have proved the iteration (6) converges to q π with probability one under some certain conditions.

Expected Sarsa( λ ). Sutton and Barto [3] have proposed a multi-step TD learning extends Expected Sarsa to λ -return version: for each t0 ,

G t λ = Q t + k=t ( γλ ) kt δ k ES i=t+1 k π( A i | S i ) μ( A i | S i ) . (7)

For the convenience, in the following paragraph, we consider the following notations,

ρ i =: π( A i | S i ) μ( A i | S i ) , i=t k ρ i =: ρ t:k , ρ t:t+1 =1.

2.4. A Unified View

In this section, we review an approach to unify TB( λ ) and Expected Sarsa( λ ) .

Q( σ ) Algorithm. Recently, De Asis et al., [2] propose Q( σ ) unifies multi-step Sarsa and multi-step TB( 0 ) . Concretely, according to a mixed TD error δ t π,σ :

δ t π,σ =σ δ t S +( 1σ ) δ t ES , (8)

De Asis et al., [2] construct a multi-step estimator:

G t σ = Q t + k=t γ kt δ k π,σ i=t+1 k [ ( 1σ )π( A i | S i )+σ ], (9)

where σ[ 0,1 ] is sampling parameter. When σ ranges from 0 to 1, the update (9) ranges from multi-step TB( 0 ) to multi-step Sarsa. Experimental results show that a certain σ( 0,1 ) results in a mixture of TB( 0 ) and Sarsa performs better than both σ=0 and σ=1 [2], which implies unifying some seemingly disparate algorithmic ideas can create better performing algorithms.

Off-Policy Q( σ,λ ) . Later, De Asis, [7] proposes a multi-step returns as follows,

G t σ,λ = Q t + k=t δ k ES i=t+1 k γλ( ( 1σ )π( A i | S i )+σ ρ i ) = Q t + k=t δ k ES i=t+1 k γλ c i,σ , (10)

where

c i,σ =( 1σ )π( A i | S i )+σ ρ i . (11)

When σ ranges from 0 to 1, the estimator (10) ranges from TB( λ ) (6) to Expected Sarsa( λ ) (7).

We introduce a λ -operator σ,λ π,μ ( ): | S |×| A | | S |×| A | that is a high level view of the λ -return (10),

σ,λ π,μ ( ):qq+ E μ [ k=0 δ k ES i=1 k γλ c i,σ ] =q+σ ( Iλγ P π ) 1 ( π qq ) +( 1σ ) ( Iλγ P π,μ ) 1 ( π qq ), (12)

where π is Bellman operator (1), P π,μ | S |×| S | , and whose elements are: [ P π,μ ] s, s = aA π( a|s )μ( a|s ) p s s a .

Remark 1. Yang et al. [6] propose another version of Q( σ,λ ) algorithm that extends Q( σ ) (9) with eligibility trace:

G ˜ t σ,λ = Q t + k=t ( λγ ) kt δ k π,σ . (13)

It is noteworthy that at one extreme end σ=1 , both Q( σ ) (9) and G ˜ t σ,λ (13) reduce to on-policy learning. Particularly, [6] prove that the performance of G ˜ t σ,λ for off-policy evaluation is determined by parameter σ :

E μ [ G ˜ t σ,λ |( S t , A t )=( s,a ) ] q π ( s,a ) σC, (14)

where C is a positive constant never reaches 0 no matter how we choose the starting time t . The upper error bound of (14) illustrates the capacity of G ˜ t σ,λ for off-policy evaluation decays monotonously when σ ranges from 0 to 1. At the extreme end σ=1 , G ˜ t σ,λ achieves the worst performance of off-policy evaluation. We think this is a natural result since G ˜ t σ,λ | σ=1 is an on-policy algorithm exactly. Thus, in this paper, we mainly concern (10) for off-policy learning.

2.5. Linear Function Approximation

TD learning (3) requires a very huge table to store the estimate value function Q( , ) when | S | is very large, which implies tabular TD learning is considerably expensive for high-dimensional RL. We often use a parametric function Q θ ( , ) to approximate

q π ( s,a ) ϕ ( s,a )θ=: Q θ ( s,a ), (15)

where θ p is the parameter need to be learned, ϕ( s,a )= ( φ 1 ( s,a ), φ 2 ( s,a ),, φ p ( s,a ) ) , and each φ i :S×A . Furthermore, Q θ can be rewritten as a version of matrix

Q θ =Φθ q π , (16)

where Φ is a matrix whose rows are the state-action feature vectors ϕ ( s,a ) .

3. Divergence of Q( σ,λ ) with Semi-Gradient

In this section, firstly, we derive the semi-gradient Q( σ,λ ) algorithm; then we briefly analyze the divergence of extending Q( σ,λ ) (10) with semi-gradient method. In fact, the divergence of semi-gradient off-policy TD methods are well-documented in the literature (e.g., [8] [12] [13]), which are not specific to Q( σ,λ ) .

3.1. Semi-Gradient Q( σ,λ )

Recall τ= { ( S t , A t , R t+1 ) } t0 is generated by behavior policy μ , let ϕ t =ϕ( S t , A t ) , we define semi-gradient Q( σ,λ ) as follows:

θ t+1 = θ t α t θ ( G t λ ( θ ) θ ϕ t ) 2 | θ= θ t = θ t α t ( G t λ ( θ t ) θ t ϕ t ) θ ( Q θ ( S t , A t ) )| θ= θ t = θ t + α t ( k=t δ k ES ( θ t ) i=t+1 k γλ c i,σ ) ϕ t , (17)

where

G t λ ( θ t )= θ t ϕ t + k=t δ k ES ( θ t ) i=t+1 k γλ c i,σ

is an off-line estimator of value function according to Equation (10), TD error δ k ES ( θ t )= R k+1 +γ E π [ θ t ϕ( S k+1 , ) ] θ t ϕ k , and α t is step-size. Let

A t = ϕ t k=t ( γ E π [ ϕ( S k+1 , ) ] ϕ k ) i=t+1 k γλ c i,σ , (18)

b t = ϕ t k=t R k+1 i=t+1 k γλ c i,σ . (19)

Then update (17) can be rewritten as follows,

θ t+1 = θ t + α t ( A t θ t + b t ). (20)

Furthermore, we have

A σ =: E μ [ A t ]= Φ Ξ P σ π,μ ( γ P π I )Φ, (21)

b σ =: E μ [ b t ]= Φ Ξ P σ π,μ R, (22)

where P σ π,μ =σ ( Iλγ P π ) 1 +( 1σ ) ( Iλγ P π,μ ) 1 .

3.2. Divergence Analysis

Now, we only briefly discuss the divergence lies in the iteration (17). Under the conditions of Proposition 4.8 presented by Bertsekas and Tsitsiklis [14], { θ t } t0 (17) converges to a certain point if and only if A σ is a negative matrix. Furthermore, if iteration (17) converges, then it converges to its unique TD fixed point θ that satisfes

A σ θ + b σ =0. (23)

Unfortunately, since the steady state-action distribution doesn’t match the transition probability during off-policy learning, we can not guarantee the negative definiteness of A σ , thus { θ t } t0 may diverge. To clarify this point, we use the classic counterexample [12] to show the divergence of semi-gradient TD algorithms (17) for off-policy learning.

Figure 1. Counterexample from [12]: Two-State MDP.

For the MDP in Figure 1, the behavior policy μ( right| )=0.5 , and target policy π( right| )=1 . We assign the features { ( 1,0 ) , ( 2,0 ) , ( 0,1 ) , ( 0,2 ) } to the state-action pairs { ( s 1 ,right ),( s 2 ,right ),( s 1 ,left ),( s 2 ,left ) } , From the dynamic transition shown in Figure 1, we have

P π =( 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 ),Φ=( 1 0 2 0 0 1 0 2 ),Ξ= 1 2 I 4×4 .

Then, according to (21), we have

A σ = Φ Ξ P σ π,μ ( γ P π I )Φ =( 6( 2σ )γγλ5( 2σ ) 2( 1γλ ) 0 3γ 2 5( 1 σ 2 ) ),

and the eigenvalues of A are: 6( 2σ )γγλ5( 2σ ) 2( 1γλ ) and 5( 1 σ 2 ) . For any initial θ 0 = ( θ 0,1 , θ 0,2 ) , let E[ θ t+1 ]=: ( θ t+1,1 , θ t+1,2 ) , according to (20), the first component of the term E[ θ t+1 | θ t ] is:

θ t+1,1 = θ 0,1 i=0 t ( 1+ α i 6( 2σ )γγλ5( 2σ ) 2( 1γλ ) ). (24)

For any λ( 0,1 ) , if γ( 105σ 126σλ ,1 ) , then 6( 2σ )γγλ5( 2σ ) 2( 1γλ ) is a positive scalar, which implies A σ can not be a negative matrix. Furthermore, if step size α t : i0 α t = , we have

| θ t+1,1 |=| θ 0,1 | i=0 t ( 6( 2σ )γγλ5( 2σ ) 2( 1γλ ) )+, (25)

Equation (25) is a direct result of the following conclusion that could be found in any calculus textbook. Let p i =1+ a i , where a i >0 , if i=1 a i =+ , then i=1 p i = i=1 ( 1+ a i )=+ , which implies the way (17) to extend Q( σ ) with linear function approximation via off-line estimate is unstable for off-policy learning.

4. Gradient Q( σ,λ )

In this section, we derive the gradient Q( σ,λ ) ( GQ( σ,λ ) ) algorithm. The proposed GQ( σ,λ ) unifies GQ( λ ) [11] if σ=1 . For more discussion, see Remark 2. At another extreme end σ=0 , the proposed GQ( σ,λ )| σ=0 can be seen as a new way to extend TB( λ ) (6) with linear function approximation. Although GQ( σ,λ )| σ=0 is a natural algorithm extends TB( λ ) (6) with linear function approximation, to the best of our knowledge, the update rule of GQ( σ,λ )| σ=0 has not been proposed in the existing literature. For more discussion, see Remark 3.

We derive the gradient the GQ( σ,λ ) algorithm via mean square projected Bellman error (MSPBE) [9] objective function as follows,

J( θ )= 1 2 ΦθΠ σ,λ π,μ ( Φθ ) Ξ 2 , (26)

where

Π=Φ ( Φ ΞΦ ) 1 Φ Ξ

is an | S |×| S | projection matrix, Ξ 2 is the weighted Euclidean norm: x Ξ 2 = x Ξx . Furthermore, we can rewrite min θ J( θ ) as follows,

min θ J( θ )= min θ 1 2 A σ θ+ b σ M 1 2 , (27)

where M= E μ [ ϕ t ϕ t ]= Φ ΞΦ .

The gradient method is a natural approach to solve problem (27), however, it is worth to notice that the challenges are two-fold: (I) Firstly, since the invertible matrix M 1 is involved in J( θ ) , so it is too expensive to apply stochastic gradient to solve the problem (27) directly. (II) Since J( θ )= A σ M 1 ( A σ θ+ b σ ) involves the product of expectations, then the unbiased estimate of J( θ ) cannot be obtained via a single sample. It needs to sample twice, so it is a double-sampling problem, which is the second bottleneck of applying gradient to solve the problem (27). Additionally, M 1 =E [ ϕ t ϕ t ] 1 cannot also be estimated via a single sample. The above analysis pushes us to find a new practical way to solve the problem (27).

Let e 1,σ =0 , e t,σ =λγ c t,σ e t1 + ϕ t , ϕ ¯ t+1 = E π [ ϕ( S t+1, ) ] , then we have the following equation:

1 2 J( θ t )= E μ [ ( γ ϕ ¯ t+1 ϕ t ) e t,σ ]ϖ (28)

1 2 J( θ t )= E μ [ δ t ES e t,σ ]E[ γ( 1λ ) ϕ ¯ t+1 e t,σ ]ϖ, (29)

where

ϖ= E μ [ ϕ t ϕ t ] 1 E[ δ t ES e t,σ ], δ t ES = R t+1 +γ θ t ϕ ¯ t+1 θ t ϕ t .

For the limitation of space, we provide the derivation of (28)-(29) in Appendix A. We use the sign convention that the mean update of θ follows the negative gradient J( θ t ) . The term (31) provides its unbiased stochastic approximation, while the auxiliary recursion (30) tracks ϖ= M 1 E μ [ δ t ES e t,σ ] , circumventing the double-sampling issue.

Let’s consider the term ϖ= E μ [ ϕ t ϕ t ] 1 E[ δ t ES e t,σ ] that is a solution of a least-squares problem, and a typical least mean square (LMS) update rule to find the vector ϖ is:

ω t+1 = ω t + β t ( δ t ES e t,σ ϕ t ω t ϕ t ), (30)

where β t is step-size. Then, by directly sampling from (29) with (30), we define the update rule of θ as follows,

θ t+1 = θ t + α t ( δ t ES e t,σ γ( 1λ ) ϕ ¯ t+1 e t,σ ω t ), (31)

where α t is step-size. We provide the details in Algorithm 1.

Remark 2 (Case of σ=1 ). When σ=1 , we regard GQ( σ,λ )| σ=1 as an approach to extend Expected Sarsa( λ ) (7) with linear function approximation. A more interesting result is that the proposed GQ( σ,λ )| σ=1 is reduced to GQ( λ ) [12] exactly. Now, we provide two fresh interpretations to the proposed GQ( λ ) :

  • GQ( λ ) is at one extreme end ( σ=1 ) of the proposed GQ( σ,λ ) , i.e., the proposed GQ( σ,λ ) contains GQ( λ ) .

  • Furthermore, just because GQ( λ ) is same as GQ( σ,λ )| σ=1 , GQ( λ ) can be seen as an extension of the tabular Expected Sarsa( λ ) (7) with linear function approximation, while the original motivation of Maei and Sutton [11] to propose GQ( λ ) is to introduce eligibility trace to gradient temporal difference learning for off-policy evaluation. Thus, the proposed GQ( σ,λ )| σ=1 provides a fresh understanding for the GQ( λ ) [11].

Computational complexity. The computational complexity of Algorithm 1 is O( | A |p ) per step in time, and O( p ) in memory, where p is the feature dimension and | A | is the number of actions. This maintains the same asymptotic efficiency as the baseline GQ( λ ) and gradient TB( λ ) methods.

Remark 3 (Case of σ=0 ). When σ=0 , the algorithm GQ( σ,λ )| σ=0 is reduced to a version of extending TB( λ ) (6) with linear function approximation. Although GQ( σ,λ )| σ=0 is a natural algorithm extends TB( λ ) (6) with linear function approximation, to the best of our knowledge, the update rule of GQ( σ,λ )| σ=0 has not been proposed in the existing literatures. It is worth to notice that [12] have also proposed another version of gradient TB( λ ) ( GTB( λ ) ), while the difference between the proposed GQ( σ,λ )| σ=0 and GTB( λ ) [12] is reflected at least two aspects:

  • Firstly, the proposed gradient Q( σ,λ )| σ=0 and GTB( λ ) [12] share the same update rule of e t and ω t , but instead of (31), GTB( λ ) updates the parameter θ as follows,

θ t+1 = θ t + α t ( γ ϕ ¯ t+1 + ϕ t ) e t ω t , (32)

where e t = e t,σ | σ=0 =λγπ( A t | S t ) e t1 .

  • Secondly, Touati et al., [12] derive the update rule (32) of their GTB( λ ) via the convex-concave saddle-point framework [15], while our GQ( σ,λ ) is based on the weight-duplication trick1 of (29)-(31).

5. Convergence Analysis

In this section, we prove the convergence of the proposed GQ( σ,λ ) . Theorem 1 shows that GQ( σ,λ ) converges to its TD fixed-point (23) with probability one. Besides, Theorem 1 illustrates the structure of this TD fixed-point: it is the global asymptotically stable equilibrium of its corresponding ordinary differential equation (ODE). For more discussion, we provide in Remark 4. Theorem 2 shows that GQ( σ,λ ) converges to an arbitrarily small neighborhood of the optimal solution with probability one. While we rely on the standard two-timescale ODE method for the general convergence mechanism, our novel theoretical step is verifying that the proposed σ -dependent trace coefficient c t,σ yields a stable equilibrium structure for the ODEs.

5.1. Additional Assumptions

We need some additional assumptions to present the convergent of GQ( σ,λ ) , those assumptions are widely used in reinforcement learning [13] [17]-[19].

Assumption 2 (Diminishing Step-size). The positive sequences { α t } t0 , { β t } t0 satisfy the following conditions with probability one,

t=0 α t = t=0 β t =, t=0 α t 2 <, t=0 β t 2 <.

Assumption 3. The features { ϕ t } t0 is uniformly bounded by ϕ max . The reward function is uniformly bounded by R max . The importance sampling ρ t = π( A t | S t ) μ( A t | S t ) is uniformly bounded by ρ max .

Assumption 4 (Solvability of Problem). The matrix A σ is non-singular and rank( Φ )=p .

Assumption 4 requires the non-singular matrix A σ , which implies the optimal parameter θ = A 1 b is well defined. The feature matrix Φ has linearly independent columns implies the matrix M= Φ ΞΦ is positive defined.

5.2. Main Results and Discussion

Theorem 1 (Convergence of Algorithm 1). Under Assumption 1-4, we consider the iteration { ( θ t , ω t ) } t0 that is generated according to (30)-(31). The step-size α t , β t satisfy Assumption 2 and η t = β t α t 0 , as t . We define two functions G( θ ) , H( ω,θ ) as follows,

G( θ )= A σ M 1 ( A σ θ+ b σ ), (33)

H( ω,θ )= A σ θ+ b σ Mω. (34)

Then ( θ t , ω t ) w.p.1 ( θ , ω ) , as t , where ( θ , ω ) is the unique global asymptotically stable equilibrium with respect to the following ODE correspondingly:

{ θ ˙ ( t )=: d dt θ( t )=G( θ( t ) ) ω ˙ ( t )=: d dt ω( t )=H( ω( t ),θ( t ) ), (35)

where θ( t ),ω( t ) p are the functions are defined on continuous time ( 0, ) .

Proof. We provide its proof in Appendix B. □

Remark 4 (TD-Fixed Point of GQ( σ,λ ) ). Theorem 1 illustrates that the sequence { ( θ t , ω t ) } t0 generated by GQ( σ,λ ) converges to the global asymptotically stable equilibrium of its corresponding ODE (35). Furthermore, from the details of the proof in Appendix B, we know since A σ is invertible and M 1 is positive definite, then A σ M 1 A σ is also positive defined. Because the ODE uses the negative MSPBE-gradient direction, the Jacobian at the equilibrium is A σ M 1 A σ , whose eigenvalues have negative real parts. So the following ODE

θ ˙ ( t )= A σ M 1 ( A σ θ( t )+ b σ )=G( θ( t ) ), (36)

has a unique global asymptotically stable equilibrium θ satisfies the equation

A σ θ + b σ =0, (37)

which implies the global asymptotically stable equilibrium θ of the ODE θ ˙ ( t )=G( θ( t ) ) is also the TD fixed point of (23). That means θ t converges to its TD fixed point:

θ t θ = A σ 1 b σ ,w.p.1.

Remark 5 (Unification of TD-Fixed Point). If σ=0 , the matrix A σ is reduced to

A σ | σ=0 = Φ Ξ ( Iλγ P π,μ ) 1 ( γ P π I )Φ,

which implies GQ( σ,λ )| σ=0 converges to the TD-fixed point of GTB( λ ) [12], i.e., GQ( σ,λ )| σ=0 shares the same TD-fixed point of GTB( λ ) [12] as follows

θ = A σ 1 b σ | σ=0 .

If σ=1 , then the key matrix A σ is reduced to

A σ | σ=1 = Φ Ξ ( Iλγ P π ) 1 ( γ P π I )Φ,

which illustrates the TD-fixed point of GQ( σ,λ )| σ=1 (i.e., GQ( λ ) ) algorithm as follows

θ = A σ 1 b σ | σ=1 .

Theorem 1 presents an asymptotic result, which holds only in the limit as the number of iterations increases to infinity. Now, we present a result shows the distance between θ t and the optimal solution θ convergence to 0 in probability. For any T>0 , and t0 , we introduce a notation

κ( t;T )=min{ kt| i=t k+1 α i >T }

to denote the last iteration before the sum of step-size α i between it and the t -th iteration exceeds T . Since we consider the Assumption 2, the notation κ( t;T ) is well-defined.

Theorem 2. Under Assumption 1-4, we consider the iteration { ( θ t , ω t ) } t0 generated by (30)-(31). The step-size α t , β t satisfy Assumption 2. Moreover, for the integers m t , m t , as t ,

lim t sup 0<j m t | α t+j α t 1 |=0, lim t sup 0<j m t | β t+j β t 1 |=0.

Then there exists a sequence of positive numbers T t (as t ) such that for any ϵ>0 ,

lim t [ θ i N ϵ ( θ ),i[ t,κ( t, T t ) ] ]=0, (38)

where we use N ϵ ( θ )={ θ: θ θ 2 ϵ } to denote the ϵ -neighborhood of θ .

Proof. We provide its proof in Appendix C. □

Remark 6 Theorem 2 shows that as t , the sequence { θ i } i=t κ( t, T t ) collects to an arbitrarily small neighborhood of the optimal solution θ with probability one, i.e.,

lim t ( i=t κ( t, T t ) { θ i N ϵ ( θ ) } )=1.

In fact, the results (38) implies

lim t ( θ i N( θ ,ϵ ),i[ t,κ( t, T t ) ] )=0,

then we know

1 lim t ( θ i N( θ ,ϵ ),i[ t,κ( t, T t ) ] ) = lim t ( θ i N( θ ,ϵ ),i[ t,κ( t, T t ) ] ¯ ) = lim t ( i=t κ( t, T t ) { θ i N( θ ,ϵ ) } ¯ ) = lim t ( i=t κ( t, T t ) { θ i N( θ ,ϵ ) } ).

That is lim t ( i=t κ( t, T t ) { θ i N( θ ,ϵ ) } )=1 .

6. Experiments

In this section, we test the capacity of GQ( σ,λ ) for two typical tasks: off-policy evaluation and control.

  • For off-policy evaluation, we compare GQ( σ,λ ) with four state-of-art algorithms: GQ( λ ) [11], ABQ( ζ ) [20], GTB( λ ) , GRetrace( λ ) [12] over two typical measurements: MSPBE and mean square error (MSE). It is worth noting that if σ=0 , GQ( σ,λ ) is reduced to a new version of GTB( λ ) (we denote it as GTB( λ ) -v0), thus, we also show its comparison to GTB( λ ) [12] (we denote it as GTB( λ ) -v1).

  • For the control task, our goal is to show the the trade-off between σ=0 and σ=1 . Empirical results show the GQ( σ,λ ) with a value σ( 0,1 ) that creates a mixture of GQ( λ ) and gradient Tree Backup( λ ) achieves a better performance than both the extreme end σ=0 and σ=1 .

These domains are standard diagnostic benchmarks intended to isolate the off-policy instability under linear function approximation. Scaling to very high-dimensional environments is left as future work.

6.1. Off-Policy Evaluation Experiments

We present the domains in the experiments as follows.

(1) Two State MDP [12]. This MDP is shown in Figure 1, the behavior policy μ( right| )=0.5 , and target policy π( right| )=1 . We assign the features { ( 1,0 ) , ( 2,0 ) , ( 0,1 ) , ( 0,2 ) } to the state-action pairs { ( s 1 ,right ),( s 2 ,right ),( s 1 ,left ),( s 2 ,left ) } , i.e.,

Φ= ( 1 2 0 0 0 0 1 2 ) .

(2) Baird Star [21]. The Baird Star is an episodic seven states MDP with two actions: dashed action and solid action. In this example, the behavior policy μ( |dashed )= 6 7 , μ( |solid )= 1 7 and target policy π( |solid )=1 . We choose the feature map matrix as follows

Φ=( 2 I 7×7 1 7×1 0 7×8 0 7×8 2 I 7×7 1 7×1 ),

where I denotes the identity matrix, 0 denotes a matrix whose elements are all 0, and 1 7×1 denotes a vector whose elements are all 1. We used θ 0 = ( 1,1,1,1,1,1,10,1 ) T as initial parameter vector for the methods that allow specifying a start estimate, TD-learning is known to diverge for this initialization of the parameter-vector [3] [22].

(3) Cliff Walking. This is a standard undiscounted, episodic task, with start and goal states, and the usual actions causing movement up, down, right, and left.

(4) Grid World. This environment from ([3], Chapter 4), where the agent on an 4 × 4 grid and your goal is to reach the terminal state at the top left or the bottom right corner.

(5) Windy Gridworld. This environment from ([3], Chapter 6). Windy Gridworld problem for reinforcement learning. Actions include going up, down, right, and left. In each column the wind pushes you up a specific number of steps (for the next action). If an action would take you off the grid, you remain in the previous state. For each step you get a reward of −1, until the agent reach into a terminal state.

We summarize the domains, feature settings, target policy and behavior policy below.

Two Measurements for Off-Policy Evaluation. In this section, we use empirical RMSPBE= 1 2 b ^ + A ^ θ M ^ 1 2 to evaluate the performance, where we evaluate A ^ , b ^ , and M ^ according to their unbiased estimators. Additionally, we also compare the performance over a common measurement empirical MSE: MSE= Φθ q π Ξ 2 , where q π is estimated by simulating the target policy π and averaging the discounted cumulative rewards over trajectories.

Hyper-parameter Setting. We run the hyper parameter σ as follows: σ ranges from 0 to 1 with step of 0.02, i.e.,

Σ={ 0,0.02,0.04,,0.98,1.0 }.

It collects 51 results w.r.t. GQ( σ,λ ) . We set λ=0.99 , γ=0.99 , and run the step-size α t ={ 10 2 ,2× 10 2 , 10 3 ,2× 10 3 } , η t = β t / α t ={ 2 0 , 2 1 ,, 2 10 } .

Results Report. All the results shown in Figure 2 and Figure 3 are the average of 5 runs, where we choose the best σ among the 51 results w.r.t. the space Σ The results the proposed GQ( σ,λ ) outperforms the the baseline algorithms. The results of Figure 2 and Figure 3 also show GQ( σ,λ ) with an intermediate σ (between 0 and 1) has a better performance than the extreme case ( σ=0 and 1). This experiment further validates that unifying some existing algorithms can create a better algorithm.

Figure 2. RMSPBE comparison with different baseline algorithms.

Figure 3. MSE comparison with with different baseline algorithms.

6.2. Control Domain: Off-Policy Evaluation

In this section, we test the off-policy evaluation behavior of GQ( σ,λ ) algorithm on mountain car domain, where the agent considers the task of driving an underpowered car up a steep mountain road. The agent receives a reward of -1 at every step until it reaches the goal region at the top of the hill. Since the state space of this domain is continuous, we use the open tile coding software2 to extract feature of states. Recall the states and actions of MountainCar:

S={ ( Velocity,Position ) }=[ 0.07,0.07 ]×[ 1.2,0.6 ],

A={ left,neutral,right }.

In this experiment, if Velocity>0 , we use behavior policy

μ=( 1 100 , 1 100 , 98 100 ),π=( 1 10 , 1 10 , 8 10 );

else

μ=( 98 100 , 1 100 , 1 100 ),π=( 8 10 , 1 10 , 1 10 ).

Note that the target policy π is fixed throughout. Thus, this experiment acts as an off-policy evaluation within the Mountain Car domain. In this experiment, we set the number of tilings to be 4, and there are no white noise features. As suggested by Sutton and Barto [3], we set all the initial parameters to be 0, which is optimistic about causing extensive exploration.

As before, we run the hyper parameter σ as follows: σ ranges from 0 to 1 with step of 0.02. It collects 51 results w.r.t. GQ( σ,λ ) . We also set λ=0.99 , γ=0.99 , and run the step-size α t ={ 10 2 ,2× 10 2 , 10 3 ,2× 10 3 } , η t = β t / α t ={ 2 0 , 2 1 ,, 2 10 } , the dimension of feature p={ 512,1024,2048 } .

Overall Presentation. We give more comprehensive results of the trade-off between σ=0 and σ=1 . We statistic of the number of σ happens for the following three case:

  • (I) GQ( σ,λ ) performs better than both σ=0 and σ=1 .

  • (II) GQ( σ,λ ) performs better than σ=0 or σ=1 .

  • (III) GQ( σ,λ ) performs worse than σ=0 and σ=1 .

The setting of σ is the same as the previous section, and the total number of σ reaches 51.

Table 1. Returns under various parameters.

Case

σ=0

σ=1

σ=0.1

σ=0.2

σ=0.3

σ=0.4

σ=0.5

σ=0.6

σ=0.7

σ=0.8

σ=0.9

α=0.001

p=512

−122.0

−123.6

− 123.5

119.8

117.4

120.8

119.1

−120.5

119.2

119.3

120.1

p=1024

−119.9

−121.9

−127.8

−120.2

−122.4

−122.0

118.4

−121.6

−121.6

−124.0

−120.1

p=2048

−122.3

−121.4

−122.6

−121.9

120.1

−122.3

−122.6

119.6

120.9

−121.4

−122.5

α=0.002

p=512

−126.9

−124.2

124.2

−127.5

−125.3

−125.2

124.0

121.6

−125.4

−125.8

120.2

p=1024

−124.1

−123.1

−122.4

−126.3

−122.3

−123.0

−121.4

−126.2

121.3

−123.6

−126.0

p=2048

−124.8

−124.0

−126.4

−124.6

122.7

123.9

−125.1

122.5

123.6

−126.4

122.7

Table 2. Percentage under various parameters.

Case

p=512

p=1024

p=2048

α=0.001

I

69.8%

20.4%

55.1%

II

14.1%

36.7%

20.4%

III

16.1%

42.9%

24.5%

α=0.002

I

53.1%

6.1%

49.0%

II

38.8%

16.3%

18.4%

III

8.1%

77.6%

32.6%

Results Report. The results shown in Figure 2 and Table 1 are average of 5 runs, and each run contains 400 episodes. The result in Figure 2 shows that GQ( σ,λ ) with an intermediate σ (between 0 and 1) has a better performance than the extreme case ( σ=0 and 1). This experiment further validates that unifying some existing algorithms can create a better algorithm for reinforcement. Table 1 shows the returns mountaincar reaches the goal region at the top of the hill. As shown in Table 2, those results also show GQ( σ,λ ) achieves the best performance at a σ( 0,1 ) , which implies the trade-off between σ=0 and σ=1 . That is to see the GQ( σ,λ ) with a value σ( 0,1 ) that creates a mixture of GQ( λ ) and gradient Tree Backup( λ ) achieves a better performance than both the extreme end σ=0 and σ=1 .

7. Conclusion

In this paper, we extend tabular Q( σ,λ ) with function approximation, and propose GQ( σ,λ ) . We analyze the convergence of GQ( σ,λ ) . Our theory analysis shows that GQ( σ,λ ) converges to its TD fixed-point with probability one. Then, we show GQ( σ,λ ) converges to the optimal solution of the minimizing MSPBE problem. Finally, we conduct experiments on some standard domains to confirm the effectiveness of the proposed GQ( σ,λ ) . Result show that the best performance of GQ( σ,λ ) achieved with a σ( 0,1 ) , neither σ=0 , nor σ=1 . Extending this framework to non-linear function approximation is an important future direction. Such an extension faces theoretical hurdles, such as the parameter-dependence of the Jacobian and the loss of a fixed linear least-squares structure, which may require techniques like target networks or compatible gradients to stabilize.

Funding

The project was partially supported by Scientific Research Fund of Zhejiang Provincial Education Department under Grant No. Y202456228.

Appendix

A. Derivation of (28)-(29)

Proof. Let us calculate MSPBE( θ,λ ) directly,

1 2 J( θ t )= 1 2 θ MSPBE( θ,λ )| θ= θ t = 1 2 θ ( E [ δ t e t,σ ] E [ ϕ t ϕ t ] 1 E[ δ t e t,σ ] ) =( θ E [ δ t e t,σ ] )E [ ϕ t ϕ t ] 1 E[ δ t e t,σ ] =E[ ( γ E π ϕ( S t+1 , ) ϕ t ) e t,σ ]E [ ϕ t ϕ t ] 1 E[ δ t e t,σ ] =E[ γ E π ϕ( S t+1 , ) e t,σ ϕ t e t,σ ]E [ ϕ t ϕ t ] 1 E[ δ t e t,σ ] =E[ ϕ t ϕ t + ϕ t+1 γλ e t,σ γ E π ϕ( S t+1 , ) e t,σ ] E [ ϕ t ϕ t ] 1 E[ δ t e t,σ ] =:ϖ =E[ δ t ES e t,σ ]E[ γ( 1λ ) ϕ ¯ t+1 e t,σ ]ϖ. (39)

B. Proof of Theorem 1

The ODE method (see Lemma 1) is our is our main tool to prove Theorem 1. We refer the reader to that reference for further technical details.

Lemma 1 ([25]). For the stochastic recursion of x n , y n given by

x n+1 = x n + a n [ g( x n , y n )+ M n+1 ( 1 ) ], (40)

y n+1 = y n + b n [ h( x n , y n )+ M n+1 ( 2 ) ],n (41)

if the following assumptions are satisfied:

  • (A1) Step-sizes { a n },{ b n } are positive, satisfying

n a n = n b n =, n a n 2 + b n 2 <, b n a n 0asn.

  • (A2) The map g: d+k d ,h: d+k k are Lipschitz.

  • (A3) The sequence { M n+1 ( 1 ) } n , { M n+1 ( 2 ) } n are martingale difference sequences w.r.t. the increasing σ -fields n = def σ( x m , y m , M m ( 1 ) , M m ( 2 ) ,mn ),n, satisfying

E[ M n+1 ( i ) | n ]=0,i=1,2,n.

Furthermore, { M n+1 ( i ) } n ,i=1,2 , are square-integrable with

E[ M n+1 ( i ) 2 | n ]K( 1+ x n 2 + y n 2 ),

for some constant K>0 .

  • (A4) For each x d , the o.d.e.

y ˙ ( t )=h( x,y( t ) )

has a global asymptotically stable equilibrium Ω( x ) such that: Ω( x ): d k is Lipschitz.

  • (A5) The o.d.e.

x ˙ ( t )=g( x( t ),Ω( x( t ) ) )

has a global asymptotically stable equilibrium x * .

Then, the iterates (40), (41) converge to ( x * ,Ω( x * ) ) a.s. on the set Q = def { sup n x n <, sup n y n < } .

Proof. Now, we apply Lemma 1 to prove results.

Step 1: On Convergence of ωt.

We consider the following ODE:

θ ˙ ( t )=0, ω ˙ ( t )=E[ δ t ES e t |θ( t ) ]Mω( t ).

The equation θ ˙ ( t )=0 implies there exists a constant vector θ such that: θ( t )=θ , thus we can rewrite the above ODE associated ω( t ) as follows,

ω ˙ ( t )=E[ δ t ES e t |θ ]Mω( t )= A σ θ+ b σ Mω( t ). (42)

Then, for any given θ ,

ω = M 1 ( A σ θ+ b σ )

is the unique globally asymptotically stable equilibrium for the ODE (42). Let

H( ω,θ )= A σ θ+ b σ Mω,

for a fixed θ , let

H ( ω,θ )= lim r H( rω( t ),θ ) r =Mω( t ).

Since M is a positive definite matrix, then 0 is a globally asymptotically stable equilibrium for the following ODE

ω ˙ ( t )= H ( ω( t ),θ ).

Let the σ -field t =σ( { θ k , ω k , ϕ k , R k } k<t ) be generated by the set { θ k , ω k , ϕ k , R k } k<t , where t1 . Let

M t+1 = δ t ES e t ϕ t ω t ϕ t E[ δ t ES e t ϕ t ω t ϕ t | t ],

then for each t0 , we have E[ M t+1 | t ]=0 . Furthermore, since Assumption 3 holds, there exists a non-negative constant K 1 >0 , s.t. { M t } t is square-integrable with

E[ M t 2 | t ] K 1 ( 1+ θ t 2 + ω t 2 ).

Since then, we have verified the conditions (A2)-(A5) of Lemma 1, thus

ω k ω 0,w.p.1

where w.p.1 is short for with probability one.

Step 2: On Convergence of θt.

Let the σ -field G t =σ( { θ k , ϕ k , R k } k<t ) be generated by the set { ω k , ϕ k , R k } k<t , where t1 . The iteration (31) that can be rewritten as:

θ t+1 = θ t + α t ( δ t ES e t γ( 1λ ) ϕ ¯ t+1 e t M 1 E[ δ t ES e t | θ t ] ).

Furthermore, we define a random variable N t as follows,

N t = δ t ES e t γ( 1λ ) ϕ ¯ t+1 e t M 1 E[ δ t ES e t | θ t ] E[ δ t ES e t γ( 1λ ) ϕ ¯ t+1 e t M 1 E[ δ t ES e t | θ t ]| G t ],

then for each t0 , we have E[ N t+1 | G t ]=0 . Now, we rewrite N t as follows,

N t = δ t ES e t γ( 1λ ) ϕ ¯ t+1 e t M 1 E[ δ t ES e t θ t ]E[ δ t ES e t θ t ] +γ( 1λ )E[ ϕ ¯ t+1 e t ] M 1 E[ δ t ES e t | θ t ].

Under Assumption 3, for each t0 , there exists a non-negative constant K 2 >0 such that,

E[ N t 2 | G t ] K 2 ( 1+ θ t 2 ).

We consider the iteration (31) associated with the ODE

θ ˙ ( t )=( Iγ( 1λ )E[ ϕ ¯ t+1 e t ] M 1 )E[ δ t ES e t |θ( t ) ]. (43)

Recall M=E[ ϕ t ϕ t ] , from Equation (28), the ODE (43) can be rewritten as follows,

θ ˙ ( t )= A σ M 1 ( A σ θ( t )+ b σ ) = ( 33 ) G( θ( t ) ). (44)

Since A σ is invertible, then θ = A σ 1 b σ is the unique global asymptotically stable equilibrium of ODE (44). Let

G ( θ )= lim r G( rθ,ω ) r = A σ M 1 A σ θ.

We consider the following ODE

θ ˙ ( t )= G ( θ( t ) )= A σ M 1 A σ θ( t ). (45)

Since A σ is invertible and M 1 is positive definite, then A σ M 1 A σ is a positive defined matrix. Equivalently, A σ M 1 A σ is negative definite. Thus the vector 0 is the unique global asymptotically stable equilibrium of (45). According to Lemma 1,

θ k θ 0,w.p.1.

Therefore the proof is completed. □

C. Proof of Theorem 2

Proof. Let ζ t =( e t , S t , A t , S t+1 ) p ×S×A×S , then we rewrite the iteration (31) and (30) as follows

θ t+1 = θ t + α t ( g( θ t , ω t , ζ t )+ e t ι t ),

ω t+1 = ω t + β t ( h( θ t , ω t , ζ t )+ e t ν t ),

where

g( θ t , ω t , ζ t )=E[ δ t ES e t γ( 1λ ) E π [ ϕ( S t+1, ) ] e t ω t ],

e t ι t = δ t ES e t γ( 1λ ) E π [ ϕ( S t+1, ) ] e t ω t g( θ t , ω t , ζ t )

satisfies E[ e t ι t ]=0 ; h( θ t , ω t , ζ t )=E[ δ t ES e t ϕ t ω t ϕ t ] and e t ν t = δ t ES e t ϕ t ω t ϕ t h( θ t , ω t , ζ t ) .

Now, we apply ([26], Theorem 2.3 of Chapter 8) twice, once for each time-scale. We refer the reader to that reference for further technical details in ([26], Theorem 2.3 of Chapter 8), which requires us to verify the following conditions of (i)-(iv):

(i) The random variables g( θ t , ω t , ζ t ) and h( θ t , ω t , ζ t ) are uniformly integrable (UI), i.e.,

lim a sup t E[ g( θ t , ω t , ζ t ) I{ g( θ t , ω t , ζ t ) a } ]=0,

lim a sup t E[ h( θ t , ω t , ζ t ) I{ h( θ t , ω t , ζ t ) a } ]=0,

where I{ } is the indicator function.

In fact, the uniform integrability of both g( θ t , ω t , ζ t ) and h( θ t , ω t , ζ t ) are ensured by the the UI property of e t =λγ c t,σ e t1 + ϕ t , and according to the same analysis of Proposition 2 from [27], we have g( θ t , ω t , ζ t ) and h( θ t , ω t , ζ t ) are UI.

(ii) The set of random variables { ζ t } is tight, i.e, for each positive scalar δ , there exists a compact set D δ p ×S×A×S such that

inf t [ ζ t D δ ]1δ.

In fact, recall the Assumption 3 implies the trace vector e t satisfies sup t e t < , i.e.,

e t 2 2 = k=0 t ( γλ ) tk i=k+1 t c i,σ ϕ k 2 2 ϕ max 2 1 ( γλ( ( 1σ )+σ ρ max ) ) 2 . (46)

For each a>0 , according to Markov inequality, we have

[ e t a ] sup t e t a 0,a,

which implies { e t } is tight. Since we consider the MDPs with finite state space and action space, so the sequence { ζ t =( e t , S t , A t , S t+1 ) } t0 is also tight.

(iii) Let D ζ p ×S×A×S be a compact set, for each ζ D ζ , both g( θ,ω, ζ 0 ) and h( θ,ω, ζ 0 ) are continuous with respect to ( θ,ω ) .

Recall the definitions of g( θ,ω, ζ 0 ) and h( θ,ω, ζ 0 ) , then we have

g( θ,ω, ζ 0 )= A σ M 1 ( A σ θ+ b σ ),

h( θ,ω, ζ 0 )= A σ θ+ b σ Mω.

The Assumption 3 implies A σ ,M , and b σ are bounded, thus both g( θ,ω, ζ 0 ) and h( θ,ω, ζ 0 ) are continuous with respect to ( θ,ω ) .

(iv) Recall the notation ϖ=E [ ϕ t ϕ t ] 1 E[ δ t ES e t ] , for each compact set D ζ p ×S×A×S , let

H m,n = 1 m t=n n+m1 ( h( θ,ω, ζ t )H( θ,ω ) )I( ζ t D ζ ),

G m,n = 1 m t=n n+m1 ( g( θ,ϖ, ζ t )G( θ ) )I( ζ t D ζ ),

X m,n = 1 m t=n n+m1 α t β t g( θ,ω, ζ t ),

then for each ( θ,ω ) p × p , we have:

lim m,n E[ H m,n ]= lim m,n E[ G m,n ]= lim m,n E[ X m,n ]=0.

By the Jensen’s inequality, it is sufficient that

H m,n 0, G m,n 0, X m,n 0,

as m,n . With the fact

E[ h( θ,ω, ζ 0 ) ]= A σ θ+ b σ Mω

and following the same analysis of Proposition 2.3 in [28], we have H m,n 0 , as m,n . Similarly, we have G m,n 0 , as m,n .

Note that g( θ,ω, ζ t ) is Lipschitz continuous with respect to the trace variable e t , where e t ζ t . Thus, there exists a positive scalar L such that g( θ,ω, ζ t ) L e t . From Assumption 3, we have sup t e t < and lim t α t / β t =0 , then we have

X m,n 0,

as m,n . □

NOTES

1The term “weight-duplication trick” we use here is coming from [10] [11], while some other literatures may call it “two-timescale stochastic approximation”, e.g., [18].

2http://incompleteideas.net/rlai.cs.ualberta.ca/RLAI/RLtoolkit/tilecoding.html

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1] Sutton, R.S. (1988) Learning to Predict by the Methods of Temporal Differences. Machine Learning, 3, 9-44.[CrossRef]
[2] De Asis, K., Hernandez-Garcia, J., Holland, G. and Sutton, R. (2018) Multi-Step Reinforcement Learning: A Unifying Algorithm. Proceedings of the AAAI Conference on Artificial Intelligence, 32, 2902-2909.[CrossRef]
[3] Sutton, R.S. and Barto, A.G. (2018) Reinforcement Learning: An Introduction. MIT Press.
[4] Rummery, G.A. and Niranjan, M. (1994) Online Q-Learning Using Connectionist Systems, Volume 37. University of Cambridge, Department of Engineering.
[5] Precup, D., Sutton, R.S., Singh, S.P., et al. (2000) Eligibility Traces for Off-Policy Policy Evaluation. Proceedings of the Seventeenth International Conference on Machine Learning, Standord, 29 June-2 July 2000, 759-766.
[6] Yang, L., Shi, M., Zheng, Q., Meng, W. and Pan, G. (2018) A Unified Approach for Multi-Step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, 13-19 July 2018, 2984-2990.[CrossRef]
[7] De Asis, K. (2018) A Unified View of Multi-Step Temporal Difference Learning. Ph.D. Thesis, University of Alberta Edmonton.
[8] Sutton, R.S., Mahmood, A.R. and White, M. (2016) An Emphatic Approach to the Problem of Off-Policy Temporal-Difference Learning. Journal of Machine Learning Research, 17, 2603-2631.
[9] Sutton, R.S., Maei, H.R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., et al. (2009) Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation. Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, 14-18 June 2009, 993-1000.[CrossRef]
[10] Maei, H.R., Szepesvári, C., Bhatnagar, S., Precup, D., Silver, D. and Sutton, R.S. (2009) Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation. Proceedings of the 23rd International Conference on Neural Information Processing Systems, Vancouver, 7-10 December 2009, 1204-1212.
[11] Maei, H.R. and Sutton, R.S. (2010) GQ(λ): A General Gradient Algorithm for Temporal-Difference Prediction Learning with Eligibility Traces. Proceedings of the 3rd Conference on Artificial General Intelligence (AGI-10), Lugano, 5-8 March 2010, 100-105.[CrossRef]
[12] Touati, A., Bacon, P.L., Precup, D. and Vincent, P. (2018) Convergent Tree-Backup and Retrace with Function Approximation. arXiv: 1705.09322.
[13] Yang, L., Zheng, G., Zhang, Y., Zheng, Q., Li, P. and Pan, G. (2021) On Convergence of Gradient Expected Sarsa(λ). Proceedings of the AAAI Conference on Artificial Intelligence, 35, 10621-10629. [Google Scholar] [CrossRef]
[14] Bertsekas, D.P. and Tsitsiklis, J.N. (1996) Neuro-Dynamic Programming, Volume 5. Athena Scientific.
[15] Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S. and Petrik, M. (2015) Finite-Sample Analysis of Proximal Gradient TD Algorithms. arXiv: 2006.14364.
[16] Dalal, G., Szorenyi, B., Thoppe, G. and Mannor, S. (2018) Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning. arXiv: 1703.05376.
[17] Wang, Y., Chen, W., Liu, Y.T., Ma, Z.M. and Liu, T.Y. (2017) Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting. arXiv: 1809.08926.
[18] Bhandari, J., Russo, D. and Singal, R. (2018) A Finite Time Analysis of Temporal Difference Learning with Linear Function Approximation. arXiv: 1806.02450.
[19] Xu, T.Y., Zou, S.F. and Liang, Y.B. (2019) Two Time-Scale Off-Policy TD Learning: Non-Asymptotic Analysis over Markovian Samples. arXiv: 1909.11907.
[20] Mahmood, A.R., Yu, H. and Sutton, R.S. (2017) Multi-Step Off-Policy Learning without Importance Sampling Ratios. arXiv: 1702.03006.
[21] Baird, L. (1995) Residual Algorithms: Reinforcement Learning with Function Approximation. In: Prieditis, A. and Russell, S., Eds., Machine Learning Proceedings 1995, Elsevier, 30-37.[CrossRef]
[22] Dann, C., Neumann, G. and Peters, J. (2014) Policy Evaluation with Temporal Differences: A Survey and Comparison. The Journal of Machine Learning Research, 15, 809-883.
[23] Borkar, V.S. (1997) Stochastic Approximation with Two Time Scales. Systems & Control Letters, 29, 291-294.[CrossRef]
[24] Kushner, H. and Yin, G.G. (2003) Stochastic Approximation and Recursive Algorithms and Applications, Volume 35. Springer Science & Business Media.
[25] Yu, H.Z. (2016) Weak Convergence Properties of Constrained Emphatic Temporal-Difference Learning with Constant and Slowly Diminishing Stepsize. Journal of Machine Learning Research, 17, 7745-7802.
[26] Yu, H.Z. (2017) On Convergence of Some Gradient-Based Temporal-Differences Algorithms for Off-Policy Learning. arXiv: 1712.09652.

Copyright © 2026 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.