TITLE:
A Unified Gradient Temporal Difference Learning Algorithm for Off-Policy Learning
AUTHORS:
Yafei Zhao, Long Yang
KEYWORDS:
Reinforcement Learning, Off-Policy Learning
JOURNAL NAME:
Journal of Applied Mathematics and Physics,
Vol.14 No.6,
June
24,
2026
ABSTRACT: In this paper, we propose a unification of gradient temporal difference (GTD) learning algorithm
GQ(
σ,λ
)
for off-policy learning. The proposed
GQ(
σ,λ
)
ranges from gradient
Tree?Backup(
λ
)
to
GQ(
λ
)
when
σ
ranges from 0 to 1. We investigate the structure of TD fixed-point of
GQ(
σ,λ
)
, and prove
GQ(
σ,λ
)
converges to its TD fixed-point with probability one. Furthermore, we prove that
GQ(
σ,λ
)
converges to an arbitrarily small neighborhood of the optimal solution with probability one. Empirical results show the
GQ(
σ,λ
)
with a value
σ∈(
0,1
)
that creates a mixture of
GQ(
λ
)
and gradient
Tree?Backup(
λ
)
achieves a better performance than both the extreme end
σ=0
and
σ=1
.