A Unified Gradient Temporal Difference Learning Algorithm for Off-Policy Learning ()
1. Introduction
In reinforcement learning (RL), unifying some disparate ideas not only providing a better understanding of existing algorithms but also creating better performing algorithms.
For example,
[1] unifies one-step temporal difference learning (if
) and Monte Carlo method (if
) through the trace-decay parameter
. Results show that the unified algorithm
performs best at an intermediate value
rather than the extreme cases of
and
.
The work [2] [3] propose a multi-step
that unifies
-step Sarsa [4] (if
, full-sampling) and
-step Tree-Backup [5] (
, pure-expectation), where the parameter
denotes the degree of the sampling. The work [2] have conducted experiments to show that for some value
,
creates a mixture of full-sampling and pure-expectation approach, which performs better than the extreme case
and
. Later, the work [6] [7] inherit the key idea of unification of
and
, they propose
unifies
[4] and
[5]. The previous works [6] [7] show that
performs best at an intermediate value
.
It is noteworthy that the theoretical analysis of the work [2] [6] [7] only consider the tabular learning, which requires a very large table to store the estimated value function when the state space is huge. That implies the previous methods of [2] [6] [7] are considerably expensive for high-dimensional RL, which is the main focus of this work.
Our Main Works
A practical way to address the high-dimensional curse is using a parametric function to estimate the value function. In this paper, we focus on extending
with linear function approximation. Since the divergence of semi-gradient with multi-step bootstrapping for off-policy learning are well-documented in the existing literature (e.g., [3] [8]), which could also happen in semi-gradient
. To propose a convergent gradient-based algorithm, we derive the
algorithm via the mean square projected Bellman error (MSPBE) objective function [9], that inspired by weight-duplication trick (also known as “two-timescale stochastic approximation”) [9] [10]. When
ranges from 0 to 1,
ranges from gradient
(
) to
[11], i.e., our
unifies gradient
and
.
Although
is a natural algorithm to extend
with linear function approximation, to the best of our knowledge, the update rule of
has not been proposed in the existing literatures. It is worth to notice that Touati et al., [12] have proposed another version of gradient
(
), which is different from the proposed
, we have clarified this point in Remark 3.
Then, we provide the convergence analysis of the proposed
. Theorem 1 shows that
converges to its TD fixed-point with probability one. Additionally, Theorem 1 illustrates the structure of such TD fixed-point: it is the global asymptotically stable equilibrium of its corresponding ordinary differential equation (ODE). For more discussion, see Remark 4. Furthermore, Theorem 2 shows that
converges to an arbitrarily small neighborhood of the optimal solution with probability one.
Finally, our experiments show that when
ranges from 0 to 1,
achieves the best performance of off-policy evaluation or control within a certain
, neither
, nor
, which implies that with a certain value
,
creates a mixture between
and gradient
reaches a better performance than the extreme ends (
and
).
2. Preliminary
In this section, we briefly review the basics of reinforcement learning and off-policy evaluation.
2.1. Reinforcement Learning
Reinforcement learning (RL) [3] is often formalized as Markov decision processes (MDP) that considers a tuple
;
is a set with finite states,
is a set with finite actions;
,
is the probability of state transition from
to
under playing the action
;
is the expected reward function;
.
A policy
is a probability distribution on
, and
denotes the probability of playing
in state
. Let
be generated by
, its state-action value function is:
, where
is conditional expectation on the actions selected according to
. Let
denote Bellman operator with respect to policy
:
(1)
where
,
, their corresponding elements are:
,
. It is well-known that
is the unique fixed point of
, i.e.,
, which is known as Bellman equation.
2.2. Off-Policy Evaluation
Let us consider the trajectory
generated by the behavior policy
, where
,
. Off-policy evaluation is the task to estimate the value function of the target policy
via the data that is generated by the behavior policy
, where
.
Assumption 1 (Ergodicity). The Markov chain induced by behavior policy
is ergodic, i.e., there exists a stationary distribution
over
: for
,
(2)
The ergodicity of behavior policy
is a standard assumption in off-policy learning [3], and it implies each-action pair is visited under this behavior policy
. We use
to denote a diagonal matrix whose diagonal element is
, i.e.,
.
2.3. Temporal Difference Learning and λ-Return
TD learning updates value function as follows,
,
(3)
where
is an estimator of
,
is step-size and
is TD error. Let
, if
(4)
then update (3) is Sarsa [4]. If
is expected TD error:
(5)
then update (3) is Expected Sarsa [4].
The
-return is an average contains all the
-step returns by weighting proportionally to
, where
. In this paper, we mainly consider two classic
-return:
(
) and
.
Tree Backup(
). For each pair
in the trajectory
,
[5] estimates
by
(6)
where
is expected TD error. Precup et al., [5] have proved the iteration (6) converges to
with probability one under some certain conditions.
Expected Sarsa(
). Sutton and Barto [3] have proposed a multi-step TD learning extends Expected Sarsa to
-return version: for each
,
(7)
For the convenience, in the following paragraph, we consider the following notations,
2.4. A Unified View
In this section, we review an approach to unify
and
.
Algorithm. Recently, De Asis et al., [2] propose
unifies multi-step Sarsa and multi-step
. Concretely, according to a mixed TD error
:
(8)
De Asis et al., [2] construct a multi-step estimator:
(9)
where
is sampling parameter. When
ranges from 0 to 1, the update (9) ranges from multi-step
to multi-step Sarsa. Experimental results show that a certain
results in a mixture of
and Sarsa performs better than both
and
[2], which implies unifying some seemingly disparate algorithmic ideas can create better performing algorithms.
Off-Policy
. Later, De Asis, [7] proposes a multi-step returns as follows,
(10)
where
(11)
When
ranges from 0 to 1, the estimator (10) ranges from
(6) to
(7).
We introduce a
-operator
that is a high level view of the
-return (10),
(12)
where
is Bellman operator (1),
, and whose elements are:
.
Remark 1. Yang et al. [6] propose another version of
algorithm that extends
(9) with eligibility trace:
(13)
It is noteworthy that at one extreme end
, both
(9) and
(13) reduce to on-policy learning. Particularly, [6] prove that the performance of
for off-policy evaluation is determined by parameter
:
(14)
where
is a positive constant never reaches 0 no matter how we choose the starting time
. The upper error bound of (14) illustrates the capacity of
for off-policy evaluation decays monotonously when
ranges from 0 to 1. At the extreme end
,
achieves the worst performance of off-policy evaluation. We think this is a natural result since
is an on-policy algorithm exactly. Thus, in this paper, we mainly concern (10) for off-policy learning.
2.5. Linear Function Approximation
TD learning (3) requires a very huge table to store the estimate value function
when
is very large, which implies tabular TD learning is considerably expensive for high-dimensional RL. We often use a parametric function
to approximate
(15)
where
is the parameter need to be learned,
, and each
. Furthermore,
can be rewritten as a version of matrix
(16)
where
is a matrix whose rows are the state-action feature vectors
.
3. Divergence of
with Semi-Gradient
In this section, firstly, we derive the semi-gradient
algorithm; then we briefly analyze the divergence of extending
(10) with semi-gradient method. In fact, the divergence of semi-gradient off-policy TD methods are well-documented in the literature (e.g., [8] [12] [13]), which are not specific to
.
3.1. Semi-Gradient
Recall
is generated by behavior policy
, let
, we define semi-gradient
as follows:
(17)
where
is an off-line estimator of value function according to Equation (10), TD error
, and
is step-size. Let
(18)
(19)
Then update (17) can be rewritten as follows,
(20)
Furthermore, we have
(21)
(22)
where
3.2. Divergence Analysis
Now, we only briefly discuss the divergence lies in the iteration (17). Under the conditions of Proposition 4.8 presented by Bertsekas and Tsitsiklis [14],
(17) converges to a certain point if and only if
is a negative matrix. Furthermore, if iteration (17) converges, then it converges to its unique TD fixed point
that satisfes
(23)
Unfortunately, since the steady state-action distribution doesn’t match the transition probability during off-policy learning, we can not guarantee the negative definiteness of
, thus
may diverge. To clarify this point, we use the classic counterexample [12] to show the divergence of semi-gradient TD algorithms (17) for off-policy learning.
Figure 1. Counterexample from [12]: Two-State MDP.
For the MDP in Figure 1, the behavior policy
, and target policy
. We assign the features
to the state-action pairs
, From the dynamic transition shown in Figure 1, we have
Then, according to (21), we have
and the eigenvalues of
are:
and
. For any initial
, let
, according to (20), the first component of the term
is:
(24)
For any
, if
, then
is a positive scalar, which implies
can not be a negative matrix. Furthermore, if step size
, we have
(25)
Equation (25) is a direct result of the following conclusion that could be found in any calculus textbook. Let
, where
, if
, then
, which implies the way (17) to extend
with linear function approximation via off-line estimate is unstable for off-policy learning.
4. Gradient
In this section, we derive the gradient
(
) algorithm. The proposed
unifies
[11] if
. For more discussion, see Remark 2. At another extreme end
, the proposed
can be seen as a new way to extend
(6) with linear function approximation. Although
is a natural algorithm extends
(6) with linear function approximation, to the best of our knowledge, the update rule of
has not been proposed in the existing literature. For more discussion, see Remark 3.
We derive the gradient the
algorithm via mean square projected Bellman error (MSPBE) [9] objective function as follows,
(26)
where
is an
projection matrix,
is the weighted Euclidean norm:
. Furthermore, we can rewrite
as follows,
(27)
where
.
The gradient method is a natural approach to solve problem (27), however, it is worth to notice that the challenges are two-fold: (I) Firstly, since the invertible matrix
is involved in
, so it is too expensive to apply stochastic gradient to solve the problem (27) directly. (II) Since
involves the product of expectations, then the unbiased estimate of
cannot be obtained via a single sample. It needs to sample twice, so it is a double-sampling problem, which is the second bottleneck of applying gradient to solve the problem (27). Additionally,
cannot also be estimated via a single sample. The above analysis pushes us to find a new practical way to solve the problem (27).
Let
,
,
, then we have the following equation:
(28)
(29)
where
For the limitation of space, we provide the derivation of (28)-(29) in Appendix A. We use the sign convention that the mean update of
follows the negative gradient
. The term (31) provides its unbiased stochastic approximation, while the auxiliary recursion (30) tracks
, circumventing the double-sampling issue.
Let’s consider the term
that is a solution of a least-squares problem, and a typical least mean square (LMS) update rule to find the vector
is:
(30)
where
is step-size. Then, by directly sampling from (29) with (30), we define the update rule of
as follows,
(31)
where
is step-size. We provide the details in Algorithm 1.
Remark 2 (Case of
). When
, we regard
as an approach to extend
(7) with linear function approximation. A more interesting result is that the proposed
is reduced to
[12] exactly. Now, we provide two fresh interpretations to the proposed
:
is at one extreme end (
) of the proposed
, i.e., the proposed
contains
.
Furthermore, just because
is same as
,
can be seen as an extension of the tabular
(7) with linear function approximation, while the original motivation of Maei and Sutton [11] to propose
is to introduce eligibility trace to gradient temporal difference learning for off-policy evaluation. Thus, the proposed
provides a fresh understanding for the
[11].
Computational complexity. The computational complexity of Algorithm 1 is
per step in time, and
in memory, where
is the feature dimension and
is the number of actions. This maintains the same asymptotic efficiency as the baseline
and gradient
methods.
Remark 3 (Case of
). When
, the algorithm
is reduced to a version of extending
(6) with linear function approximation. Although
is a natural algorithm extends
(6) with linear function approximation, to the best of our knowledge, the update rule of
has not been proposed in the existing literatures. It is worth to notice that [12] have also proposed another version of gradient
(
), while the difference between the proposed
and
[12] is reflected at least two aspects:
Firstly, the proposed gradient
and
[12] share the same update rule of
and
, but instead of (31),
updates the parameter
as follows,
(32)
where
.
Secondly, Touati et al., [12] derive the update rule (32) of their
via the convex-concave saddle-point framework [15], while our
is based on the weight-duplication trick1 of (29)-(31).
5. Convergence Analysis
In this section, we prove the convergence of the proposed
. Theorem 1 shows that
converges to its TD fixed-point (23) with probability one. Besides, Theorem 1 illustrates the structure of this TD fixed-point: it is the global asymptotically stable equilibrium of its corresponding ordinary differential equation (ODE). For more discussion, we provide in Remark 4. Theorem 2 shows that
converges to an arbitrarily small neighborhood of the optimal solution with probability one. While we rely on the standard two-timescale ODE method for the general convergence mechanism, our novel theoretical step is verifying that the proposed
-dependent trace coefficient
yields a stable equilibrium structure for the ODEs.
5.1. Additional Assumptions
We need some additional assumptions to present the convergent of
, those assumptions are widely used in reinforcement learning [13] [17]-[19].
Assumption 2 (Diminishing Step-size). The positive sequences
,
satisfy the following conditions with probability one,
Assumption 3. The features
is uniformly bounded by
. The reward function is uniformly bounded by
. The importance sampling
is uniformly bounded by
.
Assumption 4 (Solvability of Problem). The matrix
is non-singular and
.
Assumption 4 requires the non-singular matrix
, which implies the optimal parameter
is well defined. The feature matrix
has linearly independent columns implies the matrix
is positive defined.
5.2. Main Results and Discussion
Theorem 1 (Convergence of Algorithm 1). Under Assumption 1-4, we consider the iteration
that is generated according to (30)-(31). The step-size
,
satisfy Assumption 2 and
, as
. We define two functions
,
as follows,
(33)
(34)
Then
, as
, where
is the unique global asymptotically stable equilibrium with respect to the following ODE correspondingly:
(35)
where
are the functions are defined on continuous time
.
Proof. We provide its proof in Appendix B. □
Remark 4 (TD-Fixed Point of
). Theorem 1 illustrates that the sequence
generated by
converges to the global asymptotically stable equilibrium of its corresponding ODE (35). Furthermore, from the details of the proof in Appendix B, we know since
is invertible and
is positive definite, then
is also positive defined. Because the ODE uses the negative MSPBE-gradient direction, the Jacobian at the equilibrium is
, whose eigenvalues have negative real parts. So the following ODE
(36)
has a unique global asymptotically stable equilibrium
satisfies the equation
(37)
which implies the global asymptotically stable equilibrium
of the ODE
is also the TD fixed point of (23). That means
converges to its TD fixed point:
Remark 5 (Unification of TD-Fixed Point). If
, the matrix
is reduced to
which implies
converges to the TD-fixed point of
[12], i.e.,
shares the same TD-fixed point of
[12] as follows
If
, then the key matrix
is reduced to
which illustrates the TD-fixed point of
(i.e.,
) algorithm as follows
Theorem 1 presents an asymptotic result, which holds only in the limit as the number of iterations increases to infinity. Now, we present a result shows the distance between
and the optimal solution
convergence to 0 in probability. For any
, and
, we introduce a notation
to denote the last iteration before the sum of step-size
between it and the
-th iteration exceeds
. Since we consider the Assumption 2, the notation
is well-defined.
Theorem 2. Under Assumption 1-4, we consider the iteration
generated by (30)-(31). The step-size
,
satisfy Assumption 2. Moreover, for the integers
,
, as
,
Then there exists a sequence of positive numbers
(as
) such that for any
,
(38)
where we use
to denote the
-neighborhood of
.
Proof. We provide its proof in Appendix C. □
Remark 6 Theorem 2 shows that as
, the sequence
collects to an arbitrarily small neighborhood of the optimal solution
with probability one, i.e.,
In fact, the results (38) implies
then we know
That is
.
6. Experiments
In this section, we test the capacity of
for two typical tasks: off-policy evaluation and control.
For off-policy evaluation, we compare
with four state-of-art algorithms:
[11],
[20],
,
[12] over two typical measurements: MSPBE and mean square error (MSE). It is worth noting that if
,
is reduced to a new version of
(we denote it as
-v0), thus, we also show its comparison to
[12] (we denote it as
-v1).
For the control task, our goal is to show the the trade-off between
and
. Empirical results show the
with a value
that creates a mixture of
and gradient
achieves a better performance than both the extreme end
and
.
These domains are standard diagnostic benchmarks intended to isolate the off-policy instability under linear function approximation. Scaling to very high-dimensional environments is left as future work.
6.1. Off-Policy Evaluation Experiments
We present the domains in the experiments as follows.
(1) Two State MDP [12]. This MDP is shown in Figure 1, the behavior policy
, and target policy
. We assign the features
to the state-action pairs
, i.e.,
(2) Baird Star [21]. The Baird Star is an episodic seven states MDP with two actions: dashed action and solid action. In this example, the behavior policy
,
and target policy
. We choose the feature map matrix as follows
where
denotes the identity matrix,
denotes a matrix whose elements are all 0, and
denotes a vector whose elements are all 1. We used
as initial parameter vector for the methods that allow specifying a start estimate, TD-learning is known to diverge for this initialization of the parameter-vector [3] [22].
(3) Cliff Walking. This is a standard undiscounted, episodic task, with start and goal states, and the usual actions causing movement up, down, right, and left.
(4) Grid World. This environment from ([3], Chapter 4), where the agent on an 4 × 4 grid and your goal is to reach the terminal state at the top left or the bottom right corner.
(5) Windy Gridworld. This environment from ([3], Chapter 6). Windy Gridworld problem for reinforcement learning. Actions include going up, down, right, and left. In each column the wind pushes you up a specific number of steps (for the next action). If an action would take you off the grid, you remain in the previous state. For each step you get a reward of −1, until the agent reach into a terminal state.
We summarize the domains, feature settings, target policy and behavior policy below.
Two Measurements for Off-Policy Evaluation. In this section, we use empirical to evaluate the performance, where we evaluate
,
, and
according to their unbiased estimators. Additionally, we also compare the performance over a common measurement empirical MSE:
, where
is estimated by simulating the target policy
and averaging the discounted cumulative rewards over trajectories.
Hyper-parameter Setting. We run the hyper parameter
as follows:
ranges from 0 to 1 with step of 0.02, i.e.,
It collects 51 results w.r.t.
. We set
,
, and run the step-size
,
.
Results Report. All the results shown in Figure 2 and Figure 3 are the average of 5 runs, where we choose the best
among the 51 results w.r.t. the space
The results the proposed
outperforms the the baseline algorithms. The results of Figure 2 and Figure 3 also show
with an intermediate
(between 0 and 1) has a better performance than the extreme case (
and 1). This experiment further validates that unifying some existing algorithms can create a better algorithm.
Figure 2. RMSPBE comparison with different baseline algorithms.
Figure 3. MSE comparison with with different baseline algorithms.
6.2. Control Domain: Off-Policy Evaluation
In this section, we test the off-policy evaluation behavior of
algorithm on mountain car domain, where the agent considers the task of driving an underpowered car up a steep mountain road. The agent receives a reward of -1 at every step until it reaches the goal region at the top of the hill. Since the state space of this domain is continuous, we use the open tile coding software2 to extract feature of states. Recall the states and actions of MountainCar:
In this experiment, if
, we use behavior policy
else
Note that the target policy
is fixed throughout. Thus, this experiment acts as an off-policy evaluation within the Mountain Car domain. In this experiment, we set the number of tilings to be 4, and there are no white noise features. As suggested by Sutton and Barto [3], we set all the initial parameters to be 0, which is optimistic about causing extensive exploration.
As before, we run the hyper parameter
as follows:
ranges from 0 to 1 with step of 0.02. It collects 51 results w.r.t.
. We also set
,
, and run the step-size
,
, the dimension of feature
.
Overall Presentation. We give more comprehensive results of the trade-off between
and
. We statistic of the number of
happens for the following three case:
(I)
performs better than both
and
.
(II)
performs better than
or
.
(III)
performs worse than
and
.
The setting of
is the same as the previous section, and the total number of
reaches 51.
Table 1. Returns under various parameters.
Case |
|
|
|
|
|
|
|
|
|
|
|
|
|
−122.0 |
−123.6 |
− 123.5 |
−119.8 |
−117.4 |
−120.8 |
−119.1 |
−120.5 |
−119.2 |
−119.3 |
−120.1 |
|
|
−119.9 |
−121.9 |
−127.8 |
−120.2 |
−122.4 |
−122.0 |
−118.4 |
−121.6 |
−121.6 |
−124.0 |
−120.1 |
|
|
−122.3 |
−121.4 |
−122.6 |
−121.9 |
−120.1 |
−122.3 |
−122.6 |
−119.6 |
−120.9 |
−121.4 |
−122.5 |
|
|
−126.9 |
−124.2 |
−124.2 |
−127.5 |
−125.3 |
−125.2 |
−124.0 |
−121.6 |
−125.4 |
−125.8 |
−120.2 |
|
|
−124.1 |
−123.1 |
−122.4 |
−126.3 |
−122.3 |
−123.0 |
−121.4 |
−126.2 |
−121.3 |
−123.6 |
−126.0 |
|
|
−124.8 |
−124.0 |
−126.4 |
−124.6 |
−122.7 |
−123.9 |
−125.1 |
−122.5 |
−123.6 |
−126.4 |
−122.7 |
Table 2. Percentage under various parameters.
Case |
|
|
|
|
I |
69.8% |
20.4% |
55.1% |
|
II |
14.1% |
36.7% |
20.4% |
|
III |
16.1% |
42.9% |
24.5% |
|
I |
53.1% |
6.1% |
49.0% |
|
II |
38.8% |
16.3% |
18.4% |
|
III |
8.1% |
77.6% |
32.6% |
Results Report. The results shown in Figure 2 and Table 1 are average of 5 runs, and each run contains 400 episodes. The result in Figure 2 shows that
with an intermediate
(between 0 and 1) has a better performance than the extreme case (
and 1). This experiment further validates that unifying some existing algorithms can create a better algorithm for reinforcement. Table 1 shows the returns mountaincar reaches the goal region at the top of the hill. As shown in Table 2, those results also show
achieves the best performance at a
, which implies the trade-off between
and
. That is to see the
with a value
that creates a mixture of
and gradient
achieves a better performance than both the extreme end
and
.
7. Conclusion
In this paper, we extend tabular
with function approximation, and propose
. We analyze the convergence of
. Our theory analysis shows that
converges to its TD fixed-point with probability one. Then, we show
converges to the optimal solution of the minimizing MSPBE problem. Finally, we conduct experiments on some standard domains to confirm the effectiveness of the proposed
. Result show that the best performance of
achieved with a
, neither
, nor
. Extending this framework to non-linear function approximation is an important future direction. Such an extension faces theoretical hurdles, such as the parameter-dependence of the Jacobian and the loss of a fixed linear least-squares structure, which may require techniques like target networks or compatible gradients to stabilize.
Funding
The project was partially supported by Scientific Research Fund of Zhejiang Provincial Education Department under Grant No. Y202456228.
Appendix
A. Derivation of (28)-(29)
Proof. Let us calculate
directly,
(39)
□
B. Proof of Theorem 1
The ODE method (see Lemma 1) is our is our main tool to prove Theorem 1. We refer the reader to that reference for further technical details.
Lemma 1 ([25]). For the stochastic recursion of
given by
(40)
(41)
if the following assumptions are satisfied:
(A2) The map
are Lipschitz.
(A3) The sequence
are martingale difference sequences w.r.t. the increasing
-fields satisfying
Furthermore,
, are square-integrable with
for some constant
.
has a global asymptotically stable equilibrium
such that:
is Lipschitz.
has a global asymptotically stable equilibrium
.
Then, the iterates (40), (41) converge to
a.s. on the set
.
Proof. Now, we apply Lemma 1 to prove results.
Step 1: On Convergence of ωt.
We consider the following ODE:
The equation
implies there exists a constant vector
such that:
, thus we can rewrite the above ODE associated
as follows,
(42)
Then, for any given
,
is the unique globally asymptotically stable equilibrium for the ODE (42). Let
for a fixed
, let
Since
is a positive definite matrix, then 0 is a globally asymptotically stable equilibrium for the following ODE
Let the
-field
be generated by the set
, where
. Let
then for each
, we have
. Furthermore, since Assumption 3 holds, there exists a non-negative constant
, s.t.
is square-integrable with
Since then, we have verified the conditions (A2)-(A5) of Lemma 1, thus
where
is short for with probability one.
Step 2: On Convergence of θt.
Let the
-field
be generated by the set
, where
. The iteration (31) that can be rewritten as:
Furthermore, we define a random variable
as follows,
then for each
, we have
. Now, we rewrite
as follows,
Under Assumption 3, for each
, there exists a non-negative constant
such that,
We consider the iteration (31) associated with the ODE
(43)
Recall
, from Equation (28), the ODE (43) can be rewritten as follows,
(44)
Since
is invertible, then
is the unique global asymptotically stable equilibrium of ODE (44). Let
We consider the following ODE
(45)
Since
is invertible and
is positive definite, then
is a positive defined matrix. Equivalently,
is negative definite. Thus the vector 0 is the unique global asymptotically stable equilibrium of (45). According to Lemma 1,
Therefore the proof is completed. □
C. Proof of Theorem 2
Proof. Let
, then we rewrite the iteration (31) and (30) as follows
where
satisfies
;
and
.
Now, we apply ([26], Theorem 2.3 of Chapter 8) twice, once for each time-scale. We refer the reader to that reference for further technical details in ([26], Theorem 2.3 of Chapter 8), which requires us to verify the following conditions of (i)-(iv):
(i) The random variables
and
are uniformly integrable (UI), i.e.,
where
is the indicator function.
In fact, the uniform integrability of both
and
are ensured by the the UI property of
, and according to the same analysis of Proposition 2 from [27], we have
and
are UI.
(ii) The set of random variables
is tight, i.e, for each positive scalar
, there exists a compact set
such that
In fact, recall the Assumption 3 implies the trace vector
satisfies
, i.e.,
(46)
For each
, according to Markov inequality, we have
which implies
is tight. Since we consider the MDPs with finite state space and action space, so the sequence
is also tight.
(iii) Let
be a compact set, for each
, both
and
are continuous with respect to
.
Recall the definitions of
and
, then we have
The Assumption 3 implies
, and
are bounded, thus both
and
are continuous with respect to
.
(iv) Recall the notation
, for each compact set
, let
then for each
, we have:
By the Jensen’s inequality, it is sufficient that
as
. With the fact
and following the same analysis of Proposition 2.3 in [28], we have
, as
. Similarly, we have
, as
.
Note that
is Lipschitz continuous with respect to the trace variable
, where
. Thus, there exists a positive scalar
such that
. From Assumption 3, we have
and
, then we have
as
. □
NOTES
1The term “weight-duplication trick” we use here is coming from [10] [11], while some other literatures may call it “two-timescale stochastic approximation”, e.g., [18].
2http://incompleteideas.net/rlai.cs.ualberta.ca/RLAI/RLtoolkit/tilecoding.html