<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JDAIP</journal-id><journal-title-group><journal-title>Journal of Data Analysis and Information Processing</journal-title></journal-title-group><issn pub-type="epub">2327-7211</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jdaip.2016.44014</article-id><article-id pub-id-type="publisher-id">JDAIP-71237</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject><subject> Physics&amp;Mathematics</subject></subj-group></article-categories><title-group><article-title>
 
 
  Double Sarsa and Double Expected Sarsa with Shallow and Deep Learning
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Michael</surname><given-names>Ganger</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Ethan</surname><given-names>Duryea</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Wei</surname><given-names>Hu</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib></contrib-group><aff id="aff1"><addr-line>Computer Science Department, Houghton College, Houghton, NY, USA</addr-line></aff><author-notes><corresp id="cor1">* E-mail:<email>michael.ganger17@houghton.edu(MG)</email>;<email>ethan.duryea18@houghton.edu(ED)</email>;<email>wei.hu@houghton.edu(WH)</email>;</corresp></author-notes><pub-date pub-type="epub"><day>13</day><month>10</month><year>2016</year></pub-date><volume>04</volume><issue>04</issue><fpage>159</fpage><lpage>176</lpage><history><date date-type="received"><day>July</day>	<month>26,</month>	<year>2016</year></date><date date-type="rev-recd"><day>Accepted:</day>	<month>October</month>	<year>14,</year>	</date><date date-type="accepted"><day>October</day>	<month>17,</month>	<year>2016</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  Double Q-learning has been shown to be effective in reinforcement learning scenarios when the reward system is stochastic. We apply the idea of double learning that this algorithm uses to Sarsa and Expected Sarsa, producing two new algorithms called Double Sarsa and Double Expected Sarsa that are shown to be more robust than their single counterparts when rewards are stochastic. We find that these algorithms add a significant amount of stability in the learning process at only a minor computational cost, which leads to higher returns when using an on-policy algorithm. We then use shallow and deep neural networks to approximate the actionvalue, and show that Double Sarsa and Double Expected Sarsa are much more stable after convergence and can collect larger rewards than the single versions.
 
</p></abstract><kwd-group><kwd>Double Sarsa</kwd><kwd> Double Expected Sarsa</kwd><kwd> Reinforcement Learning</kwd><kwd> Deep Learning</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>Reinforcement learning is concerned with finding optimal solutions to the class of problems that can be described as agent-environment interactions. The agent explores and takes actions in an environment, which gives the agent a reward, r, for each state, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x2.png" xlink:type="simple"/></inline-formula>, into which the agent transitions as a result of taking action a from the initial state s. The goal is to find an optimal policy <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x3.png" xlink:type="simple"/></inline-formula> that maximizes the expected reward collected by the agent [<xref ref-type="bibr" rid="scirp.71237-ref1">1</xref>] . Often, this is described as a Markov Decision Process (MDP), which groups this sequence into an experience:<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x4.png" xlink:type="simple"/></inline-formula>. In an MDP, the state s fully describes the environment, meaning that no other information is required to choose the next action; in other words, information from all previous states that can affect all future states is expressed in s.</p><p>There are multiple approaches to finding the optimal policy<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x5.png" xlink:type="simple"/></inline-formula>, which gives the probability of taking an action a given a state s. One set of techniques, known as Policy Gradient methods, directly search the space of available policies for one that</p><p>maximizes the accumulated discounted reward per episode, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x6.png" xlink:type="simple"/></inline-formula>, where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x7.png" xlink:type="simple"/></inline-formula></p><p>is the discount rate [<xref ref-type="bibr" rid="scirp.71237-ref2">2</xref>] . Another set of approaches, called Temporal Difference (TD) methods [<xref ref-type="bibr" rid="scirp.71237-ref3">3</xref>] , estimate the value of a particular state, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x8.png" xlink:type="simple"/></inline-formula>, or state-action pair, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x9.png" xlink:type="simple"/></inline-formula>, and use these values to derive a policy <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x10.png" xlink:type="simple"/></inline-formula> that maximizes these value functions at each step instead of maximizing g. There are other techniques that combine the ideas of Policy Gradient and Temporal Difference methods, most notably a class of algorithms called Actor Critic [<xref ref-type="bibr" rid="scirp.71237-ref4">4</xref>] [<xref ref-type="bibr" rid="scirp.71237-ref5">5</xref>] , but in this paper we only consider algorithms that fall under the Temporal Difference category. Within this category, there are two main types of algorithms: on-policy and off-policy [<xref ref-type="bibr" rid="scirp.71237-ref4">4</xref>] . With off-policy algorithms, the target policy being learned is different from the behavior policy, which is the policy that the agent uses to explore the environment. For example, the behavior policy might be to choose completely random actions, while the target policy might be to always take the action with the largest expected return. In contrast to off-policy algorithms, the target and behavior policies are the same with on-policy algorithms.</p><p>One of the most popular Temporal Difference algorithms is Q-learning, first proposed in [<xref ref-type="bibr" rid="scirp.71237-ref6">6</xref>] . Q-learning is an off-policy algorithm that learns the greedy action-value <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x11.png" xlink:type="simple"/></inline-formula> by updating the estimate <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x12.png" xlink:type="simple"/></inline-formula> at every step. Although it is guaranteed to converge when the environment and rewards are deterministic, it is less robust in scenarios where these are stochastic. An extension of the Q-learning algorithm, called Double Q-learning [<xref ref-type="bibr" rid="scirp.71237-ref7">7</xref>] , uses two action-value estimates <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x13.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x14.png" xlink:type="simple"/></inline-formula>, improving the performance of Q-learning in these stochastic scenarios. Generally, the average of <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x15.png" xlink:type="simple"/></inline-formula> and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x16.png" xlink:type="simple"/></inline-formula> tends to be below the estimate that Q-learning makes, and is sometimes below<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x17.png" xlink:type="simple"/></inline-formula>.</p><p>However, in many scenarios an off-policy algorithm is not realistic as it does not account for possible rewards and penalties that might result from an exploratory behavior policy. For example, receiving immediate returns might be more important than a true optimal policy, and while an on-policy algorithm may not converge to the optimal policy, it may still converge in fewer time steps than an off-policy algorithm to a policy which may be considered “sufficient” according to the problem domain. Often, these on-policy algorithms have stochastic policies that encourage exploration of the environment, which can also be a beneficial quality when the environment is subject to change. One such policy is called <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x18.png" xlink:type="simple"/></inline-formula>-greedy [<xref ref-type="bibr" rid="scirp.71237-ref6">6</xref>] , which uses the parameter <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x19.png" xlink:type="simple"/></inline-formula> to control the probability that the optimal action will be taken over a random one.</p><p>A simple on-policy algorithm that is similar to Q-learning is called Sarsa [<xref ref-type="bibr" rid="scirp.71237-ref8">8</xref>] . Like Q-learning, it learns the action-values at each step, but unlike Q-learning, it depends solely on the states visited and actions taken. Because of this, Sarsa’s action-value estimate <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x20.png" xlink:type="simple"/></inline-formula> never converges when the learning rate <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x20.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x21.png" xlink:type="simple"/></inline-formula> is constant, although for a sufficiently small<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x20.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x21.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x22.png" xlink:type="simple"/></inline-formula>, the policy can converge to one that balances exploration and exploitation. For example, if an <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x20.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x21.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x22.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x23.png" xlink:type="simple"/></inline-formula>-greedy policy is used, the policy that Sarsa converges to will avoid states that are adjacent to other states with a large negative reward. In other words, the policy will account for the possibility of random actions and take a path which figuratively does not come “too close to the edge”. A similar on-policy algorithm is called Expected Sarsa [<xref ref-type="bibr" rid="scirp.71237-ref9">9</xref>] ; like Sarsa, this algorithm converges to a policy that balances exploration and exploitation. However, unlike Sarsa, the action-value estimate also converges, which allowing for much higher learning rates to be utilized. Notably, the policies of both of these algorithms can only converge if the reward is deterministic; if it is stochastic, the policies are much less likely to converge (unless a sufficiently small learning rate is used).</p><p>The algorithms presented above use a tabular format to store the action-values, i.e. there is a single entry in the table for every s, a pair; as such, they are limited to simple problems where the state-action space is small. For many real problems, this is not the case, especially when the state-space is continuous; function approximation must be used instead. In application to Temporal Difference algorithms, it is the value functions that are approximated [<xref ref-type="bibr" rid="scirp.71237-ref10">10</xref>] , and a variety of techniques from supervised learning are used. A more recent development is the application of deep learning [<xref ref-type="bibr" rid="scirp.71237-ref11">11</xref>] to Q-learning, termed a Deep Q-Network [<xref ref-type="bibr" rid="scirp.71237-ref12">12</xref>] . Deep learning function approximation is the term given to neural networks with many layers, and has been shown to be effective in reinforcement learning problems with large state-action spaces, such as those encountered in Atari games. Double learning has also been applied to Deep Q-Networks, which is referred to as Deep Double Q-learning [<xref ref-type="bibr" rid="scirp.71237-ref13">13</xref>] , and has shown success in the same domain. However, recent work has shown that shallow networks can achieve similar results [<xref ref-type="bibr" rid="scirp.71237-ref14">14</xref>] , so the advantage of deep learning over shallow learning appears to be highly domain-dependent. Additionally, deep learning has been applied to Actor Critic methods, combining Deep Q-networks with recent development in deterministic policy gradients [<xref ref-type="bibr" rid="scirp.71237-ref15">15</xref>] to produce a robust learning algorithm [<xref ref-type="bibr" rid="scirp.71237-ref16">16</xref>] .</p><p>The current state-of-the-art in reinforcement learning can be seen in [<xref ref-type="bibr" rid="scirp.71237-ref17">17</xref>] , which combined many techniques in order to learn the game of Go. This study used supervised learning to initialize a policy network, and then improved this network through self-play and generated new data. This data was then used to train a value network. During game play, a Monte Carlo Tree Search algorithm was used to simulate future moves and choose the best action. This efficient use of data to solve a problem with about 2.08 &#215; 10<sup>170</sup> states [<xref ref-type="bibr" rid="scirp.71237-ref18">18</xref>] represents a significant achievement in the field of reinforcement learning, and shows the power of combining multiple different techniques. The study used a combination of supervised learning and reinforcement learning to train both a policy and a value network, and combined real data with simulated data to improve their training. Additionally, although during training many CPUs and GPUs were used, the final rollout action selection was very efficient and ran on a single machine in a short period of time.</p></sec><sec id="s2"><title>2. Algorithms</title><p>In this paper, we present two new algorithms that extend from the Sarsa and Expected Sarsa algorithms, which we refer to as Double Sarsa and Double Expected Sarsa. The concept of doubling the algorithms comes from Double Q-learning, where two estimates of the action-value <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x24.png" xlink:type="simple"/></inline-formula> are decoupled and updated against each other in order to improve the rate of learning in an environment with a stochastic reward system. Although Q-learning and Double Q-learning are off-policy, this concept extends naturally to the on-policy algorithms of Sarsa and Expected Sarsa, producing a variation of each algorithm that is less susceptible to variations in the reward system. In addition, the ideas of Double Sarsa and Double Expected Sarsa can be extended with function approximation of the action-values, in the same way that Q-learning can be extended to Deep Q-networks through function approximation.</p><sec id="s2_1"><title>2.1. Double Sarsa</title><p>The update rule for Double Q-learning is what makes it unique from standard Q- learning. In Q-learning, the action-value is updated according to</p><disp-formula id="scirp.71237-formula124"><label>(1)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/2-2870146x25.png"  xlink:type="simple"/></disp-formula><p>where s is the initial state, a is the action taken from that state, r is the reward observed from taking action a, and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x26.png" xlink:type="simple"/></inline-formula> is the next state the agent reaches resulting from s, a. In Double Q-learning, the update is decoupled using two tables, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x26.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x27.png" xlink:type="simple"/></inline-formula>and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x26.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x27.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x28.png" xlink:type="simple"/></inline-formula>:</p><disp-formula id="scirp.71237-formula125"><label>(2)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/2-2870146x29.png"  xlink:type="simple"/></disp-formula><p>The key idea is the replacement of the maximum action-value, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x30.png" xlink:type="simple"/></inline-formula>, with the value in a second table. This serves to decouple the two tables, tending to reduce susceptibility to random variation in r and stabilize the action-values. Additionally, the roles of the two tables <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x30.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x31.png" xlink:type="simple"/></inline-formula> and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x30.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x31.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x32.png" xlink:type="simple"/></inline-formula> are periodically switched, meaning that each table is only updated using half of all the experiences and that there is only a marginal increase in computational cost over having a single table.</p><p>The update rule for Double Sarsa is very similar to that used for Double Q-learning. However, because it is on-policy, a few modifications are necessary. First, we use an <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x33.png" xlink:type="simple"/></inline-formula>- greedy policy that uses the average of the two tables to determine the greedy action,</p><disp-formula id="scirp.71237-formula126"><label>(3)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/2-2870146x34.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x35.png" xlink:type="simple"/></inline-formula> is the probability of taking action a from state s, and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x35.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x36.png" xlink:type="simple"/></inline-formula> is the number of actions that can be taken from state s. In general, any policy derived from the average of <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x35.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x36.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x37.png" xlink:type="simple"/></inline-formula> and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x35.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x36.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x37.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x38.png" xlink:type="simple"/></inline-formula> can be used, such as <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x35.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x36.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x37.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x38.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x39.png" xlink:type="simple"/></inline-formula>-greedy or softmax [<xref ref-type="bibr" rid="scirp.71237-ref19">19</xref>] . The update rule then becomes</p><disp-formula id="scirp.71237-formula127"><label>(4)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/2-2870146x40.png"  xlink:type="simple"/></disp-formula><p>Because Sarsa does not take the maximum action value during the update rule, but does so instead during the computation of the greedy policy, there is a weaker decoupling of the two tables. However, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x41.png" xlink:type="simple"/></inline-formula>is still updated using the value from <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x41.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x42.png" xlink:type="simple"/></inline-formula> for the state-action pair<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x41.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x42.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x43.png" xlink:type="simple"/></inline-formula>, which helps to reduce the variation in the action-value.</p><p><xref ref-type="fig" rid="fig1">Figure 1</xref> shows the Algorithm for Double Sarsa, using a generic policy <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x44.png" xlink:type="simple"/></inline-formula> that balances exploration and exploitation. Unlike Double Q-learning, where the algorithm updates <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x44.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x45.png" xlink:type="simple"/></inline-formula> or <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x44.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x45.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x46.png" xlink:type="simple"/></inline-formula> with equal probability, in Double Sarsa <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x44.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x45.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x46.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x47.png" xlink:type="simple"/></inline-formula> and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x44.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x45.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x46.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x47.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x48.png" xlink:type="simple"/></inline-formula> are instead swapped with equal probability to simplify implementation. This algorithm is very similar to the original Sarsa algorithm [<xref ref-type="bibr" rid="scirp.71237-ref4">4</xref>] , except for the addition of the second action-value table and the swapping of the two tables.</p></sec><sec id="s2_2"><title>2.2. Double Expected Sarsa</title><p>Expected Sarsa is a more recently developed algorithm that improves on the on-policy nature of Sarsa. Because Sarsa has an update rule that requires the next action<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x49.png" xlink:type="simple"/></inline-formula>, it cannot converge unless the learning rate is reduced (<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x49.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x50.png" xlink:type="simple"/></inline-formula>) or exploration is annealed (<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x49.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x50.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x51.png" xlink:type="simple"/></inline-formula>), as <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x49.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x50.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x51.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x52.png" xlink:type="simple"/></inline-formula> always has a degree of randomness. Expected Sarsa changes this with an update rule that takes the expected action-value instead of the action-value of<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x49.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x50.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x51.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x52.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x53.png" xlink:type="simple"/></inline-formula>:</p><disp-formula id="scirp.71237-formula128"><label>(5)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/2-2870146x54.png"  xlink:type="simple"/></disp-formula><p>Because the update no longer depends on the next action taken, but instead depends on the expected action-value, Expected Sarsa can indeed converge; [<xref ref-type="bibr" rid="scirp.71237-ref9">9</xref>] notes that for the case of a greedy policy<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x55.png" xlink:type="simple"/></inline-formula>, Expected Sarsa is the same as Q-learning. In order adapt this to Double Expected Sarsa, we change the summation to be over <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x55.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x56.png" xlink:type="simple"/></inline-formula> instead of<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x55.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x56.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x57.png" xlink:type="simple"/></inline-formula>:</p><disp-formula id="scirp.71237-formula129"><label>(6)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/2-2870146x58.png"  xlink:type="simple"/></disp-formula><p>Although Expected Sarsa can be both on-policy and off-policy, here we discuss only</p><fig id="fig1"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref></label><caption><title> Double Sarsa algorithm, with tabular representation of the action-values. Lines 10 and 11 swap the references to <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x60.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x60.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x61.png" xlink:type="simple"/></inline-formula>, meaning each table is updated using half of the experiences each. Note that <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x60.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x61.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x62.png" xlink:type="simple"/></inline-formula> if the next state <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x60.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x61.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x62.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x63.png" xlink:type="simple"/></inline-formula> is terminal, otherwise it is the discount rate</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/2-2870146x59.png"/></fig><p>the on-policy version as it often has more utility; in Expected Sarsa, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x64.png" xlink:type="simple"/></inline-formula>re- presents the estimated action-value under target policy<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x64.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x65.png" xlink:type="simple"/></inline-formula>, which is the same as the behavior policy when it is on-policy. If the behavior policy and target policy are different (i.e. it is off-policy), it is usually more desirable for the target policy to be greedy, and not stochastic, in which case Expected Sarsa degenerates to Q-learning. The on-policy Double Expected Sarsa algorithm is shown in <xref ref-type="fig" rid="fig2">Figure 2</xref>, with lines 8 and 9 being the only differences from Double Sarsa. The two tables are again decoupled, this time in calculating the expected value <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x64.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x65.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x66.png" xlink:type="simple"/></inline-formula> under the current policy. Although the action <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x64.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x65.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x66.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x67.png" xlink:type="simple"/></inline-formula> is chosen in line 7, it is not needed until the next iteration (it is shown as such in order to be consistent with the Double Sarsa algorithm in <xref ref-type="fig" rid="fig1">Figure 1</xref>).</p></sec><sec id="s2_3"><title>2.3. Neural Network Approximation of Q(s, a)</title><p>Often, it is advantageous to represent the action-value function <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x68.png" xlink:type="simple"/></inline-formula> with a form of function approximation, especially when the state space is large or continuous. The simplest representation is a linear combination of the state-action features, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x68.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x69.png" xlink:type="simple"/></inline-formula>, using a vector of weights,<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x68.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x69.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x70.png" xlink:type="simple"/></inline-formula>. In other words,</p><disp-formula id="scirp.71237-formula130"><label>(7)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/2-2870146x71.png"  xlink:type="simple"/></disp-formula><p>If <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x72.png" xlink:type="simple"/></inline-formula> is a one-hot encoding for each <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x72.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x73.png" xlink:type="simple"/></inline-formula> pair, this degenerates to the tabular form discussed above. However, it is often beneficial to introduce non-linearities into the function approximator; one set of functions that do so are known as neural networks. The action-value function can be written more generally to accommodate this change of form:</p><disp-formula id="scirp.71237-formula131"><label>(8)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/2-2870146x74.png"  xlink:type="simple"/></disp-formula><fig id="fig2"  position="float"><label><xref ref-type="fig" rid="fig2">Figure 2</xref></label><caption><title> Double Expected Sarsa algorithm, with tabular representation of the action-values. Lines 11 and 12 swap the references to <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x76.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x76.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x77.png" xlink:type="simple"/></inline-formula>, meaning each table is updated using half of the experiences each. Note that <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x76.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x77.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x78.png" xlink:type="simple"/></inline-formula> if <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x76.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x77.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x78.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x79.png" xlink:type="simple"/></inline-formula> is terminal, otherwise it is the discount rate</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/2-2870146x75.png"/></fig><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x80.png" xlink:type="simple"/></inline-formula> is a feature vector that represents the state s, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x80.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x81.png" xlink:type="simple"/></inline-formula>is a vector that represents the parameters of the network, and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x80.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x81.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x82.png" xlink:type="simple"/></inline-formula> is the component of the vector-va- lued function f that corresponds to action a. It is important to note that this function approximation allows for a continuous state-space, but a discrete action-space; the approximation can be extended further to continuous action-spaces as well, especially in actor-critic algorithms [<xref ref-type="bibr" rid="scirp.71237-ref16">16</xref>] , but in this paper we only discuss the former approximation. In order to update the Sarsa network, we use a target similar to the target used in the tabular form,</p><disp-formula id="scirp.71237-formula132"><label>(9)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/2-2870146x83.png"  xlink:type="simple"/></disp-formula><p>and for Expected Sarsa,</p><disp-formula id="scirp.71237-formula133"><label>(10)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/2-2870146x84.png"  xlink:type="simple"/></disp-formula><p>Deep Double Sarsa and Deep Double Expected Sarsa use two different neural networks that have the same structure; we represent these two networks by their parameters <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x85.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x85.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x86.png" xlink:type="simple"/></inline-formula>. Similar to the tabular update rules, the target used for Deep Double Sarsa is</p><disp-formula id="scirp.71237-formula134"><label>(11)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/2-2870146x87.png"  xlink:type="simple"/></disp-formula><p>and the target for Deep Double Expected Sarsa is</p><disp-formula id="scirp.71237-formula135"><label>(12)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/2-2870146x88.png"  xlink:type="simple"/></disp-formula><p>As in the tabular algorithms, the policy is derived from the average of <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x89.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x89.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x90.png" xlink:type="simple"/></inline-formula>. The algorithms for Deep Double Sarsa and Deep Double Expected Sarsa are shown in <xref ref-type="fig" rid="fig3">Figure 3</xref> and <xref ref-type="fig" rid="fig4">Figure 4</xref>, respectively.</p><fig id="fig3"  position="float"><label><xref ref-type="fig" rid="fig3">Figure 3</xref></label><caption><title> Deep Double Sarsa algorithm, with neural network representation of the action-values. Lines 11 and 12 swap the references to <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x92.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x92.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x93.png" xlink:type="simple"/></inline-formula>, meaning each table is updated using half of the experiences each. Note that <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x92.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x93.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x94.png" xlink:type="simple"/></inline-formula> if <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x92.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x93.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x94.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x95.png" xlink:type="simple"/></inline-formula> is terminal, otherwise it is the discount rate</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/2-2870146x91.png"/></fig><fig id="fig4"  position="float"><label><xref ref-type="fig" rid="fig4">Figure 4</xref></label><caption><title> Deep Double Expected Sarsa algorithm, with neural network representation of the action-values. Lines 12 and 13 swap the references to <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x97.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x97.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x98.png" xlink:type="simple"/></inline-formula>, meaning each table is updated using half of the experiences each. Note that <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x97.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x98.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x99.png" xlink:type="simple"/></inline-formula> if <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x97.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x98.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x99.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x100.png" xlink:type="simple"/></inline-formula> is terminal, otherwise it is the discount rate</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/2-2870146x96.png"/></fig></sec></sec><sec id="s3"><title>3. Results</title><p>The experiment used to test the difference between Sarsa, Expected Sarsa, and their respective doubled versions was a simple grid world (see <xref ref-type="fig" rid="fig5">Figure 5</xref>) with two terminal states, one with a positive reward of 10 and the other with a negative reward of −10. Additionally, a blocking “wall” was placed in between the terminal states. Every time the agent moves a step in the environment, it receives an average reward r with mean <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x101.png" xlink:type="simple"/></inline-formula> and standard deviation<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x101.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x102.png" xlink:type="simple"/></inline-formula>. The state feature vector was represented by the concatenation of four one-hot encodings of the position of each of the objects,</p><disp-formula id="scirp.71237-formula136"><label>(13)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/2-2870146x103.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x104.png" xlink:type="simple"/></inline-formula> is the ith element of the one-hot encoding of the position of object<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x104.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x105.png" xlink:type="simple"/></inline-formula>, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x104.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x105.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x106.png" xlink:type="simple"/></inline-formula>is the position of that object, and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x104.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x105.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x106.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x107.png" xlink:type="simple"/></inline-formula> is the concatenation of all the <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x104.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x105.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x106.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x107.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x108.png" xlink:type="simple"/></inline-formula> vectors. The number corresponding to each position can be seen in <xref ref-type="fig" rid="fig5">Figure 5</xref>, as well as the object positions, which were inspired from [<xref ref-type="bibr" rid="scirp.71237-ref20">20</xref>] .</p><p>For comparison, we show the difference between the algorithms for rewards with both a deterministic distribution, where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x109.png" xlink:type="simple"/></inline-formula> for non-terminal s, and a stochastic distribution of two values, where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x109.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x110.png" xlink:type="simple"/></inline-formula> for arbitrary <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x109.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x110.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x111.png" xlink:type="simple"/></inline-formula> for non-terminal s. In both the deterministic and stochastic cases, the negative terminal state P had a reward of −10 and the positive terminal state G had a reward of +10. An environment with both a positive and negative terminal state is ideal for testing the robustness of on-policy algorithms because they must learn a policy that minimizes the number of steps to the positive terminal state while avoiding states that may lead to the negative terminal state, due to the stochastic nature of the policies.</p><fig id="fig5"  position="float"><label><xref ref-type="fig" rid="fig5">Figure 5</xref></label><caption><title> Grid world used to test the four algorithms discussed in this paper, left grid shows the number corresponding to the position and right grid shows the initial position of each object. A is the agent’s starting position, W is the “wall”, P is the terminal state with a reward of −10 (the “pit”), and G is the second terminal state with a reward of +10 (the “goal”). A is the only position allowed to change throughout the course of an episode</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/2-2870146x112.png"/></fig><p>In this paper, we first compare Sarsa, Expected Sarsa, Double Sarsa, and Double Expected Sarsa in tabular form, where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x113.png" xlink:type="simple"/></inline-formula> is represented by a single table entry for each s, a pair, varying different parameters of exploration, learning, and rewards. Then, we discuss the extension of these algorithms to Q-Networks and Deep Q-Networks, using neural networks to approximate <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x113.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x114.png" xlink:type="simple"/></inline-formula> in a few scenarios that highlight the advantage of applying double learning to Sarsa and Expected Sarsa.</p><sec id="s3_1"><title>3.1. Tabular Representation of Q(s, a)</title><p>A comparison of Sarsa, Expected Sarsa, Double Sarsa, and Double Expected Sarsa under a deterministic reward system can be seen in <xref ref-type="fig" rid="fig6">Figure 6</xref>(a), showing the average return was over 100,000 episodes. Expected Sarsa and Double Expected Sarsa appear to have almost identical performance, although for small learning rates Expected Sarsa tends to perform marginally better; presumably, this is because the doubled version must train two tables and consequently takes longer to converge than the single version. In the first 1000 episodes under the same reward system, the average return collected by Double Expected Sarsa was about 6.4% less than the reward received by Expected Sarsa (not shown), which supports this hypothesis.</p><p>However, unlike the Expected algorithms, there is a clear performance difference between Sarsa and Double Sarsa for a deterministic reward. Like Expected Sarsa, Sarsa performs marginally better than Double Sarsa when the learning rate is small, although this is difficult to see in <xref ref-type="fig" rid="fig6">Figure 6</xref>(a). However, for learning rates greater than about 0.25, Double Sarsa shows a clear performance improvement over the standard Sarsa algorithm, especially as<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x115.png" xlink:type="simple"/></inline-formula>. When<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x115.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x116.png" xlink:type="simple"/></inline-formula>, the average return collected by Sarsa quickly drops off to below 0, while Double Sarsa stillcollects an average return of about 3.5. This improvement in performance is likely a consequence of the Sarsa update rule, which uses the value of the next action <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x115.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x116.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x117.png" xlink:type="simple"/></inline-formula> to update the value at the current state and action<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x115.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x116.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x117.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x118.png" xlink:type="simple"/></inline-formula>. This can introduce a substantial amount of variation in<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x115.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x116.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x117.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x118.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x119.png" xlink:type="simple"/></inline-formula>,</p><fig id="fig6"  position="float"><label><xref ref-type="fig" rid="fig6">Figure 6</xref></label><caption><title> (a) Average return per episode vs. learning rate α for 100,000 episodes, deterministic reward with <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x121.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x121.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x122.png" xlink:type="simple"/></inline-formula>, and an <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x121.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x122.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x123.png" xlink:type="simple"/></inline-formula>-greedy policy with<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x121.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x122.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x123.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x124.png" xlink:type="simple"/></inline-formula>. Expected Sarsa and Double Expected Sarsa are overlapping due to their convergence to very similar average returns. (b) Average return per episode vs. learning rate α for 100,000 episodes, stochastic reward with <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x121.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x122.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x123.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x124.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x125.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x121.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x122.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x123.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x124.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x125.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x126.png" xlink:type="simple"/></inline-formula>, and an <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x121.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x122.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x123.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x124.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x125.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x126.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x127.png" xlink:type="simple"/></inline-formula>- greedy policy with<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x121.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x122.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x123.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x124.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x125.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x126.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x127.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x128.png" xlink:type="simple"/></inline-formula></title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/2-2870146x120.png"/></fig><p>especially if α is not annealed over time. Double Sarsa reduces this variation by decoupling the two tables, preventing against large changes in<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x129.png" xlink:type="simple"/></inline-formula>, which tends to produce a more stable policy and increase the amount of reward collected.</p><p><xref ref-type="fig" rid="fig6">Figure 6</xref>(b) shows the same comparison between the four algorithms with a stochastic reward, where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x130.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x130.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x131.png" xlink:type="simple"/></inline-formula>. For most of the learning rates tested, the doubled versions of the algorithms performed better than their respective single version. Unlike the deterministic case, Expected Sarsa does not have a clear advantage over Double Expected Sarsa in the first 1000 episodes (not shown), and like the deterministic case both exhibit the same trend over 100,000 episodes (<xref ref-type="fig" rid="fig6">Figure 6</xref>(b)), although the trend is significantly different.</p><p><xref ref-type="fig" rid="fig7">Figure 7</xref> shows the learning rate below which returns are positive and above which they are negative, comparing the learning rate which produces the same average return. These results indicate that the double estimators employed by Double Sarsa and Double Expected Sarsa allow for faster learning rates under the same stochastic reward conditions. As shown in <xref ref-type="fig" rid="fig7">Figure 7</xref>, around a 40% increase in the learning rate can be applied before Double Sarsa and Double Expected Sarsa collect rewards equivalent to Sarsa and Double Expected Sarsa, respectively. In real world applications, this can be a significant advantage, allowing greater returns to be collected earlier on in the learning process.</p><p>A comparison of the path length distributions between the four algorithms in the stochastic case is shown in <xref ref-type="fig" rid="fig8">Figure 8</xref>. The path length L is the number of steps that it took the algorithm to reach a terminal state in a given episode. Although all four algorithms reach the negative terminal state in <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x132.png" xlink:type="simple"/></inline-formula> steps with approximately equal probability, it is apparent that Double Sarsa and Double Expected Sarsa tend to reach</p><fig id="fig7"  position="float"><label><xref ref-type="fig" rid="fig7">Figure 7</xref></label><caption><title> Comparison of the performance of each algorithm with<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x134.png" xlink:type="simple"/></inline-formula>, a stochastic reward system of <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x134.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x135.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x134.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x135.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x136.png" xlink:type="simple"/></inline-formula>, and an -greedy policy of<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x134.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x135.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x136.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x137.png" xlink:type="simple"/></inline-formula>. The zero-crossing was determined by fitting a line <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x134.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x135.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x136.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x137.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x138.png" xlink:type="simple"/></inline-formula> to each of the curves in <xref ref-type="fig" rid="fig6">Figure 6</xref>(b), using at least 7 points very close to the line<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x134.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x135.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x136.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x137.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x138.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x139.png" xlink:type="simple"/></inline-formula>, and finding <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x134.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x135.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x136.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x137.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x138.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x139.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x140.png" xlink:type="simple"/></inline-formula> such that<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x134.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x135.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x136.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x137.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x138.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x139.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x140.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x141.png" xlink:type="simple"/></inline-formula>. In all cases, the R<sup>2</sup> value was greater than 0.9. The computation time was averaged over 100 runs; a single run includes 100,000 episodes after initialization of the algorithm, where an episode completed when the agent reaches the terminal state. Increase by Doubling was calculated by taking the ratio of the two metrics for Double Sasrsa to Sarsa and Double Expected Sarsa to Expected Sarsa. Increase from Sarsa was calculated by taking the ratio of the two metrics for Double Sarsa, Expected Sarsa, and Double Expected Sarsa to Sarsa. Computational Efficiency was computed by subtracting the percentage increase of computation time from the percentage increase of the zero crossing learning rate</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/2-2870146x133.png"/></fig><fig id="fig8"  position="float"><label><xref ref-type="fig" rid="fig8">Figure 8</xref></label><caption><title> Path length distributions for all four algorithms, accumulated over 100 runs with 100,000 episodes each, truncated to a path length of <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x143.png" xlink:type="simple"/></inline-formula> to show the most frequently occurring lengths. The path length is the number of steps that were needed to reach a terminal state. An <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x143.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x144.png" xlink:type="simple"/></inline-formula>-greedy policy of <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x143.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x144.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x145.png" xlink:type="simple"/></inline-formula> was used, with a learning rate of<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x143.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x144.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x145.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x146.png" xlink:type="simple"/></inline-formula>. The reward system was stochastic, with <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x143.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x144.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x145.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x146.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x147.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x143.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x144.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x145.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x146.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x147.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x148.png" xlink:type="simple"/></inline-formula></title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/2-2870146x142.png"/></fig><p>the positive terminal state in fewer steps than Sarsa and Expected Sarsa. This is likely due to the double versions having a more stable policy as a result of having decoupled action-value estimates, preventing against large changes in the action-value, as well as the policy.</p><p>Also shown in the table is the average computation time for 100,000 episodes, with<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x149.png" xlink:type="simple"/></inline-formula>. As can be seen in the table, the extra computational expense of Double Sarsa, Expected Sarsa, and Double Expected Sarsa is marginal, with all three algorithms taking less than 10% more time than Sarsa. This is in contrast to the increase in the zero crossing learning rate (<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x149.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x150.png" xlink:type="simple"/></inline-formula>where<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x149.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x150.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x151.png" xlink:type="simple"/></inline-formula>), which in all cases is significantly greater than the original algorithm. This indicates that there is a significant advantage of using the doubled versions of Sarsa and Expected Sarsa when the reward is stochastic.</p><p>As shown with the Double Q-Learning algorithm, Double Sarsa and Double Expected Sarsa initially tend to have a lower estimate of the action-value than Sarsa and Expected Sarsa, respectively. <xref ref-type="fig" rid="fig9">Figure 9</xref>(a) shows the maximum action value, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x152.png" xlink:type="simple"/></inline-formula>, for the initial state of the agent, averaged over 1000 runs, with the doubled versions converging to the true value slower than the single versions. In addition, this plot shows the increased stability of the action-values that doubling the Sarsa and Expected Sarsa algorithms imparts. Interestingly, unlike Double Q-Learning, the Double Sarsa and Double Expected Sarsa tended to converge to a higher maximum action value (see <xref ref-type="fig" rid="fig9">Figure 9</xref>(a)) than Sarsa and Expected Sarsa, and approached the true value which was converged to in the deterministic case. This is likely due to the increased stability provided by the two decoupled action-value tables, instead of a single action-value table, which improves the quality of the policy and consequently increases the total reward.</p><p>This stability is especially important for on-policy algorithms, as a more stable behavior policy tends to reduce variation the distribution of states visited by the agent, as</p><fig-group id="fig9"><label><xref ref-type="fig" rid="fig9">Figure 9</xref></label><caption><title> (a) Maximum action value for the initial state <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x154.png" xlink:type="simple"/></inline-formula> for each algorithm, averaged over 1000 runs. Reward was stochastic with <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x154.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x155.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x154.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x155.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x156.png" xlink:type="simple"/></inline-formula>, with <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x154.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x155.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x156.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x157.png" xlink:type="simple"/></inline-formula>-greedy policy of <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x154.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x155.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x156.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x157.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x158.png" xlink:type="simple"/></inline-formula> and a learning rate of<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x154.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x155.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x156.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x157.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x158.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x159.png" xlink:type="simple"/></inline-formula>. The true value is the value which Expected converged to in the deterministic case. (b) Average returns for the same experiment of 1000 runs. The return g is the sum of the rewards in a given episode, or<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x154.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x155.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x156.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x157.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x158.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x159.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x160.png" xlink:type="simple"/></inline-formula>. The convergence of the returns occurs much faster than the estimated maximum action value of the initial state.</title></caption><fig id ="fig9_1"><label> (b)</label><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/2-2870146x153.png"/></fig></fig-group><p>well as the actions taken, making them significantly more predictable. For comparison, in the experiment shown in <xref ref-type="fig" rid="fig9">Figure 9</xref>(a), the average variance of the maximum ac-</p><p>tion-value over all <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x161.png" xlink:type="simple"/></inline-formula> episodes, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x161.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x162.png" xlink:type="simple"/></inline-formula>(where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x161.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x162.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x163.png" xlink:type="simple"/></inline-formula> is the variance</p><p>of <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x164.png" xlink:type="simple"/></inline-formula> at episode t over 1000 runs) was computed. For Sarsa, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x164.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x165.png" xlink:type="simple"/></inline-formula>was 4.51, 2.44 for Double Sarsa, 4.36 for Expected Sarsa, and 2.32 for Double Expected Sarsa. This is a significant reduction in variation, given the small difference in the average return curves shown in <xref ref-type="fig" rid="fig9">Figure 9</xref>(b).</p><p><xref ref-type="fig" rid="fig1">Figure 1</xref>0(a) and <xref ref-type="fig" rid="fig1">Figure 1</xref>0(b) show similar results from an experiment with the same parameters, except that the number of episodes was increased to 100,000 and the number of runs decreased to 100. As can be seen, the average return collected by Double Sarsa and Double Expected Sarsa quickly surpasses that of Sarsa and Expected Sarsa, and the maximum action-value increases accordingly. Once again, this is likely due to the reduction in variation provided by double learning. It is also interesting to note that this is different than what was shown in [<xref ref-type="bibr" rid="scirp.71237-ref7">7</xref>] for Double Q-learning. That study found that the double estimator should, on average, underestimate the single estimator; this is clearly not the case in <xref ref-type="fig" rid="fig9">Figure 9</xref>(a). Likely, this is due to the fact that Q-learning is off-policy and takes the max in its update rule, while Sarsa is on-policy and often has a stochastic behavior (and target) policy.</p><p>The degree of effectiveness of Double Sarsa and Double Expected Sarsa is highly dependent on the distribution of rewards. <xref ref-type="fig" rid="fig1">Figure 1</xref>1(a) shows the average return per algorithm against the standard deviation <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x166.png" xlink:type="simple"/></inline-formula> of the two-value stochastic distribution,</p><fig-group id="fig10"><label><xref ref-type="fig" rid="fig1">Figure 1</xref>0</label><caption><title> (a) Maximum action value for the initial state <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x168.png" xlink:type="simple"/></inline-formula> for each algorithm, averaged over 100 runs. Reward was stochastic with <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x168.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x169.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x168.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x169.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x170.png" xlink:type="simple"/></inline-formula>, with <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x168.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x169.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x170.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x171.png" xlink:type="simple"/></inline-formula>-greedy policy of <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x168.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x169.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x170.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x171.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x172.png" xlink:type="simple"/></inline-formula> and a learning rate of<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x168.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x169.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x170.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x171.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x172.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x173.png" xlink:type="simple"/></inline-formula>. The graph shows the average value every 1000 episodes, averaged over the previous 1000 episodes. The true value is the value which Expected Sarsa converged to in the deterministic case. (b) Average return for the same experiment as <xref ref-type="fig" rid="fig9">Figure 9</xref>(a), averaged over the same 1000 episode intervals, where the return g is computed the same way as in <xref ref-type="fig" rid="fig9">Figure 9</xref>(b).</title></caption><fig id ="fig10_1"><label> (b)</label><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/2-2870146x167.png"/></fig></fig-group><fig id="fig11"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref>1</label><caption><title> (a) Average return over 100,000 episodes with varying standard deviation of the reward system,<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x175.png" xlink:type="simple"/></inline-formula>. The reward distribution had two values, with <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x175.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x176.png" xlink:type="simple"/></inline-formula> for non-terminal s. For all cases, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x175.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x176.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x177.png" xlink:type="simple"/></inline-formula>, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x175.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x176.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x177.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x178.png" xlink:type="simple"/></inline-formula>, and an <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x175.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x176.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x177.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x178.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x179.png" xlink:type="simple"/></inline-formula>-greedy policy of<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x175.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x176.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x177.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x178.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x179.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x180.png" xlink:type="simple"/></inline-formula>. Expected Sarsa had very similar average returns, as did Double Sarsa and Double Expected Sarsa. (b) Average return with varying power x for <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x175.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x176.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x177.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x178.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x179.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x180.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x181.png" xlink:type="simple"/></inline-formula>-decreasing policy of<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x175.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x176.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x177.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x178.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x179.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x180.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x181.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x182.png" xlink:type="simple"/></inline-formula>, taken over 10,000 episodes and averaged over 100 runs. A stochastic reward system was used, with <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x175.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x176.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x177.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x178.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x179.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x180.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x181.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x182.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x183.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x175.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x176.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x177.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x178.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x179.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x180.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x181.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x182.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x183.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x184.png" xlink:type="simple"/></inline-formula>, and the learning rate was kept constant at <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x175.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x176.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x177.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x178.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x179.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x180.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x181.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x182.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x183.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x184.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x185.png" xlink:type="simple"/></inline-formula> for all four algorithms</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/2-2870146x174.png"/></fig><p>with<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x186.png" xlink:type="simple"/></inline-formula>, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x186.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x187.png" xlink:type="simple"/></inline-formula>, and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x186.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x187.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x188.png" xlink:type="simple"/></inline-formula>. It appears that the doubled versions are significantly more robust with respect to variations in the reward distribution. For example, when<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x186.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x187.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x188.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x189.png" xlink:type="simple"/></inline-formula>, Double Sarsa and Double Expected Sarsa still net positive rewards, while Sarsa and Expected Sarsa are significantly negative. Note that this means the reward that <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x186.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x187.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x188.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x189.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x190.png" xlink:type="simple"/></inline-formula> for non-terminal s, which covers a range that is about double the range of the terminal rewards,<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x186.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x187.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x188.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x189.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x190.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x191.png" xlink:type="simple"/></inline-formula>.</p><p>The advantage of doubling Sarsa and Expected Sarsa can also be seen in <xref ref-type="fig" rid="fig1">Figure 1</xref>1(b), which compares the average return collected by each algorithm over 10,000 episodes with an <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x192.png" xlink:type="simple"/></inline-formula>-decreasing [<xref ref-type="bibr" rid="scirp.71237-ref21">21</xref>] policy and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x192.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x193.png" xlink:type="simple"/></inline-formula>. For the <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x192.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x193.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x194.png" xlink:type="simple"/></inline-formula>-decreasing policy, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x192.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x193.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x194.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x195.png" xlink:type="simple"/></inline-formula>was calculated according to<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x192.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x193.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x194.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x195.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x196.png" xlink:type="simple"/></inline-formula>, where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x192.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x193.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x194.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x195.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x196.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x197.png" xlink:type="simple"/></inline-formula> is the number of times s was visited and x is an arbitrary exponent used to control how quickly<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x192.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x193.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x194.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x195.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x196.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x197.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x198.png" xlink:type="simple"/></inline-formula>. For the same learning rate, a faster decreasing <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x192.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x193.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x194.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x195.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x196.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x197.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x198.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x199.png" xlink:type="simple"/></inline-formula> (a larger x) can be used with Double Sarsa and Double Expected Sarsa than with Sarsa and Expected Sarsa before the returns collapse, meaning a greedy policy can be more quickly achieved and greater returns can be collected; in situations where exploration is highly undesirable (e.g. it is expensive), this can be a significant advantage.</p></sec><sec id="s3_2"><title>3.2. Neural Network Representation of Q(s, a)</title><p>In order to test the robustness of each algorithm, we tested each of them with neural network function approximation of<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x200.png" xlink:type="simple"/></inline-formula>. All neural networks were implemented using the Keras library [<xref ref-type="bibr" rid="scirp.71237-ref22">22</xref>] , and backpropagation was performed using the RMS Prop technique [<xref ref-type="bibr" rid="scirp.71237-ref23">23</xref>] . A comparison of different neural network architectures applied to each algorithm can be seen in <xref ref-type="fig" rid="fig1">Figure 1</xref>2. The parameter n represents trials from a range of values; typically,<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x200.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x201.png" xlink:type="simple"/></inline-formula>. The returns were averaged over 16 runs in order to reduce natural variations in performance from random initialization of the network parameters, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x200.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x201.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x202.png" xlink:type="simple"/></inline-formula>, and the maximum average return for each architecture was taken over n according to<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x200.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x201.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x202.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x203.png" xlink:type="simple"/></inline-formula>, where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x200.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x201.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x202.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x203.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x204.png" xlink:type="simple"/></inline-formula> is the average return for a given architecture with parameter n. As can be seen in the figure, a variety of trends are apparent. First, as the network architecture transitions from shallow to deep, the average return collected generally decreases. For a random policy, the average return g was determined experimentally to be about −16.03, indicating that any network architecture with an average return <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x200.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x201.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x202.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x203.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x204.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x205.png" xlink:type="simple"/></inline-formula> has learned a policy better than random, and any network architecture with an average return <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x200.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x201.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x202.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x203.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x204.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x205.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x206.png" xlink:type="simple"/></inline-formula> has learned a policy that must reach the positive terminal state at least part of the time.</p><p>For the architectures shown, the average increase in return of Double Sarsa (DS) over Sarsa (S), <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x207.png" xlink:type="simple"/></inline-formula>, is 1.05 &#177; 6.29, for Expected Sarsa (ES) the average increase in return over Sarsa <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x207.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x208.png" xlink:type="simple"/></inline-formula> is 1.42 &#177; 3.53, and for Double Expected Sarsa (DES), <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x207.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x208.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x209.png" xlink:type="simple"/></inline-formula>is 0.59 &#177; 7.23 (the uncertainty is the standard deviation of the differences). Clearly, Double Sarsa, Expected Sarsa, and Double Expected Sarsa are improvements over Sarsa when neural networks are used. Presumably, this is because all three provide increased stability to the action-value estimates, in different ways.</p><fig id="fig12"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref>2</label><caption><title> Comparison of average returns collected by neural network implementations of the four algorithms over 10,000 episodes, averaged over 4 runs and maximized over the size of the last hidden layer (max<sub>n</sub>g<sub>n</sub> for each algorithm and network parameter). The input was a vector of length 64, concatenating the one-hot encodings of the positions of the agent, the “wall”, the “pit”, and the “goal”, each vectors of length 16. The network architecture represents the size of the hidden layers as a list, in consecutive order from left to right, and represents a parameter that was typically in the range of 5 to 10. A sub-list such as <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x211.png" xlink:type="simple"/></inline-formula> indicates this layer is a convolutional layer with <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x211.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x212.png" xlink:type="simple"/></inline-formula> filters of <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x211.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x212.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x213.png" xlink:type="simple"/></inline-formula> inputs. The output layer had 4 units, one for each action-value. Each hidden layer used a rectified linear activation function, and the output layer used a linear activation function</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/2-2870146x210.png"/></fig><p>Although it is apparent that doubling Sarsa and Expected Sarsa generally improves the performance the algorithms when neural networks are used to approximate the action-value function, the advantage of deep learning over shallow learning is contradicted by our experiments. Presumably, this is because there are comparatively very few states in our simple grid world environment; it is likely that, as the size of the grid increases, the benefit of neural network approximation might increase. However, even in this case, the advantage of deep learning over shallow learning might not fully become apparent without increasing the complexity of the environment; deep neural networks might not be beneficial until the environment reaches a certain level of complexity and non-linearity.</p><p>Even so, the experiments summarized in <xref ref-type="fig" rid="fig1">Figure 1</xref>2 show the effect of using function approximation on an on-policy algorithm, which in this case decreased the average return significantly, which is something that was not observed with off-policy algorithms. Likely, this is a product of the increased feedback present in on-policy algorithms; the choice of action a affects the update of<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x214.png" xlink:type="simple"/></inline-formula>, which changes the action-values and policy, and consequently affects the choice of the next action<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x214.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x215.png" xlink:type="simple"/></inline-formula>. In off-policy algorithms, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x214.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x215.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x216.png" xlink:type="simple"/></inline-formula>does not affect the policy, meaning that there is a greater degree of stability when training the neural network approximator.</p></sec></sec><sec id="s4"><title>4. Conclusion</title><p>Current on-policy reinforcement algorithms are less effective when rewards are stochastic, requiring a reduction in the learning rate in order to maintain a stable policy. Two new on-policy reinforcement learning algorithms, Double Sarsa and Double Expected Sarsa, were proposed in this paper to address this issue. Similar to what was found with Double Q-learning, Double Sarsa and Double Expected Sarsa were found to be more robust to random rewards. For a constant learning rate<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x217.png" xlink:type="simple"/></inline-formula>, these algorithms are more stable to large variations in rewards, allowing them to still achieve significant returns when the standard deviation <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x217.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/2-2870146x218.png" xlink:type="simple"/></inline-formula> is significantly larger than the magnitude of the rewards received in the terminal states. We found that the estimated action-values of Double Sarsa and Double Expected Sarsa were much more stable than those of both Sarsa and Expected Sarsa, which resulted in a better policy. However, unlike Double Q-learning, we showed that the double estimators of the proposed algorithms could overestimate the single estimators of the original algorithms. In addition, we found that, for the same average return, a more aggressive learning rate could be used with the doubled versions, at only a minor computational cost. Finally, we demonstrated that this technique could be extended with neural networks and deep reinforcement learning, showing the same improvement from doubling as the tabular forms do. Future work should focus on exploring the robustness of the neural network versions of Double Sarsa and Double Expected Sarsa in more complex environments.</p></sec><sec id="s5"><title>Acknowledgements</title><p>We would like to thank the Summer Research Institute at Houghton College for providing financial support for this study.</p></sec><sec id="s6"><title>Cite this paper</title><p>Ganger, M., Duryea, E. and Hu, W. (2016) Double Sarsa and Double Expected Sarsa with Shallow and Deep Learning. Journal of Data Ana- lysis and Information Processing, 4, 159- 176. http://dx.doi.org/10.4236/jdaip.2016.44014</p></sec></body><back><ref-list><title>References</title><ref id="scirp.71237-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Kaelbling, L.P., Littman, M.L. and Moore, A.W. (1996) Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4, 237-285.</mixed-citation></ref><ref id="scirp.71237-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Williams, R.J. (1992) Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8, 229-256.  
http://dx.doi.org/10.1007/BF00992696</mixed-citation></ref><ref id="scirp.71237-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Sutton, R.S. (1988) Learning to Predict by the Methods of Temporal Differences. Machine Learning, 3, 9-44. http://dx.doi.org/10.1007/BF00115009</mixed-citation></ref><ref id="scirp.71237-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Sutton, R.S. and Barto, A.G. (1998) Reinforcement Learning: An Introduction. Vol. 1, MIT press, Cambridge.</mixed-citation></ref><ref id="scirp.71237-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Konda, V.R. and Tsitsiklis, J.N. (1999) Actor-Critic Algorithms. NIPS Proceedings, 13, 1008-1014.</mixed-citation></ref><ref id="scirp.71237-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Watkins, C.J.C.H. (1989) Learning from Delayed Rewards. Ph.D. Thesis, University of Cambridge, Cambridge.</mixed-citation></ref><ref id="scirp.71237-ref7"><label>7</label><mixed-citation publication-type="book" xlink:type="simple">Hasselt, H.V. (2010) Double Q-Learning. In: Lafferty, J.D., Williams, C.K.I., Shawe-Taylor, J., Zemel, R.S. and Culotta, A., Eds., Advances in Neural Information Processing Systems, Curran Associates, Inc., New York, 2613-2621.</mixed-citation></ref><ref id="scirp.71237-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Rummery, G.A. and Niranjan, M. (1994) On-Line Q-Learning Using Connectionist Systems. Department of Engineering, University of Cambridge, Cambridge.</mixed-citation></ref><ref id="scirp.71237-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Van Seijen, H., Van Hasselt, H., Whiteson, S. and Wiering, M. (2009) A Theoretical and Empirical Analysis of Expected Sarsa. 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, Nashville, 30 March-2 April 2009, 177-184.  
&lt;br /&gt;http://dx.doi.org/10.1109/ADPRL.2009.4927542</mixed-citation></ref><ref id="scirp.71237-ref10"><label>10</label><mixed-citation publication-type="book" xlink:type="simple">Boyan, J. and Moore, A.W. (1995) Generalization in Reinforcement Learning: Safely Approximating the Value Function. In: Tesauro, G., Touretzky, D.S. and Leen, T.K., Eds., Advances in Neural Information Processing Systems, MIT Press, Cambridge, 369-376.</mixed-citation></ref><ref id="scirp.71237-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">LeCun, Y., Bengio, Y. and Hinton, G. (2015) Deep Learning. Nature, 521, 436-444.  
&lt;br /&gt;http://dx.doi.org/10.1038/nature14539</mixed-citation></ref><ref id="scirp.71237-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. and Riedmiller, M. (2013) Playing Atari with Deep Reinforcement Learning.  
https://arxiv.org/abs/1312.5602</mixed-citation></ref><ref id="scirp.71237-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">van Hasselt, H., Guez, A. and Silver, D. (2015) Deep Reinforcement Learning with Double Q-Learning. http://arxiv.org/abs/1509.06461</mixed-citation></ref><ref id="scirp.71237-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Liang, Y., Machado, M.C., Talvitie, E. and Bowling, M. (2016) State of the Art Control of Atari Games Using Shallow Reinforcement Learning. Proceedings of the 2016 International Conference on Autonomous Agents &amp; Multiagent Systems, Singapore, 9-13 May 2016, 485- 493.</mixed-citation></ref><ref id="scirp.71237-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D. and Riedmiller, M. (2014) Deterministic Policy Gradient Algorithms. Proceedings of the 31st International Conference on Machine Learning (ICML-14), Beijing, 21-26 June 2014, 387-395.</mixed-citation></ref><ref id="scirp.71237-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. and Wierstra, D. (2015) Continuous Control with Deep Reinforcement Learning.  
https://arxiv.org/abs/1509.02971</mixed-citation></ref><ref id="scirp.71237-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">Silver D., Huang A., Maddison C.J., Guez A., Sifre L., Van Den Driessche G., Schrittwieser J., Antonoglou I., Panneershelvam V., Lanctot M., et al. (2016) Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, 529, 484-489.  
http://dx.doi.org/10.1038/nature16961</mixed-citation></ref><ref id="scirp.71237-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">Tromp, J. (2016) Number of Legal Go Positions. https://tromp.github.io/go/legal.html</mixed-citation></ref><ref id="scirp.71237-ref19"><label>19</label><mixed-citation publication-type="book" xlink:type="simple">Tokic, M. and Palm, G. (2011) Value-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax. In: Bach, J. and Edelkamp, S., Eds., KI 2011: Advances in Artificial Intelligence, Springer, Berlin, 335-346.</mixed-citation></ref><ref id="scirp.71237-ref20"><label>20</label><mixed-citation publication-type="other" xlink:type="simple">Brandon (2015) Q-Learning with Neural Networks.  
http://outlace.com/Reinforcement-Learning-Part-3/</mixed-citation></ref><ref id="scirp.71237-ref21"><label>21</label><mixed-citation publication-type="book" xlink:type="simple">Vermorel, J. and Mohri, M. (2005) Multi-Armed Bandit Algorithms and Empirical Evaluation. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M. and Torgo, L., Eds., Machine Learning: ECML 2005, Springer, Berlin, 437-448. http://dx.doi.org/10.1007/11564096_42</mixed-citation></ref><ref id="scirp.71237-ref22"><label>22</label><mixed-citation publication-type="other" xlink:type="simple">Chollet, F. (2015) keras, GitHub. https://github.com/fchollet/keras</mixed-citation></ref><ref id="scirp.71237-ref23"><label>23</label><mixed-citation publication-type="other" xlink:type="simple">Tieleman, T. and Hinton, G. (2012) Lecture 6.5-rmsprop: Divide the Gradient by a Running Average of its Recent Magnitude. COURSERA: Neural Networks for Machine Learning, 4, 26-30.</mixed-citation></ref></ref-list></back></article>