<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JILSA</journal-id><journal-title-group><journal-title>Journal of Intelligent Learning Systems and Applications</journal-title></journal-title-group><issn pub-type="epub">2150-8402</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jilsa.2014.64014</article-id><article-id pub-id-type="publisher-id">JILSA-50485</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject><subject> Engineering</subject></subj-group></article-categories><title-group><article-title>
 
 
  A Reinforcement Learning System to Dynamic Movement and Multi-Layer Environments
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>thai</surname><given-names>Phommasak</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Daisuke</surname><given-names>Kitakoshi</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Hiroyuki</surname><given-names>Shioya</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Junji</surname><given-names>Maeda</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib></contrib-group><aff id="aff2"><addr-line>Department of Information Engineering, Tokyo National College of Technology, Tokyo, Japan</addr-line></aff><aff id="aff1"><addr-line>Division of Information and Electronic Engineering, Muroran Institute of Technology, Muroran, Japan</addr-line></aff><author-notes><corresp id="cor1">* E-mail:<email>12092001@mmm.muroran-it.ac.jp(TP)</email>;<email>kitakosi@tokyo-ct.ac.jp(DK)</email>;<email>shioya@csse.muroran-it.ac.jp(HS)</email>;<email>junji@csse.muroran-it.ac.jp(JM)</email>;</corresp></author-notes><pub-date pub-type="epub"><day>15</day><month>10</month><year>2014</year></pub-date><volume>06</volume><issue>04</issue><fpage>176</fpage><lpage>185</lpage><history><date date-type="received"><day>22</day>	<month>August</month>	<year>2014</year></date><date date-type="rev-recd"><day>26</day>	<month>September</month>	<year>2014</year>	</date><date date-type="accepted"><day>8</day>	<month>October</month>	<year>2014</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  There are many proposed policy-improving systems of Reinforcement Learning (RL) agents which are effective in quickly adapting to environmental change by using many statistical methods, such as mixture model of Bayesian Networks, Mixture Probability and Clustering Distribution, etc. However such methods give rise to the increase of the computational complexity. For another method, the adaptation performance to more complex environments such as multi-layer environments is required. In this study, we used profit-sharing method for the agent to learn its policy, and added a mixture probability into the RL system to recognize changes in the environment and appropriately improve the agent’s policy to adjust to the changing environment. We also introduced a clustering that enables a smaller, suitable selection in order to reduce the computational complexity and simultaneously maintain the system’s performance. The results of experiments presented that the agent successfully learned the policy and efficiently adjusted to the changing in multi-layer environment. Finally, the computational complexity and the decline in effectiveness of the policy improvement were controlled by using our proposed system.
 
</p></abstract><kwd-group><kwd>Reinforcement Learning</kwd><kwd> Profit-Sharing Method</kwd><kwd> Mixture Probability</kwd><kwd> Clustering</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>Along with the increasing need for rescue robots in disasters such as earthquakes and tsunami, there is an urgent need to develop robotics software for learning and adapting to any environment. Reinforcement Learning (RL) is often used in developing robotic software. RL is an area of machine learning within the computer science domain, and many RL methods have recently been proposed and applied to a variety of problems [<xref ref-type="bibr" rid="scirp.50485-ref1">1</xref>] -[<xref ref-type="bibr" rid="scirp.50485-ref4">4</xref>] , where agents learn the policies to maximize the total number of rewards decided according to specific rules. In the process whereby agents obtain rewards; data consisting of state-action pairs are generated. The agents’ policies are effectively improved by a supervised learning mechanism using the sequential expression of the stored data series and rewards.</p><p>Normally, RL agents need to initialize the policies when they are placed in a new environment and the learning process starts afresh each time. Effective adjustment to an unknown environment becomes possible by using statistical methods, such as a Bayesian network model [<xref ref-type="bibr" rid="scirp.50485-ref5">5</xref>] [<xref ref-type="bibr" rid="scirp.50485-ref6">6</xref>] , mixture probability and clustering distribution [<xref ref-type="bibr" rid="scirp.50485-ref7">7</xref>] [<xref ref-type="bibr" rid="scirp.50485-ref8">8</xref>] , etc., which consist of observational data on multiple environments that the agents have learned in the past [<xref ref-type="bibr" rid="scirp.50485-ref9">9</xref>] [<xref ref-type="bibr" rid="scirp.50485-ref10">10</xref>] . However, the use of a mixture model of Bayesian networks increases the system’s calculation time. Also, when there are limited processing resources, it becomes necessary to control the computational complexity. On the other hand, by using mixture probability and clustering distribution, even though the computational complexity was controlled and the system’s performance was simultaneously maintained, the experiments were only conducted on fixed obstacle 2D-environments. Therefore, examination of the computational complexity load and the adaptation performance in dynamic 3D-environments is required.</p><p>In this paper, we describe modifications of profit-sharing method with new parameters that make it possible to work on dynamic movement of multi-layer environments. We then describe a mixture probability consisting of the integration of observational data on environments that agent learned in the past within framework of RL, which provides initial knowledge to the agent and enables efficient adjustment to a changing environment. We also describe a novel clustering that makes it possible to select fewer elements for a significant reduction in the computational complexity while retaining system’s performance.</p><p>The paper is organized as follows. Section 2 briefly explains the profit-sharing method, the mixture probability, the clustering distribution, and the flow system. The experimental setup and procedure as well as the presentation of results are described in Section 3. Finally, Section 4 summarizes the key points and mentions our future work.</p></sec><sec id="s2"><title>2. Preparation</title><sec id="s2_1"><title>2.1. Profit-Sharing</title><p>Profit-sharing is an RL method that is used as a policy learning mechanism in our proposed system. RL agents learn their own policies through “rewards” received from an environment.</p><sec id="s2_1_1"><title>2.1.1. 2D-Environments</title><p>The policy is given by the following function:</p><disp-formula id="scirp.50485-formula641"><label>(1)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601278x5.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x6.png" xlink:type="simple"/></inline-formula> and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x7.png" xlink:type="simple"/></inline-formula> denote a set of state and action, respectively. Pair <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x8.png" xlink:type="simple"/></inline-formula> is referred to as a rule. <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x9.png" xlink:type="simple"/></inline-formula>is used as the weight of the rule (<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x10.png" xlink:type="simple"/></inline-formula>is positive in this paper). When state <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x11.png" xlink:type="simple"/></inline-formula> is observed, a rule is selected in proportion to the weight of rule<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x12.png" xlink:type="simple"/></inline-formula>. The agent selects a single rule corresponding to given state <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x13.png" xlink:type="simple"/></inline-formula> using the following probability:</p><disp-formula id="scirp.50485-formula642"><label>(2)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601278x14.png"  xlink:type="simple"/></disp-formula><p>The agent stores the sequence of all rules that were selected until the agent reaches the target as an episode.</p><disp-formula id="scirp.50485-formula643"><label>(3)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601278x15.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x16.png" xlink:type="simple"/></inline-formula> is the length of the episode. When the agent selects rule <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x17.png" xlink:type="simple"/></inline-formula> and requires reward<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x18.png" xlink:type="simple"/></inline-formula>, the weight of each rule in the episode is reinforced by</p><disp-formula id="scirp.50485-formula644"><label>(4)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601278x19.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.50485-formula645"><label>(5)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601278x20.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x21.png" xlink:type="simple"/></inline-formula> is referred to as the reinforcement function and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x21.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x22.png" xlink:type="simple"/></inline-formula> is the “learning rate”. In this paper, the following nonfixed reward is used:</p><disp-formula id="scirp.50485-formula646"><label>(6)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601278x23.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x24.png" xlink:type="simple"/></inline-formula> is the initial reward, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x24.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x25.png" xlink:type="simple"/></inline-formula>is the action number limit in one trial and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x24.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x25.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x26.png" xlink:type="simple"/></inline-formula> is the real action number until the agent reaches the target. We expect that the agent can choose a more suitable rule to reach the target in a dynamic environment by using this nonfixed reward.</p></sec><sec id="s2_1_2"><title>2.1.2. 3D-Environments</title><p>The weight <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x27.png" xlink:type="simple"/></inline-formula> becomes <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x27.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x28.png" xlink:type="simple"/></inline-formula> where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x27.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x28.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x29.png" xlink:type="simple"/></inline-formula> (<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x27.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x28.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x29.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x30.png" xlink:type="simple"/></inline-formula>is number of layers in this paper). The probability of the rule <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x27.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x28.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x29.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x30.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x31.png" xlink:type="simple"/></inline-formula> becomes to this following function:</p><disp-formula id="scirp.50485-formula647"><label>(7)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601278x32.png"  xlink:type="simple"/></disp-formula><p>and the new episode is given in the following function:</p><disp-formula id="scirp.50485-formula648"><label>(8)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601278x33.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.50485-formula649"><label>(9)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601278x34.png"  xlink:type="simple"/></disp-formula><p>By the movement on<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x35.png" xlink:type="simple"/></inline-formula>, we can set the pseudo-reward [<xref ref-type="bibr" rid="scirp.50485-ref11">11</xref>] by using the following function:</p><disp-formula id="scirp.50485-formula650"><label>(10)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601278x36.png"  xlink:type="simple"/></disp-formula><p>and update the weights according to the following function by using function (10):</p><disp-formula id="scirp.50485-formula651"><label>(11)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601278x37.png"  xlink:type="simple"/></disp-formula><disp-formula id="scirp.50485-formula652"><label>(12)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601278x38.png"  xlink:type="simple"/></disp-formula></sec><sec id="s2_1_3"><title>2.1.3. Ineffective Rule Suppression</title><p>As <xref ref-type="fig" rid="fig1">Figure 1</xref>, agent selects rule <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x39.png" xlink:type="simple"/></inline-formula> in <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x39.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x40.png" xlink:type="simple"/></inline-formula> then moves to<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x39.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x40.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x41.png" xlink:type="simple"/></inline-formula>. When agent selects any rule in <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x39.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x40.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x41.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x42.png" xlink:type="simple"/></inline-formula> and finally moves back to state<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x39.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x40.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x41.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x42.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x43.png" xlink:type="simple"/></inline-formula>, the rules were selected on <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x39.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x40.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x41.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x42.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x43.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x44.png" xlink:type="simple"/></inline-formula> are became detour rules which may not contribute to the acquisition of the reward and these detour rules are called as ineffective rule [<xref ref-type="bibr" rid="scirp.50485-ref12">12</xref>] [<xref ref-type="bibr" rid="scirp.50485-ref13">13</xref>] .</p><p>The ineffective rule has more negative effect such as the rules continue being selected repeatedly on the movement of <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x45.png" xlink:type="simple"/></inline-formula> and agent cannot avoid from that situation. And this may make the policy learning become stagnation. From these reasons, the suppression of ineffective rule becomes necessary.</p><p>In this paper, we use this following method to suppress the ineffective rule:</p><p>Here, we use <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x46.png" xlink:type="simple"/></inline-formula> as the length of episode <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x46.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x47.png" xlink:type="simple"/></inline-formula> and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x46.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x47.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x48.png" xlink:type="simple"/></inline-formula> as a fixed number for determination ineffective rule.</p><p>When<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x49.png" xlink:type="simple"/></inline-formula>, all rules in <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x49.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x50.png" xlink:type="simple"/></inline-formula> are decided to be ineffective rule. Here, all rules in <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x49.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x50.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x51.png" xlink:type="simple"/></inline-formula> and the fi-</p><p>nal rule in <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x52.png" xlink:type="simple"/></inline-formula> will be excluded from <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x52.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x53.png" xlink:type="simple"/></inline-formula> as shown on <xref ref-type="fig" rid="fig1">Figure 1</xref>.</p></sec></sec><sec id="s2_2"><title>2.2. Mixture Probability</title><p>Mixture probability is a mechanism for recognizing changes in the environment and consequently improving the agent’s policy to adjust to those changes.</p><p>The joint distribution [<xref ref-type="bibr" rid="scirp.50485-ref14">14</xref>] <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x54.png" xlink:type="simple"/></inline-formula>, consisting of the episode observed while learning an agent’s policy, is</p><fig id="fig1"  position="float"><label><xref ref-type="fig" rid="fig1">Figure 1</xref></label><caption><title> Example of ineffective rule</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601278x55.png"/></fig><p>probabilistic knowledge about the environment. Furthermore, the policy acquired by the agent is improved by using the mixture probability of <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x56.png" xlink:type="simple"/></inline-formula> obtained in multiple known environments. The mixing distribution is given by the following function:</p><disp-formula id="scirp.50485-formula653"><label>(13)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601278x57.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x58.png" xlink:type="simple"/></inline-formula> denotes the number of joint distributions, and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x58.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x59.png" xlink:type="simple"/></inline-formula> is the mixing parameter<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x58.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x59.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x60.png" xlink:type="simple"/></inline-formula>. By</p><p>adjusting the environment subject to this mixing parameter, we expect appropriate improvement of the policy on the unknown dynamic environment.</p><p>In this paper, we use the following Hellinger distance [<xref ref-type="bibr" rid="scirp.50485-ref15">15</xref>] function to fix the mixing parameter:</p><disp-formula id="scirp.50485-formula654"><label>(14)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601278x61.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x62.png" xlink:type="simple"/></inline-formula> is the distance between <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x62.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x63.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x62.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x63.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x64.png" xlink:type="simple"/></inline-formula>, and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x62.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x63.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x64.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x65.png" xlink:type="simple"/></inline-formula> is set to 0 when <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x62.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x63.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x64.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x65.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x66.png" xlink:type="simple"/></inline-formula> and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x62.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x63.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x64.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x65.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x66.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x67.png" xlink:type="simple"/></inline-formula> are the same. <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x62.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x63.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x64.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x65.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x66.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x67.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x68.png" xlink:type="simple"/></inline-formula>is joint distributions obtained in <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x62.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x63.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x64.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x65.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x66.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x67.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x68.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x69.png" xlink:type="simple"/></inline-formula> different environments that an agent has learned in the past, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x62.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x63.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x64.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x65.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x66.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x67.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x68.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x69.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x70.png" xlink:type="simple"/></inline-formula>is the sample distribution obtained from the successful trial of <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x62.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x63.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x64.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x65.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x66.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x67.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x68.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x69.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x70.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x71.png" xlink:type="simple"/></inline-formula> times in an unknown environment, and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x62.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x63.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x64.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x65.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x66.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x67.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x68.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x69.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x70.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x71.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x72.png" xlink:type="simple"/></inline-formula> is the total number of rules. Given that <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x62.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x63.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x64.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x65.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x66.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x67.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x68.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x69.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x70.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x71.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x72.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x73.png" xlink:type="simple"/></inline-formula> is established, the mixing parameter can be fixed by the following function:</p><disp-formula id="scirp.50485-formula655"><label>(15)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601278x74.png"  xlink:type="simple"/></disp-formula><p>However, when<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x75.png" xlink:type="simple"/></inline-formula>, <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x75.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x76.png" xlink:type="simple"/></inline-formula>, and when all distributions are equal, the mixing parameter is evenly allotted.</p></sec><sec id="s2_3"><title>2.3. Clustering Distributions</title><p>We expect that the computational complexity of the system can be controlled and it will be possible to maintain the effectiveness of policy learning by selecting only the suitable joint distributions as the mixture probability elements based on this clustering method.</p><p>In this study, we used the group average method as opposed to the clustering method. The distance between the clusters can be determined by the following function:</p><disp-formula id="scirp.50485-formula656"><label>(16)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601278x77.png"  xlink:type="simple"/></disp-formula><p>where <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x78.png" xlink:type="simple"/></inline-formula> are the number of joint distributions contained in <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x78.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x79.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x78.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x79.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x80.png" xlink:type="simple"/></inline-formula>, respectively. In this study, we used the Hellinger distance function<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x78.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x79.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x80.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x81.png" xlink:type="simple"/></inline-formula>. After completing the clustering, element <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x78.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x79.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x80.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x81.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x82.png" xlink:type="simple"/></inline-formula> having the minimum <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x78.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x79.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x80.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x81.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x82.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x83.png" xlink:type="simple"/></inline-formula> will be selected as the mixture probability element from each cluster.</p></sec><sec id="s2_4"><title>2.4. Flow System</title><p>The system framework is shown in <xref ref-type="fig" rid="fig2">Figure 2</xref>. A case involving the application of mixture probability and clustering distributions to improve the agent’s policy is explained in the following procedure:</p><fig id="fig2"  position="float"><label><xref ref-type="fig" rid="fig2">Figure 2</xref></label><caption><title> System framework</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601278x84.png"/></fig><p>Step 1 Learn the policy in <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x85.png" xlink:type="simple"/></inline-formula> environments by using the profit-sharing method to make the joint distributions<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x85.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x86.png" xlink:type="simple"/></inline-formula>;</p><p>Step 2 Cluster distributions into <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x87.png" xlink:type="simple"/></inline-formula> clusters;</p><p>Step 3 Calculate the Hellinger distance <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x88.png" xlink:type="simple"/></inline-formula> of distributions <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x88.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x89.png" xlink:type="simple"/></inline-formula> and sample distribution<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x88.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x89.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x90.png" xlink:type="simple"/></inline-formula>;</p><p>Step 4 Select the element having the minimum <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x91.png" xlink:type="simple"/></inline-formula> from each cluster;</p><p>Step 5 Calculate the mixing parameter<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x92.png" xlink:type="simple"/></inline-formula>;</p><p>Step 6 Mix probability<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x93.png" xlink:type="simple"/></inline-formula>;</p><p>Step 7 Update the weight of all rules by using the following function:</p><disp-formula id="scirp.50485-formula657"><label>(17)</label><graphic position="anchor" xlink:href="http://html.scirp.org/file/3-9601278x94.png"  xlink:type="simple"/></disp-formula><p>and then continue learning the updated weight by using the profit-sharing method.</p></sec></sec><sec id="s3"><title>3. Experiments</title><p>We performed an experiment to demonstrate the agent navigation problem and to illustrate the applied improvement in the RL agent’s policy through the modification of parameters of the profit-sharing method and using the mixture probability scheme. The purpose of this experiment was to evaluate the adjustment performance in the unknown dynamic 3D-environment by applying the policy improvement, and to evaluate its effectiveness by using mixture probability.</p><sec id="s3_1"><title>3.1. Experimental Setup</title><p>The aim in the agent navigation problem is to arrive at the target from the default position of the environment where the agent is placed. In the experiment, the reward is obtained when the agent reaches the target by avoiding the obstacle in the environment, as shown in <xref ref-type="fig" rid="fig3">Figure 3</xref>.</p><p>The types of state and action are shown in <xref ref-type="table" rid="table1">Table 1</xref> and <xref ref-type="table" rid="table2">Table 2</xref>, respectively. <xref ref-type="table" rid="table1">Table 1</xref> shows the output actions of an agent in 8 directions and <xref ref-type="table" rid="table2">Table 2</xref> shows 256 types of the total input states coming from the combination of existing obstacles in 8 directions. The 8 directions are the top left, top, top right, left, right, bottom left, bottom, and bottom right. The agent has 2048 (8 actions &#215; 512 states) rules in total that result from a combination of input states and output actions in a layer. The size of agent, target, and environment are 1 &#215; 1, 5 &#215; 5, and 50 &#215; 50, respectively.</p></sec><sec id="s3_2"><title>3.2. Experimental Procedure</title><p>The agent learns the policy by using the profit-sharing method. A trial is considered to be successful if an agent</p><fig id="fig3"  position="float"><label><xref ref-type="fig" rid="fig3">Figure 3</xref></label><caption><title> Environment of agent navigation problem</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601278x95.png"/></fig><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> Types of action</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Direction of action</th><th align="center" valign="middle" >Value</th></tr></thead><tr><td align="center" valign="middle" ><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" style="border-collapse:collapse;mso-table-layout-alt:fixed;mso-yfti-tbllook:   1184;mso-padding-alt:0cm 5.4pt 0cm 5.4pt"> 
 <tbody> 
  <tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes;height:15.15pt"> 
   <td width="27" style="width:20.6pt;padding:0cm 5.4pt 0cm 5.4pt;height:15.15pt"> <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x96.png" xlink:type="simple"/></inline-formula> </td> 
   <td width="26" style="width:19.2pt;padding:0cm 5.4pt 0cm 5.4pt;height:15.15pt"> <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x97.png" xlink:type="simple"/></inline-formula> </td> 
   <td width="27" style="width:20.6pt;padding:0cm 5.4pt 0cm 5.4pt;height:15.15pt"> <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x98.png" xlink:type="simple"/></inline-formula> </td> 
  </tr> 
  <tr style="mso-yfti-irow:1;height:15.45pt"> 
   <td width="27" style="width:20.6pt;padding:0cm 5.4pt 0cm 5.4pt;height:15.45pt"> <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x99.png" xlink:type="simple"/></inline-formula> </td> 
   <td width="26" style="width:19.2pt;padding:0cm 5.4pt 0cm 5.4pt;height:15.45pt"> <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x100.png" xlink:type="simple"/></inline-formula> </td> 
   <td width="27" style="width:20.6pt;padding:0cm 5.4pt 0cm 5.4pt;height:15.45pt"> <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x101.png" xlink:type="simple"/></inline-formula> </td> 
  </tr> 
  <tr style="mso-yfti-irow:2;mso-yfti-lastrow:yes;height:15.15pt"> 
   <td width="27" style="width:20.6pt;padding:0cm 5.4pt 0cm 5.4pt;height:15.15pt"> <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x102.png" xlink:type="simple"/></inline-formula> </td> 
   <td width="26" style="width:19.2pt;padding:0cm 5.4pt 0cm 5.4pt;height:15.15pt"> <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x103.png" xlink:type="simple"/></inline-formula> </td> 
   <td width="27" style="width:20.6pt;padding:0cm 5.4pt 0cm 5.4pt;height:15.15pt"> <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x104.png" xlink:type="simple"/></inline-formula> </td> 
  </tr> 
 </tbody> 
</table></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x96.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x97.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x98.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x99.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x100.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x101.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x102.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x103.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x104.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="151" style="border-collapse:collapse;mso-table-layout-alt:fixed;mso-yfti-tbllook:   1184;mso-padding-alt:0cm 5.4pt 0cm 5.4pt"> 
 <tbody> 
  <tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes;height:15.15pt"> 
   <td width="20" style="width:15.05pt;padding:0cm 5.4pt 0cm 5.4pt;height:15.15pt"> 0 </td> 
   <td width="26" style="width:19.2pt;padding:0cm 5.4pt 0cm 5.4pt;height:15.15pt"> 1 </td> 
   <td width="20" style="width:15.05pt;padding:0cm 5.4pt 0cm 5.4pt;height:15.15pt"> 2 </td> 
  </tr> 
  <tr style="mso-yfti-irow:1;height:15.45pt"> 
   <td width="20" style="width:15.05pt;padding:0cm 5.4pt 0cm 5.4pt;height:15.45pt"> 3 </td> 
   <td width="26" style="width:19.2pt;padding:0cm 5.4pt 0cm 5.4pt;height:15.45pt"> <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x105.png" xlink:type="simple"/></inline-formula> </td> 
   <td width="20" style="width:15.05pt;padding:0cm 5.4pt 0cm 5.4pt;height:15.45pt"> 4 </td> 
  </tr> 
  <tr style="mso-yfti-irow:2;mso-yfti-lastrow:yes;height:15.15pt"> 
   <td width="20" style="width:15.05pt;padding:0cm 5.4pt 0cm 5.4pt;height:15.15pt"> 5 </td> 
   <td width="26" style="width:19.2pt;padding:0cm 5.4pt 0cm 5.4pt;height:15.15pt"> 6 </td> 
   <td width="20" style="width:15.05pt;padding:0cm 5.4pt 0cm 5.4pt;height:15.15pt"> 7 </td> 
  </tr> 
 </tbody> 
</table></td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >2</td><td align="center" valign="middle" >3</td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x105.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" >4</td><td align="center" valign="middle" >5</td><td align="center" valign="middle" >6</td><td align="center" valign="middle" >7</td></tr><tr><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x96.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x97.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x98.png" xlink:type="simple"/></inline-formula></td></tr><tr><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x99.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x100.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x101.png" xlink:type="simple"/></inline-formula></td></tr><tr><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x102.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x103.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x104.png" xlink:type="simple"/></inline-formula></td></tr><tr><td align="center" valign="middle" >0</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >2</td></tr><tr><td align="center" valign="middle" >3</td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x105.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" >4</td></tr><tr><td align="center" valign="middle" >5</td><td align="center" valign="middle" >6</td><td align="center" valign="middle" >7</td></tr></tbody></table></table-wrap><table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> Some types of state</title></caption><table><tbody><thead><tr><th align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x96.png" xlink:type="simple"/></inline-formula></th><th align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x97.png" xlink:type="simple"/></inline-formula></th><th align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x98.png" xlink:type="simple"/></inline-formula></th></tr></thead><tr><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x99.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x100.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x101.png" xlink:type="simple"/></inline-formula></td></tr><tr><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x102.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x103.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x104.png" xlink:type="simple"/></inline-formula></td></tr></tbody></table></table-wrap><p>reaches the target at least once out of 200 action attempts. The action is selected by randomization and that action continues until the state is changed.</p><p>The purpose of the experiment is to learn the policy in unknown dynamic environments <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x106.png" xlink:type="simple"/></inline-formula> and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x106.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x107.png" xlink:type="simple"/></inline-formula> in three cases (fixed obstacle, periodic dynamic and nonperiodic dynamic environments), by employing only the profit-sharing method and the mixture probability scheme (elements are <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x106.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x107.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x108.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x106.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x107.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x108.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x109.png" xlink:type="simple"/></inline-formula>); the evaluation is based on the success rate of 2000 trials. The experimental parameters are shown in <xref ref-type="table" rid="table3">Table 3</xref>. Some of known environments that became mixture probability elements, and the unknown dynamic environments <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x106.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x107.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x108.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x109.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x110.png" xlink:type="simple"/></inline-formula> used to evaluate the policy improvement are shown in <xref ref-type="fig" rid="fig4">Figure 4</xref> and <xref ref-type="fig" rid="fig5">Figure 5</xref>, respectively.</p></sec><sec id="s3_3"><title>3.3. Discussion</title><p>The success rate of policy improvement in <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x111.png" xlink:type="simple"/></inline-formula> and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x111.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x112.png" xlink:type="simple"/></inline-formula> by using only profit-sharing method and using mixture probabilities and clustering is shown in <xref ref-type="fig" rid="fig6">Figure 6</xref>, and the processing time from Step 3 (system flow) until experiment finish in cases using all 50 elements and using only 35, 25 and 15 elements is shown in and <xref ref-type="table" rid="table4">Table 4</xref>, respectively</p><p><xref ref-type="fig" rid="fig6">Figure 6</xref> shows that the immediate success rate obtained by policy improvement is higher than that obtained by only the profit-sharing method in all environments. This means the speed of adaptation in unknown environment is higher and the higher success rate continues until the experiments end. This results shows the success rate by policy improvement is higher than using only the profit-sharing more than 20% in <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x113.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x113.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x114.png" xlink:type="simple"/></inline-formula>, and more than 30% in<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x113.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x114.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x115.png" xlink:type="simple"/></inline-formula>. So, we can say the policy improvement is effective in all environments.</p><p>Even the success rate by using only 15 elements is also higher than that using only the profit-sharing method, but is still lower compared to the results using 25 and 35 elements. Hence, we can say by reducing the number of elements too much, the influence on policy improvement is apparent in all environments. However, although</p><fig id="fig4"  position="float"><label><xref ref-type="fig" rid="fig4">Figure 4</xref></label><caption><title> Some of known environments</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601278x116.png"/></fig><fig id="fig5"  position="float"><label><xref ref-type="fig" rid="fig5">Figure 5</xref></label><caption><title> Unknown environments</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601278x117.png"/></fig><fig id="fig6"  position="float"><label><xref ref-type="fig" rid="fig6">Figure 6</xref></label><caption><title> Transition of success rate</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601278x118.png"/></fig><table-wrap id="table3" ><label><xref ref-type="table" rid="table3">Table 3</xref></label><caption><title> Experimental parameters</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >0</th><th align="center" valign="middle" >1</th><th align="center" valign="middle" >2</th></tr></thead><tr><td align="center" valign="middle" >3</td><td align="center" valign="middle" ><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x105.png" xlink:type="simple"/></inline-formula></td><td align="center" valign="middle" >4</td></tr><tr><td align="center" valign="middle" >5</td><td align="center" valign="middle" >6</td><td align="center" valign="middle" >7</td></tr></tbody></table></table-wrap><table-wrap id="table4" ><label><xref ref-type="table" rid="table4">Table 4</xref></label><caption><title> Processing time</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Position of obstacle and value</th></tr></thead><tr><td align="center" valign="middle" ><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" style="border-collapse:collapse;mso-table-layout-alt:fixed;mso-yfti-tbllook:   1184;mso-padding-alt:0cm 5.4pt 0cm 5.4pt"> 
 <tbody> 
  <tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes;height:28.95pt"> 
   <td width="40" style="width:29.75pt;padding:0cm 5.4pt 0cm 5.4pt;    height:28.95pt"> </td> 
   <td width="40" style="width:29.75pt;padding:0cm 5.4pt 0cm 5.4pt;    height:28.95pt"> </td> 
   <td width="40" style="width:29.75pt;padding:0cm 5.4pt 0cm 5.4pt;    height:28.95pt"> </td> 
   <td width="40" style="width:29.75pt;padding:0cm 5.4pt 0cm 5.4pt;    height:28.95pt"> … </td> 
   <td width="40" style="width:29.75pt;padding:0cm 5.4pt 0cm 5.4pt;    height:28.95pt"> </td> 
   <td width="40" style="width:29.75pt;padding:0cm 5.4pt 0cm 5.4pt;    height:28.95pt"> … </td> 
   <td width="40" style="width:29.75pt;padding:0cm 5.4pt 0cm 5.4pt;    height:28.95pt"> </td> 
  </tr> 
  <tr style="mso-yfti-irow:1;mso-yfti-lastrow:yes;height:13.75pt"> 
   <td width="40" style="width:29.75pt;padding:0cm 5.4pt 0cm 5.4pt;    height:13.75pt"> 0 </td> 
   <td width="40" style="width:29.75pt;padding:0cm 5.4pt 0cm 5.4pt;    height:13.75pt"> 1 </td> 
   <td width="40" style="width:29.75pt;padding:0cm 5.4pt 0cm 5.4pt;    height:13.75pt"> 2 </td> 
   <td width="40" style="width:29.75pt;padding:0cm 5.4pt 0cm 5.4pt;    height:13.75pt"> </td> 
   <td width="40" style="width:29.75pt;padding:0cm 5.4pt 0cm 5.4pt;    height:13.75pt"> 111 </td> 
   <td width="40" style="width:29.75pt;padding:0cm 5.4pt 0cm 5.4pt;    height:13.75pt"> </td> 
   <td width="40" style="width:29.75pt;padding:0cm 5.4pt 0cm 5.4pt;    height:13.75pt"> 255 </td> 
  </tr> 
 </tbody> 
</table></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" >…</td><td align="center" valign="middle" ></td><td align="center" valign="middle" >…</td><td align="center" valign="middle" ></td><td align="center" valign="middle" >0</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >2</td><td align="center" valign="middle" ></td><td align="center" valign="middle" >111</td><td align="center" valign="middle" ></td><td align="center" valign="middle" >255</td></tr><tr><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" ></td><td align="center" valign="middle" >…</td><td align="center" valign="middle" ></td><td align="center" valign="middle" >…</td><td align="center" valign="middle" ></td></tr><tr><td align="center" valign="middle" >0</td><td align="center" valign="middle" >1</td><td align="center" valign="middle" >2</td><td align="center" valign="middle" ></td><td align="center" valign="middle" >111</td><td align="center" valign="middle" ></td><td align="center" valign="middle" >255</td></tr></tbody></table></table-wrap><p>the success rate using all 50 elements was the highest, but that obtained using 25 elements was almost the same as that using all the elements in this result. So, the decline in effectiveness can still be controlled even if the number of mixture probability elements is reduced to half.</p><p>Furthermore, from the results in <xref ref-type="table" rid="table4">Table 4</xref>, we can see that by reducing the number of elements, the processing time was reduced considerably. Hence, we can say by using 25 elements, we can reduce the processing time without declining in policy improvement performance.</p><p><xref ref-type="fig" rid="fig7">Figure 7</xref> shows the typical trajectories of agent following the policy acquired while selecting data in environment <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x132.png" xlink:type="simple"/></inline-formula> in cases 1 - 500, 501 - 1000 and 1001 - 2000 trials. The intensity of color (from light red to dark red) show the frequency of agent’s trajectories when they reached to target in each layers.</p><p>In these results, we can see in the first 500 trials, agent reached to all sub-targets in top layer. But due to the agent which started from sub-target 1 was the most difficult for reaching next sub-target, the number of time that agent reached to sub-target 1 became fewer in 501 - 1000 trials and finally almost reached to sub-target 2 and 3 in 1001 - 2000 trials. Also in middle layer, agent reached to all sub-targets in first 500 trials. But due to the agent which started from sub-target 5 was more easily for reaching to the final goal, the more number of trials there are, the more frequency of agent’s trajectories from sub-target 5 to the final goal increased clearly.</p><p>From the results of typical agent’, we can say by using the pseudo-reward, the agent can choose more suitable rules to reach the target in each layers even agent might be sometimes more difficult to reach in some layer, but more easily to reach to the final goal.</p></sec><sec id="s3_4"><title>3.4. Supplemental Experiments</title><p>These experiments were conducted to compare the performance of the policy improvement in cases of fixed obstacle, periodic dynamic movement and nonperiodic dynamic movement on <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x133.png" xlink:type="simple"/></inline-formula> and <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x133.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x134.png" xlink:type="simple"/></inline-formula> by using 25 elements. And experiments in only periodic and nonperiodic cases by using the same parameters were conducted 5 times.</p></sec></sec><sec id="s4"><title>4. Discussion</title><p>The results of policy improvement by using 25 elements of mixture probabilities in three cases are shown in <xref ref-type="fig" rid="fig8">Figure 8</xref>, and the results of five sets of experiments in periodic and nonperiodic dynamic movement are shown in <xref ref-type="fig" rid="fig9">Figure 9</xref>, respectively.</p><p><xref ref-type="fig" rid="fig8">Figure 8</xref> shows that the success rate in the case of periodic dynamic movement was almost no difference in the early period compared with the fixed obstacle case in both <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x135.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x135.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x136.png" xlink:type="simple"/></inline-formula>, and continued to keep abreast of high success rate until the experiments end. On the other hand, in the case of nonperiodic dynamic movement, even the success rate in <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x135.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x136.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x137.png" xlink:type="simple"/></inline-formula> was almost no difference or sometime was conversely higher compared with the fixed obstacle case. However, as shown in <xref ref-type="fig" rid="fig9">Figure 9</xref>, even though the experiments were conducted by using the same parameters, the results of nonperiodic case in <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x135.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x136.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x137.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x138.png" xlink:type="simple"/></inline-formula> was quite low compared to periodic case. And the results of nonperiodic case were unstable in all <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x135.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x136.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x137.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x138.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x139.png" xlink:type="simple"/></inline-formula> and<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x135.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x136.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x137.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x138.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x139.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x140.png" xlink:type="simple"/></inline-formula>.</p><p>From these results, we can deduce that agent successfully learns the policy in the periodic dynamic movement environment and can more easily reach the target when the obstacle moves out from the trajectory as in<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x141.png" xlink:type="simple"/></inline-formula>. On the contrary, when the obstacle moves into the trajectory, it will be more difficult for the agent to reach the target.</p></sec><sec id="s5"><title>5. Conclusions</title><p>In this research, we used the joint distributions <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x142.png" xlink:type="simple"/></inline-formula> as the knowledge and the sample distribution <inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x142.png" xlink:type="simple"/></inline-formula><inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x143.png" xlink:type="simple"/></inline-formula> to</p><fig id="fig7"  position="float"><label><xref ref-type="fig" rid="fig7">Figure 7</xref></label><caption><title> Typical agent trajectories in<inline-formula><inline-graphic xlink:href="http://html.scirp.org/file/3-9601278x145.png" xlink:type="simple"/></inline-formula></title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601278x144.png"/></fig><fig id="fig8"  position="float"><label><xref ref-type="fig" rid="fig8">Figure 8</xref></label><caption><title> Transition of success rate (3 cases)</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601278x146.png"/></fig><fig id="fig9"  position="float"><label><xref ref-type="fig" rid="fig9">Figure 9</xref></label><caption><title> Five sets of experiments</title></caption><graphic mimetype="image"   position="float"  xlink:type="simple"  xlink:href="http://html.scirp.org/file/3-9601278x147.png"/></fig><p>find the degree of similarity between the unknown and each known environment. We then used this as the basis to update the initial knowledge as being very useful for the agent to learn the policy in a changing environment. Even if obtaining the sample distribution is time-consuming, it is still worthwhile if the agent can efficiently learn the policy in an unknown dynamic environment.</p><p>Also, by using the clustering method to collect similar elements and then selecting just one suitable joint distribution as the mixture probability elements from each cluster, we can avoid using similar elements to maintain a variety of elements when we reduce their number.</p><p>From the results of the computer experiment as an example application in the agent navigation problem, we can confirm that the policy improvement in dynamic movement environments is effective by using the mixture probabilities. Furthermore, agent is possible to select suitable rules to reach to the target in multi-layer by using the pseudo-reward. And the decline in effectiveness of the policy improvement can be controlled by using the clustering method. We conclude that the improvement of stability and speed in policy learning, and the control of computational complexity are effective by using our proposed system.</p><p>Improvement of the RL policy is also required by using mixture probability with a positive and negative weight value for making the system adaptable to unknown environments that are not similar to any known environments. Finally, a new reward process is needed as well as a new mixing parameter for the agent to adjust to a changing environment more efficiently and to be able to work well in any complicate environments in future work.</p></sec><sec id="s6"><title>Acknowledgements</title><p>This research paper is made possible through the help and support by Honjo International Scholarship Foundation.</p></sec></body><back><ref-list><title>References</title><ref id="scirp.50485-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Sutton, R.S. and Barto, A.G. (1998) Reinforcement Learning: An Introduction. MIR Press, Cambridge.</mixed-citation></ref><ref id="scirp.50485-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Croonenborghs, T., Ramon, J., Blockeel, H. and Bruynooghe, M. (2006) Model-Assisted Approaches for Relational Reinforcement Learning: Some Challenges for the SRL Community. Proceedings of the ICML-2006 Workshop on Open Problems in Statistical Relational Learning, Pittsburgh.</mixed-citation></ref><ref id="scirp.50485-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Fernandez, F. and Veloso, M. (2006) Probabilistic Policy Reuse in a Reinforcement Learning Agent. Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multi-Agent Systems, New York, May 2006, 720-727. http://dx.doi.org/10.1145/1160633.1160762</mixed-citation></ref><ref id="scirp.50485-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Kober, J., Bagnell, J.A. and Peters, J. (2013) Reinforcement Learning in Robotics: A Survey. International Journal of Robotics Research, 32, 1238-1274. http://dx.doi.org/10.1177/0278364913495721</mixed-citation></ref><ref id="scirp.50485-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Kitakoshi, D., Shioya, H. and Nakano, R. (2004) Adaptation of the Online Policy-Improving System by Using a Mixture Model of Bayesian Networks to Dynamic Environments. Electronics, Information and Communication Engineers, 104, 15-20.</mixed-citation></ref><ref id="scirp.50485-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Kitakoshi, D., Shioya, H. and Nakano, R. (2010) Empirical Analysis of an On-Line Adaptive System Using a Mixture of Bayesian Networks. Information Science, 180, 2856-2874. http://dx.doi.org/10.1016/j.ins.2010.04.001</mixed-citation></ref><ref id="scirp.50485-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Phommasak, U., Kitakoshi, D. and Shioya, H. (2012) An Adaptation System in Unknown Environments Using a Mixture Probability Model and Clustering Distributions. Journal of Advanced Computational Intelligence and Intelligent Informatics, 16, 733-740.</mixed-citation></ref><ref id="scirp.50485-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Phommasak, U., Kitakoshi, D., Mao, J. and Shioya, H. (2014) A Policy-Improving System for Adaptability to Dynamic Environments Using Mixture Probability and Clustering Distribution. Journal of Computer and Communications, 2, 210-219. http://dx.doi.org/10.4236/jcc.2014.24028</mixed-citation></ref><ref id="scirp.50485-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Tanaka, F. and Yamamura, M. (1997) An Approach to Lifelong Reinforcement Learning through Multiple Environments. Proceedings of the Sixth European Workshop on Learning Robots, Brighton, 1-2 August 1997, 93-99.</mixed-citation></ref><ref id="scirp.50485-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Minato, T. and Asada, M. (1998) Environmental Change Adaptation for Mobile Robot Navigation. 1998 IEEE/RSJ International Conference on Intelligent Robots and Systems, 3, 1859-1864.</mixed-citation></ref><ref id="scirp.50485-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">Ghavamzadeh, M. and Mahadevan, S. (2007) Hierarchical Average Reward Reinforcement Learning. The Journal of Machine Learning Research, 8, 2629-2669.</mixed-citation></ref><ref id="scirp.50485-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Kato, S. and Matsuo, H. (2000) A Theory of Profit Sharing in Dynamic Environment. Proceedings of the 6th Pacific Rim International Conference on Artificial Intelligence, Melbourne, 28 August-1 September 2000, 115-124.</mixed-citation></ref><ref id="scirp.50485-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Nakano, H., Takada, S., Arai, S. and Miyauchi, A. (2005) An Efficient Reinforcement Learning Method for Dynamic Environments Using Short Term Adjustment. International Symposium on Nonlinear Theory and Its Applications, Bruges, 18-21 October 2005, 250-253.</mixed-citation></ref><ref id="scirp.50485-ref14"><label>14</label><mixed-citation publication-type="journal" xlink:type="simple"><name name-style="western"><surname>Hellinger</surname><given-names> E. </given-names></name>,<etal>et al</etal>. (<year>1909</year>)<article-title>Neue Begr&amp;#252;&amp;#252;ndung der Theorie quadratischer Formen von unendlichvielen Ver&amp;#228;&amp;#228;nderlichen</article-title><source> Journal für die Reine und Angewandte Mathematik</source><volume> 136</volume>,<fpage> 210</fpage>-<lpage>271</lpage>.<pub-id pub-id-type="doi"></pub-id></mixed-citation></ref><ref id="scirp.50485-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Pub. Inc., San Francisco.</mixed-citation></ref></ref-list></back></article>