Giter Site home page Giter Site logo

Comments (4)

ShangtongZhang avatar ShangtongZhang commented on August 23, 2024

Figure 4.1 is only policy evaluation not policy iteration, so we should not change the random policy. And even you do policy iteration I do not see any reason to use a softmax, unless you are working on soft q-learning. https://arxiv.org/abs/1704.06440

from reinforcement-learning-an-introduction.

cbrom avatar cbrom commented on August 23, 2024

So, by changing the action probabilities I am doing policy iteration, and using softmax is not the right way?

from reinforcement-learning-an-introduction.

ShangtongZhang avatar ShangtongZhang commented on August 23, 2024

yes

from reinforcement-learning-an-introduction.

cbrom avatar cbrom commented on August 23, 2024

Hey I came to a realization that for it to be a policy iteration it must satisfy this one

  1. Initialization
    V (s) ∈ R and π(s) ∈ A(s) arbitrarily for all s ∈ S
  2. Policy Evaluation
    Repeat
    ∆ ← 0
    For each s ∈ S:
    v ← V (s)
    V (s) ← s ,r p(s , r |s, π(s)) r + γV (s )
    ∆ ← max(∆, |v − V (s)|)
    until ∆ < θ (a small positive number)
  3. Policy Improvement
    policy-stable ← true
    For each s ∈ S:
    old-action ← π(s)
    π(s) ← argmax a s ,r p(s , r |s, a) r + γV (s )
    If old-action = π(s), then policy-stable ← f alse
    If policy-stable, then stop and return V ≈ v ∗ and π ≈ π ∗ ; else go to 2

We should have a separate policy evaluation succeeded by one policy improvement,
but it seems that what I did on my code was value iteration, this one

Repeat
∆ ← 0
For each s ∈ S:
v ← V (s)
V (s) ← max a s ,r p(s , r |s, a) r + γV (s )
∆ ← max(∆, |v − V (s)|)
until ∆ < θ (a small positive number)
Output a deterministic policy, π ≈ π ∗ , such that
π(s) = argmax a s ,r p(s , r |s, a) r + γV (s )

here I used softmax as a probability value of each action.

from reinforcement-learning-an-introduction.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.