Understanding The Value Iteration Algorithm of Markov Decision Processes
Let B
be your current balance.
If you choose to roll, the expected reward is 2.5 - B * 0.5
.
If you choose not to roll, the expected reward is 0
.
So, the policy is this: If B < 5
, roll. Otherwise, don't.
And the expected reward on each step when following that policy is V = max(0, 2.5 - B * 0.5)
.
Now, if you want to express it in terms of the Bellman equation, you need to incorporate the balance into the state.
Let the state <Balance, GameIsOver>
consist of the current balance and the flag that defines whether the game is over.
- Action
stop
:- turns the state
<B, false>
into<B, true>
- turns the state
- Action
roll
:- turns
<B, false>
into<0, true>
with the probability1/2
- turns
<B, false>
into<B + 4, false>
with the probability1/6
- turns
<B, false>
into<B + 5, false>
with the probability1/6
- turns
<B, false>
into<B + 6, false>
with the probability1/6
- turns
- No action can turn
<B1, true>
into<B2, false>
Using the notation from here:
π(<B, false>) = "roll", if B < 5
π(<B, false>) = "stop", if B >= 5
V(<B, false>) = 2.5 - B * 0.5, if B < 5
V(<B, false>) = 0, if B >= 5
Sam Hammamy
Updated on June 16, 2022Comments
-
Sam Hammamy almost 2 years
In learning about
MDP
's I am having trouble withvalue iteration
. Conceptually this example is very simple and makes sense:If you have a
6
sided dice, and you roll a4
or a5
or a6
you keep that amount in$
but if you roll a1
or a2
or a3
you loose your bankroll and end the game.In the beginning you have
$0
so the choice between rolling and not rolling is:k = 1 If I roll : 1/6*0 + 1/6*0 + 1/6*0 + 1/6*4 + 1/6*5 + 1/6*6 = 2.5 I I don't roll : 0 since 2.5 > 0 I should roll k = 2: If I roll and get a 4: If I roll again: 4 + 1/6*(-4) + 1/6*(-4) + 1/6*(-4) + 1/6*4 + 1/6*5 + 1/6*6 = 4.5 If I don't roll: 4 since 4.5 is greater than 4 I should roll If I roll and get a 5: If I roll again: 5 + 1/6*(-5) + 1/6*(-5) + 1/6*(-5) + 1/6*4 + 1/6*5 + 1/6*6 = 5 If I don't roll: 5 Since the difference is 0 I should not roll If I roll and get a 6: If I roll again: 6 + 1/6*(-6) + 1/6*(-5) + 1/6*(-5) + 1/6*4 + 1/6*5 + 1/6*6 = 5.5 If I don't roll: 6 Since the difference is -0.5 I should not roll
What I am having trouble with is converting that into python code. Not because I am not good with python, but maybe my understanding of the pseudocode is wrong. Even though the Bellman equation does make sense to me.
I
borrowed
the Berkley code forvalue iteration
and modified it to:isBadSide = [1,1,1,0,0,0] def R(s): if isBadSide[s-1]: return -s return s def T(s, a, N): return [(1./N, s)] def value_iteration(N, epsilon=0.001): "Solving an MDP by value iteration. [Fig. 17.4]" U1 = dict([(s, 0) for s in range(1, N+1)]) while True: U = U1.copy() delta = 0 for s in range(1, N+1): U1[s] = R(s) + max([sum([p * U[s1] for (p, s1) in T(s, a, N)]) for a in ('s', 'g',)]) delta = max(delta, abs(U1[s] - U[s])) if delta < epsilon: return U print(value_iteration(6)) # {1: -1.1998456790123457, 2: -2.3996913580246915, 3: -3.599537037037037, 4: 4.799382716049383, 5: 5.999228395061729, 6: 7.199074074074074}
Which is the wrong answer. Where is the bug in this code? Or is it an issue of my understanding of the algorithm?