Building POMDPs¶
A Partially Observable Markov Decision Process (POMDP) is an MDP (see previous notebooks) where the agent cannot see in which state the model currently it is. That is, only knows about the actions that you can take in a state, and possibly an observation for the current state, and it has to take a decision based on these.
Note that usually when we refer to MDPs we actually mean Completely Observed MDPs as opposed to POMDPs.
We introduce a simple example to understand the difference between MDP and POMDP. The idea is that a coin is flipped while the agent is not looking, and then the agent has to guess if it’s heads or tails. We first construct an MDP.
[1]:
from stormvogel import *
init = pgc.State(x=["flip"])
def available_actions(s: pgc.State):
if "heads" in s.x or "tails" in s.x:
return [pgc.Action(["guess", "heads"]), pgc.Action(["guess", "tails"])]
return [pgc.Action([])]
def delta(s: pgc.State, a: pgc.Action):
if s == init:
return [(0.5, pgc.State(x=["heads"])), (0.5, pgc.State(x=["tails"]))]
elif "guess" in a.labels:
if "heads" in s.x and "heads" in a.labels or "tails" in s.x and "tails" in a.labels:
return [(1, pgc.State(x=["correct", "done"]))]
else:
return [(1, pgc.State(x=["wrong", "done"]))]
else:
return [(1, s)]
labels = lambda s: s.x
def rewards(s: pgc.State, a: pgc.Action):
if "correct" in s.x:
return {"r1": 100}
return {"r1":0}
coin_mdp = pgc.build_pgc(
delta=delta,
initial_state_pgc=init,
available_actions=available_actions,
labels=labels,
modeltype=ModelType.MDP,
rewards=rewards
)
vis = show(coin_mdp)
Test request failed. See 'Communication server remark' in docs. Disable warning by use_server=False.
Since this MDP is fully observed. This means that the agent can actually see what state the world is in. In other words, the agent knows whether the coin is head or tails. If we ask stormpy to calculate the policy that maximizes the reward, we see that the agent can always ‘guess’ correctly because of this information. The chosen actions are highlighted in red. (More on model checking later.)
[2]:
result = model_checking(coin_mdp, 'Rmax=? [S]')
vis3 = show(coin_mdp, result=result)
To model the fact that our agent does not know the state correctly, we will need to use a POMDP! (Note that we re-use a lot of code from before)
[3]:
def observations(s: pgc.State):
return 0
coin_pomdp = pgc.build_pgc(
delta=delta,
initial_state_pgc=init,
available_actions=available_actions,
labels=labels,
modeltype=ModelType.POMDP,
rewards=rewards,
observations=observations
)
vis3 = show(coin_pomdp)
Unfortunately, model checking POMDPs turns out to be very hard in general, even undecidable. For this model, the result of model checking would look similar to this. The agent doesn’t know if it’s currently in the state heads or tails, therefore it just guesses heads and has only a 50 percent chance of winning.
[4]:
import stormvogel.result
taken_actions = {}
for id, state in coin_pomdp.states.items():
taken_actions[id] = state.available_actions()[0]
scheduler2 = stormvogel.result.Scheduler(coin_pomdp, taken_actions)
values = [50, 50, 50, 100.0, 0.0]
result2 = stormvogel.result.Result(coin_pomdp, values, scheduler2)
vis4 = show(coin_pomdp, result=result2)
[ ]: