Module2 - mtrl - MDPs and POMDPs

This module is about control and decision making. If you follow the order of the topics suggested by us you will have looked at some control schemes, such as PID control, which is a form of low level decision making.

MDP

A common way to formalise slightly more high level control/decision making problems is to use Markov Decision Processes (MDPs). Markov in MDP comes from the Markov assumption, which says that we can make a prediction about the next state using only information about the current state. That is, there is no benefit with keeping track of any other information about the past. A state here can be more or less anything. In order to define an MDP we also need the so called transition function which describes what the outcome will be of different actions. Thanks to the Markov assumption this transition function can be written as P(s'|s,a), a probability function, where we started in state s, performed action a and ended up in state s'. This is sometimes written as T(s,a,s'). Another component in an MDP is the reward function R(s) which tells what reward the agent gets in a certain state. This reward function is typically not something that is given by the problem, but rather something that you, the engineer, has to come up with in order to get desirable results. Finally, an MDP requires the initial state of the agent. So, in summary an MDP is defined by

Initial state $s_{0}$ $s_0$
Transition function $P (s^{'} ∣ s, a)$ $P(s'\mid s,a)$
Reward function $R (s)$ $R(s)$

Because the outcomes of actions are uncertain, the best solution to an MDP cannot be a fixed plan (sequence of actions). It is a policy that maps a state to an action. This is often denoted as a = $π$ $\pi$ (s). One of the consequences of this is that if you perform repeated experiments you will get different realizations because the transition function will take you to different sequences of states.

To determine how good a certain policy is we calculate the sum of rewards. This is sometimes referred to as the utility. To handle cases of infinite sequences we actually typically use the discounted reward, where we form a geometric sum of the rewards

$U (s_{0}, s_{1}, s_{2}, . . .) = R (s_{0}) + γ R (s_{1}) + γ^{2} R (s_{2}) + . . .$ $U\left(s_0,s_1,s_2,...\right)=R\left(s_0\right)\:+\:\gamma R\left(s_1\right)\:+\:\gamma^2R\left(s_2\right)\:+...$

where the discount factor, $γ$ $\gamma$ , is between 0 and 1.

To find the best policy, $π^{*}$ $\pi^{\ast}$ , we maximize the expected utility (remember that the actions are non-deterministic)

$π^{*} = \underset{π}{a r g m a x} E (\sum γ^{t} R (s_{t}) | π)$ $\pi^{\ast}=\underset{\pi} {\mathrm{argmax}}\:E\left(\sum\gamma^tR\left(s_t\right)|\pi\right)$

The solution to this problem comes via the insight that the value of a state s, V(s), is the immediate reward for that state plus the discounted expected utility of the next states

$V (s) = R (s) + γ max_{a} \sum_{s^{'}} P (s^{'} | s, a) V (s^{'})$ $V\left(s\right)=R\left(s\right)+\gamma\:\underset{a}\max\sum\limits_{s'} P\left(s'|s,a\right)V\left(s'\right)$

Algorithm: Value iteration

Initialize V(s) for all s (to any value)
Loop until the policy has converged
- Loop over all states, s
  - Loop over all actions, a
    - $Q (s, a) = R (s) + γ \sum P (s^{'} | s, a) V (s^{'})$ $Q\left(s,a\right)\:\:=R\left(s\right)+\gamma\sum P\left(s'|s,a\right)V\left(s'\right)$
  - $V (s) = max_{a} Q (s, a)$ $V(s) = \underset{a}\max Q(s,a)$

One of the central insights in reinforcement learning which builds on MDPs is that one can come up with a policy without actually knowing the transition function by directly learning the Q-function.

The equations to calculate V(s) above is an example of the Bellman equation

$V_{i + 1} (s) = R (s) + γ max_{a} \sum P (s^{'} | s, a) V_{i} (s^{'})$ $V_{i+1}\left(s\right)\:=\:R\left(s\right)\:+\:\gamma\:\underset{a}\max\sum P\left(s'|s,a\right)V_i\left(s'\right)$

which can be shown to converge to a unique optimal solution V.

A classical example

Below follows a classical example for MDPs. The world consists of a 4x3 grid. The agent starts in the lower left corner and the goal is to move up to the upper right corner where the reward is +1. It should also avoid getting stuck in a trap, for which the reward is -1 (i.e. a penalty). The actions that are allowed is up, down, left and right. The agent cannot move outside of the world or into the obstacle at [2,2].

To formulate this as an MDP we need to define the reward in each state. Let us say that R(s)=r for all states but the two terminal ones. We also need the transition function P. Let us assume that the probability is 0.8 that the agent will move in the desired direction and 0.1 that it will move 90 degrees to the left/right respectively with respect to the desired direction. (If unable to move to the new location, the agent will remain in the old state).

Depending on the value of r we get different policies.

Exercise

Investigate this using the following Matlab / Octave script value_iteration.m Download value_iteration.m

which you would execute with, for example,

value_iteration(-0.04, +1, -1, 0.9999)

where the first argument is r and the last is $γ$ $\gamma$ . The two figures illustrate $V_{i} (s)$ $V_i(s)$ and the optimal action $a$ $a$ . Be ready to describe how changing r and $γ$ $\gamma$ affects the policy for the quiz.

POMDP

As we saw in Module 1 on Sensing and Perception it is actually really hard to find out your exact state. This means that the state is associated with some uncertainty, hence one of the assumptions for MDPs fails; that of full state observability. If we have only a partially observable state we need to solve a Partially Observable MDP, or POMDP, instead. (Sometimes the approximation that you live in an MDP world is good enough, e.g. if state uncertainty is small. The MDP assumption sure is practical, for example if you want to produce impressive learning results in simulations, and don't mind leaving the real world to others :-))

Replacing the direct state observation that was available for the MDP we here have an observation function O(s,o) which is another way to write P(o|s), i.e. it gives the probability of getting a certain observation o in a state s. The state s is not known and we only get uncertain information about it so we instead need to maintain a so called belief state b(s) which assigns a probability to each state.

In the example above we would start with the distribution (1,0,0,0,0,0,0,0,0,0,0), i.e. probability 1 that we are in the start state and 0 in all others. As soon as we execute the first action this distribution will be spread out: 0.8 of it moving in the direction that the action defined and 0.1 at two other spots. If we make use of the knowledge encoded in the observation we get the following update rule for the belief state

$b^{'} (s^{'}) = α P (o | s^{'}) \sum_{s} P (s^{'} | s, a) b (s)$ $b'\left(s'\right)\:=\:\alpha\:P\left(o|s'\right)\sum_s P\left(s'|s,a\right)b\left(s\right)$

(where $α$ $\alpha$ is a normalizing constant, so elements $b^{'} (s^{'})$ $b'(s')$ sum to 1). When solving a POMDP the key insight is that the optimal action depends on the belief state and not the actual state. That is, the optimal policy is a function that maps belief state to action, $π^{*} (b)$ $\pi^{\ast}\left(b\right)$ .

An agent powered by a POMDP would then cycle through

Execute the action $a = π^{*} (b)$ $a=\pi^{\ast}\left(b\right)$
Receive observation o
Update belief state $b$ $b$ given action $a$ $a$ and observation o

POMDPs are very challenging to solve. If we turn back to our example, we notice that in the fully observable MDP case we had 1 state variable that could take 11 different discrete values. In the partially observable case we have an 11-dimensional continuous state space. For an even worse example: If your state s is a single real-valued number, the belief state b will be a member of the space of all probability distributions over $R$ $R$ , an infinite-dimensional object.

Alternative sources of information

You find the information above in a slightly extended version in these slides AS1-M2-SeqDecisionMaking.pdf Download AS1-M2-SeqDecisionMaking.pdf

. You can also watch the following videos (total ~25 mins) that discuss these two concepts using the same problems as in the slides and the text above.

MDPs

Markov Decision Processes - Georgia Tech - Machine Learning (2min) Links to an external site.

Markov Decision Processes Two - Georgia Tech - Machine Learning (4min) Links to an external site.

Markov Decision Processes Three - Georgia Tech - Machine Learning (5min) Links to an external site.

Markov Decision Processes Four - Georgia Tech - Machine Learning (7min) Links to an external site.

POMDPs

POMDPS (3min) Links to an external site.

POMDP Example - 1 (2min) Links to an external site.

Additional Reading:

A survey paper on POMDP, written for behavioral scientists, is available here Links to an external site.

The POMDP model was introduced by Åström in 1965 in this paper Links to an external site. where the optimal control of POMDPs (with finite discrete state) was determined.