# Markov Process

Related to Markov Chain.

Markov decision processes formally describe an environment for reinforcement learning where the environment is fully observable

- i.e. The current state completely characterizes the process
- Almost all RL problems can be formalized as MDPs, e.g.
- Optimal control primarily deals with continuous MDPs
- Partially observable problems can be converted into MDPs
- Bandits are MDPs with one state

## Markov Property

Definition

A state $S_{t}$ is

Markovif and only if $P[S_{t+1}∣S_{t}]=P[S_{t+1}∣S_{1},…,S_{t}]$ i.e. the state captures all relevant information from the history

- “The future is independent of the past given the present”
- Once the state is known, the history may be thrown away
- i.e. The state is a sufficient statistic of the future

### State Transition Probability

“What is the probability of transitioning from state $s$ to state $s_{′}$?”

For a Markov state $s$ and successor state $s_{′}$ , the state transition probability is defined by $Pss_{′}=P[S_{t+1}=s_{′}∣S_{t}=s]$

State transition matrix $P$ defines transition probabilities from all states $s$ to all successor states $s_{′}$, where $P= P_{11}⋮P_{n1} …… P_{1n}⋮P_{nn} $

### Markov Process / Markov Chain

At a high level, a Markov process is a sequence of random states $S_{1},S_{2},…$ with the Markov property.

Definition

A

Markov Processis a tuple $⟨S,P⟩$

- $S$ is a finite set of states
- $P$ is a state transition probability matrix, $P_{ss_{′}}=P[S_{t+1}=s_{′}∣S_{t}=s]$
$P$ stands for probability.

## Markov Reward Process

A Markov reward process is a Markov process with rewards (values for each state).

Definition

A

Markov Reward Processis a tuple $⟨S,P,R,γ⟩$

- $S$ is a finite set of states
- $P$ is a state transition probability matrix, $P_{ss_{′}}=P[S_{t+1}=s_{′}∣S_{t}=s]$
- $R$ is a reward function, $R_{s}=E[R_{t+1}∣S_{t}=s]$
- $γ$ is a discount factor, $γ∈[0,1]$
$P$ stands for probability, while $E$ stands for expected.

We can solve the MRP analytically, using the fact that we know this is a Markov Process. So we break down the value function into: Value = (immediate reward) + (discounted sum of future rewards) $V=R+γPV$ We know both $R$ and $P$, and we set $γ$ ourselves, so we can solve for $V$ directly. $V=R(I−γP)_{−1}$ For analytical way, we can directly solve MRP, which takes around $O(N_{3})$ where $N$ is the number of states, and solve the matrix.

We can also use iterative algorithm with Dynamic Programming using a similar idea as the Dynamic Programming using a similar idea as the Bellman Equation.

- Initialize $V_{0}(s)=0$ for all $s$
- For $k=1$ until convergence (i.e. $∣v_{k}−v_{k−1}∣<ϵ$)
- $∀s∈S$, $V_{k}(s)=R(s)+γ∑_{s_{′}∈S}P(s_{′}∣s)V_{k−1}(s_{′})$

## Markov Decision Process

Now, we add decisions. A MDP is a MRP with decisions.

Definition

A

Markov Decision Processis a tuple $⟨S,A,P,R,γ⟩$

- $S$ is a finite set of states
- $A$ is a finite set of actions
- $P$ is a state transition probability matrix, $P_{ss_{′}}=P[S_{t+1}=s_{′}∣S_{t}=s,A_{t}=a]$
- $R$ is a reward function, $R_{s}=E[R_{t+1}∣S_{t}=s,A_{t}=a]$
- $γ$ is a discount factor, $γ∈[0,1]$

The goal is to find the Optimal Policy, which maximizes the expected return.

If you have a MDP + a policy, that just becomes a MRP. So can solve using the technique above.

Also see Partially Observable Markov Decision Process

### MDPs are everywhere

- Cleaning Robot
- Walking Robot
- Pole Balancing
- Tetris
- Backgammon
- Server management
- Shortest path problems
- Model for animals or people

### Solving MDPs

To solve MDPs, we use Dynamic Programming in Reinforcement Learning.

### Maximum Entropy MDP

Regular formulation $max_{π}E[∑_{t=0}r_{t}]$

Maximum-entropy formulation $max_{π}E[∑_{t=0}r_{t}+βH(π(⋅∣s_{t}))]$

### Continuous MDPs with Discretization

Value iteration becomes impractical when the state space $S$ is continuous.

- Solution one is to Discretization

### Extending MDPs

From page 7 of this paper, you have

- Markov Game
- Extensive-Form Game