An

A state is

*information state*(a.k.a.*Markov state*) contains all useful information from the history.A state is

*Markov*if and only ifIf an environment is fully observable, Markov Decision Process (MDP) can be applied. Else if only partial observability is guaranteed, than the partially observable Markov decision process (POMDP) should be applied.

In an RL agent, there are policy, value function, and model. Policy is what we want to learn by experiences – learning how to maximize the reward by trial and error. Value function is the prediction of future reward. It evaluates the goodness and badness of states.

Lastly, model predicts what the environment will do next. MDP does not require to have a model.

In an RL agent, there are policy, value function, and model. Policy is what we want to learn by experiences – learning how to maximize the reward by trial and error. Value function is the prediction of future reward. It evaluates the goodness and badness of states.

Lastly, model predicts what the environment will do next. MDP does not require to have a model.

**Figure 1. MDP control in blackjack**

Above figure is an example of MDP control applied to the game of blackjack. The resulting policy is how to behave in each state. If the value function has a value smaller than 0, player should better hit, and if larger than 0, stick. For example, if observable dealer’s card is 7 and the sum of the player’s cards is 19, the player better to stop at both usable ace state and no usable ace states.

This concept can be applied to the UAS routing algorithm, especially in grid systems.

This concept can be applied to the UAS routing algorithm, especially in grid systems.