Harvard

Example Of Infinite Horizon Mdp

Example Of Infinite Horizon Mdp
Example Of Infinite Horizon Mdp

An infinite horizon Markov Decision Process (MDP) is a mathematical framework used to model decision-making problems that have an infinite number of stages. In an MDP, an agent interacts with an environment, and at each stage, the agent takes an action and receives a reward. The goal of the agent is to maximize the cumulative reward over an infinite horizon.

Infinite Horizon MDP Definition

An infinite horizon MDP is defined by the following components:

  • States: A set of states S, which represents the possible situations the agent can be in.
  • Actions: A set of actions A, which represents the possible actions the agent can take.
  • Transition function: A function P(s’ | s, a) that specifies the probability of transitioning from state s to state s’ when taking action a.
  • Reward function: A function R(s, a) that specifies the reward the agent receives when taking action a in state s.
  • Discount factor: A parameter γ, which represents the discount factor, where 0 ≤ γ < 1.

The agent’s goal is to find a policy π that maps states to actions, such that the expected cumulative reward over an infinite horizon is maximized.

Infinite Horizon MDP Example

Consider a simple example of an infinite horizon MDP, where an agent is trying to navigate a grid world. The agent can move up, down, left, or right, and receives a reward of +1 for reaching the goal state. The agent starts at a random state and the goal is to reach the goal state.

StateActionRewardNext State
s1up0s2
s1down0s3
s2up0s4
s2right1goal
s3down0s5
s3left0s1

In this example, the agent needs to find a policy that maximizes the expected cumulative reward over an infinite horizon. The optimal policy would be to move up and then right to reach the goal state.

💡 In infinite horizon MDPs, the discount factor γ plays a crucial role in determining the optimal policy. A higher discount factor means that the agent values future rewards more, while a lower discount factor means that the agent values immediate rewards more.

Solution Methods for Infinite Horizon MDPs

There are several solution methods for infinite horizon MDPs, including:

  1. Value Iteration: An iterative algorithm that computes the optimal value function and policy.
  2. Policy Iteration: An iterative algorithm that computes the optimal policy and value function.
  3. Linear Programming: A method that formulates the MDP as a linear program and solves it using linear programming techniques.
  4. Deep Reinforcement Learning: A method that uses deep neural networks to approximate the value function and policy.

Each of these methods has its strengths and weaknesses, and the choice of method depends on the specific problem and the desired level of accuracy.

Value Iteration Example

Consider the grid world example above. The value iteration algorithm iteratively computes the optimal value function V(s) as follows:

V(s) = max [R(s, a) + γ ∑ P(s’ | s, a) V(s’)]

where V(s) is the value of state s, R(s, a) is the reward for taking action a in state s, and P(s’ | s, a) is the transition probability from state s to state s’ when taking action a.

The algorithm iterates until convergence, at which point the optimal value function and policy can be computed.

What is the difference between a finite horizon MDP and an infinite horizon MDP?

+

A finite horizon MDP has a fixed number of stages, while an infinite horizon MDP has an infinite number of stages. In a finite horizon MDP, the agent's goal is to maximize the cumulative reward over a fixed number of stages, while in an infinite horizon MDP, the agent's goal is to maximize the expected cumulative reward over an infinite horizon.

How does the discount factor affect the optimal policy in an infinite horizon MDP?

+

The discount factor γ affects the optimal policy in an infinite horizon MDP by determining the trade-off between immediate and future rewards. A higher discount factor means that the agent values future rewards more, while a lower discount factor means that the agent values immediate rewards more. The optimal policy will balance the trade-off between immediate and future rewards based on the discount factor.

In conclusion, infinite horizon MDPs are a powerful framework for modeling decision-making problems with an infinite number of stages. The solution methods for infinite horizon MDPs, including value iteration, policy iteration, linear programming, and deep reinforcement learning, can be used to compute the optimal policy and value function. Understanding the discount factor and its effect on the optimal policy is crucial in solving infinite horizon MDPs.

Related Articles

Back to top button