Skip to content
/ RL Public

Reinforcement Learning, Deep Reinforcement Learning, rl, deepmind

Notifications You must be signed in to change notification settings

ChicagoPark/RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 

Repository files navigation

Reinforcement_Learning

reinforcement_learning, deep learning, rl, deepmind

Contents

1. Markov Decision Processes (MDPs)
2. Policies and Value Functions
3. Optimality
4. Q-Learning
5. Exploration VS Exploitation
6. Deep Q-Learning

1. Markov Decision Processes (MDPs)

IMG

  • Agent: Decision Maker interacting with the environment sequentially over time.
    • Agent goal in MDP: Maximize the expected culmulative rewards based on the policy.
  • Environment: According to the MDP model, it decides the next state(s') and reward(r).
  • State: Represent the situation from environment
  • Action: Based on the policy
  • Reward: Get from corresponding action about the state

Probability to be next state and reward

[1-1] Episodic VS Continuing Tasks

  • Episodic Tasks: Each new round of the game can be thought of as an episode, and the final time step of an episode occurs when a player scores a point.

    • The next episode then begins independently from how the previous episode ended.

    • [Cumulative Reward: G - Episodic]

      IMG
  • Continuing Tasks: There is no limit. That's why it needs Discount Variable

    • [Cumulative Reward: G - Continuing]

      IMG

[2] Policies and Value Functions

  • Policy: a function that maps a given state to probabilities of selecting each possible action from that state. We will use the symbol π to denote a policy.

    • Policy Background: How probable is it for an agent to select any action from a given state?

      IMG
  • Value Functions: Value functions are functions of states, or of state-action pairs, that estimate how good it is for an agent to be in a given state, or how good it is for the agent to perform a given action in a given state.

    • Value Functions Background: How good a given action or a given state is for the agent?
    • Since the way an agent acts is influenced by the policy it's following, then we can see that value functions are defined with respect to policies.

    [2-1] State-Value Function

    IMG

    [2-2] Action-Value Function

    IMG

[3] Optimality

It is the goal of reinforcement learning algorithms to find a policy that will yield a lot of rewards for the agent if the agent indeed follows that policy. Specifically, reinforcement learning algorithms seek to find a policy that will yield more return to the agent than all other policies.

[3-1] Optimal Policy

IMG

[3-2] Optimal State-Value Function

IMG

[3-3] Optimal Action-Value Function

IMG

[3-4] Bellman Optimality Equation

IMG

IMG

once we have our optimal Q-function q* we can determine the optimal policy by applying a reinforcement learning algorithm to find the action that maximizes q* for each state.

[4] Q-learning

  • What is Q-learning?
    • Q-learning is to find the optimal policy by learning the optimal Q-values for each state-action pair in a Markov Decision Process.
Chicago's guess in Q-learning flow: Overall policy -> Q-learning -> Find the optimal policy

[4-1] Value Iteration

The Q-learning algorithm iteratively updates the Q-values for each state-action pair using the Bellman equation until the Q-function converges to the optimal Q-function, q*. This approach is called value iteration. To see exactly how this happens, let's set up an example, appropriately called The Lizard Game.

[4-2] Epsilon Greedy Strategy

Q-learning is Choosing Actions With An Epsilon Greedy Strategy. As the agent learns more about the environment, at the start of each new episode, episode will decay by some rate that we set so that the likelihood of exploration becomes less and less probable as the agent learns more and more about the environment.

Epsilon: Exploration rate

  • Algorithm
    • To determine whether the agent will choose exploration or exploitation at each time step, we generate a random number between 0 and 1. If this number is greater than epsilon, then the agent will choose its next action via exploitation, i.e. it will choose the action with the highest Q-value for its current state from the Q-table. Otherwise, its next action will be chosen via exploration, i.e. randomly choosing its action and exploring what happens in the environment.

-4. Exploration VS Exploitation

[epsilon greedy strategy]: epsilon means rate for exploration when epsilon is near 0: exploitation when epsilon is near 1: exploration

-5. Deep Q-learning

IMG

Image can be the representation of state, but it could be difficult to recognize the situation easily. That's why most cases we put image sequence as an input.

There are output nodes as much as possible actions from the state.

Experience replay: At time t, the agent's experience et is defined as this tuple

IMG

Two Types of Neural Network: 1. Policy Network(get s as an input) 2. Target Network(get s' as an input)

replay memory

About

Reinforcement Learning, Deep Reinforcement Learning, rl, deepmind

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published