Reinforcement_Learning

reinforcement_learning, deep learning, rl, deepmind

Agent: Decision Maker interacting with the environment sequentially over time.
- Agent goal in MDP: Maximize the expected culmulative rewards based on the policy.
Environment: According to the MDP model, it decides the next state(s') and reward(r).
State: Represent the situation from environment
Action: Based on the policy
Reward: Get from corresponding action about the state

Probability to be next state and reward

[1-1] Episodic VS Continuing Tasks

Episodic Tasks: Each new round of the game can be thought of as an episode, and the final time step of an episode occurs when a player scores a point.
- The next episode then begins independently from how the previous episode ended.
- [Cumulative Reward: G - Episodic]
Continuing Tasks: There is no limit. That's why it needs Discount Variable
- [Cumulative Reward: G - Continuing]

[2] Policies and Value Functions

Policy: a function that maps a given state to probabilities of selecting each possible action from that state. We will use the symbol π to denote a policy.
- Policy Background: How probable is it for an agent to select any action from a given state?
Value Functions: Value functions are functions of states, or of state-action pairs, that estimate how good it is for an agent to be in a given state, or how good it is for the agent to perform a given action in a given state.
- Value Functions Background: How good a given action or a given state is for the agent?
- Since the way an agent acts is influenced by the policy it's following, then we can see that value functions are defined with respect to policies.
[2-1] State-Value Function

[2-2] Action-Value Function

[3] Optimality

It is the goal of reinforcement learning algorithms to find a policy that will yield a lot of rewards for the agent if the agent indeed follows that policy. Specifically, reinforcement learning algorithms seek to find a policy that will yield more return to the agent than all other policies.

[3-1] Optimal Policy

[3-2] Optimal State-Value Function

[3-3] Optimal Action-Value Function

`[3-4] Bellman Optimality Equation`

once we have our optimal Q-function q* we can determine the optimal policy by applying a reinforcement learning algorithm to find the action that maximizes q* for each state.

[4] Q-learning

What is Q-learning?

Q-learning is to find the optimal policy by learning the optimal Q-values for each state-action pair in a Markov Decision Process.

Chicago's guess in Q-learning flow: Overall policy -> Q-learning -> Find the optimal policy

[4-1] Value Iteration

The Q-learning algorithm iteratively updates the Q-values for each state-action pair using the Bellman equation until the Q-function converges to the optimal Q-function, q*. This approach is called value iteration. To see exactly how this happens, let's set up an example, appropriately called The Lizard Game.

[4-2] Epsilon Greedy Strategy

Q-learning is Choosing Actions With An Epsilon Greedy Strategy. As the agent learns more about the environment, at the start of each new episode, episode will decay by some rate that we set so that the likelihood of exploration becomes less and less probable as the agent learns more and more about the environment.

Epsilon: Exploration rate

Algorithm

To determine whether the agent will choose exploration or exploitation at each time step, we generate a random number between 0 and 1. If this number is greater than epsilon, then the agent will choose its next action via exploitation, i.e. it will choose the action with the highest Q-value for its current state from the Q-table. Otherwise, its next action will be chosen via exploration, i.e. randomly choosing its action and exploring what happens in the environment.

-4. Exploration VS Exploitation

[epsilon greedy strategy]: epsilon means rate for exploration when epsilon is near 0: exploitation when epsilon is near 1: exploration

-5. Deep Q-learning

Image can be the representation of state, but it could be difficult to recognize the situation easily. That's why most cases we put image sequence as an input.

There are output nodes as much as possible actions from the state.

Experience replay: At time t, the agent's experience et is defined as this tuple

Two Types of Neural Network: 1. Policy Network(get s as an input) 2. Target Network(get s' as an input)

replay memory

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
DQN-Cart_and_Pole.ipynb		DQN-Cart_and_Pole.ipynb
Q-learning_Frozen_Lake.ipynb		Q-learning_Frozen_Lake.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reinforcement_Learning

Contents

1. Markov Decision Processes (MDPs)

Probability to be next state and reward

[1-1] Episodic VS Continuing Tasks

[2] Policies and Value Functions

[2-1] State-Value Function

[2-2] Action-Value Function

[3] Optimality

[3-1] Optimal Policy

[3-2] Optimal State-Value Function

[3-3] Optimal Action-Value Function

`[3-4] Bellman Optimality Equation`

[4] Q-learning

[4-1] Value Iteration

[4-2] Epsilon Greedy Strategy

-4. Exploration VS Exploitation

-5. Deep Q-learning

About

Releases

Packages

Languages

ChicagoPark/RL

Folders and files

Latest commit

History

Repository files navigation

Reinforcement_Learning

Contents

1. Markov Decision Processes (MDPs)

Probability to be next state and reward

[1-1] Episodic VS Continuing Tasks

[2] Policies and Value Functions

[2-1] State-Value Function

[2-2] Action-Value Function

[3] Optimality

[3-1] Optimal Policy

[3-2] Optimal State-Value Function

[3-3] Optimal Action-Value Function

[3-4] Bellman Optimality Equation

[4] Q-learning

[4-1] Value Iteration

[4-2] Epsilon Greedy Strategy

-4. Exploration VS Exploitation

-5. Deep Q-learning

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`[3-4] Bellman Optimality Equation`

Packages