reinforcement_learning, deep learning, rl, deepmind
1. Markov Decision Processes (MDPs)
2. Policies and Value Functions
3. Optimality
4. Q-Learning
5. Exploration VS Exploitation
6. Deep Q-Learning
- Agent: Decision Maker interacting with the environment sequentially over time.
- Agent goal in MDP: Maximize the expected culmulative rewards based on the policy.
- Environment: According to the MDP model, it decides the next state(s') and reward(r).
- State: Represent the situation from environment
- Action: Based on the policy
- Reward: Get from corresponding action about the state
-
Episodic Tasks: Each new round of the game can be thought of as an episode, and the final time step of an episode occurs when a player scores a point.
-
Continuing Tasks: There is no limit. That's why it needs Discount Variable
-
Policy: a function that maps a given state to probabilities of selecting each possible action from that state. We will use the symbol
π
to denote a policy. -
Value Functions: Value functions are
functions of states
, or ofstate-action pairs
, that estimate how good it is for an agent to be in a given state, or how good it is for the agent to perform a given action in a given state.- Value Functions Background: How good a given action or a given state is for the agent?
Since the way an agent acts is influenced by the policy it's following, then we can see that value functions are defined with respect to policies.
It is the goal of reinforcement learning algorithms to find a policy that will yield a lot of rewards for the agent if the agent indeed follows that policy. Specifically, reinforcement learning algorithms seek to find a policy that will yield more return to the agent than all other policies.
once we have our optimal Q-function q* we can determine the optimal policy by applying a reinforcement learning algorithm to find the action that maximizes q* for each state.
- What is Q-learning?
- Q-learning is to find the optimal policy by learning the optimal Q-values for each state-action pair
in a Markov Decision Process
.
Chicago's guess in Q-learning flow: Overall policy -> Q-learning -> Find the optimal policy
The
Q-learning algorithm iteratively updates the Q-values
for each state-action pairusing the Bellman equation
until the Q-function converges
to theoptimal Q-function, q*
. This approach is called value iteration. To see exactly how this happens, let's set up an example, appropriately called The Lizard Game.
Q-learning is Choosing Actions With An Epsilon Greedy Strategy. As the agent learns more about the environment, at the start of each new episode, episode will decay by some rate that we set so that the likelihood of exploration becomes less and less probable as the agent learns more and more about the environment.
Epsilon: Exploration rate
- Algorithm
- To determine whether the agent will choose exploration or exploitation at each time step, we generate a random number between 0 and 1.
If this number is greater than epsilon, then the agent will choose its next action via exploitation
, i.e. it will choose the action with the highest Q-value for its current state from the Q-table. Otherwise, its next action will be chosen via exploration, i.e. randomly choosing its action and exploring what happens in the environment.
[epsilon greedy strategy]: epsilon means rate for exploration when epsilon is near 0: exploitation when epsilon is near 1: exploration
Image can be the representation of state, but it could be difficult to recognize the situation easily. That's why most cases we put image sequence as an input.
There are output nodes as much as possible actions from the state.
Experience replay: At time t, the agent's experience et is defined as this tuple
Two Types of Neural Network: 1. Policy Network(get s as an input) 2. Target Network(get s' as an input)
replay memory