Skip to content
/ TD3 Public

Pytorch implementation of twin delayed deep deterministic policy gradients (TD3)

Notifications You must be signed in to change notification settings

naivoder/TD3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Twin Delayed Deep Deterministic Policy Gradient (TD3)

Overview

This repository contains a PyTorch implementation of Twin Delayed Deep Deterministic Policy Gradients (TD3), a reinforcement learning algorithm that addresses some of the key challenges associated with continuous control tasks. The TD3 algorithm builds on the foundation of Deep Deterministic Policy Gradients (DDPG) by introducing several improvements to enhance stability and performance. One of the primary motivations behind TD3 is to mitigate the overestimation bias in Q-learning, which can lead to suboptimal policies. To achieve this, the authors proposed using a pair of critic networks to provide more accurate Q-value estimates. Additionally, TD3 employs a delayed policy update strategy, which reduces the variance in policy updates and helps in achieving more robust learning. Finally, the introduction of target policy smoothing adds noise to the target action, which reduces the likelihood of policy exploitation due to function approximation errors.

🤔 it kinds seems like the catastrophic drops in average score are occuring at regular intervals... could this be a function of the parameter updates?
I'm also not convinced I'm handling the action's correctly for envs with action bounds | x | > 1.

Setup

Required Dependencies

Install the required dependencies using the following command:

pip install -r requirements.txt

Running the Algorithm

You can run the algorithm on any supported Gymnasium environment. For example:

python main.py --env 'LunarLanderContinuous-v2'

No hyperparameter tuning was conducted for the various environments. This was an intentional choice to compare the generalization of the algorithm to different tasks. For this reason, the agent successfully learn in some cases, and in others was still training after 10,000 epochs.

Pendulum-v1

LunarLanderContinuous-v2

MountainCarContinuous-v0

BipedalWalker-v3

Hopper-v4

Humanoid-v4

Ant-v4

HalfCheetah-v4

HumanoidStandup-v4

InvertedDoublePendulum-v4

InvertedPendulum-v4

Pusher-v4

Reacher-v4

Swimmer-v3

Walker2d-v4

Acknowledgements

Special thanks to Phil Tabor, an excellent teacher! I highly recommend his Youtube channel.