Actor loss implementation #3

SimonBurmer · 2021-02-08T15:45:05Z

Hi bic4907, i really like your BicNet implementation! My goal is to run your BicNet implementation on an environment where every agent gets -1 reward for each time step it needs to finish the env. But there is a problem with your actor loss implementation, because the loss of the actor is defined as the prediction of the critic, the rewards needs to be zero if the agents performs perfect, isn't it?

Can you explain to me why you implemented it this way? Also is there a possibility that the reward doesn't converges to 0 when the Agents performs good (linke in the environment i mentioned above)?

bic4907 · 2021-02-23T18:04:37Z

Hi, Simon.
Sorry for late checking your issue.
This repository is written 2 years ago, so I can't remember everything :(

When I read your issue, I understood that you intended to make the agents hurry up for reaching the landmark. (-1 reward for every step) Now, my reward shaping is hard-fit for this environment, so every reward function shapings are welcomed. If the reward converges under zero when the agents work well, you never mind about this, it's just my old-style implementation.

I think Actor Loss that you mentioned is implemented to actor choose actions that maximizing Q-value (critic).

If there is more stuff to discuss, please contact me anytime.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actor loss implementation #3

Actor loss implementation #3

SimonBurmer commented Feb 8, 2021 •

edited

Loading

bic4907 commented Feb 23, 2021 •

edited

Loading

Actor loss implementation #3

Actor loss implementation #3

Comments

SimonBurmer commented Feb 8, 2021 • edited Loading

bic4907 commented Feb 23, 2021 • edited Loading

SimonBurmer commented Feb 8, 2021 •

edited

Loading

bic4907 commented Feb 23, 2021 •

edited

Loading