Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rllib] Memory leak in MADDPG #6857

Closed
jkterry1 opened this issue Jan 20, 2020 · 4 comments
Closed

[rllib] Memory leak in MADDPG #6857

jkterry1 opened this issue Jan 20, 2020 · 4 comments
Labels
bug Something that is supposed to be working; but isn't

Comments

@jkterry1
Copy link
Contributor

jkterry1 commented Jan 20, 2020

I'm running on two custom (propriatary) enviroments, and it otherwise works fine, but with MADDPG Ray throws RayOutOfMemoryError after 52 episodes (or 49k timesteps). The error.txt output log:

Failure # 1 (occurred at 2020-01-17_19-40-17)
Traceback (most recent call last):
  File "/home/ananth/.local/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 426, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/ananth/.local/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 378, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/home/ananth/.local/lib/python3.6/site-packages/ray/worker.py", line 1457, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayOutOfMemoryError): �[36mray::MADDPG.train()�[39m (pid=15090, ip=10.70.51.83)
  File "python/ray/_raylet.pyx", line 636, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 619, in ray._raylet.execute_task.function_executor
  File "/home/ananth/.local/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 444, in train
    raise e
  File "/home/ananth/.local/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 433, in train
    result = Trainable.train(self)
  File "/home/ananth/.local/lib/python3.6/site-packages/ray/tune/trainable.py", line 176, in train
    result = self._train()
  File "/home/ananth/.local/lib/python3.6/site-packages/ray/rllib/agents/trainer_template.py", line 138, in _train
    res = collect_metrics_fn(self)
  File "/home/ananth/.local/lib/python3.6/site-packages/ray/rllib/contrib/maddpg/maddpg.py", line 170, in collect_metrics
    result = trainer.collect_metrics()
  File "/home/ananth/.local/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 753, in collect_metrics
    selected_workers=selected_workers)
  File "/home/ananth/.local/lib/python3.6/site-packages/ray/rllib/optimizers/policy_optimizer.py", line 108, in collect_metrics
    timeout_seconds=timeout_seconds)
  File "/home/ananth/.local/lib/python3.6/site-packages/ray/rllib/evaluation/metrics.py", line 75, in collect_episodes
    metric_lists = ray_get_and_free(collected)
  File "/home/ananth/.local/lib/python3.6/site-packages/ray/rllib/utils/memory.py", line 33, in ray_get_and_free
    result = ray.get(object_ids)
ray.exceptions.RayTaskError(RayOutOfMemoryError): �[36mray::RolloutWorker�[39m (pid=15078, ip=10.70.51.83)
  File "python/ray/_raylet.pyx", line 627, in ray._raylet.execute_task
  File "/home/ananth/.local/lib/python3.6/site-packages/ray/memory_monitor.py", line 130, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node gammaumd5 is used (30.3 / 31.07 GB). The top 10 memory consumers are:

PID	MEM	COMMAND
15078	4.68GiB	ray::RolloutWorker
15090	3.81GiB	ray::MADDPG.train()
14994	0.21GiB	python3 run_maddpg.py
15102	0.08GiB	ray::IDLE
15106	0.08GiB	ray::IDLE
15096	0.08GiB	ray::IDLE
15093	0.08GiB	ray::IDLE
15099	0.08GiB	ray::IDLE
15107	0.08GiB	ray::IDLE
15098	0.08GiB	ray::IDLE

In addition, up to 17.93 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`. Note that Ray assumes all system memory is available for use by workers. If your system has other applications running, you should manually set these memory limits to a lower value.

Ray was initialized with ray.init(redis_max_memory=int(4e9), object_store_memory=int(10e9), memory=int(16e9)), on a machine with 32GB of ram. I've attatched my slightly modified version of run_maddpg.py from the example code to include all interesting parameters. This issue persists with upgrading to tf 1.15 and ray 0.9dev0, which github reports say fixed similar issues. I'm also not the only person to be having this problem, as evidenced by the comments here in the old example code repo: wsjeon/maddpg-rllib#7 (the one that was marked off topic).

I followed this comment from #3884 to set my object store size. Each observation in my game that is returned by a step call is of size 538 KB. Therefore, 538 K * max(48, 12*4) * 16 = 414 MB (I believe the default value of LEARNER_QUEUE_SIZE is 16.). So, I set the object_store_memory to be 10 GB. That should be enough.

One super curious thing in this issue is that, while num_workers parameter is the number of distributed compute nodes Ray uses, when I run pstree to check for the spawned processes, I see that there are 89 RolloutWorkers threads that are running. The number changes depending on the memory parameters I set, but it never goes below 60. This doesn't make sense.

python3 run_mad─┬─/usr/bin/python───32*[{/usr/bin/python}]
                ├─2*[/usr/bin/python───{/usr/bin/python}]
                ├─plasma_store_se
                ├─python3
                ├─raylet─┬─30*[ray::IDLE───9*[{ray::IDLE}]]
                │        ├─ray::MADDPG.tra───110*[{ray::MADDPG.tra}]
                │        ├─ray::RolloutWor───89*[{ray::RolloutWor}]
                │        └─34*[{raylet}]
                ├─raylet_monitor
                ├─2*[redis-server───3*[{redis-server}]]
                └─18*[{python3 run_mad}]

After the initial start, the number of processes don't keep growing, so this isn't some sort of leak in terms of process number and them not being released.

The nature of the memory management here also leads me to believe that ray somehow isn't respecting the limits I set, which I filed a seperate bug report about here: #6785

The documentation also seems to say that we can enforce memory qoutas on workers- "Enforcement: If an actor exceeds its memory quota, calls to it will throw RayOutOfMemoryError and it may be killed. Memory quota is currently enforced on a best-effort basis for actors only (but quota is taken into account during scheduling in all cases)." As far as I can tell, there's no way to set this in maddpg? Am I missing something?

Getting this resolved is incredibly important to me, and I'm very happy to try and submit PRs and work with you guys debugging this issue (I'm already the new maintainer of the maddpg example repo), but I'm completely as a loss Can anyone offer some insight here?

cc @wsjeon @richardliaw @ericl

Here's the code mentioned above, I had to change the extension from .py to .txt to post it.
run_maddpg.txt

@jkterry1 jkterry1 added the bug Something that is supposed to be working; but isn't label Jan 20, 2020
@ericl
Copy link
Contributor

ericl commented Jan 20, 2020

Probably your replay buffer is just using too much memory. Does the leak still happen if you set num_workers=0, buffer_size=1000, and leave ray.init() alone?

@jkterry1
Copy link
Contributor Author

  1. It looks like that fixed it. A thing in the run_maddpg.py example repo made it look like I was changing the replay buffer when I wasn't. I'm the maintainer of that now, so I fixed it.

  2. I need to use MADDPG with a CNN policy with doesn't currently support them. Could you please walk me through what adding this would entail a little bit? I'm not terribly familiar with the RLlib source code, but I'll submit a PR for it.

@edoakes edoakes closed this as completed Jan 23, 2020
@GoingMyWay
Copy link

GoingMyWay commented May 1, 2020

  1. It looks like that fixed it. A thing in the run_maddpg.py example repo made it look like I was changing the replay buffer when I wasn't. I'm the maintainer of that now, so I fixed it.
  2. I need to use MADDPG with a CNN policy with doesn't currently support them. Could you please walk me through what adding this would entail a little bit? I'm not terribly familiar with the RLlib source code, but I'll submit a PR for it.

You can add CNN to _build_actor_network and _build_critic_network in maddpg.py.

@duburcqa
Copy link
Contributor

I have the same issue using DDPG. I use 50 workers, and replay buffer of size 100000. It is consuming more than 60go after 50M iterations, and it is linearly increasing since the beginning. I'm using release 0.8.6. @ericl Do you think it is normal ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't
Projects
None yet
Development

No branches or pull requests

5 participants