Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Detailed Hyperparameters in LOTUS Experiments #6

Open
BrightMoonStar opened this issue Jun 22, 2024 · 10 comments
Open

Request for Detailed Hyperparameters in LOTUS Experiments #6

BrightMoonStar opened this issue Jun 22, 2024 · 10 comments

Comments

@BrightMoonStar
Copy link

Dear Dr. Weikang Wan and Team,

I recently came across your fascinating work on the LOTUS algorithm, as detailed in your paper "LOTUS: Continual Imitation Learning for Robot Manipulation Through Unsupervised Skill Discovery."

Your approach to lifelong robot learning through unsupervised skill discovery is truly impressive and offers significant insights into continual imitation learning for robot manipulation. I am particularly interested in replicating and building upon your experiments as part of my research.

However, I noticed that the paper does not provide specific details on some of the experimental hyperparameters, such as the learning rate, number of epochs, and batch size used during training. These details are crucial for ensuring that my replication is as accurate as possible.

Could you kindly provide the following details:

The learning rate(s) used for training the models.
The number of epochs each model was trained for.
The batch size used during training.
Any other relevant hyperparameters or settings that were critical to the performance of the LOTUS algorithm.
I greatly appreciate your time and assistance. Your work is a significant contribution to the field, and having these details would be immensely helpful for my research.

Thank you very much for your support and I look forward to your response.

Best regards

@wkwan7
Copy link
Contributor

wkwan7 commented Jun 24, 2024

Hi @BrightMoonStar, thanks for your interest in our work! For the hyperparameters you mentioned, you can find them in configs, we use these provided training scripts in our paper experiments.

@BrightMoonStar
Copy link
Author

Thank you for your reply. I strictly followed the instructions of each step in your project and repeated it many times. I found that all evaluation indicators including success rate and Aoc are always 0. I also found that LIBERO has this problem , as shown below Lifelong-Robot-Learning/LIBERO#21 . I really can't find where the problem is.

@BrightMoonStar
Copy link
Author

At first I thought n_epochs: 50 might not be enough in lotus/configs/train/default.yaml n_epochs: 50, so I set n_epochs=1000. The training and evaluation logs are as follows, but all the results of success Rate and Aoc are still 0. Thank you again.
output.log

@pengzhi1998
Copy link

pengzhi1998 commented Jun 27, 2024

Dear Weikang,

Thank you for open-sourcing this great repo!

I'm wondering how did you tackle the challenge when opening hdf5 files with multiprocessing for LIBERO? It seems many people encountered this same issue: Lifelong-Robot-Learning/LIBERO#19 (comment). May we have your suggestions? @wkwan7

Thank you for your attention and precious time. Look forward to your reply!!

Best regards,
Pengzhi

@wkwan7
Copy link
Contributor

wkwan7 commented Jun 29, 2024

Hi @BrightMoonStar , can you try the default parameters (e.g., n_epochs: 50) and post the your output log shere? btw, I recommend to use wandb which can show more detailed logs.

@wkwan7
Copy link
Contributor

wkwan7 commented Jun 29, 2024

Hi @pengzhi1998, if the default dataloader setting not works for you, you can try this:

train_dataloader = DataLoader(
    dataset,
    batch_size=self.cfg.train.batch_size,
    num_workers=0, #self.cfg.train.num_workers,
    sampler=RandomSampler(dataset),
    # persistent_workers=True,
)

I don't think this will significantly increase the training time.

@pengzhi1998
Copy link

pengzhi1998 commented Jun 29, 2024

Thank you Weikang for your reply!

Yes I have tried it and worked well when training and evaluating on the first task in LIBERO. However, when training the second task, the same problem occurred but due to a different reason:

Traceback (most recent call last):
  File "lifelong/main.py", line 219, in main
    s_fwd, l_fwd = algo.learn_one_task(
  File "/workspace/LIBERO/libero/lifelong/algos/base.py", line 170, in learn_one_task
    loss = self.observe(data)
  File "/workspace/LIBERO/libero/lifelong/algos/er.py", line 73, in observe
    buf_data = next(self.buffer)
  File "/workspace/LIBERO/libero/lifelong/algos/er.py", line 17, in cycle
    for data in dl:
  File "/opt/conda/envs/libero/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in __iter__
    self._iterator = self._get_iterator()
  File "/opt/conda/envs/libero/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 314, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/opt/conda/envs/libero/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 927, in __init__
    w.start()
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/envs/libero/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/opt/conda/envs/libero/lib/python3.8/site-packages/h5py/_hl/base.py", line 370, in __getnewargs__
    raise TypeError("h5py objects cannot be pickled")
TypeError: h5py objects cannot be pickled

It seems when running experience reply algorithm, this cycle function would also create multiple processes for loading data from previous task (though I have set num_workers as 0, which confuses me most).

Besides, I noticed Robomimic also made use of multiprocessing for training with data as hdf5 files. While their implementations are very similar to LIBERO's, but didn't encounter this error, which is also confusing.

May I have some of your insights about these issues? Thank you so much again!!

@wkwan7
Copy link
Contributor

wkwan7 commented Jun 29, 2024

Hi @pengzhi1998, when I run ER using the Lotus codebase, it doesn't seem to have the issue you mentioned. The command I used is as follows:

export CUDA_VISIBLE_DEVICES=0 && \
export MUJOCO_EGL_DEVICE_ID=0 && \
python lotus/lifelong/main_old.py seed=0 \
                               benchmark_name=LIBERO_OBJECT \
                               policy=bc_transformer_policy \
                               lifelong=er

Maybe you can try using Lotus codebase to see if you still have the issue.

@BrightMoonStar
Copy link
Author

Hi @BrightMoonStar , can you try the default parameters (e.g., n_epochs: 50) and post the your output log shere? btw, I recommend to use wandb which can show more detailed logs.

Hi , this is the wandb log with n_epoch=50
offline-run-20240702_001934-vaark4lu.zip

@pengzhi1998
Copy link

@BrightMoonStar Hi, did you solve this problem (success rate is always around 0) at last?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants