Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues training custom model #4

Open
marcellszi opened this issue Mar 3, 2022 · 2 comments
Open

Issues training custom model #4

marcellszi opened this issue Mar 3, 2022 · 2 comments

Comments

@marcellszi
Copy link

I've attempted to reproduce some of the results from the paper:

L. Fu, Y. Cao, J. Wu, Q. Peng, Q. Nie, and X. Xie, "UFold: fast and accurate RNA secondary structure prediction with deep learning", Nucleic Acids Research, p. gkab1074, Nov. 2021, doi: 10.1093/nar/gkab1074.

I attempted to re-train UFold on a custom dataset, but ran into some issues, and have a few questions I'm hoping you can help clear up.

Running testing script

After following the installation instructions, I attempted to check if the installation succeeded (by running python ufold_test.py --test_files TS2), which resulted in the following traceback:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1579022034529/work/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "ufold_test.py", line 341, in <module>
    main()
  File "ufold_test.py", line 204, in main
    torch.cuda.set_device(2)
  File "/home/usr/anaconda3/envs/UFold/lib/python3.6/site-packages/torch/cuda/__init__.py", line 292, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1579022034529/work/torch/csrc/cuda/Module.cpp:59

I suspected that this issue was simply due to setting a hardcoded CUDA device that I don't have, so I made the following changes to ufold_test.py:

72c72
<     device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
---
>     device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
204c204
<     torch.cuda.set_device(0)
---
>     torch.cuda.set_device(2)
257c257
<     device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
---
>     device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
322c322
<     contact_net.load_state_dict(torch.load(MODEL_SAVED,map_location='cuda:0'))
---
>     contact_net.load_state_dict(torch.load(MODEL_SAVED,map_location='cuda:1'))

After this, the test script runs without issues.

Running training script

I then attempted to train UFold on some other datasets, starting with the provided TS0 as an example. First, I made similar fixes the to hardcoded CUDA devices within ufold_train.py. Then, when trying to train the model via python ufold_train.py --train_files TS0, I ran into more issues.

The code contains several breakpoints (pdb.set_trace()), I assume simply left over from debugging. However, continuing through the breakpoints results in the following traceback:

Traceback (most recent call last):
  File "ufold_train.py", line 220, in <module>
    main()
  File "ufold_train.py", line 189, in main
    train_merge = Dataset_FCN_merge(train_data_list)
  File "/home/usr/ufold2/UFold/ufold/data_generator.py", line 609, in __init__
    self.data = self.merge_data(data_list)
  File "/home/usr/ufold2/UFold/ufold/data_generator.py", line 617, in merge_data
    self.data2.data_x = np.concatenate((data_list[0].data_x,data_list[1].data_x),axis=0)
IndexError: list index out of range

I believe this is because you must provide two datasets as arguments, i.e. python ufold_train.py --train_files dataset_A dataset_B, for example. However, it's not clear to me why this is the case. Is one of the datasets used for pre-training?

Finally, the training runs through an epoch, and then fails due to what I assume to be a hardcoded save path with the following traceback:

Traceback (most recent call last):
  File "ufold_train.py", line 220, in <module>
    main()
  File "ufold_train.py", line 210, in main
    train(contact_net,train_merge_generator,epoches_first)
  File "ufold_train.py", line 43, in train
    steps_done = 0
  File "/home/usr/anaconda3/envs/UFold/lib/python3.6/site-packages/torch/serialization.py", line 327, in save
    with _open_file_like(f, 'wb') as opened_file:
  File "/home/usr/anaconda3/envs/UFold/lib/python3.6/site-packages/torch/serialization.py", line 212, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/usr/anaconda3/envs/UFold/lib/python3.6/site-packages/torch/serialization.py", line 193, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '../models_ckpt/final_model/for_servermodel/tmp/ufold_train_onalldata_0.pt'

Could you please assist me in re-creating your training methodology for a custom dataset? Additionally, could you detail how I might go about re-training with synthetic data as mentioned in the paper, along with the methodology to generate it? I have found references to multiple training steps in ufold/config.json (for example, epoches_first, epoches_second, and epoches_third), but no other references anywhere else in the code.

@sperfu
Copy link
Contributor

sperfu commented Mar 4, 2022

Hi there,

Thanks for reaching out. The first issue you point out is due to the GPU id, you can switch to your own cuda device id or change it to cpu, our framework is capable of running using cpu. The second one you ran into is because our model support for multiple datasets input, therefore we call a function that merge all the datasets, we also changed our code for only one dataset as you can see in the data_generator.py file. The third issue is simply due to the pre-trained model saving path error, which we have changed that to the working path, you may change to your own path as well. Lastly, some references in the ufold/config.json file didn't show in the code because some params are used for earlier debugging and test. So we have delete some unecessary references. Last but not least, for the synthetic data you mentioned, for each real sequence, we first generate 3 synthetic sequences to create a pool of synthetic sequences by randomly mutate some nucleotides from the corresponding real sequence. In order to make the sequences pass the redundant removal procedure to keep clear with the training set, we then use CD-HIT 80 to remove any sequences that have similarity over 80% to real sequences. The synthetic ground truth labels are generated with Contrafold, which then use to train UFold.

Thanks

@marcellszi
Copy link
Author

Hi,

Thank you very much for your quick response and fixes.

I see your last four commits addressed my issues. I appreciate the help. I was able to start training a model after your changes.

Quick FYI: 8db5a90 breaks things due to 528533143e194854e264fcfd9802252c95f2f6b7/ufold/config.py#L24, but I was able to trivially fix it by reverting the config file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants