-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Learning rate restart broken with Nanoset? #233
Comments
Hey thanks for opening the issue, can you add the error message that you get and the log ? |
Here they are:
(switched to .txt due to github constraints) |
Hi! About the "0 steps remaining" in this issue: here nanotron/src/nanotron/helpers.py Lines 694 to 698 in 97c13b0
|
cc @zzhhjjj maybe if you can take a look at this (i screenshot the part that show two different lr despite having the same lr_schedule in the config and resuming from ckpt) |
I think you are correct. I'll take a look. I remember seeing the same issue before. A temporary bypass would be to modify the metafile by hand. |
Retraining on checkpoint works perfectly with the tokenization on the fly, but breaks while using nanoset: training restart with a different lr, which is not the same as lr_schedule.pt
We also have two additional issues that are likely connected:
Training tested with this configuration:
The text was updated successfully, but these errors were encountered: