-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frequently getting NAN for losses? Halts the training process #189
Comments
(0) Invalid argument: Nan in summary histogram for: generator/encoder_5/conv2d/kernel/values |
I'm having the same issue and it's driving me crazy.
|
Batch Normalization is the culprit |
@antonio0372 did you fix it? |
Hi Samantha,
I did some progress. It turns out the default batch size in pix2pix is 1,
which makes batch normalization pretty useless, and indeed it seemed to
easily cause NaN in gradients on the GPU, but not on the CPU, which is
strange.
Anyway, by replacing batch normalization with per image normalization, I've
got much farther and fully trained 200 epochs on the GPU (around 1.8M
steps).
However I later modified my dataset and got NaNs again, so it seems
Tensorflow GPU is still quite sensitive/unstable dependent on several
factors.
Cheers
…On Wed., 5 Aug. 2020, 02:35 Simanta Deb Turja, ***@***.***> wrote:
@antonio0372 <https://github.com/antonio0372> did you fix it?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#189 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD6J6UWLUJ7MKIG7NBLKDDDR7A2GHANCNFSM4NRXSBBA>
.
|
Hi, can you please elaborate on that solution a little bit? By the way, for me this issue also happened when using the CPU backend. |
Hi!
I've actually switched to Torch. It's training, passed 2M iterations, no
problems at all so far.
I'm starting to believe Tensorflow is unstable.
The last thing I tried was clipping the gradients. There's no way the
gradients can explode if clipped. But I still got NaNs. So I finally
decided to ditch Tensorflow altogether.
…On Fri., 14 Aug. 2020, 04:05 aaaaaaaaargh, ***@***.***> wrote:
Hi, can you please elaborate on that solution a little bit?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#189 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD6J6UXXDNGBZRWV33XPS4TSAQTOFANCNFSM4NRXSBBA>
.
|
Antonio, thanks for your quick answer! Torch.. oh well.. I don't know anything about that, but I don't know anything about tf as well, so i guess I'm giving it a shot then :) |
Got the same problem on a generator decoder (not encoder): And it's started only when I'm using batch size >1, and only on my own dataset. |
Interesting. I don't have duplicates, but probably similar enough to
minimise their differences in stochastic terms.
…On Mon., 5 Oct. 2020, 00:10 skabbit, ***@***.***> wrote:
Got the same problem on a generator decoder (not encoder):
Nan in summary histogram for:
generator/decoder_5/conv2d_transpose/kernel/values
And it's started only when I'm using batch size >1, and only on my own
dataset.
I suppose it happens on duplicated images in dataset. @antonio0372
<https://github.com/antonio0372>, may you also have duplicates in yours?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#189 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD6J6UUUTCTLK34OF5AMYBDSJB63RANCNFSM4NRXSBBA>
.
|
Thank for fast reply, @antonio0372! |
Degrade Tensorflow to 1.14.0 resolve this issue. |
But, I clearly advice NOT to use such a huge batch size, as it generalize MUCH worse. I tried 4-10 batch and it is give much more convenient result in a comparable amount of time. |
As mentioned by @skabbit using TensorFlow 1.14.0 (pip install tensorflow-gpu==1.14.0) seems to work fine for now. I am using anaconda on windows 10 machine. |
During training I am getting NAN for the training losses, sometimes in the first epoch and sometimes way later. Example:
progress epoch 5 step 357 image/sec 10.4 remaining 391m
discrim_loss nan
gen_loss_GAN 1.5034107
gen_loss_L1 nan
Training process looks to be working perfectly until this and then the training process halts. Any idea?
Thank you
The text was updated successfully, but these errors were encountered: