Frequently getting NAN for losses? Halts the training process #189

JT1316 · 2020-06-03T14:47:09Z

During training I am getting NAN for the training losses, sometimes in the first epoch and sometimes way later. Example:

progress epoch 5 step 357 image/sec 10.4 remaining 391m
discrim_loss nan
gen_loss_GAN 1.5034107
gen_loss_L1 nan

Training process looks to be working perfectly until this and then the training process halts. Any idea?

Thank you

JT1316 · 2020-06-03T14:57:18Z

(0) Invalid argument: Nan in summary histogram for: generator/encoder_5/conv2d/kernel/values
[[node generator/encoder_5/conv2d/kernel/values (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
[[batch/_779]]
(1) Invalid argument: Nan in summary histogram for: generator/encoder_5/conv2d/kernel/values
[[node generator/encoder_5/conv2d/kernel/values (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

antonio0372 · 2020-07-27T15:51:41Z

I'm having the same issue and it's driving me crazy.

only happens with cuda when using my own images. The facades dataset successfully trained on cuda
works perfectly fine with CPU backend
works perfectly fine with directml backend

antonio0372 · 2020-07-28T23:16:51Z

Batch Normalization is the culprit

simantaturja · 2020-08-04T16:35:30Z

@antonio0372 did you fix it?

antonio0372 · 2020-08-05T00:41:18Z

Hi Samantha, I did some progress. It turns out the default batch size in pix2pix is 1, which makes batch normalization pretty useless, and indeed it seemed to easily cause NaN in gradients on the GPU, but not on the CPU, which is strange. Anyway, by replacing batch normalization with per image normalization, I've got much farther and fully trained 200 epochs on the GPU (around 1.8M steps). However I later modified my dataset and got NaNs again, so it seems Tensorflow GPU is still quite sensitive/unstable dependent on several factors. Cheers

…

On Wed., 5 Aug. 2020, 02:35 Simanta Deb Turja, ***@***.***> wrote: @antonio0372 <https://github.com/antonio0372> did you fix it? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#189 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD6J6UWLUJ7MKIG7NBLKDDDR7A2GHANCNFSM4NRXSBBA> .

aaaaaaaaargh · 2020-08-13T18:05:08Z

Hi, can you please elaborate on that solution a little bit?

By the way, for me this issue also happened when using the CPU backend.

antonio0372 · 2020-08-13T18:10:33Z

Hi! I've actually switched to Torch. It's training, passed 2M iterations, no problems at all so far. I'm starting to believe Tensorflow is unstable. The last thing I tried was clipping the gradients. There's no way the gradients can explode if clipped. But I still got NaNs. So I finally decided to ditch Tensorflow altogether.

…

On Fri., 14 Aug. 2020, 04:05 aaaaaaaaargh, ***@***.***> wrote: Hi, can you please elaborate on that solution a little bit? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#189 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD6J6UXXDNGBZRWV33XPS4TSAQTOFANCNFSM4NRXSBBA> .

aaaaaaaaargh · 2020-08-13T18:20:26Z

Antonio, thanks for your quick answer! Torch.. oh well.. I don't know anything about that, but I don't know anything about tf as well, so i guess I'm giving it a shot then :)

skabbit · 2020-10-04T14:09:48Z

Got the same problem on a generator decoder (not encoder):
Nan in summary histogram for: generator/decoder_5/conv2d_transpose/kernel/values

And it's started only when I'm using batch size >1, and only on my own dataset.
I suppose it happens on duplicated images in dataset. @antonio0372, may you also have duplicates in yours?

antonio0372 · 2020-10-04T14:30:45Z

Interesting. I don't have duplicates, but probably similar enough to minimise their differences in stochastic terms.

…

On Mon., 5 Oct. 2020, 00:10 skabbit, ***@***.***> wrote: Got the same problem on a generator decoder (not encoder): Nan in summary histogram for: generator/decoder_5/conv2d_transpose/kernel/values And it's started only when I'm using batch size >1, and only on my own dataset. I suppose it happens on duplicated images in dataset. @antonio0372 <https://github.com/antonio0372>, may you also have duplicates in yours? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#189 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD6J6UUUTCTLK34OF5AMYBDSJB63RANCNFSM4NRXSBBA> .

skabbit · 2020-10-04T14:54:33Z

Thank for fast reply, @antonio0372!
Now I have 6 different reasons may probably resolve this issue, I'll check them all and write down here the results.

skabbit · 2020-10-04T16:08:55Z

Degrade Tensorflow to 1.14.0 resolve this issue.
Working fine during 150 epochs with batch size 100.

skabbit · 2020-10-20T14:09:32Z

But, I clearly advice NOT to use such a huge batch size, as it generalize MUCH worse. I tried 4-10 batch and it is give much more convenient result in a comparable amount of time.
I hope somebody find this useful.

burhr2 · 2020-11-02T13:38:29Z

As mentioned by @skabbit using TensorFlow 1.14.0 (pip install tensorflow-gpu==1.14.0) seems to work fine for now. I am using anaconda on windows 10 machine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequently getting NAN for losses? Halts the training process #189

Frequently getting NAN for losses? Halts the training process #189

JT1316 commented Jun 3, 2020

JT1316 commented Jun 3, 2020

antonio0372 commented Jul 27, 2020

antonio0372 commented Jul 28, 2020

simantaturja commented Aug 4, 2020

antonio0372 commented Aug 5, 2020 via email

aaaaaaaaargh commented Aug 13, 2020 •

edited

Loading

antonio0372 commented Aug 13, 2020 via email

aaaaaaaaargh commented Aug 13, 2020 •

edited

Loading

skabbit commented Oct 4, 2020

antonio0372 commented Oct 4, 2020 via email

skabbit commented Oct 4, 2020

skabbit commented Oct 4, 2020

skabbit commented Oct 20, 2020

burhr2 commented Nov 2, 2020 •

edited

Loading

Frequently getting NAN for losses? Halts the training process #189

Frequently getting NAN for losses? Halts the training process #189

Comments

JT1316 commented Jun 3, 2020

JT1316 commented Jun 3, 2020

antonio0372 commented Jul 27, 2020

antonio0372 commented Jul 28, 2020

simantaturja commented Aug 4, 2020

antonio0372 commented Aug 5, 2020 via email

aaaaaaaaargh commented Aug 13, 2020 • edited Loading

antonio0372 commented Aug 13, 2020 via email

aaaaaaaaargh commented Aug 13, 2020 • edited Loading

skabbit commented Oct 4, 2020

antonio0372 commented Oct 4, 2020 via email

skabbit commented Oct 4, 2020

skabbit commented Oct 4, 2020

skabbit commented Oct 20, 2020

burhr2 commented Nov 2, 2020 • edited Loading

aaaaaaaaargh commented Aug 13, 2020 •

edited

Loading

aaaaaaaaargh commented Aug 13, 2020 •

edited

Loading

burhr2 commented Nov 2, 2020 •

edited

Loading