Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练卡死,GPU利用率为0,但是显存占满 #60

Open
cyLVlyl opened this issue Dec 12, 2019 · 3 comments
Open

训练卡死,GPU利用率为0,但是显存占满 #60

cyLVlyl opened this issue Dec 12, 2019 · 3 comments

Comments

@cyLVlyl
Copy link

cyLVlyl commented Dec 12, 2019

博主,非常感谢你的pse-tensorflow复现,对我的帮助很大。请问一下,为什么我在训练过程中会出现卡死的情况
INFO:root:Step 000000, model loss 0.9816, total loss 1.2604, 1.28 seconds/step, 12.52 examples/second
INFO:root:Step 000010, model loss 0.9747, total loss 1.2535, 1.46 seconds/step, 10.94 examples/second
INFO:root:Step 000020, model loss 0.9603, total loss 1.2391, 0.80 seconds/step, 20.00 examples/second
INFO:root:Step 000030, model loss 0.9513, total loss 1.2301, 0.79 seconds/step, 20.31 examples/second
INFO:root:Step 000040, model loss 0.9450, total loss 1.2237, 0.79 seconds/step, 20.15 examples/second
INFO:root:Step 000050, model loss 0.9209, total loss 1.1995, 0.79 seconds/step, 20.19 examples/second
INFO:root:Step 000060, model loss 0.8839, total loss 1.1626, 0.80 seconds/step, 20.06 examples/second
INFO:root:Step 000070, model loss 0.9407, total loss 1.2193, 0.81 seconds/step, 19.83 examples/second
INFO:root:Step 000080, model loss 0.7876, total loss 1.0662, 0.80 seconds/step, 19.97 examples/second
INFO:root:Step 000090, model loss 0.9840, total loss 1.2626, 0.81 seconds/step, 19.72 examples/second
INFO:root:Step 000100, model loss 0.8153, total loss 1.0938, 0.81 seconds/step, 19.82 examples/second
INFO:root:Step 000110, model loss 0.8064, total loss 1.0850, 0.87 seconds/step, 18.29 examples/second
INFO:root:Step 000120, model loss 0.8660, total loss 1.1446, 0.81 seconds/step, 19.79 examples/second
INFO:root:Step 000130, model loss 0.7714, total loss 1.0499, 0.80 seconds/step, 19.99 examples/second
INFO:root:Step 000140, model loss 0.9863, total loss 1.2648, 0.81 seconds/step, 19.66 examples/second
INFO:root:Step 000150, model loss 0.8436, total loss 1.1220, 0.81 seconds/step, 19.75 examples/second
INFO:root:Step 000160, model loss 0.9230, total loss 1.2014, 0.81 seconds/step, 19.67 examples/second
INFO:root:Step 000170, model loss 0.9442, total loss 1.2226, 0.81 seconds/step, 19.74 examples/second
INFO:root:Step 000180, model loss 0.7808, total loss 1.0592, 0.81 seconds/step, 19.85 examples/second
INFO:root:Step 000190, model loss 0.9916, total loss 1.2700, 0.82 seconds/step, 19.48 examples/second
INFO:root:Step 000200, model loss 0.9583, total loss 1.2367, 0.81 seconds/step, 19.71 examples/second
INFO:root:Step 000210, model loss 0.7617, total loss 1.0401, 0.89 seconds/step, 18.08 examples/second
INFO:root:Step 000220, model loss 0.8324, total loss 1.1107, 0.81 seconds/step, 19.83 examples/second
INFO:root:Step 000230, model loss 0.7749, total loss 1.0533, 0.81 seconds/step, 19.79 examples/second
INFO:root:Step 000240, model loss 0.7469, total loss 1.0252, 0.80 seconds/step, 20.05 examples/second
INFO:root:Step 000250, model loss 0.9720, total loss 1.2504, 0.80 seconds/step, 19.90 examples/second
INFO:root:Step 000260, model loss 0.7180, total loss 0.9963, 0.82 seconds/step, 19.59 examples/second
INFO:root:Step 000270, model loss 0.8716, total loss 1.1499, 0.82 seconds/step, 19.58 examples/second
INFO:root:Step 000280, model loss 0.8580, total loss 1.1363, 0.80 seconds/step, 19.95 examples/second
INFO:root:Step 000290, model loss 0.9351, total loss 1.2134, 0.80 seconds/step, 20.02 examples/second
INFO:root:Step 000300, model loss 0.7840, total loss 1.0623, 0.80 seconds/step, 19.92 examples/second
INFO:root:Step 000310, model loss 0.9569, total loss 1.2352, 0.89 seconds/step, 18.05 examples/second
INFO:root:Step 000320, model loss 0.6371, total loss 0.9154, 0.81 seconds/step, 19.66 examples/second
INFO:root:Step 000330, model loss 0.8040, total loss 1.0823, 0.85 seconds/step, 18.82 examples/second
INFO:root:Step 000340, model loss 0.8689, total loss 1.1471, 0.81 seconds/step, 19.81 examples/second
INFO:root:Step 000350, model loss 0.8724, total loss 1.1506, 0.84 seconds/step, 19.01 examples/second
INFO:root:Step 000360, model loss 0.8443, total loss 1.1225, 0.83 seconds/step, 19.26 examples/second
INFO:root:Step 000370, model loss 0.8604, total loss 1.1386, 0.80 seconds/step, 20.12 examples/second
INFO:root:Step 000380, model loss 0.8354, total loss 1.1136, 0.87 seconds/step, 18.46 examples/second
INFO:root:Step 000390, model loss 0.7982, total loss 1.0764, 0.82 seconds/step, 19.53 examples/second
INFO:root:Step 000400, model loss 0.8847, total loss 1.1629, 0.84 seconds/step, 19.08 examples/second
INFO:root:Step 000410, model loss 0.8517, total loss 1.1299, 0.91 seconds/step, 17.55 examples/second
INFO:root:Step 000420, model loss 0.7689, total loss 1.0471, 0.82 seconds/step, 19.42 examples/second
之后就不再训练了,我应该怎么改,才能不出现占满卡死的情况
我是用的gtx1080 两张显卡训练的
望不吝赐教

@liuheng92
Copy link
Owner

检查一下数据是否按照格式存放的

@yanqi1811
Copy link

@cyLVlyl @liuheng92 请教下您,我在训练时候model_loss一直保持1.0不变化,total_loss下降正常,是我的数据有问题吗,还是什么原因,谢谢!

@arufus

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants