About OOM issue related to 'create_graph=True' #7

DOHA-HWANG · 2022-01-26T01:03:58Z

Dear Author,

Above all, thank you for sharing nice codes.

BTW, about quant training on CIFAR10,
Have you ever faced with OOM issues by loss.backward(create_graph=True) in update_grad_scales?
When I tried it by below args, I was faced with the "RuntimeError: CUDA out of memory" issue.

python train_quant.py --gpu_id '0'
--weight_levels 8
--act_levels 8
--baseline False
--use_hessian True
--load_pretrain True
--pretrain_path '../results/ResNet20_CIFAR10/fp/checkpoint/last_checkpoint.pth'
--log_dir '../results/ResNet20_CIFAR10/ours(hess)/W8A8/

Do you have some idea to avoid this issue?

Thank you in advance.

Joejwu · 2022-05-11T13:46:02Z

So have you solved the problem? I have the same problem

kartikgupta-at-anu · 2022-05-24T01:44:20Z

I run into same issue when training on ImageNet.

DOHA-HWANG · 2022-05-24T02:17:26Z

I haven't used this code recently. That's why I can't remember clearly how to avoid this problem.
However, when I checked the last private code, I disabled "--use_hessian False". (And I remember I had success training this project before.)
As far as I know, using hessian was not crucial in this project according to their paper experiments.

kartikgupta-at-anu · 2022-05-24T03:34:35Z

use_hessian is important if we want the scale factors in EWGS equation to be based on hessian.
I figured out a solution to this problem. The code in this part is not well-written and thus there are some references left dangling due to which it keeps on accumulating graphs. One possible solution/hack to avoid OOM issue is to do a loss.backward() in utils.py after line 116. This will release the graph after each batch.

EWGS/ImageNet/utils.py

Line 116 in 56c654c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About OOM issue related to 'create_graph=True' #7

About OOM issue related to 'create_graph=True' #7

DOHA-HWANG commented Jan 26, 2022

Joejwu commented May 11, 2022

kartikgupta-at-anu commented May 24, 2022

DOHA-HWANG commented May 24, 2022

kartikgupta-at-anu commented May 24, 2022 •

edited

Loading

About OOM issue related to 'create_graph=True' #7

About OOM issue related to 'create_graph=True' #7

Comments

DOHA-HWANG commented Jan 26, 2022

Joejwu commented May 11, 2022

kartikgupta-at-anu commented May 24, 2022

DOHA-HWANG commented May 24, 2022

kartikgupta-at-anu commented May 24, 2022 • edited Loading

kartikgupta-at-anu commented May 24, 2022 •

edited

Loading