CI fails on windows: ci/circleci: unittest_windows_cpu_pyX.Y #6189

vfdev-5 · 2022-06-22T10:45:32Z

Tests on windows are started failing:

test/test_models.py::test_classification_model[cpu-regnet_y_16gf] PASSED [ 79%]
test/test_models.py::test_classification_model[cpu-regnet_y_32gf] PASSED [ 79%]
test/test_models.py::test_classification_model[cpu-regnet_y_128gf] 

Exited with code exit status 127

CircleCI received exit code 127

It started appearing on PyTorch core nightly 20220622

cc @pmeier @seemethere

The text was updated successfully, but these errors were encountered:

YosuaMichael · 2022-06-22T11:52:27Z

Currently trying to reproduce on https://github.com/pytorch/vision/pull/5009/checks?check_run_id=7006220104
cc @atalman

YosuaMichael · 2022-06-22T15:47:30Z

Seems like it is green now #5009
should we close this issues?
@vfdev-5

vfdev-5 · 2022-06-22T15:49:15Z

Thanks for the update! Let's close this issue if everything is OK now

vfdev-5 · 2022-06-22T16:14:48Z

@YosuaMichael actually, tests are still failing on #5009. I reopen

YosuaMichael · 2022-06-22T19:22:27Z

@YosuaMichael actually, tests are still failing on #5009. I reopen

Ah yeah, previously I just rerun the test and it seems green. But it get the error after I update the branch. Sorry for the false negative @vfdev-5 !

datumbox · 2022-06-23T08:59:43Z

@vfdev-5 the failure is suspicious because it's on a very large model. Can you try skipping the specific test to see if this is related to issues on CircleCI side rather than on core? Another thing we can do to confirm that the core is not the issue, is to fix the nightly the the one before and rerun the job. If it fails we will know it's the CircleCI.

YosuaMichael · 2022-06-23T09:33:37Z

@datumbox @vfdev-5 Let me check the hypothesis by skipping the large model (will create mock PR for this)

YosuaMichael · 2022-06-23T10:42:14Z

I have confirmed that skipping the big models indeed make the CI green again. Now with the same PR #6195 I use the older nightly version (torch==1.13.0.dev20220621) to check if this issue is caused by core or circle CI.

YosuaMichael · 2022-06-23T11:13:19Z

@vfdev-5 @datumbox Using the older nightly version seems to encounter same error as before (see #6195, https://app.circleci.com/pipelines/github/pytorch/vision/18534/workflows/54804215-e148-4523-af21-c0ce30837484/jobs/1499610).
I think this means the problem is not from core.
Looking at torchvision PR on the last 7 days, there seems no change on models or test. Hence I think it might be because of Circle CI.

Any suggestion on how to go from here?

^^ Nevermind on this, I check on the error and it seems to still use new torch nightly:

Installed c:\users\circleci\project\env\lib\site-packages\charset_normalizer-2.0.12-py3.9.egg
Searching for torch==1.13.0.dev20220623
Best match: torch 1.13.0.dev20220623
Adding torch 1.13.0.dev20220623 to easy-install.pth file

seems like the windows build have different way to get the torch core, will look more on this first.

vfdev-5 · 2022-06-23T11:21:03Z

@YosuaMichael you can also be able to ssh to the circle CI failing job and check directly which nightly is passing (if any)

YosuaMichael · 2022-06-23T11:53:36Z

@vfdev-5 I never ssh to circle CI before, do you have any pointer on how to do it?

@datumbox @vfdev-5 I identify the installation of pytorch on windows seems to happened here: https://github.com/pytorch/vision/blob/main/.circleci/unittest/windows/scripts/install.sh#L37. Do you have any idea how to modify to install the older nightly version?
Note that this is the command it used to install in the failing test:

conda install -y -c pytorch-nightly -c nvidia 'pytorch-nightly::pytorch[build=*cpu*]' cpuonly

Another note that in https://anaconda.org/pytorch-nightly/pytorch/files it seems that it only have version 20220623 and the nearest after that is 20220423 ...

vfdev-5 · 2022-06-23T12:47:39Z

@YosuaMichael there was a wiki page on pytorch about that. There is link on circle ci docs: https://circleci.com/docs/2.0/ssh-access-jobs

Basically, once you logged in on circle CI you can have options like : restart failed job, restart failed job with ssh. I think we should have write rights on the repo to be able to run with ssh.
Ssh on Windows can be tricky due to sometime missing tty and we just see empty terminal but connection is established (check by running dir or ls). To be able to ssh, we have to add public ssh key to GitHub.

As for installation with conda, I would try also with pip in case there is a version...
I can try from my side and help you with that if you want (in ~45mins)

YosuaMichael · 2022-06-23T12:53:35Z

@vfdev-5 thanks for the suggestion! Currently I hardcode and replace the conda install with pip, it seems to successfully install the torch version that we want (see https://app.circleci.com/pipelines/github/pytorch/vision/18540/workflows/3765b5f9-445c-4d89-895b-b100f2f99834/jobs/1500134).
Now we just need to wait whether it reproduce the error

YosuaMichael · 2022-06-23T13:05:37Z

Now I have confirmed that the problem is not from core.
I have make sure it use the nightly version 20220621 (see Install Torchvision section on https://app.circleci.com/pipelines/github/pytorch/vision/18540/workflows/3765b5f9-445c-4d89-895b-b100f2f99834/jobs/1500134) and it still got the error (probably OOM).

@vfdev-5 @datumbox any idea what to do next ?

datumbox · 2022-06-23T13:17:46Z

That's the problem with very large models like that. They often cause random memory issues. If you send a PR that adds a list of such models and skips them (similar to what you have for the GPU), I'll be happy to review it. Basically we should turn off the specific test and recover our CI.

vfdev-5 · 2022-06-23T13:38:02Z

@YosuaMichael just to confirm if we run everything locally it does not fail right, only Circle CI is failing everytime ?

YosuaMichael · 2022-06-23T14:34:26Z

@YosuaMichael just to confirm if we run everything locally it does not fail right, only Circle CI is failing everytime ?

For my macbook it does not fail, but I think this is expected (in the circle ci, only windows one failing and probably because of resource problem like memory)

datumbox · 2022-06-23T18:56:28Z

Fixed by #6197

vfdev-5 added module: tests module: ci labels Jun 22, 2022

datumbox added the core issue label Jun 22, 2022

vfdev-5 closed this as completed Jun 22, 2022

vfdev-5 reopened this Jun 22, 2022

YosuaMichael mentioned this issue Jun 23, 2022

[DONT MERGE] PR to debug CI failures on windows #6195

Closed

datumbox removed the core issue label Jun 23, 2022

YosuaMichael mentioned this issue Jun 23, 2022

Skip big models on cpu test to fix CI #6197

Merged

datumbox closed this as completed Jun 23, 2022

datumbox mentioned this issue Sep 5, 2022

Skip big models per platform/device #6539

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI fails on windows: ci/circleci: unittest_windows_cpu_pyX.Y #6189

CI fails on windows: ci/circleci: unittest_windows_cpu_pyX.Y #6189

vfdev-5 commented Jun 22, 2022 •

edited by datumbox

Loading

YosuaMichael commented Jun 22, 2022 •

edited

Loading

YosuaMichael commented Jun 22, 2022

vfdev-5 commented Jun 22, 2022

vfdev-5 commented Jun 22, 2022

YosuaMichael commented Jun 22, 2022

datumbox commented Jun 23, 2022

YosuaMichael commented Jun 23, 2022

YosuaMichael commented Jun 23, 2022

YosuaMichael commented Jun 23, 2022 •

edited

Loading

vfdev-5 commented Jun 23, 2022

YosuaMichael commented Jun 23, 2022

vfdev-5 commented Jun 23, 2022 •

edited

Loading

YosuaMichael commented Jun 23, 2022

YosuaMichael commented Jun 23, 2022

datumbox commented Jun 23, 2022

vfdev-5 commented Jun 23, 2022

YosuaMichael commented Jun 23, 2022

datumbox commented Jun 23, 2022

CI fails on windows: ci/circleci: unittest_windows_cpu_pyX.Y #6189

CI fails on windows: ci/circleci: unittest_windows_cpu_pyX.Y #6189

Comments

vfdev-5 commented Jun 22, 2022 • edited by datumbox Loading

YosuaMichael commented Jun 22, 2022 • edited Loading

YosuaMichael commented Jun 22, 2022

vfdev-5 commented Jun 22, 2022

vfdev-5 commented Jun 22, 2022

YosuaMichael commented Jun 22, 2022

datumbox commented Jun 23, 2022

YosuaMichael commented Jun 23, 2022

YosuaMichael commented Jun 23, 2022

YosuaMichael commented Jun 23, 2022 • edited Loading

vfdev-5 commented Jun 23, 2022

YosuaMichael commented Jun 23, 2022

vfdev-5 commented Jun 23, 2022 • edited Loading

YosuaMichael commented Jun 23, 2022

YosuaMichael commented Jun 23, 2022

datumbox commented Jun 23, 2022

vfdev-5 commented Jun 23, 2022

YosuaMichael commented Jun 23, 2022

datumbox commented Jun 23, 2022

vfdev-5 commented Jun 22, 2022 •

edited by datumbox

Loading

YosuaMichael commented Jun 22, 2022 •

edited

Loading

YosuaMichael commented Jun 23, 2022 •

edited

Loading

vfdev-5 commented Jun 23, 2022 •

edited

Loading