Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI fails on windows: ci/circleci: unittest_windows_cpu_pyX.Y #6189

Closed
vfdev-5 opened this issue Jun 22, 2022 · 18 comments
Closed

CI fails on windows: ci/circleci: unittest_windows_cpu_pyX.Y #6189

vfdev-5 opened this issue Jun 22, 2022 · 18 comments

Comments

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jun 22, 2022

Tests on windows are started failing:

test/test_models.py::test_classification_model[cpu-regnet_y_16gf] PASSED [ 79%]
test/test_models.py::test_classification_model[cpu-regnet_y_32gf] PASSED [ 79%]
test/test_models.py::test_classification_model[cpu-regnet_y_128gf] 

Exited with code exit status 127

CircleCI received exit code 127

It started appearing on PyTorch core nightly 20220622

cc @pmeier @seemethere

@YosuaMichael
Copy link
Contributor

YosuaMichael commented Jun 22, 2022

@YosuaMichael
Copy link
Contributor

Seems like it is green now #5009
should we close this issues?
@vfdev-5

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Jun 22, 2022

Thanks for the update! Let's close this issue if everything is OK now

@vfdev-5 vfdev-5 closed this as completed Jun 22, 2022
@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Jun 22, 2022

@YosuaMichael actually, tests are still failing on #5009. I reopen

@vfdev-5 vfdev-5 reopened this Jun 22, 2022
@YosuaMichael
Copy link
Contributor

@YosuaMichael actually, tests are still failing on #5009. I reopen

Ah yeah, previously I just rerun the test and it seems green. But it get the error after I update the branch. Sorry for the false negative @vfdev-5 !

@datumbox
Copy link
Contributor

@vfdev-5 the failure is suspicious because it's on a very large model. Can you try skipping the specific test to see if this is related to issues on CircleCI side rather than on core? Another thing we can do to confirm that the core is not the issue, is to fix the nightly the the one before and rerun the job. If it fails we will know it's the CircleCI.

@YosuaMichael
Copy link
Contributor

@datumbox @vfdev-5 Let me check the hypothesis by skipping the large model (will create mock PR for this)

@YosuaMichael
Copy link
Contributor

I have confirmed that skipping the big models indeed make the CI green again. Now with the same PR #6195 I use the older nightly version (torch==1.13.0.dev20220621) to check if this issue is caused by core or circle CI.

@YosuaMichael
Copy link
Contributor

YosuaMichael commented Jun 23, 2022

@vfdev-5 @datumbox Using the older nightly version seems to encounter same error as before (see #6195, https://app.circleci.com/pipelines/github/pytorch/vision/18534/workflows/54804215-e148-4523-af21-c0ce30837484/jobs/1499610).
I think this means the problem is not from core.
Looking at torchvision PR on the last 7 days, there seems no change on models or test. Hence I think it might be because of Circle CI.

Any suggestion on how to go from here?

^^ Nevermind on this, I check on the error and it seems to still use new torch nightly:

Installed c:\users\circleci\project\env\lib\site-packages\charset_normalizer-2.0.12-py3.9.egg
Searching for torch==1.13.0.dev20220623
Best match: torch 1.13.0.dev20220623
Adding torch 1.13.0.dev20220623 to easy-install.pth file

seems like the windows build have different way to get the torch core, will look more on this first.

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Jun 23, 2022

@YosuaMichael you can also be able to ssh to the circle CI failing job and check directly which nightly is passing (if any)

@YosuaMichael
Copy link
Contributor

@vfdev-5 I never ssh to circle CI before, do you have any pointer on how to do it?

@datumbox @vfdev-5 I identify the installation of pytorch on windows seems to happened here: https://github.com/pytorch/vision/blob/main/.circleci/unittest/windows/scripts/install.sh#L37. Do you have any idea how to modify to install the older nightly version?
Note that this is the command it used to install in the failing test:

conda install -y -c pytorch-nightly -c nvidia 'pytorch-nightly::pytorch[build=*cpu*]' cpuonly

Another note that in https://anaconda.org/pytorch-nightly/pytorch/files it seems that it only have version 20220623 and the nearest after that is 20220423 ...

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Jun 23, 2022

@YosuaMichael there was a wiki page on pytorch about that. There is link on circle ci docs: https://circleci.com/docs/2.0/ssh-access-jobs

Basically, once you logged in on circle CI you can have options like : restart failed job, restart failed job with ssh. I think we should have write rights on the repo to be able to run with ssh.
Ssh on Windows can be tricky due to sometime missing tty and we just see empty terminal but connection is established (check by running dir or ls). To be able to ssh, we have to add public ssh key to GitHub.

As for installation with conda, I would try also with pip in case there is a version...
I can try from my side and help you with that if you want (in ~45mins)

@YosuaMichael
Copy link
Contributor

@vfdev-5 thanks for the suggestion! Currently I hardcode and replace the conda install with pip, it seems to successfully install the torch version that we want (see https://app.circleci.com/pipelines/github/pytorch/vision/18540/workflows/3765b5f9-445c-4d89-895b-b100f2f99834/jobs/1500134).
Now we just need to wait whether it reproduce the error

@YosuaMichael
Copy link
Contributor

Now I have confirmed that the problem is not from core.
I have make sure it use the nightly version 20220621 (see Install Torchvision section on https://app.circleci.com/pipelines/github/pytorch/vision/18540/workflows/3765b5f9-445c-4d89-895b-b100f2f99834/jobs/1500134) and it still got the error (probably OOM).

@vfdev-5 @datumbox any idea what to do next ?

@datumbox
Copy link
Contributor

That's the problem with very large models like that. They often cause random memory issues. If you send a PR that adds a list of such models and skips them (similar to what you have for the GPU), I'll be happy to review it. Basically we should turn off the specific test and recover our CI.

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Jun 23, 2022

@YosuaMichael just to confirm if we run everything locally it does not fail right, only Circle CI is failing everytime ?

@YosuaMichael
Copy link
Contributor

@YosuaMichael just to confirm if we run everything locally it does not fail right, only Circle CI is failing everytime ?

For my macbook it does not fail, but I think this is expected (in the circle ci, only windows one failing and probably because of resource problem like memory)

@datumbox
Copy link
Contributor

Fixed by #6197

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants