Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix v2 transforms in spawn mp context #8067

Merged
merged 7 commits into from
Oct 27, 2023
Merged

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Oct 25, 2023

Fixes #8066

cc @vfdev-5

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 25, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8067

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 30 Unrelated Failures

As of commit 33dc494 with merge base 96d2ce9 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Comment on lines 732 to 733
if isinstance(dataset, VOCDetection):
assert wrapped_sample[0][0].size == (321, 123)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the point of this test is just to show that the fix works. It obviously won't stay as-is.

We should really add transforms to this test though. @pmeier Do you have any suggestion on how you'd prefer to do that? Otherwise, I'll kinda just do something similar for all datasets, i.e. just check that the images in wrapped_sample are of a specific expected size.

dataset.transform = self.transform
dataset.transforms = self.transforms
dataset.target_transform = self.target_transform
return wrap_dataset_for_transforms_v2, (dataset, self._target_keys)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I don't really understand why we needed to have __reduce__ in the first place. I understand it's needed to support pickle, but my understanding stops here.

The whole logic seems really strange, i.e. calling yet again wrap_dataset_for_transforms_v2(), which is how the instance-being-pickled was created in the first place anyway. And now on top of that we add the "field resetting logic" i.e. we just undo what we did in __init__.

I understand it's needed to support pickle, but my understanding stops here.

@pmeier can you remind me the details of this? Do you think we could support pickleability of these datasets in a different way that would require this "fix"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pickle does not support dynamic types like we do in

wrapped_dataset_cls = type(f"Wrapped{type(dataset).__name__}", (VisionDatasetTVTensorWrapper, type(dataset)), {})

by default. Thus, we implement the __reduce__ method and tell it how to construct the object by its parts. Now only the items on L212 need to be pickled. While unpickling, the dynamic type is created anew and thus we circumvent the issue.

We have the dynamic type in the first place to support isinstance checks. There was also a different option in #7239 (comment) that would work without a dynamic type. I think that could potentially work without your fix, but I would need to test. However, this option also has its drawbacks (see discussion in the original PR).

@@ -198,8 +199,19 @@ def __getitem__(self, idx):
def __len__(self):
return len(self._dataset)

# TODO: maybe we should use __getstate__ and __setstate__ instead of __reduce__, as recommended in the docs.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See just above this link

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try, but I think it is not possible. The "state" in __get_state__ and __set_state__ is the second return value of __reduce__ below. And while we can recreate a VisionDatasetTVTensorWrapper from them, pickle does not know how to create the dynamic type. I'll give it a shot.

@@ -1005,9 +1014,11 @@ def inject_fake_data(self, tmpdir, config):
)
return num_videos_per_class * len(classes)

@pytest.mark.xfail(reason="FIXME")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails, I think it's because Kinetics doesn't convert its transform attribute into a transforms attribute, but I haven't double-checked. If that's OK I'd like to merge this PR right now to get it behind us, and investigate that separately just after.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK with investigating later. Will have a look.

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

# https://github.com/pytorch/vision/issues/8066
# Implicitly, this also checks that the wrapped datasets are pickleable.

# To save CI/test time, we only check on macOS where "spawn" is the default
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spawn is also the default on Windows. Since macOS CI is by far the costliest, should we just use Windows here?

@@ -1005,9 +1014,11 @@ def inject_fake_data(self, tmpdir, config):
)
return num_videos_per_class * len(classes)

@pytest.mark.xfail(reason="FIXME")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK with investigating later. Will have a look.

@@ -198,8 +199,19 @@ def __getitem__(self, idx):
def __len__(self):
return len(self._dataset)

# TODO: maybe we should use __getstate__ and __setstate__ instead of __reduce__, as recommended in the docs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try, but I think it is not possible. The "state" in __get_state__ and __set_state__ is the second return value of __reduce__ below. And while we can recreate a VisionDatasetTVTensorWrapper from them, pickle does not know how to create the dynamic type. I'll give it a shot.

@NicolasHug
Copy link
Member Author

NicolasHug commented Oct 27, 2023

Thanks for the review! Failure seem to be unrelated.

There was

FAILED test/test_datasets.py::FlyingThings3DTestCase::test_str_smoke - PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\tmp3_12x8ek\\FlyingThings3D\\optical_flow\\TRAIN\\B\\0001\\into_future\\left'
= 1 failed, 21433 passed, 16912 skipped, 2 xfailed, 231 warnings in 2584.93s (0:43:04) =

on the 3.10 windows job but this dataset isn't touched by this PR, and the other windows job are executing the new test and running fine. Let's hope it's just a one-off, I'll merge

@NicolasHug NicolasHug merged commit b80bdb7 into pytorch:main Oct 27, 2023
2 checks passed
@NicolasHug NicolasHug deleted the fix_spawn branch October 27, 2023 12:31
NicolasHug added a commit to NicolasHug/vision that referenced this pull request Oct 27, 2023
facebook-github-bot pushed a commit that referenced this pull request Nov 14, 2023
Reviewed By: vmoens

Differential Revision: D50789087

fbshipit-source-id: 4feccf22ac58fa8b7028dd64c85dd247800e466d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Transforms are lost when using a dataloader with spawn
3 participants