Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterable Dataloader #88

Merged
merged 2 commits into from
Nov 18, 2023
Merged

Iterable Dataloader #88

merged 2 commits into from
Nov 18, 2023

Conversation

ibanesh
Copy link
Contributor

@ibanesh ibanesh commented Nov 16, 2023

Fairseq2 data pipeline is an iterable without index based random access.
The current simuleval design doesn't allow us to iterate through such iterable dataloaders without fully loading them into memory (which is inefficient). Fairseq2 data pipeline is an iterable which loads the samples from teh data file one at a time or in batches. To accommodate this lazy loading of data and to be memory efficient, refactoring the way we iterate through the dataset.

Eventually, it would be memory efficient to switch to iterable dataloaders by default and avoid loading the whole file into memory like we do for most dataloaders now.

An example iterable dataloader is defined in https://github.com/fairinternal/seamless_communication/pull/61 - changes were tested using this PR.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 16, 2023
@ibanesh ibanesh force-pushed the iterable-dataloader branch 2 times, most recently from d81622b to 1c6df08 Compare November 16, 2023 19:39
segment, states[-1], upstream_states=upstream_states + states_list[:index]
segment,
states[-1],
upstream_states=upstream_states + states_list[: len(self.module_list[:-1])],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for my understanding, when is len(self.module_list[:-1]) not equal to index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the len(self.module_list) = 1.
len(self.module_list[:-1]) will be 0 and index will not even get assigned a value in this case, resulting in an exception.

simuleval/evaluator/evaluator.py Show resolved Hide resolved
"--output",
type=str,
default=None,
help="Output directory. Required if using iterable dataloader.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just curious: why required if using iterable dataloader but not otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the typical dataloader (other than "iterable" ones), we create a list of instances and iterate through this list in the evaluator's __call__ method, the results are stored as a filed in the instance object in this list and at the end of iteration we this list from memory to calculate the metrics/scores. So it doesn't matter if an output arg is specified, without an output arg the instances don't get written to a file but we will still be able to calculate the score/metrics using the results in memory.

But for iterable dataloaders where we are lazy loading the data one at a time, only the current instance is stored in memory and at each iteration we write this instance data to a file if output is given. At the end of iterating through all the data, we will load the results into memory using the file we wrote to while iterating and then use this in memory instances to calculate the cumulative scores/metric. So without the output arg, we won't write to file and subsequently we wont be able to calculate the score/metrics.

Comment on lines -236 to -248
system.reset()
for instance in self.instance_iterator:
while not self.is_finished(instance):
input_segment = instance.send_source(self.source_segment_size)
output_segment = system.pushpop(input_segment)
instance.receive_prediction(output_segment)
if instance.finish_prediction:
# if instance.finish_prediction where set by the reader,
# source_finished_reading will be set as well. If it is
# set by any of the intermediate components, then we didn't
# end yet. We are going to clear the state and continue
# processing the rest of the input.
system.reset()
Copy link
Contributor Author

@ibanesh ibanesh Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part was added/modified in #48 & #49 to support the inference using youtube streams, but has been causing issue with wait-k inference sometimes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping it as is for now, will change this to old logic if we run into issue when testing parity for wait-k.

segment, states[-1], upstream_states=upstream_states + states_list[:index]
segment,
states[-1],
upstream_states=upstream_states + states_list[: len(self.module_list[:-1])],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the len(self.module_list) = 1.
len(self.module_list[:-1]) will be 0 and index will not even get assigned a value in this case, resulting in an exception.

"--output",
type=str,
default=None,
help="Output directory. Required if using iterable dataloader.",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the typical dataloader (other than "iterable" ones), we create a list of instances and iterate through this list in the evaluator's __call__ method, the results are stored as a filed in the instance object in this list and at the end of iteration we this list from memory to calculate the metrics/scores. So it doesn't matter if an output arg is specified, without an output arg the instances don't get written to a file but we will still be able to calculate the score/metrics using the results in memory.

But for iterable dataloaders where we are lazy loading the data one at a time, only the current instance is stored in memory and at each iteration we write this instance data to a file if output is given. At the end of iterating through all the data, we will load the results into memory using the file we wrote to while iterating and then use this in memory instances to calculate the cumulative scores/metric. So without the output arg, we won't write to file and subsequently we wont be able to calculate the score/metrics.

@ibanesh ibanesh merged commit 2e3bd7e into main Nov 18, 2023
3 checks passed
@ibanesh ibanesh deleted the iterable-dataloader branch November 18, 2023 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants