-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iterable Dataloader #88
Conversation
6e78bda
to
c95c761
Compare
d81622b
to
1c6df08
Compare
segment, states[-1], upstream_states=upstream_states + states_list[:index] | ||
segment, | ||
states[-1], | ||
upstream_states=upstream_states + states_list[: len(self.module_list[:-1])], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for my understanding, when is len(self.module_list[:-1])
not equal to index
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the len(self.module_list) = 1.
len(self.module_list[:-1]) will be 0 and index
will not even get assigned a value in this case, resulting in an exception.
"--output", | ||
type=str, | ||
default=None, | ||
help="Output directory. Required if using iterable dataloader.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just curious: why required if using iterable dataloader but not otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the typical dataloader (other than "iterable" ones), we create a list of instances and iterate through this list in the evaluator's __call__
method, the results are stored as a filed in the instance object in this list and at the end of iteration we this list from memory to calculate the metrics/scores. So it doesn't matter if an output arg is specified, without an output arg the instances don't get written to a file but we will still be able to calculate the score/metrics using the results in memory.
But for iterable dataloaders where we are lazy loading the data one at a time, only the current instance is stored in memory and at each iteration we write this instance data to a file if output
is given. At the end of iterating through all the data, we will load the results into memory using the file we wrote to while iterating and then use this in memory instances to calculate the cumulative scores/metric. So without the output
arg, we won't write to file and subsequently we wont be able to calculate the score/metrics.
system.reset() | ||
for instance in self.instance_iterator: | ||
while not self.is_finished(instance): | ||
input_segment = instance.send_source(self.source_segment_size) | ||
output_segment = system.pushpop(input_segment) | ||
instance.receive_prediction(output_segment) | ||
if instance.finish_prediction: | ||
# if instance.finish_prediction where set by the reader, | ||
# source_finished_reading will be set as well. If it is | ||
# set by any of the intermediate components, then we didn't | ||
# end yet. We are going to clear the state and continue | ||
# processing the rest of the input. | ||
system.reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping it as is for now, will change this to old logic if we run into issue when testing parity for wait-k.
segment, states[-1], upstream_states=upstream_states + states_list[:index] | ||
segment, | ||
states[-1], | ||
upstream_states=upstream_states + states_list[: len(self.module_list[:-1])], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the len(self.module_list) = 1.
len(self.module_list[:-1]) will be 0 and index
will not even get assigned a value in this case, resulting in an exception.
"--output", | ||
type=str, | ||
default=None, | ||
help="Output directory. Required if using iterable dataloader.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the typical dataloader (other than "iterable" ones), we create a list of instances and iterate through this list in the evaluator's __call__
method, the results are stored as a filed in the instance object in this list and at the end of iteration we this list from memory to calculate the metrics/scores. So it doesn't matter if an output arg is specified, without an output arg the instances don't get written to a file but we will still be able to calculate the score/metrics using the results in memory.
But for iterable dataloaders where we are lazy loading the data one at a time, only the current instance is stored in memory and at each iteration we write this instance data to a file if output
is given. At the end of iterating through all the data, we will load the results into memory using the file we wrote to while iterating and then use this in memory instances to calculate the cumulative scores/metric. So without the output
arg, we won't write to file and subsequently we wont be able to calculate the score/metrics.
1c6df08
to
0c3f032
Compare
Fairseq2 data pipeline is an iterable without index based random access.
The current simuleval design doesn't allow us to iterate through such iterable dataloaders without fully loading them into memory (which is inefficient). Fairseq2 data pipeline is an iterable which loads the samples from teh data file one at a time or in batches. To accommodate this lazy loading of data and to be memory efficient, refactoring the way we iterate through the dataset.
Eventually, it would be memory efficient to switch to iterable dataloaders by default and avoid loading the whole file into memory like we do for most dataloaders now.
An example iterable dataloader is defined in https://github.com/fairinternal/seamless_communication/pull/61 - changes were tested using this PR.