Iterable Dataloader #88

ibanesh · 2023-11-16T11:21:52Z

Fairseq2 data pipeline is an iterable without index based random access.
The current simuleval design doesn't allow us to iterate through such iterable dataloaders without fully loading them into memory (which is inefficient). Fairseq2 data pipeline is an iterable which loads the samples from teh data file one at a time or in batches. To accommodate this lazy loading of data and to be memory efficient, refactoring the way we iterate through the dataset.

Eventually, it would be memory efficient to switch to iterable dataloaders by default and avoid loading the whole file into memory like we do for most dataloaders now.

An example iterable dataloader is defined in https://github.com/fairinternal/seamless_communication/pull/61 - changes were tested using this PR.

annasun28 · 2023-11-16T23:52:04Z

simuleval/agents/pipeline.py

-            segment, states[-1], upstream_states=upstream_states + states_list[:index]
+            segment,
+            states[-1],
+            upstream_states=upstream_states + states_list[: len(self.module_list[:-1])],


just for my understanding, when is len(self.module_list[:-1]) not equal to index?

When the len(self.module_list) = 1.
len(self.module_list[:-1]) will be 0 and index will not even get assigned a value in this case, resulting in an exception.

simuleval/evaluator/evaluator.py

annasun28 · 2023-11-16T23:56:18Z

simuleval/options.py

+        "--output",
+        type=str,
+        default=None,
+        help="Output directory. Required if using iterable dataloader.",


Also just curious: why required if using iterable dataloader but not otherwise?

For the typical dataloader (other than "iterable" ones), we create a list of instances and iterate through this list in the evaluator's __call__ method, the results are stored as a filed in the instance object in this list and at the end of iteration we this list from memory to calculate the metrics/scores. So it doesn't matter if an output arg is specified, without an output arg the instances don't get written to a file but we will still be able to calculate the score/metrics using the results in memory.

But for iterable dataloaders where we are lazy loading the data one at a time, only the current instance is stored in memory and at each iteration we write this instance data to a file if output is given. At the end of iterating through all the data, we will load the results into memory using the file we wrote to while iterating and then use this in memory instances to calculate the cumulative scores/metric. So without the output arg, we won't write to file and subsequently we wont be able to calculate the score/metrics.

ibanesh · 2023-11-16T23:30:32Z

simuleval/evaluator/evaluator.py

-        system.reset()
-        for instance in self.instance_iterator:
-            while not self.is_finished(instance):
-                input_segment = instance.send_source(self.source_segment_size)
-                output_segment = system.pushpop(input_segment)
-                instance.receive_prediction(output_segment)
-                if instance.finish_prediction:
-                    # if instance.finish_prediction where set by the reader,
-                    # source_finished_reading will be set as well. If it is
-                    # set by any of the intermediate components, then we didn't
-                    # end yet. We are going to clear the state and continue
-                    # processing the rest of the input.
-                    system.reset()


This part was added/modified in #48 & #49 to support the inference using youtube streams, but has been causing issue with wait-k inference sometimes.

Keeping it as is for now, will change this to old logic if we run into issue when testing parity for wait-k.

ibanesh · 2023-11-17T08:06:51Z

simuleval/agents/pipeline.py

-            segment, states[-1], upstream_states=upstream_states + states_list[:index]
+            segment,
+            states[-1],
+            upstream_states=upstream_states + states_list[: len(self.module_list[:-1])],


When the len(self.module_list) = 1.
len(self.module_list[:-1]) will be 0 and index will not even get assigned a value in this case, resulting in an exception.

ibanesh · 2023-11-17T08:18:21Z

simuleval/options.py

+        "--output",
+        type=str,
+        default=None,
+        help="Output directory. Required if using iterable dataloader.",


For the typical dataloader (other than "iterable" ones), we create a list of instances and iterate through this list in the evaluator's __call__ method, the results are stored as a filed in the instance object in this list and at the end of iteration we this list from memory to calculate the metrics/scores. So it doesn't matter if an output arg is specified, without an output arg the instances don't get written to a file but we will still be able to calculate the score/metrics using the results in memory.

But for iterable dataloaders where we are lazy loading the data one at a time, only the current instance is stored in memory and at each iteration we write this instance data to a file if output is given. At the end of iterating through all the data, we will load the results into memory using the file we wrote to while iterating and then use this in memory instances to calculate the cumulative scores/metric. So without the output arg, we won't write to file and subsequently we wont be able to calculate the score/metrics.

ibanesh requested a review from kauterry November 16, 2023 11:21

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 16, 2023

ibanesh force-pushed the iterable-dataloader branch from 6e78bda to c95c761 Compare November 16, 2023 18:18

ibanesh requested review from xutaima and annasun28 November 16, 2023 18:20

ibanesh force-pushed the iterable-dataloader branch 2 times, most recently from d81622b to 1c6df08 Compare November 16, 2023 19:39

annasun28 approved these changes Nov 16, 2023

View reviewed changes

ibanesh commented Nov 17, 2023

View reviewed changes

ibanesh added 2 commits November 17, 2023 12:57

Iterable Dataloader

cd8680a

keep the evaluvator loop logic as is

0c3f032

ibanesh force-pushed the iterable-dataloader branch from 1c6df08 to 0c3f032 Compare November 17, 2023 21:22

ibanesh merged commit 2e3bd7e into main Nov 18, 2023
3 checks passed

ibanesh deleted the iterable-dataloader branch November 18, 2023 00:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterable Dataloader #88

Iterable Dataloader #88

ibanesh commented Nov 16, 2023 •

edited

Loading

annasun28 Nov 16, 2023

ibanesh Nov 17, 2023

annasun28 Nov 16, 2023

ibanesh Nov 17, 2023

ibanesh Nov 16, 2023 •

edited

Loading

ibanesh Nov 17, 2023

ibanesh Nov 17, 2023

ibanesh Nov 17, 2023

Iterable Dataloader #88

Iterable Dataloader #88

Conversation

ibanesh commented Nov 16, 2023 • edited Loading

annasun28 Nov 16, 2023

Choose a reason for hiding this comment

ibanesh Nov 17, 2023

Choose a reason for hiding this comment

annasun28 Nov 16, 2023

Choose a reason for hiding this comment

ibanesh Nov 17, 2023

Choose a reason for hiding this comment

ibanesh Nov 16, 2023 • edited Loading

Choose a reason for hiding this comment

ibanesh Nov 17, 2023

Choose a reason for hiding this comment

ibanesh Nov 17, 2023

Choose a reason for hiding this comment

ibanesh Nov 17, 2023

Choose a reason for hiding this comment

ibanesh commented Nov 16, 2023 •

edited

Loading

ibanesh Nov 16, 2023 •

edited

Loading