write to parquet #679

mschulist · 2024-07-26T23:25:24Z

Right now, the file sizes are giant when writing the output from the classifier to a csv. However, Parquet reduces this file size significantly (in our case, we saw files go from ~80GB to ~3GB due to the number of duplicate filenames).

By writing to a partitioned Parquet file, we can get small files and reduce the amount of memory needed.

sdenton4 · 2024-08-04T01:23:45Z

Hey, Mark! Thanks for this; I actually didn't know that Pandas has parquet support.

There's a lot of repeated logic in the parquet and csv methods; could you merge them into a single method with an argument to choose the output format?

mschulist · 2024-08-04T01:30:15Z

Yeah, that sounds good! I'm not sure if concatenating dataframes is the most memory efficient way of buffering dataframes (it might have to copy a lot of data?), so I might look into other ways of buffering the output.

mschulist · 2024-08-04T03:27:48Z

I'd like to test it before merging, but I couldn't find an easy way to test it in the existing test file. Is there a way to make a unit test for it, or do we have to just run it in a notebook on an existing dataset?

sdenton4 · 2024-08-04T04:37:46Z

Certainly wouldn't say no to a test, but you're right that there's not an existing test for this function. The place to put a new test is here:
https://github.com/google-research/perch/blob/main/chirp/inference/tests/classify_test.py

There's an example of creating a test dataset here which might be helpful:
https://source.corp.google.com/piper///depot/google3/third_party/py/chirp/inference/tests/bootstrap_test.py;l=47

mschulist · 2024-08-04T19:33:46Z

I was able to make a test (just creating random embeddings) and check that the csv and parquet files are "equal" (or close enough...floating point numbers). The files are significantly smaller with parquet, even in the test with only 4 classes, which is good to see.

sdenton4 · 2024-08-05T06:20:19Z

chirp/inference/classify/classify.py

-  """Write a CSV file of inference results."""
+  """Write inference results."""
+
+  if format != 'parquet' and format != 'csv':


A bit cleaner:

if format == 'parquet': ... elif format == 'csv': ... else: raise ValueError(...)

sdenton4 · 2024-08-05T06:24:24Z

chirp/inference/classify/classify.py

+  if format == 'parquet':
+    if output_filepath.endswith('.csv'):
+      output_filepath = output_filepath[:-4]
+    if not output_filepath.endswith('.parquet'):


This second-guessing of the user-intention from the extension and format args is a bit cumbersome.

Maybe we should get the extension from the output file and use that instead of an arg? (and complain if it's not one of our accepted types.)

Then we would have:

if output_filepath.endswith('.parquet'): format = 'parquet' elif output_filepath.endswith('.csv'): format = 'csv' else: raise ValueError(...)

which saves an argument and ~12 lines of code.

sdenton4 · 2024-08-05T06:28:12Z

chirp/inference/classify/classify.py

-            offset = ex['timestamp_s'] + t * embedding_hop_size_s
-            logit = '{:.2f}'.format(ex['logits'][t, i])
+  if format == 'csv':
+    f = open(output_filepath, 'w')


It's good to use the with open(...) as f because it ensures that the file will be properly flushed and closed if an exception arises, or if we return early for some reason.

sdenton4 · 2024-08-05T06:35:06Z

chirp/inference/classify/classify.py

+        if threshold is None or ex['logits'][t, i] > threshold[label]:
+          offset = ex['timestamp_s'] + t * embedding_hop_size_s
+          logit = ex['logits'][t, i]
+          if format == 'parquet':


Maybe simpler:

Write a helper function flush_rows(output_path, shard_num, rows, format, headers) which writes everything in rows to a file. Then all of the writing logic is centralized; you can call the function here and below when you deal with the remainder rows.

This also helps with the csv file handling; you just open the file and write to it when you're flushing the data to disk.

Thanks! That make it SO much cleaner

sdenton4

Think we're just about there, thanks for sticking with it!

sdenton4 · 2024-08-06T18:14:15Z

chirp/inference/classify/classify.py

-          else:
-            nondetection_count += 1
+  headers = ['filename', 'timestamp_s', 'label', 'logit']
+  # Write column headers if CSV format


Seems like this comment can be deleted now

sdenton4 · 2024-08-06T18:15:04Z

chirp/inference/classify/classify.py

@@ -180,45 +183,88 @@ def classify_batch(batch):
  )
  return inference_ds

+def flush_rows(


Let's use a slightly more descriptive name, like flush_inference_rows.

mschulist added 4 commits July 25, 2024 21:20

added write to parquet for classify

f4f41a6

whoops, pandas not polars

33c3faf

Merge branch 'main' into feature/write-to-parquet

f3bf7ab

add print

b57de58

mschulist added 2 commits August 3, 2024 20:23

combine into single method

e1e6b68

Merge branch 'main' into feature/write-to-parquet

552443b

fix rows list

32b7a12

mschulist added 2 commits August 4, 2024 11:05

start test, definitely not finished...

da13b22

write test and fix errors in write

bbc8e61

sdenton4 reviewed Aug 5, 2024

View reviewed changes

clean up writing code, add ePath ability

a273206

sdenton4 reviewed Aug 6, 2024

View reviewed changes

rename flush rows and remove old comments

4a66bb5

sdenton4 merged commit 0402b78 into google-research:main Aug 7, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write to parquet #679

write to parquet #679

mschulist commented Jul 26, 2024

sdenton4 commented Aug 4, 2024

mschulist commented Aug 4, 2024

mschulist commented Aug 4, 2024

sdenton4 commented Aug 4, 2024

mschulist commented Aug 4, 2024

sdenton4 Aug 5, 2024

sdenton4 Aug 5, 2024

sdenton4 Aug 5, 2024

sdenton4 Aug 5, 2024

mschulist Aug 5, 2024

sdenton4 left a comment

sdenton4 Aug 6, 2024

sdenton4 Aug 6, 2024

write to parquet #679

write to parquet #679

Conversation

mschulist commented Jul 26, 2024

sdenton4 commented Aug 4, 2024

mschulist commented Aug 4, 2024

mschulist commented Aug 4, 2024

sdenton4 commented Aug 4, 2024

mschulist commented Aug 4, 2024

sdenton4 Aug 5, 2024

Choose a reason for hiding this comment

sdenton4 Aug 5, 2024

Choose a reason for hiding this comment

sdenton4 Aug 5, 2024

Choose a reason for hiding this comment

sdenton4 Aug 5, 2024

Choose a reason for hiding this comment

mschulist Aug 5, 2024

Choose a reason for hiding this comment

sdenton4 left a comment

Choose a reason for hiding this comment

sdenton4 Aug 6, 2024

Choose a reason for hiding this comment

sdenton4 Aug 6, 2024

Choose a reason for hiding this comment