Parquet output format #654

Thernn88 · 2023-02-03T01:05:51Z

Thernn88
Feb 3, 2023

Hello!

I was wondering if it would be possible to have Diamond output in the open source Parquet format? This is a more modern output than tsv.

Reading in and processing the TSV is a significant bottleneck when dealing with hundreds of millions of hits.

Using the Parquet format would have several benefits. It would be orders of magnitude faster than reading a tsv/csv. Additionally, it would allow more efficient compression or even compression of individual columns.

https://posit.co/blog/speed-up-data-analytics-with-parquet-files/

https://en.wikipedia.org/wiki/Apache_Parquet

bbuchfink · 2023-02-03T12:38:29Z

bbuchfink
Feb 3, 2023
Maintainer

We have considered implementing it before. I'm not sure though if native parquet support would be a lot faster than piping the tsv output to a script that converts it to parquet. Have you tried this?

0 replies

Thernn88 · 2023-02-14T21:21:50Z

Thernn88
Feb 14, 2023
Author

Thanks for the reply. I've done tests involving parquet conversion.

Using an 8gb output tsv from Diamond as a tester.

Read speed using pyarrow. (Not bothering to test standard python read as it is slow.)

kevin@DESKTOP-PQ3G8VJ:~$ time python3 arrow.py

real    0m36.245s
user    0m36.870s
sys     0m17.097s

Above is our baseline. Lets convert it to a parquet.

kevin@DESKTOP-PQ3G8VJ:~$ time python3 convert.py

real    1m37.931s
user    1m27.606s
sys     0m12.685s

Conversion took almost 100s. Oof.

kevin@DESKTOP-PQ3G8VJ:~$ time python3 parquet.py

real    0m10.923s
user    0m6.744s
sys     0m6.653s

Read time, however, is more than twice as fast.

Unfortunately, the conversion is a huge time sink for a file we are only going to read once.

It was our conclusion that unless Diamond natively outputs in parquet it wasn't worth using the format. Thus, why I asked if you had plans to support output in the format in the future.

You believe outputting in this format would not be much faster than a tsv to parquet conversion? (Including the original write time for the tsv itself.)

0 replies

bbuchfink · 2023-02-17T14:17:40Z

bbuchfink
Feb 17, 2023
Maintainer

Thanks, looks like it makes sense to either include native support into diamond or develop a faster way to convert tsv to parquet. I will give this some thought.

0 replies

Thernn88 · 2023-02-21T19:17:52Z

Thernn88
Feb 21, 2023
Author

Thanks! I would also strongly consider JSON format. (Perhaps even more so than Parquet!) There is no option for this correct?

The penalty for outputting in this format should be less than parquet. Tools for parsing it very fast compared to TSV/CSV.

See SIMDJSON.

2 replies

bbuchfink Feb 24, 2023
Maintainer

No json is not supported, but it would not be that difficult. I'll see what I can do.

Thernn88 Mar 21, 2023
Author

Thanks for looking into this! Json output would be an immense help.

bbuchfink · 2023-06-21T11:48:34Z

bbuchfink
Jun 21, 2023
Maintainer

Thanks for looking into this! Json output would be an immense help.

JSON output format has been included in the latest release.

0 replies

bbuchfink · 2023-08-31T12:42:04Z

bbuchfink
Aug 31, 2023
Maintainer

We now provide instructions how to create Parquet output with the help of DuckDB: https://github.com/bbuchfink/diamond/wiki/File-formats

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet output format #654

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Parquet output format #654

Thernn88 Feb 3, 2023

Replies: 6 comments · 2 replies

bbuchfink Feb 3, 2023 Maintainer

Thernn88 Feb 14, 2023 Author

bbuchfink Feb 17, 2023 Maintainer

Thernn88 Feb 21, 2023 Author

bbuchfink Feb 24, 2023 Maintainer

Thernn88 Mar 21, 2023 Author

bbuchfink Jun 21, 2023 Maintainer

bbuchfink Aug 31, 2023 Maintainer

Thernn88
Feb 3, 2023

Replies: 6 comments 2 replies

bbuchfink
Feb 3, 2023
Maintainer

Thernn88
Feb 14, 2023
Author

bbuchfink
Feb 17, 2023
Maintainer

Thernn88
Feb 21, 2023
Author

bbuchfink Feb 24, 2023
Maintainer

Thernn88 Mar 21, 2023
Author

bbuchfink
Jun 21, 2023
Maintainer

bbuchfink
Aug 31, 2023
Maintainer