Replies: 6 comments 2 replies
-
We have considered implementing it before. I'm not sure though if native parquet support would be a lot faster than piping the tsv output to a script that converts it to parquet. Have you tried this? |
Beta Was this translation helpful? Give feedback.
-
Thanks for the reply. I've done tests involving parquet conversion. Using an 8gb output tsv from Diamond as a tester. Read speed using pyarrow. (Not bothering to test standard python read as it is slow.)
Above is our baseline. Lets convert it to a parquet.
Conversion took almost 100s. Oof.
Read time, however, is more than twice as fast. Unfortunately, the conversion is a huge time sink for a file we are only going to read once. It was our conclusion that unless Diamond natively outputs in parquet it wasn't worth using the format. Thus, why I asked if you had plans to support output in the format in the future. You believe outputting in this format would not be much faster than a tsv to parquet conversion? (Including the original write time for the tsv itself.) |
Beta Was this translation helpful? Give feedback.
-
Thanks, looks like it makes sense to either include native support into diamond or develop a faster way to convert tsv to parquet. I will give this some thought. |
Beta Was this translation helpful? Give feedback.
-
Thanks! I would also strongly consider JSON format. (Perhaps even more so than Parquet!) There is no option for this correct? The penalty for outputting in this format should be less than parquet. Tools for parsing it very fast compared to TSV/CSV. See SIMDJSON. |
Beta Was this translation helpful? Give feedback.
-
JSON output format has been included in the latest release. |
Beta Was this translation helpful? Give feedback.
-
We now provide instructions how to create Parquet output with the help of DuckDB: https://github.com/bbuchfink/diamond/wiki/File-formats |
Beta Was this translation helpful? Give feedback.
-
Hello!
I was wondering if it would be possible to have Diamond output in the open source Parquet format? This is a more modern output than tsv.
Reading in and processing the TSV is a significant bottleneck when dealing with hundreds of millions of hits.
Using the Parquet format would have several benefits. It would be orders of magnitude faster than reading a tsv/csv. Additionally, it would allow more efficient compression or even compression of individual columns.
https://posit.co/blog/speed-up-data-analytics-with-parquet-files/
https://en.wikipedia.org/wiki/Apache_Parquet
Beta Was this translation helpful? Give feedback.
All reactions