Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read test cases checklist #7

Open
5 of 8 tasks
wjones127 opened this issue Oct 4, 2022 · 7 comments
Open
5 of 8 tasks

Read test cases checklist #7

wjones127 opened this issue Oct 4, 2022 · 7 comments

Comments

@wjones127
Copy link
Collaborator

wjones127 commented Oct 4, 2022

What is our philosophy of test cases? Do we care about each individual feature? Or are we collecting a set of cases that have maximal coverage of important common and corner cases? I'm assuming the latter for this draft list.

Reader protocol v1:

  • A Delta Lake table with all data types
  • A table with a checkpoint (with and without early transactions present / no replay)
  • A table which has had a schema change
  • A table with stats as struct
  • A table with multiple levels of partitioning, including null values at both levels (should cover all serialization cases here: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#partition-value-serialization)
  • A table with a multi-part checkpoint

Reader protocol v2:

  • Partitioned table with id-based column mapping
  • Partitioned table with name-based column mapping
@MrPowers
Copy link
Collaborator

MrPowers commented Oct 4, 2022

@wjones127 - Your list looks great.

I'd add a Delta Lake that's constructed with different save modes to the list.

df = spark.range(0, 3)
df.write.format("delta").save("/tmp/delta-table")
df2 = spark.range(4, 6)
df2.write.mode("overwrite").format("delta").save("/tmp/delta-table")

This test will make sure that the Delta Lake reader isn't just reading all the Parquet files.

@wjones127
Copy link
Collaborator Author

Some notes for implementing each of these

A Delta Lake table with all data types

Example from delta-rs tests: https://github.com/delta-io/delta-rs/blob/fae50cca528446e27c5401818a4f31b5a97e8ad2/python/tests/conftest.py#L30-L53

A table with a checkpoint

Set delta.checkpointInterval to 2 and we can get one with three commits.

A table which has had a schema change

overwrite with .option("overwriteSchema", "true").

A table with stats as struct

Turn delta.checkpoint.writeStatsAsJson off, delta.checkpoint.writeStatsAsStruct on.

A table with id-based column mapping

set delta.columnMapping.mode to id
Maybe alter a column in subsequent version? https://docs.databricks.com/delta/delta-column-mapping.html

A table with name-based column mapping

set delta.columnMapping.mode to name
Maybe alter a column in subsequent version? https://docs.databricks.com/delta/delta-column-mapping.html

A table with multi-part checkpoint

Use setting checkpoint.partSize (or delta.checkpoint.partSize?) to force a multi-part one.

https://github.com/delta-io/delta/pull/946/files

@MrPowers
Copy link
Collaborator

MrPowers commented Nov 4, 2022

@wjones127 - are you cool with separate reference tables for "A Delta Lake table with all data types"? I think this will make it more obvious what types aren't supported for each connector. Suppose a connector doesn't support 5 data types. One failing test might not fully explain the gap like 5 failing tests would. Thoughts?

@wjones127
Copy link
Collaborator Author

IMO that doesn't seem fully necessary. But perhaps we can separate the primitive types from the nested (struct, list, map)

@MrPowers
Copy link
Collaborator

MrPowers commented Nov 4, 2022

@wjones127 - separating the primitive times from complex types seems like a nice balance 👍

@tdas
Copy link

tdas commented Nov 4, 2022

These table ideas look very good to me.
Let me add a few more ideas, some of which may already be covered

  • Table with a file removed (e.g. compaction)
  • Table with actions having extra random fields in them (AddFiles, RemoveFiles, etc.) - json parsing should ignore them, this has to be hand constructed. This is important because we have seen multiple issues regarding this.
  • Table with all the different actions (settxn)
  • Tables with and without stats

I will think of more and keeping adding to this thread. :D

@wjones127
Copy link
Collaborator Author

wjones127 commented Jan 17, 2023

  • Table that has extra actions
  • A non-HIVE-partitioned table

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants