feat(pytest): add read_table() and log_files() to DeltaTestLocation (#6135)#6219
feat(pytest): add read_table() and log_files() to DeltaTestLocation (#6135)#6219kimjune01 wants to merge 1 commit into
Conversation
DeltaTestLocation.row_count() was sufficient for basic validation, but tests that need to inspect table contents or verify file-level behavior require richer access. Add two methods: * read_table(missing_ok=False) → pyarrow.Table | None Returns the current snapshot as a pyarrow Table. Useful for row-level assertions. * log_files(missing_ok=False) → list[str] Returns the list of parquet file URIs in the current snapshot. Useful for verifying compaction, file count, and storage layout. Both follow the existing pattern: deferred deltalake import to avoid aarch64 import crashes, and missing_ok flag to probe caches. Fixes feldera#6135 Signed-off-by: June Kim <june@ea.com> Signed-off-by: June Kim <kimjune01@gmail.com>
|
Thank you for your contribution! |
mythical-fred
left a comment
There was a problem hiding this comment.
LGTM. Clean test helpers, six unit tests cover both missing_ok=True/False paths and the happy-path read/list, both methods follow the existing row_count pattern of deferring the deltalake import to the call site (so aarch64 collection stays alive), and tests are gated with @skip_on_arm64. No findings.
| raise | ||
| return dt.count() | ||
|
|
||
| def read_table(self, missing_ok: bool = False): |
There was a problem hiding this comment.
this should have return type annotation
| if missing_ok: | ||
| return None | ||
| raise | ||
| return dt.to_pyarrow_table() |
There was a problem hiding this comment.
calling .to_pyarrow_table() would load whole table in memory, not good for larger dataset. We should choose a streaming approach here
| raise | ||
| return dt.to_pyarrow_table() | ||
|
|
||
| def log_files(self, missing_ok: bool = False) -> list[str]: |
There was a problem hiding this comment.
log_files should ideally return delta log files ( _delta_log/*.json ) not file_uris()
which one we want? if logs, then update method, else let's call this file_uris / data_files ;
|
Thanks for flagging @swanandx. Both PRs touch the same Full reasoning: https://github.com/kimjune01/sweep/blob/master/repo-hypotheses/feldera-feldera.md @gz, happy to rebase after #6042 lands, or close this and reapply as a follow-up if you'd prefer one integrated PR. Which works? |
|
Cannot validate on current setup, so drafting. Please close or take it over. |
Summary
read_table()method toDeltaTestLocationthat returns the Delta table as a PyArrow Table, withmissing_okparameter for graceful handling of non-existent tableslog_files()method that returns the list of data file URIs in the current Delta snapshot, also withmissing_oksupportrow_countpattern) to avoid crashes on aarch64 hostsFixes #6135
Test plan
missing_ok=True/False, reading existing tables, listing files@skip_on_arm64matching repo conventions