Skip to content

feat(pytest): add read_table() and log_files() to DeltaTestLocation (#6135)#6219

Draft
kimjune01 wants to merge 1 commit into
feldera:mainfrom
kimjune01:fix-delta-test-location-methods
Draft

feat(pytest): add read_table() and log_files() to DeltaTestLocation (#6135)#6219
kimjune01 wants to merge 1 commit into
feldera:mainfrom
kimjune01:fix-delta-test-location-methods

Conversation

@kimjune01
Copy link
Copy Markdown

Summary

  • Add read_table() method to DeltaTestLocation that returns the Delta table as a PyArrow Table, with missing_ok parameter for graceful handling of non-existent tables
  • Add log_files() method that returns the list of data file URIs in the current Delta snapshot, also with missing_ok support
  • Both methods use deferred imports (matching the existing row_count pattern) to avoid crashes on aarch64 hosts

Fixes #6135

Test plan

  • 6 new unit tests covering: missing table with missing_ok=True/False, reading existing tables, listing files
  • All 37 total unit tests pass with no regressions
  • Tests gated with @skip_on_arm64 matching repo conventions

DeltaTestLocation.row_count() was sufficient for basic validation,
but tests that need to inspect table contents or verify file-level
behavior require richer access.

Add two methods:

* read_table(missing_ok=False) → pyarrow.Table | None
  Returns the current snapshot as a pyarrow Table. Useful for
  row-level assertions.

* log_files(missing_ok=False) → list[str]
  Returns the list of parquet file URIs in the current snapshot.
  Useful for verifying compaction, file count, and storage layout.

Both follow the existing pattern: deferred deltalake import to avoid
aarch64 import crashes, and missing_ok flag to probe caches.

Fixes feldera#6135

Signed-off-by: June Kim <june@ea.com>
Signed-off-by: June Kim <kimjune01@gmail.com>
@gz gz requested a review from swanandx May 12, 2026 17:40
@gz
Copy link
Copy Markdown
Contributor

gz commented May 12, 2026

Thank you for your contribution!

Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Clean test helpers, six unit tests cover both missing_ok=True/False paths and the happy-path read/list, both methods follow the existing row_count pattern of deferring the deltalake import to the call site (so aarch64 collection stays alive), and tests are gated with @skip_on_arm64. No findings.

Copy link
Copy Markdown
Member

@swanandx swanandx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! 🚀

PR #6042 also adds similar functionalities, @gz which one would be preferred / can you comment regarding the use case so we can update the methods in this PR to align with what we need?

Comment thread python/tests/utils.py
raise
return dt.count()

def read_table(self, missing_ok: bool = False):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should have return type annotation

Comment thread python/tests/utils.py
if missing_ok:
return None
raise
return dt.to_pyarrow_table()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling .to_pyarrow_table() would load whole table in memory, not good for larger dataset. We should choose a streaming approach here

Comment thread python/tests/utils.py
raise
return dt.to_pyarrow_table()

def log_files(self, missing_ok: bool = False) -> list[str]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log_files should ideally return delta log files ( _delta_log/*.json ) not file_uris()

which one we want? if logs, then update method, else let's call this file_uris / data_files ;

@kimjune01
Copy link
Copy Markdown
Author

kimjune01 commented May 14, 2026

Thanks for flagging @swanandx. Both PRs touch the same DeltaTestLocation class in python/tests/utils.py, but with complementary helpers: this PR adds read_table / log_files (148 lines), #6042 adds read_rows / log_json_paths (2953 lines). They'll merge-conflict in that file regardless of order, so deferring to land #6042 first is the cleaner path.

Full reasoning: https://github.com/kimjune01/sweep/blob/master/repo-hypotheses/feldera-feldera.md

@gz, happy to rebase after #6042 lands, or close this and reapply as a follow-up if you'd prefer one integrated PR. Which works?

Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kimjune01 kimjune01 marked this pull request as draft May 19, 2026 06:11
@kimjune01
Copy link
Copy Markdown
Author

Cannot validate on current setup, so drafting. Please close or take it over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[pytest] add required methods to DeltaTestLocation

4 participants