Skip to content

feat(session): configure DataFusion's built-in CacheManager from Java#78

Open
LantaoJin wants to merge 2 commits into
apache:mainfrom
LantaoJin:feat/cache-manager
Open

feat(session): configure DataFusion's built-in CacheManager from Java#78
LantaoJin wants to merge 2 commits into
apache:mainfrom
LantaoJin:feat/cache-manager

Conversation

@LantaoJin
Copy link
Copy Markdown
Contributor

@LantaoJin LantaoJin commented May 21, 2026

Which issue does this PR close?

Rationale for this change

DataFusion's RuntimeEnv accepts a CacheManagerConfig with three independent caches: the file-embedded metadata cache (parquet footers / page metadata), the list-files cache (object-store LIST results), and the file-statistics cache (per-file row counts and column stats used by the planner). The Rust API is RuntimeEnvBuilder::with_cache_manager(CacheManagerConfig). The Java binding has no surface for any of it — every SessionContext ends up with the no-op upstream defaults today, so a parquet workload reading the same footer thousands of times across queries goes back to the object store every single time, and statistics-driven planners can't persist their stats across queries.

This PR adds a typed cacheManager(CacheManagerOptions) setter on SessionContextBuilder that exposes the three caches independently:

SessionContext ctx = SessionContext.builder()
    .cacheManager(CacheManagerOptions.builder()
        .fileMetadataCache(64L << 20)                       // 64 MiB cap
        .listFilesCache(8L << 20, Duration.ofMinutes(5))    // 8 MiB cap, 5min TTL
        .fileStatisticsCache(true)
        .build())
    .build();

Semantics, all matching upstream (datafusion).

Three independent toggles, all matching upstream:

Field Java unset → Rust behaviour
fileMetadataCache(maxBytes) leave metadata_cache_limit at upstream default
listFilesCache(maxBytes,ttl) leave list_files_cache = None (disabled)
fileStatisticsCache(enabled) leave table_files_statistics_cache = None

What changes are included in this PR?

  • Proto: proto/cache_manager_options.proto.
  • Java API: org.apache.datafusion.CacheManagerOptions
  • Native: native/src/cache_manager.rs
  • Build wiring: proto/cache_manager_options.proto

Are these changes tested?

Yes, 18 new tests cross CacheManagerOptionsTest and SessionContextCacheManagerTest.

Are there any user-facing changes?

Yes, but additive only — no breaking changes:

  • New public class org.apache.datafusion.CacheManagerOptions with a static builder() and three setters.
  • New SessionContextBuilder.cacheManager(CacheManagerOptions) setter.

No behavior change for callers that do not invoke the new setter — the cache_manager field is absent on the wire and the native side leaves upstream's RuntimeEnvBuilder defaults in place.

LantaoJin added 2 commits May 21, 2026 02:49
# Conflicts:
#	core/src/main/java/org/apache/datafusion/SessionContextBuilder.java
#	proto/session_options.proto
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: configure DataFusion's built-in CacheManager from Java

1 participant