Skip to content

feat(dataframe): add join, joinOn, and JoinType#72

Merged
andygrove merged 2 commits into
apache:mainfrom
LantaoJin:feat/dataframe-join
May 21, 2026
Merged

feat(dataframe): add join, joinOn, and JoinType#72
andygrove merged 2 commits into
apache:mainfrom
LantaoJin:feat/dataframe-join

Conversation

@LantaoJin
Copy link
Copy Markdown
Contributor

@LantaoJin LantaoJin commented May 20, 2026

Which issue does this PR close?

Rationale for this change

Joins are the largest missing piece of the DataFrame API today; without them, Java callers cannot programmatically express even simple star-join queries that DataFusion's Rust API has supported for years.

What changes are included in this PR?

This PR ships Phase 1 : column-name join plus joinOn with String-form predicates. By committing to SQL-string predicates here, it also closes the Phase-2 design question. A typed Expr builder is deliberately deferred.

New public Java enum org.apache.datafusion.JoinType mirroring upstream's 10-variant enum (INNER, LEFT, RIGHT, FULL, LEFT_SEMI, RIGHT_SEMI, LEFT_ANTI, RIGHT_ANTI, LEFT_MARK, RIGHT_MARK). Crosses JNI as a byte code, mirroring the existing Volatility precedent for UDFs.

Three new methods on DataFrame:

  • DataFrame.join(right, type, leftCols, rightCols) — equi-join on named columns, no residual filter.
  • DataFrame.join(right, type, leftCols, rightCols, filter) — equi-join with a residual SQL filter parsed against the combined schema of left + right.
  • DataFrame.joinOn(right, type, predicates...) — arbitrary join predicates, each parsed as a SQL expression against the combined schema. Predicates may be qualified with the relation alias when ambiguous (e.g. "l.id = r.id", "left.x < right.y").

All three are non-consuming on the Java side: both left (the receiver) and right remain usable and must still be closed independently. Upstream's Rust join / join_on consume both, but the JNI layer clones the underlying Arc-backed DataFrame like every other transformation method (select / filter / withColumn / unnestColumns). This matches the established Java-side convention and supports the natural star-join pattern where the same fact table is joined to multiple dimensions:

DataFrame fact = ctx.sql(...);
DataFrame d1   = fact.join(dimA, INNER, fact_keys, dimA_keys);
DataFrame d2   = fact.join(dimB, INNER, fact_keys, dimB_keys);  // fact still usable

Out of scope (for follow-ups):

  • Typed Expr builder (Phase 2 option 2). Andy's issue explicitly cautions against the maintenance burden. The String-form predicate channel covers everything DataFusion's parser supports.
  • Cross-join, natural-join. Not in design: DataFrame joins (join, joinOn) and the Java Expr question #44; separate issues.
  • Predicate validation pre-flight on the Java side. DataFusion's parser is the source of truth and surfaces errors the same way filter(String) already does.

Are these changes tested?

Yes, 18 new tests in core/src/test/java/org/apache/datafusion/DataFrameJoinTest.java

Are there any user-facing changes?

Yes, purely additive. New public API:

  • org.apache.datafusion.JoinType (enum)
  • DataFrame.join(DataFrame, JoinType, String[], String[])
  • DataFrame.join(DataFrame, JoinType, String[], String[], String)
  • DataFrame.joinOn(DataFrame, JoinType, String...)

No API removals, no deprecations, no behavior change for existing callers. The native binary is unchanged in size (no new Cargo features or dependencies).

Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @LantaoJin

@andygrove andygrove merged commit af78e61 into apache:main May 21, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

design: DataFrame joins (join, joinOn) and the Java Expr question

2 participants