feat(dataframe): expose sort and repartition#66
Open
LantaoJin wants to merge 2 commits into
Open
Conversation
Add DataFrame.sort(SortExpr...), DataFrame.repartitionRoundRobin(int), and DataFrame.repartitionHash(int, String...). SortExpr is a small value class with static asc/desc factories and a fluent nullsFirst setter, mirroring DataFusion's expr::Sort. The SQL-string sort flavour the issue lists as option 1 is deferred: DataFusion 53.1 has no parse_sort_exprs helper on DataFrame, so the string flavour would force hand-rolled ORDER BY parsing. The typed SortExpr API is the same shape the issue authorises in option 2. repartitionHash takes column-name keys for v1 and translates each through col(...) in the native handler. Expression keys are deferred until a Java-side Expr builder lands.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
Two ordering / layout primitives have been missing from the Java
DataFrameAPI:sort(no way to order without dropping to SQL) andrepartition(no way to control parallelism / partitioning). Both are first-class on the upstream RustDataFrame, in the default feature set, with no Cargo flag impact. This PR exposes them additively.What changes are included in this PR?
SortExpr-- new value class. Final class with static factoriesSortExpr.asc(String)/SortExpr.desc(String)and a fluentnullsFirst(boolean)setter. Mirrors DataFusion'sexpr::Sort{ expr, asc, nulls_first }. Defaults match upstream: ASC → NULLs last, DESC → NULLs first.DataFrame.sort(SortExpr...)-- ordering. Empty array is a no-op (matchesDataFrame::sort(vec![])); eachSortExpris null-checked Java-side; the receiver remains usable.DataFrame.repartitionRoundRobin(int)-- maps toPartitioning::RoundRobinBatch(usize). Java validatesnumPartitions > 0.DataFrame.repartitionHash(int, String...)-- maps toPartitioning::Hash(Vec<Expr>, usize). Column-name keys for v1; the native handler translates each name throughdatafusion::logical_expr::col(...). Java validatesnumPartitions > 0, columns non-null/non-empty, no null elements.native/src/lib.rs-- three JNI handlers (sortRows,repartitionRoundRobinRows,repartitionHashRows) using the existingtry_unwrap_or_throwplumbing. Boolean arrays are decoded viaJBooleanArray+get_boolean_array_region(jni 0.21).datafusion::logical_expr::{col, Partitioning, SortExpr},jni::objects::JBooleanArray.Why typed
SortExprinstead of the SQL-string flavour the issue suggests as option 1:DataFrame::parse_sql_exprparses a single expression, not anORDER BYlist, and DataFusion 53.1 has noparse_sort_exprshelper. The string flavour would force hand-rolled SQL parsing on the native side. The issue authorises starting at option 2; the SQL-string flavour can be layered on later if/when anExprbuilder lands.Out of scope (for follow-ups):
df.sort("a ASC, b DESC NULLS FIRST")).SortExpr.asc("a + b")). The field is namedcolumn(notexpr) to make this contract enforceable.Partitioning::DistributeByandPartitioning::Hashwith arbitrary expressions.collect_partitioned. Tests assert the row-preservation invariant only.Are these changes tested?
Yes -- 20 new tests across
SortExprTestandDataFrameTransformationsTest, plus six new lines extending the existing close/collect coverage.Are there any user-facing changes?
Yes -- purely additive. New public API:
org.apache.datafusion.SortExpr(value class)DataFrame.sort(SortExpr...) → DataFrameDataFrame.repartitionRoundRobin(int) → DataFrameDataFrame.repartitionHash(int, String...) → DataFrameNo API removals, no deprecations, no behaviour change for existing callers. No Cargo feature changes; binary size is unchanged.