Skip to content

Add support for bitmap based filters#6007

Merged
gz merged 8 commits into
mainfrom
roaring
May 4, 2026
Merged

Add support for bitmap based filters#6007
gz merged 8 commits into
mainfrom
roaring

Conversation

@gz
Copy link
Copy Markdown
Contributor

@gz gz commented Apr 8, 2026

This PR adds three things:

1st commit: A bitmap filter based on the roaring crate with

  • Necessary plumbing so we can have different filters in front of our files and batches
  • Necessary plumbing so we can decide if we should use the roaring filter or fall back to the bloom filter
    • This needed us to consistently have a min/max for every batch
    • A heuristic that estimates when it's better to use roaring than bloom (see 2nd commit): High-level answer is that it's better when the set of keys in the batch is dense, with caveats

2nd commit: A benchmark that validates the predictor function we use to choose between roaring and bloom filters (2k lines, this code can be mostly ignored for the PR review)

3rd commit: Code that adjusts storage files by setting an incompatible feature flag

Describe Manual Test Plan

Ran many pipelines

Checklist

  • Unit tests added/updated

Breaking Changes?

Ideally no

Describe Incompatible Changes

File format has new incompatible feature, old versions will refuse to read the files with a roaring filter.
Users should not downgrade once they go beyond this version without clearing storage.

@gz gz changed the title Roaring Add support for bitmap based filter Apr 8, 2026
@gz gz changed the title Add support for bitmap based filter Add support for bitmap based filters Apr 8, 2026
Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Draft: two design questions, see inline.

Comment thread crates/dbsp/src/storage/file/format.rs Outdated
Comment thread crates/dbsp/src/storage/file/filter.rs Outdated
@gz
Copy link
Copy Markdown
Contributor Author

gz commented Apr 8, 2026

datagen

CREATE TABLE simple (
    payload [ BIGINT | INTEGER ] NOT NULL PRIMARY KEY
) WITH (
  'materialized' = 'true',
  'connectors' = '[{"name":"data","transport":{"name":"datagen","config":{"workers":9,"plan":[{"fields":{"payload":{"strategy":"[ increment | uniform ]"}}}]}}}]'
);

original is v0.281.0 and roaring branch is the current platform runtime on 92e4a95. enable_roaring=true was set on all runs.

Bench Summary

Workload Original Throughput Roaring Branch Throughput Δ Throughput Original Mem Roaring Branch Mem Δ Mem Original Storage Roaring Branch Storage Δ Storage
BIGINT + increment 4.21M/s 4.69M/s +11.3% 12.33 GiB 8.88 GiB -28.0% 22.22 GiB 25.20 GiB +13.4%
BIGINT + uniform 2.31M/s 2.29M/s -0.8% 7.49 GiB 7.70 GiB +2.9% 21.18 GiB 19.63 GiB -7.3%
INTEGER + increment 4.34M/s 4.71M/s +8.4% 11.86 GiB 8.58 GiB -27.7% 22.38 GiB 22.85 GiB +2.1%
INTEGER + uniform 2.18M/s 2.16M/s -1.2% 7.03 GiB 7.19 GiB +2.3% 11.95 GiB 12.58 GiB +5.3%
  • increment workloads improved on the roaring branch: better throughput and much lower memory.
  • uniform workloads in the noise on throughput, memory.

Filter Hit/Miss Comparison:

Workload Original Roaring Branch Read
BIGINT + increment Range: 0 hits / 4.340B misses, 100% miss rate. Bloom: 0/0. Roaring: 0/0. Range: 0 / 4.268B, 100% miss rate. Bloom: 0/0. Roaring: 0/0. Useful filtering is all in the range filter for both versions.
BIGINT + uniform Bloom: 185.6K hits / 1.845B misses, 99.9899% miss rate. Range: 1,177 misses on 1.845B passes, 0.000064% miss rate. Bloom: 152.6K / 1.506B, 99.9899% miss rate. Range: 1,121 misses on 1.507B passes, 0.000074% miss rate. Bloom is doing almost all of the useful rejection.
INTEGER + increment Range: 0 / 3.649B, 100% miss rate. Bloom: 0/0. Roaring: 0/0. Range: 0 / 3.413B, 100% miss rate. Bloom: 0/0. Roaring: 0/0. Same pattern as BIGINT + increment.
INTEGER + uniform Bloom: 15.64M hits / 785.0M misses, 98.05% miss rate. Range: 437 misses on 800.7M passes, 0.000055% miss rate. Bloom: 16.56M / 789.0M, 97.94% miss rate. Range: 414 misses on 805.5M passes, 0.000051% miss rate. Same pattern as BIGINT + uniform.

Takeaways:

  • higher miss rate is better here it means the filter is rejecting more lookups.
  • In these runs, the measurable rejection work came from range for increment and bloom for uniform; roaring hit/miss counters themselves stayed 0 in the final samples.
  • throuhgput increase most likely from building the bitmap filters faster than the bloom filters for increment workloads

postgres

original is v0.281.0 and roaring branch is the platform runtime on 92e4a95.

I loaded all four Postgres tables to 200,000,000 rows each. It's very slow.

Bench Summary

Workload Original Tput Roaring Branch Tput Δ Tput Original Mem Roaring Branch Mem Δ Mem Original Storage Roaring Branch Storage Δ Storage
INT random 1.080M/s 1.075M/s -0.49% 24.43 GiB 24.17 GiB -1.09% 3.79 GiB 3.85 GiB +1.59%
BIGINT random 1.052M/s 1.058M/s +0.53% 25.09 GiB 25.13 GiB +0.14% 3.88 GiB 3.89 GiB +0.42%
INT sequence 1.081M/s 1.087M/s +0.56% 24.57 GiB 24.68 GiB +0.45% 3.81 GiB 3.62 GiB -5.05%
BIGINT sequence 1.069M/s 1.099M/s +2.77% 25.00 GiB 24.86 GiB -0.58% 3.96 GiB 3.40 GiB -14.03%

Filter Read
Miss rate here is misses / (hits + misses), so higher is better for the filter.

Workload Original Roaring Branch
INT random Bloom 11.4K / 112.38M, miss 99.9898%, bloom size 452.8 MiB. Range 112.39M / 44, miss 0.0000%. Bloom 11.1K / 109.45M, miss 99.9898%, bloom size 452.8 MiB. Range 109.46M / 43, miss 0.0000%.
BIGINT random Bloom 11.4K / 113.54M, miss 99.9900%, bloom size 452.8 MiB. Range 113.55M / 36, miss 0.0000%. Bloom 10.6K / 106.68M, miss 99.9900%, bloom size 443.3 MiB. Range 106.69M / 55, miss 0.0001%.
INT sequence Bloom counters 0/0, bloom size 455.9 MiB. Range 0 / 94.05M, miss 100.0000%. Bloom counters 0/0, roaring counters 0/0, roaring size 186.5 MiB. Range 0 / 121.21M, miss 100.0000%.
BIGINT sequence Bloom counters 0/0, bloom size 446.3 MiB. Range 0 / 168.18M, miss 100.0000%. Bloom counters 0/0, roaring counters 0/0, roaring size 186.5 MiB. Range 0 / 96.21M, miss 100.0000%.

Takeaways:

  • The postgres input connector is not able to saturate the circuit (buffered records 0).
  • Random workloads are basically flat. One is slightly worse, one slightly better.
  • Sequence workloads show roaring roaring branch helps with filter size.
  • For sequence keys, the roaring branch switches filter representation from bloom to roaring and cuts filter size a lot, observed rejection already in the range filter. Roaring hit/miss counters stayed 0/0.
  • For random keys, bloom is doing the rejection at ~99.99% miss rate in both versions.

deltalake

original is v0.281.0 and roaring branch is the platform runtime on the current manager.

CREATE TABLE int_random_keys (
    id INT NOT NULL PRIMARY KEY
) WITH (
  'materialized' = 'true',
  'connectors' = '[{"name":"data","transport":{"name":"delta_table_input","config":{"uri":"file:///mnt/data/deltalake/$table","mode":"snapshot","transaction_mode":"none"}}}]'
);

These were 300-second fda bench runs, and all four runs had enable_roaring=true.
int_random_keys: delta-rs table filled with random u32 keys
int_sequence_keys: delta-rs table built with sequentially filling keys

Bench Summary

Workload Original Tput Roaring Branch Tput Δ Tput Original Mem Roaring Branch Mem Δ Mem Original Storage Roaring Branch Storage Δ Storage Δ State Amp
INT random 2.332M/s 2.326M/s -0.25% 8.96 GiB 8.77 GiB -2.07% 13.87 GiB 13.85 GiB -0.14% +0.10%
INT sequence 2.748M/s 3.336M/s +21.41% 9.70 GiB 7.94 GiB -18.10% 15.19 GiB 18.19 GiB +19.77% -1.30%

For fixed 300s runs, absolute storage at stop time is influenced by how much work got processed in that window.

Filter Miss Table

Miss rate is misses / (hits + misses), so higher is better.

Workload Version Bloom hits / misses Bloom Miss Rate Roaring hits / misses Roaring Miss Rate Range hits / misses Range Miss Rate
INT random original (v0.281.0) 140.2K / 1.387B 99.9899% 0 / 0 N/A 1.387B / 962 0.0001%
INT random roaring branch 150.9K / 1.495B 99.9899% 0 / 0 N/A 1.495B / 768 0.0001%
INT sequence original (v0.281.0) 158.4K / 1.566B 99.9899% 0 / 0 N/A 1.567B / 16.842M 1.0637%
INT sequence roaring branch 0 / 0 N/A 0 / 2.480B 100.0000% 2.480B / 27.664M 1.1030%

Read

  • INT random is effectively flat. Both versions stay bloom-based, and the filter behavior is almost identical.
  • INT sequence is the real win for the roaring branch: +21.4% throughput and -18.1% memory.
  • On INT sequence, original stays bloom-based, while the roaring branch switches fully to roaring and shows a 100% roaring miss rate in the final sample.
  • Absolute storage is higher on the roaring branch for INT sequence, but it also processed much more data in the same 300s window; normalized state-amplification is slightly better.

kafka

CREATE TABLE kafka_int_sequence_keys (
    id INT NOT NULL PRIMARY KEY
) WITH (
  'materialized' = 'true',
  'connectors' = '[{"name":"data","transport":{"name":"kafka_input","config":{"topic":"$topic","bootstrap.servers":"127.0.0.1:9092","start_from":"earliest","resume_earliest_if_data_expires":true,"poller_threads":10,"log_level":"info"}},"format":{"name":"avro","config":{"update_format":"raw","schema":"{\"type\":\"record\",\"name\":\"KafkaInt\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"}]}","skip_schema_id":true}}}]'
);
  • two topics: kafka_int_sequence_keys, kafka_int_random_keys
  • This only ran at ~800k/s. Not maxing out the circuit

Bench Summary

Workload Original Tput Roaring Branch Tput Δ Tput Original Mem Roaring Branch Mem Δ Mem Original Storage Roaring Branch Storage Δ Storage Δ State Amp
INT random 0.766M/s 0.761M/s -0.67% 5.59 GiB 5.81 GiB +4.07% 3.56 GiB 3.63 GiB +1.89% +2.58%
INT sequence 0.775M/s 0.775M/s +0.12% 5.70 GiB 5.47 GiB -3.96% 4.11 GiB 3.49 GiB -14.95% -15.05%

Filter Miss Table

Workload Version Bloom hits / misses Bloom Miss Rate Roaring hits / misses Roaring Miss Rate Range hits / misses Range Miss Rate
INT random original (v0.281.0) 19.7K / 195.218M 99.9899% 0 / 0 N/A 195.238M / 38 0.000019%
INT random roaring branch 19.7K / 193.705M 99.9898% 0 / 0 N/A 193.725M / 38 0.000020%
INT sequence original (v0.281.0) 19.4K / 192.655M 99.9899% 0 / 0 N/A 192.674M / 46.957M 19.5954%
INT sequence roaring branch 0 / 0 N/A 0 / 139.045M 100.0000% 139.045M / 34.087M 19.6883%

Main read:

  • INT sequence is where roaring helps on Kafka+Avro: throughput is bottlenecked on kafka, memory drops a bit and storage/state amplification drop a lot.
  • The filter behavior matches that: random stays bloom-based; sequence flips from bloom on original to roaring on the roaring branch.

@gz gz force-pushed the roaring branch 3 times, most recently from a1e6733 to 84819c1 Compare April 9, 2026 03:09
@gz gz requested review from blp and ryzhyk April 9, 2026 03:10
@gz gz marked this pull request as ready for review April 9, 2026 03:10
Comment thread crates/dbsp/src/trace/sampling.rs
@gz
Copy link
Copy Markdown
Contributor Author

gz commented Apr 9, 2026

This is currently configured as an enabled by default feature with option to disable it in dev-tweaks.
Alternative: disabled for now, enable selectively for some users.

Thoughts?

Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three blockers. The new dev_tweaks knob needs docs in docs.feldera.com, the roaring path should start default-off rather than enabled-by-default with a kill switch, and the [ci] apply automatic fixes commit needs to be squashed because dirty history is still a hard no.

Comment thread crates/feldera-types/src/config/dev_tweaks.rs
pub fn enable_roaring(&self) -> bool {
// Roaring is enabled by default, but `enable_roaring = false` remains
// available as a kill switch while the feature is still being tuned.
self.enable_roaring.unwrap_or(true)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the feature-flag question from the PR thread: this should start default-off. An incompatible storage-format feature that is still being tuned wants a real rollout gate, not an opt-out kill switch after it is already on for everyone. enable_roaring = true should selectively enable it; the default should stay false until we have more production mileage.

@gz gz force-pushed the roaring branch 3 times, most recently from 73d0445 to 0db137f Compare April 9, 2026 05:39
Comment thread crates/dbsp/src/storage/file/filter/roaring.rs Outdated
Comment thread crates/dbsp/src/utils/supports_roaring.rs Outdated
Comment thread crates/dbsp/src/utils/supports_roaring.rs Outdated
Comment thread crates/dbsp/src/utils/supports_roaring.rs Outdated
Comment thread crates/dbsp/src/utils/supports_roaring.rs
Comment thread crates/dbsp/src/storage/file/filter.rs Outdated
Comment thread crates/dbsp/src/storage/file/filter.rs Outdated
Comment thread crates/dbsp/src/storage/file/format.rs Outdated
Comment thread crates/dbsp/src/trace/sampling.rs Outdated
Comment thread crates/dbsp/src/trace/spine_async.rs Outdated
@gz gz force-pushed the roaring branch 2 times, most recently from 1f2dec2 to 04e01a1 Compare April 10, 2026 18:11
@gz
Copy link
Copy Markdown
Contributor Author

gz commented Apr 10, 2026

Thanks @blp great feedback, I addressed evyerthing except for the upsampling constant of 100. Will try to run the benchmark with lower values and see if it has an impact.

@gz
Copy link
Copy Markdown
Contributor Author

gz commented Apr 11, 2026

  • It looks like some of our users disable bloom filters, need to make sure both bloom and bitmaps can be disabled at the same time

Copy link
Copy Markdown
Contributor

@ryzhyk ryzhyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks nice and modular, but it's a lot of complexity. Most benchmarks are mixed. Is the main justification for this that it significantly speeds up ingest into tables whose PKs arrive in-order?

Comment thread crates/dbsp/src/trace/ord/fallback/indexed_wset.rs
Comment thread crates/dbsp/src/trace/sampling.rs
Comment thread crates/dbsp/src/trace/sampling.rs
Comment thread crates/dbsp/src/trace.rs Outdated
Comment thread crates/dbsp/src/utils/supports_roaring.rs
Comment thread crates/dbsp/src/utils/supports_roaring.rs Outdated
Comment thread crates/dbsp/src/storage/file/filter.rs Outdated
K: DataTrait + ?Sized,
{
fn sample_count_for_filter_plan(num_keys: usize) -> usize {
let scaled = ((num_keys as f64) * (FILTER_PLAN_SAMPLE_PERCENT / 100.0)).ceil() as usize;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the theory (and/or benchmarks) say here? Sampling 0.1% of a large batch is a lot of IO. Sampling 1024 keys in a small batch can be relatively expensive too.

Is there a constant ceiling on the number of keys that need to be sampled to achieve good probabilistic precision? E.g., is 100 or 1000 sufficient?

Copy link
Copy Markdown
Contributor Author

@gz gz Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an implicit upper bound by the fact that we don't attempt to build a bitmap filter if the batch range does not fit in a u32::MAX.

e.g. 0.1% of u32::MAX is 4294967 or 34 MiB of we assume 8 bytes keys. But I'll see what happens to prediction if we make it lower.

Comment thread crates/dbsp/src/storage/file/filter.rs Outdated
@gz
Copy link
Copy Markdown
Contributor Author

gz commented Apr 12, 2026

Most benchmarks are mixed. Is the main justification for this that it significantly speeds up ingest into tables whose PKs arrive in-order?

As long as the connector can max out the circuit it seems to be between 10-20% better wrg to tput (datagen+delta), and less memory (-28% datagen, -20% delta) for things that arrive mostly ordered.
When the connector isn't maxed out we probably would have to run longer to see effects on mem+tput.

#6007 (comment)

One question I don't know and haven't looked at is if also can benefits transactions commits if it has to build a spine and can do it in a way such new entries appear mostly sequential. Do you think it's plausible?

@ryzhyk
Copy link
Copy Markdown
Contributor

ryzhyk commented Apr 12, 2026

Most benchmarks are mixed. Is the main justification for this that it significantly speeds up ingest into tables whose PKs arrive in-order?

As long as the connector can max out the circuit it seems to be between 10-20% better wrg to tput (datagen+delta), and less memory (-28% datagen, -20% delta) for things that arrive mostly ordered. When the connector isn't maxed out we probably would have to run longer to see effects on mem+tput.

#6007 (comment)

I missed the numbers for memory. But yes, this is a significant improvement, if the input pattern is favorable. I don't know how frequently this happens in practice.

One question I don't know and haven't looked at is if also can benefits transactions commits if it has to build a spine and can do it in a way such new entries appear mostly sequential. Do you think it's plausible?

This will be the case if the output of an operator is indexed in the same way as the input, then you can expect outputs to be ordered. Also not sure how common this is. Normally outputs need to be reindexed.

@ryzhyk
Copy link
Copy Markdown
Contributor

ryzhyk commented Apr 12, 2026

This is currently configured as an enabled by default feature with option to disable it in dev-tweaks. Alternative: disabled for now, enable selectively for some users.

Thoughts?

Have you tested it on some customer workloads? If you feel you've done enough testing, it may be ok to enable by default.

@gz gz force-pushed the roaring branch 7 times, most recently from 51ca6f7 to 3793d66 Compare May 4, 2026 07:51
@gz gz enabled auto-merge May 4, 2026 07:51
@gz gz disabled auto-merge May 4, 2026 07:57
@gz gz enabled auto-merge May 4, 2026 08:01
Comment thread crates/feldera-types/src/config/dev_tweaks.rs
Comment thread Cargo.toml
#deltalake-catalog-unity = { git = "https://github.com/ryzhyk/delta-rs.git", rev = "b24e5aaec322db0f084449e5c57f97ff4526bbdc" }
#deltalake = { git = "https://github.com/ryzhyk/delta-rs.git", rev = "b24e5aaec322db0f084449e5c57f97ff4526bbdc" }
deltalake-catalog-unity = { git = "https://github.com/gz/delta-rs.git", rev = "c37cf6e788d653db4149b205542559cf25674002" }
deltalake = { git = "https://github.com/gz/delta-rs.git", rev = "c37cf6e788d653db4149b205542559cf25674002" }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two concerns on this dep bump:

  1. Personal-fork dependency on main. This switches deltalake/deltalake-catalog-unity from ryzhyk/delta-rs (the team's vetted fork) to gz/delta-rs at a single SHA. Personal forks under one engineer's account are fine for spike branches, but landing them on main makes the build depend on an account/repo with no team write access — if gz ever force-pushes, deletes the branch, or rotates ownership, the workspace stops building. Please push these patches into ryzhyk/delta-rs (or a feldera-org fork) and bump to that SHA before merging.

  2. Round-robin pre-split is back, just relocated. The commit message says this points at the "round-robin pre-split branch". That's the same design we agreed to withdraw on 2026-04-30 — moving it down into the delta-rs fork doesn't make the design questions go away (lex ordering assumptions, per-file reader scheduling, behavior under concurrent writers / OPTIMIZE rewrites, retry semantics). If we want to land it via the fork, please link the delta-rs PR/diff in this PR description and add tests on the feldera side that exercise the new path; otherwise drop the fork bump from this PR and keep it for a separate, dedicated change.

.map(|n| n as usize)
.filter(|n| *n > 0)
})
.unwrap_or(DEFAULT_MAX_CONCURRENT_READERS);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is presented as opt-in via DELTA_DF_TARGET_PARTITIONS, but the default path now pins datafusion.execution.target_partitions = 6 (or max_concurrent_readers) for every Delta connector. Before this change target_partitions was unset, so DataFusion used its own default (num_cpus). On a typical multi-core worker that's a regression in snapshot parallelism — e.g. a 32-core host drops from 32 to 6 partitions on upgrade with no config change.

If the goal is just to expose a knob, leave target_partitions unset when neither the env var nor max_concurrent_readers is provided (i.e. only set_usize it when one of those was actually specified). Otherwise this needs to be called out as an intentional default change with a benchmark, because it silently re-tunes every existing Delta source.

@gz gz added this pull request to the merge queue May 4, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 4, 2026
gz added 8 commits May 4, 2026 08:18
This makes sure the merger overhead does not become too high
once batches grow close to u32::MAX sampling millions of keys
(lots of random lookups/reads).

One way to fix is would have been to cap sampling to
like 1000 keys. This works but..

It turns out it's probably good enough to just look at min/max/count
and make the predictor slightly more pessimistic (so pick bloom when
it's close because it will not cause perf regressions in existing
deployments).

The new predictor is very close for this. It gets it mostly right,
when it gets it wrong it mostly suggests bloom, when it wrongly
suggests roaring either the performance diff is negligible or
the distribution is artifically pathological.

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Track the exact number of 16-bit windows touched by each simulated input batch using a temporary 8 KiB bitset, store the final count in a compact 16-bit encoding, and feed those counts into the merge-time roaring lookup heuristic alongside min/max span overlap. This fixes the wide-but-holey benchmark cases without sampling keys and updates the CSV/report output to show the new touched-window estimates.
This turned out to be a cheaper predictor than sampling
because it doesnt require reads over the file.

Also make some perf tweaks to avoid regressions:

- Avoid building filter for in-memory batches (unlikely to be useful at
  that size anyways)

Reduce runtime overheads when feature is disabled:
- inline(always) for per-key code-path

Reduce runtime overhead when feature is enabled (and random ingest):
- Don't build filter for memory batches

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Bumps the delta-rs deps to the round-robin pre-split branch
(gz/delta-rs.git#c37cf6e7), and adds three env-var knobs to the Delta
input connector that the patched delta-rs branch plays with:

  DELTA_DF_TARGET_PARTITIONS  - number of parallel scan partitions.
                                Falls back to `max_concurrent_readers`
                                from the connector config.
  DELTA_DF_BATCH_SIZE         - rows per parquet batch. Defaults to
                                DataFusion's 8192 when unset.
  DELTA_SNAPSHOT_EXPLAIN      - one-shot dump of the snapshot query's
                                physical plan at INFO level, useful
                                when sanity-checking parallelism /
                                ordering decisions.

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
@gz gz enabled auto-merge May 4, 2026 15:58
pub fn enable_roaring(&self) -> bool {
// Roaring is enabled by default, but `enable_roaring = false` remains
// available as a kill switch while the feature is still being tuned.
self.enable_roaring.unwrap_or(true)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still unwrap_or(true). The new INCOMPATIBLE_FEATURE_ROARING_FILTERS gating in 53da209 is a nice cleanup (avoids a v6 bump and keeps Bloom-only files readable) but it does not address the on-disk forward-compat concern: with this default, every roaring-eligible batch written by an upgraded binary still sets the incompatible bit, and the PR body itself notes downgrade requires clearing storage. While the comment literally says "still being tuned", the default should be false for the first release. Flip it; opt in selectively until tuning is complete.

Comment thread Cargo.toml
deltalake = { git = "https://github.com/ryzhyk/delta-rs.git", rev = "b24e5aaec322db0f084449e5c57f97ff4526bbdc" }
#deltalake-catalog-unity = { git = "https://github.com/ryzhyk/delta-rs.git", rev = "b24e5aaec322db0f084449e5c57f97ff4526bbdc" }
#deltalake = { git = "https://github.com/ryzhyk/delta-rs.git", rev = "b24e5aaec322db0f084449e5c57f97ff4526bbdc" }
deltalake-catalog-unity = { git = "https://github.com/gz/delta-rs.git", rev = "c37cf6e788d653db4149b205542559cf25674002" }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gz/delta-rs#c37cf6e7 is still a personal fork — single-engineer maintenance/supply-chain risk and, per the commit subject, the SHA is the "round-robin pre-split" branch, i.e. the snapshot round-robin design we agreed to withdraw on 2026-04-30 has just moved into the fork. Two things needed before this can land:

  1. Upstream the delta-rs change to delta-io/delta-rs (or at minimum to ryzhyk/delta-rs) and re-pin here.
  2. Link the delta-rs diff in the PR description and add feldera-side coverage for the round-robin behavior — the prior round-robin review raised lex-ordering assumptions, retry semantics under concurrent writers / OPTIMIZE, and behavior on restart, none of which are exercised by anything in this PR.

.map(|n| n as usize)
.filter(|n| *n > 0)
})
.unwrap_or(DEFAULT_MAX_CONCURRENT_READERS);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.unwrap_or(DEFAULT_MAX_CONCURRENT_READERS) is unconditional: when neither DELTA_DF_TARGET_PARTITIONS nor max_concurrent_readers is set, this pins datafusion.execution.target_partitions = 6 for every Delta connector. Previously the option was unset and DataFusion picked num_cpus. On any host with > 6 cores this is a silent snapshot-parallelism regression on upgrade.

Only call set_usize("datafusion.execution.target_partitions", …) when the env var or max_concurrent_readers is explicitly provided; otherwise leave DataFusion's default in place.

@gz gz added this pull request to the merge queue May 4, 2026
Merged via the queue into main with commit 5e6c8e8 May 4, 2026
1 check passed
@gz gz deleted the roaring branch May 4, 2026 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants