Add support for bitmap based filters by gz · Pull Request #6007 · feldera/feldera

gz · 2026-04-08T00:47:54Z

This PR adds three things:

1st commit: A bitmap filter based on the roaring crate with

Necessary plumbing so we can have different filters in front of our files and batches
Necessary plumbing so we can decide if we should use the roaring filter or fall back to the bloom filter
- This needed us to consistently have a min/max for every batch
- A heuristic that estimates when it's better to use roaring than bloom (see 2nd commit): High-level answer is that it's better when the set of keys in the batch is dense, with caveats

2nd commit: A benchmark that validates the predictor function we use to choose between roaring and bloom filters (2k lines, this code can be mostly ignored for the PR review)

3rd commit: Code that adjusts storage files by setting an incompatible feature flag

Describe Manual Test Plan

Ran many pipelines

Checklist

Unit tests added/updated

Breaking Changes?

Ideally no

Describe Incompatible Changes

File format has new incompatible feature, old versions will refuse to read the files with a roaring filter.
Users should not downgrade once they go beyond this version without clearing storage.

mythical-fred

Draft: two design questions, see inline.

gz · 2026-04-08T19:09:24Z

datagen

CREATE TABLE simple (
    payload [ BIGINT | INTEGER ] NOT NULL PRIMARY KEY
) WITH (
  'materialized' = 'true',
  'connectors' = '[{"name":"data","transport":{"name":"datagen","config":{"workers":9,"plan":[{"fields":{"payload":{"strategy":"[ increment | uniform ]"}}}]}}}]'
);

original is v0.281.0 and roaring branch is the current platform runtime on 92e4a95. enable_roaring=true was set on all runs.

Bench Summary

Workload	Original Throughput	Roaring Branch Throughput	Δ Throughput	Original Mem	Roaring Branch Mem	Δ Mem	Original Storage	Roaring Branch Storage	Δ Storage
BIGINT + increment	4.21M/s	4.69M/s	+11.3%	12.33 GiB	8.88 GiB	-28.0%	22.22 GiB	25.20 GiB	+13.4%
BIGINT + uniform	2.31M/s	2.29M/s	-0.8%	7.49 GiB	7.70 GiB	+2.9%	21.18 GiB	19.63 GiB	-7.3%
INTEGER + increment	4.34M/s	4.71M/s	+8.4%	11.86 GiB	8.58 GiB	-27.7%	22.38 GiB	22.85 GiB	+2.1%
INTEGER + uniform	2.18M/s	2.16M/s	-1.2%	7.03 GiB	7.19 GiB	+2.3%	11.95 GiB	12.58 GiB	+5.3%

increment workloads improved on the roaring branch: better throughput and much lower memory.
uniform workloads in the noise on throughput, memory.

Filter Hit/Miss Comparison:

Workload	Original	Roaring Branch	Read
BIGINT + increment	Range: 0 hits / 4.340B misses, 100% miss rate. Bloom: 0/0. Roaring: 0/0.	Range: 0 / 4.268B, 100% miss rate. Bloom: 0/0. Roaring: 0/0.	Useful filtering is all in the range filter for both versions.
BIGINT + uniform	Bloom: 185.6K hits / 1.845B misses, 99.9899% miss rate. Range: 1,177 misses on 1.845B passes, 0.000064% miss rate.	Bloom: 152.6K / 1.506B, 99.9899% miss rate. Range: 1,121 misses on 1.507B passes, 0.000074% miss rate.	Bloom is doing almost all of the useful rejection.
INTEGER + increment	Range: 0 / 3.649B, 100% miss rate. Bloom: 0/0. Roaring: 0/0.	Range: 0 / 3.413B, 100% miss rate. Bloom: 0/0. Roaring: 0/0.	Same pattern as BIGINT + increment.
INTEGER + uniform	Bloom: 15.64M hits / 785.0M misses, 98.05% miss rate. Range: 437 misses on 800.7M passes, 0.000055% miss rate.	Bloom: 16.56M / 789.0M, 97.94% miss rate. Range: 414 misses on 805.5M passes, 0.000051% miss rate.	Same pattern as BIGINT + uniform.

Takeaways:

higher miss rate is better here it means the filter is rejecting more lookups.
In these runs, the measurable rejection work came from range for increment and bloom for uniform; roaring hit/miss counters themselves stayed 0 in the final samples.
throuhgput increase most likely from building the bitmap filters faster than the bloom filters for increment workloads

postgres

original is v0.281.0 and roaring branch is the platform runtime on 92e4a95.

I loaded all four Postgres tables to 200,000,000 rows each. It's very slow.

Bench Summary

Workload	Original Tput	Roaring Branch Tput	Δ Tput	Original Mem	Roaring Branch Mem	Δ Mem	Original Storage	Roaring Branch Storage	Δ Storage
INT random	1.080M/s	1.075M/s	-0.49%	24.43 GiB	24.17 GiB	-1.09%	3.79 GiB	3.85 GiB	+1.59%
BIGINT random	1.052M/s	1.058M/s	+0.53%	25.09 GiB	25.13 GiB	+0.14%	3.88 GiB	3.89 GiB	+0.42%
INT sequence	1.081M/s	1.087M/s	+0.56%	24.57 GiB	24.68 GiB	+0.45%	3.81 GiB	3.62 GiB	-5.05%
BIGINT sequence	1.069M/s	1.099M/s	+2.77%	25.00 GiB	24.86 GiB	-0.58%	3.96 GiB	3.40 GiB	-14.03%

Filter Read
Miss rate here is misses / (hits + misses), so higher is better for the filter.

Workload	Original	Roaring Branch
INT random	Bloom 11.4K / 112.38M, miss 99.9898%, bloom size 452.8 MiB. Range 112.39M / 44, miss 0.0000%.	Bloom 11.1K / 109.45M, miss 99.9898%, bloom size 452.8 MiB. Range 109.46M / 43, miss 0.0000%.
BIGINT random	Bloom 11.4K / 113.54M, miss 99.9900%, bloom size 452.8 MiB. Range 113.55M / 36, miss 0.0000%.	Bloom 10.6K / 106.68M, miss 99.9900%, bloom size 443.3 MiB. Range 106.69M / 55, miss 0.0001%.
INT sequence	Bloom counters 0/0, bloom size 455.9 MiB. Range 0 / 94.05M, miss 100.0000%.	Bloom counters 0/0, roaring counters 0/0, roaring size 186.5 MiB. Range 0 / 121.21M, miss 100.0000%.
BIGINT sequence	Bloom counters 0/0, bloom size 446.3 MiB. Range 0 / 168.18M, miss 100.0000%.	Bloom counters 0/0, roaring counters 0/0, roaring size 186.5 MiB. Range 0 / 96.21M, miss 100.0000%.

Takeaways:

The postgres input connector is not able to saturate the circuit (buffered records 0).
Random workloads are basically flat. One is slightly worse, one slightly better.
Sequence workloads show roaring roaring branch helps with filter size.
For sequence keys, the roaring branch switches filter representation from bloom to roaring and cuts filter size a lot, observed rejection already in the range filter. Roaring hit/miss counters stayed 0/0.
For random keys, bloom is doing the rejection at ~99.99% miss rate in both versions.

deltalake

original is v0.281.0 and roaring branch is the platform runtime on the current manager.

CREATE TABLE int_random_keys (
    id INT NOT NULL PRIMARY KEY
) WITH (
  'materialized' = 'true',
  'connectors' = '[{"name":"data","transport":{"name":"delta_table_input","config":{"uri":"file:///mnt/data/deltalake/$table","mode":"snapshot","transaction_mode":"none"}}}]'
);

These were 300-second fda bench runs, and all four runs had enable_roaring=true.
int_random_keys: delta-rs table filled with random u32 keys
int_sequence_keys: delta-rs table built with sequentially filling keys

Bench Summary

Workload	Original Tput	Roaring Branch Tput	Δ Tput	Original Mem	Roaring Branch Mem	Δ Mem	Original Storage	Roaring Branch Storage	Δ Storage	Δ State Amp
INT random	2.332M/s	2.326M/s	-0.25%	8.96 GiB	8.77 GiB	-2.07%	13.87 GiB	13.85 GiB	-0.14%	+0.10%
INT sequence	2.748M/s	3.336M/s	+21.41%	9.70 GiB	7.94 GiB	-18.10%	15.19 GiB	18.19 GiB	+19.77%	-1.30%

For fixed 300s runs, absolute storage at stop time is influenced by how much work got processed in that window.

Filter Miss Table

Miss rate is misses / (hits + misses), so higher is better.

Workload	Version	Bloom hits / misses	Bloom Miss Rate	Roaring hits / misses	Roaring Miss Rate	Range hits / misses	Range Miss Rate
INT random	original (v0.281.0)	140.2K / 1.387B	99.9899%	0 / 0	N/A	1.387B / 962	0.0001%
INT random	roaring branch	150.9K / 1.495B	99.9899%	0 / 0	N/A	1.495B / 768	0.0001%
INT sequence	original (v0.281.0)	158.4K / 1.566B	99.9899%	0 / 0	N/A	1.567B / 16.842M	1.0637%
INT sequence	roaring branch	0 / 0	N/A	0 / 2.480B	100.0000%	2.480B / 27.664M	1.1030%

Read

INT random is effectively flat. Both versions stay bloom-based, and the filter behavior is almost identical.
INT sequence is the real win for the roaring branch: +21.4% throughput and -18.1% memory.
On INT sequence, original stays bloom-based, while the roaring branch switches fully to roaring and shows a 100% roaring miss rate in the final sample.
Absolute storage is higher on the roaring branch for INT sequence, but it also processed much more data in the same 300s window; normalized state-amplification is slightly better.

kafka

CREATE TABLE kafka_int_sequence_keys (
    id INT NOT NULL PRIMARY KEY
) WITH (
  'materialized' = 'true',
  'connectors' = '[{"name":"data","transport":{"name":"kafka_input","config":{"topic":"$topic","bootstrap.servers":"127.0.0.1:9092","start_from":"earliest","resume_earliest_if_data_expires":true,"poller_threads":10,"log_level":"info"}},"format":{"name":"avro","config":{"update_format":"raw","schema":"{\"type\":\"record\",\"name\":\"KafkaInt\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"}]}","skip_schema_id":true}}}]'
);

two topics: kafka_int_sequence_keys, kafka_int_random_keys
This only ran at ~800k/s. Not maxing out the circuit

Bench Summary

Workload	Original Tput	Roaring Branch Tput	Δ Tput	Original Mem	Roaring Branch Mem	Δ Mem	Original Storage	Roaring Branch Storage	Δ Storage	Δ State Amp
INT random	0.766M/s	0.761M/s	-0.67%	5.59 GiB	5.81 GiB	+4.07%	3.56 GiB	3.63 GiB	+1.89%	+2.58%
INT sequence	0.775M/s	0.775M/s	+0.12%	5.70 GiB	5.47 GiB	-3.96%	4.11 GiB	3.49 GiB	-14.95%	-15.05%

Filter Miss Table

Workload	Version	Bloom hits / misses	Bloom Miss Rate	Roaring hits / misses	Roaring Miss Rate	Range hits / misses	Range Miss Rate
INT random	original (v0.281.0)	19.7K / 195.218M	99.9899%	0 / 0	N/A	195.238M / 38	0.000019%
INT random	roaring branch	19.7K / 193.705M	99.9898%	0 / 0	N/A	193.725M / 38	0.000020%
INT sequence	original (v0.281.0)	19.4K / 192.655M	99.9899%	0 / 0	N/A	192.674M / 46.957M	19.5954%
INT sequence	roaring branch	0 / 0	N/A	0 / 139.045M	100.0000%	139.045M / 34.087M	19.6883%

Main read:

INT sequence is where roaring helps on Kafka+Avro: throughput is bottlenecked on kafka, memory drops a bit and storage/state amplification drop a lot.
The filter behavior matches that: random stays bloom-based; sequence flips from bloom on original to roaring on the roaring branch.

gz · 2026-04-09T03:11:39Z

This is currently configured as an enabled by default feature with option to disable it in dev-tweaks.
Alternative: disabled for now, enable selectively for some users.

Thoughts?

mythical-fred

Three blockers. The new dev_tweaks knob needs docs in docs.feldera.com, the roaring path should start default-off rather than enabled-by-default with a kill switch, and the [ci] apply automatic fixes commit needs to be squashed because dirty history is still a hard no.

mythical-fred · 2026-04-09T05:10:36Z

+    pub fn enable_roaring(&self) -> bool {
+        // Roaring is enabled by default, but `enable_roaring = false` remains
+        // available as a kill switch while the feature is still being tuned.
+        self.enable_roaring.unwrap_or(true)


On the feature-flag question from the PR thread: this should start default-off. An incompatible storage-format feature that is still being tuned wants a real rollout gate, not an opt-out kill switch after it is already on for everyone. enable_roaring = true should selectively enable it; the default should stay false until we have more production mileage.

gz · 2026-04-10T18:12:30Z

Thanks @blp great feedback, I addressed evyerthing except for the upsampling constant of 100. Will try to run the benchmark with lower values and see if it has an impact.

gz · 2026-04-11T05:12:32Z

It looks like some of our users disable bloom filters, need to make sure both bloom and bitmaps can be disabled at the same time

ryzhyk

The implementation looks nice and modular, but it's a lot of complexity. Most benchmarks are mixed. Is the main justification for this that it significantly speeds up ingest into tables whose PKs arrive in-order?

ryzhyk · 2026-04-12T18:51:48Z

+    K: DataTrait + ?Sized,
+{
+    fn sample_count_for_filter_plan(num_keys: usize) -> usize {
+        let scaled = ((num_keys as f64) * (FILTER_PLAN_SAMPLE_PERCENT / 100.0)).ceil() as usize;


What does the theory (and/or benchmarks) say here? Sampling 0.1% of a large batch is a lot of IO. Sampling 1024 keys in a small batch can be relatively expensive too.

Is there a constant ceiling on the number of keys that need to be sampled to achieve good probabilistic precision? E.g., is 100 or 1000 sufficient?

There is an implicit upper bound by the fact that we don't attempt to build a bitmap filter if the batch range does not fit in a u32::MAX.

e.g. 0.1% of u32::MAX is 4294967 or 34 MiB of we assume 8 bytes keys. But I'll see what happens to prediction if we make it lower.

gz · 2026-04-12T19:56:14Z

Most benchmarks are mixed. Is the main justification for this that it significantly speeds up ingest into tables whose PKs arrive in-order?

As long as the connector can max out the circuit it seems to be between 10-20% better wrg to tput (datagen+delta), and less memory (-28% datagen, -20% delta) for things that arrive mostly ordered.
When the connector isn't maxed out we probably would have to run longer to see effects on mem+tput.

#6007 (comment)

One question I don't know and haven't looked at is if also can benefits transactions commits if it has to build a spine and can do it in a way such new entries appear mostly sequential. Do you think it's plausible?

ryzhyk · 2026-04-12T20:25:33Z

Most benchmarks are mixed. Is the main justification for this that it significantly speeds up ingest into tables whose PKs arrive in-order?

As long as the connector can max out the circuit it seems to be between 10-20% better wrg to tput (datagen+delta), and less memory (-28% datagen, -20% delta) for things that arrive mostly ordered. When the connector isn't maxed out we probably would have to run longer to see effects on mem+tput.

#6007 (comment)

I missed the numbers for memory. But yes, this is a significant improvement, if the input pattern is favorable. I don't know how frequently this happens in practice.

One question I don't know and haven't looked at is if also can benefits transactions commits if it has to build a spine and can do it in a way such new entries appear mostly sequential. Do you think it's plausible?

This will be the case if the output of an operator is indexed in the same way as the input, then you can expect outputs to be ordered. Also not sure how common this is. Normally outputs need to be reindexed.

ryzhyk · 2026-04-12T20:27:09Z

This is currently configured as an enabled by default feature with option to disable it in dev-tweaks. Alternative: disabled for now, enable selectively for some users.

Thoughts?

Have you tested it on some customer workloads? If you feel you've done enough testing, it may be ok to enable by default.

mythical-fred · 2026-05-04T08:02:22Z

+#deltalake-catalog-unity = { git = "https://github.com/ryzhyk/delta-rs.git", rev = "b24e5aaec322db0f084449e5c57f97ff4526bbdc" }
+#deltalake = { git = "https://github.com/ryzhyk/delta-rs.git", rev = "b24e5aaec322db0f084449e5c57f97ff4526bbdc" }
+deltalake-catalog-unity = { git = "https://github.com/gz/delta-rs.git", rev = "c37cf6e788d653db4149b205542559cf25674002" }
+deltalake = { git = "https://github.com/gz/delta-rs.git", rev = "c37cf6e788d653db4149b205542559cf25674002" }


Two concerns on this dep bump:

Personal-fork dependency on main. This switches deltalake/deltalake-catalog-unity from ryzhyk/delta-rs (the team's vetted fork) to gz/delta-rs at a single SHA. Personal forks under one engineer's account are fine for spike branches, but landing them on main makes the build depend on an account/repo with no team write access — if gz ever force-pushes, deletes the branch, or rotates ownership, the workspace stops building. Please push these patches into ryzhyk/delta-rs (or a feldera-org fork) and bump to that SHA before merging.

Round-robin pre-split is back, just relocated. The commit message says this points at the "round-robin pre-split branch". That's the same design we agreed to withdraw on 2026-04-30 — moving it down into the delta-rs fork doesn't make the design questions go away (lex ordering assumptions, per-file reader scheduling, behavior under concurrent writers / OPTIMIZE rewrites, retry semantics). If we want to land it via the fork, please link the delta-rs PR/diff in this PR description and add tests on the feldera side that exercise the new path; otherwise drop the fork bump from this PR and keep it for a separate, dedicated change.

mythical-fred · 2026-05-04T08:02:22Z

+                    .map(|n| n as usize)
+                    .filter(|n| *n > 0)
+            })
+            .unwrap_or(DEFAULT_MAX_CONCURRENT_READERS);


This is presented as opt-in via DELTA_DF_TARGET_PARTITIONS, but the default path now pins datafusion.execution.target_partitions = 6 (or max_concurrent_readers) for every Delta connector. Before this change target_partitions was unset, so DataFusion used its own default (num_cpus). On a typical multi-core worker that's a regression in snapshot parallelism — e.g. a 32-core host drops from 32 to 6 partitions on upgrade with no config change.

If the goal is just to expose a knob, leave target_partitions unset when neither the env var nor max_concurrent_readers is provided (i.e. only set_usize it when one of those was actually specified). Otherwise this needs to be called out as an intentional default change with a benchmark, because it silently re-tunes every existing Delta source.

This makes sure the merger overhead does not become too high once batches grow close to u32::MAX sampling millions of keys (lots of random lookups/reads). One way to fix is would have been to cap sampling to like 1000 keys. This works but.. It turns out it's probably good enough to just look at min/max/count and make the predictor slightly more pessimistic (so pick bloom when it's close because it will not cause perf regressions in existing deployments). The new predictor is very close for this. It gets it mostly right, when it gets it wrong it mostly suggests bloom, when it wrongly suggests roaring either the performance diff is negligible or the distribution is artifically pathological. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

Track the exact number of 16-bit windows touched by each simulated input batch using a temporary 8 KiB bitset, store the final count in a compact 16-bit encoding, and feed those counts into the merge-time roaring lookup heuristic alongside min/max span overlap. This fixes the wide-but-holey benchmark cases without sampling keys and updates the CSV/report output to show the new touched-window estimates.

This turned out to be a cheaper predictor than sampling because it doesnt require reads over the file. Also make some perf tweaks to avoid regressions: - Avoid building filter for in-memory batches (unlikely to be useful at that size anyways) Reduce runtime overheads when feature is disabled: - inline(always) for per-key code-path Reduce runtime overhead when feature is enabled (and random ingest): - Don't build filter for memory batches Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

Bumps the delta-rs deps to the round-robin pre-split branch (gz/delta-rs.git#c37cf6e7), and adds three env-var knobs to the Delta input connector that the patched delta-rs branch plays with: DELTA_DF_TARGET_PARTITIONS - number of parallel scan partitions. Falls back to `max_concurrent_readers` from the connector config. DELTA_DF_BATCH_SIZE - rows per parquet batch. Defaults to DataFusion's 8192 when unset. DELTA_SNAPSHOT_EXPLAIN - one-shot dump of the snapshot query's physical plan at INFO level, useful when sanity-checking parallelism / ordering decisions. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

mythical-fred · 2026-05-04T16:00:57Z

+    pub fn enable_roaring(&self) -> bool {
+        // Roaring is enabled by default, but `enable_roaring = false` remains
+        // available as a kill switch while the feature is still being tuned.
+        self.enable_roaring.unwrap_or(true)


Still unwrap_or(true). The new INCOMPATIBLE_FEATURE_ROARING_FILTERS gating in 53da209 is a nice cleanup (avoids a v6 bump and keeps Bloom-only files readable) but it does not address the on-disk forward-compat concern: with this default, every roaring-eligible batch written by an upgraded binary still sets the incompatible bit, and the PR body itself notes downgrade requires clearing storage. While the comment literally says "still being tuned", the default should be false for the first release. Flip it; opt in selectively until tuning is complete.

mythical-fred · 2026-05-04T16:00:57Z

-deltalake = { git = "https://github.com/ryzhyk/delta-rs.git", rev = "b24e5aaec322db0f084449e5c57f97ff4526bbdc" }
+#deltalake-catalog-unity = { git = "https://github.com/ryzhyk/delta-rs.git", rev = "b24e5aaec322db0f084449e5c57f97ff4526bbdc" }
+#deltalake = { git = "https://github.com/ryzhyk/delta-rs.git", rev = "b24e5aaec322db0f084449e5c57f97ff4526bbdc" }
+deltalake-catalog-unity = { git = "https://github.com/gz/delta-rs.git", rev = "c37cf6e788d653db4149b205542559cf25674002" }


gz/delta-rs#c37cf6e7 is still a personal fork — single-engineer maintenance/supply-chain risk and, per the commit subject, the SHA is the "round-robin pre-split" branch, i.e. the snapshot round-robin design we agreed to withdraw on 2026-04-30 has just moved into the fork. Two things needed before this can land:

Upstream the delta-rs change to delta-io/delta-rs (or at minimum to ryzhyk/delta-rs) and re-pin here.

Link the delta-rs diff in the PR description and add feldera-side coverage for the round-robin behavior — the prior round-robin review raised lex-ordering assumptions, retry semantics under concurrent writers / OPTIMIZE, and behavior on restart, none of which are exercised by anything in this PR.

mythical-fred · 2026-05-04T16:00:57Z

+                    .map(|n| n as usize)
+                    .filter(|n| *n > 0)
+            })
+            .unwrap_or(DEFAULT_MAX_CONCURRENT_READERS);


.unwrap_or(DEFAULT_MAX_CONCURRENT_READERS) is unconditional: when neither DELTA_DF_TARGET_PARTITIONS nor max_concurrent_readers is set, this pins datafusion.execution.target_partitions = 6 for every Delta connector. Previously the option was unset and DataFusion picked num_cpus. On any host with > 6 cores this is a silent snapshot-parallelism regression on upgrade.

Only call set_usize("datafusion.execution.target_partitions", …) when the env var or max_concurrent_readers is explicitly provided; otherwise leave DataFusion's default in place.

gz changed the title ~~Roaring~~ Add support for bitmap based filter Apr 8, 2026

gz changed the title ~~Add support for bitmap based filter~~ Add support for bitmap based filters Apr 8, 2026

gz force-pushed the roaring branch from f9ae764 to 92e4a95 Compare April 8, 2026 01:11

mythical-fred reviewed Apr 8, 2026

View reviewed changes

Comment thread crates/dbsp/src/storage/file/format.rs Outdated

Comment thread crates/dbsp/src/storage/file/filter.rs Outdated

gz force-pushed the roaring branch 3 times, most recently from a1e6733 to 84819c1 Compare April 9, 2026 03:09

gz requested review from blp and ryzhyk April 9, 2026 03:10

gz marked this pull request as ready for review April 9, 2026 03:10

gz commented Apr 9, 2026

View reviewed changes

Comment thread crates/dbsp/src/trace/sampling.rs

mythical-fred suggested changes Apr 9, 2026

View reviewed changes

mythical-fred reviewed Apr 9, 2026

View reviewed changes

Comment thread crates/feldera-types/src/config/dev_tweaks.rs

mythical-fred reviewed Apr 9, 2026

View reviewed changes

gz force-pushed the roaring branch 3 times, most recently from 73d0445 to 0db137f Compare April 9, 2026 05:39

blp approved these changes Apr 9, 2026

View reviewed changes

gz force-pushed the roaring branch 2 times, most recently from 1f2dec2 to 04e01a1 Compare April 10, 2026 18:11

ryzhyk reviewed Apr 12, 2026

View reviewed changes

ryzhyk approved these changes Apr 12, 2026

View reviewed changes

gz force-pushed the roaring branch from 04e01a1 to c18fdea Compare April 12, 2026 23:43

gz force-pushed the roaring branch 7 times, most recently from 51ca6f7 to 3793d66 Compare May 4, 2026 07:51

gz enabled auto-merge May 4, 2026 07:51

gz disabled auto-merge May 4, 2026 07:57

gz force-pushed the roaring branch from 3793d66 to 392df47 Compare May 4, 2026 08:00

gz enabled auto-merge May 4, 2026 08:01

mythical-fred suggested changes May 4, 2026

View reviewed changes

gz added this pull request to the merge queue May 4, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 4, 2026

gz added 8 commits May 4, 2026 08:18

dbsp: add roaring-based file filters

7cc19ca

bench: add filter predictor benchmark

eb574c9

storage: gate roaring format changes by feature flag

53da209

bench: add holey-wide filter predictor case

a82b593

gz force-pushed the roaring branch from 392df47 to 650a7c3 Compare May 4, 2026 15:56

gz enabled auto-merge May 4, 2026 15:58

mythical-fred suggested changes May 4, 2026

View reviewed changes

gz added this pull request to the merge queue May 4, 2026

Merged via the queue into main with commit 5e6c8e8 May 4, 2026
1 check passed

gz deleted the roaring branch May 4, 2026 17:41

mythical-fred mentioned this pull request May 20, 2026

[connectors] configure storage for datafusion instances used by connectors #6170

Open

4 tasks

Conversation

gz commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe Manual Test Plan

Checklist

Breaking Changes?

Describe Incompatible Changes

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gz commented Apr 8, 2026

datagen

postgres

deltalake

kafka

Uh oh!

Uh oh!

gz commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mythical-fred Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gz commented Apr 10, 2026

Uh oh!

gz commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryzhyk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryzhyk Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gz Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gz commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryzhyk commented Apr 12, 2026

Uh oh!

ryzhyk commented Apr 12, 2026

Uh oh!

Uh oh!

mythical-fred May 4, 2026

Choose a reason for hiding this comment

Uh oh!

mythical-fred May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mythical-fred May 4, 2026

Choose a reason for hiding this comment

Uh oh!

gz commented Apr 8, 2026 •

edited

Loading

gz commented Apr 9, 2026 •

edited

Loading

gz commented Apr 11, 2026 •

edited

Loading

gz Apr 12, 2026 •

edited

Loading

gz commented Apr 12, 2026 •

edited

Loading