arrow-go shipped AVX2/SSE4/NEON SBBF probes in 18.3.0 (PR #336). arrow-cpp and Velox ship the same scalar reference line-for-line; arrow-rs is the same algorithm in Rust. Those three cover the C-family Parquet ecosystem: DuckDB, ClickHouse, Polars, DataFusion, Trino, Presto, Spark via Gluten, StarRocks, Doris, pyarrow → pandas, Apache Drill.
A C++ port of the arrow-go approach: bit-identical on 167M (query, filter) pairs, 3–5× in-cache on the probe microbenchmark, 1.5× out-of-L3 with a 4-way bulk path across row-group filters.
The three still-scalar implementations
Apache Arrow C++ (bloom_filter.cc:348),
the reference forked by most C++ readers:
bool BlockSplitBloomFilter::FindHash(uint64_t hash) const {
const uint32_t bucket_index = static_cast<uint32_t>(
((hash >> 32) * (num_bytes_ / kBytesPerFilterBlock)) >> 32);
const uint32_t key = static_cast<uint32_t>(hash);
const uint32_t* bitset32 = reinterpret_cast<const uint32_t*>(data_->data());
for (int i = 0; i < kBitsSetPerBlock; ++i) {
const uint32_t mask = UINT32_C(0x1) << ((key * SALT[i]) >> 27);
if (ARROW_PREDICT_FALSE(0 ==
(bitset32[kBitsSetPerBlock * bucket_index + i] & mask))) {
return false;
}
}
return true;
}
arrow-rs (mod.rs:241),
used by Polars and DataFusion:
fn check(&self, hash: u32) -> bool {
let mask = Self::mask(hash);
for i in 0..8 {
if self[i] & mask[i] == 0 {
return false;
}
}
true
}
Velox (BloomFilter.cpp:196),
which powers Prestissimo and Trino, and JVM Spark via Gluten:
bool BlockSplitBloomFilter::findHash(uint64_t hash) const {
const uint32_t bucketIndex = static_cast<uint32_t>(
((hash >> 32) * (numBytes_ / kBytesPerFilterBlock)) >> 32);
const uint32_t key = static_cast<uint32_t>(hash);
const uint32_t* bitset32 =
reinterpret_cast<const uint32_t*>(data_->as<char>());
for (int i = 0; i < kBitsSetPerBlock; ++i) {
const uint32_t mask = UINT32_C(0x1) << ((key * SALT[i]) >> 27);
if (0 == (bitset32[kBitsSetPerBlock * bucketIndex + i] & mask)) {
return false;
}
}
return true;
}
Eight dependent 32-bit loads per probe, early-exit branch after each.
LLVM and GCC don’t auto-vectorize this — the early return false
blocks the equivalence proof with a horizontal AND-test across the
block.
The spec maps onto AVX2
The 2018 SBBF spec dictates:
- 256-bit blocks = one
__m256i. - 8 bits set per block = 8 lanes of a 32-bit AVX2 register.
- Bit position formula
(key * SALT[i]) >> 27fori = 0..7=vpmulld + vpsrldacross 8 lanes. - “All 8 bits set?” check =
_mm256_testc_si256in one op. - The eight SALT constants slot into
_mm256_load_si256(SALT).
The on-disk format reads like an instruction sheet for AVX2. Impala and Kudu shipped AVX2 SBBF probes years ago; arrow-go added AVX2/SSE4/NEON last year. Apache Arrow C++ shipped the scalar pseudocode reference from day one, and arrow-rs and Velox ported from it, scalar shape and all.
The AVX2 probe
Same on-disk format, same SALT, same bit positions. Only the probe changes:
bool find_avx2(uint64_t hash) const {
uint32_t bucket_index = uint32_t(((hash >> 32) * num_blocks_) >> 32);
uint32_t key = uint32_t(hash);
__m256i hash_v = _mm256_set1_epi32(int32_t(key));
__m256i salt = _mm256_load_si256(reinterpret_cast<const __m256i*>(SALT));
__m256i prod = _mm256_mullo_epi32(hash_v, salt);
__m256i shift = _mm256_srli_epi32(prod, 27);
__m256i ones = _mm256_set1_epi32(1);
__m256i mask = _mm256_sllv_epi32(ones, shift);
const __m256i* blk = reinterpret_cast<const __m256i*>(data_ + bucket_index * 32);
return _mm256_testc_si256(_mm256_loadu_si256(blk), mask);
}
About a dozen instructions, no branches, ~7 cycles in the hot path.
_mm256_mullo_epi32(hash_v, salt) produces eight 32-bit products,
each equal to scalar key * SALT[i]. The shifts produce the same 8
bit positions. _mm256_testc_si256 returns 1 iff (~block & mask) == 0
— the same condition as “every mask bit set in the block.”
Bit-identical to the scalar path on every input.
The bench in the repo runs both paths on the same Parquet-format data (same SALT, same XXH64, same Lemire fastrange bucket index): 0 mismatches across 167M (query, filter) pairs.
Numbers
Intel Xeon @ 2.8 GHz (Cascade Lake-class, 33 MiB L3, virtualised),
g++ -O3 -mavx2 -mbmi2. All numbers are post-hash probe latency;
XXH64 is computed once at setup and excluded from the timed loop.
| Regime | Filter footprint | scalar | AVX2 single-key | AVX2 4-way bulk |
|---|---|---|---|---|
| Small (in-cache) | 0.5 MiB | 12.73 ns | 3.70 ns (3.4×) | 2.36 ns (5.4×) |
| Medium (out-of-L3) | 128 MiB | 35.84 ns | 29.18 ns (1.2×) | 22.79 ns (1.6×) |
| Large (deep DRAM) | 1 GiB | 41.41 ns | 35.90 ns (1.15×) | 27.75 ns (1.5×) |
In-cache the bottleneck is ALU work: 8 dependent loads with branches vs. two SIMD ops after one cache-line load. The 3.4× comes from collapsing that ALU work.
Out-of-L3 both paths wait for the same DRAM load and the per-probe gain narrows to 1.15–1.2×. The 4-way bulk path (one query hash probed against four row-group filters in parallel) overlaps four cache misses in flight and recovers ~1.5×.
What the AVX2 path touches
Scoped to the probe stage, not query wall time. Databricks moved off Parquet bloom for Photon predictive I/O, but the rest of the OSS lakehouse — Iceberg, Hudi, DuckDB, ClickHouse — still scans through this code.
| Concern | AVX2 changes it? |
|---|---|
| Probe-stage CPU | Yes — 3–5× in-cache, 1.15–1.5× out-of-L3 (microbench) |
WHERE col IN (val1, val2, ...) probe cost | Yes — 4-way bulk amortises across values |
| Build-stage CPU during writes | Yes — bulk insert path |
| Storage overhead (~10 bits/value extra per file) | No |
| FP rate vs alternatives (min/max + page indexes) | No |
| Maintenance cost on UPDATE/MERGE rebuild | No |
Same on-disk format, no behaviour change for any reader.
ARM
x86 AVX2 only — no NEON or SVE2 measurements. The intrinsics map almost line-for-line on paper (NEON in two 128-bit ops per block; SVE2 in one register on Graviton 3/4), but the ratios need measuring. The ALU collapse should still apply since the dependent-load chain is the same; rough estimate 2–2.5× in-cache.
Upstream the AVX2 path lands behind a CpuInfo runtime dispatch; non-AVX2 builds keep the scalar path.
The repo
Standalone bench, vendored scalar references from arrow-cpp / arrow-rs / Velox for diffing against upstream, full results:
Under investigations/:
parquet_port/— the bench.make && ./benchreproduces the numbers. Diff-test runs every time.parquet_port/reference/— verbatim source from the three upstream readers.PARQUET.md— technical analysis.
Upstream
Sent a [DISCUSS] to dev@arrow.apache.org:
lists.apache.org thread.