← Research

// article

Replacing DuckDB with Rust: 10.4× through predicate pushdown

October 4, 2025 Article

The trading platform used DuckDB for time-series queries. Performance was fine — 1–2ms typical — but the access pattern was wasteful. Every query read the full Parquet dataset, decompressed it, and filtered in memory.

Query: "SPY bars where close >= $450"
DuckDB execution:
  1. Read 100,000 bars from Parquet (1MB compressed)
  2. Decompress to memory (7.2MB)
  3. Filter in memory
  4. Return 30,000 matching bars

Latency: 1.2ms
I/O: 1MB read for a 300KB result (70% wasted)

DuckDB is the right tool for ad-hoc analytics, complex joins, window functions. For single-table time-series lookups with simple filters, most of the engine is dead weight. I built a custom Rust engine that targets exactly that workload — and got 10.4× on selective queries by pushing filters down to storage.

This is the writeup of five iterations: what worked, what regressed by 86%, and which decision did the heavy lifting.

What the baseline looked like

Criterion.rs benchmarks. 100,000 bars, price range 400–500, various filter selectivities.

QueryLatency (ns)I/O readBars returned
High selectivity (90% filtered)2,781100%10%
Medium selectivity (50% filtered)3,011100%50%
Low selectivity (10% filtered)3,367100%90%
No filter1,341100%100%

Filtered queries were slower than unfiltered. Counterintuitive at first — reading 100K bars and keeping 10K took longer than reading 100K bars and keeping all of them.

The cost was the filter itself. Each bar required field access and comparison; 90K discarded bars meant 90K wasted comparisons.

The code:

let mut all_bars = Vec::new();

// Read everything
for partition in &partitions {
    if let Ok(bars) = self.storage.read(symbol, partition.as_str()) {
        all_bars.extend(bars);
    }
}

// Filter after the fact
for filter in &query.filters {
    all_bars.retain(|bar| filter.matches_ohlcv(bar));
}

Filtering ran after I/O, after decompression, after deserialization. Wrong order.

The fix: predicate pushdown

Move filtering from memory to storage. Two levels — row group pruning (skip whole chunks via Parquet metadata) and Arrow compute filtering (filter rows before deserializing to structs).

Row group pruning

Parquet files carry min/max statistics per row group. For close >= 450, any row group with max_close < 450 can be skipped entirely.

Parquet File:
├── Metadata (10KB)
│   ├── Row Group 0: {min: 440.0, max: 445.0} → SKIP
│   ├── Row Group 1: {min: 445.0, max: 452.0} → READ (overlaps)
│   ├── Row Group 2: {min: 450.0, max: 455.0} → READ (all match)
│   └── ...
├── Row Group 0 Data (100KB compressed) → never read from disk
├── Row Group 1 Data (100KB compressed)
└── Row Group 2 Data (100KB compressed)

Row Group 0 is skipped at the metadata level. I/O saved depends on data distribution — 33% for this example, up to 90% for highly selective queries.

Arrow compute filtering

For row groups that survive pruning, apply filters to the Arrow RecordBatch before converting to OHLCV structs.

let batch = reader.next()?;
let filtered_batch = filter::apply_filters(&batch, filters)?;
let bars = Self::from_record_batch(&filtered_batch)?;

Arrow’s compute kernels use SIMD automatically. Comparisons run 4–8 values per instruction on the columnar data; only matching rows get converted to structs.

Wired up

pub fn read_with_filters(
    &self,
    symbol: &str,
    partition: &str,
    filters: &[QueryFilter],
) -> Result<Vec<OHLCV>> {
    let file = File::open(&file_path)?;
    let builder = ParquetRecordBatchReaderBuilder::try_new(file)?;

    // Level 1: prune row groups
    let metadata = builder.metadata();
    let row_groups_to_read: Vec<usize> = metadata
        .row_groups()
        .iter()
        .enumerate()
        .filter_map(|(idx, rg)| {
            if filter::should_read_row_group(rg, filters) {
                Some(idx)
            } else {
                None
            }
        })
        .collect();

    if row_groups_to_read.is_empty() {
        return Ok(Vec::new());
    }

    let reader = builder
        .with_row_groups(row_groups_to_read)
        .build()?;

    let mut all_data = Vec::new();

    // Level 2: Arrow compute filtering
    for batch_result in reader {
        let batch = batch_result?;
        let filtered_batch = filter::apply_filters(&batch, filters)?;
        let data = Self::from_record_batch(&filtered_batch)?;
        all_data.extend(data);
    }

    Ok(all_data)
}

Row group selection happens at the builder level, so only selected groups are read from disk. Arrow filtering then runs on each batch after decompression.

Filter translation

Query filters translate to Arrow predicates:

pub fn filter_to_arrow_predicate(
    batch: &RecordBatch,
    filter: &QueryFilter,
) -> Result<BooleanArray> {
    match filter {
        QueryFilter::MinPrice(threshold_i64) => {
            let threshold = *threshold_i64 as f64 / 10000.0;

            let close_array = batch.column(4)
                .as_any()
                .downcast_ref::<Float64Array>()?;

            let threshold_array = Float64Array::from(
                vec![threshold; close_array.len()]
            );

            arrow::compute::gt_eq(close_array, &threshold_array)
        },

        QueryFilter::MinVolume(threshold) => {
            let volume_array = batch.column(5)
                .as_any()
                .downcast_ref::<UInt64Array>()?;

            let threshold_array = UInt64Array::from(
                vec![*threshold; volume_array.len()]
            );

            arrow::compute::gt_eq(volume_array, &threshold_array)
        }
    }
}

Arrow’s gt_eq is SIMD-accelerated under the hood. No manual vectorization.

Zero regression on unfiltered queries

Unfiltered queries can’t regress. Explicit branching keeps them on the original fast path:

if query.filters.is_empty() {
    // No filters: original path, zero overhead
    for partition in &partitions {
        all_bars.extend(self.storage.read(symbol, partition)?);
    }
} else {
    // With filters: predicate pushdown
    for partition in &partitions {
        all_bars.extend(
            self.storage.read_with_filters(symbol, partition, &query.filters)?
        );
    }
}

Existing workloads pay nothing for an optimization they don’t use.

Numbers after

ScenarioBefore (ns)After (ns)SpeedupI/O saved
High selectivity (90% filtered)2,78126710.4×90%
Medium selectivity (50% filtered)3,0117993.8×50%
Low selectivity (10% filtered)3,3671,2712.6×10%
No filter (regression check)1,3411,3481.0×0%
Multiple filters (AND)2,9113767.7×75%

The regression check came in at +0.5% — measurement noise.

Why the speedup exceeds the I/O cut

For the high-selectivity case, I/O was cut 90% but the speedup was 10.4×. Disproportionate. Four compounding effects:

  • Avoided decompression. LZ4 decompression is CPU-bound. Skipping 90% of compressed data saved most of it.
  • Smaller allocations. Growing a Vec to 10K is cheaper than growing it to 100K — allocator behavior isn’t linear.
  • Better cache locality. 10K items fit in L2. 100K items thrash L3.
  • SIMD on what remains. Arrow compute uses AVX-2 on the row groups that do get read.

I/O reduction is the first-order effect. The rest piles on top.

Breakdown for a single query (close >= 480)

Before:

Read: 100KB compressed → 1MB decompressed
Deserialize: 10,000 bars → Vec<OHLCV> (7.2MB)
Filter: 10,000 → 1,000 (9,000 discarded)
Time: 2,781ns

After:

Read metadata: 10KB
Check 10 row-group statistics
Skip 9 row groups (90KB compressed saved)
Read 1 row group: 10KB compressed → 100KB decompressed
Arrow filter: 1,200 bars → 1,000 bars
Deserialize: 1,000 → Vec<OHLCV> (720KB)
Time: 267ns (10.4×)

In production

Two-week migration with gradual rollout. DuckDB removed entirely at the end.

MetricDuckDBCustomChange
P50 latency1.5ms0.3ms
P99 latency5ms0.5ms10×
Throughput667 qps2,500 qps3.8×
Memory120MB80MB−33%

The new ceiling enabled features the old one couldn’t:

  • 100ms-refresh dashboards
  • 10-symbol correlation analysis with parallel queries
  • Historical backtests that, combined with the LRU cache (iteration 3 below), run 50× faster

Monitoring tracked filter effectiveness directly:

let selectivity = result.bars.len() as f64 / total_bars_scanned as f64;
metrics::histogram!("query.selectivity", selectivity);

let pruning_rate = 1.0 - (row_groups_read as f64 / total_row_groups as f64);
metrics::histogram!("query.row_groups_pruned", pruning_rate);

Observed distribution:

  • 60% of queries — high selectivity (>80% filtered) → 8–10× speedup
  • 30% — medium (30–80%) → 2–4× speedup
  • 10% — low (<30%) → 1.5–2× speedup

No regressions. Unfiltered queries stayed at baseline.

How the five iterations played out

Not all of them worked. One regressed 86% and got reverted.

Iteration 1: baseline

Multi-partition query 2.45ms. Storage read 9.30M bars/sec. Storage write 142K bars/sec. Compression 2.6:1.

Iteration 2: async I/O

Parallel partition reads with Tokio:

let futures: Vec<_> = partitions
    .iter()
    .map(|p| storage.read_async(symbol, p))
    .collect();

let results = futures::future::join_all(futures).await;

2.19ms → 0.81ms. 2.7× speedup. Saturated past 5 partitions because of synchronization overhead.

Iteration 3: LRU cache

if let Some(cached) = self.cache.get(&query) {
    return Ok(cached.clone());
}

let result = self.execute_uncached(query)?;
self.cache.put(query.clone(), result.clone());

Cache hit: 1,054× speedup (sub-microsecond). At 80% hit rate: 588× speedup. Cache size 1,000 queries.

This was the single largest gain. Dashboard queries went from 1.5ms to under 5μs.

Iteration 4: manual SIMD (failed)

Tried to vectorize VWAP, average, and sum. Data was stored as Vec<OHLCV> — Array-of-Structs. SIMD wants Struct-of-Arrays.

OperationScalar (ns)SIMD (ns)Outcome
VWAP1,3402,4860.54× (−86%)
Average4126870.60× (−67%)
Sum3395660.60× (−67%)

Field extraction dominated. Reverted everything.

The lesson: data layout matters more than instruction optimization. Arrow’s columnar format gives you SIMD at the library level. Manual SIMD on AoS data costs more than it saves.

Iteration 5: predicate pushdown (the 10.4× section above)

Combined

IterationOptimizationSpeedupCumulative
1Baseline1.0×1.0×
2Async I/O2.7×2.7×
3LRU cache (80% hit)588×~1,590×
4SIMD0.5× — reverted
5Predicate pushdown10.4×~16,500×

For the realistic mixed workload (80% cache hit, selective filters when not cached):

  • Cache hit path: ~5μs
  • Cache miss with filter: ~300μs (vs. 2.5ms baseline)
  • Effective speedup: ~500×

Five things this taught me

1. Optimize I/O before CPU. Iteration 4 targeted CPU; it regressed 86%. Iteration 5 targeted I/O; it gained 10.4×. Time-series queries are I/O-bound. The expensive parts are decompression and deserialization — not the filter math.

2. Parquet statistics are free. Row-group min/max already exist. You don’t pay to write them, and reading the metadata costs ~10KB to skip ~1MB of data. 100:1 ROI on the cheapest possible optimization.

3. Equivalence tests catch bugs before users do.

#[test]
fn test_filter_equivalence_inmemory() {
    let storage_result = storage.read_with_filters("SPY", "2024-01-02", &filters)?;

    let memory_result: Vec<_> = all_bars.iter()
        .filter(|bar| filters.iter().all(|f| f.matches_ohlcv(bar)))
        .cloned()
        .collect();

    assert_eq!(storage_result, memory_result);
}

Three bugs caught this way: boundary value handling (450.0 vs 449.9999), AND vs OR in multi-filter logic, and statistics byte order (assumed big-endian; Parquet is little-endian). None reached production.

4. Zero-regression branching protects existing workloads. Unfiltered queries route through the old path; filtered queries route through the new one. Measured cost on the no-filter case: 1,341ns → 1,348ns. That’s noise.

5. Eliminate FFI when latency matters. DuckDB through Python was ~1.2ms query + ~0.3ms FFI = 1.5ms. Native Rust: ~0.3ms total. Eliminating FFI was about 20% of the total win.

When to build, when to use DuckDB

DuckDB is the right call when:

  • You need SQL flexibility
  • Queries are complex — joins, window functions, CTEs
  • Development time matters more than query time
  • 10ms latency is fine
  • Workload is varied, no single hot path

A custom engine pays off when:

  • The workload is narrow and well-defined
  • Latency requirements are strict (<1ms)
  • You need control over optimization strategy
  • FFI overhead is non-trivial relative to the query itself
  • Query patterns are predictable

The trading workload fit the second column exactly: single-table time-series, simple filters, latency budget under 1ms. DuckDB’s SQL parser and planner were paying a tax for capabilities I wasn’t using.

Custom engine, 5× lower latency, full control over what gets optimized next.


The final state: 10.4× on selective queries, zero regression on unfiltered queries, 118 tests passing, DuckDB removed. Production serves 2,500 queries/second with P99 under 1ms.

Part 1 of the database series: database optimization reference.