// article

Database optimization: a working reference

Six database optimization techniques — predicate pushdown, row group pruning, result caching, async I/O, indexes, and where SIMD goes wrong — with impact numbers and code.

October 4, 2025 Article

There’s a hierarchy. I/O reduction beats CPU optimization. Caching beats computation. Filtering early beats filtering late. Most people reach for the last category first and find out it didn’t matter.

Six techniques follow, in roughly the order they tend to pay off. Each has the same structure: what it does, when it applies, how it’s wired, and what it actually buys you. Part 2 is the DuckDB-replacement case study where these stack up to a 10.4× speedup.

Quick reference

Technique	Impact	Effort	Priority	When
Predicate pushdown	10×+	Medium	High	Selective queries
Row group pruning	5–10×	Low	High	Columnar storage
Result caching	100–1000×	Low	High	Repeated queries
Async I/O	2–3×	Medium	Medium	Multi-partition reads
Database indexes	10–100×	Low	High	Point queries
SIMD	Variable	High	Low	Hot CPU paths

Predicate pushdown

The idea: filter at the storage layer, not in memory after reading. Two levels of pushdown work together — row group pruning (skip entire chunks via metadata) and record batch filtering (filter rows before deserializing).

For selective queries, this is a 10–20× win on I/O. For non-selective queries, it does nothing.

Selectivity	Data kept	I/O reduction	Speedup
High (1% match)	1%	90–95%	10–20×
Medium (30% match)	30%	60–70%	3–5×
Low (70% match)	70%	20–30%	1.5–2×

Code

pub fn filter_to_arrow_predicate(
    batch: &RecordBatch,
    filter: &QueryFilter,
) -> Result<BooleanArray> {
    match filter {
        QueryFilter::MinPrice(threshold_i64) => {
            let threshold = *threshold_i64 as f64 / 10000.0;
            let close_array = batch.column(4)
                .as_any()
                .downcast_ref::<Float64Array>()?;

            let threshold_array = Float64Array::from(
                vec![threshold; close_array.len()]
            );

            // Arrow compute kernel — SIMD accelerated
            gt_eq(close_array, &threshold_array)
        }
    }
}

The Arrow gt_eq kernel is SIMD-accelerated under the hood. You don’t write vectorized code; Arrow handles it.

When it applies

Yes when: storage is columnar (Parquet, Arrow, ORC), queries have selective WHERE clauses, and I/O is the bottleneck.

No when: queries scan most of the table, the format lacks statistics, or filter selectivity is below ~30%.

Trade-offs

Requires storage statistics. Parquet and ORC write them by default; older formats may not.

False positives are possible — a row group with range [440, 460] will be read for close >= 450 even if all values are 440–449. You waste I/O, you don’t return wrong rows.

Implementation complexity is real. You need to understand the storage format internals and Arrow compute kernels.

Row group pruning

Specific case of pushdown: skip entire chunks based on per-chunk statistics.

Parquet files carry metadata for each row group:

Parquet File
├── Metadata (10KB)
│   ├── Row Group 0: {min_close: 440.0, max_close: 445.0}
│   ├── Row Group 1: {min_close: 445.0, max_close: 450.0}
│   └── Row Group 2: {min_close: 450.0, max_close: 455.0}
├── Row Group 0 (100KB compressed)
├── Row Group 1 (100KB compressed)
└── Row Group 2 (100KB compressed)

For close >= 450:

Row group	min	max	Action	Why
0	440.0	445.0	Skip	max < 450
1	445.0	450.0	Read	Overlaps
2	450.0	455.0	Read	All match

Row Group 0 is never read. 33% I/O saved on this query.

Code

fn filter_matches_row_group(
    row_group: &RowGroupMetaData,
    filter: &QueryFilter,
) -> bool {
    match filter {
        QueryFilter::MinPrice(threshold_i64) => {
            let threshold = *threshold_i64 as f64 / 10000.0;

            if let Some(stats) = row_group.column(4).statistics() {
                if let Some(max_bytes) = stats.max_bytes_opt() {
                    if let Ok(max_close) = parse_f64_from_bytes(max_bytes) {
                        return max_close >= threshold;
                    }
                }
            }

            // No stats: don't skip
            true
        }
    }
}

Missing statistics default to “match” — safe but suboptimal. Effectiveness depends on data distribution. Time-ordered data with trending values prunes well. Randomly distributed data prunes poorly.

Result caching

Stores query results in memory. For repeated queries, this skips all I/O and all computation. Hit latency is 50–100× faster than disk.

The cache key has to be everything that affects the result: symbol, time range, filters, aggregation. Hash the whole query object.

Metric	Without	With (80% hit)	Speedup
P50 latency	1.5ms	0.3ms	5×
P99 latency	5ms	0.4ms	12.5×
Throughput	667 qps	3,333 qps	5×

Hit rate is the lever. Dashboards typically run at 80%; interactive analytics often exceed 90%.

Code

// Check cache
if let Some(cached_bars) = self.cache.borrow_mut().get(query) {
    let execution_time_us = start.elapsed().as_micros() as u64;
    return Ok(QueryResult {
        bars: cached_bars.clone(),
        execution_time_us,
        from_cache: true,
    });
}

// Execute
let result = storage.read_with_filters(...)?;

// Store
self.cache.borrow_mut().put(query.clone(), result.clone());

LRU eviction; add TTL if data changes faster than your cache can serve stale data.

When it applies

Yes when: queries repeat (dashboards, APIs), reads outnumber writes 10:1+, results fit in memory, brief staleness is tolerable.

No when: every query is unique, writes are frequent, memory is tight, or consistency is non-negotiable.

Invalidation

The hard part. Three approaches:

Strategy	Accuracy	Complexity	Fits
TTL (time-based)	Low	Low	Infrequent writes
Write-through	High	Medium	Moderate writes
Version stamps	High	High	Complex dependencies

TTL is the cheapest and most common. It will serve stale data; the question is whether the staleness window is acceptable.

Async I/O

For queries that span multiple partitions, read them in parallel. Time-series queries spanning 5 days can read 5 daily partitions concurrently.

Partitions	Sequential	Parallel	Speedup
1	1.0ms	1.0ms	1.0×
2	2.0ms	1.1ms	1.8×
5	5.0ms	1.8ms	2.8×
10	10.0ms	3.2ms	3.1×

Speedup saturates around 3× — synchronization and I/O contention put a ceiling on it. SSDs scale better than HDDs; network-attached storage sometimes scales better than either.

Code

let futures: Vec<_> = partitions
    .iter()
    .map(|partition| {
        let storage = storage.clone();
        let symbol = symbol.to_string();
        let partition = partition.to_string();

        async move {
            storage.read_async(&symbol, &partition).await
        }
    })
    .collect();

let results = futures::future::join_all(futures).await;

let mut all_bars = Vec::new();
for result in results {
    all_bars.extend(result?);
}

Use tokio::task::spawn_blocking for the actual I/O so the async runtime stays responsive.

When it applies

Yes when: queries span multiple partitions, storage supports parallelism (SSD, distributed), partitions are independent.

No when: single-partition queries dominate, storage is the bottleneck and already saturated, or thread overhead exceeds the I/O time.

Indexes

Indexes are lookup structures. B-tree for range queries, hash for equality, bitmap for low-cardinality, covering for SELECT subsets. The trade is always the same: faster reads, slower writes.

Type	Best for	Write overhead	Read speedup
B-tree	Range queries	20–30%	10–100×
Hash	Equality	10–20%	50–200×
Bitmap	Low cardinality	5–10%	20–50×
Covering	SELECT subset	30–50%	100–1000×

B-tree is the default. Supports <, >, =, BETWEEN. Moderate write cost.

Hash is faster for equality, useless for ranges.

Bitmap compresses well for columns with few distinct values (country, status, category). Fast AND/OR.

Covering includes every column the query needs. The base table is never touched. High write overhead because the index duplicates data.

When to index

Yes when: a slow query is doing a full scan, filter selectivity is high (<10% rows match), the table is large (>100K rows), reads outnumber writes 10:1+.

No when: the table is small (<10K rows), filters return most of the table, or writes dominate enough that index maintenance erases the read win.

Costs

Disk: a B-tree index is roughly 50% of the indexed column’s size. Multiple indexes multiply.

Maintenance: every INSERT/UPDATE/DELETE touches every index. For bulk loads, drop indexes, load, rebuild.

Statistics: ANALYZE after significant data changes. Stale statistics produce bad query plans.

When SIMD fails

SIMD processes 2–8 values per CPU instruction. AVX-2 is 256-bit, AVX-512 is 512-bit. Both require contiguous data in the right layout.

Layout	SIMD-friendly	Why
Struct-of-Arrays (SoA)	Yes	Contiguous fields
Array-of-Structs (AoS)	No	Scattered fields
Columnar (Arrow)	Yes	Already SoA

Consider this OHLCV struct:

struct OHLCV {
    timestamp: i64,
    open: f64,
    high: f64,
    low: f64,
    close: f64,
    volume: u64,
}

Vec<OHLCV> is AoS. To SIMD over close, you’d extract every 5th element, paying extraction overhead that exceeds the SIMD win.

I tried this on real data. Results:

Operation	Scalar (ns)	SIMD (ns)	Outcome
VWAP	1,340	2,486	0.54× (86% regression)
Average	412	687	0.60× (67% regression)
Sum	339	566	0.60× (67% regression)

Field extraction dominated. Scalar code accessed struct fields directly; SIMD code copied them into temporary arrays first.

The actual fix

Move to columnar storage. Parquet/Arrow is already SoA. Arrow compute kernels use SIMD automatically:

let close_array = batch.column(4)
    .as_any()
    .downcast_ref::<Float64Array>()?;

let predicate = arrow::compute::gt_eq(close_array, &threshold_array)?;

No manual SIMD. Arrow handles it.

The lesson: data layout matters more than instruction-level optimization. SIMD on AoS data is worse than not trying.

Decision tree

Profile first. The slow thing is rarely what you think it is.

Is the query slow?
├─ I/O-bound?
│  ├─ Selective filter? → Predicate pushdown (10×)
│  ├─ Multiple partitions? → Async I/O (3×)
│  └─ Repeated query? → Cache (100×+)
└─ CPU-bound?
   ├─ Hot aggregation, columnar data? → SIMD via Arrow
   └─ Complex computation? → Algorithm

For I/O-bound queries, optimize I/O first. CPU work won’t help.

For CPU-bound queries, verify the data layout supports the optimization you have in mind. Don’t assume.

Priority order

Profile. Measure, don’t guess.
Cache if queries repeat. 100–1000× for low effort.
Predicate pushdown if filters are selective. 10× for medium effort.
Async I/O if multi-partition. 3× for medium effort.
Indexes for point queries. 10–100× for low effort.
Only then consider CPU work.

The highest-leverage moves eliminate work entirely. Caching eliminates all of it. Pushdown eliminates ~90% of I/O. CPU optimization makes the remaining 10% twice as fast.

Eliminate work before speeding it up.

Part 2: Replacing DuckDB with Rust and predicate pushdown — 10.4× speedup.