← Research

// article

Database optimization: a working reference

October 4, 2025 Article

There’s a hierarchy. I/O reduction beats CPU optimization. Caching beats computation. Filtering early beats filtering late. Most people reach for the last category first and find out it didn’t matter.

Six techniques follow, in roughly the order they tend to pay off. Each has the same structure: what it does, when it applies, how it’s wired, and what it actually buys you. Part 2 is the DuckDB-replacement case study where these stack up to a 10.4× speedup.

Quick reference

TechniqueImpactEffortPriorityWhen
Predicate pushdown10×+MediumHighSelective queries
Row group pruning5–10×LowHighColumnar storage
Result caching100–1000×LowHighRepeated queries
Async I/O2–3×MediumMediumMulti-partition reads
Database indexes10–100×LowHighPoint queries
SIMDVariableHighLowHot CPU paths

Predicate pushdown

The idea: filter at the storage layer, not in memory after reading. Two levels of pushdown work together — row group pruning (skip entire chunks via metadata) and record batch filtering (filter rows before deserializing).

For selective queries, this is a 10–20× win on I/O. For non-selective queries, it does nothing.

SelectivityData keptI/O reductionSpeedup
High (1% match)1%90–95%10–20×
Medium (30% match)30%60–70%3–5×
Low (70% match)70%20–30%1.5–2×

Code

pub fn filter_to_arrow_predicate(
    batch: &RecordBatch,
    filter: &QueryFilter,
) -> Result<BooleanArray> {
    match filter {
        QueryFilter::MinPrice(threshold_i64) => {
            let threshold = *threshold_i64 as f64 / 10000.0;
            let close_array = batch.column(4)
                .as_any()
                .downcast_ref::<Float64Array>()?;

            let threshold_array = Float64Array::from(
                vec![threshold; close_array.len()]
            );

            // Arrow compute kernel — SIMD accelerated
            gt_eq(close_array, &threshold_array)
        }
    }
}

The Arrow gt_eq kernel is SIMD-accelerated under the hood. You don’t write vectorized code; Arrow handles it.

When it applies

Yes when: storage is columnar (Parquet, Arrow, ORC), queries have selective WHERE clauses, and I/O is the bottleneck.

No when: queries scan most of the table, the format lacks statistics, or filter selectivity is below ~30%.

Trade-offs

Requires storage statistics. Parquet and ORC write them by default; older formats may not.

False positives are possible — a row group with range [440, 460] will be read for close >= 450 even if all values are 440–449. You waste I/O, you don’t return wrong rows.

Implementation complexity is real. You need to understand the storage format internals and Arrow compute kernels.

Row group pruning

Specific case of pushdown: skip entire chunks based on per-chunk statistics.

Parquet files carry metadata for each row group:

Parquet File
├── Metadata (10KB)
│   ├── Row Group 0: {min_close: 440.0, max_close: 445.0}
│   ├── Row Group 1: {min_close: 445.0, max_close: 450.0}
│   └── Row Group 2: {min_close: 450.0, max_close: 455.0}
├── Row Group 0 (100KB compressed)
├── Row Group 1 (100KB compressed)
└── Row Group 2 (100KB compressed)

For close >= 450:

Row groupminmaxActionWhy
0440.0445.0Skipmax < 450
1445.0450.0ReadOverlaps
2450.0455.0ReadAll match

Row Group 0 is never read. 33% I/O saved on this query.

Code

fn filter_matches_row_group(
    row_group: &RowGroupMetaData,
    filter: &QueryFilter,
) -> bool {
    match filter {
        QueryFilter::MinPrice(threshold_i64) => {
            let threshold = *threshold_i64 as f64 / 10000.0;

            if let Some(stats) = row_group.column(4).statistics() {
                if let Some(max_bytes) = stats.max_bytes_opt() {
                    if let Ok(max_close) = parse_f64_from_bytes(max_bytes) {
                        return max_close >= threshold;
                    }
                }
            }

            // No stats: don't skip
            true
        }
    }
}

Missing statistics default to “match” — safe but suboptimal. Effectiveness depends on data distribution. Time-ordered data with trending values prunes well. Randomly distributed data prunes poorly.

Result caching

Stores query results in memory. For repeated queries, this skips all I/O and all computation. Hit latency is 50–100× faster than disk.

The cache key has to be everything that affects the result: symbol, time range, filters, aggregation. Hash the whole query object.

MetricWithoutWith (80% hit)Speedup
P50 latency1.5ms0.3ms
P99 latency5ms0.4ms12.5×
Throughput667 qps3,333 qps

Hit rate is the lever. Dashboards typically run at 80%; interactive analytics often exceed 90%.

Code

// Check cache
if let Some(cached_bars) = self.cache.borrow_mut().get(query) {
    let execution_time_us = start.elapsed().as_micros() as u64;
    return Ok(QueryResult {
        bars: cached_bars.clone(),
        execution_time_us,
        from_cache: true,
    });
}

// Execute
let result = storage.read_with_filters(...)?;

// Store
self.cache.borrow_mut().put(query.clone(), result.clone());

LRU eviction; add TTL if data changes faster than your cache can serve stale data.

When it applies

Yes when: queries repeat (dashboards, APIs), reads outnumber writes 10:1+, results fit in memory, brief staleness is tolerable.

No when: every query is unique, writes are frequent, memory is tight, or consistency is non-negotiable.

Invalidation

The hard part. Three approaches:

StrategyAccuracyComplexityFits
TTL (time-based)LowLowInfrequent writes
Write-throughHighMediumModerate writes
Version stampsHighHighComplex dependencies

TTL is the cheapest and most common. It will serve stale data; the question is whether the staleness window is acceptable.

Async I/O

For queries that span multiple partitions, read them in parallel. Time-series queries spanning 5 days can read 5 daily partitions concurrently.

PartitionsSequentialParallelSpeedup
11.0ms1.0ms1.0×
22.0ms1.1ms1.8×
55.0ms1.8ms2.8×
1010.0ms3.2ms3.1×

Speedup saturates around 3× — synchronization and I/O contention put a ceiling on it. SSDs scale better than HDDs; network-attached storage sometimes scales better than either.

Code

let futures: Vec<_> = partitions
    .iter()
    .map(|partition| {
        let storage = storage.clone();
        let symbol = symbol.to_string();
        let partition = partition.to_string();

        async move {
            storage.read_async(&symbol, &partition).await
        }
    })
    .collect();

let results = futures::future::join_all(futures).await;

let mut all_bars = Vec::new();
for result in results {
    all_bars.extend(result?);
}

Use tokio::task::spawn_blocking for the actual I/O so the async runtime stays responsive.

When it applies

Yes when: queries span multiple partitions, storage supports parallelism (SSD, distributed), partitions are independent.

No when: single-partition queries dominate, storage is the bottleneck and already saturated, or thread overhead exceeds the I/O time.

Indexes

Indexes are lookup structures. B-tree for range queries, hash for equality, bitmap for low-cardinality, covering for SELECT subsets. The trade is always the same: faster reads, slower writes.

TypeBest forWrite overheadRead speedup
B-treeRange queries20–30%10–100×
HashEquality10–20%50–200×
BitmapLow cardinality5–10%20–50×
CoveringSELECT subset30–50%100–1000×

B-tree is the default. Supports <, >, =, BETWEEN. Moderate write cost.

Hash is faster for equality, useless for ranges.

Bitmap compresses well for columns with few distinct values (country, status, category). Fast AND/OR.

Covering includes every column the query needs. The base table is never touched. High write overhead because the index duplicates data.

When to index

Yes when: a slow query is doing a full scan, filter selectivity is high (<10% rows match), the table is large (>100K rows), reads outnumber writes 10:1+.

No when: the table is small (<10K rows), filters return most of the table, or writes dominate enough that index maintenance erases the read win.

Costs

Disk: a B-tree index is roughly 50% of the indexed column’s size. Multiple indexes multiply.

Maintenance: every INSERT/UPDATE/DELETE touches every index. For bulk loads, drop indexes, load, rebuild.

Statistics: ANALYZE after significant data changes. Stale statistics produce bad query plans.

When SIMD fails

SIMD processes 2–8 values per CPU instruction. AVX-2 is 256-bit, AVX-512 is 512-bit. Both require contiguous data in the right layout.

LayoutSIMD-friendlyWhy
Struct-of-Arrays (SoA)YesContiguous fields
Array-of-Structs (AoS)NoScattered fields
Columnar (Arrow)YesAlready SoA

Consider this OHLCV struct:

struct OHLCV {
    timestamp: i64,
    open: f64,
    high: f64,
    low: f64,
    close: f64,
    volume: u64,
}

Vec<OHLCV> is AoS. To SIMD over close, you’d extract every 5th element, paying extraction overhead that exceeds the SIMD win.

I tried this on real data. Results:

OperationScalar (ns)SIMD (ns)Outcome
VWAP1,3402,4860.54× (86% regression)
Average4126870.60× (67% regression)
Sum3395660.60× (67% regression)

Field extraction dominated. Scalar code accessed struct fields directly; SIMD code copied them into temporary arrays first.

The actual fix

Move to columnar storage. Parquet/Arrow is already SoA. Arrow compute kernels use SIMD automatically:

let close_array = batch.column(4)
    .as_any()
    .downcast_ref::<Float64Array>()?;

let predicate = arrow::compute::gt_eq(close_array, &threshold_array)?;

No manual SIMD. Arrow handles it.

The lesson: data layout matters more than instruction-level optimization. SIMD on AoS data is worse than not trying.

Decision tree

Profile first. The slow thing is rarely what you think it is.

Is the query slow?
├─ I/O-bound?
│  ├─ Selective filter? → Predicate pushdown (10×)
│  ├─ Multiple partitions? → Async I/O (3×)
│  └─ Repeated query? → Cache (100×+)
└─ CPU-bound?
   ├─ Hot aggregation, columnar data? → SIMD via Arrow
   └─ Complex computation? → Algorithm

For I/O-bound queries, optimize I/O first. CPU work won’t help.

For CPU-bound queries, verify the data layout supports the optimization you have in mind. Don’t assume.

Priority order

  1. Profile. Measure, don’t guess.
  2. Cache if queries repeat. 100–1000× for low effort.
  3. Predicate pushdown if filters are selective. 10× for medium effort.
  4. Async I/O if multi-partition. 3× for medium effort.
  5. Indexes for point queries. 10–100× for low effort.
  6. Only then consider CPU work.

The highest-leverage moves eliminate work entirely. Caching eliminates all of it. Pushdown eliminates ~90% of I/O. CPU optimization makes the remaining 10% twice as fast.

Eliminate work before speeding it up.


Part 2: Replacing DuckDB with Rust and predicate pushdown — 10.4× speedup.