// article
There’s a hierarchy. I/O reduction beats CPU optimization. Caching beats computation. Filtering early beats filtering late. Most people reach for the last category first and find out it didn’t matter.
Six techniques follow, in roughly the order they tend to pay off. Each has the same structure: what it does, when it applies, how it’s wired, and what it actually buys you. Part 2 is the DuckDB-replacement case study where these stack up to a 10.4× speedup.
| Technique | Impact | Effort | Priority | When |
|---|---|---|---|---|
| Predicate pushdown | 10×+ | Medium | High | Selective queries |
| Row group pruning | 5–10× | Low | High | Columnar storage |
| Result caching | 100–1000× | Low | High | Repeated queries |
| Async I/O | 2–3× | Medium | Medium | Multi-partition reads |
| Database indexes | 10–100× | Low | High | Point queries |
| SIMD | Variable | High | Low | Hot CPU paths |
The idea: filter at the storage layer, not in memory after reading. Two levels of pushdown work together — row group pruning (skip entire chunks via metadata) and record batch filtering (filter rows before deserializing).
For selective queries, this is a 10–20× win on I/O. For non-selective queries, it does nothing.
| Selectivity | Data kept | I/O reduction | Speedup |
|---|---|---|---|
| High (1% match) | 1% | 90–95% | 10–20× |
| Medium (30% match) | 30% | 60–70% | 3–5× |
| Low (70% match) | 70% | 20–30% | 1.5–2× |
pub fn filter_to_arrow_predicate(
batch: &RecordBatch,
filter: &QueryFilter,
) -> Result<BooleanArray> {
match filter {
QueryFilter::MinPrice(threshold_i64) => {
let threshold = *threshold_i64 as f64 / 10000.0;
let close_array = batch.column(4)
.as_any()
.downcast_ref::<Float64Array>()?;
let threshold_array = Float64Array::from(
vec![threshold; close_array.len()]
);
// Arrow compute kernel — SIMD accelerated
gt_eq(close_array, &threshold_array)
}
}
}
The Arrow gt_eq kernel is SIMD-accelerated under the hood. You don’t write vectorized code; Arrow handles it.
Yes when: storage is columnar (Parquet, Arrow, ORC), queries have selective WHERE clauses, and I/O is the bottleneck.
No when: queries scan most of the table, the format lacks statistics, or filter selectivity is below ~30%.
Requires storage statistics. Parquet and ORC write them by default; older formats may not.
False positives are possible — a row group with range [440, 460] will be read for close >= 450 even if all values are 440–449. You waste I/O, you don’t return wrong rows.
Implementation complexity is real. You need to understand the storage format internals and Arrow compute kernels.
Specific case of pushdown: skip entire chunks based on per-chunk statistics.
Parquet files carry metadata for each row group:
Parquet File
├── Metadata (10KB)
│ ├── Row Group 0: {min_close: 440.0, max_close: 445.0}
│ ├── Row Group 1: {min_close: 445.0, max_close: 450.0}
│ └── Row Group 2: {min_close: 450.0, max_close: 455.0}
├── Row Group 0 (100KB compressed)
├── Row Group 1 (100KB compressed)
└── Row Group 2 (100KB compressed)
For close >= 450:
| Row group | min | max | Action | Why |
|---|---|---|---|---|
| 0 | 440.0 | 445.0 | Skip | max < 450 |
| 1 | 445.0 | 450.0 | Read | Overlaps |
| 2 | 450.0 | 455.0 | Read | All match |
Row Group 0 is never read. 33% I/O saved on this query.
fn filter_matches_row_group(
row_group: &RowGroupMetaData,
filter: &QueryFilter,
) -> bool {
match filter {
QueryFilter::MinPrice(threshold_i64) => {
let threshold = *threshold_i64 as f64 / 10000.0;
if let Some(stats) = row_group.column(4).statistics() {
if let Some(max_bytes) = stats.max_bytes_opt() {
if let Ok(max_close) = parse_f64_from_bytes(max_bytes) {
return max_close >= threshold;
}
}
}
// No stats: don't skip
true
}
}
}
Missing statistics default to “match” — safe but suboptimal. Effectiveness depends on data distribution. Time-ordered data with trending values prunes well. Randomly distributed data prunes poorly.
Stores query results in memory. For repeated queries, this skips all I/O and all computation. Hit latency is 50–100× faster than disk.
The cache key has to be everything that affects the result: symbol, time range, filters, aggregation. Hash the whole query object.
| Metric | Without | With (80% hit) | Speedup |
|---|---|---|---|
| P50 latency | 1.5ms | 0.3ms | 5× |
| P99 latency | 5ms | 0.4ms | 12.5× |
| Throughput | 667 qps | 3,333 qps | 5× |
Hit rate is the lever. Dashboards typically run at 80%; interactive analytics often exceed 90%.
// Check cache
if let Some(cached_bars) = self.cache.borrow_mut().get(query) {
let execution_time_us = start.elapsed().as_micros() as u64;
return Ok(QueryResult {
bars: cached_bars.clone(),
execution_time_us,
from_cache: true,
});
}
// Execute
let result = storage.read_with_filters(...)?;
// Store
self.cache.borrow_mut().put(query.clone(), result.clone());
LRU eviction; add TTL if data changes faster than your cache can serve stale data.
Yes when: queries repeat (dashboards, APIs), reads outnumber writes 10:1+, results fit in memory, brief staleness is tolerable.
No when: every query is unique, writes are frequent, memory is tight, or consistency is non-negotiable.
The hard part. Three approaches:
| Strategy | Accuracy | Complexity | Fits |
|---|---|---|---|
| TTL (time-based) | Low | Low | Infrequent writes |
| Write-through | High | Medium | Moderate writes |
| Version stamps | High | High | Complex dependencies |
TTL is the cheapest and most common. It will serve stale data; the question is whether the staleness window is acceptable.
For queries that span multiple partitions, read them in parallel. Time-series queries spanning 5 days can read 5 daily partitions concurrently.
| Partitions | Sequential | Parallel | Speedup |
|---|---|---|---|
| 1 | 1.0ms | 1.0ms | 1.0× |
| 2 | 2.0ms | 1.1ms | 1.8× |
| 5 | 5.0ms | 1.8ms | 2.8× |
| 10 | 10.0ms | 3.2ms | 3.1× |
Speedup saturates around 3× — synchronization and I/O contention put a ceiling on it. SSDs scale better than HDDs; network-attached storage sometimes scales better than either.
let futures: Vec<_> = partitions
.iter()
.map(|partition| {
let storage = storage.clone();
let symbol = symbol.to_string();
let partition = partition.to_string();
async move {
storage.read_async(&symbol, &partition).await
}
})
.collect();
let results = futures::future::join_all(futures).await;
let mut all_bars = Vec::new();
for result in results {
all_bars.extend(result?);
}
Use tokio::task::spawn_blocking for the actual I/O so the async runtime stays responsive.
Yes when: queries span multiple partitions, storage supports parallelism (SSD, distributed), partitions are independent.
No when: single-partition queries dominate, storage is the bottleneck and already saturated, or thread overhead exceeds the I/O time.
Indexes are lookup structures. B-tree for range queries, hash for equality, bitmap for low-cardinality, covering for SELECT subsets. The trade is always the same: faster reads, slower writes.
| Type | Best for | Write overhead | Read speedup |
|---|---|---|---|
| B-tree | Range queries | 20–30% | 10–100× |
| Hash | Equality | 10–20% | 50–200× |
| Bitmap | Low cardinality | 5–10% | 20–50× |
| Covering | SELECT subset | 30–50% | 100–1000× |
B-tree is the default. Supports <, >, =, BETWEEN. Moderate write cost.
Hash is faster for equality, useless for ranges.
Bitmap compresses well for columns with few distinct values (country, status, category). Fast AND/OR.
Covering includes every column the query needs. The base table is never touched. High write overhead because the index duplicates data.
Yes when: a slow query is doing a full scan, filter selectivity is high (<10% rows match), the table is large (>100K rows), reads outnumber writes 10:1+.
No when: the table is small (<10K rows), filters return most of the table, or writes dominate enough that index maintenance erases the read win.
Disk: a B-tree index is roughly 50% of the indexed column’s size. Multiple indexes multiply.
Maintenance: every INSERT/UPDATE/DELETE touches every index. For bulk loads, drop indexes, load, rebuild.
Statistics: ANALYZE after significant data changes. Stale statistics produce bad query plans.
SIMD processes 2–8 values per CPU instruction. AVX-2 is 256-bit, AVX-512 is 512-bit. Both require contiguous data in the right layout.
| Layout | SIMD-friendly | Why |
|---|---|---|
| Struct-of-Arrays (SoA) | Yes | Contiguous fields |
| Array-of-Structs (AoS) | No | Scattered fields |
| Columnar (Arrow) | Yes | Already SoA |
Consider this OHLCV struct:
struct OHLCV {
timestamp: i64,
open: f64,
high: f64,
low: f64,
close: f64,
volume: u64,
}
Vec<OHLCV> is AoS. To SIMD over close, you’d extract every 5th element, paying extraction overhead that exceeds the SIMD win.
I tried this on real data. Results:
| Operation | Scalar (ns) | SIMD (ns) | Outcome |
|---|---|---|---|
| VWAP | 1,340 | 2,486 | 0.54× (86% regression) |
| Average | 412 | 687 | 0.60× (67% regression) |
| Sum | 339 | 566 | 0.60× (67% regression) |
Field extraction dominated. Scalar code accessed struct fields directly; SIMD code copied them into temporary arrays first.
Move to columnar storage. Parquet/Arrow is already SoA. Arrow compute kernels use SIMD automatically:
let close_array = batch.column(4)
.as_any()
.downcast_ref::<Float64Array>()?;
let predicate = arrow::compute::gt_eq(close_array, &threshold_array)?;
No manual SIMD. Arrow handles it.
The lesson: data layout matters more than instruction-level optimization. SIMD on AoS data is worse than not trying.
Profile first. The slow thing is rarely what you think it is.
Is the query slow?
├─ I/O-bound?
│ ├─ Selective filter? → Predicate pushdown (10×)
│ ├─ Multiple partitions? → Async I/O (3×)
│ └─ Repeated query? → Cache (100×+)
└─ CPU-bound?
├─ Hot aggregation, columnar data? → SIMD via Arrow
└─ Complex computation? → Algorithm
For I/O-bound queries, optimize I/O first. CPU work won’t help.
For CPU-bound queries, verify the data layout supports the optimization you have in mind. Don’t assume.
The highest-leverage moves eliminate work entirely. Caching eliminates all of it. Pushdown eliminates ~90% of I/O. CPU optimization makes the remaining 10% twice as fast.
Eliminate work before speeding it up.
Part 2: Replacing DuckDB with Rust and predicate pushdown — 10.4× speedup.