// article
Part 1 of 2 in the Database Performance Optimization series. Part 2: the Atlas trading-system case study.
Most database performance comes from three things: configuration, indexing, and how you batch work. The technology choice matters less than people think. SQLite with the right PRAGMAs serves 100,000+ queries/second. The trading system in Part 2 went from a 6.6-second startup to 8 milliseconds with a single anti-pattern fix.
This piece is the reference I built up while doing that work — the optimizations that paid off, in roughly the order they pay off. Configuration first (1 line, 2–20× gains), indexes second (10–1000×), batching third (10–100×), pooling fourth (10–50×). They compound.
| Optimization | Impact | Effort | Priority |
|---|---|---|---|
| Enable WAL mode | 2–20× | 1 line | Critical |
| Add composite indexes | 10–1000× | Low–Medium | Critical |
| Batch operations | 10–100× | Medium | High |
| Connection pooling | 10–50× | Medium | High |
| Eliminate N+1 queries | 10× | Medium | High |
| Cache schema metadata | 30× | Low | High |
| Short transactions (<100ms) | Prevents deadlocks | Medium | High |
| Covering indexes | 50–90% I/O reduction | Medium | Medium |
| Tune remaining PRAGMAs | 10–30% | Low | Medium |
Run these immediately on every connection.
connection.executescript("""
PRAGMA journal_mode = WAL; -- 2-20× improvement
PRAGMA synchronous = normal; -- Reduces fsync overhead
PRAGMA cache_size = -64000; -- 64MB memory cache
PRAGMA temp_store = memory; -- Keep temp tables in RAM
PRAGMA mmap_size = 30000000000; -- 30GB mmap window
PRAGMA busy_timeout = 5000; -- Wait 5s for locks
PRAGMA foreign_keys = ON; -- Enable FK constraints
PRAGMA journal_size_limit = 67108864; -- Cap WAL at 64MB
""")
What each one does:
| PRAGMA | Effect | When to use |
|---|---|---|
journal_mode = WAL | Concurrent reads during writes | Always, in production |
synchronous = normal | Transaction latency from 30ms+ to <1ms | With WAL |
cache_size = -64000 | 64MB cache reduces disk I/O | When memory allows |
mmap_size | 10–30% read perf gain | 64-bit systems with RAM |
busy_timeout = 5000 | Automatic retry on locks | High-concurrency workloads |
mmap_size reserves virtual address space, not physical RAM. The OS pages in what’s actually touched. So you can request a large window on 64-bit systems without committing memory upfront.
| Platform | Recommended | Reason |
|---|---|---|
| 64-bit server | 30GB | Plenty of address space, OS manages physical |
| Mobile / embedded | 256MB | Limited virtual address space |
| 32-bit | 128MB | Address space constraints |
| Network filesystem | 0 (disabled) | mmap is incompatible with remote FS |
connection.execute("PRAGMA optimize") # Before closing connections
connection.execute("PRAGMA optimize=0x10002") # Periodic on long-running apps
cursor.execute("EXPLAIN QUERY PLAN SELECT ...") # Verify index usage
Reading EXPLAIN QUERY PLAN output:
SCAN — full table scan. Bad.SEARCH — index used. Good.COVERING INDEX — index has everything; no table lookup. Best.CREATE TABLE market_data (
symbol TEXT NOT NULL,
timestamp_us INTEGER NOT NULL,
price REAL NOT NULL,
volume INTEGER NOT NULL,
bid_price REAL,
ask_price REAL,
PRIMARY KEY (symbol, timestamp_us)
) WITHOUT ROWID, STRICT;
-- Composite: equality before range
CREATE INDEX idx_symbol_time
ON market_data(symbol, timestamp_us DESC);
-- Covering index for the common query
CREATE INDEX idx_symbol_time_price_vol
ON market_data(symbol, timestamp_us DESC, price, volume);
-- Partial index on the hot path
CREATE INDEX idx_recent_data
ON market_data(symbol, timestamp_us DESC, price, volume)
WHERE timestamp_us > unixepoch('now', '-24 hours');
| Technique | Benefit | Cost |
|---|---|---|
INTEGER timestamps | B-tree friendly, no float precision issues | Must convert from datetime |
WITHOUT ROWID | Saves 8 bytes/row | Only works when PK is unique |
STRICT (3.37.0+) | Type checking, better plans | Requires explicit types |
NOT NULL | Smaller storage, better plans | Must always provide a value |
| Partial indexes | Faster, smaller | Only helps matching predicates |
| Covering indexes | 2–5× speedup, no table lookups | 10–30% write overhead |
| Pattern | Problem | Fix |
|---|---|---|
| EAV (entity-attribute-value) | Query plans fall apart | Real columns or JSON for dynamic data |
| Comma-separated lists | Can’t index, can’t filter | Junction table with FKs |
| Polymorphic foreign keys | Type ambiguity | Separate FKs or inheritance |
| No primary key | Update/dedupe problems downstream | Always define one |
TEXT for numbers | Type conversion overhead | Use INTEGER / REAL |
Leftmost prefix matching. An index on (A, B, C) supports:
WHERE A = ?WHERE A = ? AND B = ?WHERE A = ? AND B = ? AND C = ?It does not support WHERE B = ? or WHERE C = ? alone.
Column ordering. Three rules, in priority order:
Equality columns before range columns.
-- Good: equality (symbol) before range (timestamp)
CREATE INDEX ON trades(symbol, timestamp DESC, price);
-- Bad: range before equality
CREATE INDEX ON trades(timestamp DESC, symbol, price);
Most selective columns first, when all are equality.
-- domain_id: 1,000 unique values. user_id: 10,000.
CREATE INDEX ON events(user_id, domain_id, event_type);
Follow logical hierarchy when the structure matters more than raw selectivity.
CREATE INDEX ON servers(domain_id, server_id, instance_id);
Include every column the query needs, so the table itself never has to be touched:
SELECT price, volume FROM market_data
WHERE symbol = ? AND timestamp_us > ?;
CREATE INDEX idx_covering
ON market_data(symbol, timestamp_us DESC, price, volume);
The trade: 50–90% I/O reduction on the read side, 10–30% slower writes. Worth it for read-heavy hot paths.
| Scenario | Strategy | Why |
|---|---|---|
| Read-heavy, rare writes | Covering indexes | Maximize read perf, accept write cost |
| Balanced | Standard composite | Reasonable compromise |
| Write-heavy with range queries | Partial indexes on hot data | Reduce write overhead |
| Time-series with recency bias | Partial on recent rows | Smaller, faster, less maintenance |
| Sparse columns | Partial with WHERE | Only index non-NULL rows |
Problem. Database init was taking 6,632ms, blocking 37 trading features from loading.
Before. Query information_schema.tables per table to check existence:
def _initialize_schema(self):
cursor.execute("""
SELECT COUNT(*) FROM information_schema.tables
WHERE table_schema = 'main'
""")
if cursor.fetchone()[0] == 0:
self._create_tables()
After. Let SQLite handle existence checks itself:
def _initialize_schema(self):
cursor.execute("""
CREATE TABLE IF NOT EXISTS market_data (
symbol TEXT NOT NULL,
timestamp INTEGER NOT NULL,
bid_price REAL,
ask_price REAL,
PRIMARY KEY (symbol, timestamp)
)
""")
cursor.execute("""
CREATE INDEX IF NOT EXISTS idx_market_data_symbol_time
ON market_data(symbol, timestamp DESC)
""")
Result. 6,632ms → 8ms (829×). Symbol queries: 450ms → 12ms (37×). Historical analysis: 2.3s → 85ms (27×). The full case study is in Part 2.
| Level | Consistency | Concurrency | Use |
|---|---|---|---|
| READ UNCOMMITTED | Dirty reads allowed | Highest | Analytics, approximate results |
| READ COMMITTED | Only committed data | High | OLTP default |
| REPEATABLE READ | Consistent reads | Medium | Reports needing consistency |
| SERIALIZABLE | Perfect consistency | Lowest | Critical paths only |
Default to READ COMMITTED. Escalate only when the requirement is real.
# Bad: long transaction holds locks during external I/O
def process_order(order_id):
with connection:
order = fetch_order(order_id)
validate_inventory(order) # External API call
send_confirmation_email(order) # Network I/O
update_inventory(order)
# Good: only the DB work is inside the transaction
def process_order(order_id):
order = fetch_order(order_id)
validate_inventory(order)
send_confirmation_email(order)
with connection:
update_inventory(order)
Target: keep transactions under 100ms.
In the trading system, cursor creation under a thread lock was a bottleneck. Adding a lock-free fast path for the cached case dropped it to effectively zero.
def get_cursor(self, thread_id):
# Fast path: lock-free read
if thread_id in self._cursors:
return self._cursors[thread_id]
# Slow path: lock only when allocating
with self._lock:
if thread_id not in self._cursors:
self._cursors[thread_id] = self.connection.cursor()
return self._cursors[thread_id]
Throughput: 200,000 cursor lookups/second on the fast path.
Always lock objects in a consistent order across all transactions.
def transfer(from_acct, to_acct, amount):
# Lower account ID always locked first
accounts = sorted([from_acct, to_acct], key=lambda a: a.id)
with transaction:
lock_account(accounts[0])
lock_account(accounts[1])
perform_transfer(from_acct, to_acct, amount)
Counter increments serialize. If one row is being updated by everyone, the fix is structural:
| Solution | Trade |
|---|---|
| Batch updates in memory, flush periodically | Eventual consistency; loss risk on crash |
| Shard counters across N rows | Aggregate at read time |
| Optimistic locking with version columns | Retry overhead on contention |
| Append-only event log | Storage and aggregation cost |
| Operation | Time | Throughput | Speedup |
|---|---|---|---|
| Individual INSERTs (auto-commit) | 120s | 167 rows/s | 1× |
| Batched INSERTs (manual commit) | 1s | 20,000 rows/s | 120× |
Bulk-load API (COPY, SqlBulkCopy) | 0.5s | 40,000 rows/s | 240× |
# Bad: commit per row triggers fsync per row
for row in data:
cursor.execute("INSERT INTO trades VALUES (?, ?, ?)", row)
connection.commit()
# Better: one transaction
connection.execute("BEGIN")
for row in data:
cursor.execute("INSERT INTO trades VALUES (?, ?, ?)", row)
connection.commit()
# Best: executemany
connection.execute("BEGIN")
cursor.executemany(
"INSERT INTO trades VALUES (?, ?, ?)",
data
)
connection.commit()
Python sqlite3 benchmark: executemany() is 1.7× faster than a loop with execute() for 1M rows (1.6s vs. 2.7s).
# Bad: index update per insert
CREATE INDEX idx_symbol ON trades(symbol);
INSERT INTO trades ... # millions of rows, each updating the index
# Good: bulk insert, single index build
INSERT INTO trades ...
CREATE INDEX idx_symbol ON trades(symbol);
6–7× faster on large loads.
| Database | API | Speedup |
|---|---|---|
| PostgreSQL | COPY FROM | 50–100× |
| MySQL | LOAD DATA INFILE | 50–100× |
| SQL Server | SqlBulkCopy | 50–100× |
| SQLite | executemany() in a transaction | 120× |
sqlite3 patternsimport threading
import sqlite3
class DatabaseManager:
def __init__(self, db_path):
self._db_path = db_path
self._local = threading.local()
def get_connection(self):
if not hasattr(self._local, 'connection'):
self._local.connection = sqlite3.connect(
self._db_path,
timeout=30.0,
check_same_thread=False
)
self._local.connection.execute("PRAGMA journal_mode=WAL")
return self._local.connection
db = DatabaseManager("trades.db")
conn = db.get_connection() # Unique per thread
WAL gives you concurrent reads; thread-local connections give you contention-free access.
conn = sqlite3.connect("db.sqlite", autocommit=False)
# Or, fine-grained control:
conn = sqlite3.connect("db.sqlite", autocommit=True)
conn.execute("BEGIN IMMEDIATE") # Acquire write lock upfront
try:
conn.execute("INSERT ...")
conn.execute("UPDATE ...")
conn.execute("COMMIT")
except Exception:
conn.execute("ROLLBACK")
raise
BEGIN IMMEDIATE prevents upgrade deadlocks where multiple deferred transactions compete for the same write lock.
import time
import random
import sqlite3
def execute_with_retry(conn, sql, params=None, max_retries=5):
for attempt in range(max_retries):
try:
cursor = conn.execute(sql, params or [])
conn.commit()
return cursor
except sqlite3.OperationalError as e:
if "database is locked" in str(e) and attempt < max_retries - 1:
wait = (2 ** attempt) * 0.1
jitter = random.uniform(0, wait)
time.sleep(wait + jitter)
else:
raise
Don’t retry IntegrityError (constraint violations) or ProgrammingError (code bugs). Those won’t go away with time.
with block gotcha# Wrong: connection stays open after the block
with sqlite3.connect("db.sqlite") as conn:
conn.execute("INSERT ...")
# conn is STILL OPEN here
# Right: explicit close
conn = sqlite3.connect("db.sqlite")
try:
with conn:
conn.execute("INSERT ...")
finally:
conn.close()
# Or: contextlib.closing
from contextlib import closing
with closing(sqlite3.connect("db.sqlite")) as conn:
with conn:
conn.execute("INSERT ...")
sqlite3’s with manages transactions, not the connection lifetime.
# Bad: full materialization
cursor.execute("SELECT * FROM large_table")
rows = cursor.fetchall()
for row in rows:
process(row)
# Good: iterate
for row in conn.execute("SELECT * FROM large_table"):
process(row)
if should_stop():
break
For OLTP, start at 10–20 connections per application instance.
The traditional formula:
connections = (2 × database_CPU_cores) + disk_spindle_count
Database CPUs, not application CPUs. And it’s a starting point — measure and adjust.
from queue import Queue, Empty
import sqlite3
class ConnectionPool:
def __init__(self, db_path, pool_size=10):
self._db_path = db_path
self._pool = Queue(maxsize=pool_size)
for _ in range(pool_size):
conn = sqlite3.connect(db_path, check_same_thread=False)
conn.execute("PRAGMA journal_mode=WAL")
self._pool.put(conn)
def get_connection(self, timeout=5.0):
try:
return self._pool.get(timeout=timeout)
except Empty:
raise RuntimeError("Connection pool exhausted")
def return_connection(self, conn):
self._pool.put(conn)
def __enter__(self):
self._conn = self.get_connection()
return self._conn
def __exit__(self, *args):
self.return_connection(self._conn)
pool = ConnectionPool("trades.db", pool_size=20)
with pool as conn:
conn.execute("INSERT ...")
Transaction-mode pooling multiplexes 1,000 client connections onto 25 database connections. 40× connection-side amplification.
[databases]
mydb = host=localhost dbname=mydb
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
Three modes:
session — one DB conn per client (traditional, no amplification)transaction — one DB conn per transaction (40× amplification)statement — one DB conn per statement (breaks prepared statements)| Signal | Metric | Target | Alert |
|---|---|---|---|
| Latency | P95, P99 query time | <10ms OLTP | P99 > 50ms |
| Traffic | Queries / sec | Varies | Sudden 3× spike |
| Errors | Failed query % | <0.1% | >1% |
| Saturation | Pool utilization | <80% | >90% |
Percentiles beat averages here. P99 latency tells you what the worst-served users see; average latency hides that.
# Tier 1: liveness — microseconds
def liveness_check():
return {"status": "ok"}
# Tier 2: local health — milliseconds
def local_health():
start = time.perf_counter()
cursor = get_cursor()
cursor.execute("SELECT 1")
latency_ms = (time.perf_counter() - start) * 1000
return {
"status": "healthy" if latency_ms < 1.0 else "degraded",
"latency_ms": latency_ms,
"active_cursors": len(self._cursors)
}
# Tier 3: dependency health — seconds
def deep_health():
db_health = local_health()
api_health = check_external_api()
return {
"database": db_health,
"api": api_health,
"status": "healthy" if all_healthy() else "unhealthy"
}
A subtle rule: fail open, not closed. When every node reports unhealthy because a shared dependency is down, keep routing — declaring everything unhealthy is worse than serving degraded responses.
# 1. Identify slow queries
cursor.execute("EXPLAIN QUERY PLAN " + query)
for row in cursor:
print(row)
# 2. Check for SCAN. SEARCH or COVERING INDEX is what you want.
# 3. Refresh statistics
cursor.execute("ANALYZE")
# 4. Re-check the plan
cursor.execute("EXPLAIN QUERY PLAN " + query)
# 5. Add missing indexes
cursor.execute("CREATE INDEX idx_missing ON table(column)")
Tracing 100% of queries is too expensive. Sample 1–5%.
import random
def should_trace():
return random.random() < 0.05 # 5%
def execute_query(sql):
if should_trace():
start = time.perf_counter()
result = cursor.execute(sql)
duration = time.perf_counter() - start
log_slow_query(sql, duration)
return result
return cursor.execute(sql)
Statistical insight at ~5% of the overhead.
| Pattern | Symptom | Impact | Fix | Win |
|---|---|---|---|---|
| Schema metadata in hot path | information_schema per request | 30× slowdown | Cache schema at startup | 30× |
| N+1 queries | 1 parent + N child | 10× | JOIN or IN | 10× |
| Per-row inserts | Slow bulk ops | 100× | Batch in a transaction | 100× |
| Missing indexes | Full scans | Minutes vs. ms | Composite indexes | 1000×+ |
| Over-indexing | Slow writes | 50–100× | Drop unused indexes | 50–100× |
| Long transactions | Lock contention | Cascading delays | Keep <100ms | Avoids deadlocks |
| Hot rows | Serialized updates | Throughput cap | Shard or batch | 10×+ |
| No pooling | Connection overhead | 10–50× slower | Pool | 10–50× |
# Bad: 1 + N queries
posts = db.execute("SELECT * FROM posts WHERE user_id = ?", [user_id])
for post in posts:
comments = db.execute(
"SELECT * FROM comments WHERE post_id = ?", [post.id]
)
# Good: one query, one JOIN
posts_with_comments = db.execute("""
SELECT p.*, c.*
FROM posts p
LEFT JOIN comments c ON p.id = c.post_id
WHERE p.user_id = ?
""", [user_id])
Measured: 800 items across 17 categories — 100ms with JOIN, >1s with N+1.
# Bad: race between SELECT and INSERT
result = db.execute("SELECT id FROM users WHERE email = ?", [email])
if result.fetchone():
db.execute("UPDATE users SET name = ? WHERE email = ?", [name, email])
else:
db.execute("INSERT INTO users (email, name) VALUES (?, ?)", [email, name])
# Good: database-native UPSERT
# SQLite
db.execute("""
INSERT INTO users (email, name) VALUES (?, ?)
ON CONFLICT(email) DO UPDATE SET name = excluded.name
""", [email, name])
# PostgreSQL
db.execute("""
INSERT INTO users (email, name) VALUES (%s, %s)
ON CONFLICT (email) DO UPDATE SET name = EXCLUDED.name
""", [email, name])
# MySQL
db.execute("""
INSERT INTO users (email, name) VALUES (%s, %s)
ON DUPLICATE KEY UPDATE name = VALUES(name)
""", [email, name])
| Requirement | Spec | Why |
|---|---|---|
| Iterations | 30–100+ | Statistical significance |
| Confidence | p < 0.05 | 95% the difference is real |
| Percentiles | P50, P90, P99, P99.9 | Tail latency matters |
| Warm-up | 5–10 iterations discarded | Caches need to settle |
| Consistency | Same hardware/load/time | Eliminate confounding |
from scipy import stats
# Two-sample t-test (normal data)
t_stat, p_value = stats.ttest_ind(baseline_times, optimized_times)
if p_value < 0.05:
print(f"Significant improvement (p={p_value:.4f})")
# Mann-Whitney U (non-parametric)
u_stat, p_value = stats.mannwhitneyu(baseline_times, optimized_times)
# Wilcoxon signed-rank (paired)
w_stat, p_value = stats.wilcoxon(before_times, after_times)
| Tool | Database | Use |
|---|---|---|
| pgbench | PostgreSQL | TPC-B, custom scripts |
| mysqlslap | MySQL | Concurrency simulation |
| HammerDB | Multi-DB | TPC-C, TPC-H |
| YCSB | NoSQL / distributed | Customizable |
| sysbench | Multi-DB | Lua-scriptable |
| Database | Command | Output |
|---|---|---|
| PostgreSQL | EXPLAIN ANALYZE | Times, rows, buffers |
| MySQL (8.0.18+) | EXPLAIN ANALYZE | Cost and actual times |
| SQLite | EXPLAIN QUERY PLAN | Index usage, scan types |
| SQL Server | Query Store | Historical stats |
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM trades
WHERE symbol = 'AAPL' AND timestamp > 1234567890;
Output to watch: estimated vs. actual row counts (mismatch = stale stats), shared buffer hits (cache effectiveness), per-node execution time.
Bencher’s SQLite instance: queries running 38+ seconds. Adding composite indexes that matched the query patterns and materializing one expensive aggregation dropped queries to sub-second. 1,200×.
Lesson: optimization wins only show up at scale. The same indexes on small data produced negligible differences.
Trading system database manager. 6.6s init blocking 37 features. Root cause: information_schema queries scanning hundreds of tables on every startup. Replaced with CREATE TABLE IF NOT EXISTS. 6,644ms → 8.2ms. Full breakdown in Part 2.
2,100+ clusters, 3M queries/second:
Even at extreme scale, version upgrades and query optimization compound.
Failed at 4 concurrent requests in one configuration; stable at 16+ with P99 latency in budget in another. The differences were BEGIN IMMEDIATE for write locks, a custom busy_handler that releases Ruby’s GVL, and WAL mode.
Configuration alone moves the system from “doesn’t work” to “production viable.”
time.perf_counter() before/after the operation, 30+ samples.# 4: verify
from scipy import stats
import numpy as np
baseline = [measure_baseline() for _ in range(100)]
optimized = [measure_optimized() for _ in range(100)]
t_stat, p_value = stats.ttest_ind(baseline, optimized)
improvement = (np.mean(baseline) - np.mean(optimized)) / np.mean(baseline)
print(f"Improvement: {improvement*100:.1f}%")
print(f"Significance: p={p_value:.6f}")
# 5: monitor
def monitor_database():
metrics = {
"query_p99": measure_p99_latency(),
"connection_pool_utilization": pool.active / pool.size,
"cache_hit_ratio": get_cache_hit_ratio(),
"lock_wait_time": get_lock_wait_time()
}
for metric, value in metrics.items():
if exceeds_threshold(metric, value):
alert(f"{metric} = {value}")
Part 2: Optimizing Atlas Trading System — 810× database performance improvement.