← Research

Part 1 of 2 in the Database Performance Optimization series. Part 2: the Atlas trading-system case study.

Most database performance comes from three things: configuration, indexing, and how you batch work. The technology choice matters less than people think. SQLite with the right PRAGMAs serves 100,000+ queries/second. The trading system in Part 2 went from a 6.6-second startup to 8 milliseconds with a single anti-pattern fix.

This piece is the reference I built up while doing that work — the optimizations that paid off, in roughly the order they pay off. Configuration first (1 line, 2–20× gains), indexes second (10–1000×), batching third (10–100×), pooling fourth (10–50×). They compound.

Quick reference

OptimizationImpactEffortPriority
Enable WAL mode2–20×1 lineCritical
Add composite indexes10–1000×Low–MediumCritical
Batch operations10–100×MediumHigh
Connection pooling10–50×MediumHigh
Eliminate N+1 queries10×MediumHigh
Cache schema metadata30×LowHigh
Short transactions (<100ms)Prevents deadlocksMediumHigh
Covering indexes50–90% I/O reductionMediumMedium
Tune remaining PRAGMAs10–30%LowMedium

SQLite configuration

The eight PRAGMAs that matter

Run these immediately on every connection.

connection.executescript("""
    PRAGMA journal_mode = WAL;              -- 2-20× improvement
    PRAGMA synchronous = normal;            -- Reduces fsync overhead
    PRAGMA cache_size = -64000;             -- 64MB memory cache
    PRAGMA temp_store = memory;             -- Keep temp tables in RAM
    PRAGMA mmap_size = 30000000000;         -- 30GB mmap window
    PRAGMA busy_timeout = 5000;             -- Wait 5s for locks
    PRAGMA foreign_keys = ON;               -- Enable FK constraints
    PRAGMA journal_size_limit = 67108864;   -- Cap WAL at 64MB
""")

What each one does:

PRAGMAEffectWhen to use
journal_mode = WALConcurrent reads during writesAlways, in production
synchronous = normalTransaction latency from 30ms+ to <1msWith WAL
cache_size = -6400064MB cache reduces disk I/OWhen memory allows
mmap_size10–30% read perf gain64-bit systems with RAM
busy_timeout = 5000Automatic retry on locksHigh-concurrency workloads

mmap sizing

mmap_size reserves virtual address space, not physical RAM. The OS pages in what’s actually touched. So you can request a large window on 64-bit systems without committing memory upfront.

PlatformRecommendedReason
64-bit server30GBPlenty of address space, OS manages physical
Mobile / embedded256MBLimited virtual address space
32-bit128MBAddress space constraints
Network filesystem0 (disabled)mmap is incompatible with remote FS

Query plan checks

connection.execute("PRAGMA optimize")              # Before closing connections
connection.execute("PRAGMA optimize=0x10002")      # Periodic on long-running apps
cursor.execute("EXPLAIN QUERY PLAN SELECT ...")    # Verify index usage

Reading EXPLAIN QUERY PLAN output:

  • SCAN — full table scan. Bad.
  • SEARCH — index used. Good.
  • COVERING INDEX — index has everything; no table lookup. Best.

Schema design for time-series

CREATE TABLE market_data (
    symbol TEXT NOT NULL,
    timestamp_us INTEGER NOT NULL,
    price REAL NOT NULL,
    volume INTEGER NOT NULL,
    bid_price REAL,
    ask_price REAL,
    PRIMARY KEY (symbol, timestamp_us)
) WITHOUT ROWID, STRICT;

-- Composite: equality before range
CREATE INDEX idx_symbol_time
ON market_data(symbol, timestamp_us DESC);

-- Covering index for the common query
CREATE INDEX idx_symbol_time_price_vol
ON market_data(symbol, timestamp_us DESC, price, volume);

-- Partial index on the hot path
CREATE INDEX idx_recent_data
ON market_data(symbol, timestamp_us DESC, price, volume)
WHERE timestamp_us > unixepoch('now', '-24 hours');
TechniqueBenefitCost
INTEGER timestampsB-tree friendly, no float precision issuesMust convert from datetime
WITHOUT ROWIDSaves 8 bytes/rowOnly works when PK is unique
STRICT (3.37.0+)Type checking, better plansRequires explicit types
NOT NULLSmaller storage, better plansMust always provide a value
Partial indexesFaster, smallerOnly helps matching predicates
Covering indexes2–5× speedup, no table lookups10–30% write overhead

Anti-patterns

PatternProblemFix
EAV (entity-attribute-value)Query plans fall apartReal columns or JSON for dynamic data
Comma-separated listsCan’t index, can’t filterJunction table with FKs
Polymorphic foreign keysType ambiguitySeparate FKs or inheritance
No primary keyUpdate/dedupe problems downstreamAlways define one
TEXT for numbersType conversion overheadUse INTEGER / REAL

Indexing

Composite index rules

Leftmost prefix matching. An index on (A, B, C) supports:

  • WHERE A = ?
  • WHERE A = ? AND B = ?
  • WHERE A = ? AND B = ? AND C = ?

It does not support WHERE B = ? or WHERE C = ? alone.

Column ordering. Three rules, in priority order:

  1. Equality columns before range columns.

    -- Good: equality (symbol) before range (timestamp)
    CREATE INDEX ON trades(symbol, timestamp DESC, price);
    
    -- Bad: range before equality
    CREATE INDEX ON trades(timestamp DESC, symbol, price);
  2. Most selective columns first, when all are equality.

    -- domain_id: 1,000 unique values. user_id: 10,000.
    CREATE INDEX ON events(user_id, domain_id, event_type);
  3. Follow logical hierarchy when the structure matters more than raw selectivity.

    CREATE INDEX ON servers(domain_id, server_id, instance_id);

Covering indexes

Include every column the query needs, so the table itself never has to be touched:

SELECT price, volume FROM market_data
WHERE symbol = ? AND timestamp_us > ?;

CREATE INDEX idx_covering
ON market_data(symbol, timestamp_us DESC, price, volume);

The trade: 50–90% I/O reduction on the read side, 10–30% slower writes. Worth it for read-heavy hot paths.

Index decision matrix

ScenarioStrategyWhy
Read-heavy, rare writesCovering indexesMaximize read perf, accept write cost
BalancedStandard compositeReasonable compromise
Write-heavy with range queriesPartial indexes on hot dataReduce write overhead
Time-series with recency biasPartial on recent rowsSmaller, faster, less maintenance
Sparse columnsPartial with WHEREOnly index non-NULL rows

Worked example: 829× initialization speedup

Problem. Database init was taking 6,632ms, blocking 37 trading features from loading.

Before. Query information_schema.tables per table to check existence:

def _initialize_schema(self):
    cursor.execute("""
        SELECT COUNT(*) FROM information_schema.tables
        WHERE table_schema = 'main'
    """)
    if cursor.fetchone()[0] == 0:
        self._create_tables()

After. Let SQLite handle existence checks itself:

def _initialize_schema(self):
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS market_data (
            symbol TEXT NOT NULL,
            timestamp INTEGER NOT NULL,
            bid_price REAL,
            ask_price REAL,
            PRIMARY KEY (symbol, timestamp)
        )
    """)

    cursor.execute("""
        CREATE INDEX IF NOT EXISTS idx_market_data_symbol_time
        ON market_data(symbol, timestamp DESC)
    """)

Result. 6,632ms → 8ms (829×). Symbol queries: 450ms → 12ms (37×). Historical analysis: 2.3s → 85ms (27×). The full case study is in Part 2.

Transactions and locks

Isolation levels

LevelConsistencyConcurrencyUse
READ UNCOMMITTEDDirty reads allowedHighestAnalytics, approximate results
READ COMMITTEDOnly committed dataHighOLTP default
REPEATABLE READConsistent readsMediumReports needing consistency
SERIALIZABLEPerfect consistencyLowestCritical paths only

Default to READ COMMITTED. Escalate only when the requirement is real.

Keep transactions short

# Bad: long transaction holds locks during external I/O
def process_order(order_id):
    with connection:
        order = fetch_order(order_id)
        validate_inventory(order)        # External API call
        send_confirmation_email(order)   # Network I/O
        update_inventory(order)

# Good: only the DB work is inside the transaction
def process_order(order_id):
    order = fetch_order(order_id)
    validate_inventory(order)
    send_confirmation_email(order)

    with connection:
        update_inventory(order)

Target: keep transactions under 100ms.

Double-checked locking for cursor reuse

In the trading system, cursor creation under a thread lock was a bottleneck. Adding a lock-free fast path for the cached case dropped it to effectively zero.

def get_cursor(self, thread_id):
    # Fast path: lock-free read
    if thread_id in self._cursors:
        return self._cursors[thread_id]

    # Slow path: lock only when allocating
    with self._lock:
        if thread_id not in self._cursors:
            self._cursors[thread_id] = self.connection.cursor()
        return self._cursors[thread_id]

Throughput: 200,000 cursor lookups/second on the fast path.

Deadlock prevention

Always lock objects in a consistent order across all transactions.

def transfer(from_acct, to_acct, amount):
    # Lower account ID always locked first
    accounts = sorted([from_acct, to_acct], key=lambda a: a.id)
    with transaction:
        lock_account(accounts[0])
        lock_account(accounts[1])
        perform_transfer(from_acct, to_acct, amount)

Hot rows

Counter increments serialize. If one row is being updated by everyone, the fix is structural:

SolutionTrade
Batch updates in memory, flush periodicallyEventual consistency; loss risk on crash
Shard counters across N rowsAggregate at read time
Optimistic locking with version columnsRetry overhead on contention
Append-only event logStorage and aggregation cost

Batching

The numbers

OperationTimeThroughputSpeedup
Individual INSERTs (auto-commit)120s167 rows/s
Batched INSERTs (manual commit)1s20,000 rows/s120×
Bulk-load API (COPY, SqlBulkCopy)0.5s40,000 rows/s240×

Three batching levels

# Bad: commit per row triggers fsync per row
for row in data:
    cursor.execute("INSERT INTO trades VALUES (?, ?, ?)", row)
    connection.commit()

# Better: one transaction
connection.execute("BEGIN")
for row in data:
    cursor.execute("INSERT INTO trades VALUES (?, ?, ?)", row)
connection.commit()

# Best: executemany
connection.execute("BEGIN")
cursor.executemany(
    "INSERT INTO trades VALUES (?, ?, ?)",
    data
)
connection.commit()

Python sqlite3 benchmark: executemany() is 1.7× faster than a loop with execute() for 1M rows (1.6s vs. 2.7s).

Build indexes after bulk loads

# Bad: index update per insert
CREATE INDEX idx_symbol ON trades(symbol);
INSERT INTO trades ...   # millions of rows, each updating the index

# Good: bulk insert, single index build
INSERT INTO trades ...
CREATE INDEX idx_symbol ON trades(symbol);

6–7× faster on large loads.

Native bulk APIs

DatabaseAPISpeedup
PostgreSQLCOPY FROM50–100×
MySQLLOAD DATA INFILE50–100×
SQL ServerSqlBulkCopy50–100×
SQLiteexecutemany() in a transaction120×

Python sqlite3 patterns

Thread-local connections

import threading
import sqlite3

class DatabaseManager:
    def __init__(self, db_path):
        self._db_path = db_path
        self._local = threading.local()

    def get_connection(self):
        if not hasattr(self._local, 'connection'):
            self._local.connection = sqlite3.connect(
                self._db_path,
                timeout=30.0,
                check_same_thread=False
            )
            self._local.connection.execute("PRAGMA journal_mode=WAL")
        return self._local.connection

db = DatabaseManager("trades.db")
conn = db.get_connection()  # Unique per thread

WAL gives you concurrent reads; thread-local connections give you contention-free access.

Explicit transactions (Python 3.12+)

conn = sqlite3.connect("db.sqlite", autocommit=False)

# Or, fine-grained control:
conn = sqlite3.connect("db.sqlite", autocommit=True)
conn.execute("BEGIN IMMEDIATE")  # Acquire write lock upfront
try:
    conn.execute("INSERT ...")
    conn.execute("UPDATE ...")
    conn.execute("COMMIT")
except Exception:
    conn.execute("ROLLBACK")
    raise

BEGIN IMMEDIATE prevents upgrade deadlocks where multiple deferred transactions compete for the same write lock.

Retry with exponential backoff

import time
import random
import sqlite3

def execute_with_retry(conn, sql, params=None, max_retries=5):
    for attempt in range(max_retries):
        try:
            cursor = conn.execute(sql, params or [])
            conn.commit()
            return cursor
        except sqlite3.OperationalError as e:
            if "database is locked" in str(e) and attempt < max_retries - 1:
                wait = (2 ** attempt) * 0.1
                jitter = random.uniform(0, wait)
                time.sleep(wait + jitter)
            else:
                raise

Don’t retry IntegrityError (constraint violations) or ProgrammingError (code bugs). Those won’t go away with time.

The with block gotcha

# Wrong: connection stays open after the block
with sqlite3.connect("db.sqlite") as conn:
    conn.execute("INSERT ...")
# conn is STILL OPEN here

# Right: explicit close
conn = sqlite3.connect("db.sqlite")
try:
    with conn:
        conn.execute("INSERT ...")
finally:
    conn.close()

# Or: contextlib.closing
from contextlib import closing
with closing(sqlite3.connect("db.sqlite")) as conn:
    with conn:
        conn.execute("INSERT ...")

sqlite3’s with manages transactions, not the connection lifetime.

Stream large results

# Bad: full materialization
cursor.execute("SELECT * FROM large_table")
rows = cursor.fetchall()
for row in rows:
    process(row)

# Good: iterate
for row in conn.execute("SELECT * FROM large_table"):
    process(row)
    if should_stop():
        break

Connection pooling

Sizing

For OLTP, start at 10–20 connections per application instance.

The traditional formula:

connections = (2 × database_CPU_cores) + disk_spindle_count

Database CPUs, not application CPUs. And it’s a starting point — measure and adjust.

A minimal pool

from queue import Queue, Empty
import sqlite3

class ConnectionPool:
    def __init__(self, db_path, pool_size=10):
        self._db_path = db_path
        self._pool = Queue(maxsize=pool_size)
        for _ in range(pool_size):
            conn = sqlite3.connect(db_path, check_same_thread=False)
            conn.execute("PRAGMA journal_mode=WAL")
            self._pool.put(conn)

    def get_connection(self, timeout=5.0):
        try:
            return self._pool.get(timeout=timeout)
        except Empty:
            raise RuntimeError("Connection pool exhausted")

    def return_connection(self, conn):
        self._pool.put(conn)

    def __enter__(self):
        self._conn = self.get_connection()
        return self._conn

    def __exit__(self, *args):
        self.return_connection(self._conn)

pool = ConnectionPool("trades.db", pool_size=20)
with pool as conn:
    conn.execute("INSERT ...")

PgBouncer for PostgreSQL

Transaction-mode pooling multiplexes 1,000 client connections onto 25 database connections. 40× connection-side amplification.

[databases]
mydb = host=localhost dbname=mydb

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25

Three modes:

  • session — one DB conn per client (traditional, no amplification)
  • transaction — one DB conn per transaction (40× amplification)
  • statement — one DB conn per statement (breaks prepared statements)

Monitoring

Four signals worth watching

SignalMetricTargetAlert
LatencyP95, P99 query time<10ms OLTPP99 > 50ms
TrafficQueries / secVariesSudden 3× spike
ErrorsFailed query %<0.1%>1%
SaturationPool utilization<80%>90%

Percentiles beat averages here. P99 latency tells you what the worst-served users see; average latency hides that.

Three health-check tiers

# Tier 1: liveness — microseconds
def liveness_check():
    return {"status": "ok"}

# Tier 2: local health — milliseconds
def local_health():
    start = time.perf_counter()
    cursor = get_cursor()
    cursor.execute("SELECT 1")
    latency_ms = (time.perf_counter() - start) * 1000
    return {
        "status": "healthy" if latency_ms < 1.0 else "degraded",
        "latency_ms": latency_ms,
        "active_cursors": len(self._cursors)
    }

# Tier 3: dependency health — seconds
def deep_health():
    db_health = local_health()
    api_health = check_external_api()
    return {
        "database": db_health,
        "api": api_health,
        "status": "healthy" if all_healthy() else "unhealthy"
    }

A subtle rule: fail open, not closed. When every node reports unhealthy because a shared dependency is down, keep routing — declaring everything unhealthy is worse than serving degraded responses.

Profiling workflow

# 1. Identify slow queries
cursor.execute("EXPLAIN QUERY PLAN " + query)
for row in cursor:
    print(row)

# 2. Check for SCAN. SEARCH or COVERING INDEX is what you want.

# 3. Refresh statistics
cursor.execute("ANALYZE")

# 4. Re-check the plan
cursor.execute("EXPLAIN QUERY PLAN " + query)

# 5. Add missing indexes
cursor.execute("CREATE INDEX idx_missing ON table(column)")

Sampling

Tracing 100% of queries is too expensive. Sample 1–5%.

import random

def should_trace():
    return random.random() < 0.05  # 5%

def execute_query(sql):
    if should_trace():
        start = time.perf_counter()
        result = cursor.execute(sql)
        duration = time.perf_counter() - start
        log_slow_query(sql, duration)
        return result
    return cursor.execute(sql)

Statistical insight at ~5% of the overhead.

Anti-patterns

PatternSymptomImpactFixWin
Schema metadata in hot pathinformation_schema per request30× slowdownCache schema at startup30×
N+1 queries1 parent + N child10×JOIN or IN10×
Per-row insertsSlow bulk ops100×Batch in a transaction100×
Missing indexesFull scansMinutes vs. msComposite indexes1000×+
Over-indexingSlow writes50–100×Drop unused indexes50–100×
Long transactionsLock contentionCascading delaysKeep <100msAvoids deadlocks
Hot rowsSerialized updatesThroughput capShard or batch10×+
No poolingConnection overhead10–50× slowerPool10–50×

N+1

# Bad: 1 + N queries
posts = db.execute("SELECT * FROM posts WHERE user_id = ?", [user_id])
for post in posts:
    comments = db.execute(
        "SELECT * FROM comments WHERE post_id = ?", [post.id]
    )

# Good: one query, one JOIN
posts_with_comments = db.execute("""
    SELECT p.*, c.*
    FROM posts p
    LEFT JOIN comments c ON p.id = c.post_id
    WHERE p.user_id = ?
""", [user_id])

Measured: 800 items across 17 categories — 100ms with JOIN, >1s with N+1.

UPSERT race condition

# Bad: race between SELECT and INSERT
result = db.execute("SELECT id FROM users WHERE email = ?", [email])
if result.fetchone():
    db.execute("UPDATE users SET name = ? WHERE email = ?", [name, email])
else:
    db.execute("INSERT INTO users (email, name) VALUES (?, ?)", [email, name])

# Good: database-native UPSERT
# SQLite
db.execute("""
    INSERT INTO users (email, name) VALUES (?, ?)
    ON CONFLICT(email) DO UPDATE SET name = excluded.name
""", [email, name])

# PostgreSQL
db.execute("""
    INSERT INTO users (email, name) VALUES (%s, %s)
    ON CONFLICT (email) DO UPDATE SET name = EXCLUDED.name
""", [email, name])

# MySQL
db.execute("""
    INSERT INTO users (email, name) VALUES (%s, %s)
    ON DUPLICATE KEY UPDATE name = VALUES(name)
""", [email, name])

Benchmarking

What to measure

RequirementSpecWhy
Iterations30–100+Statistical significance
Confidencep < 0.0595% the difference is real
PercentilesP50, P90, P99, P99.9Tail latency matters
Warm-up5–10 iterations discardedCaches need to settle
ConsistencySame hardware/load/timeEliminate confounding

Tests

from scipy import stats

# Two-sample t-test (normal data)
t_stat, p_value = stats.ttest_ind(baseline_times, optimized_times)
if p_value < 0.05:
    print(f"Significant improvement (p={p_value:.4f})")

# Mann-Whitney U (non-parametric)
u_stat, p_value = stats.mannwhitneyu(baseline_times, optimized_times)

# Wilcoxon signed-rank (paired)
w_stat, p_value = stats.wilcoxon(before_times, after_times)

Tools

ToolDatabaseUse
pgbenchPostgreSQLTPC-B, custom scripts
mysqlslapMySQLConcurrency simulation
HammerDBMulti-DBTPC-C, TPC-H
YCSBNoSQL / distributedCustomizable
sysbenchMulti-DBLua-scriptable

Profiling commands

DatabaseCommandOutput
PostgreSQLEXPLAIN ANALYZETimes, rows, buffers
MySQL (8.0.18+)EXPLAIN ANALYZECost and actual times
SQLiteEXPLAIN QUERY PLANIndex usage, scan types
SQL ServerQuery StoreHistorical stats
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM trades
WHERE symbol = 'AAPL' AND timestamp > 1234567890;

Output to watch: estimated vs. actual row counts (mismatch = stale stats), shared buffer hits (cache effectiveness), per-node execution time.

Three case studies

1,200× via indexing

Bencher’s SQLite instance: queries running 38+ seconds. Adding composite indexes that matched the query patterns and materializing one expensive aggregation dropped queries to sub-second. 1,200×.

Lesson: optimization wins only show up at scale. The same indexes on small data produced negligible differences.

794× via removing a metadata query

Trading system database manager. 6.6s init blocking 37 features. Root cause: information_schema queries scanning hundreds of tables on every startup. Replaced with CREATE TABLE IF NOT EXISTS. 6,644ms → 8.2ms. Full breakdown in Part 2.

Uber’s MySQL 8.0 fleet upgrade

2,100+ clusters, 3M queries/second:

  • Inserts: 29% faster
  • Reads: 33% faster
  • Updates: 47% faster
  • Client queries: 78% faster
  • Lock time: 94% reduced

Even at extreme scale, version upgrades and query optimization compound.

SQLite-on-Rails

Failed at 4 concurrent requests in one configuration; stable at 16+ with P99 latency in budget in another. The differences were BEGIN IMMEDIATE for write locks, a custom busy_handler that releases Ruby’s GVL, and WAL mode.

Configuration alone moves the system from “doesn’t work” to “production viable.”

A workflow

  1. Measure baseline. time.perf_counter() before/after the operation, 30+ samples.
  2. Profile. Find the actual bottleneck. The slowest function is rarely the one you suspected.
  3. Apply. In priority order: configuration → indexes → batching → pooling → schema.
  4. Verify. Statistical test on baseline vs. optimized. Don’t trust eyeballed numbers.
  5. Monitor. Continuous P99, pool utilization, cache hit ratio, lock waits.
# 4: verify
from scipy import stats
import numpy as np

baseline = [measure_baseline() for _ in range(100)]
optimized = [measure_optimized() for _ in range(100)]

t_stat, p_value = stats.ttest_ind(baseline, optimized)
improvement = (np.mean(baseline) - np.mean(optimized)) / np.mean(baseline)

print(f"Improvement: {improvement*100:.1f}%")
print(f"Significance: p={p_value:.6f}")
# 5: monitor
def monitor_database():
    metrics = {
        "query_p99": measure_p99_latency(),
        "connection_pool_utilization": pool.active / pool.size,
        "cache_hit_ratio": get_cache_hit_ratio(),
        "lock_wait_time": get_lock_wait_time()
    }
    for metric, value in metrics.items():
        if exceeds_threshold(metric, value):
            alert(f"{metric} = {value}")

Part 2: Optimizing Atlas Trading System — 810× database performance improvement.