← Research

// article

Deploying ML in production: a working reference (Part 1)

October 3, 2025 Article

Part 1 of 2 in the ML deployment series. Part 2: building Atlas’s forecasting system in production.

Production ML is 5–10× more infrastructure code than model training1. The model is the part everyone focuses on; the rest — serving, containerization, lifecycle management, monitoring, drift — is what makes it work or doesn’t. This piece is the reference I’ve used to think through each stage, drawing on practice from Netflix23, Uber45, AWS6, and NVIDIA789 alongside systems I’ve built.

Quick reference: deployment patterns

PatternLatencyThroughputFitsComplexity
REST API (FastAPI)50–200msMediumPublic APIs, <10KB payloadsLow
gRPC10–50msHighInternal services, >100KB dataMedium
Kafka streaming<10msVery highReal-time events, millions msg/sHigh
Batch (Celery)Minutes–hoursVery highOffline scoring, cacheable inputsLow
Serverless (Lambda)VariableLow–medium<1,000 req/day, sporadic trafficLow

Serving architectures

REST API

The subtle failure mode: synchronous model inference inside an async framework blocks the event loop, degrading I/O performance 10–20× (5ms → 100ms). FastAPI plus a model is a textbook example.

The fix is to push the model into a process pool and await it.

from concurrent.futures import ProcessPoolExecutor
import asyncio
from fastapi import FastAPI

def create_model():
    global model
    model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

pool = ProcessPoolExecutor(max_workers=1, initializer=create_model)
app = FastAPI()

async def predict_async(text):
    loop = asyncio.get_event_loop()
    vector = await loop.run_in_executor(pool, model_predict, text)
    return vector

@app.post("/predict")
async def predict(request: PredictionRequest):
    result = await predict_async(request.text)
    return {"prediction": result}

CPU-bound inference runs in the pool; async I/O stays responsive.

gRPC

For internal service-to-service traffic, gRPC is consistently 9–10× the throughput of REST10, with P99 latency 11× better at 500 concurrent threads (7s vs. 30s). Computer vision models see 75–85% latency reductions:

  • MobileNetV2: 100ms → 25ms
  • EfficientDetD1: 200ms → 30ms

Three reasons it wins:

  • Binary protobuf, 30% smaller than JSON
  • HTTP/2 multiplexing eliminates 40–60% connection overhead
  • Native bidirectional streaming

Kafka streaming

Millions of messages per second with single-digit-ms P99. Two patterns:

PatternLatencyCouplingFits
Remote servingHigherCoupled availabilitySeparation of concerns
Embedded inferenceLowestDecoupledBest latency, exactly-once semantics

Embedded looks like:

Kafka → Flink (with embedded model) → Kafka

Good for fraud detection, real-time recommendations, IoT telemetry.

Batch vs. real-time

CriterionBatchReal-time
Update latencyHours/days OKSeconds required
Input spaceCacheable, limited varietyUnpredictable
TrafficPredictable, periodicSporadic, on-demand
Cost sensitivityOptimize computeOptimize experience

Service mesh overhead

MeshLatency overheadWhen
Linkerd5–10%Latency-critical ML, GPU cost significant
Istio25–35%Complex routing, multi-cluster, large teams

Small teams default to Linkerd. Platform teams with feature requirements default to Istio1112.

Containerization and orchestration

Multi-stage Docker builds

Right pattern takes an ML image from 450MB to 158MB:

# Stage 1: dependency builder
FROM python:3.11-slim as deps-builder
RUN apt-get update && apt-get install -y build-essential && \
    rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN python -m venv /venv && \
    /venv/bin/pip install --no-cache-dir -r requirements.txt

# Stage 2: model preparation
FROM python:3.11-slim as model-stage
COPY --from=deps-builder /venv /venv
ENV PATH="/venv/bin:$PATH"
WORKDIR /app
COPY download_models.py ./
RUN python download_models.py

# Stage 3: production runtime
FROM python:3.11-slim as runtime
RUN apt-get update && apt-get install -y libgomp1 && \
    rm -rf /var/lib/apt/lists/* && \
    groupadd -r appuser && useradd -r -g appuser appuser

COPY --from=deps-builder /venv /venv
ENV PATH="/venv/bin:$PATH"
COPY --from=model-stage /app/models /models
WORKDIR /app
COPY . .
RUN chown -R appuser:appuser /app
USER appuser

HEALTHCHECK --interval=30s --timeout=10s \
  CMD python -c "import requests; requests.get('http://localhost:8000/health')"

EXPOSE 8000
CMD ["gunicorn", "--bind", "0.0.0.0:8000", "--workers", "2", "app:app"]

Use python:3.11-slim-bookworm, not Alpine. Alpine’s musl libc is incompatible with most ML wheels and builds run roughly 50× slower.

Model loading

StrategyCold start (140MB)Cold start (1GB)When
Baked into image5.6s17.6sSmall models (<500MB)
S3 loading5.7s20.2sLarge models, frequent updates
Provisioned concurrency0s0sProduction APIs (~$15/mo)
Lazy (@lru_cache)First request: 10sFirst request: 30sLow-traffic endpoints

Kubernetes HPA for ML

CPU-based autoscaling responds in 30–60s. GPU and queue-length signals: 5–15s.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-inference-deployment
  minReplicas: 2
  maxReplicas: 15
  metrics:
  - type: External
    external:
      metric:
        name: dcgm_fi_dev_gpu_util  # NVIDIA DCGM exporter
      target:
        type: AverageValue
        averageValue: "75"
  - type: Pods
    pods:
      metric:
        name: tgi_queue_size
      target:
        type: AverageValue
        averageValue: "10"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

Scale up fast; scale down slow. Thrashing is more expensive than excess capacity.

GPU scheduling

GPUs must be in limits only, with requests matching:

resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    nvidia.com/gpu: 1

Node affinity for specific GPU types:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: accelerator
          operator: In
          values:
          - nvidia-tesla-v100
          - nvidia-tesla-a100

Serverless vs. Kubernetes

MetricServerlessKubernetes
Traffic<1,000 req/day>10K req/day
Latency SLABest-effort<100ms required
GPU needsNoneRequired
CostOptimize idleOptimize throughput

Cold-start reality:

  • 140MB model: 5.7s P50
  • 1GB model: 20s P50
  • Super-cold starts: 100s+ on fresh deployments

Mitigations in order of cost:

  1. Provisioned concurrency: ~$15/mo per instance, eliminates cold starts
  2. Warm pinging every 5 minutes: ~$0.20/mo
  3. Code optimization: 10–30%

Lifecycle management

Service initialization

Dependency injection makes the service testable and the resource lifecycle explicit.

from dependency_injector import containers, providers
from dependency_injector.wiring import Provide, inject

class MLServiceContainer(containers.DeclarativeContainer):
    config = providers.Configuration()

    model = providers.Singleton(
        MLModel,
        model_path=config.model_path
    )

    prediction_service = providers.Factory(
        PredictionService,
        model=model
    )

@inject
def predict(
    data: dict,
    service: PredictionService = Provide[MLServiceContainer.prediction_service]
):
    return service.predict(data)

Three-tier health checks

ProbeEndpointPurposeTimeout
Liveness/healthProcess alive5s
Readiness/health/readyDependencies + model loaded10s
Startup/health/startupExtended initialization150s (30 × 5s)

Graceful shutdown

import signal
import time
from threading import Lock

is_shutting_down = False
active_requests = 0
request_lock = Lock()

def shutdown_handler(signum, frame):
    global is_shutting_down
    is_shutting_down = True
    logger.info("Shutdown initiated, draining requests...")

    while active_requests > 0:
        time.sleep(0.1)

    model.cleanup()
    db_connection.close()
    logger.info("Shutdown complete")
    sys.exit(0)

signal.signal(signal.SIGTERM, shutdown_handler)

@app.before_request
def check_shutdown():
    if is_shutting_down:
        abort(503, "Service shutting down")

    global active_requests
    with request_lock:
        active_requests += 1

@app.after_request
def decrement_counter(response):
    global active_requests
    with request_lock:
        active_requests -= 1
    return response

Model versioning

Semantic versioning:

  • MAJOR: breaking API changes
  • MINOR: retraining, backward-compatible improvements
  • PATCH: bug fixes, performance

MLflow registry:

import mlflow

mlflow.register_model(
    model_uri="runs:/abc123/model",
    name="fraud_detector"
)

client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="fraud_detector",
    version=5,
    stage="Production",
    archive_existing_versions=True
)

A/B testing

50/50 splits are optimal — variance is minimized when both arms are equal. A 95/5 split has 5× more variance for the same total traffic, which is more samples or less power either way.

Sample size:

import numpy as np
from scipy import stats

def calculate_sample_size(
    baseline_rate: float,
    minimum_detectable_effect: float,
    alpha: float = 0.05,
    power: float = 0.8
) -> int:
    z_alpha = stats.norm.ppf(1 - alpha/2)
    z_beta = stats.norm.ppf(power)

    p1 = baseline_rate
    p2 = baseline_rate * (1 + minimum_detectable_effect)
    p_avg = (p1 + p2) / 2

    n = ((z_alpha * np.sqrt(2 * p_avg * (1 - p_avg)) +
          z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) /
         (p2 - p1)) ** 2

    return int(np.ceil(n))

# 3% baseline, 20% relative improvement
n = calculate_sample_size(0.03, 0.20)  # ~6,500 per group

Minimum test duration: max(sample_size / daily_users, 14 days) — the 14-day floor captures weekly patterns.

Canary deployments

Standard progression: 5% → 25% → 50% → 100%, with 15min / 30min / 60min pauses.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: ml-model
spec:
  replicas: 10
  strategy:
    canary:
      steps:
      - setWeight: 5
      - pause: {duration: 15m}
      - setWeight: 25
      - pause: {duration: 30m}
      - setWeight: 50
      - pause: {duration: 1h}
      analysis:
        templates:
        - templateName: success-metrics
        startingStep: 1

Automatic rollback on: error rate >10%, latency >1s, accuracy <70%, or three consecutive checks past a warning threshold.

Deployment patterns

PatternWhenSplitDurationRollback
A/B testStatistical comparison50/501–2 weeksModerate
CanaryGradual rollout5 → 25 → 50 → 100%2–4 hoursFast (<1min)
Blue-greenInstant rollback needed0 → 100%ImmediateInstant
ShadowZero user impact0% (observe)ContinuousN/A

Performance optimization

What costs what

TechniqueSpeedupMemory reductionAccuracy costComplexity
INT8 quantization2–4×1–3%Medium
FP16 quantization<1%Low
Structured pruning1.5–2×2–5%Medium–high
Knowledge distillation2–5×2–10×2–5%High
TensorRTMinimalLow–medium
ONNX RuntimeNoneLow
Redis caching100–1000×NoneLow
Dynamic batching3–4×NoneLow

Quantization

The best return on the list. INT8: 2–4× speedup, 4× memory cut, <1% accuracy loss in most cases.

TensorRT measured speedups1314:

  • ResNet50: 9.9ms → 3.0ms (3.3×)
  • ResNet18: 3.8ms → 1.7ms (2.2×)

QAT vs. PTQ matters for complex models:

  • EfficientNet-B0 with QAT: 76.8% accuracy
  • EfficientNet-B0 with PTQ: 33.9%
  • FP32 baseline: 77.4%

PTQ on a complex model can crater accuracy. Use QAT.

import torch.quantization

model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)

for data in calibration_loader:
    model(data)

torch.quantization.convert(model, inplace=True)

Inference acceleration

FrameworkPlatformSpeedupFits
TensorRTNVIDIA GPUMaximum GPU performance
ONNX RuntimeCross-platformCompatibility, portability
OpenVINOIntel CPU/GPU27%Intel hardware, edge

ONNX export:

import torch.onnx
import onnxruntime as ort

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    opset_version=13,
    do_constant_folding=True
)

session = ort.InferenceSession("model.onnx")
output = session.run(None, {"input": input_data.numpy()})

TensorRT, in concrete numbers — DenseNet: 113 → 273 inferences/sec (2.4× throughput); latency 9ms → 5ms.

Redis caching

100–1000× for repeat inputs.

import redis
import json
import hashlib

class RedisCachingDecorator:
    def __init__(self, host='localhost', port=6379, prefix=None):
        self._redis_client = redis.Redis(host=host, port=port, db=0)
        self.prefix = prefix

    def _create_cache_key(self, model_name, version, input_data):
        input_hash = hashlib.sha256(
            json.dumps(input_data, sort_keys=True).encode()
        ).hexdigest()[:16]
        return f"{self.prefix}/{model_name}/{version}/{input_hash}"

    def predict(self, model, input_data):
        cache_key = self._create_cache_key(
            model.__class__.__name__,
            getattr(model, 'version', '1.0'),
            input_data
        )

        cached_result = self._redis_client.get(cache_key)
        if cached_result:
            return json.loads(cached_result)

        result = model.predict(input_data)
        self._redis_client.setex(cache_key, 3600, json.dumps(result))
        return result

TTLs:

  • Real-time fraud: 5–60 minutes
  • User profiles: 1–24 hours
  • Static classifications: days to weeks

Dynamic batching

NVIDIA Triton15: 3.7× throughput. Inception plateaus at 73 inferences/sec without batching; with dynamic batching it reaches 272 at concurrency 8.

max_batch_size: 16
dynamic_batching {
  preferred_batch_size: [4, 8, 16]
  max_queue_delay_microseconds: 100
  priority_levels: 2
  default_priority_level: 1
}
instance_group [{
  count: 2
  kind: KIND_GPU
  gpus: [0]
}]

Latency/throughput trade — Llama-2-70B on A100:

Batch sizeLatency/tokenThroughputSpeedup
130ms33 tok/s
435ms114 tok/s3.5×
1640ms400 tok/s12×
64120ms533 tok/s16× (plateau)

The right batch size depends on the SLA. Hit batch 16 if your latency budget allows it.

Monitoring

Percentiles, not averages

PercentileMeaningAlert
P50 (median)Typical experience<200ms
P95Tail catch<500ms
P99Architectural bottlenecks<1000ms
from prometheus_client import Histogram

latency_histogram = Histogram(
    'http_request_duration_seconds',
    'Duration of HTTP requests',
    ['status', 'path', 'method'],
    buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
)

# P95 in PromQL:
# histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Histogram vs. Summary

TypeAggregates across instancesFlexible percentilesUse
HistogramYesYesDistributed systems, Kubernetes
SummaryNo (per-instance only)No (pre-defined)Single instance, exact values

Default to Histogram for ML.

ML-specific metrics

Beyond standard service metrics, track:

  • Prediction distribution (histogram)
  • Confidence scores (flag low confidence <0.7)
  • Feature values (out-of-range inputs)
  • Model version usage

Recording rules to precompute the expensive queries:

groups:
  - name: ml_sli_rules
    interval: 30s
    rules:
      - record: ml:prediction_latency:p95
        expr: |
          histogram_quantile(0.95,
            sum(rate(ml_prediction_duration_seconds_bucket[5m]))
            by (le, model_version))

      - record: ml:prediction_error_rate
        expr: |
          sum(rate(ml_predictions_total{result="error"}[5m]))
          by (model_version) /
          sum(rate(ml_predictions_total[5m]))
          by (model_version)

      - record: ml:slo:prediction_latency_good_ratio
        expr: |
          sum(rate(ml_prediction_duration_seconds_bucket{le="0.5"}[5m]))
          by (model_version) /
          sum(rate(ml_prediction_duration_seconds_count[5m]))
          by (model_version)

Burn-rate alerts

Fast burn (1h window): page. Slow burn (6h): ticket.

- alert: MLLatencyFastBurn
  expr: |
    (ml:prediction_latency:p95 > 0.5 and
     (1 - ml:slo:prediction_latency_good_ratio) > (14.4 * 0.001))
  for: 2m
  labels:
    severity: page

- alert: MLLatencySlowBurn
  expr: |
    (ml:prediction_latency:p95 > 0.5 and
     (1 - ml:slo:prediction_latency_good_ratio) > (6 * 0.001))
  for: 15m
  labels:
    severity: ticket

Structured logging

structlog with JSON in production, plain console in development.

import structlog
import uuid
from flask import Flask, request, g

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ]
)
logger = structlog.get_logger()

@app.before_request
def add_correlation_id():
    correlation_id = request.headers.get(
        'X-Correlation-ID',
        str(uuid.uuid4())
    )
    g.correlation_id = correlation_id
    structlog.contextvars.bind_contextvars(
        correlation_id=correlation_id
    )

@app.route('/predict', methods=['POST'])
def predict():
    logger.info(
        "prediction_made",
        prediction=0.85,
        confidence=0.92,
        latency_ms=45
    )

SLI/SLO targets

MetricTargetWhat it means
P50 latency<200msTypical user experience
P95 latency<500msGood experience threshold
P99 latency<1000msWorst acceptable case
Availability99.5%21 min downtime/month
Error rate<0.1%1 in 1,000 requests

Capacity planning: track at P95, alert at 80% utilization.

Drift detection

MethodData typeSensitivitySample dependencyFits
KS testContinuousVery highStrongSmall datasets, tiny shifts
PSIBothLow–mediumNoneLarge datasets, finance/credit
WassersteinContinuousMediumWeakGeneral purpose, balanced
Chi-squareCategoricalMediumModerateCategorical features
JS divergenceBothMediumWeakSymmetric comparison

Population Stability Index

PSI = Σ((%_new − %_old) × ln(%_new / %_old))

import numpy as np

def calculate_psi(expected, actual, buckets=10):
    """Calculate Population Stability Index"""
    breakpoints = np.arange(0, buckets + 1) / buckets * 100

    expected_percents = np.histogram(expected, bins=breakpoints)[0] / len(expected)
    actual_percents = np.histogram(actual, bins=breakpoints)[0] / len(actual)

    psi_values = (actual_percents - expected_percents) * \
                 np.log(actual_percents / expected_percents)
    psi_values[np.isnan(psi_values)] = 0
    psi_values[np.isinf(psi_values)] = 0

    return np.sum(psi_values)

Thresholds:

  • PSI < 0.1: no significant change
  • PSI 0.1–0.25: small shift
  • PSI > 0.25: significant drift, investigate

Retraining triggers

SeverityPSIAccuracy dropAction
Small0.10–0.15<2%Automated retrain
Moderate0.15–0.252–5%Human review
Severe>0.25>5%Emergency

Plus scheduled retraining (weekly/monthly), minimum-data thresholds, and cost-benefit checks.

Input validation

from pydantic import BaseModel, validator, Field

class PredictionInput(BaseModel):
    age: int = Field(..., ge=18, le=100)
    income: float = Field(..., gt=0, lt=10000000)
    credit_score: int = Field(..., ge=300, le=850)

    @validator('income')
    def income_outlier_check(cls, v):
        if v > 500000:
            raise ValueError('income value suspicious')
        return v

Security

LayerPracticeImplementation
InputSchema validationPydantic, strict types
InputAdversarial detectionConfidence threshold, perturbation
APIAuthenticationJWT RS256, API key rotation
APIRate limitingToken bucket, 100 req/min
ModelAccess controlRBAC, audit logs, least privilege
ModelEncryptionAES-256 at rest, TLS 1.2+ in transit
PipelineDependency scanningsafety, Snyk, version pinning
ComplianceAudit trailsAppend-only logs, hash chaining

Rate limiting

Token bucket with Redis:

import redis
from datetime import datetime

class RateLimiter:
    def __init__(self, redis_client):
        self.redis = redis_client

    def check_rate_limit(
        self,
        user_id: str,
        max_requests: int = 100,
        window_seconds: int = 60
    ):
        key = f"rate_limit:{user_id}:{datetime.now().minute}"
        current = self.redis.incr(key)

        if current == 1:
            self.redis.expire(key, window_seconds)

        if current > max_requests:
            return False, f"Rate limit exceeded: {current}/{max_requests}"

        return True, f"Requests remaining: {max_requests - current}"

Defaults:

  • Interactive APIs: 100 req/min per user
  • Batch APIs: 1,000 req/min
  • Expensive inference: 10 req/min

Resilience

PatternPurposeWhenComplexity
RetryTransient failuresNetwork timeouts, rate limitsLow
Circuit breakerCascading failuresService degradationMedium
FallbackGraceful degradationCircuit openLow–medium
TimeoutBound latencyHung requestsLow
BulkheadResource isolationPrevent exhaustionMedium

Retry with exponential backoff

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10)
)
def call_external_api(data):
    response = requests.post(API_URL, json=data, timeout=5)
    response.raise_for_status()
    return response.json()

Delays: 4s, 8s, 10s.

Circuit breaker

from pybreaker import CircuitBreaker, CircuitBreakerError

breaker = CircuitBreaker(
    fail_max=5,
    reset_timeout=60,
    exclude=[ValueError]
)

@breaker
def call_prediction_service(data):
    response = requests.post(PREDICTION_URL, json=data, timeout=5)
    response.raise_for_status()
    return response.json()

try:
    result = call_prediction_service(input_data)
except CircuitBreakerError:
    result = get_cached_prediction(input_data)

States: CLOSED (normal), OPEN (fail fast after 5 consecutive failures), HALF-OPEN (probe recovery after 60s).

Quick picks

Serving:

  • FastAPI for public APIs with <100ms inference, <10KB payloads
  • gRPC for internal services, large payloads (>100KB), needing 5–10× speedup
  • Triton for multi-framework serving, GPU optimization, dynamic batching
  • Kubernetes at >10K req/day with autoscaling or GPU scheduling
  • Serverless at <1,000 req/day with sporadic traffic

First optimizations (return-on-effort order):

  1. INT8 quantization — 2–4× speedup, easiest
  2. Redis caching — 100–1000× for repeats
  3. Dynamic batching — 3–4× throughput
  4. TensorRT — 5× on GPUs

Monitoring baseline:

  • Histogram metrics (not Summary)
  • P95 alerts with burn-rate
  • SLO: P95 <500ms, P99 <1s
  • Structured logs with correlation IDs

Security baseline:

  • Pydantic input validation
  • JWT authentication
  • 100 req/min rate limiting
  • TLS 1.2+ everywhere

Part 2: building Atlas’s forecasting system in production.


References

Footnotes

  1. Production ML Systems — Google for Developers.

  2. Supporting Diverse ML Systems at Netflix — Netflix Technology Blog.

  3. Scaling Media Machine Learning at Netflix — Netflix Technology Blog.

  4. Meet Michelangelo: Uber’s Machine Learning Platform — Uber Engineering.

  5. Upgrading Uber’s MySQL Fleet to version 8.0 — Uber Engineering.

  6. AWS HealthLake.

  7. NVIDIA TensorRT Best Practices Guide.

  8. DCGM Exporter Documentation.

  9. Monitoring GPUs in Kubernetes with DCGM.

  10. gRPC vs REST Performance Comparison.

  11. Benchmarking Linkerd and Istio.

  12. Service Meshes Decoded: Istio vs Linkerd.

  13. NVIDIA TensorRT Best Practices Guide — ResNet benchmarks.

  14. Torch-TensorRT ResNet-50 Example.

  15. NVIDIA Triton Optimization Guide.