// project
A biostatistics study investigating whether a common medication class affects the rate of cognitive decline in older adults. The dataset: NACC, the National Alzheimer’s Coordinating Center — 130,000+ longitudinal observations across 37 Alzheimer’s Disease Research Centers, multi-year follow-up on 30,000+ participants.
Under peer review. Nobody assigned this. The question was interesting, so I built the pipeline.
| Dimension | Value |
|---|---|
| Participants | 30,000+ |
| Longitudinal observations | 130,000+ |
| Research centers | 37 |
| Follow-up window | Multi-year, per-participant variable |
| Adjustment covariates | 24 (demographic, clinical, comorbid) |
| Outcome measure | Standardized neuropsychological battery |
| Missingness | Multi-pattern; handled via multiple imputation |
| Status | Under peer review |
Findings are withheld until publication. Methodology is below.
The medication class in question is widely prescribed. Its effect on cognitive trajectory — positive, negative, or null — is contested. Existing studies disagree, often because of confounding: people who get prescribed the medication differ systematically from those who don’t, and those differences also predict cognitive outcomes.
The research design tries to subtract those differences out.
Three statistical machines, each addressing a specific confounder:
Propensity score matching. Models the probability of receiving the medication as a function of all observed covariates, then matches treated and untreated participants with similar scores. This makes the two groups exchangeable on observed confounders — what remains should be closer to the causal effect.
class PropensityAnalysis:
def __init__(self, treatment_var, covariates):
self.treatment = treatment_var
self.covariates = covariates
self.matched_data = None
def calculate_scores(self, data):
"""Propensity scores via logistic regression"""
from sklearn.linear_model import LogisticRegression
X = data[self.covariates]
y = data[self.treatment]
model = LogisticRegression(max_iter=1000)
model.fit(X, y)
return model.predict_proba(X)[:, 1]
def match_subjects(self, data, caliper=0.01):
"""1:1 matching within a caliper distance"""
# Greedy nearest-neighbor matching within the caliper
...
Survival analysis with time-varying covariates. Time-to-event modeling where the exposure and covariates can change over follow-up. Critical for medication studies because people start and stop the drug, and treating that as a fixed baseline variable biases the estimate.
Mixed-effects models for repeated cognitive assessments. Each participant has multiple cognitive scores over time. Random intercepts and slopes per participant let the model separate within-person change from between-person variation.
Multiple imputation for missing data. Cognitive studies have missingness. Single imputation lies to you about uncertainty; complete-case analysis loses statistical power and may introduce selection bias. Multiple imputation gives valid standard errors.
Data engineering is most of the work in biostatistics, and almost none of the published papers say so.
def calculate_hazard_ratio(data, exposure, outcome, covariates):
"""
Calculate adjusted hazard ratio with confidence intervals.
Parameters
----------
data : pandas.DataFrame
exposure : str
outcome : str
covariates : list[str]
Returns
-------
dict : hazard ratio, 95% CI, p-value
"""
# Cox proportional hazards with time-varying covariates,
# validated against published reference values.
...
This is independent research. There’s no grant funding it, no advisor pushing it, no employer that benefits. The reason it exists is that the question is real and the data is available, and I have the statistical and engineering background to do the work properly.
The same rigor that makes a trading system not lose money makes a biostatistics study not mislead clinicians. Propensity matching is regularization on causal inference; walk-forward validation is the same idea in time. The underlying discipline is identical.
Specific findings and the medication class are withheld pending publication.