// project

Cognitive Health Study

September 2025

Technologies

JupyterPythonPandasSciPyStatsmodelsMixed ModelsSurvival Analysis

A biostatistics study investigating whether a common medication class affects the rate of cognitive decline in older adults. The dataset: NACC, the National Alzheimer’s Coordinating Center — 130,000+ longitudinal observations across 37 Alzheimer’s Disease Research Centers, multi-year follow-up on 30,000+ participants.

Under peer review. Nobody assigned this. The question was interesting, so I built the pipeline.

Study at a glance

Dimension	Value
Participants	30,000+
Longitudinal observations	130,000+
Research centers	37
Follow-up window	Multi-year, per-participant variable
Adjustment covariates	24 (demographic, clinical, comorbid)
Outcome measure	Standardized neuropsychological battery
Missingness	Multi-pattern; handled via multiple imputation
Status	Under peer review

Findings are withheld until publication. Methodology is below.

The question

The medication class in question is widely prescribed. Its effect on cognitive trajectory — positive, negative, or null — is contested. Existing studies disagree, often because of confounding: people who get prescribed the medication differ systematically from those who don’t, and those differences also predict cognitive outcomes.

The research design tries to subtract those differences out.

Methodology

Three statistical machines, each addressing a specific confounder:

Propensity score matching. Models the probability of receiving the medication as a function of all observed covariates, then matches treated and untreated participants with similar scores. This makes the two groups exchangeable on observed confounders — what remains should be closer to the causal effect.

class PropensityAnalysis:
    def __init__(self, treatment_var, covariates):
        self.treatment = treatment_var
        self.covariates = covariates
        self.matched_data = None

    def calculate_scores(self, data):
        """Propensity scores via logistic regression"""
        from sklearn.linear_model import LogisticRegression

        X = data[self.covariates]
        y = data[self.treatment]

        model = LogisticRegression(max_iter=1000)
        model.fit(X, y)

        return model.predict_proba(X)[:, 1]

    def match_subjects(self, data, caliper=0.01):
        """1:1 matching within a caliper distance"""
        # Greedy nearest-neighbor matching within the caliper
        ...

Survival analysis with time-varying covariates. Time-to-event modeling where the exposure and covariates can change over follow-up. Critical for medication studies because people start and stop the drug, and treating that as a fixed baseline variable biases the estimate.

Mixed-effects models for repeated cognitive assessments. Each participant has multiple cognitive scores over time. Random intercepts and slopes per participant let the model separate within-person change from between-person variation.

Multiple imputation for missing data. Cognitive studies have missingness. Single imputation lies to you about uncertainty; complete-case analysis loses statistical power and may introduce selection bias. Multiple imputation gives valid standard errors.

The pipeline

Data engineering is most of the work in biostatistics, and almost none of the published papers say so.

Automated extraction from the NACC source files with structural validation against the codebook
Per-variable cleaning rules — out-of-range values, type conversions, longitudinal consistency checks
A reproducible analysis framework that re-runs the full pipeline from raw data to final figures
Unit tests on the statistical functions themselves — the formula for a hazard ratio is the same formula every time, and the tests pin that down

def calculate_hazard_ratio(data, exposure, outcome, covariates):
    """
    Calculate adjusted hazard ratio with confidence intervals.

    Parameters
    ----------
    data : pandas.DataFrame
    exposure : str
    outcome : str
    covariates : list[str]

    Returns
    -------
    dict : hazard ratio, 95% CI, p-value
    """
    # Cox proportional hazards with time-varying covariates,
    # validated against published reference values.
    ...

What I built around the analysis

Reproducibility — every figure in the manuscript regenerates from the same script. The container is the analysis environment.
Interactive exploration — Jupyter notebooks for the work-in-progress; publication-quality figures for the manuscript.
HIPAA-compliant handling — even though NACC data is de-identified, the pipeline treats it conservatively. Encryption at rest, restricted access, audit logging.
Code review and version control — every analytic decision is in git history with a commit message explaining why.

Why I’m doing it

This is independent research. There’s no grant funding it, no advisor pushing it, no employer that benefits. The reason it exists is that the question is real and the data is available, and I have the statistical and engineering background to do the work properly.

The same rigor that makes a trading system not lose money makes a biostatistics study not mislead clinicians. Propensity matching is regularization on causal inference; walk-forward validation is the same idea in time. The underlying discipline is identical.

Specific findings and the medication class are withheld pending publication.