Validating Biomarker signatures of Aging: Extending Insights from SNAC-K too the BLSA Cohort
Understanding how we age and why some individuals experience accelerated disease accumulation is a central challenge in biomedical research. Our recent work in the SNAC-K (stanford-Duke Aging and longevity Study – Korea) cohort identified key biomarkers associated with the rate of disease accumulation using a powerful machine learning technique called LASSO regression. But a crucial next step is confirming these findings - do they hold true in other populations? This is where external validation comes in, and in our study, we leveraged the Berlin Aging Study (BLSA) to rigorously test the generalizability of our SNAC-K discoveries.
Here’s a detailed look at how we approached this validation, and why it’s so important for translating research into real-world impact.
Harmonizing Data for Cross-Study Comparison
Before we could compare results, we needed to ensure the data from BLSA and SNAC-K were “speaking the same language.” This involved:
* Standardizing Chronic Condition Definitions: We carefully mapped chronic conditions in BLSA using both International Classification of Diseases (ICD) and anatomical Therapeutic Chemical (ATC) codes, aligning them with the definitions used in SNAC-K.
* Leveraging Study Visit Data: We utilized data collected across multiple study visits in both cohorts to create a consistent longitudinal framework.
* Baseline Characteristics: detailed baseline characteristics of the BLSA study sample are available in Supplementary Table 7.
Why External validation Matters – and Our Approach
Simply finding biomarkers associated with aging in one study isn’t enough. You need to know if those findings are robust and applicable to other populations. This is what external validation achieves.
Due to limitations in biomarker availability within the BLSA (compared to the more comprehensive SNAC-K dataset), a full replication of the original LASSO models wasn’t feasible. Rather, we focused on validating the predictive accuracy of the biomarkers already identified by LASSO in SNAC-K.This is a standard and well-respected approach in the field, as highlighted by Hastie et al.65, 66.
Here’s the multi-step process we employed:
- Estimating Disease Accumulation Rates in BLSA: We used linear mixed-effects models to calculate individual rates of disease accumulation within the BLSA cohort, accounting for individual variability over time.
- Age adjustment: We adjusted for age, a primary driver of disease progression, in our subsequent analyses.
- Applying SNAC-K LASSO Coefficients: We took the biomarker weights (coefficients) identified by the LASSO model in SNAC-K and applied them to the BLSA data to predict individual disease accumulation rates. Essentially, we were asking: “Can the SNAC-K model accurately predict aging trajectories in BLSA?”
- Assessing Predictive Accuracy (MSE): We used Mean Squared Error (MSE) - the same metric used during the original model training – to quantify the accuracy of our predictions. A lower MSE indicates better predictive performance. We then compared the MSE obtained in BLSA to the MSE achieved in SNAC-K, providing a direct measure of generalizability.
Why This Approach is Powerful
this validation strategy is particularly strong because:
* BLSA is a Ample Cohort: Representing approximately 25% of the size of SNAC-K, BLSA provides a meaningful sample for external validation.
* Realistic Scenario: Acknowledging the limitations in biomarker availability mirrors real-world scenarios where complete datasets are often unavailable.
* Focus on Generalizability: By assessing predictive performance, we’re directly testing whether the SNAC-K biomarker signature can generalize to a new, independent population.
Tools and Technologies
All statistical analyses were performed using R (version 4.2.3) with the following key packages:
* poLCA: Latent Class Analysis
* glmnet: LASSO Regression
* lme4: Linear Mixed-Effects Models
* factoextra: Exploratory Data Analysis and Visualization
* corrplot: Correlation Visualization
* survival: Survival Analysis
* ggplot2: Data Visualization
Further Information
For a more detailed overview of our research design, please refer to the Nature Portfolio Reporting









