Laura Jiménez-Gracia
2026-01-12 00:00:00
Ethics declaration
Human blood processed in-house for this project was preselected and included within other ongoing studies. All the studies included were conducted in accordance with ethical guidelines, and all patients provided written informed consent. Ethical committees and research project approvals for the different studies included in this paper are detailed in the following text.
SCGT00 and SCGT00val were approved by the Hospital Universitari Vall d’Hebron Research Ethics Committee (PR(AG)144/201). SCGT01 received institutional review board (IRB) approval by the Parc de Salut Mar Ethics Committee (2016/7075/I). SCGT02 received ethics approval by the Medisch-Etische Toetsingscommissie (METc) committee—for patients with asthma (ARMS and ORIENT projects: NL53173.042.15 and NL69765.042.19, respectively), for patients with COPD (SHERLOck project, NL57656.042.16) and for healthy controls (NORM project, NL26187.042.09). SCGT03 was approved by the Comité Ético de Investigación con Medicamentos del Hospital Universitario Vall d’Hebron (654/C/2019). SCGT04 and SCGT06 were approved by the Comitè d’Ètica d’Investigació amb medicaments (CEim) del Hospital de la Santa Creu i Sant Pau (EC/21/373/6616 and EC/23/258/7364). SCGT05 was approved by the IRBs of the Commissie Medische Ethiek UZ KU Leuven/Onderzoek (S66460 and S62294).
Atlas of circulating immune cells
The Inflammation Landscape of Circulating Immune Cells atlas was conceived as a comprehensive resource to expand the current knowledge of physiological and pathological inflammation through the study of circulating immune cells. With this aim, we included data representing both acute and chronic inflammatory processes as well as healthy donors. Further details about the included datasets are available (Supplementary Table 1).
The project includes in-house scRNA-seq data generation from samples shared by our collaborators from several research institutions. Samples were collected with written informed consent obtained from all participants and comply with the ethical guidelines for human samples. Specifically, we generated data from patients suffering from rheumatoid arthritis, psoriatic arthritis, Crohnʼs disease, ulcerative colitis, psoriasis and SLE and from healthy controls in collaboration with the Vall d’Hebron Research Institute within the DoCTIS consortia (https://doctis.eu/) (SCGT00 and SCGT00val). Additionally, we processed and obtained data from healthy controls in collaboration with the Institut Hospital del Mar d’Investigacions Mèdiques (SCGT01); asthma, COPD and healthy control samples in collaboration with the University Medical Center Groningen (SCGT02); BRCA samples in collaboration with the Vall d’Hebron Institute of Oncology (SCGT03); cirrhosis samples in collaboration with the Biomedical Research Institut Sant Pau (SCGT04); CRC samples in collaboration with the Katholieke Universiteit Leuven (SCGT05); and COVID and healthy control samples also in collaboration with the Biomedical Research Institut Sant Pau (SCGT06).
Moreover, we also included publicly available datasets to complete our cohort. Specifically, we considered data from patients suffering from sepsis26,55, HNSCC56, HBV57, multiple sclerosis58, NPC59, HIV60,61, SLE17,62,63, cirrhosis64, Crohnʼs disease65, COVID-Flu-sepsis34 and COVID66 and from healthy controls from Terekhova et al.67 and 10x Genomics, together with the available healthy samples from all the cited studies. The data information and access identifiers for each project can be found in Supplementary Table 1 (Sheet 1). When raw data were available, we downloaded FASTQ files; otherwise, we retrieved the raw count matrices from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/) or Sequence Read Archieve (SRA) (https://submit.ncbi.nlm.nih.gov/about/sra/), BioStudies Array Express (https://www.ebi.ac.uk/biostudies/arrayexpress), Broad Institute DUOS (https://duos.broadinstitute.org/), Synapse (https://www.synapse.org), Genome Sequence Archive (GSA) (https://ngdc.cncb.ac.cn/gsa-human/), CELLxGENE Data Portal (https://cellxgene.cziscience.com/datasets) and 10x Genomics (https://www.10xgenomics.com/datasets) resources. For all studies, we also collected clinical metadata.
Sample collection
Human blood samples were collected in EDTA tubes (BD Biosciences). PBMCs from the SCGT00, SCGT00val, SCGT02, SCGT04, SCGT05 and SCGT06 datasets were isolated using Ficoll density gradient centrifugation (STEMCELL Technologies, Lymphoprep; GE Healthcare Biosciences AB, Ficoll-Plus). PBMCs belonging to the SCGT01 and SCGT03 datasets were isolated using Vacutainer CPT tubes (BD Biosciences). Subsequently, all aliquots were centrifuged following the manufacturer’s protocol. After centrifugation, PBMCs were washed and resuspended in freezing media. Aliquots were gradually frozen using a commercial freezing box (Thermo Fisher Scientific; Mr. Frosty, Nalgene) at −80 °C for 24 hours before being transferred to liquid nitrogen for long-term storage.
Cell thawing and preprocessing
Cryopreserved PBMCs were thawed in a water bath at 37 °C and transferred to a 15-ml Falcon tube containing 10 ml of prewarmed RPMI media supplemented with 10% FBS (Thermo Fisher Scientific). Samples were centrifuged at 350g for 8 minutes at room temperature, supernatant was removed and pellets were resuspended with 1 ml of cold 1× PBS (Thermo Fisher Scientific) supplemented with 0.05% BSA (Miltenyi Biotec, PN 130-091-376). Samples were incubated during 10 minutes at room temperature with 0.1 mg ml−1 DNAse I (Worthington Biochemical, PN LS002007) to eliminate ambient DNA and favor the resuspension of the pellet. Cells were filtered with a 40-µm strainer (Cell Strainer, PN 43-10040-70) to remove eventual clumps and washed by adding 10 ml of cold PBS + 0.05% BSA. Samples were centrifuged at 350g for 8 minutes at 4 °C and resuspended in an adequate volume of PBS + 0.05% BSA to reach the desired concentration. Cell concentration and viability were verified with a TC20 Automated Cell Counter (Bio-Rad) upon staining of the cells with trypan blue.
Sample multiplexing by genotyping
PBMC samples were evenly mixed in pools of eight donors per library following a multiplexing approach based on the donor’s genotype for a more cost-efficient and time-efficient strategy. Notably, in the case of SCGT00, libraries were designed to pool samples together from the same disease with different response to treatment (not relevant in this study), whereas, in the case of the SCGT02 Asthma+HC cohort, six samples belonging to patients were pooled with two samples derived from non-smoking healthy control individuals. With this approach, we aimed to avoid technical artifacts that could mask subtle biological differences.
3′ CellPlex
PBMC samples belonging to the SCGT01, SCGT02 COPD+HC, SCGT04 and SCGT06 cohorts were multiplexed with 10x Genomics CellPlex Kit following the Cell Multiplexing Oligo Labeling for Single Cell RNA Sequencing Protocol (10x Genomics). Whereas, for SCGT02 COPD+HC and SCGT06 projects, we pooled eight samples from patients with healthy controls together, for SCGT01 and SCGT04 we only included samples from the condition of interest. In brief, 0.2−1 million cells were centrifuged at 350g at room temperature with a swinging bucket rotor, resuspended in 100 µl of Cell Multiplexing Oligo (10x Genomics, 3′ CellPlex Kit Set A, PN-1000261) and incubated at room temperature for 5 minutes. Cells were washed three times with cold 1× PBS (Thermo Fisher Scientific) supplemented with 1% BSA (MACS, Miltenyi Biotec), all centrifugations being performed at 350g at 4 °C. Cells were finally resuspended in an appropriate volume of 1× PBS + 1% BSA to obtain a final cell concentration of approximately 1,600 cells per microliter and counted using a TC20 Automated Cell Counter (Bio-Rad). An equal number of cells of each sample was pooled and filtered with a 40-µm strainer to remove eventual clumps; final cell concentration and viability of the pools were verified before loading onto the Chromium for cell partitioning.
Cell encapsulation and scRNA-seq library preparation
Multiplexed samples were loaded for a target cell recovery between 20,000 and 60,000 cells (corresponding to 5,000−7,500 cells per sample within each plex). More specifically, samples belonging to SCGT00, SCGT00val and SCGT01 cohorts were encapsulated using standard-throughput Chromium Next GEM Single Cell 3′ Reagent Kit version 3.1, whereas multiplex samples belonging to SCGT02 Asthma+HC and COPD+HC, SCGT04 and SCGT06 were encapsulated using the high-throughput Chromium Next GEM Single Cell 3′ HT Reagent Kit version 3.1 in combination with the Chromium X instrument. On the other hand, SCGT03 and SCGT05 cohorts were loaded in a standard assay with a target recovery of 6,000−8,000 cells per sample using Chromium Next GEM Single Cell 5′ Reagent Kit version 2 (10x Genomics, PN-1000263).
Libraries were prepared following the manufacturer’s instructions of protocols CG000315 or CG000390, for the standard assay without and with sample multiplexing, and protocols CG000416 and CG000419, for the high-throughput assay without and with sample multiplexing. Protocol CG000331 was instead followed for the SCGT03 and SCGT05 cohorts. Between 20 ng and 200 ng of cDNA was used for preparing libraries, and final library size distribution and concentration were determined using a Bioanalyzer High Sensitivity chip (Agilent Technologies). Sequencing was carried out on a NovaSeq 6000 system (Illumina) and a NextSeq 500 system (Illumina) using the following sequencing conditions: 28 bp (Read 1) + 10 bp (i7 index) + 10 bp (i5 index) + 90 bp (Read 2), to obtain approximately 40,000 read pairs per cell for the gene expression library and 2,000−4,000 read pairs per cell for the CellPlex library.
Data processing
To profile the cellular transcriptome, we processed the sequencing reads with the 10x Genomics software package Cell Ranger (version 6.1) (https://support.10xgenomics.com/single-cell-gene-expression/software/overview/welcome) and mapped them against the human GRCh38 reference genome (GENCODE version 32/Ensembl 98). This step was applied to the sequencing reads obtained from in-house-processed samples and from published projects, when available.
Genotype processing
Genome-wide genotyping data for patients from the SCGT00, SCGT00val and SCGT02 Asthma+HC studies were generated from PBMC samples. For SCGT00, 184 patients were distributed in four genotyping cohorts (N1 = 64, N2 = 32, N3 = 40 and N4 = 48 samples), and, for SCGT00val, 32 patients were processed with the Illumina Omni2.5-8 and Illumina GSA MG v3-24 arrays, respectively. For SCGT02 Asthma+HC, 16 patients were distributed in two genotyping cohorts (N1 = 8 and N2 = 8 samples) using Infinium Global Screening Array-24 version 3.0 (GSAMD-24v3.0) with the A1 array. Genotyping was done using GRCh37 human genome reference. Data preprocessing and quality control analysis were separately performed for each genotyping batch of samples at IMIDomics, Inc. (Barcelona, Spain) and at Erasmus MC (Rotterdam, Netherlands). Quality control analysis was performed using PLINK software (versions 1.9 and 2). In the SCGT00 and SCGT00val quality control analysis, we identified autosomal single-nucleotide polymorphisms (SNPs), and, using those SNPs from chromosome X, we confirmed the consistency between SNP-estimated and clinically reported genders. Then, we quantified the percentage of SNPs with a minor allele frequency (MAF) higher than 5%. Next, we computed the percentage of missingness both at the SNP-wise and sample-wise levels. Finally, we assessed the heterozygosity rate (F) of each sample in order to evaluate if any of the genotyped samples could be contaminated. In the SCGT02 Asthma+HC quality control analysis, we excluded samples with a SNP calling rate lower than 98%.
Patient genotypes (VCF format) were simplified by removing single-nucleotide variants (SNVs) that were unannotated (chr 0), located in the sexual Y (chr 24), pseudo-autosomal XY (chr 25) or mitochondrial (chr 26) chromosomes. As genotypes were obtained using the human hg19 reference genome, we converted their coordinates to the same reference genome used to mapped the sequencing reads (GRCh38) using the UCSC LiftOver tool (https://genome.ucsc.edu/cgi-bin/hgLiftOver). LiftOver requires an input file in BED format. Thus, we used a Python script (https://github.com/single-cell-genetics/cellsnp-lite/blob/master/scripts/liftOver/liftOver_vcf.py) to convert our VCF file accordingly.
Library demultiplexing
Multiplexed libraries from SCGT00, SCGT00val and SCGT02 Asthma+HC cohorts were demultiplexed with Cellsnp-lite (version 1.2.2) in Mode 1a68, which allows us to genotype single-cell gene expression libraries by piling up the expressed alleles based on a list of given SNPs. To do so, we used a list of 7.4 million common SNPs in the human population (MAF > 5%) published by the 1000 Genomes Project consortium and compiled by the authors (https://sourceforge.net/projects/cellsnp/files/SNPlist/). Then, we performed the donor deconvolution with vireo (version 0.5.6)69, which assigns the deconvoluted samples to its donor identity using known genotypes while detecting doublets and unassigned cells. Finally, we discarded detected doublets and unassigned cells before moving on to the downstream processing steps. For CellPlex libraries, we followed a joint deconvolution strategy combining cell multiplexing oligo (CMO) hashing and genotype-based deconvolution; we generated pools of cells belonging to different samples based on the individual SNPs and traced back to their donor of origin based on the CMO hashing. When no genotype is available, the use of this dual approach minimizes the discarded cells.
Data analysis
All analyses presented in this paper were carried out using mainly Python, unless specified otherwise. In particular, we structured our data in anndata objects70 compatible with SCANPY suite71, which allowed us to apply single-cell data processing and visualization best practices. All experiments and panels are reproducible with the code released in the project’s GitHub repository.
Data standardization
Considering the diversity of the datasets included in the reference of circulating immune cells, a standardization step was needed.
Cell barcodes
The ‘cellID’ barcodes assigned were inspired by The Cancer Genome Atlas (TCGA) project (https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/). Each barcode unequivocally identifies a cell, and it is composed of the studyID (project), libraryID (10x GEM channel), patientID, chemistry (only when 3′ and 5′ gene expression were available for the same sample), timepoint (if multiple observations were available for a patient) and the 10x Genomics cell barcode, respectively (for example, SCGT00_L046_P006.3P_T0_AAACCCAAGGTGAGAA).
Gene name harmonization
All datasets were mapped using human GRCh38 genome reference, but the annotation file version might differ, resulting in gene names with multiple aliases or deprecated symbols. To avoid gene redundancy or mismatching, we used Ensembl symbols instead of gene names. Then, for datasets without the Ensembl symbols, we compared all gene names with the HUGO Gene Nomenclature Committee (HGNC) database (latest version, February 2024; https://www.genenames.org/), in order to convert them to the latest official HUGO name, merging possible duplicates and retrieving the corresponding Ensembl symbol. For non-official genes, we used the MyGene Python interface (https://mygene.info/) to query the Ensembl symbol. Finally, we removed 16 genes categorized as ‘artifact’ or ‘TEC (To be Experimentally Confirmed)’.
Metadata harmonization
Patient metadata were unified across datasets, using common variable names and values for those present in multiple sources; specifically, we homogenized these variables of interest such as sex, age, disease, diseaseStatus, smokingStatus, ethnicity or institute. For instance, ‘M’, ‘Male’ and ‘Hombre’ entries were replaced with ‘male’. Additionally, we created a new variable, ‘binned_age’, to group patients within a range of 10 years, considering that, for the SCGT01, SCGT04 and SCGT11 datasets, the specific age information was not available. As detailed below, the datasets missing sex and age information were considered as data from unseen studies and used to evaluate the patient classifier.
Data splitting
Some studies in our cohort included patients with samples collected in multiple replicates, different timepoints or using different chemistry protocols. In studies with multiple replicates—that is, Zhang2022 and Terekhova2023—we selected samples with the largest number of cells. When multiple timepoints or disease statuses for the same patient are available—that is, Perez2022, COMBAT2022 and Ren2021—we kept only the samples associated with higher disease severity.
The filtered Inflammation Atlas cohort was then split in two datasets: CORE and unseen studies. Data from unseen studies include 86 samples and are used as an independent validation of our patient classifier pipeline. For this dataset, we selected studies that either involve diseases with a large support in our full cohort or lack metadata on sex and age. These chosen studies are as follows: SCGT00val, SCGT06, Palshikar2022, Ramachandran2019, Martin2019, Savage2021, Jiang2020, Mistry2019 and 10XGenomics.
After performing data quality control (removing low-quality libraries and cells; see section below), the CORE dataset includes 961 samples and was further split into Main and data from unseen patients, with 817 and 144 samples each, respectively. We first stratified samples based on the following metadata: studyID, chemistry and disease. From each of those groups, we randomly selected 20% of samples to be part of the unseen patients, provided that they amounted to at least five samples. In the patient classifier pipeline, the Main dataset is used as a reference, whereas data from unseen patients and unseen studies are used as a query dataset in two independent scenarios.
The Centralized Dataset included samples from SCGT00 and SCGT00val. Because all healthy patients were sequenced in the same pool, we did not take them into account. Then, because multiple samples were multiplexed and sequenced together, we split them, stratifying by sequencing patientPool to generate both the reference and the query datasets that include at least one pool for all the IMID diseases (that is, rheumatoid arthritis, psoriasis, psoriatic arthritis, Crohnʼs disease, ulcerative colitis and SLE). Further information about the samples classified in each group is detailed in Supplementary Table 1.
Data quality control
We performed data quality control on the CORE dataset by computing the main metrics (that is, library size, library complexity and percentage of mitochondrial, ribosomal, hemoglobin and platelet-related gene expression) on the count matrix. Metric distributions were visualized grouping cells by library (10x Genomics) and by considering their chemistry (3′ or 5′ and their version). Consequently, we removed low-quality observations using permissive thresholds, whereas the robust cleaning process was performed during cell annotation tasks. In particular, we initially excluded the low-quality libraries across datasets (25% for 3′ V3 and 20% for 3′ V2 and 5′), as it is indicative of lysed cells. Then, we removed barcodes with a high library size (>50,000 UMIs for 3′ V3 and 5′ V1, >40,000 UMIs for 5′ V2 or >25,000 UMIs for 3′ V2 chemistry) or with a high complexity (>6,000 genes for 3′ V3 and 5′ or >4,000 genes for 3′ V2 chemistry). After cell quality control, we also removed low-quality libraries (72 (version 4.3.0.1) and defined the different cell cycle ‘phases’ (G1, G2M and S). Before the dataset cleanup, we predicted doublets using the function scanpy.external.pp.scrublet() from the SCANPY library (version 1.9.8), which provides a score to flag putative doublets but without filtering them out at this stage. Consequently, during the clustering and annotation step, the clusters co-expressing gene markers from different lineage/population and high doublet score were assessed to determine whether a specific cluster could be classified as a group of doublets and subsequently excluded. After this step, the CORE dataset was split into Main and data from unseen patients, as explained above.
Quality control on the data from unseen studies was performed independently. We applied the same approach described above, but we filtered only poor-quality libraries and cells.
Data processing for annotation
Annotation strategy
To identify all the immune cell types and states present in the human blood, we employed a recursive top-down approach inspired by previous work done by Massoni-Badosa et al.73. Starting with 4,918,140 cells and 817 patients from the Main dataset, we divided the annotation into several stages. In brief, we first grouped all cells into the primary compartments within our study. Subsequently, each compartment was processed aiming to detect potential doublets, low-quality cells and cells resembling platelets or erythrocytes (cells with high expression of hemoglobin genes). Additionally, we also placed back some clusters of cells into their corresponding cell lineages, when wrongly clustered due to similar profiles (for example, T cells found in the natural killer cell group or vice versa). Then, we identified the clusters resembling specific biological cell profiles (cell subtypes), obtaining a final number of 64 different subpopulations, excluding Doublets and LowQuality_cells, that we defined as annotation Level 2. Those cell subtypes were grouped into 15 cell populations that we defined as annotation Level 1. At the end, our Inflammation Atlas contains 4,435,922 cells. For each group identified in the initial stage (cell lineages), we applied the following tasks: normalization, feature selection, integration, clustering and annotation. In the following, we will always refer to the parameters of the initial stage, and the specifics of the subsequent steps (from lineages to cell types), along with the annotation labels and the marker genes used to define them, can be found in Supplementary Table 3.
Data normalization
Following standard practices, filtered cells were normalized by total counts over all genes and multiplied by a scaling factor of 104 (scanpy.pp.normalize_total(target_sum = 104)). Then, the normalized count matrix X was log transformed as loge(X + 1) (scanpy.pp.log1p()).
Feature selection
Gene selection was performed by identifying the highly variable genes (HVGs). Before doing so, we excluded genes related to mitochondrial and ribosomal organules. Also, we skipped T cell and B cell receptor (TCR/BCR) genes, including joining and variable regions, because they are not useful to describe cell identities but, rather, to capture patient-specific clonally expanded cell populations within an inflammatory-related condition. Lastly, we excluded major histocompatibility complex (MHC) genes. To reduce the influence of a study’s specific composition and prevent biases in the gene selection task, we preferred genes that are highly variable in as many studies as possible. Therefore, similar to Sikkema et al.74, we first considered each study independently and computed the HVGs using the Seurat implementation75 (scanpy.pp.highly_variable_genes(min_disp = 0.3, min_mean = 0.01, max_mean = 4)). Then, we ranked genes based on the number of studies in which they are among the highly variable. Finally, for the initial stage, we determined the minimum number of studies required to compose an HVG list of more than 3,000 genes. Applying this strategy, we selected a total of 3,236 genes being highly variable in at least six studies. In the following steps, we requested more than 2,000 HVGs; the minimum number of studies required and the total of selected genes depend on the step and the cells under study. To identify red blood cell (RBCs) and platelets, we kept genes associated with erythrocytes, such as hemoglobin subunits, and pro-platelet basic protein (PPBP) (platelet related) in the HVG list. Because such genes are known to be related to ambient RNA when found in other cell types, we subsequently removed them after having annotated the above cell types.
Data integration
Our dataset includes single-cell data obtained from multiple studies including different chemistry protocols, inflammatory status samples and a broad range of other clinical features (for example, age and sex). Although this is a strength point of our atlas, such high levels of heterogeneity induced by technical confounding factors and unwanted biological variability resulted in challenging integration tasks before clustering and annotating cell populations. Therefore, we employed scVI14, a VAE approach that proves to be one of the most effective integration methods in complex scenarios, particularly when the annotation information is missing16. scVI takes as input the raw count matrix to generate an integrated, low-dimensional embedding space, where the cell states are preserved and the batch effects are reduced. Moreover, scVI’s embedding space can be exploited to cluster and annotate cells based on either known or cluster-specific marker genes. Details on the scVI parameters used in each annotation step can be found in Supplementary Table 7.
Cell clustering
To cluster cells into cell types with the Leiden algorithm, we first built the k-nearest neighbors (KNN) graph using scVI’s latent embeddings and k = 20 as the number of neighbors (scanpy.pp.neighbors(n_neighbors = k)). We then applied the Leiden algorithm using a resolution of r = 0.1 (scanpy.tl.leiden(resolution = r)). The k and r used in every other step for every lineage can be found in Supplementary Table 3.
Cell annotation
Cell clusters were manually annotated by immunology experts by comparing the expression levels of canonical gene markers. Moreover, the final step of annotation was performed using the cluster markers obtained performing a differential expression analysis among clusters (Supplementary Table 3). First, we ranked genes to characterize each cluster (scanpy.tl.rank_genes_groups()), by considering normalized RNA counts with the Wilcoxon rank-sum test. Then, we selected those genes with log2 fold change (log2FC) > 0.25 and false discovery rate (FDR) adjusted P
External annotation validation
We compared our independent annotations with the ones available in the largest public datasets. To quantify the overlap of cells among groups, we computed the adjusted Rand index (ARI) to measure the similarity between our label assignments and the ones performed by the original authors. Further details are available in Supplementary Table 2.
Centralized Dataset annotation
All previously described steps were applied to process and annotate the Centralized Dataset (SCGT00), with the following adjustments: (1) standard HVG selection was performed as the dataset included only a single study; (2) the dataset was integrated using ‘patientPool’ as the batch key; and (3) cell annotation was conducted up to Level 1, recovering the same cell types as in the main Inflammation Atlas, as this was necessary for the patient classifier. Here, starting with 855,417 cells and 152 patients included in the reference dataset, we recovered 15 cell populations (Level 1), excluding Doublets and LowQuality_cells. Details on the scVI parameters used in each annotation step can be found in Supplementary Table 7, whereas details on the clustering and annotation steps are provided in Supplementary Table 3.
Feature selection after annotation
Gene selection
To improve the quality of downstream analysis to characterize the inflammation landscape, it is necessary to perform a gene selection in order to remove dataset-specific genes and reduce the batch effect. First, we performed data normalization (as described above), kept only the genes that are expressed (raw count > 0) in at least one cell in each study and removed genes associated with mitochondrial, ribosomal, TCR/BCR, MHC, hemoglobin and platelet cell types. This step retained a total of 14,127 genes. Then, we identified three sets of genes: (1) the HVGs, (2) the differentially expressed genes (DEGs) between healthy and each inflammatory status and (3) Cytopus76, a manually curated immune-specific gene list.
HVGs
Similar to the feature selection approach described in the annotation section, we selected a total of 3,283 HVGs, by using a threshold of at least 3,000 genes. In practice, we first ranked the genes based on the number of studies in which they are concurrently highly variable (scanpy.pp.highly_variable_genes(min_disp = 0.3, min_mean = 0.01, max_mean = 4, batch_key=’libraryID’)) and then chose a minimum number of studies of five.
DEGs between healthy and each disease
We obtained a list of DEGs after grouping single-cell gene expression profiles into pseudobulks. Therefore, we first combined the expression profiles of individual cells to produce pseudobulks for every patient and cell type (Level 1), removing groups with no more than 20 cells, using the Python implementation of decoupleR77 (version 1.6.0) (decoupler.get_pseudobulk(min_cells = 20, sample_col=’sampleID’, groups_col = ‘Level1’, layer = ‘counts’, mode = ‘sum’)). Then, we applied the edgeR (version 4.0.16) quasi-likelihood functions to search for DEGs between healthy patients and each other’s inflammatory conditions, by considering one cell type at a time. Because not all the cell types were detected in each patient, we did not perform the pairwise comparison if one disease had fewer than three pseudobulks. More in detail, for each pairwise comparison, we first removed genes with a low expression value (filterByExpr(y, group=disease)). Second, we normalized by library size the aggregated raw counts (calcNormFactors(y, logratioTrim = 0.3)). Third, we corrected for the main confounding factors—that is, chemistry protocol, sex and binned age—considering an additive model. One patient was excluded from the analysis due to missing age information. We defined the design of our comparison using the following patsy-style (https://patsy.readthedocs.io/en/latest/formulas.html) formula: ‘~0 + C(disease) + C(chemistry) + C(sex) + C(binned_age)’. Fourth, we estimated a negative binomial dispersion for each gene using estimateDisp(), which we fed into a gene-wise negative binomial generalized linear model (glmQLFit(robust = TRUE)) to test for DEGs with a quasi-likelihood F-test (glmQLFTest()). Lastly, results obtained from each comparison were merged together, and the F-test P values were corrected using the Benjamini−Hochberg FDR procedure implemented in R (p.adjust(method = ‘BH’)). Given the corrected P values and the log2FC, we selected 6,868 DEGs with P 2FC > 1.5.
Curated immune-specific genes
To be able to track the full spectrum of inflammatory processes, including immune activation and progression, we curated nine inflammation-related functions defined in the literature78,79,80,81,82,83,84 (1,364 genes present in our dataset; Supplementary Table 4) and complemented them with a published list of cell-type-specific signatures derived from immunological knowledge based on single-cell studies (Cytopus76). Specifically, we retrieved all global gene sets for the leukocyte category and the following inflammatory-related cell-type-specific factors: naive and non-naive CD4 T cells (CD4T_TFH_UP, CD4T_TH1_UP, CD4T_TH2_UP, CD4T_TH17_UP, Tregs_FOXP3_stabilization); naive and non-naive CD8 T cells (CD8T_exhaustion, CD8T_tcr_activation); B cells (B_effector); monocytes (IFNG response, IL4-IL13 response); and dendritic cells (dendritic cell antigen crosspresentation).
Aggregation of gene sets
We generate the relevant gene set by doing the union of HVGs, DEGs and the manually curated list. The final number of unique genes is 8,253.
Dataset integration and gene expression correction via scANVI
Atlas-level analysis requires a careful preprocessing of the gene expression profiles to deal with the heterogeneity of the studies, the batch effect and the missing or noisy observations. scANVI15 is one of the existing methods capable of addressing these challenges and has been proven effective on atlas-level benchmarks compared to other integration methods. We validated its performance on our data by using the metrics from the scib-metrics package16 (version 0.5) (Extended Data Fig. 1).
scANVI integration
scANVI is an extension of the scVI model, employed previously for data integration, that also leverages the information of the cell type annotation. We first trained an scVI VAE (scvi.model.SCVI) and then trained scANVI (scvi.model.SCANVI) starting from the pretrained scVI model (see parameters in Supplementary Table 7). Both models corrected for the chemistry batch while also considering libraryID, studyID, sex and binned age as covariates. After training, we generated the normalized corrected counts by sampling from scANVI’s negative binomial posterior (SCANVI.get_normalized_expression). The batch effect was mitigated by sampling and averaging each cell’s expression as if it originated from each chemistry protocol by setting the transform_batch parameter to the list of chemistry protocols present in our atlas.
Comparison of cell type composition
To estimate the changes in the proportions of cell populations across conditions, we applied the scCODA package85 (version 0.1.9), a Bayesian modeling tool that takes into account the compositional nature of the data to reduce the risk of false discoveries. scCODA allows us to infer changes between conditions while considering other covariates, corresponding to the disease status in our setting. scCODA searches for changes between a reference cell type, assumed to be constant among different conditions, and the other cell types. We selected as the reference population the one that showed the lower variance across conditions, excluding rare cell populations (that is, progenitors, plasmacytoid dendritic cells and cycling cells). This resulted in the selection of dendritic cell as the reference cell type for all diseases. scCODA takes as input the count of cells belonging to each cell type in each patient and returns the list of cell type proportion changes with the corresponding corrected P values (through the FDR procedure). A patsy-style formula was used to build the covariate matrix, specified with ‘healthy’ as baseline and sex and binned age as covariates (C(disease, Treatment(‘healthy’)) + C(sex) + C(binned_age)), because we are interested in detecting changes between a normal and a diseased status. We reported only changes with a corrected P 2FC > 0.2.
Comparison of gene expression profiles
Gene factor inference
To expand the list of curated immune-related genes following a data-driven approach, we employed Spectra22 (version 0.2.0), a matrix factorization algorithm that enables us to identify a minimal set of genes related to specific functions in the data—that is, factors. Spectra takes as input cell type labels to infer global and cell-type-specific factors that decompose the overall gene expression matrix and each cell type submatrices, respectively. Given our list of curated gene sets, we considered the Cytopus ones as global factors, whereas we regarded all the remaining as cell-type-specific factors. The Spectra model was fitted with default parameters with the exception of λ, which was set equal to 0.001. Considering the prohibitive computational resources required for applying Spectra on our single-cell data, we fed the algorithm with the metacell aggregated expression matrix, as described in the paragraph below. Spectra returned a list of 135 factors that are a linear combination of the gene expression from the original matrix. The coefficients included in the matrix can be then used as a proxy of the gene relevance in a given factor.
Metacell generation
We generated metacells using SEACells86 (version 0.3.3), which aggregates cells by exploiting their distances in a low-dimensional embedding space. Starting from the normalizing data, we selected the top 3,000 HVGs using the highly_variable_genes function in SCANPY, with the Seurat flavor. To define SEACells’ input embedding space, we calculated the first 50 principal components and selected those principal components that explain 90% of the total variance observed. To avoid biases due to batch effect and other confounding factors, we executed SEACells for each sample independently. In particular, we generated a number of metacells equal to the number of cells of each patient divided by 50. We further filtered the obtained metacells by computing the proportion of the most abundant annotation label (Level 1) in each SEACells group and then removed the ones with a purity lower than 0.75. Overall, we defined 71,108 metacells. Given the assignment of cells to each metacell, we generated each metacell’s gene expression profile by averaging the corresponding cells’ scANVI normalized and corrected expression profiles. Because scANVI returns counts sampled from a negative binomial distribution, we also log scaled the obtained metacell profiles.
Inflammation-related signature definition
Spectra provided a total of 135 factors that include a refined gene list for each gene set we used as input. Thus, we need to assign those factors to our original gene sets for retrieving the corresponding biological function. For doing so, we performed enrichment analysis with ULMs available in the Python implementation of decoupleR77, to estimate the factors associated with each gene set. The gene coefficients returned by Spectra were considered as the response variable and a vector of weights (1 if the genes were included in the gene set, 0 otherwise) serving as explanatory variables. ULM returns an estimate and a P value for each enrichment. We corrected those P values for multiple comparison by computing the FDR with the Benjamini−Hochberg procedure, implemented in the scipy (version 1.12.0) library. We kept 125 factors with a positive estimate and an adjusted P value
Inflammation-related signature scores
To compare immune-relevant activation profiles across diseases and cell types, we applied an enrichment signature scoring procedure, considering the factors obtained with Spectra22. First, we generated pseudobulks stratified by cell type (Level 1 or Level 2) and patients, discarding groups with fewer than 10 cells. We averaged the scANVI-corrected gene expression matrix of each cell belonging to a given pseudobulk and then log transformed and scaled the expression values to zero mean and unit variance, to reduce the impact of highly expressed genes. We fitted decoupleR’s ULM77 by considering pseudobulk expression profiles as the response variable and the gene coefficient returned by Spectra as the explanatory one. We assessed the scores for the 119 cell-type-specific factors only in their corresponding cell type. The output of the model is a Studentʼs t-statistic for each combination of pseudobulk and factor, which is used as a proxy for the corresponding biological function activity: positive values are associated with more active functions in a given sample and vice versa. To identify the upregulated or downregulated biological functions across inflammatory conditions, we compared the activation score between healthy and each disease, considering only comparisons that include at least three observations in both conditions. To take into account the batch effect induced by studies and chemistry protocols that still affects the data (Extended Data Fig. 1b,c; scbi metrics and principal component analysis (PCA)), we applied a LMEM. In particular, we fitted the function mixedlm() from the statsmodels Python library (version 0.14.1) with the following formula:‘Q(‘{factor}’) ~ C(disease, Treatment(reference = ‘healthy’)) + ‘f’C(chemistry)’, grouping by ‘studyID’. We corrected the P value obtained for multiple testing using FDR considering all the comparisons when tested at Level 1 and within each Level 1 population when tested at Level 2.
GRN analysis
Pseudobulk matrices were calculated by averaging the corrected and standardized count matrices by cell type and sample. We compute differential expression analysis for each cell type in each disease using healthy individuals as reference. LMEMs were used to model the expression levels of each gene independently, considering the disease as a fixed effect while modeling the ID of the study as a random effect. We used the mixedlm() function of the statsmodels (version 0.14.0) Python package to run the analysis. To associate each cell-type-specific ‘IFN-induced’ factor with a given transcription factor regulator, we integrated these signatures with the CollecTRI Gene Regulatory Network87 by matching target genes to identify common genes between transcription factor regulons and Spectra signatures. Therefore, each ‘IFN-induced’ signature was thus linked to a subset of transcription factor regulons. The activity of each transcription factor was calculated using only the common genes between each transcription factor and each signature, employing the UML from decoupleR77 and the z-values obtained from the differential expression analysis. To ensure robustness, only regulons with at least 10 gene targets were considered. This pipeline was applied across ‘IFN-induced’ factors and diseases, focusing on the activity in the cell type where the Spectra signature was identified. Negative activities (t-stat P > 0.05) were filtered out. This analysis identified STAT1 and SP1 as the sole transcription factor regulators of the defined cell types. We performed one-versus-all Wilcoxon rank-sum tests to compare transcription factor activity across Level 2 subpopulations within each Level 1 lineage for SLE and Flu. For each transcription factor (SP1 and STAT1), activity within a given Level 2 state was compared to all other states within the same Level 1 compartment. Tests were two-sided and restricted to comparisons with at least three observations per group. P values were adjusted using the Benjamini−Hochberg method. The same approach was applied to monocytes, comparing transcription factor activity across diseases within each Level 2 monocyte subset.
For the comparison of flare and non-flare patients from SLE, non-corrected log1p-normalized single-cell expression matrix from Perez et al.17 was used to further investigate SP1 and STAT1 regulon activities across both categories. Pseudobulk profiles were calculated by averaging by cell type, considering only cell types (Level 2) with a minimum of 10 cells and groups that include at least three patients in both conditions (flare versus non-flare). Prior to calculating transcription factor activities across samples, we standardized the gene expression data on patients with SLE based on healthy individuals. Specifically, for each gene, the mean and standard deviation were calculated from the healthy group, and these statistics were then used to scale the gene expression values across patients with SLE. Only gene targets identified in the previous step were used to calculate enrichment using the ULM method. Finally, the activity of STAT1 and SP1 was calculated at Level 2 using CollecTRI87.
Immune gene importance evaluation
In this section, we introduce our pipeline used to obtain a gene importance metric by interpreting cell-type-specific classifiers for disease prediction. All the steps described below were carried out separately for each cell type (excluding RBCs, platelets, progenitors and cycling cells). Specifically, the classification task was performed with GBDTs implemented in the XGBoost library88 (py-xgboost-gpu: version 2.0.3). Furthermore, interpretability was performed using SHAP values35 (version 0.45.1), a powerful approach assigning an importance to each gene by also taking into account their interactions.
Feature selection
To focus our analysis on cell-type-specific inflammatory-related signatures, we considered only genes relevant in annotated Spectra factors, and we further reduced the list by removing cell identity genes (for example, CD3E and MS4A1) as well as non-protein-coding genes. This filtering gave a final number of 935 genes.
Data processing
We split our data into three parts: the training set and the validation and testing set, used for hyperparameter tuning and performance evaluation, respectively. We balanced the splits by disease, ensuring that each sample’s cells were included in the same set. Initially, we partitioned the data into five splits using the function sklearn.model_selection.StratifiedGroupKFold. Three of these splits were assigned to the training set, and one was designated for validation and one for testing. Accounting for both stratification by disease and patient partitioning might lead to an uneven distribution of cells among diseases. To address this, we assigned splits with a well-balanced distribution of cells to the training and testing sets first.
XGBoost fitting
XGBoost (xgboost.XGBClassifier) hyperparameters were tuned using the Optuna library89 (version 3.6.0). The performance of each model configuration was estimated using the WF1 score on the validation set. To reduce the computational cost, we both pruned unpromising hyperparameters and early stopped the training when no improvement was achieved more than 20 steps before the upper bound of 1,500. We considered 50 configurations of XGBoost, taking into account the hyperparameters detailed in Supplementary Table 7. Using the best configuration and its corresponding number of training steps (equivalent to the number of estimators), we retrained the best model on the union of the training and validation sets. This time, we did not apply early stopping and increased the number of training steps by 20%, to account for the larger number of training samples.
d-SHAP interpretability
To interpret the decision of the selected XGBoost classifier, we employed the widely adopted Shapley values through the SHAP library. SHAP values were computed with shap.TreeExplainer using the observational ‘three_path_dependant’ approach. Given the potential resource-intensive nature of handling all SHAP values for every cell and disease, especially in terms of storage, we computed their mean and variance across all samples in batches using the Weldford online algorithm90. Given a specific cell type ct, we have a SHAP value for every gene in every cell and for each disease: a matrix of real values (mathrm{SHAP}^{mathrm{ct}}(c,g,d)), where c, g and d identify the cell, gene and disease, respectively. The average contribution of a gene g for a disease d can be computed as ({mathrm{d{text-}mathrm{SHAP}}^{mathrm{ct}}(g,d)=mathrm{mean}_{cin C}{| mathrm {SHAP}}^{mathrm{ct}}(c,g,d)|}), where C is the set of cells. To aggregate the d-SHAP values across multiple diseases—for example, the ones included in the same study—we summed their values across genes.
Gene selection
To validate our ensemble of important genes through d-SHAP values, we tested if our selection generalizes to unseen studies. First, we defined a gene set GS that included genes expressed in at least 5% of the cells. Then, for each of the eight conditions included in unseen studies, we selected from GS the top k ranked genes by d-SHAP importance. We then trained XGBoost in a nested cross-validation setting on data from unseen studies, where we performed both hyperparameter tuning and performance evaluation, using only one of the gene sets as input features. Next, we computed WF1 and BAS to test the performance considering k = 5, k = 10 and k = 20. Given our selection S of genes, with size |S|, we also tested 20 sets of |S| randomly selected genes from GS, excluding the ones in S (that is, not top ranked according to d-SHAP). Lastly, we compared the performances of the models trained on each gene set against XGBoost trained on the whole set of genes GS. The analysis was repeated for each cell type independently.
Study classifier and s-SHAP values
To identify whether the gene importance is driven by the study batch effects, we trained a separate classifier to predict study instead of the disease. Feature selection, data processing and model fitting were done in the same way as explained above for disease classification, apart from the data split where cells were stratified by study instead of disease. SHAP values were computed for each of the cell-type-specific study classifiers, resulting in an average contribution of a gene g for a study s ({mathrm{s{text-}mathrm{SHAP}}^{mathrm{ct}}(g,s)=mathrm{mean}_{cin C}{|mathrm{SHAP}}^{mathrm{ct}}(c,g,s)|}). Because diseases can be associated with multiple studies, we aggregated the s-SHAP values for study prediction by summing them across all studies that include the selected disease. This allowed us to compare the batch-related signal (s-SHAP) with the disease-related signal (d-SHAP).
Patient classifier pipeline
In this section, we describe the pipeline used to validate the Inflammation Atlas as a diagnostic tool. In the following analysis, the terms ‘patient’ and ‘sample’ are equivalent, because, after data splitting, we kept only one sample for each patient. The pipeline consists of (1) integrating an annotated reference dataset with data integration tools that provide batch-corrected embeddings, (2) mapping a query dataset into the reference to obtain its corrected embeddings, (3) transfering the cell annotation labels from the reference, (4) defining a patient embedding space and (5) training a classifier to predict the patient conditions from the embeddings.
Starting from a large annotated reference dataset, we applied four state-of-the-art integration methods, described below, to obtain a batch-corrected embedding. We considered different chemistry protocols as the main source of batch effect; thus, we corrected for the chemistry covariate. Then, an independent query dataset was mapped into the corrected embeddings. This step provides both batch correction and allows us to transfer cell annotation labels from the reference to the query dataset. To define patient-wise embeddings, we averaged each patient’s cell embeddings by cell type, resulting in an embedding for each cell type and each patient (30 embedded dimensions for scANVI main configuration; see Supplementary Table 7 for all the methods and configurations). To predict the inflammatory conditions of the patients in the independent query dataset, we fit one classifier for each cell type on the reference patient embeddings. Then, we predicted the inflammatory condition of the query patients by returning the most frequent condition among the predictions of every cell-type-specific classifier.
We validated our pipeline considering three different settings. In the first one, we performed a cross-validation on the Main dataset, where each left-out split is considered as a query dataset and the remaining as the reference. Moreover, we tested our diagnostic tool on data from unseen patients and unseen studies, this time using the whole Main as a reference.
Integration methods
In this section, we explain each data integration method, and the tested configurations of hyperparameters can be found in Extended Data Fig. 8 and Supplementary Table 7. Note that the scGen and Harmony/Symphony approaches generate one integrated dataset that is independent from the query data, whereas scANVI and scPoli require a fine-tuning of the reference model for a given query dataset.
scGEN
scGen is defined by two main components: a VAE and a latent space arithmetic method. The VAE estimates a posterior distribution of latent variables through the encoder, from which we can reconstruct the expression matrices via the decoder (scGen_model.batch_removal()). Similar to commonly employed VAEs, scGen approximates the posterior through a variational distribution, modeled by the encoder and defined as a multivariate Gaussian. When the scGen’s VAE has been fitted on the reference dataset, latent space arithmetic is employed to correct for the batch effect induced by the chemistry protocol used. Within each cell type, scGen first selects the mean ({mu }_{max }) of the most populated batch and then corrects each batch with mean ({mu }_{0}) by adding ({delta =mu }_{max }-{mu }_{0}) to each cell’s embedding. Importantly, the cell type has to be inferred when not known. The final corrected count matrix will correspond to the generated count matrix from the arithmetic-corrected embeddings. Following scGen’s tutorials, we will refer as corrected embeddings to the ones obtained given the corrected expression matrix as input. To apply scGen batch correction on the query dataset, we need to also infer the cell types of those cells. This step was performed through label transfer by nearest neighbors, following a similar approach employed in Human Lung Cell Atlas74 and introduced in ref. 45. The idea is to employ (approximate) nearest neighbors through PyNNDescent91 (version 0.5.11) (pynndescent.NNDescent().prepare()) and infer the most probable cell type in the 10 nearest neighbors (pynndescent.NNDescent().query()) from the already annotated cells in the reference dataset. To account for the shape of the distribution of the neighbors, a Gaussian kernel was applied instead of using the Euclidean distance. The most probable nearest neighbor cell type is then assigned to annotate new cells.
scANVI
We first trained scVI and scANVI on the reference dataset, like the dataset integration described before, and then we fine-tuned it to the query dataset. Regarding the label transfer, we employed the scANVI predict() function with default parameters.
Harmony and Symphony
Harmony92 and Symphony93 are two related methods that integrate a reference and map a query dataset to it, respectively. Harmony takes a PCA embedding of cells as input, along with their batch covariates (chemistry). Next, the model represents cell states as soft clusters, where each cell identity is defined as a probabilistic assignment across clusters, with the aim of maximizing diversity among batches within those clusters. Cells are iteratively assigned soft-cluster memberships; those assignments are used as weights in a linear mixture model to remove confounding factors. The result is a new batch-corrected embedded space. The Symphony algorithm starts from the linear model parameters inferred by Harmony to map query cells onto the corresponding embedding space. First, it projects the query gene expression profiles into the same uncorrected low-dimensional space as the reference cells. Next, Symphony computes soft-cluster assignments for the query cells based on their proximity to the reference cluster centroids. Finally, Symphony employs the Harmony mixture model components to estimate and regress out batch effects from the query data. Importantly, the reference cell embedding remains stable during this mapping process. We transferred annotation labels from the reference to the query dataset by exploiting cell proximity in the embedding space using nearest neighbors through sklearn.classifier.KNeighborsClassifier, Symphony default choice (https://github.com/potulabe/symphonypy).
scPoli
In contrast to other integration methods such as scANVI, scPoli94 encodes the condition (chemistry and sampleID) as a learnable conditional embedding and characterizes each cell type as a prototype in the latent embedding to facilitate the label transfer. In the reference building phase, we first pretrained the model given the reference dataset and its conditions and then fine-tuned to optimize the prototypes. In the reference mapping phase, we froze the model and learned the new conditional embeddings belonging to the query dataset. The label transfer is performed by simply assigning the cell type belonging to the closest prototype in the latent embedding space. All the methods belong to the scArches45 class scarches.models.scPoli.
Disease classifiers
Patient embeddings definition
After obtaining the corrected embedding from one of the data integration approaches described previously, we need to aggregate the cell-wise embeddings into patient-wise embeddings. We decided to group at the level of the cell types by computing the mean embedding across cells belonging to the same cell type and sampleID. Only for scPoli, we generated three different types of patient embeddings: the learned patient embeddings (sample), the averaged cell-wise latent embeddings (cell) and the concatenation of the two (cell&sample).
Classifiers definition and hyperparameter tuning
In this phase, the aim is to train a classifier for each cell type on the patient-wise embeddings belonging to the reference dataset. We tested the following classifier types: sklearn.svm.LinearSVC, sklearn.svm.SVC and sklearn.neighbors.KNeighborsClassifier (sklearn version 1.4.1.post1). For each classifier type, we trained different configurations defined in Supplementary Table 7 and evaluated their performance using a five-fold cross-validation on the reference patient embeddings. Similar to what we did to optimize the XGBoost classifier when estimating the immune gene importance, we employed the Optuna library to perform the hyperparameter tuning for each classifier. The best hyperparameter combination was selected according to the WF1 score independently of the cell type.
Majority voting and evaluation
The best classifier type according to the average performance over all cell types is then used to train from scratch the corresponding classifier on the whole reference patient embedding. The predicted condition (disease) for a patient is simply the majority voting among the classifiers. In case of a tie of different conditions, we conservatively rejected the prediction of the classifiers. Then, the overall metrics WF1 score, BAS and Matthews correlation coefficient (MCC) and the disease-wise metrics Precision, Recall, BAS and F1 score were computed by comparing the predicted inflammatory conditions by each classifier type in the query dataset with the available ground truth. All those metrics were computed with the sklearn Python library. When we refer to the weighted version of a given metric, we are using average = ’weighted’ parameter to take into account the unbalance of the inflammatory condition observations.
Note, if a given query patient does not have any cells annotated for a given cell type, the corresponding prediction was set as ‘Not Available’. This label was not taken into account during the majority voting procedure and was considered as a wrong prediction when evaluating the performances of that cell type.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.







