We aim to investigate whether transcriptomic differences in 2n nuclei isolated from hepatocytes of different cellular ploidy could be found. To this end, intact hepatocytes were FACS sorted based on their cellular ploidy, followed by nuclei isolation and the downstream snRNA-seq2 pipeline.
Analysis of cellular and nuclear ploidy using snRNAseq2 and a “double sorting strategy”.
Data Acquisition and Preprocessing: Paired-end RNA sequencing data were aligned using the Spliced Transcripts Alignment to a Reference (STAR) software (version 2.7.1a), a commonly used tool for aligning high-throughput RNA-seq data. A pre-built STAR index, generated from a reference genome with appended spike-in sequences (ERCC92), served as the genomic scaffold for the alignment. Following the alignment process, each sample's resulting BAM files were sorted by genomic coordinates and available for downstream analysis. PCR duplicates resulting from library preparation and sequencing were identified and removed from the aligned BAM files using the Picard tools suite (version: 3.1.1). Picard MarkDuplicates was employed for its efficacy in pinpointing and flagging duplicate reads. For gene expression quantification, we used the htseq-count tool from the HTSeq framework (version: 0.11.3). Gene expression levels were quantified in terms of read counts mapped to each gene. The output was a matrix of raw read counts, which could be used for subsequent normalization and differential expression analysis. By encompassing ERCC spike-ins within the
quantification process, we incorporated a control for technical variability which could be used to assess data quality and normalization efficiency in downstream analyses.
System and Package Configuration: R (version 4.0.3; R Core Team, 2020) and RStudio were used for all analyses. Packages required for single-cell RNA-seq (scRNA-seq) data analysis, including Seurat, Signac, dplyr, and others, were installed from CRAN and Bioconductor repositories. Libraries specific to mouse genome annotation such as EnsDb.Mmusculus.v79 and org.Mm.eg.db were installed from AnnotationHub. Parallel processing was enabled using the future and future.apply packages to maximize computational efficiency. The Seurat object assay version was set to 'v4' to ensure compatibility with subsequent analyses.
scRNA processing and visualization: The scRNA-seq count matrix was acquired from HTSeq-count and imported into R. Genome annotation files were processed to create a metadata frame for genes, including gene names, sources, and biotypes. Then, quality control was performed to ensure matching between raw count gene IDs and gene IDs in the genome annotation file, including spike-in ERCC controls. This step was critical in distinguishing endogenous genes from spike-ins later in the analysis. Transcript lengths for genes were computed using the annotation file, and TPM (Transcripts Per Million) normalization was conducted to accommodate differences in gene length and sequencing depth across samples. ERCC spike-in controls were used to calculate the proportion of ERCC in each cell, which served as a metric for technical variation. Cells with extreme values for gene counts, total counts, percentage of ERCC, or complexity were filtered out to ensure the quality of downstream analyses. A Seurat object was created, incorporating the TPM-normalized count data, relevant quality control metrics, and metadata. Additional metadata, such as ploidy status and age, were factored in for subgroup analyses. The Seurat object was processed, including identification of variable features, scaling, and regression of unwanted sources of variation (e.g., total counts, percentage ERCC). PCA was conducted to reduce data dimensionality. With the selected principal components, nearest neighbor graph construction and subsequent clustering were performed to identify distinct cell populations within the dataset. Both UMAP (Uniform Manifold Approximation and Projection) and t-SNE (t-Distributed Stochastic Neighbor Embedding) were applied for non-linear dimensionality reduction to visualize clusters. Cell populations were characterized and visualized based on metadata such as ploidy status and age.
Inter-Cluster Similarity Analysis: Jaccard correlation heatmaps were generated to examine the similarity between cell populations grouped by ploidy status. A combination of different resolutions and features were tested to achieve the most informative correlation patterns.
Data Acquisition and System Configuration: Raw methylation data were acquired from the BeadChip platform. For data processing and statistical analysis, we utilized the R software environment (version 3.6.1). We employed several R packages including SeSAM (https://github.com/zwdzwd/SeSAM) for preprocessing the IDAT files, ggpubr for data visualization, FactoMineR and factoextra for principal component analysis (PCA), and future and future.apply for parallel computation to increase the efficiency of data handling. Moreover, the Bioconductor package SummarizedExperiment was used to manage the assay data. A computation plan using 16 cores in a multisession parallelized setup was established to facilitate concurrent execution of tasks during data preprocessing.
Data Preprocessing and Quality Control (QC):The dataset was preprocessed using the SeSAM package. A custom preprocessing function included temperature, quantile color adjustment, dye-bias, and a peak-based correction (TQCDPB). Following the initial data loading, quality control (QC) steps were applied to each sample. QC was performed to assess signal quality by calculating detection P-values and signal intensity values. We also generated Quantile-Quantile (QQ) plots and Intensity-Beta value plots for each sample to further inspect and visualize data quality.
Principal Component Analysis (PCA):PCA was conducted on the batch-corrected beta values to capture the major sources of variation within the dataset. The PCA was executed using the FactoMineR package, retaining the first five principal components. The eigenvalues of the principal components were visualized to evaluate their contribution to the total variance. Subsequently, PCA plots were created using the factoextra package, incorporating genotype- and age-specific coloring schemes, to provide visual summaries of sample distributions in the context of age groups and genotypes.
Aging Prediction Analysis:Beta values were extracted for aging prediction analysis. The predictMouseAgeInMonth() function from SeSAM package was used to predict the age in months for
each mouse sample based on methylation patterns. Predicted ages were then compiled into a dataset (76).
Data Integration and Linear Regression Analysis: We merged the experimentally obtained chronological ages with the predicted ages derived from the methylation data. A linear model analysis was carried out to correlate predicted ages with chronological ages across different genotypes. We utilized ggpubr2 to visualize the results and extract information regarding the fit of the models in terms of equations and R-squared values as well as grouped boxplots.