The nf-core/scdownstream pipeline is an automated workflow designed to process already-quantified single-cell RNA sequencing data. Downstream analysis includes clustering, differential genes per cluster, gene set enrichment analysis, and cell-cell ligand-receptor interaction predictions.
NOTE: we substantially modified and re-structured the pipeline. Features and usage description will differ from the official nf-core/scdownstream pipeline, which is originally developed by Nico Trummer. The code of the UK DRI scdownstream pipeline is hosted in this github repository:
https://github.com/UKDRI/scdownstream/tree/dev_ukdri
The pipeline contains two entry points
qc_clustering: single cell specific QC, filtering, and clustering using several clustering resolutions. After inspecting UMAPs, you pick a clustering for the downstream analysis.downstream: donwstream analysis based on the selected clustering: differential genes per cluster, gene set enrichment analysis, cell-cell ligand-receptor interaction predictionSelect this entry point to perform QC and selecting a clustering resolution:
Create a samplesheet file named samplesheet_scdownstream.csv pointing to the filtered and unfiltered single-cell anndata object file:
sample,filtered,unfiltered
R6_1_Striatum_1,/data/ukdri/SCR/SCR_MM_2/processed/nfcore/scrnaseq/out/cellranger/mtx_conversions/R6_1_Striatum_1/R6_1_Striatum_1_filtered_matrix.h5ad,/data/ukdri/SCR/SCR_MM_2/processed/nfcore/scrnaseq/out/cellranger/mtx_conversions/R6_1_Striatum_1/R6_1_Striatum_1_raw_matrix.h5ad
R6_1_Striatum_2,/data/ukdri/SCR/SCR_MM_2/processed/nfcore/scrnaseq/out/cellranger/mtx_conversions/R6_1_Striatum_2/R6_1_Striatum_2_filtered_matrix.h5ad,/data/ukdri/SCR/SCR_MM_2/processed/nfcore/scrnaseq/out/cellranger/mtx_conversions/R6_1_Striatum_2/R6_1_Striatum_2_raw_matrix.h5ad
This should be the output from the nfcore:scrnaseq pipeline:
out/cellranger/mtx_conversions/SAMPLE_NAME/SAMPLE_NAME_filtered_matrix.h5adout/ cellranger/mtx_conversions/SAMPLE_NAME/SAMPLE_NAME_raw_matrix.h5adThe job template script file can be found here:
/nfsdata/scripts/job_scripts/run_scdownstream_qc_clustering.sh
To use the script, copy it to your dataset specific project folder and change the input files and parameters as desired.
As input, the pipeline needs:
samplesheet: path to a sample sheet, which specifies, which fastQ files belong to which sample (see more information below)resdir: path to the directory where results will be stored. The pipeline output will be stored in a subfolder called out# CHANGE INPUT FOLDER AND CREATE samplesheet
samplesheet=/data/${USER}/PROJECT_NAME/scdownstream/samplesheet_scdownstream.csv
# CHANGE RESULTS FOLDER
resdir=/data/${USER}/PROJECT_NAME/scdownstream/qc_clustering
outdir=$resdir/out
Always use full file paths to avoid any complications.
Per default, the pipeline is configured for mouse data. Change the species to human if required:
# CHANGE SPECIES IF NEEDED
species="mouse"
The pipeline is using CellTypist, pre-trained logistic regession models, for predicting cell types/states per cell. These predictions can be very insightful but should be taken with a grain of salt. Have a look at this webpage to select currently available models: https://www.celltypist.org/models
# CHANGE TO FITTING MODEL FOR ANNOTATING CELL TYPES
celltypist_mname="Mouse_Whole_Brain"
Currently, only scrublet is supported for doublet detection. We will add support for additional methods in the near future.
The pipeline is currently using decontX for ambient RNA correction. We will add more options in the near future.
Currently, only scvi is supported for sample integration. In addition to a PCA-based UMAP, a UMAP based on the scvi model latent space is computed. We will add support for additional methods in the near future.
Genes are filtered based on the minimum number of cells that express the gene. Adjust following parameter for changing the minimum number of cells:
--min_cells 5 \
We implemented a process for automatically determining QC thresholds and filtering cells based on outlier-detection, see Single-cell best practices.
Automated filtering is enabled per default. To disable it, remove the following line from the jobscript:
--automatic_cell_filtering true \
If automated filtering is disabled, you can add or adjust the thresholds for cell filtering manually:
Minimum and maximum number expressed genes per cell:
--min_genes 200 \
--min_genes 10000 \
Minimum and maximum total counts per cell:
--min_counts 0 \
--max_counts 50000\
Maximum percentage of counts that come from mitochondrial genes:
--max_mito_percentage 25 \
out/integrated_scvi_finalized.h5ad: anndata object comprising counts, embeddings, and clusteringsout/multiqc/multiqc_report.html: HTML report about single cell data QC metrics, doublet score distributions, and filtering thresholdsout/integrated_scvi_scdownstream_report.html: HTML report showing sample distributions and QC metrics in PCA and UMAPs, CellTypist prediction, and Leiden clusterings based on the different user-specified resolutionSelect this entry point once clusters are established and you are ready for downstream analysis:
The job template script file can be found here:
/nfsdata/scripts/job_scripts/run_scdownstream_downstream_analysis.sh
To use the script, copy it to your dataset specific project folder and change the input files and parameters as desired.
As input, thie entry point of the pipeline needs:
h5adf: an anndata object .5ad file. This should be the out put of the qc_clustering entry point out/integrated_scvi_finalized.h5adclustering: the selected clustering after running the qc_clustering entry point. This entry has be column in the obs cell metadata table of the h5adf anndata object file.name: a project/dataset name that will be used as prefix for the output filesresdir: path to the directory where results will be stored. The pipeline output will be stored in a subfolder called out# CHANGE NAME, will be used as file prefix
name="scdownstream"
# CHANGE INPUT_H5AD ANNDATA FILE
h5adf=/data/${USER}/PROJECT_NAME/scdownstream/qc_clustering/out/integrated_scvi_finalized.h5ad
# CHANGE SELECTED CLUSTERING
clustering="leiden_1.0"
# CHANGE RESULTS_FOLDER
resdir=/data/${USER}/PROJECT_NAME/scdownstream/downstream
outdir=$resdir/out
Per default, the pipeline is configured for mouse data. Change the species to human if required:
# CHANGE SPECIES IF NEEDED
species="mouse"
Differential genes are computed per cluster using the scanpy function scanpy.tl.rank_genes_groups using Wilcoxon rank-sum tests. Here, the gene expression in cells of a cluster is compared against the one in all other cells. We will update this step and add more methods in the near future.
Gene set enrichment analysis is performed using the scanpy g:Profiler wrapper function sc.queries.enrich.
For the gene set enrichment analysis, genes are filtered to be cluster specific based on:
enrich_min_in_group_fraction: the minimum fraction of cells in the cluster that express the geneenrich_min_in_group_fraction: the maximum fraction of cells outside of the cluster that express the geneenrich_min_fold_change: the minimum log2 fold-change between cells of the cluster and all other cellsTo adjust the criteria, you can add and adjust the following parameters when running the pipeline:
--enrich_min_in_group_fraction 0.25 \
--enrich_max_out_group_fraction 0.5 \
--enrich_min_fold_change 1.0 \
We will add additional methods and options for gene set enrichment analysis in the near future.
The pipeline is using LIANA+ for cell-cell ligand-receptor interaction predictions. LIANA+ is using known human ligand-receptor pairs as input. For mouse, the 1:1 human-mouse orthologs are used from the HUGO Gene Nomenclature Committee Comparison of Orthology Predictions (HCOP) to define the set of ligand-receptor pairs.
PROJECT_NAME_finalized.h5ad: anndata object .h5ad file that contains all downstream analysis results in uns entriesPROJECT_NAME_scdownstream_report.html: HTML analysis report that comprises the CellTypist predictions per cluster, differential genes, gene set enrichment per cluster, and predicted ligand-receptor predictions.Although we specified resource limits for the individual processes of the pipeline that should work for most dataset, you may have to adjust these for very large datasets that contain more then 500,000 cells.
You can adjust the slurm cluster resource limits per process by adding a custom nfcore-scdownstream.conf config file; for example:
process {
cache = true
withName: 'SCANPY_LOG_NORMALIZE' {
memory = { 225.GB }
}
withName: 'SCANPY_HVGS' {
memory = { 225.GB }
}
withName: 'SCANPY_RANKGENESGROUPS' {
queue = 'htc'
memory = { 225.GB }
time = {24.h * 7}
}
withName: 'SCANPY_ENRICH' {
memory = { 225.GB }
}
withLabel: 'process_medium' {
memory = { 150.GB * task.attempt * params.memory_scale }
}
withName: 'ADATA_ADD_OBS_OBSM' {
memory = { 225.GB }
}
withName: 'LIANA_RANKAGGREGATE' {
queue = 'htc'
memory = { 225.GB }
}
withName: 'SCANPY_EXPORT_MARKERS' {
queue = 'htc'
memory = { 225.GB }
}
withName: 'SCANPY_FILTER' {
memory = { 225.GB }
}
}
If you need more than 250G memory, change the queue to dynamic which allows up to 1T memory.
A custom Nextflow config file is invoked by added following parameter when running the pipeline:
-c /data/${USER}/PROJECT_NAME/scdownstream/nfcore-scdownstream.conf