The nf-core/scrnaseq pipeline is an automated workflow for analysing single-cell RNA sequencing (scRNA-seq) data. It takes raw sequencing files, performs quality control, aligns reads, and generates single-cell count matrices.
You can find the official usage documentation here: https://nf-co.re/scrnaseq/4.1.0/docs/usage/
Create a samplesheet samplesheet.csv, which links the fastQ files to samples. Follow this structure:
sample,fastq_1,fastq_2
Sample_1,/data/ukdri/SCR/SCR_MM_1/raw/fastq/Sample_1_GEX_Gut_1234_WT_S9_L001_R1_001.fastq.gz,/data/ukdri/SCR/SCR_MM_1/raw/fastq/Sample_1_GEX_Gut_1234_WT_S9_L001_R2_001.fastq.gz
Sample_2,/data/ukdri/SCR/SCR_MM_1/raw/fastq/Sample_2_GEX_Gut_1234_3KL_S11_L001_R1_001.fastq.gz,/data/ukdri/SCR/SCR_MM_1/raw/fastq/Sample_2_GEX_Gut_1234_3KL_S11_L001_R2_001.fastq.gz
You can add sample metadata information such as the tissue or condition here:
sample,fastq_1,fastq_2,tissue,condition
Sample_1,/data/ukdri/SCR/SCR_MM_1/raw/fastq/Sample_1_GEX_Gut_1234_WT_S9_L001_R1_001.fastq.gz,/data/ukdri/SCR/SCR_MM_1/raw/fastq/Sample_1_GEX_Gut_1234_WT_S9_L001_R2_001.fastq.gz,Gut,WT
Sample_2,/data/ukdri/SCR/SCR_MM_1/raw/fastq/Sample_2_GEX_Gut_1234_3KL_S11_L001_R1_001.fastq.gz,/data/ukdri/SCR/SCR_MM_1/raw/fastq/Sample_2_GEX_Gut_1234_3KL_S11_L001_R2_001.fastq.gz,Gut,3KL
We provide a job template script for submitting a slurm job:
/nfsdata/scripts/job_scripts/run_nfcore_scrnaseq.sh
To use the script, copy it to your dataset specific project folder and change the input files and parameters as desired.
As input, the pipeline needs:
samplesheet: path to a sample sheet, which specifies, which fastQ files belong to which sample (see more information below)resdir: path to the directory where results will be stored. The pipeline output will be stored in a subfolder called out# CREATE AND CHANGE PATH TO SAMPLESHEET
samplesheet=/nfsdata/${USER}/PATH_TO_SAMPLE_SHEET
# CHANGE RESULTS_DIR on your folder on /data
resdir=/data/${USER}/RESULTS_DIR
outdir=$resdir/out
Always use full file paths to avoid any complications.
Per default, the pipe line is configured to use the mouse mm39 genome and ENSEMBL version 115 gene annotation. You can change the genome assembly and annotation files:
# CHANGE GENOME AND ANNOTATION IF NEEDED
gtf=/nfsdata/genome/ensembl/release-115/GRCm39/chrMus_musculus.GRCm39.115.chr.gtf.gz
genome_fasta=/nfsdata/genome/ucsc/mm39/mm39.fa.gz
# aligner options: star_salmon/star_rsem/hisat2
Genome assembly files and gene annotations are stored here:
/nfsdata/genome/ucsc/
/nfsdata/genome/ensembl/
If you use 10X Genomics data, we recommend cellranger as an aligner and set the sequecing protocol to auto.
# CHANGE ALIGNER AND SEQUENCING PROTOCOL IF NEEDED
aligner=cellranger
# sequencing protocol
protocol="auto"
For non-10X Genomics data, we recomment STARsolo
aligner=star
Supported protocols are listed here: https://nf-co.re/scrnaseq/4.1.0/docs/usage/#support-for-different-scrna-seq-protocols. For example, for Smart-seq3:
aligner=star
protocol=smartseq
Submit the job script to run the pipeline:
sbatch run_nfcore_scrnaseq.sh
The pipeline generates organized folders inside your designated output directory. If you use cellranger:
out/fastq/: Raw read quality control reports.out/cellranger/: Complete Cell Ranger count matrices, web summaries, and BAM files.out/multiqc/: A unified HTML report summarizing all sample and alignment metrics.The cell ranger count matrices per sample are located in the subfolders:
out/cellranger/count/SAMPLE_NAME/outs: cellranger count outputout/cellranger/mtx_conversions/SAMPLE_NAME/SAMPLE_NAME_filtered_matrix.h5ad: empty cell barcodes are removedout/cellranger/mtx_conversions/SAMPLE_NAME/SAMPLE_NAME_raw_matrix.h5ad: contains all cell barcodesout/cellranger/SAMPLE_NAME/cellbender_removebackground/: ambient RNA corrected counts after running cellbender