The nf-core/differentialabundance pipeline is an automated workflow for statistical differential analysis. It ingests raw count matrices (e.g., from nf-core/rnaseq), or quantitative proteomics data, runs statistical comparisons, and exports interactive reports.
You can find the official usage documentation here: https://nf-co.re/differentialabundance/1.5.0/docs/usage/.
NOTE: we modified this pipeline to add the capability to process proteomics data that is produced by the UK DRI . This includes support for DIA-NN generated count matrices and additional normalisation methods. These proteomics related changes are not part of the official nfcore:differentialabundance usage description. You can find the code in our github repository:
https://github.com/UKDRI/differentialabundance/tree/dev_ukdri
You require three distinct files to define your experiment:
Maps sample IDs to your experimental conditions:
sample,treatment,batch,sex
control_1,control,batch1,female
control_2,control,batch1,male
control_3,control,batch2,male
treatment_1,treated,batch1,female
treatment_2,treated,batch1,male
treatment_3,treated,batch2,male
You can specify any kind of categorical variable in the columns here.
Defines the specific statistical comparisons you want to perform:
id,variable,reference,target,blocking
treated_vs_control,treatment,control,treated,
treated_vs_control,treatment,control,treated,batch;sex
id: the name for the contrast; this will be used as filename prefix and shown in the reportvariable: the variable/condition to be compare; any column in the metadatareference: the group that you comaparision is performed against (must be present as value in the specified metadata column)target: the group that you compare against the reference (must be present as value in the specified metadata column)blocking: account for confounding variables such as the batch or sex during differential expression analysis. For multiple variables use ; as separator, e.g. batch;sex.A tab-separated table containing raw gene counts or protein expression values. The sample names in the columns have to match the metadata.csv file. As a minimu, you will need a gene_id column and the experession per sample. Typically you will use theoutput of the nfcore:rnaseq pipeline:
gene_id transcript_id(s) control_1 control_2 treatment_1 treatment_2
ENSMUSG00000000001 ENSMUST00000000001 480.00 297.00 353.00 324.00
ENSMUSG00000000028 ENSMUST00000000028,ENSMUST00000096990,ENSMUST00000115585,ENSMUST00000231819 56.00 12.00 9.00 15.00
We provide two template job script files: one for RNAseq and one for Proteomics data. The RNAseq template file can be founde here:
/nfsdata/scripts/job_scripts/run_nfcore_differentialabundance.sh
To use the script, copy it to your dataset specific project folder and change the input files and parameters as desired.
This job script will use DESeq2 for differential expression analysis.
As input, the pipeline needs:
metadata: path to the sample metadata filecontrasts: path to the contrast filematrix: raw counts matrix, ideally using ENSEMBL gene IDs. This should be the nfcore:rnaseq output: out/star_rsem/rsem.merged.gene_counts.tsvresdir: path to the directory where results will be stored. The pipeline output will be stored in a subfolder called out# CHANGE INPUT_FOLDER AND CREATE INPUT FILES
metadata=/data/${USER}/INPUT_FOLDER/metadata.csv
contrasts=/data/${USER}/INPUT_FOLDER/contrasts.csv
# CHANGE EXPRESSION COUNT matrix PATH, ideally to output from nfcore:rnaseq
matrix=/data/${USER}/${accession}/rsem.merged.gene_counts.tsv
# CHANGE RESULTS_FOLDER
resdir=/data/${USER}/RESULT_FOLDER
outdir=$resdir/out
Per default, this template job script is configure for mouse data, the species, gene annotation, and gene set enrichment files are specified here:
# CHANGE GENE ANNOTATION gtf, species, and genesets IF NEEDED
gtf=/nfsdata/genome/ensembl/release-115/GRCm39/chrMus_musculus.GRCm39.115.chr.gtf.gz
genesets=/nfsdata/genome/gprofiler/mmusculus/gprofiler_full_mmusculus.ENSG.gmt
Gene annotations and gene set enrichment files are stored here:
/nfsdata/genome/ensembl/
/nfsdata/genome/gprofiler/
For proteomics data, we provide the following templeate job script:
/nfsdata/scripts/job_scripts/run_nfcore_differentialabundance_proteomics.sh
To use the script, copy it to your dataset specific project folder and change the input files and parameters as desired.
This job script will use limma for differential expression analysis.
As input, the pipeline needs:
metadata: path to the sample metadata filecontrasts: path to the contrast filematrix: raw counts matrix, ideally using gene names. This should be the bigbio:quantmsdiann output: out/quant_tables/diann_report.unique_genes_matrix.tsvresdir: path to the directory where results will be stored. The pipeline output will be stored in a subfolder called out# CHANGE INPUT_FOLDER AND CREATE INPUT FILES
metadata=/data/${USER}/INPUT_FOLDER/metadata.csv
contrasts=/data/${USER}/INPUT_FOLDER/contrasts.csv
# CHANGE EXPRESSION COUNT matrix PATH, ideally to output from nfcore:rnaseq
matrix=/data/${USER}/PROJECT_NAME/processed/quantmsdiann/out/quant_tables/diann_report.unique_genes_matrix.tsv
# CHANGE RESULTS_FOLDER
resdir=/data/${USER}/RESULT_FOLDER
outdir=$resdir/out
Per default, this template job script is configure for mouse data. For gene set enrichment analysis, we are using all proteins as background
# CHANGE BACKGROUND AND GENE SETS IF DESIRED
backgroundf=/nfsdata/genome/ensembl/release-115/GRCh38/list_prot_geneNames_Homo_sapiens.GRCh38.115.txt
genesets=/data/nhecker/genomes/gprofiler/hsapiens/gprofiler_full_hsapiens.ENSG.gmt
idcolumn="Genes"
Gene gene set and background files are stored here:
# CHANGE BACKGROUND AND GENE SETS IF DESIRED
backgroundf=/nfsdata/genome/ensembl/release-115/GRCm39/list_prot_geneNames_Mus_musculus.GRCm39.115.chr.txt
genesets=/nfsdata/genome/gprofiler/mmusculus/gprofiler_full_mmusculus.name.gmt
idcolumn="Genes"
We added four ways of normalising proteomics data:
median_normalised: data will be log2-transformed and normalised with the limma normalizeBetweenArrays function using the method=scalequantile_normalised: data will be log2-transformed and normalised with the limma normalizeBetweenArrays function using the method=quantilecyclic_loess: data will be log2-transformed and normalised with the limma normalizeBetweenArrays function using the method=cyclicloessvariance_stabilised: data will be normalised using the limma normalizeVSN functionOne of the four methods is used for exploratory analysis (differential expression and gene set enrichment). Per default variance_stabilised data is used:
# CHANGE ASSAY FOR EXPLORATORY ANALYSIS IF DESIRED: median_normalised, quantile_normalised, cyclic_loess, or variance_stabilised
final_assay="variance_stabilised"
The pipeline generates organized folders inside your designated output directory:
out/tables: Differential and gene set enrichment result tables with p-values, adjusted p-values, and log fold-changes.out/tables/processed_abundance (RNAseq only): normalised count matrices.out/limma: (proteomics only): normalised count matrices.out/report: An HTML report and its matching R Markdown document for fully customizable downstream work.