The bigbio/quantmsdiann pipeline is an automated workflow built following nf-core guidelines for quantitative mass spectrometry data analysis using DIA-NN.
You can find the official usage documentation here: https://github.com/bigbio/quantmsdiann/blob/main/docs/usage.md
Unlike standard genomics pipelines, input metadata must strictly use the Sample and Data Relationship Format (SDRF) in a sdrf.tsv file. The pipeline and DIA-NN require several columns to be present in the sdrf.tsv including the source name, assay name, data file, label, cleavage agent details, modification parameters, and technology type. For a minimal, example see https://github.com/bigbio/quantmsdiann/blob/main/docs/usage.md#minimal-valid-metadata-example.
A typical SDRF file lools like this:
characteristic[organism] comment[cleavage agent details] comment[instrument] comment[proteomics data acquisition method] technology type comment[modification parameters] comment[label] comment[precursor mass tolerance] comment[fragment mass tolerance] assay name comment[data file] source name factor value[Type] characteristics[Sex] factor value[AAV] factor value[LPS_or_PBS]
mus musculus trypsin; Lys-C timsTOF SCP DIA dia-PASEF NT=Carbamidomethyl;AC=UNIMOD:4;MT=Fixed;TA=C label free sample 15 ppm 15 ppm NI145 /data/ukdri/BUP/BUP_MM_1/230321_DRIX057_SH_1_S1-A1_1_4879.d NI145 astrocyte f TurboID PBS
mus musculus trypsin; Lys-C timsTOF SCP DIA dia-PASEF NT=Carbamidomethyl;AC=UNIMOD:4;MT=Fixed;TA=C label free sample 15 ppm 15 ppm NI146 /data/ukdri/BUP/BUP_MM_1/230321_DRIX057_SH_2_S1-A2_1_4880.d NI146 astrocyte f TurboID LPS
In put data can either be in Thermofisher format (.raw files) or Bruker format (.d directories).
We provide a job template script for submitting a slurm job:
/nfsdata/scripts/job_scripts/run_quantmsdiann.sh
To use the script, copy it to your dataset specific project folder and change the input files and parameters as desired.
As input, the pipeline needs:
input_sdrf: path to SDRF file, which specifies the paths to the input data and required meta dataresdir: path to the directory where results will be stored. The pipeline output will be stored in a subfolderoutdatabase: species-specific protein sequences for mapping peptide fragments# CHANGE PATH TO SDRF file
input_sdrf=/data/$USER/PROJECT_DIR/raw/sdrf.tsv
# CHANGE RESULTS_DIR on your dataset specific folder on /data
resdir=/data/$USER/PROJECT_DIR/processed/quantmsdiann
outdir=$resdir/out
Always use full file paths to avoid any complications.
Per default, a mouse protein database is specified in the job script. We added the universal Protein Contaminant Libraries for DDA and DIA Proteomics to the uniprot sequences:
# CHANGE PROTEIN DATABASE IF NEEDEED
database=/nfsdata/genome/uniprot/mouse/UP000000589_10090_plus_HaoUniversal.fasta
We currently provide protein database files for mouse and human:
/nfsdata/genome/uniprot/mouse/UP000000589_10090_plus_HaoUniversal.fasta
/nfsdata/genome/uniprot/human/UP000005640_9606_plus_HaoUniversal.fasta
Details about pipelines parameters can be accessed here: https://github.com/bigbio/quantmsdiann/blob/main/docs/parameters.md
Submit the job script to run the pipeline:
sbatch run_quantmsdiann.sh
The pipeline processes raw files, performs deep learning predictions, and organizes files into the output directory:
out/sdrf/: Validated metadata.quant_tables/: Core quantification generated by DIA-NN including count matrices.pmultqc/: A unified quality control HTML dashboard detailing identification rates and metrics.Count matrix files:
out/quant_tables/diann_report.unique_genes_matrix.tsv: counts per gene (using gene names)out/quant_tables/diann_report.pr_matrix.tsv: counts per protein (using uniprot IDs)