A metagenomic analysis pipeline for eukaryotic viruses written in nextflow.
Introduction
Viralgenie is a bioinformatics best-practice analysis pipeline for reconstructing consensus genomes and to identify intra-host variants from metagenomic sequencing data or enriched based sequencing data like hybrid capture.
Pipeline summary
- Read QC (
FastQC
) - Performs optional read pre-processing
- Metagenomic diveristy mapping
- Denovo assembly (
SPAdes
,TRINITY
,megahit
), combine contigs. - [Optional] extend the contigs with sspace_basic and filter with
prinseq++
- [Optional] Map reads to contigs for coverage estimation (
BowTie2
,BWAmem2
andBWA
) - Contig reference idententification (
blastn
)- Identify top 5 blast hits
- Merge blast hit and all contigs of a sample
- [Optional] Precluster contigs based on taxonomy
- Cluster contigs (or every taxonomic bin) of samples, options are:
- [Optional] Remove clusters with low read coverage.
bin/extract_clusters.py
- Scaffolding of contigs to centroid (
Minimap2
,iVar-consensus
) - [Optional] Annotate 0-depth regions with external reference
bin/lowcov_to_reference.py
. - [Optional] Select best reference from
--mapping_constrains
: - Mapping filtered reads to supercontig and mapping constrains(
BowTie2
,BWAmem2
andBWA
) - [Optional] Deduplicate reads (
Picard
or if UMI's are usedUMI-tools
) - Variant calling and filtering (
BCFTools
,iVar
) - Create consensus genome (
BCFTools
,iVar
) - Repeat step 12-15 multiple times for the denovo contig route
- Consensus evaluation and annotation (
QUAST
,CheckV
,blastn
,mmseqs-search
,MAFFT
- alignment of contigs vs iterations & consensus) - Result summary visualisation for raw read, alignment, assembly, variant calling and consensus calling results (
MultiQC
)
Usage
[!NOTE] If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with
-profile test
before running the workflow on actual data.
First, prepare a samplesheet with your input data that looks as follows:
samplesheet.csv
:
sample,fastq_1,fastq_2
sample1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
sample2,AEG588A5_S5_L003_R1_001.fastq.gz,
sample3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz
Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
Now, you can run the pipeline using:
nextflow run nf-core/viralgenie \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--outdir <OUTDIR>
[!WARNING] Please provide pipeline parameters via the CLI or Nextflow
-params-file
option. Custom config files including those provided by the-c
Nextflow option can be used to provide any configuration except for parameters; see docs.
For more details and further functionality, please refer to the usage documentation and the parameter documentation.
Pipeline output
To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.
Credits
Viralgenie was originally written by Joon-Klaps
.
We thank the following people for their extensive assistance in the development of this pipeline:
Contributions and Support
If you would like to contribute to this pipeline, please see the contributing guidelines.
Citations
[!WARNING] Viralgenie is currently not Published. Please cite as: Klaps J, Lemey P, Kafetzopoulou L. Viralgenie: A metagenomics analysis pipeline for eukaryotic viruses. Github https://github.com/Joon-Klaps/viralgenie
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.