Output
Introduction
This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Tip
A global, partly, random prefix can be created using the argument --prefix <string>
. The following string will then be used as a prefix to all output files.
Preprocessing
All output files of the preprocessing steps can be found in the directory preprocessing/
.
FastQC
Output files
fastqc/{raw,trim,host}
*_fastqc.html
: FastQC report containing quality metrics.*_fastqc.zip
: Zip archive containing the FastQC report, tab-delimited data file and plot images.
FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.
Tip
The FastQC plots displayed in the MultiQC report shows untrimmed, trimmed and host filtered reads. Make sure to check the section titles for the correct set of reads.
fastp
fastp is a FASTQ pre-processing tool for quality control, trimmming of adapters, quality filtering and other features.
Output files
fastp/
report/<sample-id>.*{html,json}
: report files in different formats.log/<sample-id>.*{html,json}
: log files.<sample-id>.fastp.fastq.gz
: file with the trimmed fastq reads.fail/<sample-id>.fail.fastq.gz
: file with reads that didn't suffice quality controls.
By default viralgenie will only provide the report and log files if fastp is selected. The trimmed reads can be saved by specifying --save_intermediate_reads
or --save_final_reads 'trimming'
. Similarly, the saving of the output reads can be enabled with --save_trimmed_fail
.
Trimmomatic
Trimmomatic is a FASTQ pre-processing tool for quality control, trimmming of adapters, quality filtering and other features.
Output files
trimmomatic/
<sample-id>.fastq.gz
: file with the trimmed fastq reads.log/<sample-id>.*{html,txt,zip}
: log files generated by trimmomatic.
By default viralgenie will only provide the report and log files if Trimmomatic is selected. The trimmed reads can be saved by specifying --save_intermediate_reads
or --save_final_reads 'trimming'
.
UMI-deduplication
UMI-deduplication can be done at the read level using HUMID
. Viralgenie also uses provides the opportunity to extract the UMI from the read using UMI-tools extract
if the UMI is not in the header. Results will be stored in the preprocessing/umi
directory.
Output files
umi/
humid/
log/<sample-id>.log
: log file of humid.annotated/<sample-id>_annotated_*.fastq.gz
: annotated FastQ files, reads will have their assigned cluster in the read header.deduplicated/<sample-id>_deduplicated_*.fastq.gz
: deduplicated FastQ files.
umitools/
log/<sample-id>.log
: log file of umi-tools.extracts/<sample-id>.umi_extract*.fastq.gz
: fastq file where UMI's have been removed from the read and moved to the read header.
By default viralgenie will not assume reads have UMI's. To enable this use the parameter --with_umi
. Specify where UMI deduplication should occur with --umi_deduplicate
if at a read
level, on a mapping
level or both
at a read and mapping level. The deduplicated reads can be saved by specifying --save_intermediate_reads
or --save_final_reads 'deduplication'
.
BBDuk
BBDuk stands for Decontamination Using Kmers. BBDuk was developed to combine most common data-quality-related trimming, filtering, and masking operations into a single high-performance tool.
It is used in viralgenie for complexity filtering using different algorithms. This means that it will remove reads with low sequence diversity (e.g. mono- or dinucleotide repeats).
Output files
bbduk/
log/<sample-id>.bbduk.log
: log file containing filtering statistics<sample-id>.fastq.gz
: resulting FASTQ file without low-complexity reads
By default viralgenie will only provide the log files of bbduk. The filtered reads can be saved by specifying --save_intermediate_reads
or --save_final_reads 'complexity'
.
prinseq++
It is used in viralgenie for complexity filtering using different algorithms. This means that it will remove reads with low sequence diversity (e.g. mono- or dinucleotide repeats).
Output files
prinseq/
log/<sample-id>.log
: log file containing filtering statistics<sample-id>.fastq.gz
: resulting FASTQ file without low-complexity reads
By default viralgenie will only provide the log files of prinseq. The filtered reads can be saved by specifying --save_intermediate_reads
or --save_final_reads 'complexity'
.
Hostremoval-Kraken2
Kraken2 is a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences. Kraken examines the k-mers within a query sequence and uses the information within those k-mers to query a database. That database maps -mers to the lowest common ancestor (LCA) of all genomes known to contain a given k-mer.
Output files
hostremoval-kraken2/
<sample-id>_kraken2_host.report.txt
: A profile of the aligned reads to a given host contamination database.<sample-id>_kraken2_host.unclassified*.fastq.gz
: resulting FASTQ file with reads that don't have any matches to the given host contamination database.
By default viralgenie will only provide the log files of kraken2 which are visualised in Multiqc. The filtered reads can be saved by specifying --save_intermediate_reads
or --save_final_reads 'host'
.
Metagenomic Diversity
The results of the metagenomic diversity analysis are stored in the directory metagenomic_diversity/
. Results are also visualised in the MultiQC report.
Kraken2
Kraken is a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences. Kraken examines the k-mers within a query sequence and uses the information within those k-mers to query a database. That database maps -mers to the lowest common ancestor (LCA) of all genomes known to contain a given k-mer.
Output files
-
metagenomic_diversity/kraken2/
<sample-id>.report.txt
:A Kraken2 report that summarises the fraction abundance, taxonomic ID, number of Kmers, taxonomic path of all the hits in the Kraken2 run for a given sample. Will be 6 column rather than 8 if--save_minimizers
specified.<sample-id>_kraken2_host.unclassified*.fastq.gz
: resulting FASTQ file with reads that don't have any matches to the given host contamination database.<sample-id>.classified.fastq.gz
: FASTQ file containing all reads that had a hit against a reference in the database for a given sample.<sample-id>.unclassified.fastq.gz
: FASTQ file containing all reads that did not have a hit in the database for a given sample.<sample-id>.classifiedreads.txt
: A list of read IDs and the hits each read had against the database for a given sample.
By default viralgenie will provide any classified or unclassified fastq files, specify this with --kraken2_save_reads. Similarly for the classifiedreads table, specify this with --kraken2_save_readclassification.
Kaiju
Kaiju is a program for sensitive taxonomic classification of high-throughput sequencing reads from metagenomic data. It is based on the Burrows-Wheeler transform and the lowest common ancestor algorithm.
Output files
metagenomic_diversity/kaiju/
<sample-id>.tsv
: Raw output from Kaiju with taxonomic rank, read ID and taxonic ID<sample-id>.txt
: A summary of the taxonomic classification of the reads in the sample.
Krona
Krona is a hierarchical data visualisation tool that can be used to visualise the taxonomic classification of metagenomic data.
Output files
metagenomic_diversity/krona/
<kaiju|kraken2>_.html
: A HTML file containing the Krona visualisation of the taxonomic classification of the reads in the sample.
Assembly & Polishing
The results of the assembly processes & polishing are stored in the directory assembly/
.
Multiple intermediate files can be genarated during the assembly process, some of them might not always be interesting to have. For this reason, there is an option to save the intermediate files with the --save_intermediate_polishing
argument which is by default off.
Assemblers
Multiple assemblers [spades, trinity, megahit] can be used which have their results combined. Each assembler has its own directory in the assembly/assemblers
directory, where there will be a subfolder for the contigs and the QC results from quast.
Output files
assemblers/
spades/<spades_mode>/
contigs/<sample-id>_spades.fa.gz
: Contigs generated by SPAdes.log/<sample-id>_spades.log
: Directory containing the log file of the spades run.quast/<sample-id>_spades.tsv
: Directory containing the QUAST report.
trinity/
contigs/<sample-id>_trinity.fa.gz
: Contigs generated by Trinity.quast/<sample-id>_trinity.tsv
: Directory containing the QUAST report.
megahit/
contigs/<sample-id>_megahit.fa.gz
: Contigs generated by Megahit.quast/<sample-id>_megahit.tsv
: Directory containing the QUAST report.
Quast results are also summarised and plotted in the MultiQC report.
Finally, the results of the assemblers are combined and stored in the tools_combined/
directory.
Output files
assemblers
tools_combined/<sample-id>.combined.fa
: Contigs generated by combining the results of the assemblers.
SSPACE Basic
SSPACE Basic is a tool for scaffolding contigs using paired-end reads. It is modified from SSAKE assembler and has the feature of extending contigs using reads that are unmappable in the contig assembly step.
Output files
sspace_basic/
scaffolds/<sample-id>.scaffolds.fasta
: Scaffolds generated by SSPACE Basic.log/<sample-id>.*.txt
: Various txt files containig log and summary information on the SSPACE Basic run.
prinseq++ - contigs
prinseq++
is used in for complexity filtering of contigs.
Output files
prinseq/
scaffolds/<sample-id>.scaffolds.fasta
: Scaffolds generated by SSPACE Basic.log/<sample-id>.*.txt
: Various txt files containig log and summary information on the SSPACE Basic run.
BLAST
BLAST is a sequence comparison tool that can be used to compare a query sequence against a database of sequences. In viralgenie, BLAST is used to compare the contigs generated by the assemblers to a database of viral sequences.
By default, viralgenie will only provide the BLAST results in a tabular format. It will have selected only for the top five hits and will also have filtered version where it will only include hits with an e-value of 0.01 or lower, a bitscore of 50 or higher and a alignment percentage of 0.80 or higher.
Column names
- qseqid
- sseqid
- stitle
- pident
- qlen
- slen
- length
- mismatch
- gapopen
- qstart
- qend
- sstart
- send
- evalue
- bitscore
Output files
polishing/
blast/<sample-id>_filter.tsv
: Filtered BLAST results in tabular format.intermediate/blast/filtered-sequences/<sample-id>_withref.fa
: Contigs with the blast hit sequence in a fasta file.intermediate/blast/hits/<sample-id>.txt
: unfiltered BLAST results in tabular format.
By default viralgenie will only provide the filtered blast.txt file. The intermediate files can be saved by specifying
--save_intermediate_polishing
.
Preclustering - Kaiju & Kraken2
Kaiju is a program for sensitive taxonomic classification of high-throughput sequencing reads from metagenomic data. It is based on the Burrows-Wheeler transform and the lowest common ancestor algorithm.
Kraken2 is a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences. Kraken examines the k-mers within a query sequence and uses the information within those k-mers to query a database. That database maps -mers to the lowest common ancestor (LCA) of all genomes known to contain a given k-mer.
Output files
polishing/intermediate/precluster
kaiju/<sample-id>_kaiju.tsv
: Raw output from Kaiju with taxonomic rank, read ID and taxonic IDkraken2/<sample-id>_kraken2_reports.txt
: A Kraken2 report that summarises the fraction abundance, taxonomic ID, number of Kmers, taxonomic path of all the hits in the Kraken2 run for a given sample. Will be 6 column rather than 8 if--save_minimizers
specified.kraken2/<sample-id>_kraken2.classifiedreads.txt
: A list of read IDs and the hits each read had against the database for a given sample.merged_classifications/<sample-id>.txt
: Taxonomy merged based on the specified strategy, filtered based on specified filters and simplified up to a certain taxonomy with the columns being taxonomic rank, read ID and taxonic ID.sequences/<sample-id>/<sample-id>_taxid<taxonic ID>.fa
: Fasta file with the contigs that were classified to that specific taxonomic ID.
By default viralgenie will not provide any preclustering file. The intermediate files can be saved by specifying
--save_intermediate_polishing
.
Clustering
The output files of each clustering method is directly put in te assembly/polishing
directory. With the exception of a summary file that is generated by the pipeline for each cluster with the size of the cluster, the centroid, ... .
Output files
polishing/intermediate/cluster/
<sample-id>/<sample-id>.summary_mqc.tsv
: A tabular file with comments used for Multiqc with statistics on the number of identified clusters in a sample<sample-id>/<sample-id>.clusters.tsv
: A tabular file with metadata on all clusters in a samples. It's the json file of all clusters in a table format.
Tip
Whenever there is a 'cl#' in the file name, it refers to the cluster number of that sample.
By default viralgenie will not provide any clustering overview files. The intermediate files can be saved by specifying
--save_intermediate_polishing
.
CD-HIT-EST
CD-HIT is a very fast, widely used program for clustering and comparing protein or nucleotide sequences.
Output files
polishing/cdhit/
<sample-id>/<sample-id>.fa.clstr
: A cluster file containing the clustering information. where ">" starts a new cluster, a "*" at the end means that this sequence is the representative or centroid of this cluster, and a "%" is the identity between this sequence and the representative<sample-id>/<sample-id>.fa
: A fasta file containing the centroid sequence.
vsearch-cluster
vsearch implements a single-pass, greedy centroid-based clustering algorithm, similar to the algo- rithms implemented in usearch, DNAclust and sumaclust for example. The output has to be in the --uc
format or else the pipeline will not be able to process the output.
Output files
polishing/vsearch/
<sample-id>/<sample-id>.tsv.gz
: A cluster file containing the clustering information.
vsearch -uc columns
- Entry (S, H, or C):
- Record type: S, H, or C.
- Cluster number (zero-based).
- Centroid length (S), query length (H), or cluster size (C).
- Percentage of similarity with the centroid sequence (H), or set to ’*’ (S, C).
- Match orientation + or - (H), or set to ’’ (S, C). Not used, always set to ’’ (S, C) or to zero (H).
- Not used, always set to ’*’ (S, C) or to zero (H).
- Set to ’*’ (S, C) or, for H, compact representation of the pairwise alignment using the CIGAR format (Compact Idiosyncratic Gapped Alignment Report): M (match/mismatch), D (deletion), and I (insertion). The equal sign ’=’ indicates that the query is identical to the centroid sequence.
- Label of the query sequence (H), or of the centroid sequence (S, C). 10. Label of the centroid sequence (H), or set to ’*’ (S, C).
MMseqs2
MMseqs2 is a software suite to search and cluster huge protein and nucleotide sequence sets. The cascaded clustering workflow (mmseqs-cluster
) first runs linclust, the linear-time clustering module of mmseqs (mmseqs-linclust
), that can produce clustering’s down to 50% sequence identity in very short time.
Output files
polishing/
mmseqs2/<sample-id>/<sample-id>.tsv
: A cluster file containing the clustering information. Where the first column is the cluster representative and the second column the member.- intermediate/mmseqs/clustered_db/
*`: A MMseqs2 database of the clustered sequences. - intermediate/mmseqs/sequence_db/
*`: A MMseqs2 database of the input sequences (contigs + blast hits).
vRhyme
vRhyme is a multi-functional tool for binning virus genomes from metagenomes. vRhyme functions by utilizing coverage variance comparisons and supervised machine learning classification of sequence features to construct viral metagenome-assembled genomes (vMAGs).
Output files
polishing/vrhyme
<sample-id>/vRhyme_best_bins.#.membership.tsv
: scaffold membership of best bins<sample-id>/vRhyme_best_bins.#.summary.tsv
: summary stats of best bins
Mash
Mash calculates the distance between two sequences based on the jaccard distance. The Mash distance can be quickly computed from the size-reduced sketches alone, yet produces a result that strongly correlates with alignment-based measures such as the Average Nucleotide Identity (ANI).
Output files
polishing/mash
<sample-id>/dist/*.tsv
: A distance matrix of the genomes with ANI<sample-id>/cluster/*.tsv
: A table where the first column represents the contig/genome and the second column it's corresponding cluster.<sample-id>/visual/*.png
: A visualisation of the network.
The network of a triple segmented Hazara virus looks like this, each node represents a contig colored on cluster. The edge represents that the ANI is higher then the specified --identity_threshold
.
What are those names?
Most assemblers tend to give each contig name a specific prefix. For example,
- Trinity:
'TRINITY_...'
- SPAdes:
'NODE_...'
- Megahit:
'k\d{3}_...'
Based on these prefixes viralgenie separates external references from denovo contigs. If any assemblers are added, consider specifying a specific regex for --assembler_patterns
.
Minimap2
Minimap2 is a versatile sequence alignment program that aligns larger DNA or mRNA sequences against a large reference database.
Output files
polishing/scaffolding/<sample-id>/minimap
<sample-id>_cl#.bam
: A BAM file containing the alignment of contigs to the centroid.<sample-id>_cl#.mmi
: The centroid index file.
By default viralgenie will not provide the minimap output files. The intermediate files can be saved by specifying
--save_intermediate_polishing
.
iVar contig consensus
iVar is a computational method for calling consensus sequences from viral populations.
Output files
polishing/scaffolding/<sample-id>
<sample-id>_cl#_consensus.fa
: A fasta file containing the consensus sequence of the cluster.<sample-id>_cl#_consensus.mpileup
: A mpileup file containing depth at each position of the consensus sequence.hybrid-<sample-id>_cl#_consensus.fa
: A fasta file containing the hybrid consensus sequence of the cluster and the reference./visualised/
*.png
: A visualisation of the consensus sequence displaying which regions came from the reference and which from the contigs.*.txt
: The alignment of the reference to the consensus sequence written as a blast alignment
By default viralgenie will not provide the iVar output files. The intermediate files can be saved by specifying
--save_intermediate_polishing
.
A visualisation is made to show which regions came from the external reference (red) and which from the denovo contigs (green). For example,
Info
The hybrid consensus is generated by mapping the contigs to the reference and then calling the consensus sequence. This is done to fill in the gaps in the contigs with the reference sequence, if there are no postions with 0 coverage there will not be a hybrid consensus and the output from iVar will be used.
Variant Calling & Iterative Refinement
The results from variant calling, resulting from the mapping constrains & the final round of polishing are stored in the directory variant_calling/
.
Info
Mapping constrains are combined with the specified samples, here, the identifier of the mapping constrain combined with the sample identifier. All results will have a new prefix which is <sample-id>_<mapping_constrain_id>-CONSTRAIN
.
The results from the iterations are stored with the same structure as the final round of polishing in the assembly/polishing/iterations/it#
directory.
Info
To be able to make a distinction between the output files of the iterations, viralgenie follows a schema where it starts from singletons
or a consensus
goes through the iterations and ends with the variant-calling
. The output files will have the following structure:
graph LR
F[singleton] --> B[Iteration 1: 'it1']
A[consensus] --> B[Iteration 1: 'it1']
B --> C[Iteration 2: 'it2']
C --> D[...]
D --> E[Variant-calling: 'itvariant-calling']
The prefix of the sample is combined with the previous state of sample. For example, in the first iteration (directory iterations/it1
), reads will be mapped to the reference-assisted de novo consensus sequence (ie consensus
) and so the output file will be assembly/polishing/iterations/it1/bwamem2/bam/<sample-id>/<sample-id>_cl#_consensus.bam
.
Reference selection
The reference selection is done using mash
tool. Here there reference file is sketched (variants/mapping-info/mash/sketch
) and compared to the reads (variants/mapping-info/mash/screen
) where the reference with the highest estimated avirage nucleotide identity (ANI) and shared hashes is selected (variants/mapping-info/mash/select-ref
).
Output files
variants/mapping-info/mash
sketch/<sample-id>_<constrain-id>-CONSTRAIN.msh
: The sketch file of the reads.screen/<sample-id>_<constrain-id>-CONSTRAIN.screen
: The tab results file of the comparisons between references and reads.select-ref/<sample-id>_<constrain-id>-CONSTRAIN.json
: The reference with the highest estimated ANI and shared hashes.
Column names: mash-screen
- identity
- shared-hashes
- median-multiplicity
- p-value
- query-ID
- query-comment
Read mapping
The mapping results are stored in the directory variants/mapping-info/
or in the iterations directory assembly/polishing/iterations/it#
.
If bowtie is used, the output from the raw mapping results (in addition to the results after deduplication) are included in the multiqc report.
Output files - variants
variants/mapping-info/
bwamem2/
index/<sample-id>_<constrain-id>/*
: The index files of the consensus .bam/<sample-id>/<sample-id>_<constrain-id>-CONSTRAIN.bam
: A BAM file containing the alignment of contigs to the consensus.unmapped/<sample-id>/<sample-id>_<constrain-id>-CONSTRAIN.fastq.gz
: A fastq file containing the unmapped reads.
bwamem/
index/<sample-id>_<constrain-id>/*
: The index files of the consensus .bam/<sample-id>/<sample-id>_<constrain-id>-CONSTRAIN.bam
: A BAM file containing the alignment of contigs to the consensus.unmapped/<sample-id>/<sample-id>_<constrain-id>-CONSTRAIN.fastq.gz
: A fastq file containing the unmapped reads.
bowtie2/
build/<sample-id>_<constrain-id>/*
: The index files of the consensus .bam/<sample-id>/<sample-id>_<constrain-id>-CONSTRAIN.bam
: A BAM file containing the alignment of contigs to the consensus.unmapped/<sample-id>/<sample-id>_<constrain-id>-CONSTRAIN.fastq.gz
: A fastq file containing the unmapped reads.log/<sample-id>_<constrain-id>-CONSTRAIN.log
: A log file of the bowtie2 run.
Output files - iterations
assembly/polishing/iterations/it#/
bwamem2/
index/<sample-id>/*
: The index files of the consensus .bam/<sample-id>/<sample-id>_cl#_it#.bam
: A BAM file containing the alignment of contigs to the consensus.unmapped/<sample-id>/<sample-id>_cl#_it#.fastq.gz
: A fastq file containing the unmapped reads.
bwamem/
index/<sample-id>/*
: The index files of the consensus .bam/<sample-id>/<sample-id>_cl#_it#.bam
: A BAM file containing the alignment of contigs to the consensus.unmapped/<sample-id>/<sample-id>_cl#_it#.fastq.gz
: A fastq file containing the unmapped reads.
bowtie2/
build/<sample-id>/*
: The index files of the consensus .bam/<sample-id>/<sample-id>_cl#_it#.bam
: A BAM file containing the alignment of contigs to the consensus.unmapped/<sample-id>/<sample-id>_cl#_it#.fastq.gz
: A fastq file containing the unmapped reads.log/<sample-id>_cl#_it#.log
: A log file of the bowtie2 run.
Deduplication
To accomodate for PCR duplicates, the reads are deduplicated. The deduplication results are stored in the directory variants/mapping-info/deduplicate/
or in the iterations directory assembly/polishing/iterations/it#/deduplicate
.
Deduplication results are also visualised within the MultiQC report.
UMI-tools
UMI-tools
is a set of tools for handling Unique Molecular Identifiers (UMIs) in NGS data. The deduplication is done by the dedup
tool.
Number of deduplicated reads:
Summary statistics:
Output files - variants
variants/mapping-info/deduplicate/
bam/<sample-id>/<sample-id>_<constrain-id>-CONSTRAIN.umi_deduplicated.bam
: A BAM file containing the alignment of contigs to the consensus.log/
<sample-id>_<constrain-id>-CONSTRAIN.umi_deduplicated.log
: A log file of the UMI-tools run.<sample-id>_<constrain-id>-CONSTRAIN.umi_deduplicated_edit_distance.tsv
: Reports the (binned) average edit distance between the UMIs at each position..<sample-id>_<constrain-id>-CONSTRAIN.umi_deduplicated_per_umi.tsv
: UMI-level summary statistics..<sample-id>_<constrain-id>-CONSTRAIN.umi_deduplicated_per_umi_per_position.tsv
: Tabulates the counts for unique combinations of UMI and position..
Output files - iterations
assembly/polishing/iterations/it#/deduplicate
bam/<sample-id>/<sample-id>_cl#_it#.umi_deduplicated.bam
: A BAM file containing the alignment of contigs to the consensus.log/
<sample-id>_cl#_it#.umi_deduplicated.log
: A log file of the UMI-tools run.<sample-id>_cl#_it#.umi_deduplicated_edit_distance.tsv
: Reports the (binned) average edit distance between the UMIs at each position..<sample-id>_cl#_it#.umi_deduplicated_per_umi.tsv
: UMI-level summary statistics..<sample-id>_cl#_it#.umi_deduplicated_per_umi_per_position.tsv
: Tabulates the counts for unique combinations of UMI and position..
Picard - Mark Duplicates
Picard
is a set of command line tools for manipulating high-throughput sequencing data and formats such as SAM/BAM/CRAM and VCF. The deduplication is done by the MarkDuplicates
tool.
Output files - variants
variants/mapping-info/deduplicate/
picard/
bam/<sample-id>/<sample-id>_<constrain-id>-CONSTRAIN.bam
: A BAM file containing the alignment of contigs to the consensus.log/<sample-id>_<constrain-id>-CONSTRAIN.dedup.MarkDuplicates.metrics.txt
: Dedpulication metrics from Picard.
Output files - iterations
assembly/polishing/iterations/it#/deduplicate
picard/
bam/<sample-id>/<sample-id>_cl#_it#.bam
: A BAM file containing the alignment of contigs to the consensus.log/<sample-id>_cl#_it#.dedup.MarkDuplicates.metrics.txt
: Dedpulication metrics from Picard.
Mapping statistics
Info
If --deduplicate
is set to true
[default], all metrics will be calculated on the deduplicated bam file.
Samtools
Samtools is a suite of programs for interacting with high-throughput sequencing data. We use samtools in this pipeline to obtain mapping statistics from three tools: flagstat
, idxstats
and stats
.
Output files - variants
variants/mapping-info/metrics
flagstat/<sample-id>_<constrain-id>-CONSTRAIN.flagstat
: A text file containing the flagstat output.idxstats/<sample-id>_<constrain-id>-CONSTRAIN.idxstats
: A text file containing the idxstats output.stats/<sample-id>_<constrain-id>-CONSTRAIN.stats
: A text file containing the stats output.
Output files - iterations
assembly/polishing/iterations/it#/metrics
flagstat/<sample-id>_cl#_it#.flagstat
: A text file containing the flagstat output.idxstats/<sample-id>_cl#_it#.idxstats
: A text file containing the idxstats output.stats/<sample-id>_cl#_it#.stats
: A text file containing the stats output.
Picard - Collect Multiple Metrics
Picard
is a set of command line tools for manipulating high-throughput sequencing data. We use picard-tools in this pipeline to obtain mapping and coverage metrics.
Output files - variants
variants/mapping-info/metrics/picard
*.CollectMultipleMetrics.*
: Alignment QC files from picard CollectMultipleMetrics in*_metrics
textual format.*.pdf
plots for metrics obtained from CollectMultipleMetrics.
Output files - iterations
assembly/polishing/iterations/it#/metrics/picard
*.CollectMultipleMetrics.*
: Alignment QC files from picard CollectMultipleMetrics in*_metrics
textual format.*.pdf
plots for metrics obtained from CollectMultipleMetrics.
Custom - mpileup like file
To facilitate the intra host analysis, a mpileup like file is generated. This file contains the depth of every nucletoride at each position of the reference.
Output files - variants
variants/mapping-info/custom-vcf/<sample-id>
*.tsv
: A custom tsv file containing the depth of every nucleotide at each position of the reference.
Output files - iterations
assembly/polishing/iterations/it#/custom-vcf/<sample-id>
*.tsv
: A custom tsv file containing the depth of every nucleotide at each position of the reference.
Mosdepth - Coverage
mosdepth is a fast BAM/CRAM depth calculation for WGS, exome, or targeted sequencing. mosdepth is used in this pipeline to obtain genome-wide coverage values in 200bp windows. The results are rendered in MultiQC (genome-wide coverage).
Output files - variants
variants/mapping-info/metrics/mosdepth
-<sample-id>_<constrain-id>-CONSTRAIN.per-base.bed.gz
: A bed file containing the coverage values in 200bp windows.<sample-id>_<constrain-id>-CONSTRAIN.per-base.bed.gz.csi
: Indexed bed file.<sample-id>_<constrain-id>-CONSTRAIN.mosdepth.summary.txt
: Summary metrics including mean, min and max coverage values.<sample-id>_<constrain-id>-CONSTRAIN.mosdepth.global.dist.txt
: A cumulative distribution indicating the proportion of total bases that were covered for at least a given coverage value.
Output files - iterations
assembly/polishing/iterations/it#/metrics/mosdepth
<sample-id>_cl#_it#.per-base.bed.gz
: A bed file containing the coverage values in 200bp windows.<sample-id>_cl#_it#.per-base.bed.gz.csi
: Indexed bed file.<sample-id>_cl#_it#.mosdepth.summary.txt
: Summary metrics including mean, min and max coverage values.<sample-id>_cl#_it#.mosdepth.global.dist.txt
: A cumulative distribution indicating the proportion of total bases that were covered for at least a given coverage value.
Variant calling & filtering
Variant calling is done with BCFTools mpileup
or iVar
, the filtering with BCFtools filter
.
Variant files are visualised in the MultiQC report.
Output files - variants
variants/variant_calling
bcftools/
<sample-id>/<sample-id>_<constrain-id>-CONSTRAIN.vcf.gz
: A VCF file containing the variant calls.<sample-id>/<sample-id>_<constrain-id>-CONSTRAIN.norm.vcf.gz
: A compressed VCF file where multiallelic sites are split up into biallelic records and SNPs and indels should be merged into a single record.<sample-id>/<sample-id>_<constrain-id>-CONSTRAIN.filtered.vcf.gz
: A compressed VCF file containing the filtered variants.<sample-id>/<sample-id>_<constrain-id>-CONSTRAIN.vcf.gz.tbi
: An index file for the compressed VCF file.<sample-id>/<sample-id>_<constrain-id>-CONSTRAIN_stats.txt
: A text file stats which is suitable for machine processing and can be plotted using plot-vcfstats.
ivar/
<sample-id>/<sample-id>_<constrain-id>-CONSTRAIN.ivar.tsv
: A tabular file containing the variant calls.<sample-id>/<sample-id>_<constrain-id>-CONSTRAIN.ivar.vcf
: A VCF file containing the variant calls.<sample-id>/<sample-id>_<constrain-id>-CONSTRAIN.ivar.variant_counts.log
: A summary file containing the number of indels and SNPs.<sample-id>/<sample-id>_<constrain-id>-CONSTRAIN.filtered.vcf.gz
: A compressed VCF file containing the variant calls.<sample-id>/<sample-id>_<constrain-id>-CONSTRAIN_stats.txt
: A text file stats which is suitable for machine processing and can be plotted using plot-vcfstats.
Output files - iterations
- `assembly/polishing/iterations/it#/variants/variant_calling`
- `bcftools/`
- `<sample-id>/<sample-id>_cl#_it#.vcf.gz`: A VCF file containing the variant calls.
- `<sample-id>/<sample-id>_cl#_it#.norm.vcf.gz`: A compressed VCF file where multiallelic sites are split up into biallelic records and SNPs and indels shoul d be merged into a single record.
- `<sample-id>/<sample-id>_cl#_it#.filtered.vcf.gz`: A compressed VCF file containing the filtered variants.
- `<sample-id>/<sample-id>_cl#_it#.vcf.gz.tbi`: An index file for the compressed VCF file.
- `<sample-id>/<sample-id>_cl#_it#.stats.txt`: A text file stats which is suitable for machine processing and can be plotted using plot-vcfstats.
- `ivar/`
- `<sample-id>_cl#_it#.ivar.tsv`: A tabular file containing the variant calls.
- `<sample-id>/<sample-id>_cl#_it#.ivar.vcf`: A VCF file containing the variant calls.
- `<sample-id>/<sample-id>_cl#_it#.ivar.variant_counts.log`: A summary file containing the number of indels and SNPs.
- `<sample-id>/<sample-id>_cl#_it#.filtered.vcf.gz`: A compressed VCF file containing the variant calls.
- `<sample-id>/<sample-id>_cl#_it#.stats.txt`: A text file stats which is suitable for machine processing and can be plotted using plot-vc<sample-id>/fstats.
Consensus generation
The consensus sequences are generated by BCFTools
or iVar
. The consensus sequences are stored in the directory consensus/
or in the iterations directory assembly/polishing/iterations/it#/consensus
.
BCFtools
will use the filtered variants file whereas, iVar
will redetermine the variants to collapse in the consensus using their own workflow, read more about their differences in the consensus calling section.
Output files - iterations & variants
consensus
seq/<it# | consensus | singleton | constrain>/
<sample-id>/*.fasta
: A fasta file containing the consensus sequence.
mask/<it# | consensus | singleton | constrain>
<sample-id>/*.qual.txt
: A log file of the consensus run containing statistics. [iVar
only]<sample-id>/*.bed
: A bed file containing the masked regions. [BCFtools
only]<sample-id>/*.mpileup
: A mpileup file containing information on the depth and the quality of each alinged base.
Consensus Quality control
Consensus quality control is done with multiple tools, the results are stored in the directory consensus/quality_control/
.
Quast
QUAST is a quality assessment tool for genome assemblies. It calculates various metrics such as N50, L50, number of contigs, number of mismatches, number of indels, etc.
Output files
consensus/quality_control/quast/
<sample-id>/<iteration>/<sample-id>_<cl# | constrain-id>.tsv
: A tabular file containing the QUAST report.
If no iterative refiment was ran, the output will be in the
consensus/quality_control/quast/<sample-id>/constrain
directory.
CheckV
CheckV
is a tool for assessing the quality of viral genomes recovered from metagenomes. It calculates various metrics such as the number of viral genes, the number of viral contigs, the number of viral genomes, etc.
Output files
consensus/quality_control/checkv/
<sample-id>/<sample-id>_<cl# | constrain-id>/quality_summary.tsv
: A tabular file that integrates the results from the three main modules of checkv and should be the main output referred to.<sample-id>/<sample-id>_<cl# | constrain-id>/completeness.tsv
: A detailed overview of how completeness was estimated.<sample-id>/<sample-id>_<cl# | constrain-id>/contamination.tsv
: A detailed overview of how contamination was estimated.<sample-id>/<sample-id>_<cl# | constrain-id>/complete_genomes.tsv
: A detailed overview of putative genomes identified.
BLASTn
BLAST is a tool for comparing primary biological sequence information. The output from the BLAST run is stored in the directory consensus/quality_control/blast/
. Final consensus genomes are searched against the --reference_pool
.
Column names
- qseqid
- sseqid
- stitle
- pident
- qlen
- slen
- length
- mismatch
- gapopen
- qstart
- qend
- sstart
- send
- evalue
- bitscore
Modifying blast columns
Modifying these columns can be done through a custom config file and by updating bin/utils/constant_variables.py
.
Output files
consensus/quality_control/blast/
A tabular file containing the BLAST report of all intermediate & final results.
MMseqs-search (annotation)
MMseqs-search is a ultra fast and sensitive search tool for protein and nucleotide databases. Viralgenie uses MMseqs to search the consensus genomes in a annotated database, like Virousarus (see also defining your own custom annotation database), and uses the annotation data of the best hit to assign the consensus genome a species name, segment name, expected host and any other metadata that is embedded within the database.
Column names
- qseqid
- sseqid
- stitle
- pident
- qlen
- slen
- length
- mismatch
- gapopen
- qstart
- qend
- sstart
- send
- evalue
- bitscore
Modifying mmseqs columns
Modifying these columns can be done through a custom config file and by updating bin/utils/constant_variables.py
.
Output files
consensus/quality_control/mmseqs-search/all_genomes_annotation.hits.tsv
: A tabular file containing the MMseqs-search hits, all genomes are combined to reduce the number of jobs.
MAFFT
MAFFT is a multiple sequence alignment program for amino acid or nucleotide sequences. The output from the MAFFT run is stored in the directory consensus/quality_control/mafft/
.
It is used to align the following genomic data:
- The final consensus genome
- The identified reference genome from --reference_pool
- The denovo contigs from each assembler (that constituted the final consensus genome)
- Each consensus genome from the iterative refinement steps.
Output files
consensus/quality_control/mafft/
<sample-id>/<sample_id>_cl#_iterations.fas
: A fasta file containing a multiple sequence alignment of only the iterations.<sample-id>/<sample_id>_cl#_aligned.fas
: A fasta file containing a multiple sequence alignment of the denovo contigs, the reference from reference_pool and the consensus from iteratations.
Alignment can then be opened with MSA viewer, for example Jalview
MultiQC
MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
Furthermore, viralgenie runs MultiQC 2 times, as it uses the output from multiqc to create multiple summary tables of the consensus genomes and their iterations.
Output files
multiqc/
overview-tables/
: a directory with a set of commented TSV (comments taken from--multiqc_comment_headers
) that summarise aspects of the pipeline runs.multiqc_report.html
: a standalone HTML file that can be viewed in your web browser.multiqc_data/
: directory containing parsed statistics from the different tools used in the pipeline.multiqc_dataprep/
: preparation files for the generated custom tables.multiqc_plots/
: directory containing static images from the report in various formats.
Pipeline information
Output files
pipeline_info/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter's are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
. - Parameters used by the pipeline run:
params.json
.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.