Assembly & polishing
Viralgenie offers an elaborate workflow for the assembly and polishing of viral genomes:
- Assembly: combining the results of multiple assemblers.
- Reference Matching: comparing the newly assembled contigs to a reference sequence pool.
- Clustering: clustering the contigs based on taxonomy and similarity.
- Scaffolding: scaffolding the contigs to the centroid of each bin.
- Annotation with Reference: annotating regions with 0-depth coverage with the reference sequence.
The overal workflow of creating reference assisted assemblies can be skipped with the argument
--skip_assembly
. See the parameters assembly section for all relevant arguments to control the assembly steps.The overall refinement of contigs can be skipped with the argument
--skip_polishing
. See the parameters polishing section for all relevant arguments to control the polishing steps.
The consensus genome of all clusters are then send to the variant analysis & iterative refinement step.
De-novo Assembly
Three assemblers are used, SPAdes, Megahit, and Trinity. The resulting contigs of all specified assemblers, are combined and processed further together.
Modify the spades mode with
--spades_mode [default: rnaviral]
and supply specific params with--spades_yml
or a hmm model with--spades_hmm
.Specify the assemblers to use with the
--assemblers
parameter where the assemblers are separated with a ','. The default isspades,megahit,trinity
.
Reference Matching
The newly assembled contigs are compared to a reference sequence pool (--reference_pool) using a BLASTn search. This process not only helps annotate the contigs but also assists in linking together sets of contigs that are distant within a single genome. Essentially, it aids in identifying contigs belonging to the same genomic segment and choosing the right reference for scaffolding purposes.
The top 5 hits for each contig are combined with the denovo contigs and send to the clustering step.
The reference pool can be specified with the
--reference_pool
parameter. The default is the latest clustered Reference Viral DataBase (RVDB).
Clustering
The clustering workflow of contigs consists out of 2 steps, the pre-clustering and actual clustering. Here contigs are first separated based on identified taxonomy-id (Kraken2, Kaiju) and are subsequently clustered further to identify genome segments.
Pre-clustering
The contigs along with their references have their taxonomy assigned using Kraken2 and Kaiju.
The default databases are the same ones used for read classification: - Kraken2: viral refseq database,
--kraken2_db
- Kaiju: clustered RVDB,--kaiju_db
As Kajiu and Kraken2 can have different taxonomic assignments, an additional step is performed to resolve potential inconsistencies in taxonomy and to identify the taxonomy of the contigs. This is done with a custom script that is based on KrakenTools extract_kraken_reads.py
and kaiju-Merge-Outputs
.
graph LR;
A[Contigs] --> B["`**Kraken2**`"];
A --> C["`**Kaiju**`"];
B --> D[Taxon merge resolving];
C --> D;
D --> E["Taxon filtering"];
E --> F["Taxon simplification"];
Having very complex metagenomes
The pre-clustering step can be used to simplify the taxonomy of the contigs, let NCBI's taxonomy browser help you identify taxon-id's for simplification. The simplification can be done in several ways:
- Make sure your contamination database is up to date and removes the relevant taxa.
- Exclude unclassified contigs with
--keep_unclassified false
parameter. - Simplify the taxonomy of the contigs to a higher rank using
--precluster_simplify_taxa
parameter (1). - Specify the taxa to include or exclude with
--precluster_include_children
(2),--precluster_include_parents
(3),--precluster_exclude_children
,--precluster_exclude_parents
,--precluster_exclude_taxa
parameters.
Warning
Providing lists to nextflow is done by encapsulating values with "
and separating them with a space. For example: --precluster_exclude_taxa "taxon1 taxon2 taxon3"
.
-
Options here are 'species', 'genus', 'family', 'order', 'class', 'phylum', 'kingdom' or 'superkingdom'.
-
--precluster_include_childeren
"genus1" :graph TD; A[family] -.- B["genus1 (included)"]; A -.- C[genus2]; B --- D[species1]; B --- E[species2]; C -.- F[species3];
-
--precluster_include_parents
"species3" :graph TD; A["family (included)"] -.- B["genus1"] A --- C[genus2] B -.- D[species1] B -.- E[species2] C --- F[species3]
The pre-clustering step will be run by default but can be skipped with the argument
--skip_preclustering
. Specify which classifier to use with--precluster_classifiers
parameter. The default iskaiju,kraken2
. Contig taxon filtering is still enabled despite not having to solve for inconsistencies if only Kaiju or Kraken2 is ran.
Actual clustering
The clustering is performed with one of the following tools:
These methods all come with their own advantages and disadvantages. For example, cdhitest is very fast but cannot be used for large viruses >10Mb and similarity threshold cannot go below 80% which is not preferable for highly diverse RNA viruses. Vsearch is slower but accurate. Mmseqs-linclust is the fastest but tends to create a large amount of bins. Mmseqs-cluster is slower but can handle larger datasets and is more accurate. vRhyme is a new method that is still under development but has shown promising results but can sometimes not output any bins when segments are small. Mash is a very fast comparison method is linked with a custom script that identifies communities within a network.
Tip
When pre-clustering is performed, it is recommended to set a lower identity_threshold (60-70% ANI) as the new goal becomes to separate genome segments within the same bin.
The clustering method can be specified with the
--clustering_method
parameter. The default ismash
.The network clustering method for
mash
can be specified with the--network_clustering
parameter. The default isconnected_components
, alternative isleiden
.The similarity threshold can be specified with the
--similarity_threshold
parameter. The default is0.6
. However, for cdhit the default is0.8
which is its minimum value.
Scaffolding
After classifying all contigs and their top BLAST hits into distinct clusters or bins, the contigs are then scaffolded to the centroid of each bin. Any external references that are not centroids of the cluster are subsequently removed to prevent further bias. All members of the cluster are consequently mapped towards their centroid with Minimap2 and conensus is called using iVar-consensus.
Annotation with Reference
Regions with 0-depth coverage are annotated with the reference sequence. This is done with a custom script that uses the coverage of the denovo contigs towards the reference sequence to identify regions with 0-depth coverage. The reference sequence is then annotated to these regions.
This step can be skipped using
--skip_hybrid_consensus
parameter.