Metagenome-Assembled Genomes

Identify high-quality metagenome-assembled genomes (MAGs) from PacBio HiFi data.

Workflow for identifying high-quality MAGs (Metagenome-Assembled Genomes) from PacBio HiFi data written in Workflow Description Language (WDL)

Workflow Inputs

The workflow can run with either FASTQ- or BAM-format HiFi reads as input. If BAM reads are supplied, they will first be converted to FASTQ before being run through the remainder of the metagenome-assembled genomes (MAG) pipeline.

Input	Description
`sample_id`	Sample ID; used for naming files.
`hifi_reads_bam`	HiFi reads in BAM format. If supplied, the reads will first be converted to a FASTQ. One of [hifi_reads_bam, hifi_reads_fastq] is required.
`hifi_reads_fastq`	HiFi reads in FASTQ format. One of [hifi_reads_bam, hifi_reads_fastq] is required.
`checkm2_ref_db`	The CheckM2 DIAMOND reference database Uniref100/KO used to predict the completeness and contamination of MAGs
`min_contig_length`	Minimum size of a contig to consider a long contig. [500000]
`min_contig_completeness`	Minimum completeness percentage (from CheckM2) to mark a contig as complete and place it in a distinct bin; this value should not be lower than 90%. [93]
`metabat2_min_contig_size`	The minimum size of contig to be included in binning for MetaBAT2. [30000]
`semibin2_model`	The trained model to be used in SemiBin2. If set to ‘TRAIN’, a new model will be trained from your data. (‘TRAIN’, ‘human_gut’, ‘human_oral’, ‘dog_gut’, ‘cat_gut’, ‘mouse_gut’, ‘pig_gut’, ‘chicken_caecum’, ‘ocean’, ‘soil’, ‘built_environment’, ‘wastewater’, ‘global’) [‘global’]
`dastool_search_engine`	The engine for single copy gene searching used in DAS Tool. (‘blast’, ‘diamond’, ‘usearch’) [‘diamond’]
`dastool_score_threshold`	Score threshold until selection algorithm will keep selecting bins (0..1); used by DAS Tool. [0.2]
`min_mag_completeness`	Minimum completeness percent for a genome bin. [70]
`max_mag_contamination`	Maximum contamination threshold for a genome bin. [10]
`max_contigs`	The maximum number of contigs allowed in a genome bin. [20]
`gtdbtk_data_tar_gz`	A .tar.gz file of GTDB-Tk (Genome Database Taxonomy toolkit) reference data, release207_v2 used for assigning taxonomic classifications to bacterial and archaeal genomes.
`backend`	Backend where the workflow will be executed [“Azure”, “AWS”, “GCP”, “HPC”]
`zones`	Zones where compute will take place; required if backend is set to ‘AWS’ or ‘GCP’.
`aws_spot_queue_arn`	Queue ARN for the spot batch queue; required if backend is set to ‘AWS’ and `preemptible` is set to `true`
`aws_on_demand_queue_arn`	Queue ARN for the on demand batch queue; required if backend is set to ‘AWS’ and `preemptible` is set to `false`
`container_registry`	Container registry where workflow images are hosted. If left blank, PacBio’s public Quay.io registry will be used.
`preemptible`	If set to `true`, run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to `false`, on-demand VMs will be used for every task. Ignored if backend is set to HPC.

Workflow Outputs

The set of outputs generated by the Metagenomics workflow depend on whether any long or incomplete contigs pass quality filters.

Output Description

Metagenome assembly

converted_fastq	If a BAM file was provided, the converted FASTQ version of that file
assembled_contigs_gfa	Assembled contigs in gfa format
assembled_contigs_fa_gz	Assembled contigs in gzipped-fasta format

Contig binning

dereplicated_bin_fas	Set of passing long contig and non-redundant incomplete contig bins
bin_quality_report_tsv	CheckM2 completeness/contamination report for long and non-redundant incomplete contig bins
gtdb_batch_txt	GTDB-Tk batch file; used during taxonomy assignment
passed_bin_count_txt	Txt file containing an integer specifying the number of bins that passed quality control
filtered_quality_report_tsv	Filtered `bin_quality_report_tsv` containing quality information about passing bins

Long contig binning

long_contig_bin_map	Map between passing long contigs and bins in TSV format
long_contig_bin_quality_report_tsv	CheckM2 completeness/conamination report for long contigs
filtered_long_contig_bin_map	Map between passing long contigs and bins that also pass the completeness threshold in TSV format
long_contig_scatterplot_pdf	Completeness vs. size scatterplot
long_contig_histogram_pdf	Completeness histogram
passing_long_contig_bin_map	If any contigs pass the length filter, this will be the `filtered_long_contig_bin_map`; otherwise, this is the `long_contig_bin_map`
filtered_long_bin_fas	Set of long bin fastas that pass the length and completeness thresholds
incomplete_contigs_fa	Fasta file containing contigs that do not pass either length or completeness thresholds

Incomplete contig binning

aligned_sorted_bam	HiFi reads aligned to the assembled contigs
contig_depth_txt	Summary of aligned BAM contig depths
metabat2_bin_fas	Bins output by `metabat2` in fasta format
metabat2_contig_bin_map	Map between contigs and `metabat2` bins
semibin2_bins_tsv	Bin info TSV output by `semibin2`
semibin2_bin_fas	Bins output by `semibin2` in fasta format
semibin2_contig_bin_map	Map between contigs and `semibin2` bins
merged_incomplete_bin_fas	Non-redundant incomplete contig bin set from `metabat2` and `semibin2`

Taxonomy assignment

Taxonomy assignment outputs will be generated if there is at least one long or incomplete bin passing filters.

gtdbtk_summary_txt	GTDB-Tk summary file in txt format
gtdbk_output_tar_gz	GTDB-Tk results for dereplicated bins that passed filtering with CheckM2
mag_summary_txt	A main summary file that brings together information from CheckM2 and GTDB-Tk for all MAGs that pass the filtering step.
filtered_mags_fas	The fasta files for all high-quality MAGs/bins
dastool_bins_plot_pdf	Figure that shows the dereplicated bins that were created from the set of incomplete contigs (using MetaBat2 and SemiBin2) as well as the long complete contigs
contigs_quality_plot_pdf	A plot showing the relationship between completeness and contamination for each high-quality MAG recovered, colored by the number of contigs per MAG.
genome_size_depths_plot_df	A plot showing the relationship between genome size and depth of coverage for each high-quality MAG recovered, colored by % GC content per MAG.

References

Reference datasets are hosted publicly for use in the pipeline.

Containers

Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio’s quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.

The Docker image used by a particular step of the workflow can be identified by looking at the docker key in the runtime block for the given task.