Metagenome-Assembled Genomes

Identify high-quality metagenome-assembled genomes (MAGs) from PacBio HiFi data.

Identify high-quality metagenome-assembled genomes (MAGs) from PacBio HiFi data.

Workflow for  identifying high-quality MAGs (Metagenome-Assembled Genomes) from PacBio HiFi data written in Workflow Description Language (WDL)

Metagenomics workflow diagram
Metagenomics workflow diagram

Workflow Inputs

The workflow can run with either FASTQ- or BAM-format HiFi reads as input. If BAM reads are supplied, they will first be converted to FASTQ before being run through the remainder of the metagenome-assembled genomes (MAG) pipeline.

InputDescription
sample_id

Sample ID; used for naming files.

hifi_reads_bam

HiFi reads in BAM format. If supplied, the reads will first be converted to a FASTQ. One of [hifi_reads_bam, hifi_reads_fastq] is required.

hifi_reads_fastq

HiFi reads in FASTQ format. One of [hifi_reads_bam, hifi_reads_fastq] is required.

checkm2_ref_db

The CheckM2 DIAMOND reference database Uniref100/KO used to predict the completeness and contamination of MAGs

min_contig_length

Minimum size of a contig to consider a long contig. [500000]

min_contig_completeness

Minimum completeness percentage (from CheckM2) to mark a contig as complete and place it in a distinct bin; this value should not be lower than 90%. [93]

metabat2_min_contig_size

The minimum size of contig to be included in binning for MetaBAT2. [30000]

semibin2_model

The trained model to be used in SemiBin2. If set to ‘TRAIN’, a new model will be trained from your data. (‘TRAIN’, ‘human_gut’, ‘human_oral’, ‘dog_gut’, ‘cat_gut’, ‘mouse_gut’, ‘pig_gut’, ‘chicken_caecum’, ‘ocean’, ‘soil’, ‘built_environment’, ‘wastewater’, ‘global’) [‘global’]

dastool_search_engine

The engine for single copy gene searching used in DAS Tool. (‘blast’, ‘diamond’, ‘usearch’) [‘diamond’]

dastool_score_threshold

Score threshold until selection algorithm will keep selecting bins (0..1); used by DAS Tool. [0.2]

min_mag_completeness

Minimum completeness percent for a genome bin. [70]

max_mag_contamination

Maximum contamination threshold for a genome bin. [10]

max_contigs

The maximum number of contigs allowed in a genome bin. [20]

gtdbtk_data_tar_gz

A .tar.gz file of GTDB-Tk (Genome Database Taxonomy toolkit) reference data, release207_v2 used for assigning taxonomic classifications to bacterial and archaeal genomes.

backend

Backend where the workflow will be executed [“Azure”, “AWS”, “GCP”, “HPC”]

zones

Zones where compute will take place; required if backend is set to ‘AWS’ or ‘GCP’.

aws_spot_queue_arn

Queue ARN for the spot batch queue; required if backend is set to ‘AWS’ and preemptible is set to true

aws_on_demand_queue_arn

Queue ARN for the on demand batch queue; required if backend is set to ‘AWS’ and preemptible is set to false

container_registry

Container registry where workflow images are hosted. If left blank, PacBio’s public Quay.io registry will be used.

preemptible

If set to true, run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to false, on-demand VMs will be used for every task. Ignored if backend is set to HPC.

Workflow Outputs

The set of outputs generated by the Metagenomics workflow depend on whether any long or incomplete contigs pass quality filters.

OutputDescription
Metagenome assembly
converted_fastqIf a BAM file was provided, the converted FASTQ version of that file
assembled_contigs_gfaAssembled contigs in gfa format
assembled_contigs_fa_gzAssembled contigs in gzipped-fasta format
Contig binning
dereplicated_bin_fasSet of passing long contig and non-redundant incomplete contig bins
bin_quality_report_tsvCheckM2 completeness/contamination report for long and non-redundant incomplete contig bins
gtdb_batch_txtGTDB-Tk batch file; used during taxonomy assignment
passed_bin_count_txtTxt file containing an integer specifying the number of bins that passed quality control
filtered_quality_report_tsvFiltered bin_quality_report_tsv containing quality information about passing bins

Long contig binning

long_contig_bin_mapMap between passing long contigs and bins in TSV format
long_contig_bin_quality_report_tsvCheckM2 completeness/conamination report for long contigs
filtered_long_contig_bin_mapMap between passing long contigs and bins that also pass the completeness threshold in TSV format
long_contig_scatterplot_pdfCompleteness vs. size scatterplot
long_contig_histogram_pdfCompleteness histogram
passing_long_contig_bin_mapIf any contigs pass the length filter, this will be the filtered_long_contig_bin_map; otherwise, this is the long_contig_bin_map
filtered_long_bin_fasSet of long bin fastas that pass the length and completeness thresholds
incomplete_contigs_faFasta file containing contigs that do not pass either length or completeness thresholds

Incomplete contig binning

aligned_sorted_bamHiFi reads aligned to the assembled contigs
contig_depth_txtSummary of aligned BAM contig depths
metabat2_bin_fasBins output by metabat2 in fasta format
metabat2_contig_bin_mapMap between contigs and metabat2 bins
semibin2_bins_tsvBin info TSV output by semibin2
semibin2_bin_fasBins output by semibin2 in fasta format
semibin2_contig_bin_mapMap between contigs and semibin2 bins
merged_incomplete_bin_fasNon-redundant incomplete contig bin set from metabat2 and semibin2
Taxonomy assignment

Taxonomy assignment outputs will be generated if there is at least one long or incomplete bin passing filters.

gtdbtk_summary_txtGTDB-Tk summary file in txt format
gtdbk_output_tar_gzGTDB-Tk results for dereplicated bins that passed filtering with CheckM2
mag_summary_txtA main summary file that brings together information from CheckM2 and GTDB-Tk for all MAGs that pass the filtering step.
filtered_mags_fasThe fasta files for all high-quality MAGs/bins
dastool_bins_plot_pdfFigure that shows the dereplicated bins that were created from the set of incomplete contigs (using MetaBat2 and SemiBin2) as well as the long complete contigs
contigs_quality_plot_pdfA plot showing the relationship between completeness and contamination for each high-quality MAG recovered, colored by the number of contigs per MAG.
genome_size_depths_plot_dfA plot showing the relationship between genome size and depth of coverage for each high-quality MAG recovered, colored by % GC content per MAG.

References

Reference datasets are hosted publicly for use in the pipeline.

Containers

Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio’s quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.

The Docker image used by a particular step of the workflow can be identified by looking at the docker key in the runtime block for the given task.

Top