Human de novo Assembly

Generate de novo assemblies from PacBio read data. Supports single-sample and trio-based assemblies.

Generate de novo assemblies from PacBio read data. Supports single-sample and trio-based assemblies.

Workflow for running de novo assembly using human PacBio whole genome sequencing (WGS) data. Written using Workflow Description Language (WDL). The assembly workflow performs de novo assembly on samples and trios.

Human de novo assembly workflow diagram
Human de novo assembly workflow diagram

Workflow Inputs

Each sample can independently have single-sample de novo assembly run. Additionally, if a trio is provided, trio-based assembly may be run.

InputDescription
cohort

A cohort can include one or more samples. Samples need not be related.

cohort_idA unique name for the cohort; used to name outputs
samplesThe set of samples for the cohort. At least one sample must be defined.
run_de_novo_assembly_trioRun trio binned de novo assembly.
samples

Sample information for each sample in the workflow run.

sample_idA unique name for the sample; used to name outputs
movie_bamsThe set of unaligned movie BAMs associated with this sample
sexSample sex
father_idPaternal sample_id
mother_idMaternal sample_id
run_de_novo_assemblyIf true, run single-sample de novo assembly for this sample
reference

Files associated with the reference genome.

nameReference name; used to name outputs (e.g., “GRCh38”)
fastaReference genome and index
backend

Backend where the workflow will be executed

zones

Zones where compute will take place; required if backend is set to ‘AWS’ or ‘GCP’.

aws_spot_queue_arn

Queue ARN for the spot batch queue; required if backend is set to ‘AWS’ and preemptible is set to true

aws_on_demand_queue_arn

Queue ARN for the on demand batch queue; required if backend is set to ‘AWS’ and preemptible is set to false

container_registry

Container registry where workflow images are hosted. If left blank, PacBio’s public Quay.io registry will be used.

preemptible

If set to true, run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to false, on-demand VMs will be used for every task. Ignored if backend is set to HPC.

Workflow Outputs

The output set will depend on whether single-sample or trio-based de novo assembly is run.

OutputDescription
Sample de novo assembly

These files will be output if cohort.samples[sample] is set to true for any sample.

zipped_assembly_fastasDe novo dual assembly generated by hifiasm
assembly_noseq_gfasAssembly graphs in GFA format.
assembly_lowQ_bedsCoordinates of low quality regions in BED format.
assembly_statsAssembly size and NG50 stats generated by calN50.
asm_bamminimap2 alignment of assembly to reference.
htsbox_vcfNaive pileup variant calling of assembly against reference with htsbox
htsbox_vcf_statsbcftools stats summary statistics for htsbox variant calls
Trio de novo assembly

These files will be output if cohort.de_novo_assembly_trio is set to true and there is at least one parent-parent-kid trio in the cohort.

trio_zipped_assembly_fastasHaplotype-resolved de novo assembly of the trio kid generated by hifiasm with trio binning
trio_assembly_noseq_gfasAssembly graphs in GFA format.
trio_assembly_lowQ_bedsCoordinates of low quality regions in BED format.
trio_assembly_statsAssembly size and NG50 stats generated by calN50.
trio_asm_bamsminimap2 alignment of assembly to reference.
haplotype_keyIndication of which haplotype (hap1/hap2) corresponds to which parent.

References

Reference datasets are hosted publicly for use in the pipeline.

Containers

Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio’s quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.

Top