Whole Genome Sequencing Variant Pipeline

Analyze human PacBio whole genome sequencing (WGS) data to produce sequence alignment and variant calls for a sample or cohort.

Workflow for analyzing human PacBio whole genome sequencing (WGS) data using Workflow Description Language (WDL). The human WGS workflow performs read alignment, variant calling, and phasing. Joint-calling of small variants and structural variants for cohorts and optional variant filtering and annotation is also available.

Human whole genome sequencing workflow diagram

Workflow Inputs

An input template file can be found here.

Templates for each of the cloud backends and the HPC with reference dataset input locations filled in can be found here:

Input Description

cohort_id

A unique name for the cohort; used to name outputs

samples

The set of samples for the cohort. At least one sample must be defined. Each sample has the following fields:

sample_id	A unique name for the sample; used to name outputs
movie_bams	The set of unaligned movie BAMs associated with this sample
sex	The sample’s sex
affected	Is this sample affected by the phenotype? [true, false]
father_id	Paternal `sample_id`
mother_id	Maternal `sample_id`

phenotypes

Human Phenotype Ontology (HPO) phenotypes associated with the cohort. If no particular phenotypes are desired, the root HPO term, HP:0000001, can be used.

reference_data

Files associated with the reference genome.

These files are hosted publicly in each of the cloud backends.

name	Reference name; used to name outputs (e.g., “GRCh38”)
fasta	Reference genome and index
chromosomes	Chromosomes to phase, typically `chr{1..22} chr{X,Y}`
chromosome_lengths	Reference chromosome lengths
tandem_repeat_bed	Tandem repeat locations used by pbsv to normalize SV representation
trgt_tandem_repeat_bed	Tandem repeat sites to be genotyped by TRGT
hificnv_exclude_bed	Compressed BED and index of regions to exclude from calling by HiFiCNV. We recommend cnv.excluded_regions.common_50.hg38.bed.gz.
hificnv_expected_bed_male	BED of expected copy number for male karyotype for HiFiCNV
hificnv_expected_bed_female	BED of expected copy number for female karyotype for HiFiCNV
gnomad_af	gnomAD v3.1 allele frequences in `slivar gnotate` format
hprc_af	Allele frequences in ~100 Human Pangenome Reference Consortium (HPRC) samples in `slivar gnotate` format
gff	Ensembl GFF3 reference annotation
population_vcfs	An array of structural variant population VCFs

slivar_data

Files associated with slivar annotation.

These files are hosted publicly in each of the cloud backends.

slivar_js	Additional javascript functions for slivar
hpo_terms	HPO annotation lookups
hpo_dag	HPO annotation lookups
hpo_annotations	HPO annotation lookups
ensembl_to_hgnc	Ensembl to HGNC gene mapping
lof_lookup	Loss-of-function scores per gene
clinvar_lookup	ClinVar annotations per gene

deepvariant_version

Version of deepvariant to use [“1.5.0”]

deepvariant_model

Optional alternate DeepVariant model file to use

pbsv_call_mem_gb

Optionally set RAM (GB) for pbsv_call during cohort analysis

glnexus_mem_gb

Optionally set RAM (GB) for GLnexus during cohort analysis

run_tertiary_analysis

Run the optional tertiary analysis steps [false]

backend

Backend where the workflow will be executed [“Azure”, “AWS”, “GCP”, “HPC”]

zones

Zones where compute will take place; required if backend is set to ‘AWS’ or ‘GCP’.

aws_spot_queue_arn

Queue ARN for the spot batch queue; required if backend is set to ‘AWS’ and preemptible is set to true

aws_on_demand_queue_arn

Queue ARN for the on demand batch queue; required if backend is set to ‘AWS’ and preemptible is set to false

container_registry

Container registry where workflow images are hosted. If left blank, PacBio’s public Quay.io registry will be used.

preemptible

If set to true, run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to false, on-demand VMs will be used for every task. Ignored if backend is set to HPC.

Workflow Outputs

The set of workflow outputs will depend on which set of analyses are specified to run, determined by the number of samples in the cohort as well as whether options such as run_tertiary_analysis are set to true.

Output Description

Sample analysis

These files will be output for each sample defined in the cohort.

bam_stats	TSV of length and quality for each read (per input BAM)
read_length_summary	For each input BAM, read length distribution (per input BAM)
read_quality_summary	For each input BAM, read quality distribution (per input BAM)
small_variant_gvcfs	Small variants (SNPs and INDELs < 50bp) gVCFs called by DeepVariant (with index)
small_variant_vcf_stats	`bcftools stats` summary statistics for small variants
small_variant_roh_bed	Regions of homozygosity determiend by `bcftools roh`
sample_sv_vcfs	Structural variants called by pbsv (with index)
sample_phased_small_variant_vcfs	Small variants called by DeepVariant and phased by WhatsHap (with index)
sample_whatshap_stats_tsvs	Phase block statistics written by `whatshap stats`
sample_whatshap_stats_gtfs	Phase block GTF written by `whatshap stats`
sample_whatshap_stats_blocklists	Haplotype block list written by `whatshap stats`
merged_haplotagged_bam	Aligned (by pbmm2), haplotagged (by `whatshap haplotag`) reads (with index)
haplotagged_bam_mosdepth_summary	mosdepth summary of median depths per chromosome
haplotagged_bam_mosdepth_region_bed	mosdepthhttps://github.com/brentp/mosdepth BED of median coverage depth per 500 bp window
trgt_repeat_vcf	Tandem repeat genotypes from TRGT (with index)
trgt_spanning_reads	Fragments of HiFi reads spanning loci genotyped by TRGT (with index)
trgt_dropouts	Regions with insufficient coverage to genotype by TRGT
cpg_pileup_beds	5mCpG site methylation probability pileups generated by pb-CpG-tools
cpg_pileup_bigwigs	5mCpG site methylation probability pileups generated by pb-CpG-tools
paraphase_output	Output generated by Paraphase
paraphase_realigned_bam	Realigned BAM for selected medically relevant genes in segmental duplications (with index), generated by Paraphase
paraphase_vcfs	Phased Variant calls for selected medically relevant genes in segmental duplications, generated by Paraphase
hificnv_vcfs	VCF output containing copy number variant calls for the sample from HiFiCNV
hificnv_copynum_bedgraphs	Copy number values calculated for each region
hificnv_depth_bws	Bigwig file containing the depth measurements from HiFiCNV
hificnv_maf_bws	Bigwig file containing the minor allele frequency measurements from DeepVariant, generated by HiFiCNV

Cohort analysis

These files will be output if the cohort includes more than one sample.

cohort_sv_vcf	Structural variants joint-called by pbsv (with index)
cohort_phased_joint_called_vcf	Small variants called by DeepVariant, joint-called by GLnexus, and phased by WhatsHap (with index)
cohort_whatshap_stats_tsvs	Phase block statistics written by `whatshap stats`
cohort_whatshap_stats_gtfs	Phase block GTF written by `whatshap stats`
cohort_whatshap_stats_blocklists	Haplotype block list written by `whatshap stats`

Tertiary analysis

These files will be output for each run of the workflow if run_tertiary_analysis is set to true. The files that are being annotated will depend on whether the number of samples is equal to or greater than one:

If the number of samples is equal to one, the files being annotated in this step are the sample small variant VCF and SV VCF.
If the number of samples is greater than one, the files being annotated in this step are the phased, joint-called small variant VCF and the cohort SV VCF.

filtered_small_variant_vcf	Small variant calls that are filtered based on population frequency and annotated with cohort information, population frequency, gene, functional impact, etc., using slivar and `bcftools csq`
compound_het_small_variant_vcf	Compound heterozygotes annotated with cohort information, population frequency, gene, functional impact, etc., using slivar and `bcftools csq`
filtered_small_variant_tsv	Filtered VCFs are reformatted as a human-readable TSV by `slivar tsv`
compound_het_small_variant_tsv	Filtered VCFs are reformatted as a human-readable TSV by `slivar tsv`
filtered_svpack_vcf	Structural variant calls that are filtered based on population frequency and annotated with cohort information, population frequency, gene, functional impact, etc., using svpack
filtered_svpack_tsv	Filtered VCFs are reformatted as a human-readable TSV by `slivar tsv`

References

Reference datasets are hosted publicly for use in the pipeline. For data locations, see the backend-specific documentation and template inputs files for each backend with paths to publicly hosted reference files filled out.

Containers

Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio’s quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.