Whole Genome Sequencing Variant Pipeline

Analyze human PacBio whole genome sequencing (WGS) data to produce sequence alignment and variant calls for a sample or cohort.

Analyze human PacBio whole genome sequencing (WGS) data to produce sequence alignment and variant calls for a sample or cohort.

Workflow for analyzing human PacBio whole genome sequencing (WGS) data using Workflow Description Language (WDL). The human WGS workflow performs read alignment, variant calling, and phasing. Joint-calling of small variants and structural variants for cohorts and optional variant filtering and annotation is also available.

Human whole genome sequencing workflow diagram
Human whole genome sequencing workflow diagram

Workflow Inputs

An input template file can be found here.

Templates for each of the cloud backends and the HPC with reference dataset input locations filled in can be found here:

InputDescription
cohort_id

A unique name for the cohort; used to name outputs

samples

The set of samples for the cohort. At least one sample must be defined. Each sample has the following fields:

sample_idA unique name for the sample; used to name outputs
movie_bamsThe set of unaligned movie BAMs associated with this sample
sexThe sample’s sex
affectedIs this sample affected by the phenotype? [true, false]
father_idPaternal sample_id
mother_idMaternal sample_id
phenotypes

Human Phenotype Ontology (HPO) phenotypes associated with the cohort. If no particular phenotypes are desired, the root HPO term, HP:0000001, can be used.

reference_data

Files associated with the reference genome.

These files are hosted publicly in each of the cloud backends.

nameReference name; used to name outputs (e.g., “GRCh38”)
fastaReference genome and index
chromosomesChromosomes to phase, typically chr{1..22} chr{X,Y}
chromosome_lengthsReference chromosome lengths
tandem_repeat_bedTandem repeat locations used by pbsv to normalize SV representation
trgt_tandem_repeat_bedTandem repeat sites to be genotyped by TRGT
hificnv_exclude_bedCompressed BED and index of regions to exclude from calling by HiFiCNV. We recommend cnv.excluded_regions.common_50.hg38.bed.gz.
hificnv_expected_bed_maleBED of expected copy number for male karyotype for HiFiCNV
hificnv_expected_bed_femaleBED of expected copy number for female karyotype for HiFiCNV
gnomad_afgnomAD v3.1 allele frequences in slivar gnotate format
hprc_afAllele frequences in ~100 Human Pangenome Reference Consortium (HPRC) samples in slivar gnotate format
gffEnsembl GFF3 reference annotation
population_vcfsAn array of structural variant population VCFs
slivar_data

Files associated with slivar annotation.

These files are hosted publicly in each of the cloud backends.

slivar_jsAdditional javascript functions for slivar
hpo_termsHPO annotation lookups
hpo_dagHPO annotation lookups
hpo_annotationsHPO annotation lookups
ensembl_to_hgncEnsembl to HGNC gene mapping
lof_lookupLoss-of-function scores per gene
clinvar_lookupClinVar annotations per gene
deepvariant_version

Version of deepvariant to use [“1.5.0”]

deepvariant_model

Optional alternate DeepVariant model file to use

pbsv_call_mem_gb

Optionally set RAM (GB) for pbsv_call during cohort analysis

glnexus_mem_gb

Optionally set RAM (GB) for GLnexus during cohort analysis

run_tertiary_analysis

Run the optional tertiary analysis steps [false]

backend

Backend where the workflow will be executed [“Azure”, “AWS”, “GCP”, “HPC”]

zones

Zones where compute will take place; required if backend is set to ‘AWS’ or ‘GCP’.

aws_spot_queue_arn

Queue ARN for the spot batch queue; required if backend is set to ‘AWS’ and preemptible is set to true

aws_on_demand_queue_arn

Queue ARN for the on demand batch queue; required if backend is set to ‘AWS’ and preemptible is set to false

container_registry

Container registry where workflow images are hosted. If left blank, PacBio’s public Quay.io registry will be used.

preemptible

If set to true, run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to false, on-demand VMs will be used for every task. Ignored if backend is set to HPC.

Workflow Outputs

The set of workflow outputs will depend on which set of analyses are specified to run, determined by the number of samples in the cohort as well as whether options such as run_tertiary_analysis are set to true.

OutputDescription
Sample analysis

These files will be output for each sample defined in the cohort.

bam_statsTSV of length and quality for each read (per input BAM)
read_length_summaryFor each input BAM, read length distribution (per input BAM)
read_quality_summaryFor each input BAM, read quality distribution (per input BAM)
small_variant_gvcfsSmall variants (SNPs and INDELs < 50bp) gVCFs called by DeepVariant (with index)
small_variant_vcf_statsbcftools stats summary statistics for small variants
small_variant_roh_bedRegions of homozygosity determiend by bcftools roh
sample_sv_vcfsStructural variants called by pbsv (with index)
sample_phased_small_variant_vcfsSmall variants called by DeepVariant and phased by WhatsHap (with index)
sample_whatshap_stats_tsvsPhase block statistics written by whatshap stats
sample_whatshap_stats_gtfsPhase block GTF written by whatshap stats
sample_whatshap_stats_blocklistsHaplotype block list written by whatshap stats
merged_haplotagged_bamAligned (by pbmm2), haplotagged (by whatshap haplotag) reads (with index)
haplotagged_bam_mosdepth_summarymosdepth summary of median depths per chromosome
haplotagged_bam_mosdepth_region_bedmosdepthhttps://github.com/brentp/mosdepth BED of median coverage depth per 500 bp window
trgt_repeat_vcfTandem repeat genotypes from TRGT (with index)
trgt_spanning_readsFragments of HiFi reads spanning loci genotyped by TRGT (with index)
trgt_dropoutsRegions with insufficient coverage to genotype by TRGT
cpg_pileup_beds5mCpG site methylation probability pileups generated by pb-CpG-tools
cpg_pileup_bigwigs5mCpG site methylation probability pileups generated by pb-CpG-tools
paraphase_outputOutput generated by Paraphase
paraphase_realigned_bamRealigned BAM for selected medically relevant genes in segmental duplications (with index), generated by Paraphase
paraphase_vcfsPhased Variant calls for selected medically relevant genes in segmental duplications, generated by Paraphase
hificnv_vcfsVCF output containing copy number variant calls for the sample from HiFiCNV
hificnv_copynum_bedgraphsCopy number values calculated for each region
hificnv_depth_bwsBigwig file containing the depth measurements from HiFiCNV
hificnv_maf_bwsBigwig file containing the minor allele frequency measurements from DeepVariant, generated by HiFiCNV
Cohort analysis

These files will be output if the cohort includes more than one sample.

cohort_sv_vcfStructural variants joint-called by pbsv (with index)
cohort_phased_joint_called_vcfSmall variants called by DeepVariant, joint-called by GLnexus, and phased by WhatsHap (with index)
cohort_whatshap_stats_tsvsPhase block statistics written by whatshap stats
cohort_whatshap_stats_gtfsPhase block GTF written by whatshap stats
cohort_whatshap_stats_blocklistsHaplotype block list written by whatshap stats
Tertiary analysis

These files will be output for each run of the workflow if run_tertiary_analysis is set to true. The files that are being annotated will depend on whether the number of samples is equal to or greater than one:

  • If the number of samples is equal to one, the files being annotated in this step are the sample small variant VCF and SV VCF.
  • If the number of samples is greater than one, the files being annotated in this step are the phased, joint-called small variant VCF and the cohort SV VCF.
filtered_small_variant_vcfSmall variant calls that are filtered based on population frequency and annotated with cohort information, population frequency, gene, functional impact, etc., using slivar and bcftools csq
compound_het_small_variant_vcfCompound heterozygotes annotated with cohort information, population frequency, gene, functional impact, etc., using slivar and bcftools csq
filtered_small_variant_tsvFiltered VCFs are reformatted as a human-readable TSV by slivar tsv
compound_het_small_variant_tsvFiltered VCFs are reformatted as a human-readable TSV by slivar tsv
filtered_svpack_vcfStructural variant calls that are filtered based on population frequency and annotated with cohort information, population frequency, gene, functional impact, etc., using svpack
filtered_svpack_tsvFiltered VCFs are reformatted as a human-readable TSV by slivar tsv

References

Reference datasets are hosted publicly for use in the pipeline. For data locations, see the backend-specific documentation and template inputs files for each backend with paths to publicly hosted reference files filled out.

Containers

Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio’s quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.

Top