HiFi Target Enrichment

Analyze targeted HiFi sequence datasets using PacBio read data. Call and phase small and structural variants.

Analyze targeted HiFi sequence datasets using PacBio read data. Call and phase small and structural variants.

Workflow for performing target enrichment on PacBio HiFi data written in Workflow Description Language (WDL). The target enrichment workflow minimally runs demultiplexing, duplicate marking, alignment to reference, and small variant variant calling using DeepVariant, in addition to structural variant (SV) calling using pbsv. Phases and haplotags samples. Cohort analysis, QC, HS metrics, and pharmcat steps are optional.

HiFi Target Enrichment workflow diagram
HiFi Target Enrichment workflow diagram

Workflow Inputs

The HifiTargetEnrichment FAQ section has details about file formats.

InputDescription
batch_id

Batch name; used for naming files

hifi_reads

HiFi reads in BAM format.

barcode_sample_map

FASTA file containing forward and reverse barcode sequences; used to demultiplex reads.

target_regions_bed

BED file specifying the coordinates of the regions of interest.

reference

Files associated with the reference genome.

nameReference name; used to name outputs
fastaReference genome and index to align reads to
chromosome_lengthsFile specifying the lengths of each of the reference chromosomes
tandem_repeat_bedTandem repeat locations in the reference genome
exons_bedBED file specifying reference exon locations
run_cohort_analysis

Run optional cohort analysis steps

run_qc

Run optional QC steps

qc_low_coverage

Low coverage cutoff for QC [10]

probes_bed

BED file specifying the coordinates for the probes used to prepare the target capture library. The same file used for target_regions_bed may be used in place of the probes_bed if you do not have access to the probes_bed. If this file is specified, the HS metrics workflow will run.

picard_sample_size

Sample size for Picard CollectHsMetrics; the sample size used for Theoretical Het Sensitivity sampling. [1000]

picard_near_distance

Near distance cutoff for Picard CollectHsMetrics; the maximum distance between a read and the nearest probe/bait/amplicon for the read to be considered ‘near probe’ and included in the percent selected. [5000]

run_pharmcat

Run optional pharmcat and pangu_cyp2d6 steps

pharmcat_positions

VCF file and index specifying pharmact positions; required if run_pharmcat is set to true.

pharmcat_min_coverage

Minimum coverage cutoff used to filter the preprocessed VCF passed to pharmcat [10]

deepvariant_version

Version of deepvariant to use [1.4.0]

deepvariant_model

Optonal alternate DeepVariant model file to use

backend

Backend where the workflow will be executed [“Azure”, “AWS”, “GCP”, “HPC”]

zones

Zones where compute will take place; required if backend is set to ‘AWS’ or ‘GCP’.

aws_spot_queue_arn

Queue ARN for the spot batch queue; required if backend is set to ‘AWS’ and preemptible is set to true

aws_on_demand_queue_arn

Queue ARN for the on demand batch queue; required if backend is set to ‘AWS’ and preemptible is set to false

container_registry

Container registry where workflow images are hosted. If left blank, PacBio’s public Quay.io registry will be used.

preemptible

If set to true, run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to false, on-demand VMs will be used for every task. Ignored if backend is set to HPC.

Workflow Outputs

The set of workflow outputs will depend on which set of analyses are specified to run, determined by the set of inputs that are provided as well as whether options such as run_cohort_analysis, run_qc, and run_pharmcat are set to true.

OutputDescription
demultiplex_failed_samples

A file listing samples that failed demultiplexing.

Sample analysis

Sample analysis are produced for each demultiplexed sample generated from the input hifi reads.

pbsv_vcfStructural variants called by pbsv (with index)
sample_phased_vcfsPhased VCFs and indices called by DeepVariant and phased by WhatsHap
haplotagged_bamsHaplotagged BAM output by WhatsHap
Cohort analysis

Cohort analysis outputs will be produced if the input run_cohort_analysis is set to true.

cohort_phased_joint_called_vcfPhased cohort VCF called by glnexus and phased by WhatsHap
Quality control (QC)

QC outputs will be produced if the input run_qc is set to true.

Sample
sample_readcount_bedsPer-sample BED file containing counts of intersections between the input target_regions_bed and the aligned BAM
sample_readcount_csvsPer-sample CSV file containing counts of intersections between the input target_regions_bed and the aligned BAM
sample_coverage_fraction_csvsPer-sample base coverage fractions in CSV format
sample_read_metrics_csvsPer-sample read metrics in CSV format
sample_duplicate_lengths_csvsPer-sample PCR/optical read duplicate lengths in CSV format
sample_merged_read_metrics_csvsPer-sample merged read metrics, target read metrics, and exons per read information in CSV format
sample_mean_base_coverage_by_target_plotsPer-sample PNG plot of mean base coverage by target
sample_coverage_plotsPer-sample PNG plot of coverage per target
Batch
batch_covered_fraction_summary_csvsBatch-level covered fraction summary for targets and exons in CSV format
batch_coverage_summary_csvsBatch-level coverage summary for targets and exons in CSV format
batch_dropped_issue_elements_csvsBatch-level dropped elements for targets and exons in CSV format
batch_lowcov_issue_elements_csvsBatch-level low-coverage elements for targets and exons in CSV format
batch_gc_content_csvsBatch-level high GC content sites for targets and exons in CSV format
batch_duplicate_lengths_csvBatch-level PCR/optical read duplicate lengths in CSV format
batch_read_data_csvBatch-level read metrics
batch_mean_base_coverage_plot_pngBatch-level PNG plot of mean base coverage
batch_multi_coverage_by_target_pngBatch-level PNG plot of coverage
batch_read_categories_pngBatch-level PNG plot of read categories
batch_read_length_by_sample_csvBatch-level CSV denoting read lengths by sample
Hybrid selection (HS) metrics

HS metrics outputs will be generated if the input probes_bed is defined.

sample_hs_metricsPicard hybrid-selection (HS) metrics
batch_consolidated_hs_metrics_tsvConsolidated HS metrics TSV
batch_consolidated_hs_metrics_quickview_tsvConsolidated HS metrics quickview TSV
PharmCAT

PharmCAT outputs will be produced if the input run_pharmcat is set to true.

pangu_jsonsPangu report JSON
pangu_tsvsPangu TSV output; used by PharmCAT
fixed_pangu_tsvsPangu TSV with missing calls fixed
pharmcat_missing_pgx_vcfsPhased VCF with missing calls converted to ref calls
pharmcat_preprocessed_filtered_vcfsPhased VCF with low-coverage ref calls removed
pharmcat_match_jsonsPharmCAT match results in JSON format
pharmcat_phenotype_jsonsPharmCAT phenotype results in JSON format
pharmcat_report_htmlsPharmCAT report in HTML format
pharmcat_report_jsonsPharmCAT report in JSON format

References

Reference datasets are hosted publicly for use in the pipeline.

Containers

Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio’s quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.

The Docker image used by a particular step of the workflow can be identified by looking at the docker key in the runtime block for the given task.

Top