DRAGEN – Whole Genome Germline Single Sample Analysis

DRAGEN functional equivalence germline SNP and indel discovery in human whole genome sequencing data.

DRAGEN functional equivalence germline SNP and indel discovery in human whole genome sequencing data.

The Whole Genome Germline Single Sample pipeline implements data pre-processing and initial variant calling according to the GATK Best Practices for germline SNP and indel discovery in human whole genome sequencing data. When the pipeline runs in the DRAGEN-GATK mode, it produces functionally equivalent outputs to the DRAGEN pipeline.

This workflow is maintained by the Broad Institute and is written in Workflow Description Language (WDL). Further documentation can be found here.

DRAGEN Whole Genome Single Sample workflow diagram
DRAGEN Whole Genome Single Sample workflow diagram

Workflow Inputs

The workflow requires sample and reference information. The user may decide whether or not to run the pipeline in DRAGEN functional equivalence mode by setting the value of the dragen_functional_equivalence_mode input.

The Broad Institute provides various test inputs hosted in GCP that can be used to run the pipeline.

InputDescription
sample_and_unmapped_bams

Information and files associated with the sample.

base_file_nameString used for output files; can be set to a read group ID.
final_gvcf_base_nameBase name for the output GVCF file; can be set to a read group ID.
flowcell_unmapped_bamsHuman whole-genome paired-end sequencing data in unmapped BAM (uBAM) format; each uBAM file contains one or more read groups all belonging to a single sample (SM).
sample_nameA string to describe the sample; can be set to a read group ID.
unmapped_bam_suffixThe suffix for the input uBAM file; must be consistent across files; (ex: “.unmapped.bam”).

 

references

Data associated with the reference genome.

contamination_sites_udContamination site files for the CheckContamination task.
contamination_sites_bedContamination site files for the CheckContamination task.
contamination_sites_muContamination site files for the CheckContamination task.
calling_interval_listInterval list used for variant calling.
reference_fastaRef fasta, index, dict, and associated bwa index files. See the struct definition for the full list of associated reference files.
known_indels_sites_vcfsSet of known indel site VCFs
known_indels_sites_indicesSet of known indel site VCF indices
dbsnp_vcfdbSNP VCF file
dbsnp_vcf_indexdbSNP VCF file index
evaluation_interval_listFile containing the target set of genomic intervals
haplotype_database_fileFile containing known haplotype major and minor alleles and frequencies
dragmap_reference

Files used by the DRAGMAP aligner.

reference_binBinary representation of the reference FASTA file used for the DRAGEN mode DRAGMAP aligner.
hash_table_cfg_binBinary representation of the configuration for the hash table used for the DRAGEN mode DRAGMAP aligner.
hash_table_cmpCompressed representation of the hash table that is used for the DRAGEN mode DRAGMAP aligner.
scatter_settings

Information for variant calling scatter settings.

haplotype_scatter_countScatter count used for variant calling.
break_bands_at_multiples_ofBreaks reference bands up at genomic positions that are multiples of this number; used to reduce GVCF file size.
papi_settings

Information regarding the number of preemptions allowed.

preemptible_triesNumber of times the workflow can be preempted.
agg_preemptible_triesNumber of preemtible machine tries for the BamtoCram task.
wgs_coverage_interval_list

Interval list for the CollectWgsMetrics tool.

Workflow Outputs

The pipeline outputs variant calls, aligned reads, and various metrics files.

OutputDescription
UnmappedBamToAlignedBam

Quality control metrics and files output during alignment.

quality_yield_metricsThe quality metrics calculated for the unmapped BAM files.
unsorted_read_group_base_distribution_by_cycle_pdfPDF of the base distribution for each unsorted, readgroup-specific BAM.
unsorted_read_group_base_distribution_by_cycle_metricsMetrics of the base distribution by cycle for each unsorted, readgroup-specific BAM.
unsorted_read_group_insert_size_histogram_pdfHistograms of insert size for the unsorted, readgroup-specific BAMs.
unsorted_read_group_insert_size_metricsInsert size metrics for the unsorted, readgroup-specific BAMs.
unsorted_read_group_quality_by_cycle_pdfQuality by cycle PDF for the unsorted, readgroup-specific BAMs.
unsorted_read_group_quality_by_cycle_metricsQuality by cycle metrics for the unsorted, readgroup-specific BAMs.
unsorted_read_group_quality_distribution_pdfQuality distribution PDF for the unsorted, readgroup-specific BAMs.
unsorted_read_group_quality_distribution_metricsQuality distribution metrics for the unsorted, readgroup-specific BAMs.
cross_check_fingerprints_metricsFingerprint metrics file if optional fingerprinting is performed.
selfSMContamination estimate from VerifyBamID2.
contaminationEstimated contamination from the CheckContamination task.
duplicate_metricsDuplicate read metrics from the MarkDuplicates tool.
output_bqsr_reportsBQSR reports if BQSR tool is run.
output_bamOutput aligned recalibrated BAM if the provided_output_bam is true.
output_bam_indexOptional index for the aligned recalibrated BAM if the provided_output_bam is true.
AggregatedBamQC

Outputs from aggregating the aligned recalibrated BAM and calculating quality control metrics.

read_group_alignment_summary_metricsAlignment summary metrics for the aggregated BAM.
read_group_gc_bias_detail_metricsGC bias detail metrics for the aggregated BAM.
read_group_gc_bias_pdfPDF of the GC bias by readgroup for the aggregated BAM.
read_group_gc_bias_summary_metricsGC bias summary metrics by readgroup for the aggregated BAM.
calculate_read_group_checksum_md5MD5 checksum for aggregated BAM.
agg_alignment_summary_metricsAlignment summary metrics for the aggregated BAM.
agg_bait_bias_detail_metricsBait bias detail metrics for the aggregated BAM.
agg_bait_bias_summary_metricsBait bias summary metrics for the aggregated BAM.
agg_gc_bias_detail_metricsGC bias detail metrics for the aggregated BAM.
agg_gc_bias_pdfPDF of GC bias for the aggregated BAM.
agg_gc_bias_summary_metricsGC bias summary metrics for the aggregated BAM.
agg_insert_size_histogram_pdfHistogram of insert size for aggregated BAM.
agg_insert_size_metricsInsert size metrics for the aggregated BAM.
agg_pre_adapter_detail_metricsDetails metrics for artifacts that occur prior to the addition of adaptors for the aggregated BAM.
agg_pre_adapter_summary_metricsSummary metrics for artifacts that occur prior to the addition of adaptors for the aggregated BAM.
agg_quality_distribution_pdfPDF of the quality distribution for the aggregated BAM.
agg_quality_distribution_metricsQuality distribution metrics for the aggregated BAM.
agg_error_summary_metricsError summary metrics for the aggregated BAM.
fingerprint_summary_metricsOptional fingerprint summary metrics for the aggregated BAM.
fingerprint_detail_metricsOptional fingerprint detail metrics for the aggregated BAM.
CollectWgsMetrics

WGS metrics collected using stringent thresholds.

wgs_metricsMetrics from the CollectWgsMetrics tool.
CollectRawWgsMetrics

WGS metrics collected using less stringent thresholds.

raw_wgs_metricsMetrics from the CollectRawWgsMetrics tool.
BamToGvcf

HaplotypeCaller variant calling outputs.

gvcf_summary_metrics(g)VCF summary metrics.
gvcf_detail_metrics(g)VCF detail metrics.
output_vcfFinal reblocked gVCF with variant calls produced by HaplotypeCaller (read more in the Reblocking section).
output_vcf_indexIndex for the final gVCF.
BamToCram

Files associated with converting the aggregated recalibrated BAM to CRAM.

output_cramAligned, recalibrated output CRAM.
output_cram_indexIndex for the aligned recalibrated CRAM.
output_cram_md5MD5 checksum for the aligned recalibrated BAM.
validate_cram_file_reportValidated report for the CRAM created with the ValidateSam tool.

References

Reference data hosted in GCP may be found here.

Containers

Containers used by the pipeline are hosted in the Broad Institute’s public container registry, and the public biocontainers registry in quay.io.

Top