IsoSeq Isoform Discovery

Identify transcripts in PacBio single-molecule sequencing data.

Identify transcripts in PacBio single-molecule sequencing data.

Workflow for running scalable de novo isoform discovery on PacBio HiFi data written in Workflow Description Language (WDL)

IsoSeq workflow diagram
IsoSeq workflow diagram

Workflow Inputs

The type of run (single-cell or bulk) is determined by whether or not the barcodes_txt file is provided. If provided, the single-cell IsoSeq pipeline, including barcode correction, will run. If not provided, the bulk IsoSeq pipeline will run.

InputDescription
batch_id

Batch name; used for naming files

hifi_reads

Array of HiFi reads in BAM format

primers_fasta

FASTA file containing forward and reverse primer sequences. Used to demultiplex and refine reads.

reference

Reference data and associated files. See the IsoSeq docs for more information.

nameReference name; used to name outputs (e.g., “GRCh38”)
fastaReference genome and index
annotation_gtfAnnotation file for the reference genome in gtf format
cage_bedCAGE peaks in BED format
intropolis_tsvIntropolis data in custom format
polyA_listpolyA motif list in custom format
adapters_fasta

Optional file containing fasta adapter sequences, ordered in the expected order of the adapters within the reads. If this file is provided, skera will be run first to segment the reads from the sample movie BAM. Required if the movie BAMs were generated using MAS-Seq.

barcodes_txt

Optional file containing valid whitelisted barcode sequences. If provided, the single-cell IsoSeq pipeline will be run. Otherwise, bulk IsoSeq will run.

tag_design

Optional UMI/Barcode design. If not provided and the single-cell pipeline is run, the isoseq tag default tag will be used [T-8U-10B]

backend

Backend where the workflow will be executed

zones

Zones where compute will take place; required if backend is set to ‘AWS’ or ‘GCP’.

aws_spot_queue_arn

Queue ARN for the spot batch queue; required if backend is set to ‘AWS’ and preemptible is set to true

aws_on_demand_queue_arn

Queue ARN for the on demand batch queue; required if backend is set to ‘AWS’ and preemptible is set to false

container_registry

Container registry where workflow images are hosted. If left blank, PacBio’s public Quay.io registry will be used.

preemptible

If set to true, run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to false, on-demand VMs will be used for every task. Ignored if backend is set to HPC.

Workflow Outputs

Common outputs are produced regardless of which pipeline is run; depending on which pipeline is run, either single-cell or bulk outputs will be produced in addition to common outputs.

OutputDescription
Common
refine_metadataMetadata output from the polyA and concatemer removal step
refine_summary_jsonsSummary JSON output from the polyA and concatemer removal step
refine_report_csvsReport CSV output from the polyA and concatemer removal step
aligned_bamReads aligned to the reference genome in BAM format
collapse_read_statRead stats output from the transcript collapse step
collapse_report_jsonReport JSON output from the transcript collapse step
sorted_gffSorted gff output by pigeon
classification_summary_txtSummary file output by pigeon transcript classification
classification_report_jsonReport JSON output by pigeon transcript classification
classification_txtClassification file output by pigeon transcript classification
junctions_txtA junctions txt file containing every junction for each isoform. Follows the SQANTI3 junction file convention.
filtered_reasons_txtThe filter tool outputs a txt file containing the reasons an isoform was filtered. See the pigeon documentation for reasons an isoform can be filtered.
filtered_report_jsonFiltered report JSON file
filtered_classification_txtFiltered classification txt file following the SQANTI3 classification file convention, with two added columns: fl_assoc, and cell_barcodes.
filtered_junctions_txtFiltered junctions txt file
filtered_gffFiltered gff output by pigeon filter
gene_saturation_txtA txt file containing the read count and number of unique genes found in a subsambled number of reads
Single-cell IsoSeq
corrected_bamBarcode-corrected BAM
corrected_summary_jsonCorrected barcode report JSON
bcstats_jsonStats for group barcodes in JSON format
bcstats_tsvStats for group barcodes in TSV format
seurat_tarFiles required to run tertiary analysis with Seurat.
Bulk IsoSeq
clustered_bamClustered BAM and index
cluster_report_csvClustering report in CSV format

References

Reference datasets are hosted publicly for use in the pipeline.

Containers

Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio’s quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.

The Docker image used by a particular step of the workflow can be identified by looking at the docker key in the runtime block for the given task.

Top