BWA Enrichment v1.0 BaseSpace App Guide (15050958 A)

BWA Enrichment v1.0 BaseSpace App Guide (15050958 A)
BWA Enrichment App
Introduction
Running BWA Enrichment
BWA Enrichment Output
BWA Enrichment Methods
Technical Assistance
ILLUMINA PROPRIETARY
15050958 Rev. A
January 2014
3
5
7
27
This document and its contents are proprietary to Illumina, Inc. and its affiliates ("Illumina"), and are intended solely for the
contractual use of its customer in connection with the use of the product(s) described herein and for no other purpose. This
document and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed,
or reproduced in any way whatsoever without the prior written consent of Illumina. Illumina does not convey any license
under its patent, trademark, copyright, or common-law rights nor similar rights of any third parties by this document.
The instructions in this document must be strictly and explicitly followed by qualified and properly trained personnel in order
to ensure the proper and safe use of the product(s) described herein. All of the contents of this document must be fully read
and understood prior to using such product(s).
FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS CONTAINED HEREIN
MAY RESULT IN DAMAGE TO THE PRODUCT(S), INJURY TO PERSONS, INCLUDING TO USERS OR OTHERS, AND
DAMAGE TO OTHER PROPERTY.
ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE PRODUCT(S)
DESCRIBED HEREIN (INCLUDING PARTS THEREOF OR SOFTWARE) OR ANY USE OF SUCH PRODUCT(S) OUTSIDE
THE SCOPE OF THE EXPRESS WRITTEN LICENSES OR PERMISSIONS GRANTED BY ILLUMINA IN CONNECTION
WITH CUSTOMER'S ACQUISITION OF SUCH PRODUCT(S).
FOR RESEARCH USE ONLY
© 2014 Illumina, Inc. All rights reserved.
Illumina, IlluminaDx, BaseSpace, BeadArray, BeadXpress, cBot, CSPro, DASL, DesignStudio, Eco, GAIIx, Genetic
Energy, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium, iScan, iSelect, MiSeq, MiSeqDx,
Nextera, NextSeq, NuPCR, SeqMonitor, Solexa, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG,
VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks of Illumina, Inc. in the
U.S. and/or other countries. All other names, logos, and other trademarks are the property of their respective owners.
The BWA Enrichment app analyzes DNA that has been enriched for particular target
sequences using Nextera Rapid Capture. Alignment is performed with BWA and variant
calling with GATK. Variant analysis is performed for just the target regions. Statistics
reporting accumulates coverage and enrichment specific statistics for each target as well
as overall metrics.
The main output files generated by the BWA Enrichment workflow are .bam files
(containing the reads after alignment), .vcf files (containing the variant calls like indels
and SNVs, Single Nucleotide Variants), and Genome VCF (.genome.vcf) files (describing
the calls for all variant and non-variant sites in the genome). In addition, there are
summary pages and PDF reports.
See BWA Enrichment Methods on page 27 and BWA Enrichment Output on page 7 for more
information.
Figure 1 BWA Enrichment Analysis with BWA/GATK
Versions
The following module versions are used in the BWA Enrichment app:
} BWA: 0.6.1-r104-tpx-isis
} GATK: 1.6-22-g3ec78bd
} Picard: 1.79
BWA Enrichment User Guide
3
Introduction
Introduction
} Samtools: 0.1.18
} Tabix: 0.2.5 (r1005)
Current Limitations
Before running the BWA Enrichment app, be aware of the following limitations:
} hg19 reference only
} Read length of at least 21 bp
} Dataset size less than 200 Gigabases
} No minimum number of reads, but use a reasonable input size to get your required
coverage
4
15050958 Rev. A
1
Navigate to the project or sample that you want to analyze.
2
Click the Apps button and select BWA Enrichment from the dropdown list.
3
Read the End User License Agreement and permissions, and click Accept if you
agree.
4
Fill out the required fields in the BWA Enrichment input form:
a Analysis Name: Provide the analysis name. Default name is the app name with
the date and time the analysis was started.
b Save Results To: Select the project that stores the app results.
c Sample(s): Browse to the sample you want to analyze, and select the check box.
You can analyze multiple samples.
d Reference Genome: Select the reference genome. Currently, you can only use
hg19.
e Targeted Regions: Select the targeted region of your enrichment.
f Base Padding: Select the padding you want. Padding defines the amount of
sequence immediately upstream and downstream of the targeted regions that is
also used in enrichment analysis.
g Annotation: Choose which gene and transcript annotation reference database to
use.
5
If desired, fill out the advanced fields in the BWA Enrichment input form:
a Depth Threshold: The GATK variant caller filters variants if the coverage depth
at that location is less than the specified threshold. Decreasing this value will
increase variant calling sensitivity, but raise the risk of false positives. The
variant caller reports the variants, but filters them in the VCF files by adding a
LowDP flag. Default value for GATK: 5.
b Trim Nextera Rapid Capture Adapters: If selected, Nextera Rapid Capture
adapters will be trimmed. Use this setting only if not already applied as a
sample sheet setting.
c Flag PCR Duplicates: If selected, PCR duplicates are flagged in the BAM files
and not used for variant calling. PCR duplicates are defined as two clusters from
a paired-end run where both clusters have the exact same alignment positions
for each read. Optical duplicates are already filtered out during RTA processing
d Generate Picard HS Metrics: If selected, Picard HS metrics are generated. See
Picard Metrics on page 27 for more information.
BWA Enrichment User Guide
5
Running BWA Enrichment
Running BWA Enrichment
Figure 2 BWA Enrichment Input Form
6
Click Launch.
The BWA Enrichment app will now analyze your sample. When completed, the status of
the app session is automatically updated, and you receive an email.
6
15050958 Rev. A
This chapter describes the output that is produced by the BWA Enrichment app. To go to
the results, click the Projects button, then the project, then the analysis.
Figure 3 BWA Enrichment Output Navigation Bar
When the analysis is completed, you can access your output through the left navigation
bar, which provides the following:
} Analysis Info: an overview of the app session settings. See Analysis Info on page 14
for a description.
} Output Files: access to the output files, organized by sample and app session.
SeeBWA Enrichment Output Files on page 16 for descriptions.
} Inputs: overview of input settings, see .Inputs Overview on page 16
} Aggregate Summary: access to analysis metrics for the aggregate results. The
Aggregate Summary is only displayed if multiple samples are analyzed. See
Aggregate Summary Page on page 7.
} Sample Pages: access to analysis reports for each sample. See Sample Summary Page
on page 9 for a description.
Aggregate Summary Page
The BWA Enrichment app provides an overview of metrics for all samples combined on
the Aggregate Summary page. You can view the histograms with metrics graphed by
sample, or take a look at the numerical data by clicking
.
NOTE
PCR duplicate reads are not removed from statistics. Results are not directly comparable
to Picard HsMetrics.
A brief description of the page is provided below.
Alignment Summary
The mean region coverage depth is plotted against sample, and a table is available with
additional metrics. The metrics displayed are explained here:
Statistic
Definition
Mean coverage
The total number of targeted bases divided by the targeted region size.
BWA Enrichment User Guide
7
BWA Enrichment Output
BWA Enrichment Output
Statistic
Definition
Target coverage at 1X
Percentage targets with coverage greater than 1X.
Target coverage at 10X
Percentage targets with coverage greater than 10X.
Target coverage at 20X
Percentage targets with coverage greater than 20X.
Target coverage at 50X
Percentage targets with coverage greater than 50X.
Enrichment Statistics
Aligned bases are plotted against sample, and a table is available with additional
metrics. The metrics displayed are explained here:
} Read Level
Statistic
Definition
Total aligned reads
The total number of reads passing filter that aligned.
Percent aligned reads
Percentage of reads passing filter that aligned.
Target aligned reads
Number of reads that aligned to the target.
Read enrichment
100*(Target aligned reads/Total aligned reads).
Padded target aligned reads
Number of reads that aligned to the padded target.
Padded read enrichment
100*(Padded target aligned reads/Total aligned reads).
} Base Level
Statistic
Definition
Total aligned bases
Total aligned bases.
Target aligned bases
Total aligned bases in the target region.
Base enrichment
100*(Total Aligned Bases in Targeted Regions/Total Aligned Bases).
Padded target aligned
bases
Total aligned bases in the padded target region.
Padded base enrichment
100*(Total Aligned Bases in Padded Targeted Regions/Total Aligned
Bases).
Variant Statistics
} SNVs
Number of SNVs passing are plotted against sample, and a table is available with
additional metrics. The metrics displayed are explained here:
8
15050958 Rev. A
Definition
Total
number
Total number of Single Nucleotide Variants present in the dataset that pass the
quality filters.
Het/Hom
Ratio
Number of heterozygous SNVs/Number of homozygous SNVs.
Transitions /
Transversion
Transition rate of SNVs that pass the quality filters divided by transversion rate
of SNVs that pass the quality filters. Transitions are interchanges of purines (A, G)
or of pyrimidines (C, T). Transversions are interchanges of purine and pyrimidine
bases (for example, A to T).
Percent
found in
dbSNP
100*( Number of SNVs in dbSNP/Number of SNVs).
} Indels
Number of indels passing are plotted against sample, and a table is available with
additional metrics. The metrics displayed are explained here:
Statistic
Definition
Total number
Total number of Indels present in the dataset and pass the quality
filters.
Het/Hom Ratio
Number of heterozygous indels/Number of homozygous indels.
Percent found in
dbSNP
100*(Number of Indels in dbSNP/Number of Indels).
Sample Summary Page
The BWA Enrichment app provides an overview of statistics per sample on the sample
pages. you can also download the Enrichment Summary Report as PDF (see also
Enrichment Summary Reports by Sample on page 11). A brief description of the Sample
Summary Page metrics is listed below.
NOTE
PCR duplicate reads are not removed from statistics. Results are not directly comparable
to Picard HsMetrics.
Sample Information
Statistic
Definition
Total PF
Reads
The number of reads passing filter for the sample.
Percent Q30
The percentage of bases with a quality score of 30 or higher.
BWA Enrichment User Guide
9
BWA Enrichment Output
Statistic
Statistic
Definition
Percent
Duplicate
Paired Reads
Percentage of paired reads that have duplicates.
Fragment
Length
Median
Median length of the sequenced fragment. The fragment length is calculated
based on the locations at which a read pair aligns to the reference. The read
mapping information is parsed from the BAM files.
Fragment
Length
Standard
Deviation
Standard deviation of the sequenced fragment length.
Enrichment Summary
Statistic
Definition
Total length of
targeted reference
Total length of sequenced bases in the target region.
Padding size
The length of sequence immediately upstream and downstream of the
enrichment targets that is included for a padded target.
} Read Level Enrichment
Statistic
Definition
Total aligned reads
The total number of reads passing filter that aligned.
Percent aligned reads
Percentage of reads passing filter that aligned.
Target aligned reads
Number of reads that aligned to the target.
Read enrichment
100*(Target aligned reads/Total aligned reads).
Padded target aligned reads
Number of reads that aligned to the padded target.
Padded read enrichment
100*(Padded target aligned reads/Total aligned reads).
} Base Level Enrichment
10
Statistic
Definition
Total aligned bases
Total aligned bases.
Target aligned bases
Total aligned bases in the target region.
Base enrichment
100*(Total Aligned Bases in Targeted Regions/Total Aligned Bases).
Padded target aligned
bases
Total aligned bases in the padded target region.
Padded base enrichment
100*(Total Aligned Bases in Padded Targeted Regions/Total Aligned
Bases).
15050958 Rev. A
This table provides metrics about the number of SNVs, deletions, and insertions.
Statistic
Definition
Total
number
Total number of variants present in the dataset that pass the quality filters.
Het/Hom
Ratio
Number of heterozygous variants/Number of homozygous variants.
Transitions /
Transversion
Transition rate of SNVs that pass the quality filters divided by transversion rate
of SNVs that pass the quality filters. Transitions are interchanges of purines (A, G)
or of pyrimidines (C, T). Transversions are interchanges of purine and pyrimidine
bases (for example, A to T).
Coverage Summary
Statistic
Definition
Mean coverage
The total number of targeted bases divided by the targeted region size.
Uniformity of
coverage (Pct >
0.2*mean):
The percentage of targeted base positions in which the read depth is
greater than 0.2 times the mean region target coverage depth.
Target coverage at 1X
Percentage targets with coverage greater than 1X.
Target coverage at
10X
Percentage targets with coverage greater than 10X.
Target coverage at
20X
Percentage targets with coverage greater than 20X.
Target coverage at
50X
Percentage targets with coverage greater than 50X.
Enrichment Summary Reports by Sample
The BWA Enrichment app provides enrichment statistics PDF report for each sample. A
brief description of the report is below.
NOTE
PCR duplicate reads are not removed from statistics. Results are not directly comparable
to Picard HsMetrics.
Sample Information
Provides the sample and run folder in the report.
BWA Enrichment User Guide
11
BWA Enrichment Output
Small Variants Summary
Enrichment Summary
Statistic
Definition
Total length of
targeted reference
Total length of sequenced bases in the target region.
Padding size
The length of sequence immediately upstream and downstream of the
enrichment targets that is included for a padded target.
Read Level Enrichment
Statistic
Definition
Total aligned reads
The total number of reads passing filter that aligned.
Targeted aligned reads
Number of reads that aligned to the target.
Read enrichment
100*(Target aligned reads/Total aligned reads).
Padded target aligned reads
Number of reads that aligned to the padded target.
Padded read enrichment
100*(Padded target aligned reads/Total aligned reads).
Base Level Enrichment
Statistic
Definition
Total aligned bases
Total aligned bases.
Targeted aligned bases
Total aligned bases in the target region.
Base enrichment (not
padded)
100*(Total Aligned Bases in Targeted Regions/Total Aligned Bases).
Padded target aligned
bases
Total aligned bases in the padded target region.
Padded base enrichment
100*(Total Aligned Bases in Padded Targeted Regions/Total Aligned
Bases).
SNV Summary
12
Statistic
Definition
SNVs
Total number of Single Nucleotide Variants present in the dataset that pass the
quality filters.
SNVs
(Percent
found in
dbSNP)
100*( Number of SNVs in dbSNP/Number of SNVs).
15050958 Rev. A
Definition
SNV
Ts/Tv
Ratio
Transition rate of SNVs that pass the quality filters divided by transversion rate of
SNVs that pass the quality filters. Transitions are interchanges of purines (A, G) or of
pyrimidines (C, T). Transversions are interchanges of purine and pyrimidine bases
(for example, A to T).
SNV
Het/Hom
Ratio
Number of heterozygous SNVs/Number of homozygous SNVs.
Indel Summary
Statistic
Definition
Indels
Total number of Indels present in the dataset and pass the quality
filters.
Indels (Percent found in
dbSNP)
100*(Number of Indels in dbSNP/Number of Indels).
Indel Het/Hom ratio
Number of heterozygous indels/Number of homozygous indels.
Coverage Summary
Statistic
Definition
Mean region
coverage depth
The total number of targeted bases divided by the targeted region size.
Uniformity of
coverage (Pct >
0.2*mean):
The percentage of targeted base positions in which the read depth is
greater than 0.2 times the mean region target coverage depth.
Target coverage at 1X
Percentage targets with coverage greater than 1X.
Target coverage at
10X
Percentage targets with coverage greater than 10X.
Target coverage at
20X
Percentage targets with coverage greater than 20X.
Target coverage at
50X
Percentage targets with coverage greater than 50X.
In a addition,the app provides two graphs:
} A Mean Coverage by Targeted Region graph that plots the mean coverage by the
targeted region.
} A Targeted Regions Depth of Coverage graph that plots the number of targeted
sequences by the depth of coverage.
BWA Enrichment User Guide
13
BWA Enrichment Output
Statistic
Statistic
Definition
Depth of Sequencing
Coverage
The coverage depth of a position in the genome refers to the
number of sequenced bases that align to that position.
Number of Targeted
Bases Covered at Depth
Number of targeted bases that have at least the indicated depth of
coverage.
Total Targeted Bases
Covered
Total bases aligning to the target regions that have at least the
indicated depth of coverage.
Target Coverage
Percent of targeted bases that reach teh indicated depth of
coverage.
Fragment Length Summary
Statistic
Definition
Fragment
length
median
Median length of the sequenced fragment. The fragment length is calculated based
on the locations at which a read pair aligns to the reference. The read mapping
information is parsed from the BAM files.
Minimum
Minimum length of the sequenced fragment.
Maximum
Maximum length of the sequenced fragment.
Standard
Deviation
Standard deviation of the sequenced fragment length.
Gaps Summary
The app also provides a Targeted Regions Gap Length Distribution graph that plots the
number of gaps on a log scale by the length of the gap in bases.
Duplicates Information
Statistic
Definition
Percent duplicate paired reads
Percentage of paired reads that have duplicates.
Analysis Info
This app provides an overview of the analysis on the Analysis Info page. A brief
description of the metrics is below.
14
Row
Definition
Name
Name of the app session.
Application
App that generated this analysis.
Date started
Date the app session started.
Date completed
Date the app session completed.
Duration
Duration of analysis.
Status
Status of the app session.
15050958 Rev. A
Clicking on the Log Files link at the bottom of the Analysis Info page provides access to
BWA Enrichment app log files.
The following files log information to help follow data processing and debugging:
} WorkflowLog.txt: Workflow standard output (contains details about workflow steps,
command line calls with parameters, timing and progress).
} WorkflowError.txt: Workflow standard error output (contains errors messages
created while running the workflow).
} Logging.zip: Contains all detailed workflow log files for each step of the workflow
(content from Isis Logging folder)
} IlluminaAppsService.log.copy: Wrapper log file containing information about
communication (get and post requests) between BaseSpace and AWS.
The following files contain additional information in case things (like mono) do not work
as expected:
} monoErr.txt: Wrapper mono call error output (contains anything that is not caught
by the WorkflowError.txt and will on most cases be empty, except one line).
} monoOut.txt: Wrapper mono call standard output (contains command calling the
workflow and anything that is not caught by the WorkflowLog.txt).
NOTE
For explanation about mono, see www.mono-project.com.
BWA Enrichment Status
For single samples, the status of the BWA Enrichment app session can have the
following values:
1
Preparing Run Data
2
Finished Preparing Run Data
3
Analysis Started
4
Alignment for Sample {SampleName}
5
Variant analysis for Sample {SampleName}
6
Statistics evaluation for Sample {SampleName}
7
Report generation for Sample {SampleName}
8
Analysis Completed for Sample {SampleName}
9
Finalizing Analysis Results for Sample {SampleName}
10 Finished Finalizing Analysis Results
For multiple samples, the status of the BWA Enrichment app session can have the
following values:
1
Preparing Run Data
2
Finished Preparing Run Data
3
Analysis Started
Begin loop over samples:
4
Analysis Started for Sample {currentSampleIndex} of {totalSampleCount}
BWA Enrichment User Guide
15
BWA Enrichment Output
Log Files
5
Alignment for Sample {currentSampleIndex} of {totalSampleCount}
6
Variant analysis for Sample {currentSampleIndex} of {totalSampleCount}
7
Statistics evaluation for Sample {currentSampleIndex} of {totalSampleCount}
8
Report generation for Sample {currentSampleIndex} of {totalSampleCount}
9
Analysis Completed for Sample {currentSampleIndex} of {totalSampleCount}
10 Finalizing Analysis Results for Sample {currentSampleIndex} of {totalSampleCount}
When all samples are done:
11 Aggregate Analysis Started
12 Finalizing PDF Reports
13 Finalizing Aggregate Results
14 Finished Finalizing Analysis Results
Inputs Overview
The BWA Enrichment app provides an overview of the input samples and settings that
were specified when setting up the BWA Enrichment run.
BWA Enrichment Output Files
The Output Files page provides access to the output files, organized by sample and app
session, and an aggregate Summary section.
Figure 4 BWA Enrichment Output Files Navigation
See the following pages for descriptions:
} BAM Files on page 16
} VCF Files on page 17
} gVCF Files on page 17
} Enrichment_summary.csv on page 20
} Aggregate Summary Report on page 23
BAM Files
The Sequence Alignment/Map (SAM) format is a generic alignment format for storing
read alignments against reference sequences, supporting short and long reads (up to 128
Mb) produced by different sequencing platforms. SAM is a text format file that is humanreadable. The Binary Alignment/Map (BAM) keeps exactly the same information as
SAM, but in a compressed, binary format that is only machine-readable.
If you use an app in BaseSpace that uses BAM files as input, the app will locate the file
when launched. If using BAM files in other tools, download the file to use it in the
external tool.
16
15050958 Rev. A
VCF Files
VCF is a text file format which contains information about variants found at specific
positions in a reference genome. The file format consists of meta-information lines, a
header line, and then data lines. Each data line contains information about a single
variant.
If you use an app in BaseSpace that uses VCF files as input, the app will locate the file
when launched. If using VCF files in other tools, download the file to use it in the
external tool.
A detailed description of the VCF format is provided in the BaseSpace User Guide.
gVCF Files
This application also produces the genome Variant Call Format file (gVCF). gVCF was
developed to store sequencing information for both variant and non-variant positions,
which is required for human clinical applications. gVCF is a set of conventions applied
to the standard variant call format (VCF) 4.1 as documented by the 1000 Genomes
Project. These conventions allow representation of genotype, annotation, and other
information across all sites in the genome in a compact format. Typical human whole
genome sequencing results expressed in gVCF with annotation are less than 1 Gbyte, or
about 1/100 the size of the BAM file used for variant calling. If you are performing
targeted sequencing, gVCF is also an appropriate choice to represent and compress the
results.
gVCF is a text file format, stored as a gzip compressed file (*.genome.vcf.gz).
Compression is further achieved by joining contiguous non-variant regions with similar
properties into single ‘block’ VCF records. To maximize the utility of gVCF, especially for
high stringency applications, the properties of the compressed blocks are conservative -thus block properties like depth and genotype quality reflect the minimum of any site in
the block. The gVCF file can be indexed (creating a .tbi file) and used with existing VCF
tools such as tabix and IGV, making it convenient both for direct interpretation and as a
starting point for tertiary analysis.
For more information, see https://sites.google.com/site/gvcftools/home/about-gvcf.
The following conventions are used in the variant caller gVCF files.
Samples per File
There is only one sample per gVCF file.
Non-Variant Blocks Using END Key
Contiguous non-variant segments of the genome can be represented as single records in
gVCF. These records use the standard 'END' INFO key to indicate the extent of the
record. Even though the record can span multiple bases, only the first base is provided
in the REF field to reduce file size.
The following is a simplified segment of a gVCF file, describing a segment of non-variant
calls (starting with an A) on chromosome 1 from position 51845 to 51862.
##INFO=<ID=END,Number=1,Type=Integer,Description="End position
of the variant described in this record">#CHROM POS ID REF
ALT QUAL FILTER INFO FORMAT NA19238chr1 51845 . A . . PASS
END=51862
BWA Enrichment User Guide
17
BWA Enrichment Output
Go to http://samtools.sourceforge.net/SAM1.pdf to see the exact SAM specification.
Any fields provided for a block of sites, such as read depth (using the DP key), will show
the minimum value observed among all sites encompassed by the block. Each sample
value shown for the block, such as the depth (using the DP key), is restricted to a range
where the maximum value is within 30% or 3 of the minimum, i.e. for sample value
range [x,y], y <= x+max(3,x*0.3). This range restriction applies to each of the sample
values printed out in the final block record.
Indel Regions
Note that sites which are "filled in" inside of deletions have additional changes:
All deletions:
} Sites inside of any deletion are marked with the deletion's filters, in addition to any
filters which have already been applied to the site.
} Sites inside of deletions cannot have a genotype or alternate allele quality score
higher than the corresponding value from the enclosing indel.
Heterozygous deletions:
} Sites inside of heterozygous deletions are altered to have haploid genotype entries
(e.g. "0" instead of "0/0", "1" instead of "1/1").
} Heterozygous SNV calls inside of heterozygous deletions are marked with the
"SiteConflict" filter and their genotype is unchanged.
Homozygous deletions:
} Homozygous reference and no-call sites inside of homozygous deletions have
genotype "."
} Sites inside of homozygous deletions which have a non-reference genotype are
marked with a “SiteConflict” filter, and their genotype is unchanged.
} Site and genotype quality are set to "."
The above modifications reflect the notion that the site confidence is bound by the
enclosing indel confidence.
Also note that on occasion, the variant caller will produce multiple overlapping indel
calls which cannot be resolved into two haplotypes. If this occurs all indels and sites in
the region of the overlap will be marked with the “IndelConflict” filter (see below).
Genotype Quality for Variant and Non-variant Sites
The gVCF file uses an adapted version of genotype quality for variant and non-variant
site filtration. This value is associated with the key GQX. The GQX value is intended to
represent the minimum of {Phred genotype quality assuming the site is variant, Phred
genotype quality assuming the site is non-variant}. The reason for using this is to allow a
single value to be used as the primary quality filter for both variant and non-variant
sites. Filtering on this value corresponds to a conservative assumption appropriate for
applications where reference genotype calls must be determined at the same stringency
as variant genotypes, i.e.:
} An assertion that a site is homozygous reference at GQX >= 30 is made assuming the
site is variant.
} An assertion that a site is a non-reference genotype at GQX >= 30 is made assuming
the site is non-variant.
Section Descriptions
The gVCF file contains the following sections:
18
15050958 Rev. A
Note that if you extract the variant lines from a gVCF file, you produce a conventional
variant VCF file.
Field Descriptions
The fixed fields #CHROM, POS, ID, REF, ALT, QUAL are defined in the VCF 4.1
standard provided by the 1000 Genomes Project, while the fields ID, INFO, FORMAT,
and sample are described in the meta-information. Descriptions are provided below.
} CHROM: Chromosome: an identifier from the reference genome or an anglebracketed ID String ("<ID>") pointing to a contig.
} POS: Position: The reference position, with the 1st base having position 1. Positions
are sorted numerically, in increasing order, within each reference sequence CHROM.
There can be multiple records with the same POS. Telomeres are indicated by using
positions 0 or N+1, where N is the length of the corresponding chromosome or
contig.
} ID: Semi-colon separated list of unique identifiers where available. If this is a dbSNP
variant it is encouraged to use the rs number(s). No identifier should be present in
more than one data record. If there is no identifier available, then the missing value
should be used.
} REF: Reference base(s): A,C,G,T,N; there can be multiple bases. The value in the POS
field refers to the position of the first base in the string. For simple insertions and
deletions in which either the REF or one of the ALT alleles would otherwise be
null/empty, the REF and ALT strings include the base before the event (which is
reflected in the POS field), unless the event occurs at position 1 on the contig in
which case they include the base after the event. If any of the ALT alleles is a
symbolic allele (an angle-bracketed ID String "<ID>") then the padding base is
required and POS denotes the coordinate of the base preceding the polymorphism.
} ALT: Comma separated list of alternate non-reference alleles called on at least one of
the samples. Options are:
• Base strings made up of the bases A,C,G,T,N
• angle-bracketed ID String (”<ID>”)
• breakend replacement string as described in the section on breakends.
If there are no alternative alleles, then the missing value should be used.
} QUAL: Phred-scaled quality score for the assertion made in ALT. i.e. -10log_10 prob
(call in ALT is wrong). If ALT is ”.” (no variant) then this is -10log_10 p(variant),
and if ALT is not ”.” this is -10log_10 p(no variant). High QUAL scores indicate
high confidence calls. Although traditionally people use integer phred scores, this
field is permitted to be a floating point to enable higher resolution for low confidence
calls if desired. If unknown, the missing value should be specified. (Numeric)
} FILTER: PASS if this position has passed all filters, i.e. a call is made at this
position. Otherwise, if the site has not passed all filters, a semicolon-separated list of
codes for filters that fail. gVCF files use the following values:
• PASS: position has passed all filters.
• IndelConflict: Locus is in region with conflicting indel calls.
BWA Enrichment User Guide
19
BWA Enrichment Output
} Meta-information lines start with ## and contain meta-data, config information, and
define the values that the INFO, FILTER and FORMAT fields can have.
} The header line starts with # and names the fields that the data lines use. These are
#CHROM, POS, ID,REF, ALT, QUAL, FILTER, INFO, FORMAT, followed by one or
more sample columns.
} Data lines that contain information about one or more positions in the genome.
• SiteConflict: Site genotype conflicts with proximal indel call. This is typically a
heterozygous SNV call made inside of a heterozygous deletion.
• LowGQX: Locus GQX (minimum of {Genotype quality assuming variant
position,Genotype quality assuming non-variant position}) is less than 30 or not
present.
• HighDPFRatio: The fraction of basecalls filtered out at a site is greater than 0.3.
• HighSNVSB: SNV strand bias value (SNVSB) exceeds 10. High strand bias
indicates a potential high false-positive rate for SNVs.
• HighSNVHPOL: SNV contextual homopolymer length (SNVHPOL) exceeds 6.
• HighREFREP: Indel contains an allele which occurs in a homopolymer or
dinucleotide track with a reference repeat greater than 8.
• HighDepth: Locus depth is greater than 3x the mean chromosome depth.
} INFO: Additional information. INFO fields are encoded as a semicolon-separated
series of short keys with optional values in the format: <key>=<data>[,data]. gVCF
files use the following values:
• END: End position of the region described in this record.
• BLOCKAVG_min30p3a: Non-variant site block. All sites in a block are
constrained to be non-variant, have the same filter value, and have all sample
values in range [x,y], y <= max(x+3,(x*1.3)). All printed site block sample values
are the minimum observed in the region spanned by the block.
• SNVSB: SNV site strand bias.
• SNVHPOL: SNV contextual homopolymer length.
• CIGAR: CIGAR alignment for each alternate indel allele.
• RU: Smallest repeating sequence unit extended or contracted in the indel allele
relative to the reference. RUs are not reported if longer than 20 bases.
• REFREP: Number of times RU is repeated in reference.
• IDREP: Number of times RU is repeated in indel allele.
} FORMAT: Format of the sample field. FORMAT specifies the data types and order of
the subfields. gVCF files use the following values:
• GT: Genotype.
• GQ: Genotype Quality.
• GQX: Minimum of {Genotype quality assuming variant position,Genotype
quality assuming non-variant position}.
• DP: Filtered basecall depth used for site genotyping.
• DPF: Basecalls filtered from input before site genotyping.
• AD: Allelic depths for the ref and alt alleles in the order listed. For indels this
value only includes reads which confidently support each allele (posterior
probability 0.999 or higher that read contains indicated allele vs all other
intersecting indel alleles).
• DPI: Read depth associated with indel, taken from the site preceding the indel.
} SAMPLE: Sample fields as defined by the header.
Enrichment_summary.csv
The BWA Enrichment app produces an overview of statistics for each sample and the
aggregate results in a comma separated values (CSV) format: the *.enrichment_
summary.csv. These files are located in the results folder for each samples and the
aggregate results. A brief description of the metrics is below.
NOTE
PCR duplicate reads are not removed from statistics. Results are not directly comparable
to Picard HsMetrics.
20
15050958 Rev. A
Definition
Sample ID
IDs of samples reported on in the file.
Sample Name
Names of samples reported on in the file.
Run Folder
Run folders for samples reported on in the file.
Target manifest
The target manifest file used for analyis. This file specifies the targeted regions
for the aligner and variant caller.
Reference
Genome
Reference genome selected.
Padding size
The length of sequence immediately upstream and downstream of the
enrichment targets that is included for a padded target.
Total length of
targeted
reference
Total length of sequenced bases in the target region.
Total PF reads
The number of reads passing filter for the sample.
Percent Q30
The percentage of bases with a quality score of 30 or higher.
Total aligned
reads
The total number of reads passing filter that aligned.
Percent aligned
reads
Percentage of reads passing filter that aligned.
Total aligned
reads
The total number of reads passing filter that aligned.
Target aligned
reads
Number of reads that aligned to the target.
Read
enrichment
100*(Target aligned reads/Total aligned reads).
Padded target
aligned reads
Number of reads that aligned to the padded target.
Padded read
enrichment
100*(Padded target aligned reads/Total aligned reads).
Total aligned
bases
Total aligned bases.
Target aligned
bases
Total aligned bases in the target region.
Base
enrichment
100*(Total Aligned Bases in Targeted Regions/Total Aligned Bases).
Padded target
aligned bases
Total aligned bases in the padded target region.
Padded base
enrichment
100*(Total Aligned Bases in Padded Targeted Regions/Total Aligned Bases).
BWA Enrichment User Guide
21
BWA Enrichment Output
Statistic
22
Statistic
Definition
Percent
duplicate paired
reads
Percentage of paired reads that have duplicates.
Mean region
coverage depth
The total number of targeted bases divided by the targeted region size.
Uniformity of
coverage (Pct >
0.2*mean):
The percentage of targeted base positions in which the read depth is greater
than 0.2 times the mean region target coverage depth.
Target
coverage at 1X
Percentage targets with coverage greater than 1X.
Target
coverage at 10X
Percentage targets with coverage greater than 10X.
Target
coverage at 20X
Percentage targets with coverage greater than 20X.
Target
coverage at 50X
Percentage targets with coverage greater than 50X.
Fragment
length median
Median length of the sequenced fragment. The fragment length is calculated
based on the locations at which a read pair aligns to the reference. The read
mapping information is parsed from the BAM files.
Fragment
length min
Minimum length of the sequenced fragment.
Fragment
length max
Maximum length of the sequenced fragment.
Fragment
length SD
Standard deviation of the sequenced fragment length.
SNVs, Indels,
Insertions,
Deletions
Total number of variants present in the dataset that pass the quality filters.
SNVs, Indels,
Insertions,
Deletions
(Percent found
in dbSNP)
100*( Number of variants in dbSNP/Number of variants).
SNV Ts/Tv ratio
Transition rate of SNVs that pass the quality filters divided by transversion
rate of SNVs that pass the quality filters. Transitions are interchanges of
purines (A, G) or of pyrimidines (C, T). Transversions are interchanges of
purine and pyrimidine bases (for example, A to T).
SNVs, Indels,
Insertions,
Deletions
Het/Hom ratio
Number of heterozygous variants/Number of homozygous variants.
15050958 Rev. A
BWA Enrichment Output
Statistic
Definition
SNVs,
Insertions,
Deletions in
Genes
The number of variants that fall into a gene.
SNVs,
Insertions,
Deletions in
Exons
The number of variants that fall into an exon.
SNVs,
Insertions,
Deletions in
Coding Regions
The number of variants that fall into a coding region.
SNVs,
Insertions,
Deletions in
UTR Region
The number of variants that fall into an untranslated region (UTR).
SNVs,
Insertions,
Deletions in
Splice Site
Region
The number of variants that fall into a splice site region.
Stop Gained
SNVs,
Insertions,
Deletions
The number of variants that cause an additional stop codon.
Stop Lost SNVs,
Insertions,
Deletions
The number of variants that cause the loss of a stop codon.
Frameshift
SNVs,
Insertions,
Deletions
The number of variants that cause a frameshift.
Nonsynonymous
SNVs,
Insertions,
Deletions
The number of variants that cause an amino acid change in a coding region.
Synonymous
SNVs
The number of variants that are within a coding region, but do not cause an
amino acid change.
Aggregate Summary Report
The BWA Enrichment app provides an aggregate enrichment statistics PDF report for all
samples combined on the Summary page. A brief description of the report is below.
NOTE
PCR duplicate reads are not removed from statistics. Results are not directly comparable
to Picard HsMetrics.
BWA Enrichment User Guide
23
Sample Information
Defines the sample numbers and names in the report.
Enrichment Summary
Statistic
Definition
Total length of
targeted reference
Total length of sequenced bases in the target region.
Padding size
The length of sequence immediately upstream and downstream of the
enrichment targets that is included for a padded target.
Read Level Enrichment
Statistic
Definition
Total aligned reads
The total number of reads passing filter that aligned.
Targeted aligned reads
Number of reads that aligned to the target.
Read enrichment
100*(Target aligned reads/Total aligned reads).
Padded target aligned reads
Number of reads that aligned to the padded target.
Padded read enrichment
100*(Padded target aligned reads/Total aligned reads).
Base Level Enrichment
Statistic
Definition
Total aligned bases
Total aligned bases.
Targeted aligned bases
Total aligned bases in the target region.
Base enrichment
100*(Total Aligned Bases in Targeted Regions/Total Aligned Bases).
Padded target aligned
bases
Total aligned bases in the padded target region.
Padded base enrichment
100*(Total Aligned Bases in Padded Targeted Regions/Total Aligned
Bases).
The Base Enrichment histogram graphs the total aligned bases, padded targeted aligned
bases, and padded base enrichment by sample.
SNV Summary
24
Statistic
Definition
SNVs
Total number of Single Nucleotide Variants present in the dataset that pass the
quality filters.
15050958 Rev. A
Definition
SNVs
(Percent
found in
dbSNP)
100*( Number of SNVs in dbSNP/Number of SNVs).
SNV
Ts/Tv
Ratio
Transition rate of SNVs that pass the quality filters divided by transversion rate of
SNVs that pass the quality filters. Transitions are interchanges of purines (A, G) or of
pyrimidines (C, T). Transversions are interchanges of purine and pyrimidine bases
(for example, A to T).
SNV
Het/Hom
Ratio
Number of heterozygous SNVs/Number of homozygous SNVs.
The SNVs histogram graphs the number of SNVs passing by sample.
Indel Summary
Statistic
Definition
Indels
Total number of Indels present in the dataset and pass the quality
filters.
Indels (Percent found in
dbSNP)
100*(Number of Indels in dbSNP/Number of Indels).
Indel Het/Hom ratio
Number of heterozygous indels/Number of homozygous indels.
The Indels histogram graphs the number of indels passing by sample.
Coverage Summary
Statistic
Definition
Mean region
coverage depth
The total number of targeted bases divided by the targeted region size.
Uniformity of
coverage (Pct >
0.2*mean):
The percentage of targeted base positions in which the read depth is
greater than 0.2 times the mean region target coverage depth.
Target coverage at 1X
Percentage targets with coverage greater than 1X.
Target coverage at
10X
Percentage targets with coverage greater than 10X.
Target coverage at
20X
Percentage targets with coverage greater than 20X.
Target coverage at
50X
Percentage targets with coverage greater than 50X.
In a addition,the app provides two graphs:
BWA Enrichment User Guide
25
BWA Enrichment Output
Statistic
} A Mean Coverage and Uniformity histogram that plots the mean region coverage
depth and uniformity of covergae by sample.
} A Targeted Regions Depth of Coverage histogram that plots the number of targeted
sequences by the depth of coverage.
Statistic
Definition
Depth of Sequencing
Coverage
The coverage depth of a position in the genome refers to the number
of sequenced bases that align to that position.
Number of Targeted
Bases Covered
Number of targeted bases that have at least the indicated depth of
coverage.
Fragment Length Summary
Statistic
Definition
Fragment
length
median
Median length of the sequenced fragment. The fragment length is calculated based
on the locations at which a read pair aligns to the reference. The read mapping
information is parsed from the BAM files.
Minimum
Minimum length of the sequenced fragment.
Maximum
Maximum length of the sequenced fragment.
Standard
Deviation
Standard deviation of the sequenced fragment length.
The Fragment Length Medians histogram graphs the fragment length median by sample.
Duplicates Summary
Statistic
Definition
Percent duplicate paired reads
Percentage of paired reads that have duplicates.
The Percent Duplicate Paired Reads histogram graphs the percent duplicate paired reads
by sample.
26
15050958 Rev. A
This chapter describes the methods that are used in the BWA Enrichment app.
BWA
The BWA Enrichment workflow uses the Burrows-Wheeler Aligner (BWA)1, which aligns
relatively short nucleotide sequences against a long reference sequence. BWA
automatically adjusts parameters based on read lengths and error rates, and then
estimates insert size distribution.. For more information, see bio-bwa.sourceforge.net/.
After BWA alignment, variant calling is done by GATK.
1
DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis
AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM,
Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ (2011) A framework
for variation discovery and genotyping using next-generation DNA sequencing data.
Nat Genet. 43(5): 491-8.
GATK
The Genome Analysis Toolkit (GATK) is the standard variant caller after BWA
alignment.
Developed by the Broad Institute, the Genome Analysis Toolkit (GATK) calls raw
variants for each sample read, analyzes the variants against known variants, and then
applies a calibration procedure to compute a false discovery rate for each variant.
Variants are flagged as homozygous (1/1) or heterozygous (0/1) in the VCF file sample
column.
The app is guided by the GATK best practices as described here:
www.broadinstitute.org/gatk/guide/topic?name=best-practices
For more information about GATK, see www.broadinstitute.org/gatk and
www.broadinstitute.org/gatk/about/citing-gatk.
Picard Metrics
Picard is a suite of tools in Java that work with next generation sequencing data in BAM
format. BWA Enrichment uses the CalculateHsMetrics tool in Picard to compute a set of
Hybrid Selection specific metrics from an aligned SAM or BAM file. If a reference
sequence is provided, AT/GC dropout metrics will be calculated. GC and mean coverage
information for every target can also be computed.
For more information, see: http://picard.sourceforge.net/command-line-overview.shtml
BWA Enrichment User Guide
27
BWA Enrichment Methods
BWA Enrichment Methods
Notes
For technical assistance, contact Illumina Technical Support.
Table 1 Illumina General Contact Information
Illumina Website
Email
www.illumina.com
[email protected]
Table 2 Illumina Customer Support Telephone Numbers
Region
Contact Number
Region
North America
1.800.809.4566
Italy
Austria
0800.296575
Netherlands
Belgium
0800.81102
Norway
Denmark
80882346
Spain
Finland
0800.918363
Sweden
France
0800.911850
Switzerland
Germany
0800.180.8994
United Kingdom
Ireland
1.800.812949
Other countries
Contact Number
800.874909
0800.0223859
800.16836
900.812168
020790181
0800.563118
0800.917.0041
+44.1799.534000
Safety Data Sheets
Safety data sheets (SDSs) are available on the Illumina website at
www.illumina.com/msds.
Product Documentation
Product documentation in PDF is available for download from the Illumina website. Go
to www.illumina.com/support, select a product, then click Documentation & Literature.
BWA Enrichment User Guide
Technical Assistance
Technical Assistance
Illumina
San Diego, California 92122 U.S.A.
+1.800.809.ILMN (4566)
+1.858.202.4566 (outside North America)
[email protected]
www.illumina.com
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement