Local Run Manager DNA Enrichment Analysis Module Workflow Guide (1000000003343 v00)

Local Run Manager DNA Enrichment Analysis Module Workflow Guide (1000000003343 v00)
Local Run Manager
DNA Enrichment Analysis Module
Workflow Guide
For Research Use Only. Not for use in diagnostic procedures.
Overview
Set Parameters
Analysis Methods
View Analysis Results
Analysis Report
Analysis Output Files
Custom Analysis Settings
Technical Assistance
ILLUMINA PROPRIETARY
Document # 1000000003343 v00
January 2016
3
4
7
9
10
14
21
23
This document and its contents are proprietary to Illumina, Inc. and its affiliates ("Illumina"), and are intended solely for the
contractual use of its customer in connection with the use of the product(s) described herein and for no other purpose. This
document and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed,
or reproduced in any way whatsoever without the prior written consent of Illumina. Illumina does not convey any license
under its patent, trademark, copyright, or common-law rights nor similar rights of any third parties by this document.
The instructions in this document must be strictly and explicitly followed by qualified and properly trained personnel in order
to ensure the proper and safe use of the product(s) described herein. All of the contents of this document must be fully read
and understood prior to using such product(s).
FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS CONTAINED HEREIN
MAY RESULT IN DAMAGE TO THE PRODUCT(S), INJURY TO PERSONS, INCLUDING TO USERS OR OTHERS, AND
DAMAGE TO OTHER PROPERTY.
ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE PRODUCT(S)
DESCRIBED HEREIN (INCLUDING PARTS THEREOF OR SOFTWARE).
© 2016 Illumina, Inc. All rights reserved.
Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cBot, CSPro, CytoChip, DesignStudio,
Epicentre, ForenSeq, Genetic Energy, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium, iScan, iSelect,
MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, NextBio, Nextera, NextSeq, Powered by Illumina, SureMDA, TruGenome,
TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the pumpkin orange color, and the
streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the U.S. and/or other countries. All other
names, logos, and other trademarks are the property of their respective owners.
Overview
Overview
The Local Run Manager DNA Enrichment analysis module aligns reads against the
whole genome reference, and then performs variant analysis for regions of interest
specified in the manifest file.
Compatible Library Types
The DNA Enrichment analysis module is compatible with specific library types
represented by library kit categories on the Create Run screen. For a current list of
compatible library kits, see the Local Run Manager support page on the Illumina
website.
Input Requirements
In addition to sequencing data files generated during the sequencing run, such as base
call files, the DNA Enrichment analysis module requires the following files.
} Manifest file—The DNA Enrichment analysis module requires at least 1 manifest
file. The manifest files are available for download from the Illumina website. The
manifest file is a list of targeted regions and the chromosome start and end positions.
} Reference genome—The DNA Enrichment analysis module requires the reference
genome specified in the manifest file. The reference genome provides the
chromosome and start coordinate in the BAM file output.
} Custom Probe Manifest—When using Picard HS Metrics reporting, the DNA
Enrichment analysis module requires a custom probe manifest, which lists the
targeted regions and the chromosome start and end positions.
Uploading Manifests
To import a manifest for all runs using the DNA Enrichment analysis module, use the
Module Settings command from the Local Run Manager navigation bar. For more
information, see the Local Run Manager Software Guide (document # 1000000002702).
Alternatively, you can import a manifest for the current run only using the Import
Manifests command on the Create Run screen.
About This Guide
This guide provides instructions for setting up run parameters for sequencing and
analysis parameters for the DNA Enrichment analysis module. For information about the
Local Run Manager dashboard and system settings, see the Local Run Manager Software
Guide (document # 1000000002702).
Local Run Manager DNA Enrichment Analysis Module Workflow Guide
3
Set Parameters
1
Click Create Run, and select DNA Enrichment.
2
Enter a run name that identifies the run from sequencing through analysis.
Use alphanumeric characters, spaces, underscores, or dashes.
3
[Optional] Enter a run description to help identify the run.
Use alphanumeric characters.
Specify Run Settings
1
From the Library Kit drop-down list, select from the following library kit categories.
} Nextera Rapid Capture Enrichment
} TruSight Enrichment Panels
2
Specify the number of index reads.
} 0 for a run with no indexing
} 1 for a single-indexed run
} 2 for a dual-indexed run
3
Specify a read type: Single Read or Paired End.
4
Enter the number of cycles for the run.
5
[Optional] Specify any custom primers to be used for the run.
Specify Module-Specific Settings
4
1
Expand the Aligner drop-down list and select an alignment method.
} BWA-MEM—(Default) Optimized for Illumina sequencing data and reads ≥ 70 bp.
} BWA-Backtrack Legacy—Use with legacy data or reads < 70 bp.
2
Expand the Variant Caller drop-down list and select a variant calling method.
} Starling—(Default) Calls SNPs and small indels, and summarizes depth and
probabilities for every site in the genome.
} Somatic—Identifies variants at low frequency and minimizes false positives.
Recommended for analysis of tumor samples.
} GATK—Calls raw variants for each sample, analyzes variants against known
variants, and then calculates a false discovery rate for each variant.
3
If using the Somatic Variant Caller, specify the following analysis settings.
} Variant Frequency—Set to a threshold of 0.05 by default. Variants with a
frequency below the specified threshold are not reported in VCF files.
} Indel Repeat Filter Cutoff—On by default. When enabled, indels are filtered when
the reference has a 1-base or 2-base motif over 8 times next to the variant.
4
Use the up/down arrows to specify the Manifest Padding threshold.
Set to 150 by default, this setting specifies the number of bases immediately
upstream and downstream of the targeted regions used to calculate enrichment
statistics. Options are 0–250 in increments of 50.
5
Click the On/Off toggle to enable or disable the following analysis settings.
} Flag PCR Duplicates—On by default. When enabled, PCR duplicates are flagged
in the BAM files and not used for variant calling. PCR duplicates are defined as 2
Document # 1000000003343 v00
6
Click Show advanced module settings.
7
Click the On/Off toggle to enable or disable Picard HS metrics.
} Picard HS Metric Reporting—Off by default. When enabled, this setting generates
Picard HS metrics for the given bait and manifest file. When enabled, you are
prompted to upload a custom probe manifest.
} To upload a custom probe manifest file, click Import and navigate to the location
of the file.
Import Manifest Files for the Run
1
Make sure that the manifests you want to import are available in an accessible
network location or on a USB drive.
2
Click Import Manifests.
3
Navigate to the manifest file and select the manifest that you want to add.
NOTE
To import manifests for any run using the DNA Enrichment analysis module, use the
Module Settings feature from the navigation bar.
Specify Samples for the Run
Specify samples for the run using the following options:
} Enter samples manually—Use the blank table on the Create Run screen.
} Import samples—Navigate to an external file in a comma-separated values (*.csv)
format. A template is available for download on the Create Run screen.
After you have populated the samples table, you can export the sample information to
an external file, and use the file as a reference when preparing libraries or import the file
for another run.
Enter Samples Manually
1
Adjust the samples table to an appropriate number of rows.
} Click the + icon to add a row.
} Use the up/down arrows to add multiple rows. Click the + icon.
} Click the x icon to delete a row.
} Right-click on a row in the table and use the commands in the drop-down menu.
2
Enter a unique sample ID in the Sample ID field.
Use alphanumeric characters, dashes, or underscores.
3
[Optional] Enter a sample description in the Sample Description field.
Use alphanumeric characters, dashes, underscores, or spaces.
4
Expand the Index 1 (i7) drop-down list and select an Index 1 adapter.
5
Expand the Index 2 (i5) drop-down list and select an Index 2 adapter.
6
Expand the Manifest drop-down list and select a manifest file.
7
[Optional] Click the Export
icon to export sample information in *.csv format.
Local Run Manager DNA Enrichment Analysis Module Workflow Guide
5
Set Parameters
clusters from a paired-end run where both clusters have the exact same alignment
position for each read.
} Indel Realignment—On by default. When enabled, regions containing indels are
locally realigned to minimize the number of mismatches.
8
When finished, click Save Run.
Import Samples
6
1
Click Template. The template file contains the correct column headings for import.
2
Enter the sample information in each column for the samples in the run, and then
save the file.
3
Click Import Samples and browse to the location of the sample information file.
4
When finished, click Save Run.
Document # 1000000003343 v00
The DNA Enrichment analysis module performs the following analysis steps and then
writes analysis output files to the Alignment folder.
} Demultiplexes index reads
} Generates FASTQ files
} Aligns to a reference
} Identifies variants
Demultiplexing
Demultiplexing compares each Index Read sequence to the index sequences specified for
the run. No quality values are considered in this step.
Index reads are identified using the following steps:
} Samples are numbered starting from 1 based on the order they are listed for the run.
} Sample number 0 is reserved for clusters that were not assigned to a sample.
} Clusters are assigned to a sample when the index sequence matches exactly or when
there is up to a single mismatch per Index Read.
FASTQ File Generation
After demultiplexing, the software generates intermediate analysis files in the FASTQ
format, which is a text format used to represent sequences. FASTQ files contain reads for
each sample and the associated quality scores. Any controls used for the run and
clusters that did not pass filter are excluded.
Each FASTQ file contains reads for only 1 sample, and the name of that sample is
included in the FASTQ file name. FASTQ files are the primary input for alignment.
Adapter Trimming
The DNA Enrichment analysis module performs adapter trimming by default.
During longer runs, clusters can sequence beyond the sample DNA and read bases from
a sequencing adapter. To prevent sequencing into the adapter, the adapter sequence is
trimmed before the sequence is written to the FASTQ file. Trimming the adapter sequence
avoids reporting false mismatches with the reference sequence and improves alignment
accuracy and performance.
Adapter Sequences
When using the DNA Enrichment analysis module with TruSight Enrichment or Nextera
Rapid Capture Enrichment, the following adapter sequence is trimmed:
CTGTCTCTTATACACATCT
Alignment
During the alignment step, reads are aligned against the entire reference genome using
the Burrows-Wheeler Aligner (BWA), which aligns relatively short nucleotide sequences
against a long reference sequence. BWA automatically adjusts parameters based on read
lengths and error rates, and then estimates insert size distribution.
The DNA Enrichment analysis module provides the option of using BWA-MEM or
BWA-Backtrack Legacy for the alignment step.
Local Run Manager DNA Enrichment Analysis Module Workflow Guide
7
Analysis Methods
Analysis Methods
BWA-MEM
BWA-MEM is the most recent version of the Burrows-Wheeler Alignment algorithm.
Optimized for longer read lengths of ≥ 70 bp, BWA-MEM has a significant positive
impact on detection of variants, especially insertions and deletions.
BWA-Backtrack
BWA-Backtrack is an earlier version of the Burrows-Wheeler Aligner algorithm that
aligns sequencing read lengths in < 70 bp segments. Use this version for very short
reads, or when consistency is required with previous study data.
Variant Calling
Variant calling records single nucleotide polymorphisms (SNPs), insertions/deletions
(indels), and other structural variants in a standardized variant call format (VCF).
For each SNP or indel called, the probability of an error is provided as a variant quality
score. Reads are realigned around candidate indels to improve the quality of the calls
and site coverage summaries.
The DNA Enrichment analysis module provides the option of using Starling, Somatic, or
GATK for variant calling.
Starling
Starling calls both SNPs and small indels, and summarizes depth and probabilities for
every site in the genome. Starling produces a VCF file for each sample that contains
variants.
Starling treats each insertion or deletion as a single mismatch. Base calls with more than
2 mismatches to the reference sequence within 20 bases of the call are ignored. If the call
occurs within the first or last 20 bases of a read, the mismatch limit is increased to 41
bases.
Somatic Variant Caller
Developed by Illumina, the somatic variant caller identifies variants present at low
frequency in the DNA sample and minimizes false positives.
The somatic variant caller identifies SNPs in 3 steps:
} Considers each position in the reference genome separately
} Counts bases at the given position for aligned reads that overlap the position
} Computes a variant score that measures the quality of the call using a Poisson
model. Variants with a quality score below Q20 are excluded.
The somatic variant caller analyzes how many alignments covering a given position
include a particular indel compared to the overall coverage at that position.
GATK
The Genome Analysis Toolkit (GATK) calls raw variants for each sample, analyzes
variants against known variants, and then calculates a false discovery rate for each
variant. Variants are flagged as homozygous (1/1) or heterozygous (0/1) in the VCF file
sample column. For more information, see www.broadinstitute.org/gatk.
8
Document # 1000000003343 v00
1
From the Local Run Manager dashboard, click the run name.
2
From the Run Overview tab, review the sequencing run metrics.
3
[Optional] Click the Copy to Clipboard
4
Click the Sequencing Information tab to review run parameters and consumables
information.
5
Click the Samples and Results tab to view the analysis report.
} If analysis was repeated, expand the Select Analysis drop-down and select the
appropriate analysis.
} From the left navigation bar, select a sample name to view the report for another
sample.
6
[Optional] Click the Copy to Clipboard
Local Run Manager DNA Enrichment Analysis Module Workflow Guide
icon for access to the output run folder.
icon for access to the Analysis folder.
9
View Analysis Results
View Analysis Results
Analysis Report
Analysis results are summarized on the Samples and Results tab. The report is also
available in a PDF file format for each sample and as an aggregate report in the Analysis
folder.
Sample Information
Table 1 Sample Information Table
Column Heading
Description
Sample ID
The sample ID provided when the run was created.
Sample Name
The sample name provided when the run was created.
Run Folder
The name of the run folder.
Total PF Reads
The total number of reads passing filter.
Percent Q30 Bases
The percentage of bases called with a quality score ≥ Q30.
Median Read Length
The average read length in base pairs.
Adapters Trimmed
Indicator if adapter trimming was performed.
Enrichment Summary
Table 2 Enrichment Summary Table
Column Heading
Description
Target Manifest
The name of the file that specifies the reference and targeted
reference regions.
Total Length of Targeted The total length in base pairs of sequenced bases in the target
Reference
regions.
Padding Size
The length of sequence immediately upstream and downstream
of the enriched targets.
Enrichment values are calculated without padding. If a targeted region overlaps another
regions, positions are adjusted to remove the overlap.
Read Level Enrichment
Table 3 Read Level Enrichment Table
Column Heading
Description
Total Aligned Reads
The total number of reads that aligned to the reference.
Percent Aligned Reads
The percentage of reads that aligned to the reference.
Targeted Aligned Reads
The number of reads that aligned to the target.
Read Enrichment
The percentage of targeted aligned reads over total aligned
reads.
Padded Target Aligned
The number of reads that aligned to the padded target.
Reads
Padded Read Enrichment The percentage of padded target aligned reads over total
aligned reads.
10
Document # 1000000003343 v00
Analysis Report
Base Level Enrichment
Table 4 Base Level Enrichment Table
Column Heading
Description
Total Aligned Bases
The total number of bases that aligned to the reference.
Percent Aligned Bases
The percentage
Targeted Aligned Bases
The total number of aligned bases in the target region.
Base Enrichment
The percentage of aligned bases in targeted regions over total
aligned bases.
Padded Target Aligned
The total number of aligned bases in the padded target region.
Bases
Padded Base Enrichment
The percentage of total aligned bases in padded target regions
over total aligned bases.
Small Variants Summary
Table 5 Small Variants Summary Table
Row Heading
Description
Total Passing
The total number of variants passing filter for single nucleotide
variations (SNVs), insertions, and deletions.
Percent Found in dbSNP
The percentage of variants called by the variant caller that are
also present in dbSNP.
Het/Hom Ratio
The ratio of the number of heterozygous SNPs and number of
homozygous SNPs detected for the sample.
Ts/Tv Ratio
The ratio of transitions and transversions in SNPs.
• Transitions are variants of the same nucleotide type
(pyrimidine to pyrimidine, C and T; or purine to purine, A
and G).
• Transversions are variants of a different nucleotide type
(pyrimidine to purine, or purine to pyrimidine).
Variants by Sequence Context
Table 6 Variants by Sequence Context Table
Row Heading
Description
In Genes
The number of variants present in genes.
In Exons
The number of variants present in exons. Includes coding and
UTR regions.
In Coding Regions
The number of variants present in coding regions.
In UTR Regions
The number of variants present in untranslated regions (UTR).
Includes 5' and 3' UTR regions.
In Splice Site Regions
The number of variants present in splice site regions. Includes
regions annotated as splice acceptor, splice donor, splice site, or
splice region.
Variants by Consequence
Table 7 Variants by Consequence Table
Row Heading
Description
Frameshifts
The number of variants involving a number of base pairs that
are not a multiple of 3, which disrupts the triple reading frame.
Local Run Manager DNA Enrichment Analysis Module Workflow Guide
11
Row Heading
Nonsynonymous
Synonymous
Stop Gained
Stop Lost
Description
The number of variants with an amino acid change in a coding
region.
The number of variants within a coding region and without an
amino acid change.
The number of variants with the gain of a stop codon in a
coding sequence.
The number of variants with the loss of a stop codon in a coding
sequence.
Coverage Summary
Table 8 Coverage Summary Table
Column Heading
Description
Mean Region Coverage
The total number of aligned bases divided by the targeted
Depth
region size.
Uniformity of Coverage
The percentage of targeted base positions with coverage values
greater than the low coverage threshold of 0.2 * mean region
coverage depth.
Target Coverage at 1X
The percentage of coverage greater than 1X.
Target Coverage at 10X
The percentage of coverage greater than 10X.
Target Coverage at 20X
The percentage of coverage greater than 20X.
Target Coverage at 50X
The percentage of coverage greater than 50X.
The Mean Coverage by Targeted Region graph shows the mean coverage across target
regions.
Depth of Coverage in Targeted Regions
Table 9 Depth of Coverage in Targeted Regions Table
Column Heading
Description
Depth of Sequencing
The coverage depth based on the number of sequence bases
Coverage
that align to the position.
Number of Targeted
The number of targeted bases that have at least the indicated
Bases Covered at Depth
depth of coverage.
Total Targeted Bases
The total number of bases that align to the target regions that
Covered
have at least the indicated depth of coverage.
Target Coverage
The percentage of targeted bases that meet the indicated depth
of coverage.
Reads marked as duplicates are not included.
Fragment Length Summary
The fragment length summary section lists the average length of the sequenced fragment
for the selected sample, the minimum fragment length, the maximum fragment length,
and the range of variability listed as standard deviation. To account for potential
outliers, the minimum and maximum are calculated from values within ~ 3 standard
deviations, excluding the lower and upper 0.15% of the data.
Duplicate Information
The duplicate information section lists the percentage of clusters for a paired-end
sequencing run that are considered to be PCR duplicates. PCR duplicates are defined as
12
Document # 1000000003343 v00
Local Run Manager DNA Enrichment Analysis Module Workflow Guide
13
Analysis Report
2 clusters from a paired-end run where both clusters have the exact same alignment
positions for each read.
Analysis Output Files
The following analysis output files are generated for the DNA Enrichment analysis
module and provide analysis results for alignment and variant calling. Analysis output
files are located in the Alignment folder.
File Name
Description
Demultiplexing (*.demux)
Intermediate files containing demultiplexing results.
FASTQ (*.fastq.gz)
Intermediate files containing quality scored base calls.
FASTQ files are the primary input for the alignment step.
Alignment files in the
BAM format (*.bam)
Contains aligned reads for a given sample.
Per-Pool variant call files
in the VCF format (*.vcf)
Contains variants called at each position from either the
forward pool or the reverse pool.
Variant call files in the
genome VCF format
(*.genome.vcf)
Contains the genotype for each position, whether called
as a variant or called as a reference.
Consensus variant call files
in the VCF format (*.vcf)
Contains variants called at each position from both pools.
AmpliconCoverage_M1.tsv
Contains information about coverage per amplicon per
sample for each manifest provided. M# represents the
manifest number.
Demultiplexing File Format
The process of demultiplexing reads the index sequence attached to each cluster to
determine from which sample the cluster originated. The mapping between clusters and
sample number are written to 1 demultiplexing (*.demux) file for each tile of the flow
cell.
The demultiplexing file naming format is s_1_X.demux, where X is the tile number.
Demultiplexing files start with a header:
} Version (4 byte integer), currently 1
} Cluster count (4 byte integer)
The remainder of the file consists of sample numbers for each cluster from the tile.
When the demultiplexing step is complete, the software generates a demultiplexing file
named DemultiplexSummaryF1L1.txt.
} In the file name, F1 represents the flow cell number.
} In the file name, L1 represents the lane number.
} Demultiplexing results in a table with 1 row per tile and 1 column per sample,
including sample 0.
} The most commonly occurring sequences in index reads.
FASTQ File Format
FASTQ file is a text-based file format that contains base calls and quality values per read.
Each record contains 4 lines:
14
Document # 1000000003343 v00
The identifier
The sequence
A plus sign (+)
The quality scores in an ASCII encoded format
The identifier is formatted as:
@Instrument:RunID:FlowCellID:Lane:Tile:X:Y ReadNum:FilterFlag:0:SampleNumber
Example:
@SIM:1:FCX:1:15:6329:1045 1:N:0:2
TCGCACTCAACGCCCTGCATATGACAAGACAGAATC
+
<>;##=><9=AAAAAAAAAA9#:<#<;<<<????#=
BAM File Format
A BAM file (*.bam) is the compressed binary version of a SAM file that is used to
represent aligned sequences up to 128 Mb. SAM and BAM formats are described in
detail at https://samtools.github.io/hts-specs/SAMv1.pdf.
BAM files use the file naming format of SampleName_S#.bam, where # is the sample
number determined by the order that samples are listed for the run.
BAM files contain a header section and an alignments section:
} Header—Contains information about the entire file, such as sample name, sample
length, and alignment method. Alignments in the alignments section are associated
with specific information in the header section.
} Alignments—Contains read name, read sequence, read quality, alignment
information, and custom tags. The read name includes the chromosome, start
coordinate, alignment quality, and the match descriptor string.
The alignments section includes the following information for each or read pair:
} RG: Read group, which indicates the number of reads for a specific sample.
} BC: Barcode tag, which indicates the demultiplexed sample ID associated with the
read.
} SM: Single-end alignment quality.
} AS: Paired-end alignment quality.
} NM: Edit distance tag, which records the Levenshtein distance between the read and
the reference.
} XN: Amplicon name tag, which records the amplicon tile ID associated with the
read.
BAM files are suitable for viewing with an external viewer such as IGV or the UCSC
Genome Browser.
BAM index files (*.bam.bai) provide an index of the corresponding BAM file.
VCF File Format
VCF is a widely used file format developed by the genomics scientific community that
contains information about variants found at specific positions in a reference genome.
VCF files use the file naming format SampleName_S#.vcf, where # is the sample number
determined by the order that samples are listed for the run.
VCF File Header—Includes the VCF file format version and the variant caller version.
The header lists the annotations used in the remainder of the file. If MARS is listed, the
Illumina internal annotation algorithm annotated the VCF file. The VCF header includes
Local Run Manager DNA Enrichment Analysis Module Workflow Guide
15
Analysis Output Files
}
}
}
}
the reference genome file and BAM file. The last line in the header contains the column
headings for the data lines.
VCF File Data Lines—Each data line contains information about a single variant.
VCF File Headings
Heading
Description
CHROM
The chromosome of the reference genome. Chromosomes appear in
the same order as the reference FASTA file.
POS
The single-base position of the variant in the reference chromosome.
For SNPs, this position is the reference base with the variant; for indels
or deletions, this position is the reference base immediately before the
variant.
ID
The rs number for the SNP obtained from dbSNP.txt, if applicable.
If there are multiple rs numbers at this location, the list is semicolon
delimited. If no dbSNP entry exists at this position, a missing value
marker ('.') is used.
REF
The reference genotype. For example, a deletion of a single T is
represented as reference TT and alternate T. An A to T single nucleotide
variant is represented as reference A and alternate T.
ALT
The alleles that differ from the reference read.
For example, an insertion of a single T is represented as reference A and
alternate AT. An A to T single nucleotide variant is represented as
reference A and alternate T.
QUAL
A Phred-scaled quality score assigned by the variant caller.
Higher scores indicate higher confidence in the variant and lower
probability of errors. For a quality score of Q, the estimated probability
of an error is 10-(Q/10). For example, the set of Q30 calls has a 0.1% error
rate. Many variant callers assign quality scores based on their statistical
models, which are high in relation to the error rate observed.
VCF File Annotations
16
Heading
Description
FILTER
If all filters are passed, PASS is written in the filter column.
• LowDP—Applied to sites with depth of coverage below a cutoff.
• LowGQ—The genotyping quality (GQ) is below a cutoff.
• LowQual—The variant quality (QUAL) is below a cutoff.
• LowVariantFreq—The variant frequency is less than the given
threshold.
• R8—For an indel, the number of adjacent repeats (1-base or 2-base)
in the reference is greater than 8.
• SB—The strand bias is more than the given threshold. Used with the
Somatic Variant Caller and GATK.
Document # 1000000003343 v00
Description
INFO
Possible entries in the INFO column include:
• AC—Allele count in genotypes for each ALT allele, in the same order
as listed.
• AF—Allele Frequency for each ALT allele, in the same order as listed.
• AN—The total number of alleles in called genotypes.
• CD—A flag indicating that the SNP occurs within the coding region
of at least 1 RefGene entry.
• DP—The depth (number of base calls aligned to a position and used
in variant calling).
• Exon—A comma-separated list of exon regions read from RefGene.
• FC—Functional Consequence.
• GI—A comma-separated list of gene IDs read from RefGene.
• QD—Variant Confidence/Quality by Depth.
• TI—A comma-separated list of transcript IDs read from RefGene.
FORMAT
The format column lists fields separated by colons. For example,
GT:GQ. The list of fields provided depends on the variant caller used.
Available fields include:
• AD—Entry of the form X,Y, where X is the number of reference calls,
and Y is the number of alternate calls.
• DP—Approximate read depth; reads with MQ=255 or with bad mates
are filtered.
• GQ—Genotype quality.
• GQX—Genotype quality. GQX is the minimum of the GQ value and
the QUAL column. In general, these values are similar; taking the
minimum makes GQX the more conservative measure of genotype
quality.
• GT—Genotype. 0 corresponds to the reference base, 1 corresponds
to the first entry in the ALT column, and so on. The forward slash (/)
indicates that no phasing information is available.
• NL—Noise level; an estimate of base calling noise at this position.
• PL—Normalized, Phred-scaled likelihoods for genotypes.
• SB—Strand bias at this position. Larger negative values indicate less
bias; values near 0 indicate more bias. Used with the Somatic Variant
Caller and GATK.
• VF—Variant frequency; the percentage of reads supporting the
alternate allele.
SAMPLE
The sample column gives the values specified in the FORMAT column.
Analysis Output Files
Heading
Genome VCF Files
Genome VCF (gVCF) files are VCF v4.1 files that follow a set of conventions for
representing all sites within the genome in a reasonably compact format. The gVCF files
include all sites within the region of interest in a single file for each sample.
The gVCF file shows no-calls at positions with low coverage, or where a low-frequency
variant (< 3%) occurs often enough (> 1%) that the position cannot be called to the
reference. A genotype (GT) tag of ./. indicates a no-call.
For more information, see sites.google.com/site/gvcftools/home/about-gvcf.
Local Run Manager DNA Enrichment Analysis Module Workflow Guide
17
Coverage File Format
Coverage files can be copied into a spreadsheet program such as Microsoft Excel for
viewing, sorting, or graphing.
Coverage files contain a header section and a data section:
} Header—Contains 1 line per targeted region that begins with a # character.
} The first header line specifies the enrichment, which is defined as the fraction of
aligned reads overlapping any of the targeted regions.
} The second header line specifies the number of reads aligning to targeted regions.
} The third header line specifies the column headings for the data section.
} Data—The data section includes the following information.
Column
Heading
Description
Chromosome
Contains the chromosome of the targeted region.
Start
Contains the start position of the targeted region.
Stop
Contains the stop position of the targeted region.
RegionID
Contains the identity of the region as specified in the
manifest.
MeanCoverage
Contains the mean coverage. Only reads mapped as
proper pairs count toward the coverage calculation if
the run is a paired-end run.
Gaps File Format
Given a depth threshold, a gap is defined as a consecutive run of bases in which all
bases have coverage less than the threshold. It is in these regions that variants are
filtered due to low depth. The gaps file lists all gaps identified in any targeted region.
Gaps files contain a header section and a data section:
} Header—The header section specifies the column headings for the data section.
} Data—The data section includes the following information.
18
Column Heading
Description
Chromosome
Contains the chromosome of the targeted region.
GapStart
Contains the first coordinate of the gap.
GapStop
Contains the last coordinate of the gap.
RegionID
Contains the identity of the region as specified in the manifest.
MeanGapCoverage
Contains the mean coverage in the gap region. Only proper pairs
are counted in a paired-end run.
RegionInterval
Contains a representation of the targeted interval in a format that
can be easily copied and pasted into genome and read browsers.
GapInterval
Contains a representation of the gap interval in a format that can
be easily copied and pasted into genome and read browsers.
Document # 1000000003343 v00
The following output files provide supplementary information, or summarize run results
and analysis errors. Although, these files are not required for assessing analysis results,
they can be used for troubleshooting purposes. All files are located in the Alignment
folder unless otherwise specified.
File Name
Description
AdapterTrimming.txt
Lists the number of trimmed bases and percentage of
bases for each tile. This file is present only if adapter
trimming was specified for the run.
AnalysisLog.txt
Processing log that describes every step that occurred
during analysis of the current run folder. This file does
not contain error messages.
Located in the root level of the run folder.
AnalysisError.txt
Processing log that lists any errors that occurred
during analysis. This file is present only if errors
occurred.
Located in the root level of the run folder.
CompletedJobInfo.xml
Written after analysis is complete, contains information
about the run, such as date, flow cell ID, software
version, and other parameters.
Located in the root level of the run folder.
DemultiplexSummaryF1L1.txt
Reports demultiplexing results in a table with 1 row
per tile and 1 column per sample.
ErrorsAndNoCallsByLaneTile
ReadCycle.csv
A comma-separated values file that contains the
percentage of errors and no-calls for each tile, read,
and cycle.
Mismatch.htm
Contains histograms of mismatches per cycle and nocalls per cycle for each tile.
EnrichmentStatistics.xml
Contains summary statistics specific to the run.
Located in the root level of the run folder.
Summary.xml
Contains a summary of mismatch rates and other base
calling results.
Summary.htm
Contains a summary web page generated from
Summary.xml.
Analysis Folder
The analysis folder holds the files generated by the Local Run Manager software.
The relationship between the output folder and analysis folder is summarized as follows:
} During sequencing, Real-Time Analysis (RTA) populates the output folder with files
generated during image analysis, base calling, and quality scoring.
} RTA copies files to the analysis folder in real time. After RTA assigns a quality score
to each base for each cycle, the software writes the file RTAComplete.xml to both
folders.
} When the file RTAComplete.xml is present, analysis begins.
Local Run Manager DNA Enrichment Analysis Module Workflow Guide
19
Analysis Output Files
Supplementary Output Files
}
As analysis continues, Local Run Manager writes output files to the analysis folder,
and then copies the files back to the output folder.
Folder Structure
Data
Intensities
BaseCalls
Alignment—Contains *.bam and *.vcf files, and files specific to the
analysis module.
L001—Contains one subfolder per cycle, each containing *.bcl files.
Sample1_S1_L001_R1_001.fastq.gz
Sample2_S2_L001_R1_001.fastq.gz
Undetermined_S0_L001_R1_001.fastq.gz
L001—Contains *.locs files, 1 for each tile.
RTA Logs—Contains log files from RTA software analysis.
InterOp—Contains binary files used by Sequencing Analysis Viewer (SAV).
Logs—Contains log files describing steps performed during sequencing.
Queued—A working folder for software; also called the copy folder.
AnalysisError.txt
AnalysisLog.txt
CompletedJobInfo.xml
QueuedForAnalysis.txt
[WorkflowName]RunStatistics
RTAComplete.xml
RunInfo.xml
runParameters.xml
Alignment Folders
Each time that analysis is requeued, the Local Run Manager creates an Alignment folder
named AlignmentN, where N is a sequential number.
20
Document # 1000000003343 v00
Custom analysis settings are intended for technically advanced users. If settings are
applied incorrectly, serious problems can occur.
Add a Custom Analysis Setting
1
From the Module-Specific Settings section of the Create Run screen, click Show
advanced module settings.
2
Click Add custom setting.
3
In the custom setting field, enter the setting name as listed in the Available Analysis
Settings section.
4
In the setting value field, enter the setting value.
5
To remove a setting, click the x icon.
Available Analysis Settings
}
}
}
}
Adapter Trimming—By default, adapter trimming is enabled in the DNA
Enrichment analysis module. To specify a different adapter, use the Adapter setting.
The same adapter sequence is trimmed for Read 1 and Read 2.
} To specify 2 adapter sequences, separate the sequences with a plus (+) sign.
} To specify a different adapter sequence for Read 2, use the AdapterRead2 setting.
Setting Name
Setting Value
Adapter
Enter the sequence of the adapter to be trimmed.
AdapterRead2
Enter the sequence of the adapter to be trimmed.
Quality Score Trim—The BWA alignment algorithm automatically trims the 3' ends
of non-indexed reads with low quality scores. By default, the value is set to 15.
Setting Name
Setting Value
QualityScoreTrim
Enter a value greater than 0.
Variant Frequency—Filters variants with a frequency less than the specified
threshold. If using the Somatic Variant Caller, adjust the value for this setting on the
Create Run screen.
Setting Name
Setting Value
VariantFrequencyFilterCutoff
Enter a threshold value.
With the Somatic Variant Caller, the default value
is 0.05.
With GATK or Starling, the default value is 0.20.
Indel Repeat Cutoff—Filters insertions and deletions when the reference has a 1base or 2-base motif over 8 times (by default) next to the variant. If using the Somatic
Local Run Manager DNA Enrichment Analysis Module Workflow Guide
21
Custom Analysis Settings
Custom Analysis Settings
Variant Caller, enable or disable this setting on the Create Run screen.
}
}
22
Setting Name
Setting Value
IndelRepeatFilterCutoff
Enter a threshold value.
The default value is 8.
Variant Genotyping Quality—Filters variants with a genotype quality (GQ) less than
the specified threshold.
Setting Name
Setting Value
VariantMinimumGQCutoff
Enter a value less than 99.
With GATK or Somatic Variant Caller, the default
value is 30.
With Starling, the default value is 20.
Variant Quality Cutoff—Filters variants with a quality (QUAL) less than the
specified threshold. QUAL indicates the confidence of the variant call.
Setting Name
Setting Value
VariantMiniumQualCutoff
Enter a threshold value.
With GATK or Somatic Variant Caller, the default
value is 30.
With Starling, the default value is 20.
Document # 1000000003343 v00
For technical assistance, contact Illumina Technical Support.
Table 10 Illumina General Contact Information
Website
Email
www.illumina.com
[email protected]
Table 11 Illumina Customer Support Telephone Numbers
Region
Contact Number
Region
North America
1.800.809.4566
Japan
Australia
1.800.775.688
Netherlands
Austria
0800.296575
New Zealand
Belgium
0800.81102
Norway
China
400.635.9898
Singapore
Denmark
80882346
Spain
Finland
0800.918363
Sweden
France
0800.911850
Switzerland
Germany
0800.180.8994
Taiwan
Hong Kong
800960230
United Kingdom
Ireland
1.800.812949
Other countries
Italy
800.874909
Contact Number
0800.111.5011
0800.0223859
0800.451.650
800.16836
1.800.579.2745
900.812168
020790181
0800.563118
00806651752
0800.917.0041
+44.1799.534000
Safety data sheets (SDSs)—Available on the Illumina website at
support.illumina.com/sds.html.
Product documentation—Available for download in PDF from the Illumina website. Go
to support.illumina.com, select a product, then select Documentation & Literature.
Local Run Manager DNA Enrichment Analysis Module Workflow Guide
Technical Assistance
Technical Assistance
Illumina
5200 Illumina Way
San Diego, California 92122 U.S.A.
+1.800.809.ILMN (4566)
+1.858.202.4566 (outside North America)
[email protected]
www.illumina.com
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement