TruSeq Amplicon v2.0 BaseSpace App Guide For Research Use Only. Not for use in diagnostic procedures. Introduction Workflow Diagram Set Analysis Parameters Analysis Methods Analysis Output Revision History Technical Assistance ILLUMINA PROPRIETARY Document # 15055857 v02 January 2016 3 5 6 7 10 23 This document and its contents are proprietary to Illumina, Inc. and its affiliates ("Illumina"), and are intended solely for the contractual use of its customer in connection with the use of the product(s) described herein and for no other purpose. This document and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed, or reproduced in any way whatsoever without the prior written consent of Illumina. Illumina does not convey any license under its patent, trademark, copyright, or common-law rights nor similar rights of any third parties by this document. The instructions in this document must be strictly and explicitly followed by qualified and properly trained personnel in order to ensure the proper and safe use of the product(s) described herein. All of the contents of this document must be fully read and understood prior to using such product(s). FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS CONTAINED HEREIN MAY RESULT IN DAMAGE TO THE PRODUCT(S), INJURY TO PERSONS, INCLUDING TO USERS OR OTHERS, AND DAMAGE TO OTHER PROPERTY. ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE PRODUCT(S) DESCRIBED HEREIN (INCLUDING PARTS THEREOF OR SOFTWARE). © 2016 Illumina, Inc. All rights reserved. Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cBot, CSPro, CytoChip, DesignStudio, Epicentre, ForenSeq, Genetic Energy, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium, iScan, iSelect, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, NextBio, Nextera, NextSeq, Powered by Illumina, SureMDA, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the pumpkin orange color, and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the U.S. and/or other countries. All other names, logos, and other trademarks are the property of their respective owners. The BaseSpace® App, TruSeq Amplicon v2.0, analyzes DNA enriched for particular target sequences. The app aligns amplicon reads against the reference specified in the manifest file, and then performs variant analysis. Variants are called for the targeted regions. Statistics reporting accumulates coverage statistics for each target and overall metrics. Compatible Libraries See the BaseSpace support page for a list of library types that are compatible with the TruSeq Amplicon v2.0 App. Workflow Requirements } } } } } } The minimum read length is 50 bases. No minimum number of reads is required. However, use sufficient data for each sample to support an appropriate depth of coverage for variant calling. Only 1 manifest is used per analysis. Paired-end sequencing data is required. Sequenced samples have the same read lengths. A maximum of 96 samples is possible per analysis. Versions The following components are used in the TruSeq Amplicon v2.0 App. Software Version TruSeq Amplicon (BaseSpace Workflow) 2.0.0 Isis (Analysis Software) 2.6.21.7 SAMtools 1.2 Somatic Variant Caller 4.0.13.1 Starling (Isaac Variant Caller) 2.1.4.1 GATK (Variant Caller) v1.6-23-gf0210b3 IONA (Illumina On-Node Annotation) 1.0.10.37 MarkDuplicates 1.0.1 mono 3.12.1 This app was validated with data from MiSeq and NextSeq systems. Reference Genomes } Human, UCSC hg19 TruSeq Amplicon v2.0 BaseSpace App Guide 3 Introduction Introduction } } } The human reference genome is PAR-Masked, which means that the Y chromosome sequence has the Pseudo Autosomal Regions (PAR) masked (set to N) to avoid mismapping of reads in the duplicate regions of sex chromosomes. Cow, UMD3.1 Mouse, mm9 Rat, rn4 NOTE When using a custom manifest file, specify a reference genome in the Build ID column. Reference genome names are case sensitive. 4 Document # 15055857 v02 Workflow Diagram Workflow Diagram Figure 1 TruSeq Amplicon v2.0 App Workflow TruSeq Amplicon v2.0 BaseSpace App Guide 5 Set Analysis Parameters 1 In BaseSpace, click the Apps tab. 2 Click TruSeq Amplicon. 3 From the drop-down list, select version 2.0, and then click Launch to open the app. 4 In the Analysis Name field on the app input form, enter the analysis name. By default, the analysis name includes the app name, followed by the date and time that the analysis session starts. 5 From the Save Results To field, select the project that stores the app results. 6 From the Sample(s) field, browse to the sample you want to analyze and select the checkbox. You can select multiple samples. 7 From the Targeted Amplicons field, select a panel of targeted amplicons representative of the selected samples. 8 If you selected Custom Panel in the Targeted Amplicons drop-down list, upload a custom manifest and select the manifest file from the Custom Manifest File field. Upload a custom manifest as follows. a b c Navigate to your project in BaseSpace. Click Import. Follow the instructions to add the Custom Amplicon manifest file (*.txt) to the project. NOTE Specify the reference genome in the header of the manifest file. See Reference Genomes on page 3. 9 From the Variant Caller field, select a variant caller. For tumor samples, use the Somatic Variant Caller. 10 If using the somatic variant caller, specify the Somatic Variant Caller Threshold (percentage). Set to 5 by default. Variants with a frequency below the specified threshold are not reported in VCF files. Lower threshold values can result in false positive variants. 11 Set the Read Stitching option. When enabled, reads that overlap > 10 bases between Read 1 and Read 2 are combined to create a single (longer) read for alignment. 12 From the Annotation field, select a preferred gene and transcript annotation reference database. 13 Click Continue. The TruSeq Amplicon v2.0 App begins analysis of the sample. When analysis is complete, the status of the app session is updated automatically and an email is sent to notify you. 6 Document # 15055857 v02 The TruSeq Amplicon v2.0 workflow evaluates short regions of amplified DNA, or amplicons, for variants. Focused sequencing of amplicons enables high coverage of particular regions across many samples. Alignment During the alignment step, the banded Smith-Waterman algorithm aligns clusters from each sample against amplicon sequences specified in the manifest file. The banded Smith-Waterman algorithm performs local sequence alignments to determine similar regions between 2 sequences. Instead of comparing the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths. Local alignments are useful for dissimilar sequences that are suspected to contain regions of similarity within the larger sequence. This process allows alignment across small amplicon targets, often less than 10 bp. Each paired-end read is evaluated in terms of its alignment to the relevant probe sequences for that read. } Read 1 is evaluated against the reverse complement of the Downstream LocusSpecific Oligos (DLSO). } Read 2 is evaluated against the Upstream Locus-Specific Oligos (ULSO). } If the start of a read matches a probe sequence with no more than 1 mismatch, the full length of the read is aligned against the amplicon target for that sequence. Alignments that include more than 3 indels are filtered from alignment results. Filtered alignments are written in alignment files as unaligned and are not used in variant calling. Variant Calling Variant calling is performed with any of the following variant calling options. } GATK } Isaac Variant Caller } Somatic Variant Caller GATK Developed by the Broad Institute, the Genome Analysis Toolkit (GATK) first calls raw variants for each sample read. Then GATK analyzes the variants against known variants, and applies a calibration procedure to compute a false discovery rate for each variant. Variants are flagged as homozygous (1/1) or heterozygous (0/1) in the VCF file sample column. The GATK best practices were guidelines for the app; they are described here: www.broadinstitute.org/gatk/guide/topic?name=best-practices. For more information about GATK, see www.broadinstitute.org/gatk. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5): 491-8. TruSeq Amplicon v2.0 BaseSpace App Guide 7 Analysis Methods Analysis Methods Starling (Isaac Variant Caller) Starling identifies single nucleotide variants (SNVs) and small indels using the following steps: } Read filtering—Filters out reads failing quality checks. } Indel calling—Identifies a set of possible indel candidates and realigns all reads overlapping the candidates using a multiple sequence aligner. } SNV calling—Computes the probability of each possible genotype given the aligned read data and a prior distribution of variation in the genome. } Indel genotypes—Calls indel genotypes and assigns probabilities. } Variant call output—Generates output in a VCF file and a compressed genome variant call (gVCF) file. See VCF File Format on page 12 and Genome VCF Files on page 14. Indel Candidates Input reads are filtered by removing any of the following reads: } Reads that failed base calling quality checks } Paired-end reads not marked as a proper pair } Reads with a mapping quality < 20 Indel Calling The variant caller proceeds with candidate indel discovery and generates alternate read alignments based on the candidate indels. As part of the realignment process, the variant caller selects a representative alignment to be used for site genotype calling and depth summarization by the SNV caller. SNV Calling The variant caller runs a series of filters on the set of filtered and realigned reads for SNV calling without affecting indel calls. First, any contiguous trailing sequence of N base calls is trimmed from the ends of reads. Using a mismatch density filter, reads having an unexpectedly high number of disagreements with the reference are masked, as follows: } The variant caller treats each insertion or deletion as a single mismatch. } Base calls with more than 2 mismatches to the reference sequence within 20 bases of the call are ignored. } If the call occurs within the first or last 20 bases of a read, the mismatch limit is applied to a 41-base window at the corresponding end of the read. } The mismatch limit is applied to the entire read when the read length is 41 or shorter. Indel Genotypes The variant caller filters out all bases marked by the mismatch density filter and any N base calls that remain after the end-trimming step. These filtered base calls are not used for site-genotyping but appear in the filtered base call counts in the variant caller output for each site. All remaining base calls are used for site-genotyping. The genotyping method heuristically adjusts the joint error probability that is calculated from multiple observations of the same allele on each strand of the genome. This correction accounts for the possibility of error dependencies. 8 Document # 15055857 v02 Variant Call Output After the SNV and indel genotyping methods are complete, the variant caller applies a final set of heuristic filters to produce the final set of calls in the output. The output in the genome variant call (gVCF) file captures the genotype at each position and the probability that the consensus call differs from reference. This score is expressed as a Phred-scaled quality score. Variant Calling Developed by Illumina, the somatic variant caller identifies variants present at low frequency in the DNA sample. The somatic variant caller identifies SNPs in 3 steps: } Considers each position in the reference genome separately } Counts bases at the given position for aligned reads that overlap the position } Computes a variant score that measures the quality of the call using Poisson model. Variants are first called for each pool separately. Then, variants from each pool are compared and combined into a single output file. If a variant meets the following criteria, the variant is marked as PASS in the variant call (VCF) file: } The variant is present in both pools } Has a cumulative depth of 1000 or an average depth of 500x per pool } Has a variant frequency of ≥ 3% as reported in the merged VCF file A locus for a mutation or reference is classified as a no call under the following conditions: } The variant frequency is near the signal noise level between 1% and 2.6% } The variant quality is < Q30 } The depth is < 500 } Significant strand bias is detected } The indel occurs in a homopolymer region Illumina On-Node Annotation (IONA) Annotation with IONA populates several values in the VCF file, including dbSNP ID (in the ID column), and some values in the INFO column. More detailed and extensive annotations are stored in a binary ANT file. This binary file can be imported into VariantStudio. For more information, see www.illumina.com/informatics/research/biological-datainterpretation/variantstudio.html. Ensembl or RefSeq annotation through IONA is available for alignments against the human reference genome: UCSC build hg19 / Ensembl build GRCh37 / NCBI build 37.2. TruSeq Amplicon v2.0 BaseSpace App Guide 9 Analysis Methods This method treats the highest-quality base call from each allele and strand as an independent observation and leaves the associated base call quality scores unmodified. Quality scores for subsequent base calls for each allele and strand are then adjusted. This adjustment is done to increase the joint error probability of the given allele above the error expected from independent base call observations. Analysis Output To view the results, click the Projects tab, then the project name, and then the analysis. Figure 2 Output Navigation Bar After analysis is complete, access the output through the left navigation bar. } Analysis Info—Information about the analysis session, including log files. } Inputs—Lists the samples and settings specified for the analysis session. } Output Files—Output files for the sample. } Summary Analysis Report—Analysis metrics for the aggregate results, displayed when multiple samples are analyzed. } Sample Analysis Reports—Analysis reports for each sample. Analysis Info The Analysis Info page displays the analysis settings and execution details. Row Heading Definition Name Name of the analysis session. Application App that generated this analysis. Date Started Date and time the analysis session started. Date Completed Date and time the analysis session completed. Duration Duration of the analysis. Session Type Number of nodes used. Status Status of the analysis session. The status shows either Running or Complete. Log Files 10 File Name Description AmpliconRunStatistics.xml Provides amplicon-related information. Document # 15055857 v02 Description CompletedJobInfo.xml Contains information about the completed analysis session. Logging.zip Contains all detailed log files for each step of the workflow. RunInfo.xml Identifies the boundaries of the reads (including index reads). SampleSheet.csv Sample sheet. SampleSheetUsed.csv A copy of the sample sheet, generated at the beginning of the workflow. [Manifest name].txt Manifest files used in the analysis. Analysis Output File Name Output Files The Output Files page provides access to the output files for each sample analysis. } BAM Files—Aligned sequences and quality scores in the BAM (*.bam) file format. } VCF Files—Variant calls in the VCF (*.vcf) file format. } Genome VCF Files—Variants, references, and no calls for all sites in the genome in the genome VCF (gVCF) file format. } Annotation File—Detailed annotations in a binary file format. } Summary File—Statistics for each sample. BAM File Format A BAM file (*.bam) is the compressed binary version of a SAM file that is used to represent aligned sequences up to 128 Mb. SAM and BAM formats are described in detail at https://samtools.github.io/hts-specs/SAMv1.pdf. BAM files use the file naming format of SampleName_S#.bam, where # is the sample number determined by the order that samples are listed for the run. BAM files contain a header section and an alignments section: } Header—Contains information about the entire file, such as sample name, sample length, and alignment method. Alignments in the alignments section are associated with specific information in the header section. } Alignments—Contains read name, read sequence, read quality, alignment information, and custom tags. The read name includes the chromosome, start coordinate, alignment quality, and the match descriptor string. The alignments section includes the following information for each or read pair: } RG: Read group, which indicates the number of reads for a specific sample. } BC: Barcode tag, which indicates the demultiplexed sample ID associated with the read. } SM: Single-end alignment quality. } AS: Paired-end alignment quality. } NM: Edit distance tag, which records the Levenshtein distance between the read and the reference. } XN: Amplicon name tag, which records the amplicon tile ID associated with the read. BAM index files (*.bam.bai) provide an index of the corresponding BAM file. TruSeq Amplicon v2.0 BaseSpace App Guide 11 VCF File Format VCF is a widely used file format developed by the genomics scientific community that contains information about variants found at specific positions in a reference genome. VCF files use the file naming format SampleName_S#.vcf, where # is the sample number determined by the order that samples are listed for the run. VCF File Header—Includes the VCF file format version and the variant caller version. The header lists the annotations used in the remainder of the file. If MARS is listed, the Illumina internal annotation algorithm annotated the VCF file. The VCF header includes the reference genome file and BAM file. The last line in the header contains the column headings for the data lines. VCF File Data Lines—Each data line contains information about a single variant. VCF File Headings 12 Heading Description CHROM The chromosome of the reference genome. Chromosomes appear in the same order as the reference FASTA file. POS The single-base position of the variant in the reference chromosome. For SNPs, this position is the reference base with the variant; for indels or deletions, this position is the reference base immediately before the variant. ID The rs number for the SNP obtained from dbSNP.txt, if applicable. If there are multiple rs numbers at this location, the list is semicolon delimited. If no dbSNP entry exists at this position, a missing value marker ('.') is used. REF The reference genotype. For example, a deletion of a single T is represented as reference TT and alternate T. An A to T single nucleotide variant is represented as reference A and alternate T. ALT The alleles that differ from the reference read. For example, an insertion of a single T is represented as reference A and alternate AT. An A to T single nucleotide variant is represented as reference A and alternate T. QUAL A Phred-scaled quality score assigned by the variant caller. Higher scores indicate higher confidence in the variant and lower probability of errors. For a quality score of Q, the estimated probability of an error is 10-(Q/10). For example, the set of Q30 calls has a 0.1% error rate. Many variant callers assign quality scores based on their statistical models, which are high in relation to the error rate observed. Document # 15055857 v02 Analysis Output VCF File Annotations Heading Description FILTER If all filters are passed, PASS is written in the filter column. • LowDP—Applied to sites with depth of coverage below a cutoff. • LowGQ—The genotyping quality (GQ) is below a cutoff. • LowQual—The variant quality (QUAL) is below a cutoff. • LowVariantFreq—The variant frequency is less than the given threshold. • R8—For an indel, the number of adjacent repeats (1-base or 2-base) in the reference is greater than 8. • SB—The strand bias is more than the given threshold. Used with the Somatic Variant Caller and GATK. INFO Possible entries in the INFO column include: • AC—Allele count in genotypes for each ALT allele, in the same order as listed. • AF—Allele Frequency for each ALT allele, in the same order as listed. • AN—The total number of alleles in called genotypes. • CD—A flag indicating that the SNP occurs within the coding region of at least 1 RefGene entry. • DP—The depth (number of base calls aligned to a position and used in variant calling). • Exon—A comma-separated list of exon regions read from RefGene. • FC—Functional Consequence. • GI—A comma-separated list of gene IDs read from RefGene. • QD—Variant Confidence/Quality by Depth. • TI—A comma-separated list of transcript IDs read from RefGene. FORMAT The format column lists fields separated by colons. For example, GT:GQ. The list of fields provided depends on the variant caller used. Available fields include: • AD—Entry of the form X,Y, where X is the number of reference calls, and Y is the number of alternate calls. • DP—Approximate read depth; reads with MQ=255 or with bad mates are filtered. • GQ—Genotype quality. • GQX—Genotype quality. GQX is the minimum of the GQ value and the QUAL column. In general, these values are similar; taking the minimum makes GQX the more conservative measure of genotype quality. • GT—Genotype. 0 corresponds to the reference base, 1 corresponds to the first entry in the ALT column, and so on. The forward slash (/) indicates that no phasing information is available. • NL—Noise level; an estimate of base calling noise at this position. • PL—Normalized, Phred-scaled likelihoods for genotypes. • SB—Strand bias at this position. Larger negative values indicate less bias; values near 0 indicate more bias. Used with the Somatic Variant Caller and GATK. • VF—Variant frequency; the percentage of reads supporting the alternate allele. SAMPLE The sample column gives the values specified in the FORMAT column. TruSeq Amplicon v2.0 BaseSpace App Guide 13 Genome VCF Files Genome VCF (gVCF) files are VCF v4.1 files that follow a set of conventions for representing all sites within the genome in a reasonably compact format. The gVCF files include all sites within the region of interest in a single file for each sample. The gVCF file shows no-calls at positions with low coverage, or where a low-frequency variant (< 3%) occurs often enough (> 1%) that the position cannot be called to the reference. A genotype (GT) tag of ./. indicates a no-call. For more information, see sites.google.com/site/gvcftools/home/about-gvcf. Annotation File Illumina On-Node Annotation (IONA) generates a binary annotation file (*.ant), which contains consequences for all affected transcripts. The annotations are more detailed than the annotations in the VCF file. You can view this binary file in VariantStudio. For more information, see www.illumina.com/informatics/research/biological-datainterpretation/variantstudio.html. Summary File The TruSeq Amplicon v2.0 App produces an overview of statistics for each sample and the aggregate results in a comma-separated values (CSV) format: *.summary.csv. These files are located in the results folder for each sample and the aggregate results. 14 Statistic Definition Sample ID IDs of samples reported in the file. Sample Name Names of samples reported in the file. Run Folder Run folders for samples reported in the file. Reference genome Reference genome selected. Manifest The manifest file used for analysis. This file specifies the targeted regions for the aligner and variant caller. Number of amplicon regions The number of amplicon regions that were sequenced. Total length of amplicon regions The total length of the sequenced bases in the targeted region. Total PF reads The number of reads passing filter for the sample. Total aligned reads The total number of reads passing filter present in the data set that aligned to the reference genome. Numbers are calculated per read, and over both reads. Percent aligned reads The percentage of reads passing filter that aligned to the reference genome. Numbers are calculated per read, and over both reads. Document # 15055857 v02 Definition Total probe bases Total number of bases that aligned to the probe sequences (ULSO and DLSO) and are soft-clipped in the BAM files. Numbers are calculated per read, and over both reads. Total aligned nonprobe bases Total number of bases that aligned to the reference, excluding bases aligning to the probe sequences. This number is the same as the number of bases aligned in the BAM file (probe sequence bases are soft-clipped). Numbers are calculated per read, and over both reads. Total PF bases The number of bases passing filter for the sample. Numbers are calculated per read, and over both reads. Percent Q30 bases The percentage of bases with a quality score of 30 or higher. Numbers are calculated per read, and over both reads. Total aligned bases The total number of bases present in the data set that aligned to the reference genome. Numbers are calculated per read, and over both reads. Percent aligned bases The percentage of bases that aligned to the reference genome. Numbers are calculated per read, and over both reads. Mismatch rate The average percentage of mismatches across both reads 1 and 2 over all cycles. Numbers are calculated per read. Amplicon mean coverage The total number of aligned reads to the targeted region divided by the number of targeted regions. Uniformity of Coverage (Pct > 0.2*mean) The percentage of targeted base positions in which the read depth is greater than 0.2 times the mean region target coverage depth. SNVs, Insertions, Deletions (All) Total number of variants present in the data set. SNVs, Insertions, Deletions Total number of variants present in the data set that pass the quality filters. SNV Ts/Tv ratio Transition rate of SNVs that pass the quality filters divided by transversion rate of SNVs that pass the quality filters. Transitions are interchanges of purines (A, G) or of pyrimidines (C, T). Transversions are interchanges of purine and pyrimidine bases (for example, A to T). SNVs, Insertions, Deletions Het/Hom ratio Number of heterozygous variants/Number of homozygous variants. SNVs, Insertions, Deletions (Percent found in dbSNP) 100*(Number of variants in dbSNP/Number of variants). SNVs, Insertions, Deletions in genes Number of variants that fall into a gene. TruSeq Amplicon v2.0 BaseSpace App Guide Analysis Output Statistic 15 Statistic Definition SNVs, Insertions, Deletions in exons Number of variants that fall into an exon. SNVs, Insertions, Deletions in coding regions Number of variants that fall into a coding region. SNVs, Insertions, Deletions in UTR regions Number of variants that fall into an untranslated region (UTR). SNVs, Insertions, Deletions in splice site regions Number of variants that fall into a splice site region. Stop gained SNVs, Insertions, Deletions Number of variants that cause an additional stop codon. Stop lost SNVs, Insertions, Deletions Number of variants that cause the loss of a stop codon. Non-synonymous SNVs, Insertions, Deletions Number of variants that cause an amino acid change in a coding region. Synonymous SNVs Number of variants that are within a coding region, but do not cause an amino acid change. Frameshift Insertions, Deletions Number of variants that cause a frameshift. Sample Analysis Reports The TruSeq Amplicon v2.0 App provides an overview of statistics per sample on the Analysis Reports sample pages. To download statistics in the TruSeq Amplicon Sequencing Report, click PDF Summary Report. Amplicon Summary Table 1 Amplicon Summary Table 16 Statistic Definition Manifest (PDF only) The name of the manifest used in the analysis. Number of Amplicon Regions The number of amplicon regions that were sequenced. Total Length of Amplicon Regions The total length of the sequenced bases in the targeted region. Document # 15055857 v02 Analysis Output Table 2 Read Level Statistics Table Statistic Definition Total Aligned Reads The total number of reads passing filter present in the data set that aligned to the reference genome. Percent Aligned Reads The percentage of reads passing filter that aligned to the reference genome. Table 3 Base Level Statistics Table Statistic Definition Percent Q30 The percentage of bases with a quality score of 30 or higher. Total Aligned Bases The total number of bases present in the data set that aligned to the reference genome. Percent Aligned Bases The percentage of bases that aligned to the reference genome. Mismatch Rate The average percentage of mismatches across both reads 1 and 2 over all cycles. Small Variants Summary This table provides metrics about the number of SNVs, deletions, and insertions. Table 4 Small Variants Summary Table Statistic Definition Total Passing The total number of variants present in the data set that passed the variant quality filters. Percent Found in dbSNP 100*(Number of variants in dbSNP/Number of variants). Het/Hom Ratio Number of heterozygous variants/Number of homozygous variants. Ts/Tv Ratio Transition rate of SNVs that pass the quality filters divided by transversion rate of SNVs that pass the quality filters. Transitions are interchanges of purines (A, G) or of pyrimidines (C, T). Transversions are interchanges between purine and pyrimidine bases (for example, A to T). Table 5 Variants by Sequence Context Table Statistic Definition Number in Genes The number of variants that fall into a gene. Number in Exons The number of variants that fall into an exon. Number in Coding Regions The number of variants that fall into a coding region. TruSeq Amplicon v2.0 BaseSpace App Guide 17 Statistic Definition Number in UTR Regions The number of variants that fall into an untranslated region (UTR). Number in Splice Site Regions The number of variants that fall into a splice site region. To view the guidelines for calculating variation consequences, visit the Ensembl website: uswest.ensembl.org/info/genome/variation/predicted_data.html#consequences. Table 6 Variants by Consequence Table Statistic Definition Frameshifts The number of variants that cause a frameshift. Non-synonymous The number of variants that cause an amino acid change in a coding region. Synonymous The number of variants that are within a coding region, but do not cause an amino acid change. Stop Gained The number of variants that cause an additional stop codon. Stop Lost The number of variants that cause the loss of a stop codon. Variants Table Use the Variants Table to view, sort, filter, and export a subset of the data provided in the VCF files. NOTE The Variants table is provided on the Sample Analysis Reports page, and not in the TruSeq Amplicon Sequencing Report. Table 7 Variants Table 18 Statistic Definition Chromosome (Chr) Name of reference chromosome. Position (Pos) Position within reference chromosome. Reference Allele (Ref) The reference allele. Variant Allele (Alt) The alt allele. Variant Type (Type) Type of variant, including single nucleotide variant (SNV), insertion, and deletion. Sequence Context (Context) Location of the variant based on annotations of the reference genome. Consequence Predicted transcript consequence as described at uswest.ensembl.org/info/genome/variation/predicted_ data.html#consequences. Document # 15055857 v02 Analysis Output Statistic Definition dbSNP ID (dbSNP) Identifier in the Single Nucleotide Polymorphism Database (dbSNP), a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI). COSMIC ID (COSMIC) The numeric identifier for the variant in the Catalogue of Somatic Mutations in Cancer (COSMIC) database. ClinVar Clinical significance based on the freely accessible, public archive of reports of the relationships among human variations and phenotypes Variant Quality (Qual) Phred-scaled quality score indicating how confident we are in this asserted haplotype. Variant Frequency (Alt Freq) Proportion of the variant allele among all alleles being considered. Total Depth Number of reads aligned at this position. Reference Allele Depth (Ref Depth) Number of reads containing the reference allele. Variant Allele Depth (Alt Depth) Number of reads containing the variant allele. Strand Bias Strand bias is a type of sequencing bias in which 1 DNA strand is favored over the other, which can result in incorrect evaluation of the amount of evidence observed for one allele versus the other. Coverage Summary Table 8 Coverage Summary Table Statistic Definition Amplicon Mean Coverage The total number of aligned reads to the targeted region divided by the number of targeted regions. Uniformity of Coverage (Pct > 0.2*mean) The percentage of targeted base positions in which the read depth is greater than 0.2 times the mean region target coverage depth. Coverage by Amplicon Region Plot The Coverage by Amplicon Region plot shows the number of bases plotted against the amplicon region. It has the following features: } Amplicon regions with coverage values less than the low coverage threshold (0.2 * amplicon mean coverage) are highlighted in red. } The horizontal red line marks the low coverage threshold. } The orange line marks the moving average of all coverage values. } Off-targets are excluded in the plot. TruSeq Amplicon v2.0 BaseSpace App Guide 19 Figure 3 Example Coverage by Amplicon Region Plot Coverage values for each amplicon are detailed in the downloadable Export (CSV) file. Summary Analysis Report The TruSeq Amplicon v2.0 App provides an Aggregate Summary for all samples. Statistics are plotted against samples with tables providing additional metrics. Amplicon Summary Table 9 Read Level Statistics Table Statistic 20 Definition Total Aligned Reads (R1/R2) The total number of reads passing filter present in the data set that aligned to the reference genome. Numbers are per read. Percent Aligned Reads (R1/R2) The percentage of reads passing filter that aligned to the reference genome. Numbers are calculated per read. Overall Aligned Reads The percentage of reads passing filter that aligned for the sample across both reads. Document # 15055857 v02 Analysis Output Table 10 Base Level Statistics Table Statistic Definition Total Aligned Bases (R1/R2) The total number of bases present in the data set that aligned to the reference genome. Numbers are calculated per read. Overall Total Aligned Bases The total number of bases present in the data set that aligned to the reference genome across both reads (R1 and R2). The value is the sum of the individual Total Aligned Bases values. Percent Aligned Bases (R1/R2) The percentage of bases that aligned to the reference genome. Numbers are calculated per read, and the average over both reads. Overall Percent Aligned Bases The percentage of bases that aligned to the reference genome across both reads (R1 and R2). The value is the average of the individual Percent Aligned Bases values. Percent Q30 (R1/R2) The percentage of bases with a quality score of 30 or higher. Numbers are calculated per read. Mismatch Rate (R1/R2) The average percentage of mismatches across both reads 1 and 2 over all cycles. Numbers are calculated per read. Small Variants Summary Table 11 SNVs Table Statistic Definition SNVs The total number of SNVs present in the data set that passed the variant quality filters. SNV Ts/Tv Ratio Transition rate of SNVs that pass the quality filters divided by transversion rate of SNVs that pass the quality filters. Transitions are interchanges of purines (A, G) or of pyrimidines (C, T). Transversions are interchanges between purine and pyrimidine bases (for example, A to T). SNV Het/Hom Ratio Number of heterozygous variants/Number of homozygous variants. Table 12 Insertions Table Statistic Definition Insertions The total number of insertions present in the data set that passed the variant quality filters. Insertion Het/Hom Ratio Number of heterozygous variants/Number of homozygous variants. TruSeq Amplicon v2.0 BaseSpace App Guide 21 Table 13 Deletions Table Statistic Definition Deletions The total number of deletions present in the data set that passed the variant quality filters. Deletions Het/Hom Ratio Number of heterozygous variants/Number of homozygous variants. Coverage Summary Table 14 Coverage Summary Table 22 Statistic Definition Amplicon Mean Coverage Depth The total number of aligned reads to the targeted region divided by the number of targeted regions. Document # 15055857 v02 Revision History Revision History Date Document Description of Change Document # 15055857 v02 January 2016 Reorganized topics, updated writing style. Document # 15055857 v01 October 2015 Release supporting TruSeq Amplicon v2.0. TruSeq Amplicon v2.0 BaseSpace App Guide 23 Notes For technical assistance, contact Illumina Technical Support. Table 15 Illumina General Contact Information Website Email www.illumina.com [email protected] Table 16 Illumina Customer Support Telephone Numbers Region Contact Number Region North America 1.800.809.4566 Japan Australia 1.800.775.688 Netherlands Austria 0800.296575 New Zealand Belgium 0800.81102 Norway China 400.635.9898 Singapore Denmark 80882346 Spain Finland 0800.918363 Sweden France 0800.911850 Switzerland Germany 0800.180.8994 Taiwan Hong Kong 800960230 United Kingdom Ireland 1.800.812949 Other countries Italy 800.874909 Contact Number 0800.111.5011 0800.0223859 0800.451.650 800.16836 1.800.579.2745 900.812168 020790181 0800.563118 00806651752 0800.917.0041 +44.1799.534000 Safety data sheets (SDSs)—Available on the Illumina website at support.illumina.com/sds.html. Product documentation—Available for download in PDF from the Illumina website. Go to support.illumina.com, select a product, then select Documentation & Literature. TruSeq Amplicon v2.0 BaseSpace App Guide Technical Assistance Technical Assistance Illumina 5200 Illumina Way San Diego, California 92122 U.S.A. +1.800.809.ILMN (4566) +1.858.202.4566 (outside North America) [email protected] www.illumina.com
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement