BWA Whole Genome Sequencing v1.0 BaseSpace App Guide For Research Use Only. Not for use in diagnostic procedures. Introduction Workflow Diagram Set Analysis Parameters Analysis Methods Analysis Output Revision History Technical Assistance ILLUMINA PROPRIETARY 15050952 v01 January 2016 3 5 6 7 10 19 This document and its contents are proprietary to Illumina, Inc. and its affiliates ("Illumina"), and are intended solely for the contractual use of its customer in connection with the use of the product(s) described herein and for no other purpose. This document and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed, or reproduced in any way whatsoever without the prior written consent of Illumina. Illumina does not convey any license under its patent, trademark, copyright, or common-law rights nor similar rights of any third parties by this document. The instructions in this document must be strictly and explicitly followed by qualified and properly trained personnel in order to ensure the proper and safe use of the product(s) described herein. All of the contents of this document must be fully read and understood prior to using such product(s). FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS CONTAINED HEREIN MAY RESULT IN DAMAGE TO THE PRODUCT(S), INJURY TO PERSONS, INCLUDING TO USERS OR OTHERS, AND DAMAGE TO OTHER PROPERTY. ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE PRODUCT(S) DESCRIBED HEREIN (INCLUDING PARTS THEREOF OR SOFTWARE). © 2016 Illumina, Inc. All rights reserved. Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cBot, CSPro, CytoChip, DesignStudio, Epicentre, ForenSeq, Genetic Energy, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium, iScan, iSelect, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, NextBio, Nextera, NextSeq, Powered by Illumina, SureMDA, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the pumpkin orange color, and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the U.S. and/or other countries. All other names, logos, and other trademarks are the property of their respective owners. After BaseSpace® generates the FASTQ files containing the base calls and quality scores of the run, you can use the Burrows-Wheeler Aligner (BWA) Whole Genome Sequencing v1.0 App to analyze the sequencing data. The app analyzes the data in two parts; first, it aligns to the reference genome, and then it assembles and performs variant calls. Compatible Libraries See the BaseSpace support page for a list of library types that are compatible with the BWA Whole Genome Sequencing v1.0 App. Workflow Requirements } } } } } } } } } } This app does not support mate-pair or other non-forward and -reverse styles of paired-end sequencing. This app does not support annotation of non-human genomes. A minimum read length of 21 bp and a maximum read length of 150 bp. Minimum recommended data set size is enough data to yield 10 × coverage after alignment of the genome being sequenced. See Table 1 for details. Maximum data set size is fewer than 200 gigabases, which equates to the following: } Approximately 1 billion reads assuming 2 × 100. } Approximately 665 million reads assuming 2 × 150. Completedjobinfo.xml may not print all statistics. Sample name length has a maximum of 32 characters. GQX can be entered as any value, although the maximum recommended value is 99. GATK indel realignment on chrM (at very high coverage) displays this warning: } Reads will be written to bam out of order } One of the Picard tools, ValidateSamFile, displays this error: } MAPQ should be 0 for unmapped read and Mate unmapped flag does not match read unmapped flag of mate. Isaac App is recommended if PhiX is being sequenced at a coverage > 5,000,000 ×. For recommended minimum number of reads for 10 × coverage for different species, see Table 1. The number of reads listed yields 10 × coverage with an additional 5% to account for unaligned reads. Table 1 Recommended Minimums for 10 × Coverage Genome Size Data Size Reads for 2 × 100 (million) Reads for 2 × 150 (million) Arabidopsis thaliana 63.4 Mb 666 Mb 3.33 2.22 Bos taurus 2.65 Gb 28 Gb 140.00 93.33 Escherichia coli K-12 DH10B 4.7 Mb 50 Mb 0.25 0.17 Escherichia coli K-12 MG1655 4.6 Mb 49 Mb 0.25 0.16 Drosophila melanogaster 139.5 Mb 1.5 Gb 7.33 4.88 BWA Whole Genome Sequencing v1.0 BaseSpace App Guide 3 Introduction Introduction Genome Size Data Size Reads for 2 × 100 (million) Reads for 2 × 150 (million) Human 3.3 Gb 35 Gb 175.00 116.67 Mus musculus 2.6 Gb 28 Gb 140.00 93.33 PhiX (Illumina) 5386 b 57 Kb 282.77 188.51 Rattus norvegicus 2.9 Gb 31 Gb 155.00 103.33 Rhodobacter sphaeroides 2.4.1 4.6 Mb 49 Mb 0.25 0.16 Saccharomyces cerevisiae 12.2 Mb 129 Mb 0.65 0.43 Staphylococcus aureus NCTC 8325 12.8 Mb 135 Mb 0.68 0.45 Versions These components are used in the BWA WGS App. Software Version BWA 0.6.1-r104-tpx-isis GATK 1.6-22-g3ec78bd CNV Variant Caller (CNVseg) 2.2.4 Structural Variant Caller (Grouper) 1.4.2 SAMtools 0.1.18 Tabix 0.2.5 (r1005) Reference Genomes These genomes are available for alignment: } Human, UCSC hg19 The human reference genome is PAR-Masked, which means that the Y chromosome sequence has the Pseudo Autosomal Regions (PAR) masked (set to N) to avoid mismapping of reads in the duplicate regions of sex chromosomes. } Arabidopsis thaliana (NCBI build9.1) } Bos taurus (Ensembl UMD3.1) } Escherichia coli K-12 DH10B (NCBI 2008-03-17) } Escherichia coli K-12 MG1655 (NCBI 2001-10-15 ) } Drosophila melanogaster (UCSC dm3) } Mus musculus (UCSC MM9) } Phi X (Illumina) } Rattus norvegicus (UCSC RN5) } Rhodobacter sphaeroides 2.4.1 (NCBI 2005-10-07) } Saccharomyces cerevisiae (UCSC sacCer2) } Staphylococcus aureus NCTC 8325 (NCBI 2006-02-13) 4 15050952 v01 Workflow Diagram Workflow Diagram Figure 1 BWA WGS App Workflow BWA Whole Genome Sequencing v1.0 BaseSpace App Guide 5 Set Analysis Parameters 1 Navigate to BaseSpace, and then click the Apps tab. 2 Click BWA Whole Genome Sequencing v1.0. 3 From the drop-down list, select version 1.0.0, and then click Launch to open the app. 4 In the Analysis Name field, enter the analysis name. By default, the analysis name includes the app name, followed by the date and time that the analysis session starts. 5 From the Save Results To field, select the project that stores the app results. 6 From the Sample(s) field, browse to the sample you want to analyze, and select the checkbox. You can select multiple samples. 7 From the Reference Genome field, select the reference genome you want to align. 8 From the Enable SV/CNV calling field, select the checkbox to perform Isaac Variant Caller calls structural variation (SV) and copy number variation (CNV) calling. This option applies to paired-end data. For more information, see Large Indel and Structural Variant Calls on page 7 and CNV Variant Caller on page 8). 9 From the Annotation field, select a preferred gene and transcript annotation reference database. 10 Optional, click the Advance drop-down list for additional parameter fields. a b c From the Min GQX for Variants field, enter the quality GQX for variants. GQX is the minimum of the GQ (genotype quality) and QUAL (low quality filter), which makes it a conservative filter. The default value is 30. The maximum recommended is 99. From the Max Strand Bias for Variants field, enter the maximum allowed strand bias for variant calling. This option filters for reads when the differences in allele frequencies for forward- and reverse-strand reads are too high. The default value is 10. Variants with a high strand bias are flagged as filtered. Strand bias is higher when the evidence for a variant is confined to reads from just one strand. From the FlagPCRDuplicates field, select the checkbox to have PCR duplicates flagged in the BAM files and not be used for variant calling. PCR duplicates are defined as paired-end reads generated from two clusters that have the exact same alignment positions for each read. Optical duplicates are already filtered out during RTA processing. 11 Click Continue. The BWA Whole Genome Sequencing v1.0 App begins analysis of the sample. When analysis is complete, the status of the app session is updated automatically and an email is sent to notify you. 6 15050952 v01 Analysis Methods Analysis Methods The BWA Whole Genome Sequencing v1.0 App uses these methods to analyze the sequencing data. BWA The BWA Whole Genome Sequencing workflow uses the Burrows-Wheeler Aligner (BWA, which adjusts parameters based on read lengths and error rates, and then estimates the insert size distribution. For more information, see github.com/lh3/bwa. After BWA alignment, GATK performs variant calling. GATK The Genome Analysis Toolkit (GATK) is the standard variant caller after BWA alignment. Developed by the Broad Institute, the Genome Analysis Toolkit (GATK) first calls raw variants for each sample read. Then GATK analyzes the variants against known variants, and applies a calibration procedure to compute a false discovery rate for each variant. Variants are flagged as homozygous (1/1) or heterozygous (0/1) in the VCF file sample column. The GATK best practices were guidelines for the app; they are described here: www.broadinstitute.org/gatk/guide/topic?name=best-practices. For more information about GATK, see www.broadinstitute.org/gatk. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5): 491-8. Large Indel and Structural Variant Calls The large indel and structural variant caller uses the series of modules described here, and then generates output files in VCF 4.1 format. Before ReadBroker } } } StatsGenerator—Computes summary statistics on insert sizes, read orientation, and alignment scores for each input BAM file. AnomalousReadFinder—Grouper processes chromosomes in chunks. This method enables parallel execution and, therefore, faster performance. AnomalousReadFinder examines all alignments in a block and classifies reads and read pairs as follows: } Classifies reads as either shadow (unaligned) or semialigned partial or clipped alignment). } Classifies read pairs as either InsertionPair, DeletionPair, InversionPair, TandemDuplicationPair, or ChimericPair, according to which type of structural variant an anomalously mapped read pair is associated. ClusterFinder—Clusters reads based on their type and the position of their alignment. Only reads of the same type are clustered together at this stage, except shadow and semialigned reads, which can be clustered together. BWA Whole Genome Sequencing v1.0 BaseSpace App Guide 7 } ClusterMerger—Associates clusters of various anomalous read types with shadow/semi-aligned read clusters, which breakpoints can cause. A breakpoint is a pair of bases that are adjacent in the sample genome but not in the reference. Two clusters are merged if they share the read or if they agree on the position and length of the structural variant. This information is inferred from read alignment orientation and distance. ReadBroker } Interchromosomal translocations yield chimeric read pairs where 1 read aligns to one chromosome and its partner aligns to another. Because Grouper examines each chromosome individually, the ReadBroker step is performed to join the information from chimeric read pairs across chromosomes. After ReadBroker } } } } } } } SmallAssembler—Assembles reads in clusters into contigs using a de Bruijn method and iteratively assembles reads into contigs until all reads in the cluster are assembled. It also produces a file containing the reads that were used to assemble the contig, with a realignment to the contig sequence. SpanContigs—Uses the presence of nearby anomalous read pairs to determine whether to extend the search range used by the subsequent AlignContig step from its default. AlignContig—Computes a dynamic programming alignment of a contig to a region of the reference genome; merges full or partial duplicate calls of the same event into a single call. VariantFilter—Removes all structural variants that overlap with gaps identified in UCSC gaps. The UCSC gaps file defines regions of the genome that have not been sequenced. DeletionGenotyper—Assigns a genotype to all deletions. SomaticGenotyper—Assigns a quality score (Q-score) to all structural variants. Higher Q-scores indicate a higher probability that this structural variant is somatic. DeletionGenotyper—Assigns a genotype to all deletions. CNV Variant Caller The CNV variant caller is designed to identify copy number variants (CNVs) in diploid genomes using Hidden Markov Models (HMM) or unbalanced Haar wavelets. The method adopts a count-based approach for CNV calling, which comprises the following steps: 1 Pre-processing step, during which read depth is computed at each position and then filtered based on CpG islands, assembly gaps, telomeric/centromeric regions. Either alignability tracks or coverage tracks obtained from a pool of reference sample are used to normalize the data. Counts or count ratios are produced as an output. 2 Segmentation of read counts/ratios using fixed or variable bin size and a copy number assignment. 8 15050952 v01 A single sample or a pool of reference samples is used for normalization, by deriving a ratio between a test and the reference. Window size is fixed (by default to 100 bp). The HMM model with Gaussian emission distribution is used for segmentation. A bin exclusion criterion (less than 10% of build coverage in both samples) is applied. The reference for CNV normalization is an alignability measure that is meant to gauge the probability of a position aligning to a single unique region of the genome. In detail, the notion of alignability for reads of length k is as follows: given a map M that, for a fixed read length k and any position P in a genome G, stores at M(P) the number of occurrences in G of the k-mer that starts at P for a given position P in G, define the overlap set of P as the k-mers that overlap P. The alignability of P is the proportion of this overlap set that is unique. Variant Scoring After copy number assignment, each CNV call is assigned a quality score based on a two-sample t-test. Each counts/ratio in a 1 kb window on each size of a breakpoint (or half the length of a variant call, whichever is smaller) is compared using t-test. This test is based on the null hypothesis that there is no difference in coverage on each size of the breakpoint. Obtained p-values are then reported as Q-scores on a Phred scale as -10 log10. BWA Whole Genome Sequencing v1.0 BaseSpace App Guide 9 Analysis Methods Normalization Analysis Output To view the results, click the Projects tab, then the project name, and then the analysis. Figure 2 Output Navigation Bar After analysis is complete, access the output through the left navigation bar. } Analysis Info—Information about the analysis session, including log files. } Inputs—Overview of input settings. } Output Files—Output files for the sample. } Sample Analysis Reports—Analysis reports for each sample. Analysis Info The Analysis Info page displays the analysis settings and execution details. Row Heading Definition Name Name of the analysis session. Application App that generated this analysis. Date Started Date and time the analysis session started. Date Completed Date and time the analysis session completed. Duration Duration of the analysis. Session Type Number of nodes used. Status Status of the analysis session. The status shows either Running or Complete. Log Files Click the Log Files link to access the app log files. You can locate some log files in Output Files. 10 File Name Description CompletedJobInfo.xml Contains information about the completed analysis session. 15050952 v01 Description Logging.zip Contains all detailed log files for each step of the workflow. WorkflowError.txt Workflow standard error output (contains error messages created when running the workflow). WorkflowLog.txt Workflow standard output (contains details about workflow steps, command line calls with parameters, timing, and progress). monoOut.txt Wrapper mono call standard output that contains the command calling the workflow and anything that WorkflowLog.txt does not catch. Available when the mono component does not work as expected. Analysis Output File Name NOTE For explanation about mono, see www.mono-project.com. Output Files The Output Files page provides access to the output files for each sample analysis. } BAM Files } VCF Files } Genome VCF Files } Resequencing File } Summary File BAM File Format A BAM file (*.bam) is the compressed binary version of a SAM file that is used to represent aligned sequences up to 128 Mb. SAM and BAM formats are described in detail at https://samtools.github.io/hts-specs/SAMv1.pdf. BAM files use the file naming format of SampleName_S#.bam, where # is the sample number determined by the order that samples are listed for the run. BAM files contain a header section and an alignments section: } Header—Contains information about the entire file, such as sample name, sample length, and alignment method. Alignments in the alignments section are associated with specific information in the header section. } Alignments—Contains read name, read sequence, read quality, alignment information, and custom tags. The read name includes the chromosome, start coordinate, alignment quality, and the match descriptor string. The alignments section includes the following information for each or read pair: } RG: Read group, which indicates the number of reads for a specific sample. } BC: Barcode tag, which indicates the demultiplexed sample ID associated with the read. } SM: Single-end alignment quality. } AS: Paired-end alignment quality. } NM: Edit distance tag, which records the Levenshtein distance between the read and the reference. } XN: Amplicon name tag, which records the amplicon tile ID associated with the read. BWA Whole Genome Sequencing v1.0 BaseSpace App Guide 11 BAM index files (*.bam.bai) provide an index of the corresponding BAM file. VCF File Format VCF is a widely used file format developed by the genomics scientific community that contains information about variants found at specific positions in a reference genome. VCF files use the file naming format SampleName_S#.vcf, where # is the sample number determined by the order that samples are listed for the run. VCF File Header—Includes the VCF file format version and the variant caller version. The header lists the annotations used in the remainder of the file. If MARS is listed, the Illumina internal annotation algorithm annotated the VCF file. The VCF header includes the reference genome file and BAM file. The last line in the header contains the column headings for the data lines. VCF File Data Lines—Each data line contains information about a single variant. VCF File Headings 12 Heading Description CHROM The chromosome of the reference genome. Chromosomes appear in the same order as the reference FASTA file. POS The single-base position of the variant in the reference chromosome. For SNPs, this position is the reference base with the variant; for indels or deletions, this position is the reference base immediately before the variant. ID The rs number for the SNP obtained from dbSNP.txt, if applicable. If there are multiple rs numbers at this location, the list is semicolon delimited. If no dbSNP entry exists at this position, a missing value marker ('.') is used. REF The reference genotype. For example, a deletion of a single T is represented as reference TT and alternate T. An A to T single nucleotide variant is represented as reference A and alternate T. ALT The alleles that differ from the reference read. For example, an insertion of a single T is represented as reference A and alternate AT. An A to T single nucleotide variant is represented as reference A and alternate T. QUAL A Phred-scaled quality score assigned by the variant caller. Higher scores indicate higher confidence in the variant and lower probability of errors. For a quality score of Q, the estimated probability of an error is 10-(Q/10). For example, the set of Q30 calls has a 0.1% error rate. Many variant callers assign quality scores based on their statistical models, which are high in relation to the error rate observed. 15050952 v01 Analysis Output VCF File Annotations Heading Description FILTER If all filters are passed, PASS is written in the filter column. • LowDP—Applied to sites with depth of coverage below a cutoff. • LowGQ—The genotyping quality (GQ) is below a cutoff. • LowQual—The variant quality (QUAL) is below a cutoff. • LowVariantFreq—The variant frequency is less than the given threshold. • R8—For an indel, the number of adjacent repeats (1-base or 2-base) in the reference is greater than 8. • SB—The strand bias is more than the given threshold. Used with the Somatic Variant Caller and GATK. INFO Possible entries in the INFO column include: • AC—Allele count in genotypes for each ALT allele, in the same order as listed. • AF—Allele Frequency for each ALT allele, in the same order as listed. • AN—The total number of alleles in called genotypes. • CD—A flag indicating that the SNP occurs within the coding region of at least 1 RefGene entry. • DP—The depth (number of base calls aligned to a position and used in variant calling). • Exon—A comma-separated list of exon regions read from RefGene. • FC—Functional Consequence. • GI—A comma-separated list of gene IDs read from RefGene. • QD—Variant Confidence/Quality by Depth. • TI—A comma-separated list of transcript IDs read from RefGene. FORMAT The format column lists fields separated by colons. For example, GT:GQ. The list of fields provided depends on the variant caller used. Available fields include: • AD—Entry of the form X,Y, where X is the number of reference calls, and Y is the number of alternate calls. • DP—Approximate read depth; reads with MQ=255 or with bad mates are filtered. • GQ—Genotype quality. • GQX—Genotype quality. GQX is the minimum of the GQ value and the QUAL column. In general, these values are similar; taking the minimum makes GQX the more conservative measure of genotype quality. • GT—Genotype. 0 corresponds to the reference base, 1 corresponds to the first entry in the ALT column, and so on. The forward slash (/) indicates that no phasing information is available. • NL—Noise level; an estimate of base calling noise at this position. • PL—Normalized, Phred-scaled likelihoods for genotypes. • SB—Strand bias at this position. Larger negative values indicate less bias; values near 0 indicate more bias. Used with the Somatic Variant Caller and GATK. • VF—Variant frequency; the percentage of reads supporting the alternate allele. SAMPLE The sample column gives the values specified in the FORMAT column. BWA Whole Genome Sequencing v1.0 BaseSpace App Guide 13 Genome VCF Files Genome VCF (gVCF) files are VCF v4.1 files that follow a set of conventions for representing all sites within the genome in a reasonably compact format. The gVCF files include all sites within the region of interest in a single file for each sample. The gVCF file shows no-calls at positions with low coverage, or where a low-frequency variant (< 3%) occurs often enough (> 1%) that the position cannot be called to the reference. A genotype (GT) tag of ./. indicates a no-call. For more information, see sites.google.com/site/gvcftools/home/about-gvcf. Summary File The BWA WGS App produces an overview of statistics for each sample and the aggregate results in a comma-separated values (CSV) format. The *resquencing_ summary.csv file contains the same data as the Sample Summary Report but is formatted for easier analysis. These files are located in the results folder for each sample and the aggregate results. Statistic 14 Definition Sample ID IDs of samples reported on in the file. Run Folder Run folders for samples reported on in the file. Fragment length median Median length of the sequenced fragment. The fragment length is calculated based on the locations at which a read pair aligns to the reference. The read mapping information is parsed from the BAM files. Fragment length min Minimum length of the sequenced fragment. Fragment length max Maximum length of the sequenced fragment. Fragment length SD Standard deviation of the sequenced fragment length. Number of Reads Total number of reads passing filter for this sample. Percent Aligned (per read) Percentage of reads passing filter that aligned. Percent Q30 (per read) The percentage of bases with a quality score of 30 or higher. MismatchRate (per read) The average percentage of mismatches across both reads 1 and 2 over all cycles. SNVs All Total number of Single Nucleotide Variants present in the data set passing the quality filters. SNVs Passing Filters SNVs passing variants filter. SNVs (Percent found in dbSNP) 100*(Number of SNVs in dbSNP/Number of SNVs). 15050952 v01 Analysis Output Statistic Definition SNV Ts/Tv ratio Transition rate of SNVs that pass the quality filters divided by transversion rate of SNVs that pass the quality filters. Transitions are interchanges of purines (A, G) or of pyrimidines (C, T). Transversions are interchanges of purine and pyrimidine bases (for example, A to T). SNV Het/Hom ratio Number of heterozygous SNVs/Number of homozygous SNVs. Indels Total number of indels present in the data set passing the quality filters. Insertions Passing Variants Insertions passing variant filters. Deletions Passing Variants Deletions passing variant filters. Indels (Percent found in dbSNP) 100*(Number of Indels in dbSNP/Number of Indels). Insertions (Percent found in dbSNP) 100*(Number of insertions in dbSNP/ Number of insertions) Deletions(Percent found in dbSNP) 100*(Number of deletions in dbSNP/ Number of deletions) Indel Het/Hom ratio Number of heterozygous indels/Number of homozygous indels. Insertion Het/Hom ratio Ratio of the number of heterozygous to homozygous insertions. Deletion Het/Hom ratio Ratio of the number of heterozygous to homozygous deletions. SmallVariantStatisticsFlag Flags whether SmallVariantStatistics was run (1 means that it was run) SVStatisticsFlag Flags whether SVStatistics was run (1 means that it was run) CNVStatisticsFlag Flags whether CNVStatistics was run (1 means that it was run) Sample Analysis Reports The BWA WGS App provides an overview of statistics per sample in the Analysis Reports sample pages. To download the statistics, click PDF Summary Report. } Alignment Summary } Small Variants } Structural Variants } Coverage Histogram Summary BWA Whole Genome Sequencing v1.0 BaseSpace App Guide 15 Alignment Summary Table 2 Alignment Summary Statistic Definition Number of Reads Total number of reads passing filter for this sample. Coverage Total number of aligned bases divided by the genome size. Percent Duplicate Paired Reads Percentage of paired reads that have duplicates. Fragment Length Median Median length of the sequenced fragment. The fragment length is calculated based on the locations at which a read pair aligns to the reference. The read mapping information is parsed from the BAM files. Fragment Length Standard Deviation Standard deviation of the sequenced fragment length. Table 3 Read Level Statistics Statistic Definition Percent Aligned The percentage of reads passing filter that aligned to the reference genome. Percent Q30 The percentage of bases with a quality score of 30 or higher. Mismatch Rate The average percentage of mismatches across both reads 1 and 2 over all cycles. Small Variants Summary Table 4 Small Variants Summary 16 Statistic Definition Total Passing The total number of variants present in the data set that passed the variant quality filters. Percent Found in dbSNP 100*(Number of variants in dbSNP/Number of variants). Het/Hom Ratio Number of heterozygous variants/Number of homozygous variants. Ts/Tv Ratio Transition rate of SNVs that pass the quality filters divided by transversion rate of SNVs that pass the quality filters. Transitions are interchanges of purines (A, G) or of pyrimidines (C, T). Transversions are interchanges between purine and pyrimidine bases (for example, A to T). 15050952 v01 Analysis Output Table 5 Variants by Sequence Context Statistic Definition Number in Genes The number of variants that fall into a gene. Number in Exons The number of variants that fall into an exon. Number in Coding Regions The number of variants that fall into a coding region. Number in UTR Regions The number of variants that fall into an untranslated region (UTR). Number in Mature microRNA The number of variants that fall into a mature microRNA. Number in Splice Site Regions The number of variants that fall into a splice site region. Table 6 Variants by Consequence Statistic Definition Frameshifts The number of variants that cause a frameshift. Non-synonymous The number of variants that cause an amino acid change in a coding region. Synonymous The number of variants that are within a coding region, but do not cause an amino acid change. Stop Gained The number of variants that cause an additional stop codon. Stop Lost The number of variants that cause the loss of a stop codon. Structural Variants Summary This table breaks structural variant output into the classes of variants called, and reports the total number and their overlap with annotated genes. All counts are based on PASS filter variants. Variant Class Definition CNV A copy-number variation (CNV) is a large category of structural variation, which includes insertions, deletions and duplications. CNVs are generally greater than 10 kb. CNVs below 10 kb are filtered but still present in the VCF file. Insertion In an insertion, nucleotides are added between 2 adjacent nucleotides in the reference sequence. The insertions in the structural variants category are 51 bp or greater. Tandem Duplication In a tandem duplication, a segment of a chromosome is duplicated front to end, with both segments in the same orientation. The segments are 51 bp or greater. BWA Whole Genome Sequencing v1.0 BaseSpace App Guide 17 Variant Class Definition Deletion In a deletion, contiguous nucleotides are absent compared to the reference sequence. The deletions in the structural variants category are 51 bp or greater. Inversion An inversion is a chromosome rearrangement in which a segment of a chromosome is reversed end to end. An inversion occurs when a single chromosome undergoes breakage and rearrangement within itself. The segments are 51 bp or greater. Coverage Histogram The coverage histogram shows the number of reference bases plotted against the depth of coverage (read depth). The features include the following: } The drop-down menu lets you look at the overall picture, or highlight a particular chromosome. } The Fix Y Scale checkbox lets you keep the Y Scale the same when comparing multiple chromosomes. } The Export TSV link lets you to export the coverage data in a tab-separated txt. file. 18 15050952 v01 Revision History Revision History Document Document # 15050952 v01 Date January 2016 Description of Change Reorganized topics, updated writing style. BWA Whole Genome Sequencing v1.0 BaseSpace App Guide 19 Notes For technical assistance, contact Illumina Technical Support. Table 7 Illumina General Contact Information Website Email www.illumina.com [email protected] Table 8 Illumina Customer Support Telephone Numbers Region Contact Number Region North America 1.800.809.4566 Japan Australia 1.800.775.688 Netherlands Austria 0800.296575 New Zealand Belgium 0800.81102 Norway China 400.635.9898 Singapore Denmark 80882346 Spain Finland 0800.918363 Sweden France 0800.911850 Switzerland Germany 0800.180.8994 Taiwan Hong Kong 800960230 United Kingdom Ireland 1.800.812949 Other countries Italy 800.874909 Contact Number 0800.111.5011 0800.0223859 0800.451.650 800.16836 1.800.579.2745 900.812168 020790181 0800.563118 00806651752 0800.917.0041 +44.1799.534000 Safety data sheets (SDSs)—Available on the Illumina website at support.illumina.com/sds.html. Product documentation—Available for download in PDF from the Illumina website. Go to support.illumina.com, select a product, then select Documentation & Literature. BWA Whole Genome Sequencing v1.0 BaseSpace App Guide Technical Assistance Technical Assistance Illumina 5200 Illumina Way San Diego, California 92122 U.S.A. +1.800.809.ILMN (4566) +1.858.202.4566 (outside North America) [email protected] www.illumina.com
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement