MiSeq Reporter Resequencing Workflow Reference Guide FOR RESEARCH USE ONLY Revision History Introduction Resequencing Workflow Overview Resequencing Summary Tab Resequencing Details Tab Optional Settings for the Resequencing Workflow Analysis Output Files Technical Assistance ILLUMINA PROPRIETARY Part # 15042319 Rev. F December 2014 3 4 5 7 9 12 14 This document and its contents are proprietary to Illumina, Inc. and its affiliates ("Illumina"), and are intended solely for the contractual use of its customer in connection with the use of the product(s) described herein and for no other purpose. This document and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed, or reproduced in any way whatsoever without the prior written consent of Illumina. Illumina does not convey any license under its patent, trademark, copyright, or common-law rights nor similar rights of any third parties by this document. The instructions in this document must be strictly and explicitly followed by qualified and properly trained personnel in order to ensure the proper and safe use of the product(s) described herein. All of the contents of this document must be fully read and understood prior to using such product(s). FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS CONTAINED HEREIN MAY RESULT IN DAMAGE TO THE PRODUCT(S), INJURY TO PERSONS, INCLUDING TO USERS OR OTHERS, AND DAMAGE TO OTHER PROPERTY. ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE PRODUCT(S) DESCRIBED HEREIN (INCLUDING PARTS THEREOF OR SOFTWARE) OR ANY USE OF SUCH PRODUCT(S) OUTSIDE THE SCOPE OF THE EXPRESS WRITTEN LICENSES OR PERMISSIONS GRANTED BY ILLUMINA IN CONNECTION WITH CUSTOMER'S ACQUISITION OF SUCH PRODUCT(S). FOR RESEARCH USE ONLY © 2014 Illumina, Inc. All rights reserved. Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cBot, CSPro, CytoChip, DesignStudio, Epicentre, GAIIx, Genetic Energy, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium, iScan, iSelect, ForenSeq, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, Nextera, NextBio, NextSeq, Powered by Illumina, SeqMonitor, SureMDA, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the pumpkin orange color, and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the U.S. and/or other countries. All other names, logos, and other trademarks are the property of their respective owners. Revision History Revision History Part # Revision Date 15042319 F December 2014 Added a note in the Demultiplexing section about the default index recognition for index pairs that differ by < 3 bases. 15042319 E September 2014 Updated the description of the variants table with directions when there are > 2000 variants. 15042319 D February 2014 Updated to changes introduced in MiSeq Reporter v2.4: • Added alignment method to the description of the BAM file header. • Added the command line and annotation algorithm to the description of VCF file header. 15042319 C November 2013 Corrected error in the description of paired-end evaluation. 15042319 B August 2013 Added descriptions of optional settings: FlagPCRDuplicates, ReverseComplement, and VariantMinimumGQCutoff (also known as VariantFilterQualityCutoff). 15042319 A June 2013 MiSeq Reporter Resequencing Workflow Reference Guide Description of Change Initial release. The information provided within was previously included in the MiSeq Reporter User Guide. With this release, the MiSeq Reporter User Guide contains information about the interface, how to view run results, how to requeue a run, and how to install and configure the software. Information specific to the Resequencing workflow is provided in this guide. 3 Introduction The Resequencing workflow aligns reads against the reference genome specified in the sample sheet, and then performs variant analysis. This workflow is intended for small genomes, such as E. coli. In the MiSeq Reporter Analyses tab, a run folder associated with the Resequencing workflow is represented with the letter R. For more information about the software interface, see the MiSeq Reporter User Guide (part # 15042295). This guide describes the analysis steps performed in the Resequencing workflow, the types of data that appear on the interface, and the analysis output files generated by the workflow. Workflow Requirements } Reference genome—The Resequencing workflow requires a reference genome to provide the chromosome and start coordinate in the alignment output file. Specify the path to the genome folder in the sample sheet. For more information, see the MiSeq Sample Sheet Quick Reference Guide (part # 15028392). 4 Part # 15042319 Rev. F The Resequencing workflow compares the DNA sequence in the samples against a reference genome and identifies any variants (SNPs or indels) relative to the reference sequence. The Resequencing workflow demultiplexes indexed reads, generates FASTQ files, aligns reads to a reference, identifies variants, and writes output files to the Alignment folder. NOTE The Resequencing workflow is intended for whole-genome sequencing of small genomes. The workflow is not appropriate for use with nontargeted sequencing of human samples because coverage is too low. Consider using a targeted workflow, such as the Enrichment workflow, to analyze targeted human sequencing data. Demultiplexing Demultiplexing separates data from pooled samples based on short index sequences that tag samples from different libraries. Index reads are identified using the following steps: } Samples are numbered starting from 1 based on the order they are listed in the sample sheet. } Sample number 0 is reserved for clusters that were not successfully assigned to a sample. } Clusters are assigned to a sample when the index sequence matches exactly or there is up to a single mismatch per Index Read. NOTE Illumina indexes are designed so that any index pair differs by ≥ 3 bases, allowing for a single mismatch in index recognition. Index sets that are not from Illumina can include pairs of indexes that differ by < 3 bases. In such cases, the software detects the insufficient difference and modifies the default index recognition (mismatch=1). Instead, the software performs demultiplexing using only perfect index matches (mismatch=0). When demultiplexing is complete, one demultiplexing file named DemultiplexSummaryF1L1.txt is written to the Alignment folder, and summarizes the following information: } In the file name, F1 represents the flow cell number. } In the file name, L1 represents the lane number, which is always L1 for MiSeq. } Reports demultiplexing results in a table with one row per tile and one column per sample, including sample 0. } Reports the most commonly occurring sequences for the index reads. FASTQ File Generation MiSeq Reporter generates intermediate analysis files in the FASTQ format, which is a text format used to represent sequences. FASTQ files contain reads for each sample and their quality scores, excluding reads identified as in-line controls and clusters that did not pass filter. FASTQ files are the primary input for alignment. The files are written to the BaseCalls folder (Data\Intensities\BaseCalls) in the MiSeqAnalysis folder, and then copied to the BaseCalls folder in the MiSeqOutput folder. Each FASTQ file contains reads for only one sample, and the name of that sample is included in the FASTQ file name. For more information about FASTQ files, see the MiSeq Reporter User Guide (part # 15042295). MiSeq Reporter Resequencing Workflow Reference Guide 5 Resequencing Workflow Overview Resequencing Workflow Overview Alignment Reads are aligned against the entire reference genome using the Burrows-Wheeler Aligner (BWA), which aligns relatively short nucleotide sequences against a long reference sequence. BWA automatically adjusts parameters based on read lengths and error rates, and then estimates insert size distribution. Paired-End Evaluation For paired-end runs, the top-scoring alignment for each read is considered. Reads are flagged as an unresolved pair if either read did not align, or the paired reads aligned to different chromosomes. Bin/Sort The bin/sort step groups reads by sample and chromosome, and then sorts by chromosome position. Results are written to one BAM file per sample. Indel Realignment Reads near detected indels are realigned to remove alignment artifacts. Variant Calling SNPs and short indels are identified using the Genome Analysis Toolkit (GATK), by default. GATK calls raw variants for each sample, analyzes variants against known variants, and then calculates a false discovery rate for each variant. Variants are flagged as homozygous (1/1) or heterozygous (0/1) in the VCF file sample column. For more information, see www.broadinstitute.org/gatk. Alternatively, you can specify the somatic variant caller using the VariantCaller sample sheet setting. Variant Annotation If the SNP database (dbsnp.txt) is available in the Annotation subfolder of the reference genome folder, any known SNPs or indels are flagged in the VCF output file. If a reference gene database (refGene.txt) is available in the Annotation subfolder of the reference genome folder, any SNPs or indels that occur within known genes are annotated. Statistics Reporting Statistics are summarized and reported, and written to the Alignment folder. 6 Part # 15042319 Rev. F The Summary tab for Resequencing workflow includes a low percentage graph, high percentage graph, a clusters graph, and a mismatch graph. } Low percentages graph—Shows phasing, prephasing, and mismatches in percentages. Low percentages indicate good run statistics. } High percentages graph—Shows clusters passing filter, alignment to a reference, and intensities in percentages. High percentages indicate good run statistics. } Clusters graph—Shows numbers of raw clusters, clusters passing filter, clusters that did not align, clusters not associated with an index, and duplicates. } Mismatch graph—Shows mismatches per cycle. A mismatch refers to any mismatch between the sequencing read and a reference genome after alignment. Low Percentages Graph Y Axis X Axis Percent Phasing 1 Phasing 2 Prephasing 1 Prephasing 2 Mismatch 1 Mismatch 2 Description The percentage of molecules in a cluster that fall behind the current cycle within Read 1. The percentage of molecules in a cluster that fall behind the current cycle within Read 2. The percentage of molecules in a cluster that run ahead of the current cycle within Read 1. The percentage of molecules in a cluster that run ahead of the current cycle within Read 2. The average percentage of mismatches for Read 1 over all cycles. The average percentage of mismatches for Read 2 over all cycles. High Percentages Graph Y Axis X Axis Percent PF Align 1 Align 2 I20 / I1 1 I20 / I1 2 PE Resynthesis Description The percentage of clusters passing filters. The percentage of clusters that aligned to the reference in Read 1. The percentage of clusters that aligned to the reference in Read 2. The ratio of intensities at cycle 20 to the intensities at cycle 1 for Read 1. The ratio of intensities at cycle 20 to the intensities at cycle 1 for Read 2. The ratio of first cycle intensities for Read 1 to first cycle intensities for Read 2. MiSeq Reporter Resequencing Workflow Reference Guide 7 Resequencing Summary Tab Resequencing Summary Tab Clusters Graph Y Axis Clusters X Axis Description The total number of clusters detected in the run. Raw The total number of clusters passing filter in the run. PF Unaligned The total number of clusters passing filter that did not align to the reference genome, if applicable. Clusters that are unindexed are not included in the unaligned count. Unindexed The total number of clusters passing filter that were not associated with any index sequence in the run. The total number of clusters for a paired-end sequencing run that are Duplicate considered to be PCR duplicates. PCR duplicates are defined as two clusters from a paired-end run where both clusters have the exact same alignment positions for each read. Mismatch Graph 8 Y Axis X Axis Percent Cycle Description Plots the percentage of mismatches for all clusters in a run by cycle. Part # 15042319 Rev. F The Details tab for the Resequencing workflow includes a samples table, targets table, coverage graph, Q-score graph, variant score graph, and variants table. } } } } Samples table—Summarizes the sequencing results for each sample. Targets table—Shows statistics for a particular sample and chromosome. Coverage graph—Shows read depth at a given position in the reference. Q-score graph—Shows the average quality score, which is the estimated probability of an error measured in 10-(Q/10). For example, a score of Q30 has an error rate of 1 in 1000, or 0.1%. } Variant score graph—Shows the location of SNPs and indels. } Variants table—Summarizes differences between sample DNA and the reference. Both SNPs and indels are reported. The variants table shows up to 2000 variants for the selected sample and chromosome. If there are > 2000 variants, open the .vcf file of the sample to view the complete list. Samples Table Column Description # An ordinal identification number in the table. Sample ID The sample ID from the sample sheet. Sample ID must always be a unique value. Sample Name The sample name from the sample sheet. Cluster PF The number of clusters passing filter for the sample that aligned to the reference genome. Cluster Align The total count of PF clusters aligning for the sample (Read 1/Read 2). Mismatch The percentage mismatch to reference averaged over cycles per read (Read 1/Read 2). No Call The percentage of bases that could not be called (no-call) for the sample averaged over cycles per read (Read 1/Read 2). Coverage Median coverage (number of bases aligned to a given reference position) averaged over all positions. Het SNPs The number of heterozygous SNPs detected for the sample. Hom SNPs The number of homozygous SNPs detected for the sample. Insertions The number of insertions detected for the sample. Deletions The number of deletions detected for the sample. Median Len The median fragment length for the sample. Genome The name of the reference genome. MiSeq Reporter Resequencing Workflow Reference Guide 9 Resequencing Details Tab Resequencing Details Tab Targets Table Column Description # An ordinal identification number in the table. Chr The reference target or chromosome name. Cluster PF The number of clusters passing filter for the sample that aligned to the reference genome. Cluster Align The total count of PF clusters aligning for the sample (Read 1/Read 2). Mismatch The percentage mismatch to reference averaged over cycles per read (Read 1/Read 2). No Call The percentage of bases that could not be called (no-call) for the sample averaged over cycles per read (Read 1/Read 2). Coverage Median coverage (number of bases aligned to a given reference position) averaged over all positions. Het SNPs The number of heterozygous SNPs detected for the sample. Hom SNPs The number of homozygous SNPs detected for the sample. Insertions The number of insertions detected for the sample. Deletions The number of deletions detected for the sample. Genome The name of the reference genome. Coverage Graph Y Axis X Axis Description Coverage Position The green curve is the number of aligned reads that cover each position in the reference. The red curve is the number of aligned reads that have a miscall at this position in the reference. SNPs and other variants show up as spikes in the red curve. Q-Score Graph Y Axis X Axis QScore Position Description The average quality score of bases at the given position of the reference. Variant Score Graph Y Axis X Axis Score 10 Position Description Graphically depicts quality score and the position of SNPs and indels. Part # 15042319 Rev. F Column Description # An ordinal identification number in the table. Sample ID The sample ID from the sample sheet. Sample ID must always be a unique value. Sample Name The sample name from the sample sheet. Chr The reference target or chromosome name. Position The position at which the variant was found. Score The quality score for this variant. Variant Type The variant type, which can be either SNP or indel. Call A string representing how the base or bases changed at this location in the reference. Frequency The fraction of reads for the sample that includes the variant. For example, if the reference base is A, and sample 1 has 60 A reads and 40 T reads, then the SNP has a variant frequency of 0.4. Depth The number of reads for a sample covering a particular position. The GATK variant caller subsamples data in regions of high coverage. The GATK subsampling limit is 5000 in MiSeq Reporter v2.2, raised from 250 in v2.1. Filter The criteria for a filtered variant. dbSNP The dbSNP name of the variant, if applicable. RefGene The gene according to RefGene in which this variant appears. Genome The name of the reference genome. MiSeq Reporter Resequencing Workflow Reference Guide 11 Resequencing Details Tab Variants Table Optional Settings for the Resequencing Workflow Sample sheet settings are optional commands that control various analysis parameters. Settings are used in the Settings section of the sample sheet and require a setting name and a setting value. } If you are viewing or editing the sample sheet in Excel, the setting name resides in the first column and the setting value in the second column. } If you are viewing or editing the sample sheet in a text editor such as Notepad, follow the setting name is by a comma and a setting value. Do not include a space between the comma and the setting value. Example: VariantCaller,Somatic The following optional settings are compatible with the Resequencing workflow. Sample Sheet Settings for Analysis 12 Parameter Description FlagPCRDuplicates Settings are 0 or 1. Default is 1, filtering. If set to 1, PCR duplicates are flagged in the BAM files and not used for variant calling. PCR duplicates are defined as two clusters from a paired-end run where both clusters have the exact same alignment positions for each read. Duplicates are not flagged for single-read runs, including PCR duplicates. (Formerly FilterPCRDuplicates. FilterPCRDuplicates is acceptable for backward compatibility.) ReverseComplement Settings are 0 or 1. Default is 1, reads are reversecomplemented. If set to true (1), all reads are reverse-complemented as they are written to FASTQ files. This setting is necessary in certain cases, such as processing of mate pair data using BWA, which expects paired-end data. This setting can disrupt percycle metrics. Use the default setting of 1 when using the Resequencing workflow with Nextera Mate Pair libraries. VariantCaller Specify one of the following variant caller settings: • GATK (default) • Somatic (recommended for tumor samples) • None (no variant calling) When using the default variant caller for the workflow, it is not necessary to specify the variant calling method in the sample sheet. Part # 15042319 Rev. F Setting Name Description VariantMinimumGQCutoff This setting filters variants if the genotype quality (GQ) is less than the threshold. GQ is a measure of the quality of the genotype call and has a maximum value of 99. (Formerly, VariantFilterQualityCutoff, which is acceptable for backward compatibility.) Default value: • 30—GATK • 30—Somatic variant caller MiSeq Reporter Resequencing Workflow Reference Guide 13 Optional Settings for the Resequencing Workflow Sample Sheet Settings for Variant Calling Analysis Output Files The following analysis output files are generated for the Resequencing workflow and provide analysis results for alignment and variant calling. File Name Description *.bam files Contains aligned reads for a given sample. Located in Data\Intensities\BaseCalls\Alignment. *.vcf files Contains information about variants found at specific positions in a reference genome. Located in Data\Intensities\BaseCalls\Alignment. For descriptions of BAM and VCF file formats, see the MiSeq Reporter User Guide (part # 15042295). BAM File Format A BAM file (*.bam) is the compressed binary version of a SAM file that is used to represent aligned sequences. SAM and BAM formats are described in detail on the SAM Tools website: samtools.sourceforge.net. BAM files are written to the alignment folder in Data\Intensities\BaseCalls\Alignment. BAM files use the file naming format of SampleName_S#.bam, where # is the sample number determined by the order that samples are listed in the sample sheet. BAM files contain a header section and an alignments section: } Header—Contains information about the entire file, such as sample name, sample length, and alignment method. Alignments in the alignments section are associated with specific information in the header section. Alignment methods include banded Smith-Waterman, Burrows-Wheeler Aligner (BWA), and Bowtie. The term Isis indicates that an Illumina alignment method is in use, which is the banded Smith-Waterman method. } Alignments—Contains read name, read sequence, read quality, and custom tags. GA23_40:8:1:10271:11781 64 chr22 17552189 8 35M * 0 0 TACAGACATCCACCACCACACCCAGCTAATTTTTG IIIII>FA?C::B=:GGGB>GGGEGIIIHI3EEE# BC:Z:ATCACG XD:Z:55 SM:I:8 The read name includes the chromosome and start coordinate chr22 17552189, the alignment quality 8, and the match descriptor 35M * 0 0. BAM files are suitable for viewing with an external viewer such as IGV or the UCSC Genome Browser. BAM index files (*.bam.bai) provide and index of the corresponding BAM file. VCF File Format VCF is a widely used file format developed by the genomics scientific community that contains information about variants found at specific positions in a reference genome. VCF files use the file naming format SampleName_S#.vcf, where # is the sample number determined by the order that samples are listed in the sample sheet. VCF File Header—Includes the VCF file format version and the variant caller version. The header lists the annotations used in the remainder of the file. If MARS is listed as 14 Part # 15042319 Rev. F VCF File Data Lines—Contains information about a single variant. Data lines are listed under the column headings included in the header. VCF File Headings The VCF file format is flexible and extensible, so not all VCF files contain the same fields. The following tables describe VCF files generated by MiSeq Reporter. Heading CHROM Description The chromosome of the reference genome. Chromosomes appear in the same order as the reference FASTA file. MiSeq Reporter Resequencing Workflow Reference Guide 15 Analysis Output Files the annotator, the Illumina internal annotation algorithm is in use to annotate the VCF file. The VCF header also contains the command line call used by MiSeq Reporter to run the variant caller. The command-line call specifies all parameters used by the variant caller, including the reference genome file and .bam file. The last line in the header is column headings for the data lines. For more information, see VCF File Annotations on page 16. ##fileformat=VCFv4.1 ##FORMAT=<ID=GQX,Number=1,Type=Integer> ##FORMAT=<ID=AD,Number=.,Type=Integer> ##FORMAT=<ID=DP,Number=1,Type=Integer> ##FORMAT=<ID=GQ,Number=1,Type=Float> ##FORMAT=<ID=GT,Number=1,Type=String> ##FORMAT=<ID=PL,Number=G,Type=Integer> ##FORMAT=<ID=VF,Number=1,Type=Float> ##INFO=<ID=TI,Number=.,Type=String> ##INFO=<ID=GI,Number=.,Type=String> ##INFO=<ID=EXON,Number=0,Type=Flag> ##INFO=<ID=FC,Number=.,Type=String> ##INFO=<ID=IndelRepeatLength,Number=1,Type=Integer> ##INFO=<ID=AC,Number=A,Type=Integer> ##INFO=<ID=AF,Number=A,Type=Float> ##INFO=<ID=AN,Number=1,Type=Integer> ##INFO=<ID=DP,Number=1,Type=Integer> ##INFO=<ID=QD,Number=1,Type=Float> ##FILTER=<ID=LowQual> ##FILTER=<ID=R8> ##annotator=MARS ##CallSomaticVariants_cmdline=" -B D:\Amplicon_DS_Soma2\121017_ M00948_0054_000000000A2676_Binf02\Data\Intensities\BaseCalls\Alignment3_Tamsen_ SomaWorker -g [D:\Genomes\Homo_sapiens \UCSC\hg19\Sequence\WholeGenomeFASTA,] -f 0.01 -fo False -b 20 -q 100 -c 300 -s 0.5 -a 20 -F 20 -gVCF True -i true -PhaseSNPs true -MaxPhaseSNPLength 100 -r D: \Amplicon_DS_Soma2\121017_M00948_0054_000000000-A2676_Binf02" ##reference=file://d:\Genomes\Homo_ sapiens\UCSC\hg19\Sequence\WholeGenomeFASTA\genome.fa ##source=GATK 1.6 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 10002 - R1 Heading Description POS The single-base position of the variant in the reference chromosome. For SNPs, this position is the reference base with the variant; for indels or deletions, this position is the reference base immediately before the variant. ID The rs number for the SNP obtained from dbSNP.txt, if applicable. If there are multiple rs numbers at this location, the list is semi-colon delimited. If no dbSNP entry exists at this position, a missing value marker ('.') is used. REF The reference genotype. For example, a deletion of a single T is represented as reference TT and alternate T. ALT The alleles that differ from the reference read. For example, an insertion of a single T is represented as reference A and alternate AT. QUAL A Phred-scaled quality score assigned by the variant caller. Higher scores indicate higher confidence in the variant and lower probability of (Q/10 errors. For a quality score of Q, the estimated probability of an error is 10). For example, the set of Q30 calls has a 0.1% error rate. Many variant callers assign quality scores based on their statistical models, which are high relative to the error rate observed. VCF File Annotations Heading Description FILTER If all filters are passed, PASS is written in the filter column. • LowDP—Applied to sites with depth of coverage below a cutoff. Configure cutoff using the MinimumCoverageDepth sample sheet setting. • LowGQ—The genotyping quality (GQ) is below a cutoff. Configure cutoff using the VariantMinimumGQCutoff sample sheet setting. • LowQual—The variant quality (QUAL) is below a cutoff. Configure using the VariantMinimumQualCutoff sample sheet setting. • LowVariantFreq—The variant frequency is less than the given threshold. Configure using the VariantFrequencyFilterCutoff sample sheet setting. • R8—For an indel, the number of adjacent repeats (1-base or 2-base) in the reference is greater than 8. This filter is configurable using the IndelRepeatFilterCutoff setting in the config file or the sample sheet. • SB—The strand bias is more than the given threshold. This filter is configurable using the StrandBiasFilter sample sheet setting; available only for somatic variant caller and GATK. For more information about sample sheet settings, see MiSeq Sample Sheet Quick Reference Guide (part # 15028392). 16 Part # 15042319 Rev. F Description INFO Possible entries in the INFO column include: • AC—Allele count in genotypes for each ALT allele, in the same order as listed. • AF—Allele Frequency for each ALT allele, in the same order as listed. • AN—The total number of alleles in called genotypes. • CD—A flag indicating that the SNP occurs within the coding region of at least one RefGene entry. • DP—The depth (number of base calls aligned to a position and used in variant calling). In regions of high coverage, GATK down-samples the available reads. • Exon—A comma-separated list of exon regions read from RefGene. • FC—Functional Consequence. • GI—A comma-separated list of gene IDs read from RefGene. • QD—Variant Confidence/Quality by Depth. • TI—A comma-separated list of transcript IDs read from RefGene. FORMAT The format column lists fields separated by colons. For example, GT:GQ. The list of fields provided depends on the variant caller used. Available fields include: • AD—Entry of the form X,Y, where X is the number of reference calls, and Y is the number of alternate calls. • DP—Approximate read depth; reads with MQ=255 or with bad mates are filtered. • GQ—Genotype quality. • GQX—Genotype quality. GQX is the minimum of the GQ value and the QUAL column. In general, these values are similar; taking the minimum makes GQX the more conservative measure of genotype quality. • GT—Genotype. 0 corresponds to the reference base, 1 corresponds to the first entry in the ALT column, and so on. The forward slash (/) indicates that no phasing information is available. • NL—Noise level; an estimate of base calling noise at this position. • PL—Normalized, Phred-scaled likelihoods for genotypes. • SB—Strand bias at this position. Larger negative values indicate less bias; values near zero indicate more bias. • VF—Variant frequency; the percentage of reads supporting the alternate allele. SAMPLE The sample column gives the values specified in the FORMAT column. Supplementary Output Files The following output files provide supplementary information, or summarize run results and analysis errors. Although, these files are not required for assessing analysis results, they can be used for troubleshooting purposes. MiSeq Reporter Resequencing Workflow Reference Guide 17 Analysis Output Files Heading 18 File Name Description AdapterTrimming.txt Lists the number of trimmed bases and percentage of bases for each tile. This file is present only if adapter trimming was specified for the run. Located in Data\Intensities\BaseCalls\Alignment. AnalysisLog.txt Processing log that describes every step that occurred during analysis of the current run folder. This file does not contain error messages. Located in the root level of the run folder. AnalysisError.txt Processing log that lists any errors that occurred during analysis. This file is present only if errors occurred. Located in the root level of the run folder. CompletedJobInfo.xml Written after analysis is complete, contains information about the run, such as date, flow cell ID, software version, and other parameters. Located in the root level of the run folder. DemultiplexSummaryF1L1.txt Reports demultiplexing results in a table with one row per tile and one column per sample. Located in Data\Intensities\BaseCalls\Alignment. ErrorsAndNoCallsByLaneTile ReadCycle.csv A comma-separated values file that contains the percentage of errors and no-calls for each tile, read, and cycle. Located in Data\Intensities\BaseCalls\Alignment. Mismatch.htm Contains histograms of mismatches per cycle and no-calls per cycle for each tile. Located in Data\Intensities\BaseCalls\Alignment. ResequencingRunStatistics.xml Contains summary statistics specific to the run. Located in the root level of the run folder. Summary.xml Contains a summary of mismatch rates and other base calling results. Located in Data\Intensities\BaseCalls\Alignment. Summary.htm Contains a summary web page generated from Summary.xml. Located in Data\Intensities\BaseCalls\Alignment. Part # 15042319 Rev. F For technical assistance, contact Illumina Technical Support. Table 1 Illumina General Contact Information Website Email www.illumina.com [email protected] Table 2 Illumina Customer Support Telephone Numbers Region Contact Number Region North America 1.800.809.4566 Italy Australia 1.800.775.688 Netherlands Austria 0800.296575 New Zealand Belgium 0800.81102 Norway Denmark 80882346 Spain Finland 0800.918363 Sweden France 0800.911850 Switzerland Germany 0800.180.8994 United Kingdom Ireland 1.800.812949 Other countries Contact Number 800.874909 0800.0223859 0800.451.650 800.16836 900.812168 020790181 0800.563118 0800.917.0041 +44.1799.534000 Safety Data Sheets Safety data sheets (SDSs) are available on the Illumina website at support.illumina.com/sds.html. Product Documentation Product documentation in PDF is available for download from the Illumina website. Go to support.illumina.com, select a product, then click Documentation & Literature. MiSeq Reporter Resequencing Workflow Reference Guide Technical Assistance Technical Assistance Illumina San Diego, California 92122 U.S.A. +1.800.809.ILMN (4566) +1.858.202.4566 (outside North America) [email protected] www.illumina.com
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement