MiSeq Reporter Resequencing Workflow Guide (15042319 Rev. F)

MiSeq Reporter Resequencing Workflow Guide (15042319 Rev. F)
MiSeq Reporter
Resequencing Workflow
Reference Guide
FOR RESEARCH USE ONLY
Revision History
Introduction
Resequencing Workflow Overview
Resequencing Summary Tab
Resequencing Details Tab
Optional Settings for the Resequencing Workflow
Analysis Output Files
Technical Assistance
ILLUMINA PROPRIETARY
Part # 15042319 Rev. F
December 2014
3
4
5
7
9
12
14
This document and its contents are proprietary to Illumina, Inc. and its affiliates ("Illumina"), and are intended solely for the
contractual use of its customer in connection with the use of the product(s) described herein and for no other purpose. This
document and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed,
or reproduced in any way whatsoever without the prior written consent of Illumina. Illumina does not convey any license
under its patent, trademark, copyright, or common-law rights nor similar rights of any third parties by this document.
The instructions in this document must be strictly and explicitly followed by qualified and properly trained personnel in order
to ensure the proper and safe use of the product(s) described herein. All of the contents of this document must be fully read
and understood prior to using such product(s).
FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS CONTAINED HEREIN
MAY RESULT IN DAMAGE TO THE PRODUCT(S), INJURY TO PERSONS, INCLUDING TO USERS OR OTHERS, AND
DAMAGE TO OTHER PROPERTY.
ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE PRODUCT(S)
DESCRIBED HEREIN (INCLUDING PARTS THEREOF OR SOFTWARE) OR ANY USE OF SUCH PRODUCT(S) OUTSIDE
THE SCOPE OF THE EXPRESS WRITTEN LICENSES OR PERMISSIONS GRANTED BY ILLUMINA IN CONNECTION
WITH CUSTOMER'S ACQUISITION OF SUCH PRODUCT(S).
FOR RESEARCH USE ONLY
© 2014 Illumina, Inc. All rights reserved.
Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cBot, CSPro, CytoChip, DesignStudio,
Epicentre, GAIIx, Genetic Energy, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium,
iScan, iSelect, ForenSeq, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, Nextera, NextBio, NextSeq, Powered by Illumina,
SeqMonitor, SureMDA, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the
pumpkin orange color, and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the U.S. and/or
other countries. All other names, logos, and other trademarks are the property of their respective owners.
Revision History
Revision History
Part #
Revision
Date
15042319
F
December
2014
Added a note in the Demultiplexing section
about the default index recognition for index
pairs that differ by < 3 bases.
15042319
E
September
2014
Updated the description of the variants table
with directions when there are > 2000 variants.
15042319
D
February
2014
Updated to changes introduced in MiSeq
Reporter v2.4:
• Added alignment method to the description of
the BAM file header.
• Added the command line and annotation
algorithm to the description of VCF file
header.
15042319
C
November
2013
Corrected error in the description of paired-end
evaluation.
15042319
B
August
2013
Added descriptions of optional settings:
FlagPCRDuplicates, ReverseComplement, and
VariantMinimumGQCutoff (also known as
VariantFilterQualityCutoff).
15042319
A
June 2013
MiSeq Reporter Resequencing Workflow Reference Guide
Description of Change
Initial release.
The information provided within was previously
included in the MiSeq Reporter User Guide. With
this release, the MiSeq Reporter User Guide
contains information about the interface, how to
view run results, how to requeue a run, and how
to install and configure the software.
Information specific to the Resequencing
workflow is provided in this guide.
3
Introduction
The Resequencing workflow aligns reads against the reference genome specified in the
sample sheet, and then performs variant analysis. This workflow is intended for small
genomes, such as E. coli.
In the MiSeq Reporter Analyses tab, a run folder associated with the Resequencing
workflow is represented with the letter R. For more information about the software
interface, see the MiSeq Reporter User Guide (part # 15042295).
This guide describes the analysis steps performed in the Resequencing workflow, the
types of data that appear on the interface, and the analysis output files generated by the
workflow.
Workflow Requirements
} Reference genome—The Resequencing workflow requires a reference genome to
provide the chromosome and start coordinate in the alignment output file. Specify
the path to the genome folder in the sample sheet. For more information, see the
MiSeq Sample Sheet Quick Reference Guide (part # 15028392).
4
Part # 15042319 Rev. F
The Resequencing workflow compares the DNA sequence in the samples against a
reference genome and identifies any variants (SNPs or indels) relative to the reference
sequence.
The Resequencing workflow demultiplexes indexed reads, generates FASTQ files, aligns
reads to a reference, identifies variants, and writes output files to the Alignment folder.
NOTE
The Resequencing workflow is intended for whole-genome sequencing of small genomes.
The workflow is not appropriate for use with nontargeted sequencing of human samples
because coverage is too low. Consider using a targeted workflow, such as the Enrichment
workflow, to analyze targeted human sequencing data.
Demultiplexing
Demultiplexing separates data from pooled samples based on short index sequences that
tag samples from different libraries. Index reads are identified using the following steps:
} Samples are numbered starting from 1 based on the order they are listed in the
sample sheet.
} Sample number 0 is reserved for clusters that were not successfully assigned to a
sample.
} Clusters are assigned to a sample when the index sequence matches exactly or there
is up to a single mismatch per Index Read.
NOTE
Illumina indexes are designed so that any index pair differs by ≥ 3 bases, allowing for a
single mismatch in index recognition. Index sets that are not from Illumina can include
pairs of indexes that differ by < 3 bases. In such cases, the software detects the insufficient
difference and modifies the default index recognition (mismatch=1). Instead, the software
performs demultiplexing using only perfect index matches (mismatch=0).
When demultiplexing is complete, one demultiplexing file named
DemultiplexSummaryF1L1.txt is written to the Alignment folder, and summarizes the
following information:
} In the file name, F1 represents the flow cell number.
} In the file name, L1 represents the lane number, which is always L1 for MiSeq.
} Reports demultiplexing results in a table with one row per tile and one column per
sample, including sample 0.
} Reports the most commonly occurring sequences for the index reads.
FASTQ File Generation
MiSeq Reporter generates intermediate analysis files in the FASTQ format, which is a text
format used to represent sequences. FASTQ files contain reads for each sample and their
quality scores, excluding reads identified as in-line controls and clusters that did not
pass filter.
FASTQ files are the primary input for alignment. The files are written to the BaseCalls
folder (Data\Intensities\BaseCalls) in the MiSeqAnalysis folder, and then copied to the
BaseCalls folder in the MiSeqOutput folder. Each FASTQ file contains reads for only one
sample, and the name of that sample is included in the FASTQ file name. For more
information about FASTQ files, see the MiSeq Reporter User Guide (part # 15042295).
MiSeq Reporter Resequencing Workflow Reference Guide
5
Resequencing Workflow Overview
Resequencing Workflow Overview
Alignment
Reads are aligned against the entire reference genome using the Burrows-Wheeler
Aligner (BWA), which aligns relatively short nucleotide sequences against a long
reference sequence. BWA automatically adjusts parameters based on read lengths and
error rates, and then estimates insert size distribution.
Paired-End Evaluation
For paired-end runs, the top-scoring alignment for each read is considered. Reads are
flagged as an unresolved pair if either read did not align, or the paired reads aligned to
different chromosomes.
Bin/Sort
The bin/sort step groups reads by sample and chromosome, and then sorts by
chromosome position. Results are written to one BAM file per sample.
Indel Realignment
Reads near detected indels are realigned to remove alignment artifacts.
Variant Calling
SNPs and short indels are identified using the Genome Analysis Toolkit (GATK), by
default. GATK calls raw variants for each sample, analyzes variants against known
variants, and then calculates a false discovery rate for each variant. Variants are flagged
as homozygous (1/1) or heterozygous (0/1) in the VCF file sample column. For more
information, see www.broadinstitute.org/gatk.
Alternatively, you can specify the somatic variant caller using the VariantCaller sample
sheet setting.
Variant Annotation
If the SNP database (dbsnp.txt) is available in the Annotation subfolder of the reference
genome folder, any known SNPs or indels are flagged in the VCF output file. If a
reference gene database (refGene.txt) is available in the Annotation subfolder of the
reference genome folder, any SNPs or indels that occur within known genes are
annotated.
Statistics Reporting
Statistics are summarized and reported, and written to the Alignment folder.
6
Part # 15042319 Rev. F
The Summary tab for Resequencing workflow includes a low percentage graph, high
percentage graph, a clusters graph, and a mismatch graph.
} Low percentages graph—Shows phasing, prephasing, and mismatches in
percentages. Low percentages indicate good run statistics.
} High percentages graph—Shows clusters passing filter, alignment to a reference, and
intensities in percentages. High percentages indicate good run statistics.
} Clusters graph—Shows numbers of raw clusters, clusters passing filter, clusters that
did not align, clusters not associated with an index, and duplicates.
} Mismatch graph—Shows mismatches per cycle. A mismatch refers to any mismatch
between the sequencing read and a reference genome after alignment.
Low Percentages Graph
Y Axis
X Axis
Percent
Phasing 1
Phasing 2
Prephasing
1
Prephasing
2
Mismatch
1
Mismatch
2
Description
The percentage of molecules in a cluster that fall behind the current cycle
within Read 1.
The percentage of molecules in a cluster that fall behind the current cycle
within Read 2.
The percentage of molecules in a cluster that run ahead of the current
cycle within Read 1.
The percentage of molecules in a cluster that run ahead of the current
cycle within Read 2.
The average percentage of mismatches for Read 1 over all cycles.
The average percentage of mismatches for Read 2 over all cycles.
High Percentages Graph
Y Axis
X Axis
Percent
PF
Align 1
Align 2
I20 / I1 1
I20 / I1 2
PE
Resynthesis
Description
The percentage of clusters passing filters.
The percentage of clusters that aligned to the reference in Read 1.
The percentage of clusters that aligned to the reference in Read 2.
The ratio of intensities at cycle 20 to the intensities at cycle 1 for Read 1.
The ratio of intensities at cycle 20 to the intensities at cycle 1 for Read 2.
The ratio of first cycle intensities for Read 1 to first cycle intensities for
Read 2.
MiSeq Reporter Resequencing Workflow Reference Guide
7
Resequencing Summary Tab
Resequencing Summary Tab
Clusters Graph
Y Axis
Clusters
X Axis
Description
The
total number of clusters detected in the run.
Raw
The total number of clusters passing filter in the run.
PF
Unaligned The total number of clusters passing filter that did not align to the
reference genome, if applicable. Clusters that are unindexed are not
included in the unaligned count.
Unindexed The total number of clusters passing filter that were not associated with
any index sequence in the run.
The total number of clusters for a paired-end sequencing run that are
Duplicate
considered to be PCR duplicates. PCR duplicates are defined as two
clusters from a paired-end run where both clusters have the exact same
alignment positions for each read.
Mismatch Graph
8
Y Axis
X Axis
Percent
Cycle
Description
Plots the percentage of mismatches for all clusters in a run by cycle.
Part # 15042319 Rev. F
The Details tab for the Resequencing workflow includes a samples table, targets table,
coverage graph, Q-score graph, variant score graph, and variants table.
}
}
}
}
Samples table—Summarizes the sequencing results for each sample.
Targets table—Shows statistics for a particular sample and chromosome.
Coverage graph—Shows read depth at a given position in the reference.
Q-score graph—Shows the average quality score, which is the estimated probability
of an error measured in 10-(Q/10). For example, a score of Q30 has an error rate of 1 in
1000, or 0.1%.
} Variant score graph—Shows the location of SNPs and indels.
} Variants table—Summarizes differences between sample DNA and the reference.
Both SNPs and indels are reported. The variants table shows up to 2000 variants for
the selected sample and chromosome. If there are > 2000 variants, open the .vcf file
of the sample to view the complete list.
Samples Table
Column
Description
#
An ordinal identification number in the table.
Sample ID
The sample ID from the sample sheet. Sample ID must always be a unique
value.
Sample Name
The sample name from the sample sheet.
Cluster PF
The number of clusters passing filter for the sample that aligned to the
reference genome.
Cluster Align
The total count of PF clusters aligning for the sample (Read 1/Read 2).
Mismatch
The percentage mismatch to reference averaged over cycles per read
(Read 1/Read 2).
No Call
The percentage of bases that could not be called (no-call) for the sample
averaged over cycles per read (Read 1/Read 2).
Coverage
Median coverage (number of bases aligned to a given reference position)
averaged over all positions.
Het SNPs
The number of heterozygous SNPs detected for the sample.
Hom SNPs
The number of homozygous SNPs detected for the sample.
Insertions
The number of insertions detected for the sample.
Deletions
The number of deletions detected for the sample.
Median Len
The median fragment length for the sample.
Genome
The name of the reference genome.
MiSeq Reporter Resequencing Workflow Reference Guide
9
Resequencing Details Tab
Resequencing Details Tab
Targets Table
Column
Description
#
An ordinal identification number in the table.
Chr
The reference target or chromosome name.
Cluster PF
The number of clusters passing filter for the sample that aligned to the
reference genome.
Cluster Align
The total count of PF clusters aligning for the sample (Read 1/Read 2).
Mismatch
The percentage mismatch to reference averaged over cycles per read
(Read 1/Read 2).
No Call
The percentage of bases that could not be called (no-call) for the sample
averaged over cycles per read (Read 1/Read 2).
Coverage
Median coverage (number of bases aligned to a given reference position)
averaged over all positions.
Het SNPs
The number of heterozygous SNPs detected for the sample.
Hom SNPs
The number of homozygous SNPs detected for the sample.
Insertions
The number of insertions detected for the sample.
Deletions
The number of deletions detected for the sample.
Genome
The name of the reference genome.
Coverage Graph
Y Axis
X Axis
Description
Coverage
Position
The green curve is the number of aligned reads that cover each
position in the reference.
The red curve is the number of aligned reads that have a miscall at
this position in the reference. SNPs and other variants show up as
spikes in the red curve.
Q-Score Graph
Y Axis
X Axis
QScore
Position
Description
The average quality score of bases at the given position of the reference.
Variant Score Graph
Y Axis X Axis
Score
10
Position
Description
Graphically depicts quality score and the position of SNPs and indels.
Part # 15042319 Rev. F
Column
Description
#
An ordinal identification number in the table.
Sample ID
The sample ID from the sample sheet. Sample ID must always be a unique
value.
Sample Name
The sample name from the sample sheet.
Chr
The reference target or chromosome name.
Position
The position at which the variant was found.
Score
The quality score for this variant.
Variant Type
The variant type, which can be either SNP or indel.
Call
A string representing how the base or bases changed at this location in the
reference.
Frequency
The fraction of reads for the sample that includes the variant. For example,
if the reference base is A, and sample 1 has 60 A reads and 40 T reads, then
the SNP has a variant frequency of 0.4.
Depth
The number of reads for a sample covering a particular position. The
GATK variant caller subsamples data in regions of high coverage.
The GATK subsampling limit is 5000 in MiSeq Reporter v2.2, raised from
250 in v2.1.
Filter
The criteria for a filtered variant.
dbSNP
The dbSNP name of the variant, if applicable.
RefGene
The gene according to RefGene in which this variant appears.
Genome
The name of the reference genome.
MiSeq Reporter Resequencing Workflow Reference Guide
11
Resequencing Details Tab
Variants Table
Optional Settings for the Resequencing Workflow
Sample sheet settings are optional commands that control various analysis parameters.
Settings are used in the Settings section of the sample sheet and require a setting name
and a setting value.
} If you are viewing or editing the sample sheet in Excel, the setting name resides in
the first column and the setting value in the second column.
} If you are viewing or editing the sample sheet in a text editor such as Notepad,
follow the setting name is by a comma and a setting value. Do not include a space
between the comma and the setting value.
Example: VariantCaller,Somatic
The following optional settings are compatible with the Resequencing workflow.
Sample Sheet Settings for Analysis
12
Parameter
Description
FlagPCRDuplicates
Settings are 0 or 1. Default is 1, filtering.
If set to 1, PCR duplicates are flagged in the BAM files
and not used for variant calling. PCR duplicates are
defined as two clusters from a paired-end run where
both clusters have the exact same alignment positions
for each read.
Duplicates are not flagged for single-read runs,
including PCR duplicates.
(Formerly FilterPCRDuplicates. FilterPCRDuplicates is
acceptable for backward compatibility.)
ReverseComplement
Settings are 0 or 1. Default is 1, reads are reversecomplemented.
If set to true (1), all reads are reverse-complemented
as they are written to FASTQ files.
This setting is necessary in certain cases, such as
processing of mate pair data using BWA, which
expects paired-end data. This setting can disrupt percycle metrics.
Use the default setting of 1 when using the
Resequencing workflow with Nextera Mate Pair
libraries.
VariantCaller
Specify one of the following variant caller settings:
• GATK (default)
• Somatic (recommended for tumor samples)
• None (no variant calling)
When using the default variant caller for the
workflow, it is not necessary to specify the variant
calling method in the sample sheet.
Part # 15042319 Rev. F
Setting Name
Description
VariantMinimumGQCutoff
This setting filters variants if the genotype quality
(GQ) is less than the threshold. GQ is a measure of the
quality of the genotype call and has a maximum value
of 99.
(Formerly, VariantFilterQualityCutoff, which is acceptable
for backward compatibility.)
Default value:
• 30—GATK
• 30—Somatic variant caller
MiSeq Reporter Resequencing Workflow Reference Guide
13
Optional Settings for the Resequencing Workflow
Sample Sheet Settings for Variant Calling
Analysis Output Files
The following analysis output files are generated for the Resequencing workflow and
provide analysis results for alignment and variant calling.
File Name
Description
*.bam files
Contains aligned reads for a given sample.
Located in Data\Intensities\BaseCalls\Alignment.
*.vcf files
Contains information about variants found at specific
positions in a reference genome.
Located in Data\Intensities\BaseCalls\Alignment.
For descriptions of BAM and VCF file formats, see the MiSeq Reporter User Guide (part #
15042295).
BAM File Format
A BAM file (*.bam) is the compressed binary version of a SAM file that is used to
represent aligned sequences. SAM and BAM formats are described in detail on the SAM
Tools website: samtools.sourceforge.net.
BAM files are written to the alignment folder in Data\Intensities\BaseCalls\Alignment.
BAM files use the file naming format of SampleName_S#.bam, where # is the sample
number determined by the order that samples are listed in the sample sheet.
BAM files contain a header section and an alignments section:
} Header—Contains information about the entire file, such as sample name, sample
length, and alignment method. Alignments in the alignments section are associated
with specific information in the header section.
Alignment methods include banded Smith-Waterman, Burrows-Wheeler Aligner
(BWA), and Bowtie. The term Isis indicates that an Illumina alignment method is in
use, which is the banded Smith-Waterman method.
} Alignments—Contains read name, read sequence, read quality, and custom tags.
GA23_40:8:1:10271:11781 64 chr22 17552189 8 35M * 0 0
TACAGACATCCACCACCACACCCAGCTAATTTTTG
IIIII>FA?C::B=:GGGB>GGGEGIIIHI3EEE#
BC:Z:ATCACG XD:Z:55 SM:I:8
The read name includes the chromosome and start coordinate chr22 17552189, the
alignment quality 8, and the match descriptor 35M * 0 0.
BAM files are suitable for viewing with an external viewer such as IGV or the UCSC
Genome Browser.
BAM index files (*.bam.bai) provide and index of the corresponding BAM file.
VCF File Format
VCF is a widely used file format developed by the genomics scientific community that
contains information about variants found at specific positions in a reference genome.
VCF files use the file naming format SampleName_S#.vcf, where # is the sample number
determined by the order that samples are listed in the sample sheet.
VCF File Header—Includes the VCF file format version and the variant caller version.
The header lists the annotations used in the remainder of the file. If MARS is listed as
14
Part # 15042319 Rev. F
VCF File Data Lines—Contains information about a single variant. Data lines are listed
under the column headings included in the header.
VCF File Headings
The VCF file format is flexible and extensible, so not all VCF files contain the same fields.
The following tables describe VCF files generated by MiSeq Reporter.
Heading
CHROM
Description
The chromosome of the reference genome. Chromosomes appear in the same
order as the reference FASTA file.
MiSeq Reporter Resequencing Workflow Reference Guide
15
Analysis Output Files
the annotator, the Illumina internal annotation algorithm is in use to annotate the VCF
file. The VCF header also contains the command line call used by MiSeq Reporter to run
the variant caller. The command-line call specifies all parameters used by the variant
caller, including the reference genome file and .bam file. The last line in the header is
column headings for the data lines. For more information, see VCF File Annotations on
page 16.
##fileformat=VCFv4.1
##FORMAT=<ID=GQX,Number=1,Type=Integer>
##FORMAT=<ID=AD,Number=.,Type=Integer>
##FORMAT=<ID=DP,Number=1,Type=Integer>
##FORMAT=<ID=GQ,Number=1,Type=Float>
##FORMAT=<ID=GT,Number=1,Type=String>
##FORMAT=<ID=PL,Number=G,Type=Integer>
##FORMAT=<ID=VF,Number=1,Type=Float>
##INFO=<ID=TI,Number=.,Type=String>
##INFO=<ID=GI,Number=.,Type=String>
##INFO=<ID=EXON,Number=0,Type=Flag>
##INFO=<ID=FC,Number=.,Type=String>
##INFO=<ID=IndelRepeatLength,Number=1,Type=Integer>
##INFO=<ID=AC,Number=A,Type=Integer>
##INFO=<ID=AF,Number=A,Type=Float>
##INFO=<ID=AN,Number=1,Type=Integer>
##INFO=<ID=DP,Number=1,Type=Integer>
##INFO=<ID=QD,Number=1,Type=Float>
##FILTER=<ID=LowQual>
##FILTER=<ID=R8>
##annotator=MARS
##CallSomaticVariants_cmdline=" -B D:\Amplicon_DS_Soma2\121017_
M00948_0054_000000000A2676_Binf02\Data\Intensities\BaseCalls\Alignment3_Tamsen_
SomaWorker -g [D:\Genomes\Homo_sapiens
\UCSC\hg19\Sequence\WholeGenomeFASTA,] -f 0.01 -fo False -b 20
-q 100 -c 300 -s 0.5 -a 20 -F 20 -gVCF
True -i true -PhaseSNPs true -MaxPhaseSNPLength 100 -r D:
\Amplicon_DS_Soma2\121017_M00948_0054_000000000-A2676_Binf02"
##reference=file://d:\Genomes\Homo_
sapiens\UCSC\hg19\Sequence\WholeGenomeFASTA\genome.fa
##source=GATK 1.6
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 10002 - R1
Heading
Description
POS
The single-base position of the variant in the reference chromosome.
For SNPs, this position is the reference base with the variant; for indels or deletions,
this position is the reference base immediately before the variant.
ID
The rs number for the SNP obtained from dbSNP.txt, if applicable.
If there are multiple rs numbers at this location, the list is semi-colon delimited. If
no dbSNP entry exists at this position, a missing value marker ('.') is used.
REF
The reference genotype. For example, a deletion of a single T is represented as
reference TT and alternate T.
ALT
The alleles that differ from the reference read.
For example, an insertion of a single T is represented as reference A and alternate
AT.
QUAL
A Phred-scaled quality score assigned by the variant caller.
Higher scores indicate higher confidence in the variant and lower probability of
(Q/10
errors. For a quality score of Q, the estimated probability of an error is 10). For
example, the set of Q30 calls has a 0.1% error rate. Many variant callers assign
quality scores based on their statistical models, which are high relative to the error
rate observed.
VCF File Annotations
Heading
Description
FILTER
If all filters are passed, PASS is written in the filter column.
• LowDP—Applied to sites with depth of coverage below a cutoff. Configure
cutoff using the MinimumCoverageDepth sample sheet setting.
• LowGQ—The genotyping quality (GQ) is below a cutoff. Configure cutoff
using the VariantMinimumGQCutoff sample sheet setting.
• LowQual—The variant quality (QUAL) is below a cutoff. Configure using the
VariantMinimumQualCutoff sample sheet setting.
• LowVariantFreq—The variant frequency is less than the given threshold.
Configure using the VariantFrequencyFilterCutoff sample sheet setting.
• R8—For an indel, the number of adjacent repeats (1-base or 2-base) in the
reference is greater than 8. This filter is configurable using the
IndelRepeatFilterCutoff setting in the config file or the sample sheet.
• SB—The strand bias is more than the given threshold. This filter is
configurable using the StrandBiasFilter sample sheet setting; available only for
somatic variant caller and GATK.
For more information about sample sheet settings, see MiSeq Sample Sheet Quick
Reference Guide (part # 15028392).
16
Part # 15042319 Rev. F
Description
INFO
Possible entries in the INFO column include:
• AC—Allele count in genotypes for each ALT allele, in the same order as listed.
• AF—Allele Frequency for each ALT allele, in the same order as listed.
• AN—The total number of alleles in called genotypes.
• CD—A flag indicating that the SNP occurs within the coding region of at least
one RefGene entry.
• DP—The depth (number of base calls aligned to a position and used in variant
calling). In regions of high coverage, GATK down-samples the available reads.
• Exon—A comma-separated list of exon regions read from RefGene.
• FC—Functional Consequence.
• GI—A comma-separated list of gene IDs read from RefGene.
• QD—Variant Confidence/Quality by Depth.
• TI—A comma-separated list of transcript IDs read from RefGene.
FORMAT
The format column lists fields separated by colons. For example, GT:GQ. The list
of fields provided depends on the variant caller used. Available fields include:
• AD—Entry of the form X,Y, where X is the number of reference calls, and Y is
the number of alternate calls.
• DP—Approximate read depth; reads with MQ=255 or with bad mates are
filtered.
• GQ—Genotype quality.
• GQX—Genotype quality. GQX is the minimum of the GQ value and the QUAL
column. In general, these values are similar; taking the minimum makes GQX
the more conservative measure of genotype quality.
• GT—Genotype. 0 corresponds to the reference base, 1 corresponds to the first
entry in the ALT column, and so on. The forward slash (/) indicates that no
phasing information is available.
• NL—Noise level; an estimate of base calling noise at this position.
• PL—Normalized, Phred-scaled likelihoods for genotypes.
• SB—Strand bias at this position. Larger negative values indicate less bias;
values near zero indicate more bias.
• VF—Variant frequency; the percentage of reads supporting the alternate allele.
SAMPLE
The sample column gives the values specified in the FORMAT column.
Supplementary Output Files
The following output files provide supplementary information, or summarize run results
and analysis errors. Although, these files are not required for assessing analysis results,
they can be used for troubleshooting purposes.
MiSeq Reporter Resequencing Workflow Reference Guide
17
Analysis Output Files
Heading
18
File Name
Description
AdapterTrimming.txt
Lists the number of trimmed bases and percentage of
bases for each tile. This file is present only if adapter
trimming was specified for the run.
Located in Data\Intensities\BaseCalls\Alignment.
AnalysisLog.txt
Processing log that describes every step that occurred
during analysis of the current run folder. This file does not
contain error messages.
Located in the root level of the run folder.
AnalysisError.txt
Processing log that lists any errors that occurred during
analysis. This file is present only if errors occurred.
Located in the root level of the run folder.
CompletedJobInfo.xml
Written after analysis is complete, contains information
about the run, such as date, flow cell ID, software version,
and other parameters.
Located in the root level of the run folder.
DemultiplexSummaryF1L1.txt
Reports demultiplexing results in a table with one row per
tile and one column per sample.
Located in Data\Intensities\BaseCalls\Alignment.
ErrorsAndNoCallsByLaneTile
ReadCycle.csv
A comma-separated values file that contains the
percentage of errors and no-calls for each tile, read, and
cycle.
Located in Data\Intensities\BaseCalls\Alignment.
Mismatch.htm
Contains histograms of mismatches per cycle and no-calls
per cycle for each tile.
Located in Data\Intensities\BaseCalls\Alignment.
ResequencingRunStatistics.xml
Contains summary statistics specific to the run.
Located in the root level of the run folder.
Summary.xml
Contains a summary of mismatch rates and other base
calling results.
Located in Data\Intensities\BaseCalls\Alignment.
Summary.htm
Contains a summary web page generated from
Summary.xml.
Located in Data\Intensities\BaseCalls\Alignment.
Part # 15042319 Rev. F
For technical assistance, contact Illumina Technical Support.
Table 1 Illumina General Contact Information
Website
Email
www.illumina.com
[email protected]
Table 2 Illumina Customer Support Telephone Numbers
Region
Contact Number
Region
North America
1.800.809.4566
Italy
Australia
1.800.775.688
Netherlands
Austria
0800.296575
New Zealand
Belgium
0800.81102
Norway
Denmark
80882346
Spain
Finland
0800.918363
Sweden
France
0800.911850
Switzerland
Germany
0800.180.8994
United Kingdom
Ireland
1.800.812949
Other countries
Contact Number
800.874909
0800.0223859
0800.451.650
800.16836
900.812168
020790181
0800.563118
0800.917.0041
+44.1799.534000
Safety Data Sheets
Safety data sheets (SDSs) are available on the Illumina website at
support.illumina.com/sds.html.
Product Documentation
Product documentation in PDF is available for download from the Illumina website. Go
to support.illumina.com, select a product, then click Documentation & Literature.
MiSeq Reporter Resequencing Workflow Reference Guide
Technical Assistance
Technical Assistance
Illumina
San Diego, California 92122 U.S.A.
+1.800.809.ILMN (4566)
+1.858.202.4566 (outside North America)
[email protected]
www.illumina.com
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement