MiSeq Reporter PCR Amplicon Workflow Guide (15042318 Rev. D)

MiSeq Reporter PCR Amplicon Workflow Guide (15042318 Rev. D)
MiSeq Reporter
PCR Amplicon Workflow
Reference Guide
FOR RESEARCH USE ONLY
Revision History
Introduction
PCR Amplicon Workflow Overview
PCR Amplicon Summary Tab
PCR Amplicon Details Tab
Optional Settings for the PCR Amplicon Workflow
Analysis Output Files
PCR Amplicon Manifest File Format
Technical Assistance
ILLUMINA PROPRIETARY
Part # 15042318 Rev. D
December 2014
3
4
5
7
9
12
14
20
This document and its contents are proprietary to Illumina, Inc. and its affiliates ("Illumina"), and are intended solely for the
contractual use of its customer in connection with the use of the product(s) described herein and for no other purpose. This
document and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed,
or reproduced in any way whatsoever without the prior written consent of Illumina. Illumina does not convey any license
under its patent, trademark, copyright, or common-law rights nor similar rights of any third parties by this document.
The instructions in this document must be strictly and explicitly followed by qualified and properly trained personnel in order
to ensure the proper and safe use of the product(s) described herein. All of the contents of this document must be fully read
and understood prior to using such product(s).
FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS CONTAINED HEREIN
MAY RESULT IN DAMAGE TO THE PRODUCT(S), INJURY TO PERSONS, INCLUDING TO USERS OR OTHERS, AND
DAMAGE TO OTHER PROPERTY.
ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE PRODUCT(S)
DESCRIBED HEREIN (INCLUDING PARTS THEREOF OR SOFTWARE) OR ANY USE OF SUCH PRODUCT(S) OUTSIDE
THE SCOPE OF THE EXPRESS WRITTEN LICENSES OR PERMISSIONS GRANTED BY ILLUMINA IN CONNECTION
WITH CUSTOMER'S ACQUISITION OF SUCH PRODUCT(S).
FOR RESEARCH USE ONLY
FOR RESEARCH, FORENSIC, OR PATERNITY USE ONLY
© 2013-2014 Illumina, Inc. All rights reserved.
Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cBot, CSPro, CytoChip, DesignStudio,
Epicentre, GAIIx, Genetic Energy, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium,
iScan, iSelect, ForenSeq, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, Nextera, NextBio, NextSeq, Powered by Illumina,
SeqMonitor, SureMDA, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the
pumpkin orange color, and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the U.S. and/or
other countries. All other names, logos, and other trademarks are the property of their respective owners.
Part #
Revision
Date
15042318
D
December
2014
Added a note in the Demultiplexing section about
the default index recognition for index pairs that
differ by < 3 bases.
15042318
C
February
2014
Updated to changes introduced in MiSeq
Reporter v2.4:
• Added alignment method to the description of
the BAM file header.
• Added the command line and annotation
algorithm to the description of VCF file header.
Updated sample sheet OutputGenomeVCF
parameter default setting information.
15042318
B
August
2013
• Updated to MiSeq Reporter v2.3: added sample
sheet setting OutputGenomeVCF and
description of genome VCF (gVCF) files.
• Added descriptions of optional settings:
Adapter, FlagPCRDuplicates, VariantCaller,
and VariantMinimumGQCutoff (also known as
VariantFilterQualityCutoff).
• Removed ReverseComplement optional
setting, which is not applicable to the
Resequencing workflow.
15042318
A
June 2013
Initial release.
The information provided within was previously
included in the MiSeq Reporter User Guide. With
this release, the MiSeq Reporter User Guide
contains information about the interface, how to
view run results, how to requeue a run, and how
to install and configure the software. Information
specific to the PCR Amplicon workflow is
provided in this guide.
MiSeq Reporter PCR Amplicon Workflow Reference Guide
Description of Change
3
Revision History
Revision History
Introduction
The PCR Amplicon workflow aligns reads against the reference genomes specified in the
sample sheet, and performs variant analysis for the regions specified in the manifest.
This workflow is intended for analysis of PCR amplicons that have been fragmented
using Nextera fragmentation.
In the MiSeq Reporter Analyses tab, a run folder associated with the PCR Amplicon
workflow is represented with the letter P. For more information about the software
interface, see the MiSeq Reporter User Guide (part # 15042295).
This guide describes the analysis steps performed in the PCR Amplicon workflow, the
types of data that appear on the interface, and the analysis output files generated by the
workflow.
Workflow Requirements
} Manifest file—The PCR Amplicon workflow requires at least one manifest file.
Illumina recommends using the Illumina Experiment Manager to create the PCR
Amplicon manifest file.
} Reference genome—The PCR Amplicon workflow requires the reference genome
specified in the manifest file. The reference genome provides variant annotations and
sets the chromosome sizes in the BAM file output. Specify the path to the genome
folder in the sample sheet. For more information, see the MiSeq Sample Sheet Quick
Reference Guide (part # 15028392).
4
Part # 15042318 Rev. D
The PCR Amplicon workflow sequences any number of PCR amplicons that have been
fragmented using Nextera tagmentation.
Illumina recommends using the Adapter sample sheet setting for Nextera libraries,
which includes adapter trimming during analysis to prevent reporting sequence beyond
sample DNA. For more information, see Optional Settings for the PCR Amplicon Workflow
on page 12.
The PCR Amplicon workflow demultiplexes indexed reads, generates FASTQ files, aligns
reads to a reference, identifies variants, and writes output files to the Alignment folder.
Demultiplexing
Demultiplexing separates data from pooled samples based on short index sequences that
tag samples from different libraries. Index reads are identified using the following steps:
} Samples are numbered starting from 1 based on the order they are listed in the
sample sheet.
} Sample number 0 is reserved for clusters that were not successfully assigned to a
sample.
} Clusters are assigned to a sample when the index sequence matches exactly or there
is up to a single mismatch per Index Read.
NOTE
Illumina indexes are designed so that any index pair differs by ≥ 3 bases, allowing for a
single mismatch in index recognition. Index sets that are not from Illumina can include
pairs of indexes that differ by < 3 bases. In such cases, the software detects the insufficient
difference and modifies the default index recognition (mismatch=1). Instead, the software
performs demultiplexing using only perfect index matches (mismatch=0).
When demultiplexing is complete, one demultiplexing file named
DemultiplexSummaryF1L1.txt is written to the Alignment folder, and summarizes the
following information:
} In the file name, F1 represents the flow cell number.
} In the file name, L1 represents the lane number, which is always L1 for MiSeq.
} Reports demultiplexing results in a table with one row per tile and one column per
sample, including sample 0.
} Reports the most commonly occurring sequences for the index reads.
FASTQ File Generation
MiSeq Reporter generates intermediate analysis files in the FASTQ format, which is a text
format used to represent sequences. FASTQ files contain reads for each sample and their
quality scores, excluding reads identified as in-line controls and clusters that did not
pass filter.
FASTQ files are the primary input for alignment. The files are written to the BaseCalls
folder (Data\Intensities\BaseCalls) in the MiSeqAnalysis folder, and then copied to the
BaseCalls folder in the MiSeqOutput folder. Each FASTQ file contains reads for only one
sample, and the name of that sample is included in the FASTQ file name. For more
information about FASTQ files, see the MiSeq Reporter User Guide (part # 15042295).
MiSeq Reporter PCR Amplicon Workflow Reference Guide
5
PCR Amplicon Workflow Overview
PCR Amplicon Workflow Overview
Alignment
Reads are aligned against the entire reference genome using the Burrows-Wheeler
Aligner (BWA), which aligns relatively short nucleotide sequences against a long
reference sequence. BWA automatically adjusts parameters based on read lengths and
error rates, and then estimates insert size distribution.
Paired-End Evaluation
For paired-end runs, the top-scoring alignment for each read is considered. Reads are
flagged as an unresolved pair under the following conditions:
} If either read did not align, or the paired reads aligned to different chromosomes.
} If two alignments come from different amplicons or different rows in the Targets
section of the manifest.
Bin/Sort
The bin/sort step groups reads by sample and chromosome, and then sorts by
chromosome position. Results are written to one BAM file per sample.
Indel Realignment
Reads near detected indels are realigned to remove alignment artifacts.
Variant Calling
SNPs and short indels are identified using the Genome Analysis Toolkit (GATK), by
default. GATK calls raw variants for each sample, analyzes variants against known
variants, and then calculates a false discovery rate for each variant. Variants are flagged
as homozygous (1/1) or heterozygous (0/1) in the VCF file sample column. For more
information, see www.broadinstitute.org/gatk.
Alternatively, you can specify the somatic variant caller using the VariantCaller sample
sheet setting.
Variant Annotation
Variant analysis is performed only for the amplicon regions specified in the manifest file.
Statistics Reporting
Statistics are summarized and reported, and written to the Alignment folder.
6
Part # 15042318 Rev. D
The Summary tab for the PCR Amplicon workflow includes a low percentage graph,
high percentage graph, a clusters graph, and a mismatch graph.
} Low percentages graph—Shows phasing, prephasing, and mismatches in
percentages. Low percentages indicate good run statistics.
} High percentages graph—Shows clusters passing filter, alignment to a reference, and
intensities in percentages. High percentages indicate good run statistics.
} Clusters graph—Shows numbers of raw clusters, clusters passing filter, clusters that
did not align, clusters not associated with an index, and duplicates.
} Mismatch graph—Shows mismatches per cycle. A mismatch refers to any mismatch
between the sequencing read and a reference genome after alignment.
Low Percentages Graph
Y Axis
X Axis
Percent
Phasing 1
Phasing 2
Prephasing
1
Prephasing
2
Mismatch
1
Mismatch
2
Description
The percentage of molecules in a cluster that fall behind the current cycle
within Read 1.
The percentage of molecules in a cluster that fall behind the current cycle
within Read 2.
The percentage of molecules in a cluster that run ahead of the current
cycle within Read 1.
The percentage of molecules in a cluster that run ahead of the current
cycle within Read 2.
The average percentage of mismatches for Read 1 over all cycles.
The average percentage of mismatches for Read 2 over all cycles.
High Percentages Graph
Y Axis
X Axis
Percent
PF
Align 1
Align 2
I20 / I1 1
I20 / I1 2
PE
Resynthesis
Description
The percentage of clusters passing filters.
The percentage of clusters that aligned to the reference in Read 1.
The percentage of clusters that aligned to the reference in Read 2.
The ratio of intensities at cycle 20 to the intensities at cycle 1 for Read 1.
The ratio of intensities at cycle 20 to the intensities at cycle 1 for Read 2.
The ratio of first cycle intensities for Read 1 to first cycle intensities for
Read 2.
MiSeq Reporter PCR Amplicon Workflow Reference Guide
7
PCR Amplicon Summary Tab
PCR Amplicon Summary Tab
Clusters Graph
Y Axis
Clusters
X Axis
Description
The
total number of clusters detected in the run.
Raw
The total number of clusters passing filter in the run.
PF
Unaligned The total number of clusters passing filter that did not align to the
reference genome, if applicable. Clusters that are unindexed are not
included in the unaligned count.
Unindexed The total number of clusters passing filter that were not associated with
any index sequence in the run.
The total number of clusters for a paired-end sequencing run that are
Duplicate
considered to be PCR duplicates. PCR duplicates are defined as two
clusters from a paired-end run where both clusters have the exact same
alignment positions for each read.
Mismatch Graph
8
Y Axis
X Axis
Percent
Cycle
Description
Plots the percentage of mismatches for all clusters in a run by cycle.
Part # 15042318 Rev. D
The Details tab for the PCR Amplicon workflow includes a samples table, targets table,
coverage graph, Q-score graph, variant score graph, and variants table.
}
}
}
}
Samples table—Summarizes the sequencing results for each sample.
Targets table—Shows statistics for a particular sample and chromosome.
Coverage graph—Shows read depth at a given position in the reference.
Q-score graph—Shows the average quality score, which is the estimated probability
of an error measured in 10-(Q/10). For example, a score of Q30 has an error rate of 1 in
1000, or 0.1%.
} Variant score graph—Shows the location of SNPs and indels.
} Variants table—Summarizes differences between sample DNA and the reference.
Both SNPs and indels are reported.
Samples Table
Column
Description
#
An ordinal identification number in the table.
Sample ID
The sample ID from the sample sheet. Sample ID must always be a unique
value.
Sample Name
The sample name from the sample sheet.
Cluster PF
The number of clusters passing filter for the sample.
Cluster Align
The total count of PF clusters aligning for the sample (Read 1/Read 2).
Mismatch
The percentage mismatch to reference averaged over cycles per read
(Read 1/Read 2).
No Call
The percentage of bases that could not be called (no-call) for the sample
averaged over cycles per read (Read 1/Read 2).
Coverage
Median coverage (number of bases aligned to a given reference position)
averaged over all positions.
Het SNPs
The number of heterozygous SNPs detected for the sample.
Hom SNPs
The number of homozygous SNPs detected for the sample.
Insertions
The number of insertions detected for the sample.
Deletions
The number of deletions detected for the sample.
Median Len
The median fragment length for the sample.
Manifest
The name of the file that specifies the alignments to a reference and the
targeted reference regions.
MiSeq Reporter PCR Amplicon Workflow Reference Guide
9
PCR Amplicon Details Tab
PCR Amplicon Details Tab
Targets Table
Column
Description
#
An ordinal identification number in the table.
Name
The name of the target in the manifest.
Chr
The reference target or chromosome name.
Start Position
The start position of the target region.
End Position
The end position of the target region.
Cluster PF
Number of clusters passing filter for the target displayed per read (Read
1/Read 2).
Mismatch
The percentage of mismatched bases to target averaged over all cycles,
displayed per read. Mismatch = [mean(errors count in cycles) / cluster PF]
* 100.
No Call
The percentage of no-call bases for the target averaged over cycles,
displayed per read.
Het SNPs
The number of heterozygous SNPs detected for the target across all
samples.
Hom SNPs
The number of homozygous SNPs detected for the target across all
samples.
Insertions
The number of insertions detected for the target across all samples.
Deletions
The number of deletions detected for the target across all samples.
Manifest
The name of the file that specifies the alignments to a reference and the
targeted reference regions.
Coverage Graph
Y Axis
X Axis
Description
Coverage
Position
The green curve is the number of aligned reads that cover each
position in the reference.
The red curve is the number of aligned reads that have a miscall at
this position in the reference. SNPs and other variants show up as
spikes in the red curve.
Q-Score Graph
10
Y Axis
X Axis
QScore
Position
Description
The average quality score of bases at the given position of the reference.
Part # 15042318 Rev. D
Y Axis X Axis
Score
Position
Description
Graphically depicts quality score and the position of SNPs and indels.
Variants Table
Column
Description
#
An ordinal identification number in the table.
Sample ID
The sample ID from the sample sheet. Sample ID must always be a unique
value.
Sample Name
The sample name from the sample sheet.
Chr
The reference target or chromosome name.
Position
The position at which the variant was found.
Score
The quality score for this variant.
Variant Type
The variant type, which can be either SNP or indel.
Call
A string representing how the base or bases changed at this location in the
reference.
Frequency
The fraction of reads for the sample that includes the variant. For example,
if the reference base is A, and sample 1 has 60 A reads and 40 T reads, then
the SNP has a variant frequency of 0.4.
Depth
The number of reads for a sample covering a particular position. The
GATK variant caller subsamples data in regions of high coverage.
The GATK subsampling limit is 5000 in MiSeq Reporter v2.2, raised from
250 in v2.1.
Filter
The criteria for a filtered variant.
dbSNP
The dbSNP name of the variant, if applicable.
RefGene
The gene according to RefGene in which this variant appears.
Genome
The name of the reference genome.
MiSeq Reporter PCR Amplicon Workflow Reference Guide
11
PCR Amplicon Details Tab
Variant Score Graph
Optional Settings for the PCR Amplicon Workflow
Sample sheet settings are optional commands that control various analysis parameters.
Settings are used in the Settings section of the sample sheet and require a setting name
and a setting value.
} If you are viewing or editing the sample sheet in Excel, the setting name resides in
the first column and the setting value in the second column.
} If you are viewing or editing the sample sheet in a text editor such as Notepad,
follow the setting name is by a comma and a setting value. Do not include a space
between the comma and the setting value.
Example: Adapter,CTGTCTCTTATACACATCT
The following optional settings are compatible with the PCR Amplicon workflow.
Sample Sheet Settings for Analysis
12
Parameter
Description
Adapter
Specify the 5' portion of the adapter sequence to prevent
reporting sequence beyond the sample DNA.
Illumina recommends adapter trimming for Nextera
libraries and Nextera Mate Pair libraries.
To specify two or more adapter sequences, separate the
sequences by a plus (+) sign. For example:
CTGTCTCTTATACACATCT+AGATGTGTATAAGAGACAG
AdapterRead2
Specify the 5' portion of the Read 2 adapter sequence to
prevent reporting sequence beyond the sample DNA.
Use this setting to specify a different adapter other than the
one specified in the Adapter setting.
FlagPCRDuplicates
Settings are 0 or 1. Default is 1, filtering.
If set to 1, PCR duplicates are flagged in the BAM files and
not used for variant calling. PCR duplicates are defined as
two clusters from a paired-end run where both clusters have
the exact same alignment positions for each read.
Duplicates are not flagged for single-read runs, including
PCR duplicates.
(Formerly FilterPCRDuplicates. FilterPCRDuplicates is
acceptable for backward compatibility.)
OutputGenomeVCF
Settings are 0 or 1. Default is 1.
If set to true (1), this setting turns on genome VCF (gVCF)
output for single sample variant calling. If set to false (0),
gVCF files are not generated.
This setting requires MiSeq Reporter v2.3, or later.
VariantCaller
Specify one of the following variant caller settings:
• GATK (default)
• Somatic (recommended for tumor samples)
• None (no variant calling)
When using the default variant caller for the workflow, it is
not necessary to specify the variant calling method in the
sample sheet.
Part # 15042318 Rev. D
Setting Name
Description
VariantMinimumGQCutoff
This setting filters variants if the genotype quality
(GQ) is less than the threshold. GQ is a measure of the
quality of the genotype call and has a maximum value
of 99.
(Formerly, VariantFilterQualityCutoff, which is acceptable
for backward compatibility.)
Default value:
• 30—GATK
• 30—Somatic variant caller
MiSeq Reporter PCR Amplicon Workflow Reference Guide
13
Optional Settings for the PCR Amplicon Workflow
Sample Sheet Settings for Variant Calling
Analysis Output Files
The following analysis output files are generated for the PCR Amplicon workflow and
provide analysis results for alignment and variant calling.
File Name
Description
*.bam files
Contains aligned reads for a given sample.
Located in Data\Intensities\BaseCalls\Alignment.
*.vcf files
Contains information about variants found at specific
positions in a reference genome.
Located in Data\Intensities\BaseCalls\Alignment.
Alignment Files
Alignment files contain the aligned read sequence and quality score. MiSeq Reporter
generates alignment files in the BAM (*.bam) file format.
BAM File Format
A BAM file (*.bam) is the compressed binary version of a SAM file that is used to
represent aligned sequences. SAM and BAM formats are described in detail on the SAM
Tools website: samtools.sourceforge.net.
BAM files are written to the alignment folder in Data\Intensities\BaseCalls\Alignment.
BAM files use the file naming format of SampleName_S#.bam, where # is the sample
number determined by the order that samples are listed in the sample sheet.
BAM files contain a header section and an alignments section:
} Header—Contains information about the entire file, such as sample name, sample
length, and alignment method. Alignments in the alignments section are associated
with specific information in the header section.
Alignment methods include banded Smith-Waterman, Burrows-Wheeler Aligner
(BWA), and Bowtie. The term Isis indicates that an Illumina alignment method is in
use, which is the banded Smith-Waterman method.
} Alignments—Contains read name, read sequence, read quality, and custom tags.
GA23_40:8:1:10271:11781 64 chr22 17552189 8 35M * 0 0
TACAGACATCCACCACCACACCCAGCTAATTTTTG
IIIII>FA?C::B=:GGGB>GGGEGIIIHI3EEE#
BC:Z:ATCACG XD:Z:55 SM:I:8
The read name includes the chromosome and start coordinate chr22 17552189, the
alignment quality 8, and the match descriptor 35M * 0 0.
BAM files are suitable for viewing with an external viewer such as IGV or the UCSC
Genome Browser.
BAM index files (*.bam.bai) provide and index of the corresponding BAM file.
Variant Call Files
Variant call files contain all called variants. MiSeq Reporter generates variant call files in
the VCF (*.vcf) file format and genome VCF (*.gVCF), if configured to do so using the
optional sample sheet setting, OutputGenomeVCF.
} VCF files contain information about variants found at specific positions.
14
Part # 15042318 Rev. D
VCF File Format
VCF is a widely used file format developed by the genomics scientific community that
contains information about variants found at specific positions in a reference genome.
VCF files use the file naming format SampleName_S#.vcf, where # is the sample number
determined by the order that samples are listed in the sample sheet.
VCF File Header—Includes the VCF file format version and the variant caller version.
The header lists the annotations used in the remainder of the file. If MARS is listed as
the annotator, the Illumina internal annotation algorithm is in use to annotate the VCF
file. The VCF header also contains the command line call used by MiSeq Reporter to run
the variant caller. The command-line call specifies all parameters used by the variant
caller, including the reference genome file and .bam file. The last line in the header is
column headings for the data lines. For more information, see VCF File Annotations on
page 16.
##fileformat=VCFv4.1
##FORMAT=<ID=GQX,Number=1,Type=Integer>
##FORMAT=<ID=AD,Number=.,Type=Integer>
##FORMAT=<ID=DP,Number=1,Type=Integer>
##FORMAT=<ID=GQ,Number=1,Type=Float>
##FORMAT=<ID=GT,Number=1,Type=String>
##FORMAT=<ID=PL,Number=G,Type=Integer>
##FORMAT=<ID=VF,Number=1,Type=Float>
##INFO=<ID=TI,Number=.,Type=String>
##INFO=<ID=GI,Number=.,Type=String>
##INFO=<ID=EXON,Number=0,Type=Flag>
##INFO=<ID=FC,Number=.,Type=String>
##INFO=<ID=IndelRepeatLength,Number=1,Type=Integer>
##INFO=<ID=AC,Number=A,Type=Integer>
##INFO=<ID=AF,Number=A,Type=Float>
##INFO=<ID=AN,Number=1,Type=Integer>
##INFO=<ID=DP,Number=1,Type=Integer>
##INFO=<ID=QD,Number=1,Type=Float>
##FILTER=<ID=LowQual>
##FILTER=<ID=R8>
##annotator=MARS
##CallSomaticVariants_cmdline=" -B D:\Amplicon_DS_Soma2\121017_
M00948_0054_000000000A2676_Binf02\Data\Intensities\BaseCalls\Alignment3_Tamsen_
SomaWorker -g [D:\Genomes\Homo_sapiens
\UCSC\hg19\Sequence\WholeGenomeFASTA,] -f 0.01 -fo False -b 20
-q 100 -c 300 -s 0.5 -a 20 -F 20 -gVCF
True -i true -PhaseSNPs true -MaxPhaseSNPLength 100 -r D:
\Amplicon_DS_Soma2\121017_M00948_0054_000000000-A2676_Binf02"
##reference=file://d:\Genomes\Homo_
sapiens\UCSC\hg19\Sequence\WholeGenomeFASTA\genome.fa
##source=GATK 1.6
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 10002 - R1
VCF File Data Lines—Contains information about a single variant. Data lines are listed
under the column headings included in the header.
MiSeq Reporter PCR Amplicon Workflow Reference Guide
15
Analysis Output Files
} gVCF files contain information about all sites within the region of interest.
VCF File Headings
The VCF file format is flexible and extensible, so not all VCF files contain the same fields.
The following tables describe VCF files generated by MiSeq Reporter.
Heading
Description
CHROM
The chromosome of the reference genome. Chromosomes appear in the same
order as the reference FASTA file.
POS
The single-base position of the variant in the reference chromosome.
For SNPs, this position is the reference base with the variant; for indels or deletions,
this position is the reference base immediately before the variant.
ID
The rs number for the SNP obtained from dbSNP.txt, if applicable.
If there are multiple rs numbers at this location, the list is semi-colon delimited. If
no dbSNP entry exists at this position, a missing value marker ('.') is used.
REF
The reference genotype. For example, a deletion of a single T is represented as
reference TT and alternate T.
ALT
The alleles that differ from the reference read.
For example, an insertion of a single T is represented as reference A and alternate
AT.
QUAL
A Phred-scaled quality score assigned by the variant caller.
Higher scores indicate higher confidence in the variant and lower probability of
(Q/10
errors. For a quality score of Q, the estimated probability of an error is 10). For
example, the set of Q30 calls has a 0.1% error rate. Many variant callers assign
quality scores based on their statistical models, which are high relative to the error
rate observed.
VCF File Annotations
Heading
Description
FILTER
If all filters are passed, PASS is written in the filter column.
• LowDP—Applied to sites with depth of coverage below a cutoff. Configure
cutoff using the MinimumCoverageDepth sample sheet setting.
• LowGQ—The genotyping quality (GQ) is below a cutoff. Configure cutoff
using the VariantMinimumGQCutoff sample sheet setting.
• LowQual—The variant quality (QUAL) is below a cutoff. Configure using the
VariantMinimumQualCutoff sample sheet setting.
• LowVariantFreq—The variant frequency is less than the given threshold.
Configure using the VariantFrequencyFilterCutoff sample sheet setting.
• R8—For an indel, the number of adjacent repeats (1-base or 2-base) in the
reference is greater than 8. This filter is configurable using the
IndelRepeatFilterCutoff setting in the config file or the sample sheet.
• SB—The strand bias is more than the given threshold. This filter is
configurable using the StrandBiasFilter sample sheet setting; available only for
somatic variant caller and GATK.
For more information about sample sheet settings, see MiSeq Sample Sheet Quick
Reference Guide (part # 15028392).
16
Part # 15042318 Rev. D
Description
INFO
Possible entries in the INFO column include:
• AC—Allele count in genotypes for each ALT allele, in the same order as listed.
• AF—Allele Frequency for each ALT allele, in the same order as listed.
• AN—The total number of alleles in called genotypes.
• CD—A flag indicating that the SNP occurs within the coding region of at least
one RefGene entry.
• DP—The depth (number of base calls aligned to a position and used in variant
calling). In regions of high coverage, GATK down-samples the available reads.
• Exon—A comma-separated list of exon regions read from RefGene.
• FC—Functional Consequence.
• GI—A comma-separated list of gene IDs read from RefGene.
• QD—Variant Confidence/Quality by Depth.
• TI—A comma-separated list of transcript IDs read from RefGene.
FORMAT
The format column lists fields separated by colons. For example, GT:GQ. The list
of fields provided depends on the variant caller used. Available fields include:
• AD—Entry of the form X,Y, where X is the number of reference calls, and Y is
the number of alternate calls.
• DP—Approximate read depth; reads with MQ=255 or with bad mates are
filtered.
• GQ—Genotype quality.
• GQX—Genotype quality. GQX is the minimum of the GQ value and the QUAL
column. In general, these values are similar; taking the minimum makes GQX
the more conservative measure of genotype quality.
• GT—Genotype. 0 corresponds to the reference base, 1 corresponds to the first
entry in the ALT column, and so on. The forward slash (/) indicates that no
phasing information is available.
• NL—Noise level; an estimate of base calling noise at this position.
• PL—Normalized, Phred-scaled likelihoods for genotypes.
• SB—Strand bias at this position. Larger negative values indicate less bias;
values near zero indicate more bias.
• VF—Variant frequency; the percentage of reads supporting the alternate allele.
SAMPLE
The sample column gives the values specified in the FORMAT column.
Genome VCF Files
Genome VCF (gVCF) files are VCF v4.1 files that follow a set of conventions for
representing all sites within the genome in a reasonably compact format. The gVCF files
generated in the PCR Amplicon workflow include all sites within the region of interest
specified in the manifest file.
For more information, see sites.google.com/site/gvcftools/home/about-gvcf.
The following example illustrates the convention for representing non-variant and
variant sites in a gVCF file.
MiSeq Reporter PCR Amplicon Workflow Reference Guide
17
Analysis Output Files
Heading
Figure 1 Example gVCF File
NOTE
The gVCF file shows no-calls at positions with low coverage, or where a low-frequency
variant (< 3%) occurs often enough (> 1%) that the position cannot be called to the
reference. A genotype (GT) tag of ./. indicates a no-call.
Supplementary Output Files
The following output files provide supplementary information, or summarize run results
and analysis errors. Although, these files are not required for assessing analysis results,
they can be used for troubleshooting purposes.
18
File Name
Description
AdapterTrimming.txt
Lists the number of trimmed bases and percentage of
bases for each tile. This file is present only if adapter
trimming was specified for the run.
Located in Data\Intensities\BaseCalls\Alignment.
AnalysisLog.txt
Processing log that describes every step that occurred
during analysis of the current run folder. This file does not
contain error messages.
Located in the root level of the run folder.
AnalysisError.txt
Processing log that lists any errors that occurred during
analysis. This file is present only if errors occurred.
Located in the root level of the run folder.
CompletedJobInfo.xml
Written after analysis is complete, contains information
about the run, such as date, flow cell ID, software version,
and other parameters.
Located in the root level of the run folder.
DemultiplexSummaryF1L1.txt
Reports demultiplexing results in a table with one row per
tile and one column per sample.
Located in Data\Intensities\BaseCalls\Alignment.
ErrorsAndNoCallsByLaneTile
ReadCycle.csv
A comma-separated values file that contains the
percentage of errors and no-calls for each tile, read, and
cycle.
Located in Data\Intensities\BaseCalls\Alignment.
Mismatch.htm
Contains histograms of mismatches per cycle and no-calls
per cycle for each tile.
Located in Data\Intensities\BaseCalls\Alignment.
PCRAmpliconRunStatistics.xml
Contains summary statistics specific to the run.
Located in the root level of the run folder.
Part # 15042318 Rev. D
Description
Summary.xml
Contains a summary of mismatch rates and other base
calling results.
Located in Data\Intensities\BaseCalls\Alignment.
Summary.htm
Contains a summary web page generated from
Summary.xml.
Located in Data\Intensities\BaseCalls\Alignment.
MiSeq Reporter PCR Amplicon Workflow Reference Guide
19
Analysis Output Files
File Name
PCR Amplicon Manifest File Format
A manifest file is required input for the PCR Amplicon workflow. The manifest is
generated using the Illumina Experiment Manager and uses a tab-delimited
*.AmpliconManifest file format. The manifest name for each sample is specified in the
Data section of the sample sheet.
The PCR Amplicon manifest file begins with a header section comprising a header line
followed by two columns: ReferenceGenome and Reference Genome Folder. The main
section of the file is the Regions section, which contains the following columns:
} Name—Unique user-specified name for the amplicon.
} Chromosome—Chromosome from which the amplicon originates.
} Amplicon Start—1-based coordinate start position of the amplicon including the
probe.
} Amplicon End—1-based and inclusive coordinate of the end position of the
amplicon including the probe.
} Upstream Probe Length—The length of the upstream (5') PCR probe. This field is set
to zero.
} Downstream Probe Length—The length of the downstream (3') PCR probe. This field
is set to zero.
20
Part # 15042318 Rev. D
For technical assistance, contact Illumina Technical Support.
Table 1 Illumina General Contact Information
Website
Email
www.illumina.com
[email protected]
Table 2 Illumina Customer Support Telephone Numbers
Region
Contact Number
Region
North America
1.800.809.4566
Italy
Australia
1.800.775.688
Netherlands
Austria
0800.296575
New Zealand
Belgium
0800.81102
Norway
Denmark
80882346
Spain
Finland
0800.918363
Sweden
France
0800.911850
Switzerland
Germany
0800.180.8994
United Kingdom
Ireland
1.800.812949
Other countries
Contact Number
800.874909
0800.0223859
0800.451.650
800.16836
900.812168
020790181
0800.563118
0800.917.0041
+44.1799.534000
Safety Data Sheets
Safety data sheets (SDSs) are available on the Illumina website at
support.illumina.com/sds.html.
Product Documentation
Product documentation in PDF is available for download from the Illumina website. Go
to support.illumina.com, select a product, then click Documentation & Literature.
MiSeq Reporter PCR Amplicon Workflow Reference Guide
Technical Assistance
Technical Assistance
Illumina
San Diego, California 92122 U.S.A.
+1.800.809.ILMN (4566)
+1.858.202.4566 (outside North America)
[email protected]
www.illumina.com
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement