RNA-Seq Alignment BaseSpace App Guide v1.0 (1000000006111 v00)

RNA-Seq Alignment BaseSpace App Guide v1.0 (1000000006111 v00)
RNA-Seq Alignment v1.0
BaseSpace App Guide
For Research Use Only. Not for use in diagnostic procedures.
Introduction
Workflow
Workflow Diagram
Set Analysis Parameters
Analysis Methods
Analysis Output
Technical Assistance
ILLUMINA PROPRIETARY
1000000006111 v00
February 2016
3
5
6
7
11
15
This document and its contents are proprietary to Illumina, Inc. and its affiliates ("Illumina"), and are intended solely for the
contractual use of its customer in connection with the use of the product(s) described herein and for no other purpose. This
document and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed,
or reproduced in any way whatsoever without the prior written consent of Illumina. Illumina does not convey any license
under its patent, trademark, copyright, or common-law rights nor similar rights of any third parties by this document.
The instructions in this document must be strictly and explicitly followed by qualified and properly trained personnel in order
to ensure the proper and safe use of the product(s) described herein. All of the contents of this document must be fully read
and understood prior to using such product(s).
FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS CONTAINED HEREIN
MAY RESULT IN DAMAGE TO THE PRODUCT(S), INJURY TO PERSONS, INCLUDING TO USERS OR OTHERS, AND
DAMAGE TO OTHER PROPERTY.
ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE PRODUCT(S)
DESCRIBED HEREIN (INCLUDING PARTS THEREOF OR SOFTWARE).
© 2016 Illumina, Inc. All rights reserved.
Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cBot, CSPro, CytoChip, DesignStudio,
Epicentre, ForenSeq, Genetic Energy, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium, iScan, iSelect,
MiniSeq, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, NextBio, Nextera, NextSeq, Powered by Illumina, SureMDA,
TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the pumpkin orange color,
and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the U.S. and/or other countries. All
other names, logos, and other trademarks are the property of their respective owners.
The BaseSpace® RNA-Seq Alignment v1.0 App supports RNA sequencing read
alignments, variants calling, gene fusions detection, and novel script assembly with
Cufflinks. The app can run STAR and tools from the Tuxedo Suite (Bowtie, TopHat,
Cufflinks) to produce aligned reads, variant calls, and FPKM abundance estimates of
reference genes and transcripts. Also, the app can perform fusion calling using Mantafusion or TopHat-fusion tools.
The Cufflinks Assembly & DE v2.0 App uses the analysis results from the RNA-Seq
Alignment App to perform novel transcript merging and differential expression.
Compatible Libraries
See the BaseSpace support page for a list of library types that are compatible with the
RNA-Seq Alignment App.
Workflow Requirements
}
}
}
}
Supports 100,000–400,000,000 reads per sample.
Supports 40 billion reads across all samples in a single analysis.
Supports 35–500 bp read lengths.
Requires paired-end reads for fusion detection.
Versions
The following components are used in the RNA-Seq Alignment App.
Software
Version
Isis (Analysis Software)
2.6.25.11
TopHat (Aligner)
2.1.0
STAR (Aligner)
2.5.0a
Bowtie (Aligner)
0.12.9
Bowtie2 (Aligner)
2.2.6
Isaac Variant Caller
2.3.13-31-g3c98c29
IONA (Annotation Service)
1.0.10.37
BEDTools
2.17.0
Cufflinks
2.2.1
BLAST
2.2.26+
Reference Genomes
The following reference genomes are available for alignment:
} Homo sapiens UCSC hg19 (RefSeq & Gencode gene annotations)
} Homo sapiens UCSC hg38 (RefSeq & Gencode gene annotations)
RNA-Seq Alignment v1.0 App Guide
3
Introduction
Introduction
}
}
}
}
}
}
}
}
}
}
}
}
}
4
The human reference genome is PAR-Masked, which means that the Y chromosome
sequence has the Pseudo Autosomal Regions (PAR) masked (set to N) to avoid
mismapping of reads in the duplicate regions of sex chromosomes.
Arabidopsis thaliana Ensembl TAIR10 (Ensembl gene annotation)
Bos taurus UCSC bosTau6 (RefSeq gene annotation)
Caenorhabditis elegans UCSC ce10 (RefSeq gene annotation)
Danio rerio UCSC danRer7 (RefSeq gene annotation)
Drosophila melanogaster UCSC dm3 (RefSeq gene annotation)
Gallus gallus UCSC galGal4 (RefSeq gene annotation)
Mus musculus UCSC mm9 (RefSeq gene annotation)
Mus musculus UCSC mm10 (RefSeq gene annotation)
Oryza sativa japonica Ensembl IRGSP-1.0 (Ensembl gene annotation)
Rattus norvegicus UCSC rn5 (RefSeq gene annotation)
Saccharomyces cerevisiae Ensembl R64-1-1 (Ensembl gene annotation)
Sus scrofa UCSC susScr3 (RefSeq gene annotation)
Zea mays Ensembl AGPv3 (Ensembl gene annotation)
1000000006111 v00
}
}
}
}
}
}
}
Filtering—Bowtie filters the input reads against abundant sequences, such as
mitochondrial or ribosomal sequences, as defined by iGenomes at
support.illumina.com/sequencing/sequencing_software/igenome.html.
} Only sequences that do not align against abundant sequences are passed through
to the next phase of the analysis. Bowtie filters read pairs when at least 1 read
aligns to an abundant sequence. Also, Bowtie trims off 2 bases from the 5’ end of
the read because of a high mismatch rate from these 2 bases in the RNA-Seq
libraries. See Bowtie on page 11.
Alignment—The STAR or TopHat2 aligner performs a spliced alignment of the
filtered reads against the genome. Based on the user-specified genome, STAR or
TopHat aligns reads against known transcripts and splice junctions. See STAR on
page 11.
Alignment to ERCC—If selected, STAR aligns all reads to the ERCC RNA spike-in
sequences, independent of alignment to the transcriptome. The aligner counts reads
that align to each spike-in sequence, calculates FPKMs, and computes the correlation
between FPKMs and the expected spike-in concentrations.
Fusion Calling—If selected, the STAR aligner supports Manta-fusion and the TopHat
aligner supports TopHat-fusion. First, TopHat2 is used to detect fused alignments.
Then, a post-alignment analysis script identifies candidate fusion genes from the
fused alignments. See Manta on page 11.
Variant Calling—The Isaac Variant Caller performs variant calling, which produces
gVCF output files. For stranded library preps, the strand bias filter is disabled.
} Also, the Isaac variant caller uses the -bsnp-diploid-het-bias parameter to expand
the range for the heterozygous variant call, in order to account for allele-specific
expression.
} The Isaac tool uses a RNA-specific random-forest-based variant scoring model,
which was built using Platinum Genomes data as a reference. See Isaac Variant
Caller on page 12.
Quantification—Cufflinks quantifies reference genes and transcripts.
RnaReadCounter counts the number of aligned reads matching each annotated gene.
See Cufflinks on page 12 and RnaReadCounter on page 13.
Novel Transcript Assembly—If selected, transcripts are assembled and quantified
independently for each sample.
RNA-Seq Alignment v1.0 App Guide
5
Workflow
Workflow
Workflow Diagram
Figure 1 RNA-Seq Alignment Workflow
6
1000000006111 v00
1
Navigate to BaseSpace, and then click the Apps tab.
2
In Categories, click RNA-Seq, and then click RNA-Seq Alignment.
3
From the drop-down list, select version 1.0.0, and then click Launch to open the app.
4
In the App Session Name field, enter the analysis name.
By default, the analysis name includes the app name, followed by the date and time
that the analysis session starts.
5
From the Save Results To field, select the project that stores the app results.
6
From the Samples field, browse to the sample you want to analyze and select the
checkbox.
You can analyze multiple samples. Select the checkbox to identify samples prepared
with the TruSeq Stranded Total RNA and TruSeq Stranded mRNA library prep kits
for the first strand and the ScriptSeq v2.0 RNA-Seq library prep kit for the second
strand.
7
From the Reference Genome field, select the reference genome to be used for
alignment. The default is Homo sapiens (PAR-masked)/hg19 (RefSeq).
8
From the Panel field, select from the following:
} None (default)
} TruSight RNA Pan-Cancer
The TruSight RNA Pan-Cancer library prep kit only supports the Human, UCSC
hg19 (RefSeq & Gencode) reference genome.
9
From the Aligner field, select from the following methods:
} STAR (default)
} TopHat (Bowtie)
} TopHat (Bowtie2)
10 [Optional] Select the QC Mode checkbox to analyze only the subset of a read pairs
for each sample. If selected, enter the number of read pairs for each sample.
11 [Optional] Select the Novel Transcript Assembly checkbox for Cufflinks to assemble
novel transcripts. If selected, select the Adjust Transcript Assembly for Samples
Without PolyA Selection checkbox if the samples are prepared without PolyA
selection (TruSeq Total RNA kit).
NOTE
By default, the Call Fusions checkbox is checked if you selected a panel and an aligner that
supports fusion calling. Paired-end reads are required. The TopHat (Bowtie) aligner
supports TopHat-fusion. The STAR aligner supports Manta-fusion. The TopHat (Bowtie2)
aligner does not support fusion calling. For best results, use the STAR aligner with the
TrusSight RNA Pan-Cancer panel.
12 From the ERCC Spike-In Controls field, select from the following options:
} None (default)
} Mix 1
} Mix 2
13 [Optional] Select the Trim TruSeq Adapters checkbox.
This option trims adapter sequences from the FASTQ file. Use this option if adapter
trimming was not performed in the demultiplexing.
RNA-Seq Alignment v1.0 App Guide
7
Set Analysis Parameters
Set Analysis Parameters
14 [Optional] Select the Set Advanced Options checkbox to enable the advanced
options and then specify the values for the appropriate options.
15 Click Continue.
The RNA-Seq Alignment App begins analysis.
When analysis is complete, the app updates the status of the session and sends an
email to notify you.
Advanced Options
[Optional] Specify the values for the advanced options.
Table 1 TopHat Options Table
8
Option
Description
Read Mismatches
Enter a number between 0 and 5. The default is 2.
Alignments with more than the number of mismatches are
discarded.
Read Gap Length
Enter a number between 0 and 5. The default is 2.
Alignments with more than the total length of gaps are
discarded.
Read Edit Distance
Enter a number between 0 and 10. The default is 2.
Alignments with more than the selected edit distance are
discarded.
Mate Inner Distance
Enter a number between 0 and 300. The default is 50.
The expected (mean) inner distance between mate pairs. For
paired-end runs with fragments selected at 300 bp, where
each end is 50 bp, set this value at 200.
Mate Standard
Deviation
Enter a number between 1 and 100. The default is 20.
The standard deviation of the distribution on inner distances
between mate pairs.
Minimum Intron
Length
Enter a number between 10 and the maximum intron length.
The default is 70.
TopHat ignores donor/acceptor pairs closer than the specified
number of bases.
Maximum Intron
Length
Enter a number between the minimum intron length and
1,000,000. The default is 500,000.
When searching for junctions, TopHat ignores donor/acceptor
pairs farther than the specified number of bases, except when
a pair is supported by a split segment alignment of a long
read.
Maximum Insertion
Length
Enter a number between 0 and 5. The default is 3.
Maximum Deletion
Length
Enter a number between 0 and 5. The default is 3.
1000000006111 v00
Set Analysis Parameters
Table 2 STAR Options Table
Option
Description
Score Difference to
Filter Multimapping
Alignments
Enter a number between 1 and 5. The default is 1.
When a read aligns to multiple loci, the alignment is reported
if its score is in the range of (s - value, s], where "s" is
the highest alignment score and "value" is the number that
you entered.
Maximum Mismatches
Enter a number between 1 and 21. The default is 10.
The output includes alignments that have fewer mismatches
than this value.
Maximum Mismatches
Over Read Length
Enter a number between 0 and 0.5. The default is 0.3.
The output includes alignments that have a ratio of
mismatches to mapped length that is less than this value.
Minimum Score Over
Read Length
Enter a number between 0.33 and 1. The default is 0.66.
The output includes alignments that have a score higher than
this value. The score is normalized by read length, which is
the length sum of mates for paired-end reads.
Minimum Matches
Over Read Length
Enter a number between 0.33 and 1. The default is 0.66.
The output includes alignments that have matched bases
higher than this value. The number is normalized by reads
length, which is the length sum of mates for paired-end reads.
Maximum Seed Search
Step:
Enter a number between 30 and 1000. The default is 50.
The seed search starts at position 1 and the step length
determines the next start position. The step length cannot be
longer than this value.
Minimum Intron
Length
Enter a number between 10 and the maximum intron size.
The default is 21.
The genomic gap is considered intron when its length is
greater or equal to this value. Otherwise, the gap is
considered a deletion.
Maximum Intron
Length
Enter 0 or a number between the minimum intron size and
1,000,000. The default is 0.
If the value is set to 0, STAR calculates the maximum intron
size.
Use Annotation
Select the checkbox to use splice junction information in the
annotation. By default, the checkbox is selected.
Two-Pass Mode
Select the STAR 2-pass alignment. The default is Basic.
Table 3 Cuffnorm Options Table
Field
Hits Normalization
Description
Compatible—Cuffnorm counts only those fragments
compatible with some reference transcript towards the number
of mapped fragments used in the FPKM denominator. This
option is the default.
Total—Cuffnorm counts all fragments, including those not
compatible with any reference transcript, towards the number
of mapped fragments used in the FPKM denominator.
RNA-Seq Alignment v1.0 App Guide
9
Table 4 Cufflinks Options Table
Option
Description
Hits Normalization
Compatible—Cufflinks counts only those fragments
compatible with some reference transcript towards the
number of mapped fragments used in the FPKM
denominator.
Total—Cufflinks counts all fragments, including the
fragments not compatible with any reference transcript,
towards the number of mapped fragments used in the FPKM
denominator. This option is the default.
Minimum Isoform
Fraction
Enter a number between 0.05 and 1. The default is 0.1.
After calculating isoform abundance for a gene, Cufflinks
filters out transcripts that are low abundance. Isoforms that
are expressed at low levels often cannot be reliably
assembled.
The isoforms can be artifacts of incompletely spliced
precursors of processed transcripts. This parameter filters out
introns that have fewer supporting sliced alignments.
Pre-mRNA Fraction
Enter a number between 0 and 1. The default is 0.15.
Some RNA-seq protocols produce a significant number of
reads that come from incompletely spliced transcripts. These
reads can confound the assembly of fully spliced mRNAs.
Cufflinks uses this parameter to filter out alignments that are
within the intronic intervals. The minimum depth of coverage
in the intronic region covered by the alignment is divided by
the number of spliced reads. If the result is lower than the
value in this parameter, the intronic alignments are ignored.
Minimum Intron
Length
Enter a number between 10 and the maximum intron length.
The default is 50.
Maximum Intron
Length
Enter a number between the minimum intron length and
600,000. The default is 300,000.
When the intron length is longer than this value, Cufflinks
does not report transcripts with introns and excludes SAM
alignments with REF_SKIP CIGAR operations.
Minimum Fragments
per Transfrag
Enter a number between 5 and 100. The default is 10.
Assembled transcript fragments supported by fewer than this
value of aligned RNA-seq fragments are not reported.
Table 5 Cuffquant/Cufflinks Options Table
10
Option
Description
Fragment Bias
Correction
Cuffquant/Cufflinks runs bias detection and correction
algorithm, which can improve accuracy of transcript
abundance estimates.
Multi-read Correction
Cuffquant/Cufflinks runs an initial estimation procedure to
weight reads mapping to multiple locations in the genome
more accurately.
No Effective Length
Correction
Cuffquant/Cufflinks disables effective length normalization to
transcript FPKM.
1000000006111 v00
The RNA-Seq Alignment App uses the following methods to analyze the sequencing
data.
} STAR
} Manta
} Tuxedo Suite, which includes Bowtie, Bowtie2, TopHat, and Cufflinks
} Isaac Variant Caller
STAR
Spliced Transcripts Alignment to a Reference (STAR) is a fast RNA-seq read mapper,
with support for splice-junction and fusion read detection.
STAR aligns reads by finding the Maximal Mappable Prefix (MMP) hits between reads
(or read pairs) and the genome, using a Suffix Array index. Different parts of a read can
be mapped to different genomic positions, corresponding to splicing or RNA-fusions. The
genome index includes known splice-junctions from annotated gene models, allowing for
sensitive detection of spliced reads. STAR performs local alignment, automatically soft
clipping ends of reads with high mismatches.
We recommend using STAR because it can quickly align more reads than other aligner
methods.
For more information, see https://github.com/alexdobin/STAR.
Manta
Manta calls structural variants (SVs) from mapped paired-end sequencing reads. Manta
discovers candidate SVs from discordant pair and split-read alignments, followed by
local assembly and realignment to refine candidates.
The app uses Manta on RNA-seq data to detect gene fusions in combination with STAR,
which appear like translocations in the RNA alignments. The Manta workflow is
followed by RNA-specific filtering and scoring, which is based on the following:
} Read counts across the fusion and alignment qualities.
} Genome-wide realignment of fusion contigs to filter candidates that can be explained
by a local alignment elsewhere in the genome.
} Length of coverage around the breakpoints, indicating presence of stable fusion
transcripts.
For more information, see https://github.com/Illumina/manta.
Tuxedo Suite
The Tuxedo suite offers a set of tools for analyzing a variety of RNA-Seq data, including
short-read mapping, identification of splice junctions, transcript and isoform detection,
differential expression, visualizations, and quality control metrics.
Bowtie
Bowtie is an ultrafast, memory-efficient aligner designed to quickly align large sets of
short reads to large genomes. Bowtie indexes the genome to keep its memory footprint
small: for the human genome, the index is typically about 2.2 GB for single-read
alignment or 2.9 GB for paired-end alignment. Multiple processors can be used
simultaneously to achieve greater alignment speed.
RNA-Seq Alignment v1.0 App Guide
11
Analysis Methods
Analysis Methods
Bowtie forms the basis for other tools like TopHat, a fast splice junction mapper for
RNA-seq reads, and Cufflinks, a tool for transcriptome assembly and isoform
quantitation from RNA-seq reads.
For more information, see http://bowtie-bio.sourceforge.net/index.shtml.
Bowtie 2
Bowtie can quickly align large sets of short DNA sequences to large genomes.
You can use Bowtie 2 to align reads of about 50 to 100s or 1,000s of characters. For
human genome, the memory footprint is approximately 3.2 GB.
Bowtie 2 forms the basis for other tools like Tophat, a fast splice junction mapper for
RNA-seq reads, and Cufflinks, a tool for transcriptome assembly and isoform
quantitation from RNA-seq reads.
For more information, see http://bowtie-bio.sourceforge.net/bowtie2/index.shtml.
TopHat
TopHat is a fast splice junction mapper for RNA-seq reads that can only be used with
Bowtie or Bowtie2. TopHat uses Bowtie or Bowtie2 to map RNA-seq reads, and then it
analyzes the mapping results to identify splice junctions between exons.
For more information, see http://ccb.jhu.edu/software/tophat/index.shtml.
Cufflinks
Cufflinks assembles aligned RNA-Seq reads into transcripts, estimates their abundances,
and test for differential expression and regulation of transcriptome.
For more information, see cole-trapnell-lab.github.io/cufflinks/.
Isaac Variant Caller
The Isaac variant caller identifies single nucleotide variants (SNVs) and small indels
using the following steps:
} Read filtering—Filters reads failing quality checks.
} Indel calling—Identifies a set of possible indel candidates and realigns all reads
overlapping the candidates using a multiple sequence aligner.
} SNV calling—Computes the probability of each possible genotype given the aligned
read data and a prior distribution of variation in the genome.
} Indel genotypes—Calls indel genotypes and assigns probabilities.
} Variant call output—Generates output in a VCF and a compressed genome variant
call (gVCF) file.
For more information, see https://github.com/sequencing/isaac_variant_caller
Read Filtering
Input reads are filtered under the following conditions:
} Reads that failed base calling quality checks.
} Reads marked as PCR duplicates.
} Paired-end reads not marked as a proper pair.
} Reads with a mapping quality < 20.
12
1000000006111 v00
The variant caller proceeds with candidate indel discovery and generates alternate read
alignments based on the candidate indels. As part of the realignment process, the variant
caller selects a representative alignment to be used for site genotype calling and depth
summarization by the SNV caller.
SNV Calling
The variant caller runs a series of filters on the set of filtered and realigned reads for
SNV calling without affecting indel calls. First, any contiguous trailing sequence of N
base calls is trimmed from the ends of reads. Using a mismatch density filter, reads
having an unexpectedly high number of disagreements with the reference are masked, as
follows:
} The variant caller treats each insertion or deletion as a single mismatch.
} Base calls with more than 2 mismatches to the reference sequence within 20 bases of
the call are ignored.
} If the call occurs within the first or last 20 bases of a read, the mismatch limit is
applied to a 41-base window at the corresponding end of the read.
} The mismatch limit is applied to the entire read when the read length is 41 or
shorter.
Indel Genotypes
The variant caller filters all bases marked by the mismatch density filter and any N base
calls that remain after the end-trimming step. These filtered base calls are not used for
site-genotyping but appear in the filtered base call counts in the variant caller output for
each site.
All remaining base calls are used for site-genotyping. The genotyping method
heuristically adjusts the joint error probability that is calculated from multiple
observations of the same allele on each strand of the genome. This correction accounts
for the possibility of error dependencies.
This method treats the highest-quality base call from each allele and strand as an
independent observation and leaves the associated base call quality scores unmodified.
Quality scores for subsequent base calls for each allele and strand are then adjusted. This
adjustment increases the joint error probability of the given allele above the error
expected from independent base call observations.
Variant Call Output
After the SNV and indel genotyping methods are complete, the variant caller applies a
final set of heuristic filters to produce the final set of calls in the output.
The output in the genome variant call (gVCF) file captures the genotype at each position
and the probability that the consensus call differs from reference. This score is expressed
as a Phred-scaled quality score.
RnaReadCounter
The RnaReadCounter, an internal tool, counts the number of aligned reads in an RNASeq sample that match each annotated gene. The RnaReadCounter method is similar to
"htseq-count" in “union-mode”, using a “chromsweep” algorithm.
RNA-Seq Alignment v1.0 App Guide
13
Analysis Methods
Indel Calling
Read counts are based on the overlapping of both reads in a pair with exons of a single
gene and do not consider individual transcripts separately. Reads are not counted if they
map to more than 1 genomic position or to a position with overlapping exons from more
than one gene.
14
1000000006111 v00
Analysis Output
Analysis Output
1
Navigate to the BaseSpace site.
2
To view the results, click the Projects tab, then the project name, and then the
analysis.
Figure 2 RNA-Seq Alignment Output Navigation Bar
Use the left navigation bar to access the following analysis output:
} Analysis Info—Information about the analysis session, including log files.
} Inputs—Overview of input settings.
} Output Files—Output files for the samples.
} Analysis Reports
} Summary—Analysis metrics for the aggregate results.
} Sample Analysis—Analysis reports for each sample.
Analysis Info
The Analysis Info page displays the analysis settings and execution details.
Row Heading
Definition
Name
Name of the analysis session.
Application
App that generated this analysis.
Date Started
Date and time the analysis session started.
Date Completed
Date and time the analysis session completed.
Duration
Duration of the analysis.
Session Type
Multi-Node or Single-Node
Status
Status of the analysis session. The status shows either Running
or Complete and the number of nodes used.
RNA-Seq Alignment v1.0 App Guide
15
Log Files
File Name
Description
CompletedJobInfo.xml
Contains information about the completed analysis session.
Logging.zip
Contains all detailed log files for each step of the workflow.
SampleSheet.csv
Sample sheet.
SampleSheetUsed.csv
A copy of the sample sheet.
WorkflowError.txt
Contains error messages created when running the
workflow.
Output Files
The Output Files page provides access to the output files for each sample analysis.
}
}
}
}
}
}
}
BAM Files
VCF Files
gVCF Files
FPKM Files
Coverage.bedGraph.gz Files
Coverage.bw Files
Junctions.bed Files
BAM File Format
A BAM file (*.bam) is the compressed binary version of a SAM file that is used to
represent aligned sequences up to 128 Mb. SAM and BAM formats are described in
detail at https://samtools.github.io/hts-specs/SAMv1.pdf.
BAM files use the file naming format of SampleName_S#.bam, where # is the sample
number determined by the order that samples are listed for the run. In multi-node mode,
the S# is set to S1, regardless the order of the sample.
BAM files contain a header section and an alignment section:
} Header—Contains information about the entire file, such as sample name, sample
length, and alignment method. Alignments in the alignments section are associated
with specific information in the header section.
} Alignments—Contains read name, read sequence, read quality, alignment
information, and custom tags. The read name includes the chromosome, start
coordinate, alignment quality, and the match descriptor string.
The alignments section includes the following information for each or read pair:
} RG: Read group, which indicates the number of reads for a specific sample.
} BC: Barcode tag, which indicates the demultiplexed sample ID associated with the
read.
} SM: Single-end alignment quality.
} AS: Paired-end alignment quality.
} NM: Edit distance tag, which records the Levenshtein distance between the read and
the reference.
16
1000000006111 v00
XN: Amplicon name tag, which records the amplicon tile ID associated with the
read.
BAM index files (*.bam.bai) provide an index of the corresponding BAM file.
VCF File Format
Variant Call Format (VCF) is a widely used file format developed by the genomics
scientific community that contains information about variants found at specific positions
in a reference genome.
VCF files use the file naming format SampleName_S#.vcf, where # is the sample number
determined by the order that samples are listed for the run.
VCF File Header—Includes the VCF file format version and the variant caller version.
The header lists the annotations used in the remainder of the file. If MARS is listed, the
Illumina internal annotation algorithm annotated the VCF file. The VCF header includes
the reference genome file and BAM file. The last line in the header contains the column
headings for the data lines.
VCF File Data Lines—Each data line contains information about a single variant.
VCF File Headings
Heading
Description
CHROM
The chromosome of the reference genome. Chromosomes appear in
the same order as the reference FASTA file.
POS
The single-base position of the variant in the reference chromosome.
For SNPs, this position is the reference base with the variant; for indels
or deletions, this position is the reference base immediately before the
variant.
ID
The rs number for the SNP obtained from dbSNP.txt, if applicable.
If there are multiple rs numbers at this location, the list is semicolon
delimited. If no dbSNP entry exists at this position, a missing value
marker ('.') is used.
REF
The reference genotype. For example, a deletion of a single T is
represented as reference TT and alternate T. An A to T single nucleotide
variant is represented as reference A and alternate T.
ALT
The alleles that differ from the reference read.
For example, an insertion of a single T is represented as reference A and
alternate AT. An A to T single nucleotide variant is represented as
reference A and alternate T.
QUAL
A Phred-scaled quality score assigned by the variant caller.
Higher scores indicate higher confidence in the variant and lower
probability of errors. For a quality score of Q, the estimated probability
of an error is 10-(Q/10). For example, the set of Q30 calls has a 0.1% error
rate. Many variant callers assign quality scores based on their statistical
models, which are high in relation to the error rate observed.
RNA-Seq Alignment v1.0 App Guide
17
Analysis Output
}
VCF File Annotations
18
Heading
Description
FILTER
If all filters are passed, PASS is written in the filter column.
• LowDP—Applied to sites with depth of coverage below a cutoff.
• LowGQ—The genotyping quality (GQ) is below a cutoff.
• LowQual—The variant quality (QUAL) is below a cutoff.
• LowVariantFreq—The variant frequency is less than the given
threshold.
• R8—For an indel, the number of adjacent repeats (1-base or 2-base)
in the reference is greater than 8.
• SB—The strand bias is more than the given threshold. Used with the
Somatic Variant Caller and GATK.
INFO
Possible entries in the INFO column include:
• AC—Allele count in genotypes for each ALT allele, in the same order
as listed.
• AF—Allele Frequency for each ALT allele, in the same order as listed.
• AN—The total number of alleles in called genotypes.
• CD—A flag indicating that the SNP occurs within the coding region
of at least 1 RefGene entry.
• DP—The depth (number of base calls aligned to a position and used
in variant calling).
• Exon—A comma-separated list of exon regions read from RefGene.
• FC—Functional Consequence.
• GI—A comma-separated list of gene IDs read from RefGene.
• QD—Variant Confidence/Quality by Depth.
• TI—A comma-separated list of transcript IDs read from RefGene.
FORMAT
The format column lists fields separated by colons. For example,
GT:GQ. The list of fields provided depends on the variant caller used.
Available fields include:
• AD—Entry of the form X,Y, where X is the number of reference calls,
and Y is the number of alternate calls.
• DP—Approximate read depth; reads with MQ=255 or with bad mates
are filtered.
• GQ—Genotype quality.
• GQX—Genotype quality. GQX is the minimum of the GQ value and
the QUAL column. In general, these values are similar; taking the
minimum makes GQX the more conservative measure of genotype
quality.
• GT—Genotype. 0 corresponds to the reference base, 1 corresponds
to the first entry in the ALT column, and so on. The forward slash (/)
indicates that no phasing information is available.
• NL—Noise level; an estimate of base calling noise at this position.
• PL—Normalized, Phred-scaled likelihoods for genotypes.
• SB—Strand bias at this position. Larger negative values indicate less
bias; values near 0 indicate more bias. Used with the Somatic Variant
Caller and GATK.
• VF—Variant frequency; the percentage of reads supporting the
alternate allele.
SAMPLE
The sample column gives the values specified in the FORMAT column.
1000000006111 v00
Genome VCF (gVCF) files are VCF v4.1 files that follow a set of conventions for
representing all sites within the genome in a reasonably compact format. The gVCF files
include all sites within the region of interest in a single file for each sample.
The gVCF file shows no-calls at positions with low coverage, or where a low-frequency
variant (< 3%) occurs often enough (> 1%) that the position cannot be called to the
reference. A genotype (GT) tag of ./. indicates a no-call.
For more information, see sites.google.com/site/gvcftools/home/about-gvcf.
FPKM Files
Fragments Per Kilobase of sequence per Million mapped reads (FPKM) normalizes the
number of aligned reads by the size of the sequence feature and the total number of
mapped reads.
In each output directory, this app creates the following output files:
} genes.fpkm_tracking—Quantifies the expression of genes specified in the GTF
annotation file.
} isoforms.fpkm_tracking—Quantifies the expression of transcripts specified in the
GTF annotation file.
Analysis Reports
The RNA-Seq Alignment App provides an aggregate summary for all the samples and a
summary of statistics per sample.
Summary Analysis Report
The RNA-Seq Alignment App provides an aggregate summary for all the samples.
Table 6 Summary Table
Statistic
Definition
Sample ID
The sample ID.
Read Length
The length of reads.
Number of Reads
The total number of reads passing filter for this sample.
% Total Aligned
Percentage of reads passing filter that aligned to the reference,
including abundant reads.
% Abundant
Percentage of reads that align to abundant transcripts, such as
mitochondrial and ribosomal sequences.
% Unaligned
Percentage of reads that do not align to the reference.
Median CV Coverage
Uniformity
The median coefficient of variation of coverage of the 1000
most highly expressed transcript, as reported by the
CollectRnaSeqMetrics utility from Picard tools. Ideal value = 0.
% Stranded
Percentage of reads that are stranded.
Summary Plots
}
To save plot as scalable vector graphics (SVG), click Save Plot as SVG.
RNA-Seq Alignment v1.0 App Guide
19
Analysis Output
Genome VCF Files
}
To export data from plot as comma-separated values (CSV), click Export Data as
CSV.
Table 7 Summary Plots Table
Plot Name
Description
Insert Length
Distribution
The plot summarizes the insert length distribution for pairedend reads. The 3 vertical lines of the box represent the
quartiles and the whiskers represent the 5th and 95th
percentiles. The insert length for this box plot is capped at 600
bp.
Alignment
Distribution
If a panel is selected, the app calculates the alignment
information based on the on-target genes.
The plot shows the percentage for the color-coded genomic
regions, which include coding, UTR, intron, and non-targeted
or intergenic. The app reports "non-targeted" if a panel is
selected.
Transcript Coverage
If a panel is selected, the app calculates the alignment
information based on the on-target genes.
The plot shows the transcript coverage position as reported
by the CollectRnaSeqMetrics utility from Picard tools. A
vertical bar shows the relative coverage at the position in each
row. The numbers between 0 and 100 represents the
normalized position along a transcript.
ERCC Spike-Ins
The ERCC spike-ins analysis summary is available when an ERCC mix is selected.
Table 8 ERCC Spike-Ins Table
Statistic
Definition
Total Spike-In Reads
(% Reads)
Percentage of reads passing filter that aligned to the ERCC
sequences.
Pearson Correlation
Pearson correlation of log RNA FPKM and log spike-in molar
concentration.
Spearman Correlation
Spearman correlation of log RNA FPKM and log spike-in
molar concentration.
Sample Analysis Reports
The RNA-Seq Alignment App provides an overview of statistics per sample in the
Analysis Reports section. To download the statistics, click PDF Summary Report.
Primary Analysis Information
Table 9 Primary Analysis Information Table
20
Statistic
Definition
Read Length
Length of reads.
1000000006111 v00
Definition
Number of Reads
Total number of reads passing filter for this sample.
Bases (GB)
The total number of bases for this sample.
Q30 Bases (GB)
The total number of bases with a quality score of 30 or higher.
Analysis Output
Statistic
Insert Information
Table 10 Insert Information Table
Statistics
Definition
Insert Length Median
Median length of a sequenced fragment. The fragment length
is calculated based on the locations at which a read pair aligns
to the reference. The read mapping information is parsed
from the BAM files.
Insert Length S.D.
Standard deviation of the sequenced fragment length.
Duplicates (% Reads)
Percentage of paired reads that have duplicates from a subsampled set of 4 million reads or from the total number of
reads when there are less than 4 million reads.
Alignment Summary
Table 11 Alignment Quality Table
Statistic
Definition
Total Aligned Reads
(% Reads)
Percentage of reads passing filter that aligned to the reference.
Abundant Reads (%
Reads)
Percentage of reads that aligns to abundant transcripts, such
as mitochondrial and ribosomal sequences.
Unaligned Reads (%
Reads)
Percentage of reads passing filter that are not aligned to the
reference.
Reads with spliced
alignment (% Aligned
Reads)
Percentage of aligned reads with a spliced alignment.
Reads aligned at
multiple loci (%
Aligned Reads)
Percentage of aligned reads to multiple loci.
If a panel is selected, the app calculates the alignment information based on the on-target
genes.
Table 12 Alignment Information Table
Statistics
Definition
Coding
Metrics based on coding bases.
RNA-Seq Alignment v1.0 App Guide
21
Statistics
Definition
UTR
Metrics based on bases in untranslated regions (UTR).
Intron
Metrics based on bases in introns.
Intergenic or Nontargeted
Metrics based on bases in intergenic or non-targeted regions.
The app reports "non-targeted" if a panel is selected.
Coverage Summary
If a panel is selected, the app calculates the coverage information based on the on-target
genes.
Table 13 Coverage Uniformity Information Table
Statistic
Definition
Median CV
The median coefficient of variation of coverage of the 1000
most highly expressed transcripts, as reported by the
CollectRnaSeqMetrics utility from Picard tools. Ideal value = 0.
Median 3'
The median uniformity of coverage of the 1000 most highly
expressed transcripts at the 3' end, as reported by the
CollectRnaSeqMetrics utility from Picard tools.
3' bias is calculated per transcript as the mean coverage of the
3' most 100 bases divided by the mean coverage of the whole
transcript.
Median 5'
The median uniformity of coverage of the 1000 most highly
expressed transcripts at the 5' end, as reported by the
CollectRnaSeqMetrics utility from Picard tools.
5' bias is calculated per transcript as the mean coverage of the
5' most 100 bases divided by the mean coverage of the whole
transcript.
Reads aligned to
correct strand
Percentage of reads that align to the correct strand, as
reported by the CollectRnaSeqMetrics utility from Picard
tools.
If a panel is selected, the app calculates the gene-level coverage information based on the
on-target genes.
Table 14 Gene-Level Coverage Table
Coverage
Number of Genes
1X
Number of genes covered at the mean base coverage level or
deeper.
10X
30X
100X
22
1000000006111 v00
Analysis Output
Variants Summary
Table 15 Variant Calls Table
Statistic
Definition
Homozygous
reference
Number of homozygous reference calls.
Heterozygous
Number of heterozygous variant calls.
Homozygous variant
Number of homozygous variant calls.
SNV
Total number of single nucleotide variants (SNVs) detected
for the sample.
Indel
The number of indels detected for the sample.
Tn/Tv
The number of transition SNVs that pass the quality filters
divided by the number of transversion SNVs that pass the
quality filters. Transitions are interchanges of purines (A, G)
or of pyrimidines (C, T). Transversions are interchanges of
purine and pyrimidine bases (for example, A to T).
Sample Plots
}
}
To save plot as scalable vector graphics (SVG), click Save Plot as SVG.
To export data from plot as comma-separated values (CSV), click Export Data as
CSV.
Table 16 Sample Plots Table
Plot Name
Description
Insert Length
Distribution
The diagram summarizes the insert length distribution for
paired-end reads. The insert length for this diagram is capped
at 600 bp.
Alignment
Distribution
If a panel is selected, the app calculates the alignment
information based on the on-target genes.
The plot shows the percentage for the color-coded genomic
regions, which include coding, UTR, intron, and non-targeted
or intergenic. The app reports non-targeted when a panel is
selected.
Transcript Coverage
If a panel is selected, the app calculates the alignment
information based on the on-target genes.
The plot shows the transcript coverage position as reported
by the CollectRnaSeqMetrics utility from Picard tools. A
vertical bar shows the relative coverage at the position in each
row. The numbers between 0 and 100 represents the
normalized position along a transcript.
ERCC Spike-Ins
The plot shows the log RNA FPKM versus log spike-in molar
concentration. Each dot corresponds to a gene. The least
squares method calculates the fitted red line.
Small Variants Summary
The small variants analysis summary is available when you select a panel.
RNA-Seq Alignment v1.0 App Guide
23
Table 17 Small Variants Summary Table
Statistic
Definition
Gene
The genes harboring the variant.
Chr
The chromosome name.
Position
The position on the chromosome.
Depth
The read depth at this position.
Ref
The reference allele.
Alt
The alternative allele.
Alt Freq
The observed alternate allele frequency.
Variant Type
The specific variant type.
For more information, see
http://uswest.ensembl.org/info/genome/variation/predicted_
data.html#consequences.
dbSNP
The dbSNP ID.
COSMIC
Catalog of somatic mutations in cancer.
Because of SNP filtering, some entries are not available on the
COSMIC website. For more information, see
http://cancer.sanger.ac.uk/cosmic/analyses.
ClinVar
ClinVar annotation.
Fusion Calls
The fusion calls analysis summary is available when fusion calling is turned on. The
app highlights the on-target genes when a panel is selected.
Table 18 Fusion Calls Table
Statistic
24
Definition
Gene1
The gene on the 5' end and is highlighted when it is on target.
Chr1
The chromosome of gene 1.
Pos1
The position of gene 1.
Str1
The strand of gene 1.
Gene2
The gene on the 3' end and is highlighted when it is on target.
Chr2
The chromosome of gene 2.
Pos2
The position of gene 2.
Str2
The strand of gene 2.
Paired Read
The number of read pairs when one read aligns to the left
gene and the other read aligns to the right gene.
1000000006111 v00
Split Read
RNA-Seq Alignment v1.0 App Guide
Analysis Output
Statistic
Definition
The number of read pairs when one of the reads spans the
junction.
25
Notes
For technical assistance, contact Illumina Technical Support.
Table 19 Illumina General Contact Information
Website
Email
www.illumina.com
[email protected]
Table 20 Illumina Customer Support Telephone Numbers
Region
Contact Number
Region
North America
1.800.809.4566
Japan
Australia
1.800.775.688
Netherlands
Austria
0800.296575
New Zealand
Belgium
0800.81102
Norway
China
400.635.9898
Singapore
Denmark
80882346
Spain
Finland
0800.918363
Sweden
France
0800.911850
Switzerland
Germany
0800.180.8994
Taiwan
Hong Kong
800960230
United Kingdom
Ireland
1.800.812949
Other countries
Italy
800.874909
Contact Number
0800.111.5011
0800.0223859
0800.451.650
800.16836
1.800.579.2745
900.812168
020790181
0800.563118
00806651752
0800.917.0041
+44.1799.534000
Safety data sheets (SDSs)—Available on the Illumina website at
support.illumina.com/sds.html.
Product documentation—Available for download in PDF from the Illumina website. Go
to support.illumina.com, select a product, then select Documentation & Literature.
RNA-Seq Alignment v1.0 App Guide
Technical Assistance
Technical Assistance
Illumina
5200 Illumina Way
San Diego, California 92122 U.S.A.
+1.800.809.ILMN (4566)
+1.858.202.4566 (outside North America)
[email protected]
www.illumina.com
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement