RNA-Seq Alignment BaseSpace App Guide v1.0 (1000000006111 v00)

RNA-Seq Alignment BaseSpace App Guide v1.0 (1000000006111 v00)

RNA-Seq Alignment v1.0

BaseSpace App Guide

For Research Use Only. Not for use in diagnostic procedures.

Introduction

Workflow

Workflow Diagram

Set Analysis Parameters

Analysis Methods

Analysis Output

Technical Assistance

11

15

6

7

3

5

ILLUMINA PROPRIETARY

1000000006111 v00

February 2016

This document and its contents are proprietary to Illumina, Inc. and its affiliates ("Illumina"), and are intended solely for the contractual use of its customer in connection with the use of the product(s) described herein and for no other purpose. This document and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed, or reproduced in any way whatsoever without the prior written consent of Illumina. Illumina does not convey any license under its patent, trademark, copyright, or common-law rights nor similar rights of any third parties by this document.

The instructions in this document must be strictly and explicitly followed by qualified and properly trained personnel in order to ensure the proper and safe use of the product(s) described herein. All of the contents of this document must be fully read and understood prior to using such product(s).

FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS CONTAINED HEREIN

MAY RESULT IN DAMAGE TO THE PRODUCT(S), INJURY TO PERSONS, INCLUDING TO USERS OR OTHERS, AND

DAMAGE TO OTHER PROPERTY.

ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE PRODUCT(S)

DESCRIBED HEREIN (INCLUDING PARTS THEREOF OR SOFTWARE).

© 2016 Illumina, Inc. All rights reserved.

Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cBot, CSPro, CytoChip, DesignStudio,

Epicentre, ForenSeq, Genetic Energy, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium, iScan, iSelect,

MiniSeq, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, NextBio, Nextera, NextSeq, Powered by Illumina, SureMDA,

TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the pumpkin orange color, and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the U.S. and/or other countries. All other names, logos, and other trademarks are the property of their respective owners.

Introduction

The BaseSpace ® RNA-Seq Alignment v1.0 App supports RNA sequencing read alignments, variants calling, gene fusions detection, and novel script assembly with

Cufflinks. The app can run STAR and tools from the Tuxedo Suite (Bowtie, TopHat,

Cufflinks) to produce aligned reads, variant calls, and FPKM abundance estimates of reference genes and transcripts. Also, the app can perform fusion calling using Mantafusion or TopHat-fusion tools.

The Cufflinks Assembly & DE v2.0 App uses the analysis results from the RNA-Seq

Alignment App to perform novel transcript merging and differential expression.

Compatible Libraries

See the BaseSpace support page for a list of library types that are compatible with the

RNA-Seq Alignment App.

Workflow Requirements

}

Supports 100,000–400,000,000 reads per sample.

}

Supports 40 billion reads across all samples in a single analysis.

} Supports 35–500 bp read lengths.

} Requires paired-end reads for fusion detection.

Versions

The following components are used in the RNA-Seq Alignment App.

Software

Isis (Analysis Software)

TopHat (Aligner)

STAR (Aligner)

Bowtie (Aligner)

Bowtie2 (Aligner)

Isaac Variant Caller

IONA (Annotation Service)

BEDTools

Cufflinks

BLAST

Version

2.6.25.11

2.1.0

2.5.0a

0.12.9

2.2.6

2.3.13-31-g3c98c29

1.0.10.37

2.17.0

2.2.1

2.2.26+

Reference Genomes

The following reference genomes are available for alignment:

}

Homo sapiens UCSC hg19 (RefSeq & Gencode gene annotations)

}

Homo sapiens UCSC hg38 (RefSeq & Gencode gene annotations)

RNA-Seq Alignment v1.0 App Guide

3

The human reference genome is PAR-Masked, which means that the Y chromosome sequence has the Pseudo Autosomal Regions (PAR) masked (set to N) to avoid mismapping of reads in the duplicate regions of sex chromosomes.

}

Arabidopsis thaliana Ensembl TAIR10 (Ensembl gene annotation)

}

Bos taurus UCSC bosTau6 (RefSeq gene annotation)

} Caenorhabditis elegans UCSC ce10 (RefSeq gene annotation)

}

Danio rerio UCSC danRer7 (RefSeq gene annotation)

}

Drosophila melanogaster UCSC dm3 (RefSeq gene annotation)

}

Gallus gallus UCSC galGal4 (RefSeq gene annotation)

} Mus musculus UCSC mm9 (RefSeq gene annotation)

}

Mus musculus UCSC mm10 (RefSeq gene annotation)

}

Oryza sativa japonica Ensembl IRGSP-1.0 (Ensembl gene annotation)

}

Rattus norvegicus UCSC rn5 (RefSeq gene annotation)

} Saccharomyces cerevisiae Ensembl R64-1-1 (Ensembl gene annotation)

}

Sus scrofa UCSC susScr3 (RefSeq gene annotation)

}

Zea mays Ensembl AGPv3 (Ensembl gene annotation)

4

1000000006111 v00

Workflow

}

Filtering—Bowtie filters the input reads against abundant sequences, such as mitochondrial or ribosomal sequences, as defined by iGenomes at

s

upport.illumina.com/sequencing/sequencing_software/igenome.html.

} Only sequences that do not align against abundant sequences are passed through to the next phase of the analysis. Bowtie filters read pairs when at least 1 read aligns to an abundant sequence. Also, Bowtie trims off 2 bases from the 5’ end of the read because of a high mismatch rate from these 2 bases in the RNA-Seq libraries. See

Bowtie on page 11 .

}

Alignment—The STAR or TopHat2 aligner performs a spliced alignment of the filtered reads against the genome. Based on the user-specified genome, STAR or

TopHat aligns reads against known transcripts and splice junctions. See

STAR on

page 11

.

}

Alignment to ERCC—If selected, STAR aligns all reads to the ERCC RNA spike-in sequences, independent of alignment to the transcriptome. The aligner counts reads that align to each spike-in sequence, calculates FPKMs, and computes the correlation between FPKMs and the expected spike-in concentrations.

} Fusion Calling—If selected, the STAR aligner supports Manta-fusion and the TopHat aligner supports TopHat-fusion. First, TopHat2 is used to detect fused alignments.

Then, a post-alignment analysis script identifies candidate fusion genes from the fused alignments. See

Manta on page 11 .

}

Variant Calling—The Isaac Variant Caller performs variant calling, which produces gVCF output files. For stranded library preps, the strand bias filter is disabled.

} Also, the Isaac variant caller uses the -bsnp-diploid-het-bias parameter to expand the range for the heterozygous variant call, in order to account for allele-specific expression.

}

The Isaac tool uses a RNA-specific random-forest-based variant scoring model, which was built using Platinum Genomes data as a reference. See

Isaac Variant

Caller on page 12

.

} Quantification—Cufflinks quantifies reference genes and transcripts.

RnaReadCounter counts the number of aligned reads matching each annotated gene.

See

Cufflinks on page 12

and

RnaReadCounter on page 13 .

}

Novel Transcript Assembly—If selected, transcripts are assembled and quantified independently for each sample.

RNA-Seq Alignment v1.0 App Guide

5

Workflow Diagram

Figure 1 RNA-Seq Alignment Workflow

6

1000000006111 v00

Set Analysis Parameters

1

Navigate to BaseSpace, and then click the Apps tab.

2

In Categories, click RNA-Seq, and then click RNA-Seq Alignment.

3

From the drop-down list, select version 1.0.0, and then click Launch to open the app.

4

In the App Session Name field, enter the analysis name.

By default, the analysis name includes the app name, followed by the date and time that the analysis session starts.

5

From the Save Results To field, select the project that stores the app results.

6

From the Samples field, browse to the sample you want to analyze and select the checkbox.

You can analyze multiple samples. Select the checkbox to identify samples prepared with the TruSeq Stranded Total RNA and TruSeq Stranded mRNA library prep kits for the first strand and the ScriptSeq v2.0 RNA-Seq library prep kit for the second strand.

7

From the Reference Genome field, select the reference genome to be used for alignment. The default is Homo sapiens (PAR-masked)/hg19 (RefSeq).

8

From the Panel field, select from the following:

} None (default)

} TruSight RNA Pan-Cancer

The TruSight RNA Pan-Cancer library prep kit only supports the Human, UCSC hg19 (RefSeq & Gencode) reference genome.

9

From the Aligner field, select from the following methods:

}

STAR (default)

} TopHat (Bowtie)

}

TopHat (Bowtie2)

10 [Optional]

Select the QC Mode checkbox to analyze only the subset of a read pairs for each sample. If selected, enter the number of read pairs for each sample.

11 [Optional]

Select the Novel Transcript Assembly checkbox for Cufflinks to assemble novel transcripts. If selected, select the Adjust Transcript Assembly for Samples

Without PolyA Selection checkbox if the samples are prepared without PolyA selection (TruSeq Total RNA kit).

NOTE

By default, the Call Fusions checkbox is checked if you selected a panel and an aligner that supports fusion calling. Paired-end reads are required. The TopHat (Bowtie) aligner supports TopHat-fusion. The STAR aligner supports Manta-fusion. The TopHat (Bowtie2) aligner does not support fusion calling. For best results, use the STAR aligner with the

TrusSight RNA Pan-Cancer panel.

12

From the ERCC Spike-In Controls field, select from the following options:

}

None (default)

} Mix 1

} Mix 2

13 [Optional]

Select the Trim TruSeq Adapters checkbox.

This option trims adapter sequences from the FASTQ file. Use this option if adapter trimming was not performed in the demultiplexing.

RNA-Seq Alignment v1.0 App Guide

7

14 [Optional]

Select the Set Advanced Options checkbox to enable the advanced options and then specify the values for the appropriate options.

15

Click Continue.

The RNA-Seq Alignment App begins analysis.

When analysis is complete, the app updates the status of the session and sends an email to notify you.

Advanced Options

[Optional]

Specify the values for the advanced options.

Table 1 TopHat Options Table

Option Description

Read Mismatches

Read Gap Length

Read Edit Distance

Enter a number between 0 and 5. The default is 2.

Alignments with more than the number of mismatches are discarded.

Enter a number between 0 and 5. The default is 2.

Alignments with more than the total length of gaps are discarded.

Enter a number between 0 and 10. The default is 2.

Alignments with more than the selected edit distance are discarded.

Mate Inner Distance

Mate Standard

Deviation

Minimum Intron

Length

Maximum Intron

Length

Maximum Insertion

Length

Enter a number between 0 and 300. The default is 50.

The expected (mean) inner distance between mate pairs. For paired-end runs with fragments selected at 300 bp, where each end is 50 bp, set this value at 200.

Enter a number between 1 and 100. The default is 20.

The standard deviation of the distribution on inner distances between mate pairs.

Enter a number between 10 and the maximum intron length.

The default is 70.

TopHat ignores donor/acceptor pairs closer than the specified number of bases.

Enter a number between the minimum intron length and

1,000,000. The default is 500,000.

When searching for junctions, TopHat ignores donor/acceptor pairs farther than the specified number of bases, except when a pair is supported by a split segment alignment of a long read.

Enter a number between 0 and 5. The default is 3.

Maximum Deletion

Length

Enter a number between 0 and 5. The default is 3.

8

1000000006111 v00

Table 2 STAR Options Table

Option Description

Score Difference to

Filter Multimapping

Alignments

Maximum Mismatches

Maximum Mismatches

Over Read Length

Minimum Score Over

Read Length

Minimum Matches

Over Read Length

Maximum Seed Search

Step:

Minimum Intron

Length

Maximum Intron

Length

Use Annotation

Two-Pass Mode

Enter a number between 1 and 5. The default is 1.

When a read aligns to multiple loci, the alignment is reported if its score is in the range of (s - value, s], where "s" is the highest alignment score and "value" is the number that you entered.

Enter a number between 1 and 21. The default is 10.

The output includes alignments that have fewer mismatches than this value.

Enter a number between 0 and 0.5. The default is 0.3.

The output includes alignments that have a ratio of mismatches to mapped length that is less than this value.

Enter a number between 0.33 and 1. The default is 0.66.

The output includes alignments that have a score higher than this value. The score is normalized by read length, which is the length sum of mates for paired-end reads.

Enter a number between 0.33 and 1. The default is 0.66.

The output includes alignments that have matched bases higher than this value. The number is normalized by reads length, which is the length sum of mates for paired-end reads.

Enter a number between 30 and 1000. The default is 50.

The seed search starts at position 1 and the step length determines the next start position. The step length cannot be longer than this value.

Enter a number between 10 and the maximum intron size.

The default is 21.

The genomic gap is considered intron when its length is greater or equal to this value. Otherwise, the gap is considered a deletion.

Enter 0 or a number between the minimum intron size and

1,000,000. The default is 0.

If the value is set to 0, STAR calculates the maximum intron size.

Select the checkbox to use splice junction information in the annotation. By default, the checkbox is selected.

Select the STAR 2-pass alignment. The default is Basic.

Table 3 Cuffnorm Options Table

Field Description

Hits Normalization

Compatible—Cuffnorm counts only those fragments compatible with some reference transcript towards the number of mapped fragments used in the FPKM denominator. This option is the default.

Total—Cuffnorm counts all fragments, including those not compatible with any reference transcript, towards the number of mapped fragments used in the FPKM denominator.

RNA-Seq Alignment v1.0 App Guide

9

10

Table 4 Cufflinks Options Table

Option Description

Hits Normalization

Minimum Isoform

Fraction

Pre-mRNA Fraction

Minimum Intron

Length

Compatible—Cufflinks counts only those fragments compatible with some reference transcript towards the number of mapped fragments used in the FPKM denominator.

Total—Cufflinks counts all fragments, including the fragments not compatible with any reference transcript, towards the number of mapped fragments used in the FPKM denominator. This option is the default.

Enter a number between 0.05 and 1. The default is 0.1.

After calculating isoform abundance for a gene, Cufflinks filters out transcripts that are low abundance. Isoforms that are expressed at low levels often cannot be reliably assembled.

The isoforms can be artifacts of incompletely spliced precursors of processed transcripts. This parameter filters out introns that have fewer supporting sliced alignments.

Enter a number between 0 and 1. The default is 0.15.

Some RNA-seq protocols produce a significant number of reads that come from incompletely spliced transcripts. These reads can confound the assembly of fully spliced mRNAs.

Cufflinks uses this parameter to filter out alignments that are within the intronic intervals. The minimum depth of coverage in the intronic region covered by the alignment is divided by the number of spliced reads. If the result is lower than the value in this parameter, the intronic alignments are ignored.

Enter a number between 10 and the maximum intron length.

The default is 50.

Maximum Intron

Length

Minimum Fragments per Transfrag

Enter a number between the minimum intron length and

600,000. The default is 300,000.

When the intron length is longer than this value, Cufflinks does not report transcripts with introns and excludes SAM alignments with REF_SKIP CIGAR operations.

Enter a number between 5 and 100. The default is 10.

Assembled transcript fragments supported by fewer than this value of aligned RNA-seq fragments are not reported.

Table 5 Cuffquant/Cufflinks Options Table

Option Description

Fragment Bias

Correction

Multi-read Correction

No Effective Length

Correction

Cuffquant/Cufflinks runs bias detection and correction algorithm, which can improve accuracy of transcript abundance estimates.

Cuffquant/Cufflinks runs an initial estimation procedure to weight reads mapping to multiple locations in the genome more accurately.

Cuffquant/Cufflinks disables effective length normalization to transcript FPKM.

1000000006111 v00

Analysis Methods

The RNA-Seq Alignment App uses the following methods to analyze the sequencing data.

}

STAR

}

Manta

} Tuxedo Suite, which includes Bowtie, Bowtie2, TopHat, and Cufflinks

}

Isaac Variant Caller

STAR

Spliced Transcripts Alignment to a Reference (STAR) is a fast RNA-seq read mapper, with support for splice-junction and fusion read detection.

STAR aligns reads by finding the Maximal Mappable Prefix (MMP) hits between reads

(or read pairs) and the genome, using a Suffix Array index. Different parts of a read can be mapped to different genomic positions, corresponding to splicing or RNA-fusions. The genome index includes known splice-junctions from annotated gene models, allowing for sensitive detection of spliced reads. STAR performs local alignment, automatically soft clipping ends of reads with high mismatches.

We recommend using STAR because it can quickly align more reads than other aligner methods.

For more information, see https://github.com/alexdobin/STAR.

Manta

Manta calls structural variants (SVs) from mapped paired-end sequencing reads. Manta discovers candidate SVs from discordant pair and split-read alignments, followed by local assembly and realignment to refine candidates.

The app uses Manta on RNA-seq data to detect gene fusions in combination with STAR, which appear like translocations in the RNA alignments. The Manta workflow is followed by RNA-specific filtering and scoring, which is based on the following:

}

Read counts across the fusion and alignment qualities.

}

Genome-wide realignment of fusion contigs to filter candidates that can be explained by a local alignment elsewhere in the genome.

} Length of coverage around the breakpoints, indicating presence of stable fusion transcripts.

For more information, see https://github.com/Illumina/manta.

Tuxedo Suite

The Tuxedo suite offers a set of tools for analyzing a variety of RNA-Seq data, including short-read mapping, identification of splice junctions, transcript and isoform detection, differential expression, visualizations, and quality control metrics.

Bowtie

Bowtie is an ultrafast, memory-efficient aligner designed to quickly align large sets of short reads to large genomes. Bowtie indexes the genome to keep its memory footprint small: for the human genome, the index is typically about 2.2 GB for single-read alignment or 2.9 GB for paired-end alignment. Multiple processors can be used simultaneously to achieve greater alignment speed.

RNA-Seq Alignment v1.0 App Guide

11

Bowtie forms the basis for other tools like TopHat, a fast splice junction mapper for

RNA-seq reads, and Cufflinks, a tool for transcriptome assembly and isoform quantitation from RNA-seq reads.

For more information, see http://bowtie-bio.sourceforge.net/index.shtml.

Bowtie 2

Bowtie can quickly align large sets of short DNA sequences to large genomes.

You can use Bowtie 2 to align reads of about 50 to 100s or 1,000s of characters. For human genome, the memory footprint is approximately 3.2 GB.

Bowtie 2 forms the basis for other tools like Tophat, a fast splice junction mapper for

RNA-seq reads, and Cufflinks, a tool for transcriptome assembly and isoform quantitation from RNA-seq reads.

For more information, see http://bowtie-bio.sourceforge.net/bowtie2/index.shtml.

TopHat

TopHat is a fast splice junction mapper for RNA-seq reads that can only be used with

Bowtie or Bowtie2. TopHat uses Bowtie or Bowtie2 to map RNA-seq reads, and then it analyzes the mapping results to identify splice junctions between exons.

For more information, see http://ccb.jhu.edu/software/tophat/index.shtml.

Cufflinks

Cufflinks assembles aligned RNA-Seq reads into transcripts, estimates their abundances, and test for differential expression and regulation of transcriptome.

For more information, see

cole-trapnell-lab.github.io/cufflinks/

.

Isaac Variant Caller

The Isaac variant caller identifies single nucleotide variants (SNVs) and small indels using the following steps:

}

Read filtering—Filters reads failing quality checks.

}

Indel calling—Identifies a set of possible indel candidates and realigns all reads overlapping the candidates using a multiple sequence aligner.

} SNV calling—Computes the probability of each possible genotype given the aligned read data and a prior distribution of variation in the genome.

}

Indel genotypes—Calls indel genotypes and assigns probabilities.

}

Variant call output—Generates output in a VCF and a compressed genome variant call (gVCF) file.

For more information, see https://github.com/sequencing/isaac_variant_caller

Read Filtering

Input reads are filtered under the following conditions:

}

Reads that failed base calling quality checks.

}

Reads marked as PCR duplicates.

} Paired-end reads not marked as a proper pair.

} Reads with a mapping quality < 20.

12

1000000006111 v00

Indel Calling

The variant caller proceeds with candidate indel discovery and generates alternate read alignments based on the candidate indels. As part of the realignment process, the variant caller selects a representative alignment to be used for site genotype calling and depth summarization by the SNV caller.

SNV Calling

The variant caller runs a series of filters on the set of filtered and realigned reads for

SNV calling without affecting indel calls. First, any contiguous trailing sequence of N base calls is trimmed from the ends of reads. Using a mismatch density filter, reads having an unexpectedly high number of disagreements with the reference are masked, as follows:

}

The variant caller treats each insertion or deletion as a single mismatch.

} Base calls with more than 2 mismatches to the reference sequence within 20 bases of the call are ignored.

}

If the call occurs within the first or last 20 bases of a read, the mismatch limit is applied to a 41-base window at the corresponding end of the read.

}

The mismatch limit is applied to the entire read when the read length is 41 or shorter.

Indel Genotypes

The variant caller filters all bases marked by the mismatch density filter and any N base calls that remain after the end-trimming step. These filtered base calls are not used for site-genotyping but appear in the filtered base call counts in the variant caller output for each site.

All remaining base calls are used for site-genotyping. The genotyping method heuristically adjusts the joint error probability that is calculated from multiple observations of the same allele on each strand of the genome. This correction accounts for the possibility of error dependencies.

This method treats the highest-quality base call from each allele and strand as an independent observation and leaves the associated base call quality scores unmodified.

Quality scores for subsequent base calls for each allele and strand are then adjusted. This adjustment increases the joint error probability of the given allele above the error expected from independent base call observations.

Variant Call Output

After the SNV and indel genotyping methods are complete, the variant caller applies a final set of heuristic filters to produce the final set of calls in the output.

The output in the genome variant call (gVCF) file captures the genotype at each position and the probability that the consensus call differs from reference. This score is expressed as a Phred-scaled quality score.

RnaReadCounter

The RnaReadCounter, an internal tool, counts the number of aligned reads in an RNA-

Seq sample that match each annotated gene. The RnaReadCounter method is similar to

"htseq-count" in “union-mode”, using a “chromsweep” algorithm.

RNA-Seq Alignment v1.0 App Guide

13

Read counts are based on the overlapping of both reads in a pair with exons of a single gene and do not consider individual transcripts separately. Reads are not counted if they map to more than 1 genomic position or to a position with overlapping exons from more than one gene.

14

1000000006111 v00

Analysis Output

1

Navigate to the BaseSpace site.

2

To view the results, click the Projects tab, then the project name, and then the analysis.

Figure 2 RNA-Seq Alignment Output Navigation Bar

Use the left navigation bar to access the following analysis output:

} Analysis Info—Information about the analysis session, including log files.

}

Inputs—Overview of input settings.

}

Output Files—Output files for the samples.

}

Analysis Reports

} Summary—Analysis metrics for the aggregate results.

}

Sample Analysis—Analysis reports for each sample.

Analysis Info

The Analysis Info page displays the analysis settings and execution details.

Row Heading

Name

Application

Date Started

Date Completed

Duration

Session Type

Status

Definition

Name of the analysis session.

App that generated this analysis.

Date and time the analysis session started.

Date and time the analysis session completed.

Duration of the analysis.

Multi-Node or Single-Node

Status of the analysis session. The status shows either Running or Complete and the number of nodes used.

RNA-Seq Alignment v1.0 App Guide

15

Log Files

File Name

CompletedJobInfo.xml

Logging.zip

SampleSheet.csv

SampleSheetUsed.csv

WorkflowError.txt

Description

Contains information about the completed analysis session.

Contains all detailed log files for each step of the workflow.

Sample sheet.

A copy of the sample sheet.

Contains error messages created when running the workflow.

Output Files

The Output Files page provides access to the output files for each sample analysis.

}

BAM Files

}

VCF Files

} gVCF Files

} FPKM Files

}

Coverage.bedGraph.gz Files

}

Coverage.bw Files

} Junctions.bed Files

BAM File Format

A BAM file (*.bam) is the compressed binary version of a SAM file that is used to represent aligned sequences up to 128 Mb. SAM and BAM formats are described in detail at https://samtools.github.io/hts-specs/SAMv1.pdf.

BAM files use the file naming format of SampleName_S#.bam, where # is the sample number determined by the order that samples are listed for the run. In multi-node mode, the S# is set to S1, regardless the order of the sample.

BAM files contain a header section and an alignment section:

}

Header—Contains information about the entire file, such as sample name, sample length, and alignment method. Alignments in the alignments section are associated with specific information in the header section.

}

Alignments—Contains read name, read sequence, read quality, alignment information, and custom tags. The read name includes the chromosome, start coordinate, alignment quality, and the match descriptor string.

The alignments section includes the following information for each or read pair:

} RG: Read group, which indicates the number of reads for a specific sample.

} BC: Barcode tag, which indicates the demultiplexed sample ID associated with the read.

}

SM: Single-end alignment quality.

}

AS: Paired-end alignment quality.

} NM: Edit distance tag, which records the Levenshtein distance between the read and the reference.

16

1000000006111 v00

} XN: Amplicon name tag, which records the amplicon tile ID associated with the read.

BAM index files (*.bam.bai) provide an index of the corresponding BAM file.

VCF File Format

Variant Call Format (VCF) is a widely used file format developed by the genomics scientific community that contains information about variants found at specific positions in a reference genome.

VCF files use the file naming format SampleName_S#.vcf, where # is the sample number determined by the order that samples are listed for the run.

VCF File Header—Includes the VCF file format version and the variant caller version.

The header lists the annotations used in the remainder of the file. If MARS is listed, the

Illumina internal annotation algorithm annotated the VCF file. The VCF header includes the reference genome file and BAM file. The last line in the header contains the column headings for the data lines.

VCF File Data Lines—Each data line contains information about a single variant.

VCF File Headings

Heading

CHROM

POS

ID

REF

ALT

QUAL

Description

The chromosome of the reference genome. Chromosomes appear in the same order as the reference FASTA file.

The single-base position of the variant in the reference chromosome.

For SNPs, this position is the reference base with the variant; for indels or deletions, this position is the reference base immediately before the variant.

The rs number for the SNP obtained from dbSNP.txt, if applicable.

If there are multiple rs numbers at this location, the list is semicolon delimited. If no dbSNP entry exists at this position, a missing value marker ('.') is used.

The reference genotype. For example, a deletion of a single T is represented as reference TT and alternate T. An A to T single nucleotide variant is represented as reference A and alternate T.

The alleles that differ from the reference read.

For example, an insertion of a single T is represented as reference A and alternate AT. An A to T single nucleotide variant is represented as reference A and alternate T.

A Phred-scaled quality score assigned by the variant caller.

Higher scores indicate higher confidence in the variant and lower probability of errors. For a quality score of Q, the estimated probability of an error is 10 -(Q/10) . For example, the set of Q30 calls has a 0.1% error rate. Many variant callers assign quality scores based on their statistical models, which are high in relation to the error rate observed.

RNA-Seq Alignment v1.0 App Guide

17

18

VCF File Annotations

Heading

FILTER

INFO

FORMAT

SAMPLE

Description

If all filters are passed, PASS is written in the filter column.

LowDP—Applied to sites with depth of coverage below a cutoff.

LowGQ—The genotyping quality (GQ) is below a cutoff.

LowQual—The variant quality (QUAL) is below a cutoff.

LowVariantFreq—The variant frequency is less than the given threshold.

R8—For an indel, the number of adjacent repeats (1-base or 2-base) in the reference is greater than 8.

SB—The strand bias is more than the given threshold. Used with the

Somatic Variant Caller and GATK.

Possible entries in the INFO column include:

AC—Allele count in genotypes for each ALT allele, in the same order as listed.

AF—Allele Frequency for each ALT allele, in the same order as listed.

AN—The total number of alleles in called genotypes.

CD—A flag indicating that the SNP occurs within the coding region of at least 1 RefGene entry.

DP—The depth (number of base calls aligned to a position and used in variant calling).

Exon—A comma-separated list of exon regions read from RefGene.

FC—Functional Consequence.

GI—A comma-separated list of gene IDs read from RefGene.

QD—Variant Confidence/Quality by Depth.

TI—A comma-separated list of transcript IDs read from RefGene.

The format column lists fields separated by colons. For example,

GT:GQ. The list of fields provided depends on the variant caller used.

Available fields include:

AD—Entry of the form X,Y, where X is the number of reference calls, and Y is the number of alternate calls.

DP—Approximate read depth; reads with MQ=255 or with bad mates are filtered.

GQ—Genotype quality.

GQX—Genotype quality. GQX is the minimum of the GQ value and the QUAL column. In general, these values are similar; taking the minimum makes GQX the more conservative measure of genotype quality.

GT—Genotype. 0 corresponds to the reference base, 1 corresponds to the first entry in the ALT column, and so on. The forward slash (/) indicates that no phasing information is available.

NL—Noise level; an estimate of base calling noise at this position.

PL—Normalized, Phred-scaled likelihoods for genotypes.

SB—Strand bias at this position. Larger negative values indicate less bias; values near 0 indicate more bias. Used with the Somatic Variant

Caller and GATK.

VF—Variant frequency; the percentage of reads supporting the alternate allele.

The sample column gives the values specified in the FORMAT column.

1000000006111 v00

Genome VCF Files

Genome VCF (gVCF) files are VCF v4.1 files that follow a set of conventions for representing all sites within the genome in a reasonably compact format. The gVCF files include all sites within the region of interest in a single file for each sample.

The gVCF file shows no-calls at positions with low coverage, or where a low-frequency variant (< 3%) occurs often enough (> 1%) that the position cannot be called to the reference. A genotype (GT) tag of ./. indicates a no-call.

For more information, see

sites.google.com/site/gvcftools/home/about-gvcf

.

FPKM Files

Fragments Per Kilobase of sequence per Million mapped reads (FPKM) normalizes the number of aligned reads by the size of the sequence feature and the total number of mapped reads.

In each output directory, this app creates the following output files:

} genes.fpkm_tracking—Quantifies the expression of genes specified in the GTF annotation file.

} isoforms.fpkm_tracking—Quantifies the expression of transcripts specified in the

GTF annotation file.

Analysis Reports

The RNA-Seq Alignment App provides an aggregate summary for all the samples and a summary of statistics per sample.

Summary Analysis Report

The RNA-Seq Alignment App provides an aggregate summary for all the samples.

Table 6 Summary Table

Statistic

Sample ID

Read Length

Number of Reads

% Total Aligned

% Abundant

% Unaligned

Median CV Coverage

Uniformity

% Stranded

Definition

The sample ID.

The length of reads.

The total number of reads passing filter for this sample.

Percentage of reads passing filter that aligned to the reference, including abundant reads.

Percentage of reads that align to abundant transcripts, such as mitochondrial and ribosomal sequences.

Percentage of reads that do not align to the reference.

The median coefficient of variation of coverage of the 1000 most highly expressed transcript, as reported by the

CollectRnaSeqMetrics utility from Picard tools. Ideal value = 0.

Percentage of reads that are stranded.

Summary Plots

} To save plot as scalable vector graphics (SVG), click Save Plot as SVG.

RNA-Seq Alignment v1.0 App Guide

19

20

} To export data from plot as comma-separated values (CSV), click Export Data as

CSV.

Table 7 Summary Plots Table

Plot Name Description

Insert Length

Distribution

Alignment

Distribution

Transcript Coverage

The plot summarizes the insert length distribution for pairedend reads. The 3 vertical lines of the box represent the quartiles and the whiskers represent the 5 th and 95 th percentiles. The insert length for this box plot is capped at 600 bp.

If a panel is selected, the app calculates the alignment information based on the on-target genes.

The plot shows the percentage for the color-coded genomic regions, which include coding, UTR, intron, and non-targeted or intergenic. The app reports "non-targeted" if a panel is selected.

If a panel is selected, the app calculates the alignment information based on the on-target genes.

The plot shows the transcript coverage position as reported by the CollectRnaSeqMetrics utility from Picard tools. A vertical bar shows the relative coverage at the position in each row. The numbers between 0 and 100 represents the normalized position along a transcript.

ERCC Spike-Ins

The ERCC spike-ins analysis summary is available when an ERCC mix is selected.

Table 8 ERCC Spike-Ins Table

Statistic Definition

Total Spike-In Reads

(% Reads)

Percentage of reads passing filter that aligned to the ERCC sequences.

Pearson Correlation

Spearman Correlation

Pearson correlation of log RNA FPKM and log spike-in molar concentration.

Spearman correlation of log RNA FPKM and log spike-in molar concentration.

Sample Analysis Reports

The RNA-Seq Alignment App provides an overview of statistics per sample in the

Analysis Reports section. To download the statistics, click PDF Summary Report.

Primary Analysis Information

Table 9 Primary Analysis Information Table

Statistic Definition

Read Length

Length of reads.

1000000006111 v00

Statistic

Number of Reads

Bases (GB)

Q30 Bases (GB)

Definition

Total number of reads passing filter for this sample.

The total number of bases for this sample.

The total number of bases with a quality score of 30 or higher.

Insert Information

Table 10 Insert Information Table

Statistics Definition

Insert Length Median

Insert Length S.D.

Median length of a sequenced fragment. The fragment length is calculated based on the locations at which a read pair aligns to the reference. The read mapping information is parsed from the BAM files.

Standard deviation of the sequenced fragment length.

Duplicates (% Reads)

Percentage of paired reads that have duplicates from a subsampled set of 4 million reads or from the total number of reads when there are less than 4 million reads.

Alignment Summary

Table 11 Alignment Quality Table

Statistic Definition

Total Aligned Reads

(% Reads)

Percentage of reads passing filter that aligned to the reference.

Abundant Reads (%

Reads)

Percentage of reads that aligns to abundant transcripts, such as mitochondrial and ribosomal sequences.

Unaligned Reads (%

Reads)

Percentage of reads passing filter that are not aligned to the reference.

Percentage of aligned reads with a spliced alignment.

Reads with spliced alignment (% Aligned

Reads)

Percentage of aligned reads to multiple loci.

Reads aligned at multiple loci (%

Aligned Reads)

If a panel is selected, the app calculates the alignment information based on the on-target genes.

Table 12 Alignment Information Table

Statistics Definition

Coding

Metrics based on coding bases.

RNA-Seq Alignment v1.0 App Guide

21

22

Statistics

UTR

Intron

Intergenic or Nontargeted

Definition

Metrics based on bases in untranslated regions (UTR).

Metrics based on bases in introns.

Metrics based on bases in intergenic or non-targeted regions.

The app reports "non-targeted" if a panel is selected.

Coverage Summary

If a panel is selected, the app calculates the coverage information based on the on-target genes.

Table 13 Coverage Uniformity Information Table

Statistic Definition

Median CV

Median 3'

Median 5'

Reads aligned to correct strand

The median coefficient of variation of coverage of the 1000 most highly expressed transcripts, as reported by the

CollectRnaSeqMetrics utility from Picard tools. Ideal value = 0.

The median uniformity of coverage of the 1000 most highly expressed transcripts at the 3' end, as reported by the

CollectRnaSeqMetrics utility from Picard tools.

3' bias is calculated per transcript as the mean coverage of the

3' most 100 bases divided by the mean coverage of the whole transcript.

The median uniformity of coverage of the 1000 most highly expressed transcripts at the 5' end, as reported by the

CollectRnaSeqMetrics utility from Picard tools.

5' bias is calculated per transcript as the mean coverage of the

5' most 100 bases divided by the mean coverage of the whole transcript.

Percentage of reads that align to the correct strand, as reported by the CollectRnaSeqMetrics utility from Picard tools.

If a panel is selected, the app calculates the gene-level coverage information based on the on-target genes.

Table 14 Gene-Level Coverage Table

Coverage Number of Genes

1X

Number of genes covered at the mean base coverage level or deeper.

10X

30X

100X

1000000006111 v00

Variants Summary

Table 15 Variant Calls Table

Statistic Definition

Homozygous reference

Number of homozygous reference calls.

Heterozygous

Number of heterozygous variant calls.

Homozygous variant

Number of homozygous variant calls.

SNV

Indel

Total number of single nucleotide variants (SNVs) detected for the sample.

The number of indels detected for the sample.

T n

/T v

The number of transition SNVs that pass the quality filters divided by the number of transversion SNVs that pass the quality filters. Transitions are interchanges of purines (A, G) or of pyrimidines (C, T). Transversions are interchanges of purine and pyrimidine bases (for example, A to T).

Sample Plots

} To save plot as scalable vector graphics (SVG), click Save Plot as SVG.

}

To export data from plot as comma-separated values (CSV), click Export Data as

CSV.

Table 16 Sample Plots Table

Plot Name Description

Insert Length

Distribution

The diagram summarizes the insert length distribution for paired-end reads. The insert length for this diagram is capped at 600 bp.

Alignment

Distribution

Transcript Coverage

ERCC Spike-Ins

If a panel is selected, the app calculates the alignment information based on the on-target genes.

The plot shows the percentage for the color-coded genomic regions, which include coding, UTR, intron, and non-targeted or intergenic. The app reports non-targeted when a panel is selected.

If a panel is selected, the app calculates the alignment information based on the on-target genes.

The plot shows the transcript coverage position as reported by the CollectRnaSeqMetrics utility from Picard tools. A vertical bar shows the relative coverage at the position in each row. The numbers between 0 and 100 represents the normalized position along a transcript.

The plot shows the log RNA FPKM versus log spike-in molar concentration. Each dot corresponds to a gene. The least squares method calculates the fitted red line.

Small Variants Summary

The small variants analysis summary is available when you select a panel.

RNA-Seq Alignment v1.0 App Guide

23

24

Table 17 Small Variants Summary Table

Statistic Definition

Gene

The genes harboring the variant.

Chr

The chromosome name.

Position

The position on the chromosome.

Depth

The read depth at this position.

Ref

Alt

Alt Freq

Variant Type dbSNP

COSMIC

The reference allele.

The alternative allele.

The observed alternate allele frequency.

The specific variant type.

For more information, see http://uswest.ensembl.org/info/genome/variation/predicted_ data.html#consequences.

The dbSNP ID.

Catalog of somatic mutations in cancer.

Because of SNP filtering, some entries are not available on the

COSMIC website. For more information, see http://cancer.sanger.ac.uk/cosmic/analyses.

ClinVar annotation.

ClinVar

Fusion Calls

The fusion calls analysis summary is available when fusion calling is turned on. The app highlights the on-target genes when a panel is selected.

Table 18 Fusion Calls Table

Statistic Definition

Gene1

The gene on the 5' end and is highlighted when it is on target.

Chr1

The chromosome of gene 1.

Pos1

The position of gene 1.

Str1

The strand of gene 1.

Gene2

The gene on the 3' end and is highlighted when it is on target.

Chr2

The chromosome of gene 2.

Pos2

The position of gene 2.

Str2

The strand of gene 2.

Paired Read

The number of read pairs when one read aligns to the left gene and the other read aligns to the right gene.

1000000006111 v00

Statistic

Split Read

Definition

The number of read pairs when one of the reads spans the junction.

RNA-Seq Alignment v1.0 App Guide

25

Notes

Technical Assistance

For technical assistance, contact Illumina Technical Support.

Table 19 Illumina General Contact Information

Website

www.illumina.com

Email

[email protected]

Table 20 Illumina Customer Support Telephone Numbers

Region

North America

Australia

Austria

Belgium

China

Denmark

Finland

France

Germany

Hong Kong

Ireland

Italy

Contact Number

1.800.809.4566

1.800.775.688

0800.296575

0800.81102

400.635.9898

80882346

0800.918363

0800.911850

0800.180.8994

800960230

1.800.812949

800.874909

Region

Japan

Netherlands

New Zealand

Norway

Singapore

Spain

Sweden

Switzerland

Taiwan

United Kingdom

Other countries

Contact Number

0800.111.5011

0800.0223859

0800.451.650

800.16836

1.800.579.2745

900.812168

020790181

0800.563118

00806651752

0800.917.0041

+44.1799.534000

Safety data sheets (SDSs)—Available on the Illumina website at

support.illumina.com/sds.html

.

Product documentation—Available for download in PDF from the Illumina website. Go to support.illumina.com, select a product, then select Documentation & Literature.

RNA-Seq Alignment v1.0 App Guide

Illumina

5200 Illumina Way

San Diego, California 92122 U.S.A.

+1.800.809.ILMN (4566)

+1.858.202.4566 (outside North America) [email protected]

www.illumina.com

Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement

Table of contents