MiSeq Reporter Library QC Workflow Guide (15042316 Rev. C)

MiSeq Reporter Library QC Workflow Guide (15042316 Rev. C)

MiSeq Reporter

Library QC Workflow

Reference Guide

FOR RESEARCH USE ONLY

Revision History

Introduction

Library QC Workflow Overview

Library QC Summary Tab

Library QC Details Tab

Optional Settings for the Library QC Workflow

Analysis Output Files

Technical Assistance

8

10

11

3

4

5

6

ILLUMINA PROPRIETARY

Part # 15042316 Rev. C

February 2014

This document and its contents are proprietary to Illumina, Inc. and its affiliates ("Illumina"), and are intended solely for the contractual use of its customer in connection with the use of the product(s) described herein and for no other purpose. This document and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed, or reproduced in any way whatsoever without the prior written consent of Illumina. Illumina does not convey any license under its patent, trademark, copyright, or common-law rights nor similar rights of any third parties by this document.

The instructions in this document must be strictly and explicitly followed by qualified and properly trained personnel in order to ensure the proper and safe use of the product(s) described herein. All of the contents of this document must be fully read and understood prior to using such product(s).

FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS CONTAINED HEREIN

MAY RESULT IN DAMAGE TO THE PRODUCT(S), INJURY TO PERSONS, INCLUDING TO USERS OR OTHERS, AND

DAMAGE TO OTHER PROPERTY.

ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE PRODUCT(S)

DESCRIBED HEREIN (INCLUDING PARTS THEREOF OR SOFTWARE) OR ANY USE OF SUCH PRODUCT(S) OUTSIDE

THE SCOPE OF THE EXPRESS WRITTEN LICENSES OR PERMISSIONS GRANTED BY ILLUMINA IN CONNECTION

WITH CUSTOMER'S ACQUISITION OF SUCH PRODUCT(S).

FOR RESEARCH USE ONLY

© 2013-2014 Illumina, Inc. All rights reserved.

Illumina, IlluminaDx, BaseSpace, BeadArray, BeadXpress, cBot, CSPro, DASL, DesignStudio, Eco, GAIIx, Genetic

Energy, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium, iScan, iSelect, MiSeq, MiSeqDx,

Nextera, NextSeq, NuPCR, SeqMonitor, Solexa, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG,

VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks of Illumina, Inc. in the

U.S. and/or other countries. All other names, logos, and other trademarks are the property of their respective owners.

Revision History

Part #

15042316

Revision

C

15042316

15042316

B

A

Date

February

2014

August

2013

June

2013

Description of Change

Updated to change introduced in MiSeq Reporter v2.4:

• Added the alignment method to the description of the BAM file header.

Added optional settings section with descriptions of Adapter, FlagPCRDuplicates, and

ReverseComplement.

Initial release.

The information provided within was previously included in the MiSeq Reporter User Guide. With this release, the MiSeq Reporter User Guide contains information about the interface, how to view run results, how to requeue a run, and how to install and configure the software. Information specific to the Library QC workflow is provided in this guide.

MiSeq Reporter Library QC Workflow Reference Guide

3

Introduction

Intended for evaluating the quality of DNA libraries, the Library QC workflow aligns reads against the reference genome specified in the sample sheet.

In the MiSeq Reporter Analyses tab, a run folder associated with the Library QC workflow is represented with the letter L. For more information about the software interface, see the MiSeq Reporter User Guide (part # 15042295).

This guide describes the analysis steps performed in the Library QC workflow, the types of data that appear on the interface, and the analysis output files generated by the workflow.

4

Part # 15042316 Rev. C

Library QC Workflow Overview

The Library QC workflow is intended for evaluating abundance, fragment length, and sample quality of DNA libraries.

The alignment step uses a faster, less sensitive setting with the Burrows-Wheeler Aligner

(BWA) that provides an efficient turnaround time. After alignment, MiSeq Reporter calculates diversity and fragment lengths, and generates a sample report named

LibraryQC.html that is written to the Alignment folder. The sample report lists the characteristics of each DNA sample in terms of percentage of reads aligned. Data written to the sample report appear in the samples table.

MiSeq Reporter Library QC Workflow Reference Guide

5

Library QC Summary Tab

The Summary tab for the Library QC workflow includes a low percentages graph, high percentages graph, clusters graph, and mismatch graph.

}

Low percentages graph—Shows phasing, prephasing, and mismatches in percentages. Low percentages indicate good run statistics.

}

High percentages graph—Shows clusters passing filter, alignment to a reference, and intensities in percentages. High percentages indicate good run statistics.

} Clusters graph—Shows numbers of raw clusters, clusters passing filter, clusters that did not align, clusters not associated with an index, and duplicates.

} Mismatch graph—Shows mismatches per cycle. A mismatch refers to any mismatch between the sequencing read and a reference genome after alignment.

Low Percentages Graph

Y Axis X Axis Description

Percent Phasing 1

The percentage of molecules in a cluster that fall behind the current cycle within Read 1.

Phasing 2

The percentage of molecules in a cluster that fall behind the current cycle within Read 2.

Prephasing

1

The percentage of molecules in a cluster that run ahead of the current cycle within Read 1.

Prephasing

2

The percentage of molecules in a cluster that run ahead of the current cycle within Read 2.

Mismatch

1

The average percentage of mismatches for Read 1 over all cycles.

Mismatch

2

The average percentage of mismatches for Read 2 over all cycles.

High Percentages Graph

Y Axis X Axis

Percent PF

Align 1

Align 2

I20 / I1 1

I20 / I1 2

PE

Resynthesis

PE

Orientation

Description

The percentage of clusters passing filters.

The percentage of clusters that aligned to the reference in Read 1.

The percentage of clusters that aligned to the reference in Read 2.

The ratio of intensities at cycle 20 to the intensities at cycle 1 for Read 1.

The ratio of intensities at cycle 20 to the intensities at cycle 1 for Read 2.

The ratio of first cycle intensities for Read 1 to first cycle intensities for

Read 2.

The percentage of paired-end alignments with the expected orientation.

6

Part # 15042316 Rev. C

Clusters Graph

Y Axis X Axis

Clusters Raw

Description

The total number of clusters detected in the run.

PF

The total number of clusters passing filter in the run.

Unaligned

The total number of clusters passing filter that did not align to the reference genome, if applicable. Clusters that are unindexed are not included in the unaligned count.

Unindexed

The total number of clusters passing filter that were not associated with any index sequence in the run.

Duplicate

The total number of clusters for a paired-end sequencing run that are considered to be PCR duplicates. PCR duplicates are defined as two clusters from a paired-end run where both clusters have the exact same alignment positions for each read.

Mismatch Graph

Y Axis X Axis Description

Percent Cycle

Plots the percentage of mismatches for all clusters in a run by cycle.

MiSeq Reporter Library QC Workflow Reference Guide

7

Library QC Details Tab

The Details tab for the Library QC workflow includes a samples table, targets table, coverage graph, and Q-score graph.

}

Samples table—Summarizes the sequencing results for each sample.

}

Targets table—Shows statistics for a particular sample and chromosome.

}

Coverage graph—Shows read depth at a given position in the reference.

}

Q-score graph—Shows the average quality score, which is the estimated probability of an error measured in 10 -(Q/10) . For example, a score of Q30 has an error rate of 1 in

1000, or 0.1%.

Samples Table

Column

#

Sample ID

Sample Name

Clusters Raw

%Clusters

%PF

%Aligned

%Mismatch

Median Len

Min Len

Max Len

Estimated

Diversity

Observed

Diversity

Genome

Description

An ordinal identification number in the table.

The sample ID from the sample sheet. Sample ID must always be a unique value.

The sample name from the sample sheet.

The number of clusters sequenced for this sample.

The percentage of the total cluster number matching the index for this sample.

The percentage of clusters passing filter for this sample.

The percentage of clusters successfully aligned.

The percentage mismatch to reference averaged over cycles per read

(Read 1/Read 2).

The median fragment length for the sample.

The low percentile of fragment lengths for this sample as they correspond to three standard deviations from the median.

The high percentile of fragment lengths for this sample as they correspond to three standard deviations from the median.

An estimate of the total library diversity derived from the observed diversity and the number of apparent PCR duplicates. This calculation is available for paired-end runs unless PCR duplicate flagging was disabled in the sample sheet.

Number of distinct aligned positions. Reads with the same aligned positions are assumed to be PCR duplicates. PCR duplicates are defined as sequences with identical Read 1 and Read 2 start sites.

The name of the reference genome.

8

Part # 15042316 Rev. C

Targets Table

Column

#

Chr

Cluster PF

Mismatch

No Call

Genome

Description

An ordinal identification number in the table.

The reference target or chromosome name.

The number of clusters passing filter for the sample that aligned to the reference genome.

The percentage mismatch to reference averaged over cycles per read

(Read 1/Read 2).

The percentage of bases that could not be called (no-call) for the sample averaged over cycles per read (Read 1/Read 2).

The name of the reference genome.

Q-Score Graph

Y Axis X Axis

Q-

Score

Description

Position

The average quality score of bases at the given position of the reference.

Coverage Graph

Y Axis

Coverage

X Axis Description

Position

The green curve is the number of aligned reads that cover each position in the reference.

The red curve is the number of aligned reads that have a miscall at this position in the reference. SNPs and other variants show up as spikes in the red curve.

MiSeq Reporter Library QC Workflow Reference Guide

9

Optional Settings for the Library QC Workflow

Sample sheet settings are optional commands that control various analysis parameters.

Settings are used in the Settings section of the sample sheet and require a setting name and a setting value.

} If you are viewing or editing the sample sheet in Excel, the setting name resides in the first column and the setting value in the second column.

}

If you are viewing or editing the sample sheet in a text editor such as Notepad, the setting name is followed by a comma and a setting value. Do not include a space between the comma and the setting value.

Example: Adapter,CTGTCTCTTATACACATCT

The following optional settings are compatible with the Library QC workflow.

Sample Sheet Settings for Analysis

Parameter

Adapter

FlagPCRDuplicates

ReverseComplement

Description

Specify the 5' portion of the adapter sequence to prevent reporting sequence beyond the sample DNA.

Illumina recommends adapter trimming for Nextera libraries and Nextera Mate Pair libraries.

To specify two or more adapter sequences, separate the sequences by a plus (+) sign. For example:

CTGTCTCTTATACACATCT+AGATGTGTATAAGAGA

CAG

Settings are 0 or 1. Default is 1, filtering.

If set to 1, PCR duplicates are flagged in the BAM files and not used for variant calling. PCR duplicates are defined as two clusters from a paired-end run where both clusters have the exact same alignment positions for each read.

Duplicates are not flagged for single-read runs, including

PCR duplicates.

(Formerly FilterPCRDuplicates. FilterPCRDuplicates is acceptable for backward compatibility.)

Settings are 0 or 1. Default is 1, reads are reversecomplemented.

If set to true (1), all reads are reverse-complemented as they are written to FASTQ files.

This setting is necessary in certain cases, such as processing of mate pair data using BWA, which expects paired-end data. This setting can disrupt per-cycle metrics.

Use the default setting of 1 when using the Library QC workflow with Nextera Mate Pair libraries.

10

Part # 15042316 Rev. C

Analysis Output Files

The following analysis output files are generated for the Library QC workflow and provide analysis results for alignment and a sample report.

File Name

*.bam files

LibraryQC.html

Description

Contains aligned reads for a given sample.

Located in Data\Intensities\BaseCalls\Alignment .

Lists the characteristics of each sample in terms of percentage of reads aligned.

Located in Data\Intensities\BaseCalls\Alignment .

BAM File Format

A BAM file (*.bam) is the compressed binary version of a SAM file that is used to represent aligned sequences. SAM and BAM formats are described in detail on the SAM

Tools website: samtools.sourceforge.net.

BAM files are written to the alignment folder in

Data\Intensities\BaseCalls\Alignment

.

BAM files use the file naming format of SampleName_S#.bam, where # is the sample number determined by the order that samples are listed in the sample sheet.

BAM files contain a header section and an alignments section:

} Header—Contains information about the entire file, such as sample name, sample length, and alignment method. Alignments in the alignments section are associated with specific information in the header section.

Alignment methods include banded Smith-Waterman, Burrows-Wheeler Aligner

(BWA), and Bowtie. The term Isis indicates that an Illumina alignment method is in use, which is the banded Smith-Waterman method.

} Alignments—Contains read name, read sequence, read quality, and custom tags.

GA23_40:8:1:10271:11781 64 chr22 17552189 8 35M * 0 0

TACAGACATCCACCACCACACCCAGCTAATTTTTG

IIIII>FA?C::B=:GGGB>GGGEGIIIHI3EEE#

BC:Z:ATCACG XD:Z:55 SM:I:8

The read name includes the chromosome and start coordinate chr22 17552189, the alignment quality 8, and the match descriptor 35M * 0 0.

BAM files are suitable for viewing with an external viewer such as IGV or the UCSC

Genome Browser.

BAM index files (*.bam.bai) provide and index of the corresponding BAM file.

Library QC Analysis File

The sample report, LibraryQC.html, lists cluster counts, cluster quality, alignment information, fragment length, and diversity for each sample. For a description of each column in the sample report, see

Targets Table on page 9

.

Figure 1 LibraryQC.html Analysis File

MiSeq Reporter Library QC Workflow Reference Guide

11

Data in the sample report is generated by calculating diversity and then computing fragment length.

}

Diversity calculation—Considers all clusters that are proper pairs, meaning both mates map to the same chromosome, and then computes the number of clusters. The number of distinct alignment positions is reported as the observed diversity. Clusters with the same alignment positions are assumed to be PCR duplicates. From this information, the estimated diversity is calculated. This calculation is available for paired-end sequencing runs, unless PCR duplicate flagging is disabled in the sample sheet.

}

Fragment length—Lengths are computed for clusters where both reads successfully aligned to the same chromosome. This workflow reports lengths in median (50 th percentile), minimum (0.15

th percentile, corresponding to three standard deviations below the mean for a normal distribution), and maximum (99.85

th percentile).

Fragment lengths of ≥ 10000 bases are discarded as possible chimeras.

Supplementary Output Files

The following output files provide supplementary information, or summarize run results and analysis errors. Although, these files are not required for assessing analysis results, they can be used for troubleshooting purposes.

File Name

AdapterTrimming.txt

AnalysisLog.txt

AnalysisError.txt

CompletedJobInfo.xml

DemultiplexSummaryF1L1.txt

ErrorsAndNoCallsByLaneTile

ReadCycle.csv

Mismatch.htm

Description

Lists the number of trimmed bases and percentage of bases for each tile. This file is present only if adapter trimming was specified for the run.

Located in Data\Intensities\BaseCalls\Alignment .

Processing log that describes every step that occurred during analysis of the current run folder. This file does not contain error messages.

Located in the root level of the run folder.

Processing log that lists any errors that occurred during analysis. This file is present only if errors occurred.

Located in the root level of the run folder.

Written after analysis is complete, contains information about the run, such as date, flow cell ID, software version, and other parameters.

Located in the root level of the run folder.

Reports demultiplexing results in a table with one row per tile and one column per sample.

Located in Data\Intensities\BaseCalls\Alignment .

A comma-separated values file that contains the percentage of errors and no-calls for each tile, read, and cycle.

Located in Data\Intensities\BaseCalls\Alignment .

Contains histograms of mismatches per cycle and no-calls per cycle for each tile.

Located in Data\Intensities\BaseCalls\Alignment .

12

Part # 15042316 Rev. C

File Name

ResequencingRunStatistics.xml

Summary.xml

Summary.htm

Description

Contains summary statistics specific to the run.

Located in the root level of the run folder.

Contains a summary of mismatch rates and other base calling results.

Located in Data\Intensities\BaseCalls\Alignment .

Contains a summary web page generated from

Summary.xml.

Located in Data\Intensities\BaseCalls\Alignment .

MiSeq Reporter Library QC Workflow Reference Guide

13

Notes

Technical Assistance

For technical assistance, contact Illumina Technical Support.

Table 1 Illumina General Contact Information

Illumina Website

www.illumina.com

Email

[email protected]

Table 2 Illumina Customer Support Telephone Numbers

Region

North America

Austria

Belgium

Denmark

Finland

France

Germany

Ireland

Contact Number

1.800.809.4566

0800.296575

0800.81102

80882346

0800.918363

0800.911850

0800.180.8994

1.800.812949

Region

Italy

Netherlands

Norway

Spain

Sweden

Switzerland

United Kingdom

Other countries

Contact Number

800.874909

0800.0223859

800.16836

900.812168

020790181

0800.563118

0800.917.0041

+44.1799.534000

Safety Data Sheets

Safety data sheets (SDSs) are available on the Illumina website at www.illumina.com/msds.

Product Documentation

Product documentation in PDF is available for download from the Illumina website. Go to www.illumina.com/support, select a product, then click Documentation & Literature.

MiSeq Reporter Library QC Workflow Reference Guide

Illumina

San Diego, California 92122 U.S.A.

+1.800.809.ILMN (4566)

+1.858.202.4566 (outside North America) [email protected]

www.illumina.com

Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project