HiSeq Analysis Software User Guide - Support

HiSeq Analysis Software User Guide - Support
HiSeq Analysis Software v2.0
User Guide
For Research Use Only. Not for use in diagnostic procedures.
Revision History
Introduction
Computing Requirements
Software Dependencies
Install HiSeq Analysis Software v2.0 RPM
Validate HiSeq Analysis Software v2.0
GenerateFASTQ Workflow
Resequencing Analysis Workflow
Tumor-Normal Analysis Workflow
Input Files
Output Files
Methods for Resequencing and Alignment Analysis Workflows
Methods for Tumor-Normal Workflow
Technical Assistance
ILLUMINA PROPRIETARY
Part # 15070536 Rev. A
May 2015
3
4
5
6
7
9
10
15
19
24
26
28
43
This document and its contents are proprietary to Illumina, Inc. and its affiliates ("Illumina"), and are intended solely for the
contractual use of its customer in connection with the use of the product(s) described herein and for no other purpose. This
document and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed,
or reproduced in any way whatsoever without the prior written consent of Illumina. Illumina does not convey any license
under its patent, trademark, copyright, or common-law rights nor similar rights of any third parties by this document.
The instructions in this document must be strictly and explicitly followed by qualified and properly trained personnel in order
to ensure the proper and safe use of the product(s) described herein. All of the contents of this document must be fully read
and understood prior to using such product(s).
FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS CONTAINED HEREIN
MAY RESULT IN DAMAGE TO THE PRODUCT(S), INJURY TO PERSONS, INCLUDING TO USERS OR OTHERS, AND
DAMAGE TO OTHER PROPERTY.
ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE PRODUCT(S)
DESCRIBED HEREIN (INCLUDING PARTS THEREOF OR SOFTWARE).
© 2015 Illumina, Inc. All rights reserved.
Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cBot, CSPro, CytoChip, DesignStudio,
Epicentre, GAIIx, Genetic Energy, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium,
iScan, iSelect, ForenSeq, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, Nextera, NextBio, NextSeq, Powered by Illumina,
SeqMonitor, SureMDA, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the
pumpkin orange color, and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the U.S. and/or
other countries. All other names, logos, and other trademarks are the property of their respective owners.
Revision History
Revision History
Part #
Revision
Date
15070536
A
May 2015
HiSeq Analysis Software v2.0 User Guide
Description of Change
Initial release
3
HiSeq Analysis Software v2.0 is a software package for analyzing sequencing data
generated by Illumina HiSeq X sequencing systems. The software leverages a suite of
proven algorithms to detect genomic variants comprehensively and accurately. HiSeq
Analysis Software v2.0 is a complete package with a range of variants, including Single
Nucleotide Variants (SNV), Indels, Structural Variants (SV) and Copy Number Variants
(CNV) for tumor and normal samples. HiSeq Analysis Software v2.0 utilizes the base
calls and quality scores generated by Real-Time Analysis (RTA) software during the
sequencing run to analyze data rapidly for high-throughput whole-genome sequencing
analysis.
HiSeq Analysis Software v2.0 leverages the Isaac Aligner and Isaac Variant Caller to
provide both aligned and unaligned reads and variants. See Isaac Aligner on page 28 and
Isaac Variant Caller on page 30 for more information.
For structural variants, HiSeq Analysis Software v2.0 uses 2 complementary approaches:
} Read depth analysis by Isaac Copy Number Variant Caller (Canvas). See Isaac Copy
Number Variant Caller on page 36.
} Discordant paired-end analysis by Isaac Structural Variant Caller (Manta). See Isaac
Structural Variant Caller on page 40.
HiSeq Analysis Software v2.0 supports a multistep analysis workflow that is amenable
to using a simple command line with recommended default settings, which are specified
in the sample sheet.
This document provides an overview of the HiSeq Analysis Software v2.0 workflow and
data structure, computing requirements and installation guidelines. The aim of this
document is to help you understand the details and use of HiSeq Analysis Software v2.0
for high-throughput whole-genome sequencing analysis.
HiSeq Analysis Software v2.0 User Guide
4
Introduction
Introduction
Computing Requirements
Operating System
HiSeq Analysis Software v2.0 requires the CentOS 6 operating system with the standard
set of included libraries, which are indicated in the RPM installer. HiSeq Analysis
Software v2.0 also requires the R statistical package, which is not specified in the RPM.
For R statistical package installation procedures and a complete list of dependencies, see
Software Dependencies on page 6.
Compute Hardware
HiSeq Analysis Software v2.0 requires specific compute architecture for proper
functionality and performance. The hardware required is available from a range of
suppliers and manufacturers.
For a HiSeq X Ten system, the following architecture is recommended for analysis.
Archival storage is not included in this recommendation.
} Resilient NAS solution serving CIFS and NFS
• A minimum of 100 TB usable storage capacity with performance > 2 GB/s
To retain data after analysis, additional capacity is needed.
• Multiple 10 Gb or 40 Gb network links
} Resilient 10 Gb network
} 26 servers, each with the following:
• Dual 10 Core CPUs @ 2.8 GHz
• 128 GB 1867 MHz memory
• 6 TB useable local storage capacity with performance > 500 MB/s
• 10 Gb Ethernet adapter
5
Part # 15070536 Rev. A
The following system dependencies are required to operate HiSeq Analysis Software v2.0.
R Statistical Package
Install the R statistical package using the following command:
sudo yum install R
If yum is unable to find the R package, first install the EPEL repository using these
command lines:
wget http://download.fedoraproject.org/pub/epel/6/i386/epelrelease-6-8.noarch.rpm
sudo rpm -ivh epel-release-6-8.noarch.rpm
CentOS v6 Standard Libraries
The following packages are indicated in the rpm. If these packages are not present, they
are automatically installed during software installation.
ConsoleKit, ConsoleKit-libs, GConf2, ImageMagick, ORBit2, OpenEXR-libs, atk, auditlibs, avahi-libs, basesysem, bash, bzip2-libs, ca-certificates, cairo, chkconfig, coreutils,
coreutils-libs, cracklib, cracklib-dicts, cups-libs, curl, cyrus-sasl-lib, db4, d4-utils, dbus,
dbus-glib, dbus-libs, eggdbus, elfutils-libelf, expat, file-libs, filesystem, fontconfig, freetype,
gamin, gd, gdbm, gdk-pixbuf2, ghostcript, ghostscript-fonts, glib2, glibc, glibc-common,
gmp, gnuplot, gnuplot-common, gnutls, grep, groff, gtk2, gzip, hicolor-icon-theme,
ilmbase, infojasper-libs, keyutils-libs, keyutils-libs-devel, krb5-devel, krb5-libs, lcms-libs,
less, libICE, libIDL, libSM, libX11, libX11-common, libXau, libXcmposite, libXcursor,
libXdamage, libXext, libXfixes, libXfont, libXft, libXi, libXinerama, libXpm, libXrandr,
libXrender, libXt, libacl, libattr, libcp, libcap-ng, libcom_err, libcom_err-devel, libcroco,
libcurl, libffi, libfontenc, libgcc, libgcrypt, libgomp, libgpg-error, libgsf, libidn, libjpeg-urbo,
libpng, librsvg2, libselinux, libselinux-devel, libsepol, libsepol-devel, libssh2, libstdc++,
libtasn1, libthai, libtiff, libtool-ltdl, libuui, libwmf-lite, libxcb, libxml2, libxslt, lua, make,
ncurses, ncurses-base, ncurses-libs, nspr, nss, nss-softokn, nss-softokn-freebl, nss-sysinit,
nss-ools, nss-util, openldap, openssl, openssl-devel, p11-kit, p11-kit-trust, pam, pango,
pcre, pixman, pkgconfig, polkit, popt, psmisc, python, python-lib, readline, rpm, rpmlibs, sed, setup, sgml-common, shadow-utils, shared-mime-info, sqlite, tzdata, urw-fonts,
xorg-x11-font-utils, xz-libs, zlib, zli-devel
HiSeq Analysis Software v2.0 User Guide
6
Software Dependencies
Software Dependencies
Install HiSeq Analysis Software v2.0 RPM
NOTE
You need root privileges to perform the installation and to unpack the reference genome. NOTE
Hard drives containing the HiSeq Analysis Software v2.0 and genome installers use the
NTFS file system.
1
Copy the rpm and tarball installers to your system.
• HiSeqAnalysisSoftwareX-2.5.55.1311.HAS-1.x86-64.rpm (1.1 GB)
• HASv2_2.5.55.13XX_Homo_sapiens_UCSC_hg19.tar.gz (32 GB)
2
Install the base analysis software (2.8 GB) using the RPM package.
• Use the following install command:
sudo yum install HiSeqAnalysisSoftwarev2-2.5.55.1311.HAS1.x86_64.rpm
• To install to a different location, use this command:
sudo rpm -i --prefix /install/path HiSeqAnalysisSoftwarev22.5.55.1311.HAS-1.x86_64.rpm
When using the --prefix option, the package is placed inside the illumina
directory within your custom directory.
3
Install the hg19 reference genome (108 GB). Installation takes approximately 45
minutes. Make sure that the HiSeq Analysis Software v2.0 RPM is installed before
installing the reference genome files.
• Use the following install command:
sudo
/opt/illumina/HiSeqAnalysisSoftwareX/unpackIsisReference.py
--input-file=/path/to/HASv2_2.5.55.13XX_Homo_sapiens_UCSC_
hg19.tar.gz
• To install to a different location, use this command:
sudo
/opt/illumina/HiSeqAnalysisSoftwareX/unpackIsisReference.py
--input-file=/path/to/HASv2_2.5.55.13XX_Homo_sapiens_UCSC_
hg19.tar.gz --genomes-root=/genomes/root/genomes_drive
--jobs=16
NOTE
The --genomes-root section of the command refers to the location for the
genome installation. Make sure that the directory exists before executing the
command.
Additional Information
} The RPM installs the prerequisite software to run HiSeq Analysis Software v2.0 (eg,
Java and Mono).
} If there are missing dependencies when installing the software using the
rpm -i --prefix command, use the yum command to install the missing
dependences, and then retry installation.
} The RPM is unsigned.
} If there are problems with running and executing the program, change the
ownership of the installation directory to a group containing all users.
7
Part # 15070536 Rev. A
HiSeq Analysis Software v2.0 User Guide
8
Install HiSeq Analysis Software v2.0 RPM
} To uninstall the RPM, use the following command:
sudo yum remove HiSeqAnalysisSoftwareX.x86_64
} When running HiSeq Analysis Software on a cluster, create a script with the
necessary command lines and schedule the script on a node. See your scheduler
documentation for the correct command.
Make sure that the node is dedicated to the script. When running, HiSeq Analysis
Software uses all the RAM and cores on the node.
Validate HiSeq Analysis Software v2.0
To validate the installation, run the following script. Update the path if the software is
installed to a different location.
/opt/illumina/scratch/InstallValidationData/ValidateInstall
A successful validation prints the following output lines:
Starting Isaac validation run...
Isaac run passed validation
Log file written to
/opt/illumina/scratch/InstallValidationData/ValidateInstall.log
An unsuccessful validation prints the following output line:
Isaac run failed
9
Part # 15070536 Rev. A
Run folders of base call and quality score data in .bcl.gz file format are first processed
with the GenerateFASTQ workflow, which demultiplexes the data and converts it to
FASTQ files organized under a per-sample directory structure.
Create a SampleSheet.csv file for the GenerateFASTQ workflow step. For each step, a new
sample sheet file is automatically generated that specifies the settings to be applied in the
subsequent steps. For more information, see Sample Sheet Settings on page 24.
Run GenerateFASTQ
1
Identify a run folder for the analysis, which can be the original run folder for the
flow cell.
2
Create a sample sheet named SampleSheet.csv in the top level of the run folder. For
more information, see Sample Sheet Settings on page 24.
3
Start the GenerateFASTQ workflow for each flow cell using the following command.
Specify absolute paths. Relative paths are not supported. If your HiSeq Analysis
Software v2.0 executable is not installed in the default location, use your system
location instead of the /opt location shown below.
/opt/illumina/HiSeqAnalysisSoftwareX/latest/HAS GenerateFastq r /path/to/your/RunFolder -a /path/to/analysis/folder -f
/path/to/fastq/folder
If you log out of your terminal session, the HiSeq Analysis Software v2.0 command can
be prematurely terminated. To retain your terminal session, run under screen or add
the nohup command in the prefix. The log output is captured in the WorkflowError.txt
and WorkflowLog.txt files. Alternatively, output can be redirected using 2>&1>, as
shown below.
nohup /opt/illumina/HiSeqAnalysisSoftwareX/latest/HAS
GenerateFastq -r /path/to/your/RunFolder -a
/path/to/analysis/folder -f /path/to/fastq/folder 2>&1 >
logfile.txt &
Use these command line options to customize the GenerateFASTQ run.
Option
-a
-r
-f
Description
The path to the root analysis folder. Use the same path for all
analyses.
The path to run folder.
The path to the FASTQ folder. Use the same root directory to
store FASTQ files for all analyses.
Top-Off Samples
Topping-off allows you to merge samples to combine data from multiple lanes when
there is insufficient yield for downstream analysis.
To combine data from multiple run folders, use the project name and Sample_ID as a
unique identifier for a sample. If a project name and Sample_ID pair are reused in
multiple GenerateFASTQ calls, the combined data are treated as a single sample in the
Resequencing or Alignment workflows.
Make sure to use the same analysis folder path (-a) for the runs. Do not proceed to the
Resequencing or Alignment workflows until you have performed all the necessary
GenerateFASTQ runs.
HiSeq Analysis Software v2.0 User Guide
10
GenerateFASTQ Workflow
GenerateFASTQ Workflow
Demultiplexing
If the sample sheet defines multiple samples within the same lane, demultiplexing is
performed to assign clusters to the samples. For runs with index reads, demultiplexing
compares each Index Read sequence to the index sequences specified in the sample
sheet.
Demultiplexing separates data from pooled samples based on short index sequences that
tag samples from different libraries. Index reads are identified using the following steps:
} Samples are numbered starting from 1 based on the order they are listed in the
sample sheet.
} Sample number 0 is reserved for clusters that were not successfully assigned to a
sample.
} Clusters are assigned to a sample when the index sequence matches exactly or there
is up to a single mismatch per Index Read.
NOTE
Illumina indexes are designed so that any index pair differs by ≥ 3 bases, allowing for a
single mismatch in index recognition. Index sets that are not from Illumina can include
pairs of indexes that differ by < 3 bases. In such cases, the software detects the insufficient
difference and modifies the default index recognition (mismatch=1). Instead, the software
performs demultiplexing using only perfect index matches (mismatch=0).
When demultiplexing is complete, the DemultiplexSummaryF1L1.txt summarizes the
following information:
} In the file name, F1 represents the flow cell number.
} In the file name, L1 represents the lane number.
} Reports demultiplexing results in a table with one row per tile and one column per
sample, including sample 0.
} Reports the most commonly occurring sequences for the index reads.
The demultiplexing summary file is written to the QC folder. For the folder location, see
FASTQ Folder Structure on page 12.
If the index sequence specified in the sample sheet is shorter than the actual Index Read,
then the first cycles of the Index Read are used for demultiplexing. If there are no index
sequences specified for Index or Index2, then the first or second index reads are not used
for demultiplexing.
Make sure that all samples in the same lane have a compatible index scheme. Use the
same number of cycles for Index and Index2.
11
Part # 15070536 Rev. A
[User-Specified FASTQ Folder]
[Flow Cell ID]—Contains FASTQ files.
Delete_Bcl.cmdline—Shell script used to delete the BCL files that were
used for the analysis.
samplename_S1_L001_R1_001.fastq.gz–FASTQ file.
SampleSheet.csv–Copied from Run Folder.
QC—Contains undetermined FASTQ files, demultiplexing summary, and
AdapterCounts.txt and AdapterTrimming.txt files.
LogFiles—Contains files used for troubleshooting only.
InterOp–Contains binary files used by Sequencing Analysis
Viewer (SAV).
commands.tsv—A select list of commands that were run during
the analysis.
CompletedJobInfo.xml—XML file containing parameters that were used
for the analysis.
RunInfo.xml—Copied from Run Folder.
runParameters.xml—Copied from Run Folder.
WorkflowError.txt—Main error log file.
WorkflowLog.txt—Main log file.
FASTQ File Generation
HiSeq Analysis Software v2.0 generates intermediate analysis files in the FASTQ format,
which is a text format used to represent sequences and their quality scores. FASTQ files
contain reads for each sample and their quality scores, excluding reads identified as
inline controls and clusters that did not pass filter.
FASTQ file generation includes all tiles by default. If individual .bcl files or the
associated .filter and .loc files are missing or corrupted, make sure that the
ConvertMissingBCLsToNoCalls setting is set to true (value=1) in the sample sheet. This
setting is true by default.
FASTQ File Format
FASTQ file is a text-based file format that contains base calls and quality values per read.
Each record contains 4 lines:
}
}
}
}
The identifier
The sequence
A plus sign (+)
The Phred quality scores in an ASCII +33 encoded format
The identifier is formatted as @Instrument:RunID:FlowCellID:Lane:Tile:X:Y
ReadNum:FilterFlag:0:SampleNumber as shown in the following example:
@SIM:1:FCX:1:15:6329:1045 1:N:0:2
TCGCACTCAACGCCCTGCATATGACAAGACAGAATC
+
<>;##=><9=AAAAAAAAAA9#:<#<;<<<????#=
HiSeq Analysis Software v2.0 User Guide
12
GenerateFASTQ Workflow
FASTQ Folder Structure
The X and Y coordinates in the read name are transformed relative to the raw pixel
values according to the following calculation:
(int)((PixelX*10) + 1000 +0.5)
The instrument name is derived from the <Instrument> tag in the RunInfo.xml file. The
run ID is derived from the Number attribute from the <Run ID> tag, or from the ID
attribute. The following is an example of the tag, with the run number bolded:
<Run Id="11-07-23_BoltM11_0108_A-FCA012D_1" Number="108">
FASTQ File Names
FASTQ files are named with the sample name and the sample number. The sample
number is a numeric assignment based on the order that the sample is listed in the
sample sheet. For example:
Data\Intensities\BaseCalls\samplename_S1_L001_R1_001.fastq.gz
} samplename—The sample name provided in the sample sheet. If a sample name is
not provided, the file name includes the sample ID.
} S1—The sample number based on the order that samples are listed in the sample
sheet starting with 1. In this example, S1 indicates that this sample is the first
sample listed in the sample sheet.
NOTE
Reads that cannot be assigned to any sample are written to a FASTQ file for sample
number 0, and excluded from downstream analysis.
} L001—The lane number.
} R1—The read. In this example, R1 means Read 1. For a paired-end run, a file from
Read 2 includes R2 in the file name.
} 001—The last segment is always 001.
FASTQ files are compressed in the GNU zip format, as indicated by *.gz in the file name.
FASTQ files can be uncompressed using tools such as gzip (command-line) or 7-zip
(GUI).
Quality Control Files
To perform QC, first use the Sequencing Analysis Viewer (SAV) to view important
quality metrics. If the generated FASTQ files are empty or nearly empty, view the output
files for possible errors.
If the run does not include indexes, view the following files.
File
Description
GenerateFASTQRunStatistics.xml
Contains the number of raw and PF clusters for the
whole flow cell and per sample.
WorkflowError.txt
BuildFastq*.stdout.txt
Processing logs that contain additional information on
the run.
The BuildFastq*.stdout.txt file contains information for
each lane and tile.
InterOp files in the InterOp
folder
Contain the %Q30 scores and additional information for
QC.
FastqSummaryF1L*.txt
Contains the number of raw and PF clusters for each
land and tile on a per sample basis.
WorkflowLog.txt
13
Part # 15070536 Rev. A
File
Description
IndexMetricsOut.bin
SAV uses this file with the other files in the InterOp folder.
DemultiplexSummaryF1L1.txt
Contains the percentage of each index sample for each lane
and tile. This file also lists the most frequent indexes (unique
or combination) found in the lanes. View this file when a
sample has too many reads or no reads.
AdapterCounts.txt
Lists the number of reads that contain part of the adapter
sequence.
AdapterTrimming.txt
Lists the number of trimmed bases and percentage of bases
for each tile. This file is present only if adapter trimming
was specified for the run.
Undetermined_S0_L00*_R*_
001.gz
Contains the reads with unidentifiable index reads. This file
is present only when demultiplexing.
HiSeq Analysis Software v2.0 User Guide
14
GenerateFASTQ Workflow
If the run includes indexes, view the following files in addition to the files listed above.
Resequencing Analysis Workflow
The Resequencing analysis workflow uses the Isaac Aligner and Isaac Variant Caller to
compare the DNA sequence in the sample against the human reference genome hg19. It
identifies any small variants (SNPs or indels) and large structural variants (SVs and
CNVs) relative to the reference sequence.
The inputs to the Resequencing workflow are the sample FASTQ files located in the
GenerateFASTQ output folder structure and the SampleSheet.csv file generated by the
GenerateFASTQ step.
The main output files are BAM files (containing the reads after alignment), VCF files
(containing the variant calls), Genome VCF files (describing the calls for all variant and
nonvariant sites in the genome), and SV VCF files (describing the calls for structural
variants and copy number variants in the genome).
Figure 1 Resequencing Analysis Workflow
Run Resequencing Analysis
Run the Resequencing analysis workflow using the following command. If your HiSeq
Analysis Software v2.0 executable is not installed in the default location, use your system
location instead of the /opt location shown below.
/opt/illumina/HiSeqAnalysisSoftwareX/latest/HAS Resequencing -a
/path/to/analysis/folder -i NormalSampleId -t
/path/for/temp/files/likely/on/node
Use these command line options to customize the Resequencing run.
Option
-p
-l
--copy-fastqs
-t
15
Description
[Optional] The project the sample belongs to. Adds a directory
level named <Project> to the base of the other output directories.
[Optional] The path to the local storage. If specified, the input files
for the specified workflow are copied to the local storage of the
compute node (local hard drive). The workflow also generates
temporary and output files in the local node storage during the
analysis.
Using this option requires more compute node space than the -t
option.
Upon completion, the results are moved to the standard location.
[Optional] Copy the FASTQ files to the analysis output folder
instead of using a symlink. If this option is not specified, FASTQ
files are softlinked from the analysis output folder to the actual
FASTQ source directory generated by the GenerateFASTQ
workflow.
If this option is specified, FASTQ folders in the results folder are
not subject to the Delete_FastQ.cmdline script.
The path to the temporary directory used by Isaac.
Part # 15070536 Rev. A
Resequencing Analysis Folder Structure
[User-Specified Analysis Folder]
[Project Name]—Present if the project name is specified in the sample sheet.
[Sample ID]
Delete_FastQ.cmdline–Shell script used to delete the FASTQ files
that were used for the analysis.
FASTQ_SOURCE—Contains internal information to link FASTQ
source files to the sample. Do not delete or alter.
FASTQ_STATS—FASTQ and demultiplex summary files copied from
the GenerateFASTQ run. Contains the same content as the QC folder in
the original FASTQ directory, but renamed by flow cell.
Results
samplename_S1.bam—BAM file with all the reads aligned against
the hg19 genome.
samplename_S1.bam.bai—BAM index file.
samplename_S1.bam.md5sum—MD5 checksum of the BAM file.
samplename_S1.genome.vcf.gz—gVCF file with all reference
and small variant calls.
samplename_S1.genome.vcf.gz.tbi—Tabix index file for the
gVCF file.
samplename_S1.report.html—HTML report.
samplename_S1.report.pdf—PDF report.
samplename_S1.summary.csv—Comma-separated file with various
analysis stats.
SampleSheet.csv—Sample sheet used for the analysis.
Summary.htm—HTML file containing statistics on the flow cell on a
per tile basis.
Summary.xml—XML version of the summary file.
LogFiles—Contents are used for troubleshooting only.
Fastq—Contains softlinks to the FASTQ files.
commands.tsv—A select list of commands that were run during
the analysis.
CompletedJobInfo.xml—XML file containing parameters used
for the analysis.
RunInfo.xml—Copied from Run Folder.
WorkflowError.txt—Main error log file.
WorkflowLog.txt—Main log file.
Resequencing Analysis Report
The Resequencing workflow outputs a PDF and HTML report that provides an overview
of statistics per sample.
HiSeq Analysis Software v2.0 User Guide
16
Resequencing Analysis Workflow
NOTE
If there is available storage on the local node, the -l option is recommended as standard
practice. If the -l option is specified, there is no need to specify the -t option.
Sample Information
The sample information table provides statistics about the sample and alignment
quality.
Statistic
Definition
Total PF Reads
The total number of reads passing filter.
% Q30 Bases
The percentage of bases with a quality score of 30 or higher.
Total Aligned Reads 1 and 2
The total number of aligned reads for reads 1 and 2.
% Aligned Reads 1 and 2
The percentage of aligned passing filter reads for reads 1 and
2.
Total Aligned Bases Reads 1
and 2
The total number of aligned bases for reads 1 and 2.
% Aligned Bases Reads 1 and
2
The percentage of aligned bases for reads 1 and 2.
Mismatch Rate Reads 1 and 2
The average percentage of mismatches of reads 1 and 2 over
all cycles.
Coverage Histogram
The coverage histogram shows the number of reference bases plotted against the depth of
coverage.
The report also includes the mean coverage, which is the total number of aligned bases
divided by the genome size. The coverage histogram considers duplicates, but the mean
coverage does not.
Small Variants Summary
The small variants summary table provides metrics about the number of SNVs,
insertions, and deletions.
Statistic
Definition
Total
Passing
The total number of variants present in the data set that pass the variant quality
filters.
Het/Hom
Ratio
The number of heterozygous variants divided by the number of homozygous
variants.
Ts/Tv
Ratio
Transition rate of SNVs that pass the quality filters divided by the transversion
rate of SNVs that pass the quality filters. Transversions are interchanges between
purine and pyrimidine bases (eg, A to T).
Structural Variants Summary
The structural variants summary table separates the structural variant output into the
classes of called variants. The table also includes the total number of structural variants
and the overlap with annotated genes. All counts are based on PASS filter variants.
17
Part # 15070536 Rev. A
The fragment length summary table provides statistics on the sequenced fragments
Statistic
Definition
Fragment
Length
Median
The median length of the sequenced fragment. The fragment length is based on
the locations where a read pair aligns to the reference. The read mapping
information is parsed from the BAM files.
Minimum
The minimum length of the sequenced fragment.
Minimum
The maximum length of the sequenced fragment.
Standard
Deviation
The standard deviation of the sequenced fragment length.
Duplicate Information
The duplicate information table provides the percentage of paired reads that have
duplicates.
Analysis Details
The analysis detail tables provide the settings and software packages used for the
analysis.
HiSeq Analysis Software v2.0 User Guide
18
Resequencing Analysis Workflow
Fragment Length Summary
Tumor-Normal Analysis Workflow
HiSeq Analysis Software v2.0 includes a Tumor-Normal analysis workflow that
leverages a suite of proven algorithms that are optimized for the complexities of tumor
samples. The software delivers a set of accurate somatic variants when compared with a
matched normal sample.
Following alignment of both the tumor and normal sample, the Tumor-Normal workflow
is used to identify the somatic variants (SNVs, small indels, and structural variants) that
are unique to the tumor sample. When analyzing the tumor sample, use the Alignment
analysis workflow rather than the Resequencing analysis workflow for the alignment
step, as shown below. The alignment analysis workflow is identical to the resequencing
analysis workflow, but does not include the variant call analysis. The variant call
analysis used in the Resequencing analysis workflow is not valid for a tumor sample,
which can contain somatic variants. The variant calling methods used in the
Resequencing analysis workflow assume a diploid genotype.
For optimal results, Illumina recommends a minimum coverage of 30x for the normal
sample and 60x coverage for the tumor sample.
Below is an outline of the Tumor-Normal analysis workflow and generated files. The
Tumor-Normal analysis workflow leverages 3 interconnected somatic variant callers—
Isaac Somatic Variant Caller for SNV and small somatic variants, Isaac Structural
Variant Caller for large indel and structural variants, and SENECA for somatic copy
number variants.
Figure 2 Tumor-Normal Analysis Workflow
Run Tumor-Normal Analysis
19
1
Run the Alignment workflow for each tumor sample. A sample can span multiple
flow cells.
/opt/illumina/HiSeqAnalysisSoftwareX/latest/HAS Alignment -a
/path/to/analysis/folder -i TumorSampleId -l
/path/to/storage/on/node
2
Run the Tumor-Normal workflow for each tumor-normal pair. This example
assumes that Resequencing has already been run on the normal sample.
/opt/illumina/HiSeqAnalysisSoftwareX/latest/HAS TumorNormal -a
/path/to/analysis/folder --tumor-sample-id TumorSampleId
--normal-sample-id NormalSampleId
Part # 15070536 Rev. A
Option
-p
-l
--tumor-sample-id
--normal-sample-id
Description
[Optional] The project the sample belongs to. Adds a
directory level named <Project> to the base of the
other output directories.
[Optional] The path to the local storage. If specified, the
input files for the specified workflow are copied to the
local storage of the compute node (local hard drive).
The workflow also generates temporary and output
files in the local node storage during the analysis.
Using this option requires more compute node space
than the -t option.
Upon completion, the results are moved to the
standard location.
The sample ID of the tumor sample.
The sample ID of the normal sample.
NOTE
If there is available storage on the local node, the -l option is recommended as standard
practice. If the -l option is specified, there is no need to specify the -t option.
Tumor-Normal Analysis Folder Structure
[User-Specified Analysis Folder]
[Project Name]—Present if the project name is specified in the sample sheet.
[Tumor Sample ID or Normal Sample ID]
[Tumor-Normal]—Present in addition to a Resequencing or Alignment
folder.
normalsamplename_tumorsamplename_G1_P1.report.pdf—
PDF report.
normalsamplename_tumorsamplename_G1_P1.report.json—
JSON report.
normalsamplename_tumorsamplename_G1_P1.somatic.vcf—VCF
file containing the somatic small variants called.
normalsamplename_tumorsamplename_G1_P1.somatic.SV.vcf—
VCF file containing the somatic structural variants called
(excluding CNVs).
normalsamplename_tumorsamplename_G1_P1.somatic.CNV.vcf—
VCF file containing the somatic CNVs called.
SampleSheet.csv—Sample sheet used for the analysis.
LogFiles—Contents are used for troubleshooting only.
SoxLogs_G1_P1—Contains more detailed log files.
commands.tsv—A select list of commands that were run
during the analysis.
CompletedJobInfo.xml—XML file containing parameters
used for the analysis.
runSoxWorkflow_G1_P1_stderr.txt—Main log file.
WorkflowError.txt—Main error log file.
WorkflowLog.txt—Does not contain detailed information.
HiSeq Analysis Software v2.0 User Guide
20
Tumor-Normal Analysis Workflow
Use these command line options to customize the Resequencing run.
Somatic Analysis Report
The Tumor-Normal workflow outputs a PDF report that provides a summary of the
somatic analysis results.
Sample Information
The sample information table provides statistics about the tumor and normal samples.
Statistic
Definition
Gigabases
Passing Filter
The number of gigabases passing filter.
% Bases ≥ Q30
The percentage of bases with a quality score of 30 or higher.
Purity
The approximate amount of signal from the tumor sample that is derived
from the tumor cells.
Ploidy
The average chromosome content of the cell compared to the haploid
state.
Somatic Variants Summary
The somatic variants summary tables provide the total counts for the following:
} Somatic SNVs, deletions, and insertions.
} Different structural variants, including CNVs.
} Variants overlapping known genes.
The workflow does not provide annotations for small variants, which result in 0 counts
for the remaining sections of the summary table.
All counts are based on PASS filter variants.
21
Part # 15070536 Rev. A
The circos plot provides visualization of somatic small variation, ploidy, and structural
variations reported in the somatic variation files (VCF). The circos plot displays somatic
variation data in tracks with chromosomes circularly arranged. Following is an example
legend. Labels are described from inside the circle to the outside.
Legend
A
B
Label (From Inner
Circle to Outer
Circle)
Somatic structural
variants
E
Number of
somatic indels per
Mb
Number of
somatic SNVs per
Mb
Copy-neutral loss
of heterozygosity
(LOH)
B-allele frequency
F
Called level
G
Karyotype
H
Chromosome
position
Chromosome
number
C
D
I
HiSeq Analysis Software v2.0 User Guide
Description
The somatic structural variants detailed in somatic.SVs.vcf are
plotted in the center of the plot.
Green links—Segmental duplications (at the center of the circle).
Green boxes—Inversions (the first inner track).
Purple boxes—Deletions (the second track). The width of the
boxes indicates the length of SVs.
Purple bars—Insertion breakpoints (the third track).
Red links—Translocations. The end of the links indicates the 2
breakpoints of SVs.
The density of PASS somatic indels reported in
somatic.indels.vcf.gz in 1 Mb windows.
The scale of Y-axis in the histogram indicates the counts.
The density of PASS somatic SNVs reported in somatic.snvs.vcf in
1 Mb windows, arbitrarily scaled in a histogram with Y-axis
pointing inward.
The LOH regions with SNP calls in the normal genome but a
homozygous reference call in the tumor genome, in CNVs.vcf.
The B-allele ratios calculated by SENECA that will be used in the
ploidy and purity estimation.
The copy number aberrations from CNVs.vcf file.
The scale of Y-axis in the histogram indicates the called level.
The standard Circos ideogram defining the chromosome position,
identity, and color of cytogenetic bands.
The reference coordinates along the chromosome (in megabases)
Chromosome number: 1, 2,…,22, X, Y.
22
Tumor-Normal Analysis Workflow
Circos Plot of Somatic Variations
Depth/B-Allele Plot
The top plot provides an overview of the depth of coverage by chromosomal position.
Aberrant values indicate copy number variations. Copy number ratios are classified as
either gains (red), losses (green), or copy number unchanged (black). Estimated purity
and ploidy values are listed at the top of the plot.
The bottom of the plot displays B-allele ratio by chromosomal position. The B-allele ratio
is the ratio of the 2 alleles A and B. The B-allele ratios that are around 0.5 are filtered out
for diploid regions.
23
Part # 15070536 Rev. A
The following file is user-provided.
} SampleSheet.csv—Contains user-specified analysis options for the run, including
Sample_IDs and index sequences. Create the SampleSheet.csv file for the
GenerateFASTQ step and save it in the top level of the run folder. HiSeq Analysis
Software v2.0 creates the SampleSheet.csv for subsequent workflow steps.
HiSeq X generates the following files. These files are found in the run folder after a
sequencing run.
} RunInfo.xml–Produced by RTA and contains information about cycles per read, etc.
} Base call files (s_x_yyyy.bcl.gz)—Includes base calls for lane x, tile yyyy, cycle z.
Located in Data\Intensities\BaseCalls\L00x\Cz.1.
} FILTER files (s_x_yyyy.filter)—Includes filters flags for lane x, tile yyyy. Located in
Data\Intensities\BaseCalls\L00x\.
} Position files (s.locs)—Includes locations for all lanes. Located in Data\Intensities.
Sample Sheet Settings
The sample sheet identifies the samples included in an analysis and contains the index
information used in sample demultiplexing. To customize the sample sheet for the
GenerateFASTQ workflow, enter the following parameters in the [Settings] section of the
sample sheet.
Each line in the [Settings] section contains a parameter name in the first column and a
value in the second column. Settings are not case-sensitive.
Parameter
Description
Adapter
Specify the 5' portion of the adapter sequence to prevent
reporting sequence beyond the sample DNA.
To trim two or more adapters, separate the sequences by a
plus (+) sign. (eg, Nextera Mate Pair Libraries –
CTGTCTCTTATACACATCT+AGATGTGTATAAGAGACAG).
AdapterRead2
Specify the 5' portion of the Read 2 adapter sequence to
prevent reporting sequence beyond the sample DNA. Use
this setting to specify a different adapter other than the one
specified in the Adapter setting. If not specified, the Adapter
setting is used for Read 2.
HiSeq Analysis Software v2.0 User Guide
24
Input Files
Input Files
[Data] Section
The following table includes the parameters and descriptions of the sample sheet [Data]
fields.
Parameter
Description
Lane
[Optional] Lane numbers used for a sample. If not specified, all lanes are
used. Each lane can be entered as a separate row. Delimited by '+', eg
'1+2+3'.
Sample_ID
Sample ID. Combined with Sample_Project to make a unique identifier
for the sample.
Sample_Name
[Optional] Sample Name. If this field is completed, the sample name is
used when naming FASTQ files.
index
[Optional] Index bases used to identify a sample.
index2
[Optional] Second index bases used to identify a sample.
GenomeFolder
Location of genome used for alignment.
Sample_Project
[Optional] Project Name. Completing this field is highly recommended.
Example GenerateFASTQ Sample Sheet
An example of a sample sheet for the GenerateFASTQ workflow is shown below.
[Header]
IEMFileVersion,4
Investigator Name,Jane Doe
Project Name,Project A
Experiment Name,HiSeqX
Date,2/12/2015
Workflow,GenerateFASTQ
Application,HiSeq FASTQ Only
Assay,TruSeq HT
Description,HiSeqX run
Chemistry,Default
[Reads]
151
151
[Settings]
Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
[Data]
Lane,Sample_ID,Sample_Name,index,index2,GenomeFolder,Sample_
Project
25
Part # 15070536 Rev. A
Table 1 Common Data Files
File Name
Description
*.bam files
Contains reads for a given sample.
Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/
{SAMPLEID}/Results.
*.vcf files
Contains information about variants found at specific
positions in a reference genome.
Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/
{SAMPLEID}/Results.
*.genome.vcf.gz files
All variant and nonvariant sites in the genome for
SampleName. Block level compression is used
(bgzip) to generate this file so that it is compatible
with tabix indexing. Similarly, a tabix generated index
file corresponding to the genome.vcf.gz file is also
generated. This information is represented using the
Genome VCF (gVCF) conventions for VCF.
Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/
{SAMPLEID}/Results.
For more information, see
sites.google.com/site/gvcftools/home/about-gvcf.
Table 2 Common Metrics Files
File Name
Description
AdapterTrimming.txt
Lists the number of trimmed bases and percentage
of bases for each tile. This file is present only if
adapter trimming was specified for the run.
Located in {USER-SPECIFIED-FASTQ-FOLDER}/
{FLOWCELLID}/QC.
*.CoverageHistogram.txt
Contains coverage depth information for every
chromosome. This file can be used to generate a
coverage histogram.
Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/
{SAMPLEID}/Results/LogFiles.
DemultiplexSummaryF1L1.txt
Reports demultiplexing results in a table with one
row per tile and one column per sample.
Located in {USER-SPECIFIED-FASTQ-FOLDER}/
{FLOWCELLID}/QC.
ErrorsAndNoCallsByLaneTile
ReadCycle.csv
A comma-separated values file that contains the
percentage of errors and no-calls for each tile, read,
and cycle.
Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/
{SAMPLEID}/Results/LogFiles.
HiSeq Analysis Software v2.0 User Guide
26
Output Files
Output Files
File Name
Description
*.summary.csv
Summary statistics for the sample, stored as a
comman-separated values (.csv) file for parsing by
downstream tools.
Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/
{SAMPLEID}/Results.
*.report.pdf / *.report.html
Summary report for the sample. Provides several
high-level statistics pulled from *.summary.csv, such
as genome coverage and % aligned.
Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/
{SAMPLEID}/Results.
Table 3 Common Run Progress Files
27
File Name
Description
WorkflowLog.txt
Processing log that describes every step that
occurred during analysis of the current run folder.
This file does not contain error messages.
Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/
{SAMPLEID}/Results/LogFiles.
WorkflowError.txt
Processing log that lists any errors that occurred
during analysis. This file is present only if errors
occurred.
Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/
{SAMPLEID}/Results/LogFiles.
CompletedJobInfo.xml
Written after analysis is complete. Contains
information about the run, such as date, flow cell ID,
software version, and other parameters.
Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/
{SAMPLEID}/Results/LogFiles.
RunInfo.xml
Contains information about the run.
Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/
{SAMPLEID}/Results/LogFiles.
Part # 15070536 Rev. A
This section describes the underlining methodologies for the resequencing and alignment
algorithm in HiSeq Analysis Software v2.0. The resequencing analysis and alignment
analysis workflows are identical, except that the alignment analysis workflow does not
contain variant calling and is only recommended for tumor samples.
Introduction
After the sequencer generates base calls and quality scores, the resulting data are
analyzed in 2 steps—alignment to the reference genome followed by assembly and
variant calling.
Alignment and variant calling are performed with the Isaac Alignment Software, Isaac
Variant Caller, Isaac CNV Caller, and Isaac SV Caller. The following output is produced:
} Realigned and duplicate marked reads in a *.bam file format.
} Variants in a VCF file format.
} An additional Genome VCF (gVCF) file. This file features an entry for every base in
the reference, which differentiates reference calls and no calls, and a summary of
quality. The reference calls are block compressed and all single nucleotide
polymorphisms and indels are included. Currently Structural Variants and CNVs
are kept in separate files.
Genome Specific Details
Illumina currently uses hg19 from UCSC as a reference genome. The chromosome
naming scheme follows the UCSC conventions of chr1-22, chrX, chrY, chrM. The
pseudoautosomal region (PAR) of the Y chromosome is masked out with N’s. The result
of this is that any mappings occurring in the PAR region map to the X chromosome.
Currently, only the main chromosomes and mitochondria are used in the reference; none
of the nonmapped contigs are included. As per GATK specification for UCSC, chrM is
the first chromosome followed by the rest in karyotypic order.
The hg19 PAR regions are defined as follows.
Table 4 hg19 PAR regions
Name
Chr
PAR#1
X
PAR#2
X
PAR#1
Y
PAR#2
Y
Start
60,001
154,931,044
10,001
59,034,050
Stop
2,699,520
155,260,560
2,649,520
59,363,566
Isaac Aligner
The Isaac Aligner1 aligns DNA sequencing data, single or paired-end, with read lengths
of 32–150 bp and low error rates using the following steps:
} Candidate mapping positions—Identifies the complete set of relevant candidate
mapping positions using a 32-mer seed-based search.
} Mapping selection—Selects the best mapping among all candidates.
} Alignment score—Determines alignment scores for the selected candidates based on
a Bayesian model.
HiSeq Analysis Software v2.0 User Guide
28
Methods for Resequencing and Alignment Analysis
Methods for Resequencing and Alignment Analysis
Workflows
} Alignment output—Generates final output in a sorted duplicate-marked BAM file,
indels realigned and summary file.
1
Come Raczy, Roman Petrovski, Christopher T. Saunders, Ilya Chorny, Semyon
Kruglyak, Elliott H. Margulies, Han-Yu Chuang, Morten Källberg, Swathi A. Kumar,
Arnold Liao, Kristina M. Little, Michael P. Strömberg and Stephen W. Tanner (2013)
Isaac: Ultra-fast whole genome secondary analysis on Illumina sequencing
platforms. Bioinformatics 29(16):2041-3
bioinformatics.oxfordjournals.org/content/29/16/2041
Candidate Mapping
To align reads, the Isaac Aligner first identifies a small but complete set of relevant
candidate mapping positions. The Isaac Aligner begins with a seed-based search using
32-mers from the extremities of the read as seeds. Isaac Aligner performs another search
using different seeds for only those reads that were not mapped unambiguously with the
first pass seeds.
Mapping Selection
Following a seed-based search, the Isaac Aligner selects the best mapping among all the
candidates. For paired-end data sets, all mappings where only one end is aligned (called
orphan mappings) trigger a local search to find additional mapping candidates. These
candidates (called shadow mappings) are defined through the expected minimum and
maximum insert size. After optional trimming of low quality 3' ends and adapter
sequences, the possible mapping positions of each fragment are compared. This step
takes into account pair-end information (when available), possible gaps using a banded
Smith-Waterman gap aligner, and possible shadows. The selection is based on the
Smith-Waterman score and on the log-probability of each mapping.
Alignment Scores
The alignment scores of each read pair are based on a Bayesian model, where the
probability of each mapping is inferred from the base qualities and the positions of the
mismatches. The final mapping quality (MAPQ) is the alignment score, truncated to 60
for scores above 60, and corrected based on known ambiguities in the reference flagged
during candidate mapping. Following alignment, reads are sorted. Further analysis is
performed to identify duplicates and optionally to realign indels.
Alignment Output
After sorting the reads, the Isaac Aligner generates compressed binary alignment output
files, called BAM (*.bam) files, using the following process:
} Marking duplicates—Detection of duplicates is based on the location and observed
length of each fragment. The Isaac Aligner identifies and marks duplicates even
when they appear on oversized fragments or chimeric fragments.
} Realigning indels—The Isaac Aligner tracks previously detected indels, over a
window large enough for the current read length, and applies the known indels to
all reads with mismatches.
} Generating BAM files—The first step in BAM file generation is creation of the BAM
record, which contains all required information except the name of the read. The
Isaac Aligner reads data from base call (BCL) files that were written during base
calling on the sequencer to generate the read names. Data are then compressed into
blocks of 64 kb or less to create the BAM file.
29
Part # 15070536 Rev. A
The Isaac Variant Caller identifies single nucleotide variants (SNVs) and small indels
using the following steps:
} Read filtering—Filters out reads failing quality checks.
} Indel calling—Identifies a set of possible indel candidates and realigns all reads
overlapping the candidates using a multiple sequence aligner.
} SNV calling—Computes the probability of each possible genotype given the aligned
read data and a prior distribution of variation in the genome.
} Indel genotypes—Calls indel genotypes and assigns probabilities.
Indel Candidates
Input reads are filtered by removing any of the following:
} Reads that failed base calling quality checks.
} Reads marked as PCR duplicates.
} Paired-end reads not marked as a proper pair.
} Reads with a mapping quality less than 20.
Indel Calling
The variant caller proceeds with candidate indel discovery and generates alternate read
alignments based on the candidate indels. As part of the realignment process, the variant
caller selects a representative alignment to be used for site genotype calling and depth
summarization by the SNV caller.
SNV Calling
The variant caller runs a series of filters on the set of filtered and realigned reads for
SNV calling without affecting indel calls. First, any contiguous trailing sequence of N
base calls is trimmed from the ends of reads. Using a mismatch density filter, reads
having an unexpectedly high number of disagreements with the reference are masked, as
follows:
} The variant caller treats each insertion or deletion as a single mismatch.
} Base calls with more than two mismatches to the reference sequence within 20 bases
of the call are ignored.
} If the call occurs within the first or last 20 bases of a read, the mismatch limit is
applied to a 41-base window at the corresponding end of the read.
} The mismatch limit is applied to the entire read when the read length is 41 or
shorter.
Indel Genotypes
The variant caller filters out all bases marked by the mismatch density filter and any N
base calls that remain after the end-trimming step. These filtered base calls are not used
for site-genotyping but appear in the filtered base call counts in the variant caller output
for each site.
All remaining base calls are used for site-genotyping. The genotyping method
heuristically adjusts the joint error probability that is calculated from multiple
observations of the same allele on each strand of the genome. This correction accounts
for the possibility of error dependencies.
HiSeq Analysis Software v2.0 User Guide
30
Methods for Resequencing and Alignment Analysis
Isaac Variant Caller
This method treats the highest-quality base call from each allele and strand as an
independent observation and leaves the associated base call quality scores unmodified.
Quality scores for subsequent base calls for each allele and strand are then adjusted. This
adjustment is done to increase the joint error probability of the given allele above the
error expected from independent base call observations.
Variant Call Output
After the SNV and indel genotyping methods are complete, the variant caller applies a
final set of heuristic filters to produce the final set of calls in the output.
The output in the genome variant call (gVCF) file captures the genotype at each position
and the probability that the consensus call differs from reference. This score is expressed
as a Phred-scaled quality score.
Genome VCF (gVCF)
Human genome sequencing applications require sequencing information for both variant
and nonvariant positions, yet there is no common exchange format for such data. gVCF
addresses this issue.
gVCF is a set of conventions applied to the standard variant call format (VCF). These
conventions allow representation of genotype, annotation, and additional information
across all sites in the genome, in a reasonably compact format. Typical human wholegenome sequencing results expressed in gVCF with annotation are less than 1.7 GB, or
about 1/50 the size of the BAM file used for variant calling.
gVCF is also equally appropriate for representing and compressing targeted sequencing
results. Compression is achieved by joining contiguous nonvariant regions with similar
properties into single ‘block’ VCF records. To maximize the utility of gVCF, especially for
high stringency applications, the properties of the compressed blocks are conservative.
Block properties such as depth and genotype quality reflect the minimum of any site in
the block. The gVCF file is also a valid VCF v4.1 file, and can be indexed and used with
existing VCF tools such as tabix and IGV. This feature makes the file convenient both for
direct interpretation and as a starting point for further analysis.
gvcftools
Illumina has created a full set of utilities aimed at creating and analyzing Genome VCF
files. For up-to-date information and downloads, visit the gvcftools website at
sites.google.com/site/gvcftools/home.
Examples
The following is a segment of a VCF file following the gVCF conventions for
representation of nonvariant sites and, more specifically, using gvcftools block
compression and filtration levels.
In the following gVCF example, nonvariant regions are shown in normal text and
variants are shown in bold.
NOTE
The variant lines can be extracted from a gVCF file to produce a conventional variant VCF
file.
chr20 676337 . T . 0.00 PASS END=676401;BLOCKAVG_min30p3a
GT:GQX:DP:DPF 0/0:143:51:0
31
Part # 15070536 Rev. A
In addition to the nonvariant and variant regions in the example, there is also one
nonvariant region from [676837,676857] that is filtered out due to insufficient confidence
that the region is homozygous reference.
Conventions
Any VCF file following the gVCF convention combines information on variant calls
(SNVs and small-indels) with genotype and read depth information for all nonvariant
positions in the reference. Because this information is integrated into a single file,
distinguishing variant, reference, and no-call states for any site of interest is
straightforward.
The following subsections describe the general conventions followed in any gVCF file,
and provide information on the specific parameters and filters used in the Isaac
workflow gVCF output.
HiSeq Analysis Software v2.0 User Guide
32
Methods for Resequencing and Alignment Analysis
chr20 676402 . A . 0.00 PASS END=676441;BLOCKAVG_min30p3a
GT:GQX:DP:DPF 0/0:169:57:0
chr20 676442 . T G 287.00 PASS SNVSB=-30.5;SNVHPOL=3
GT:GQ:GQX:DP:DPF:AD 0/1:316:287:66:1:33,33
chr20 676443 . T . 0.00 PASS END=676468;BLOCKAVG_min30p3a
GT:GQX:DP:DPF 0/0:202:68:1
chr20 676469 . G . 0.00 PASS . GT:GQX:DP:DPF 0/0:199:67:5
chr20 676470 . A . 0.00 PASS END=676528;BLOCKAVG_min30p3a
GT:GQX:DP:DPF 0/0:157:53:0
chr20 676529 . T . 0.00 PASS END=676566;BLOCKAVG_min30p3a
GT:GQX:DP:DPF 0/0:120:41:0
chr20 676567 . C . 0.00 PASS END=676574;BLOCKAVG_min30p3a
GT:GQX:DP:DPF 0/0:114:39:0
chr20 676575 . A T 555.00 PASS SNVSB=-50.0;SNVHPOL=3
GT:GQ:GQX:DP:DPF:AD 1/1:114:114:39:0:0,39
chr20 676576 . T . 0.00 PASS END=676625;BLOCKAVG_min30p3a
GT:GQX:DP:DPF 0/0:95:36:0
chr20 676626 . T . 0.00 PASS END=676650;BLOCKAVG_min30p3a
GT:GQX:DP:DPF 0/0:117:40:0
chr20 676651 . T . 0.00 PASS END=676698;BLOCKAVG_min30p3a
GT:GQX:DP:DPF 0/0:90:31:0
chr20 676699 . T . 0.00 PASS END=676728;BLOCKAVG_min30p3a
GT:GQX:DP:DPF 0/0:69:24:0
chr20 676729 . C . 0.00 PASS END=676783;BLOCKAVG_min30p3a
GT:GQX:DP:DPF 0/0:57:20:0
chr20 676784 . C . 0.00 PASS END=676803;BLOCKAVG_min30p3a
GT:GQX:DP:DPF 0/0:51:18:0
chr20 676804 . G A 62.00 PASS SNVSB=-7.5;SNVHPOL=2
GT:GQ:GQX:DP:DPF:AD 0/1:95:62:17:0:11,66
chr20 676805 . C . 0.00 PASS END=676818;BLOCKAVG_min30p3a
GT:GQX:DP:DPF 0/0:48:17:0
chr20 676819 . T . 0.00 PASS END=676824;BLOCKAVG_min30p3a
GT:GQX:DP:DPF 0/0:39:14:0
chr20 676825 . A . 0.00 PASS END=676836;BLOCKAVG_min30p3a
GT:GQX:DP:DPF 0/0:30:11:0
chr20 676837 . T . 0.00 LowGQX END=676857;BLOCKAVG_min30p3a
GT:GQX:DP:DPF 0/0:21:8:0
chr20 676858 . G . 0.00 PASS END=676873;BLOCKAVG_min30p3a
GT:GQX:DP:DPF 0/0:30:11:0
NOTE
gVCF conventions are written with the assumption that only one sample per file is being
represented.
Interpretation
gVCFs file can be interpreted as follows:
} Fast interpretation—As a discrete classification of the genome into ‘variant’,
‘reference’, and ‘no-call’ loci. This classification is the simplest way to use the gVCF.
The Filter fields for the gVCF file have already been set to mark uncertain calls as
filtered for both variant and nonvariant positions. Simple analysis can be performed
to look for all loci with a filter value of “PASS” and treat them as called.
} Research interpretation—As a ‘statistical’ genome. Additional fields, such as
genotype quality, are provided for both variant and reference positions to allow the
threshold between called and uncalled sites to be varied. These fields can also be
used to apply more stringent criteria to a set of loci from an initial screen.
External Tools
gVCF is written to the VCF 4.1 specifications, so any tool that is compatible with the
specification (such as IGV and tabix) can use the file. However, certain tools are not
appropriate if they:
} Apply algorithms to VCF files that make sense for only variants calls (as opposed to
variant and nonvariant regions in the full gVCF);
} Are only computationally feasible for variant calls.
For these cases, extract the variant calls from the full gVCF file.
Special Handling for Indel Conflicts
Sites that are "filled in" inside deletions have additional treatment.
} Heterozygous Deletions—Sites inside heterozygous deletions have haploid genotype
entries (ie "0" instead of "0/0", "1" instead of "1/1"). Heterozygous SNVs are marked
with the SiteConflict filter and their original genotype is left unchanged. Sites inside
heterozygous deletions cannot have a genotype quality score higher than the
enclosing deletion genotype quality.
} Homozygous Deletions—Sites inside homozygous deletions have genotype set to "."
(period), and site and genotype quality are also set to "." (period).
} All Deletions—Sites inside any deletion are marked with the filters of the deletion,
and more filters can be added pertaining to the site itself. These modifications reflect
the idea that the enclosing indel confidence bounds the site confidence.
} Indel Conflicts—In any region where overlapping deletion evidence cannot be
resolved into 2 haplotypes, all indel and set records in the region are marked with
the IndelConflict filter.
Table 5 Indel Conflict Filters
ID
Type
Description
IndelConflict site/indel Locus is in region with conflicting indel calls.
SiteConflict
site
Site genotype conflicts with proximal indel call. This conflict is typically
a heterozygous genotype found inside of a heterozygous deletion.
Representation of Non-Variant Segments
This section includes the following subsections:
} Block representation using END key
33
Part # 15070536 Rev. A
Block Representation Using END Key
Continuous nonvariant segments of the genome can be represented as single records in
gVCF. These records use the standard 'END" INFO key to indicate the extent of the
record. Even though the record can span multiple bases, only the first base is provided
in the REF field (to reduce file size). Following is a simplified example of a nonreference
block record:
##INFO=<ID=END,Number=1,Type=Integer,Description="End position
of the variant described in this record">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA19238
chr1 51845 . A . . PASS END=51862
The example record spans positions [51845,51862].
Joining Non-Variant Sites Into a Single Block Record
Address the following issues when joining adjacent nonvariant sites into block records:
} The criteria that allow adjacent sites to be joined into a single block record.
} The method to summarize the distribution of SAMPLE or INFO values from each
site in the block record.
At any gVCF compression level, a set of sites can be joined into a block if...
} Each site is nonvariant with the same genotype call. Expected nonvariant genotype
calls are { "0/0", "0", "./.", "." }.
} Each site has the same coverage state, where 'coverage state' refers to whether at
least 1 read maps to the site. For example, sites with 0 coverage cannot be joined into
the same block with covered sites.
} Each site has the same set of FILTER tags.
} Sites have less than a threshold fraction of nonreference allele observations
compared to all observed alleles (based on AD and DP field information). This
threshold is used to keep sites with high ratios of nonreference alleles from being
compressed into nonvariant blocks. In the Isaac Variant Caller gVCF output, the
maximum nonreference fraction is 0.2
Block Sample Values
Any field provided for a block of sites, such as read depth (using the DP key), shows the
minimum value observed among all sites encompassed by the block.
Nonvariant Block Implementations
Files conforming to the gVCF conventions delineated in this document can use different
criteria for creation of block records, depending on the desired trade-off between
compression and nonvariant site detail. The Isaac Variant Caller provides the following
blocking scheme 'min30p3a' as the nonvariant block compression scheme.
Each sample value shown for the block, such as the depth (using the DP key), is
restricted to have a range where the maximum value is within 30% or 3 of the
minimum. Therefore, for sample value range [x,y], y ≤ x+max(3,x*0.3). This range
restriction applies to all sample values written in the final block record.
HiSeq Analysis Software v2.0 User Guide
34
Methods for Resequencing and Alignment Analysis
} Joining nonvariant sites into a single block record
} Block sample values
} Nonvariant block implementations
Genotype Quality for Variant and Nonvariant Sites
The gVCF file uses an adapted version of genotype quality for variant and nonvariant
site filtration. This value is associated with the GQX key. The GQX value is intended to
represent the minimum of Phred genotype quality {assuming the site is variant,
assuming the sites is nonvariant}.
You can use this value to allow a single value to be used as the primary quality filter for
both variant and nonvariant sites. Filtering on this value corresponds to a conservative
assumption appropriate for applications where reference genotype calls must be
determined at the same stringency as variant genotypes, for example:
Filter Criteria
The gVCF FILTER description is divided into 2 sections: (1) describes filtering based on
genotype quality; (2) describes all other filters.
NOTE
These filters are default values used in the current Isaac Variant Caller implementation.
However, no set of filters or cutoff values are required for a file to conform to gVCF
conventions.
The genotype quality is the primary filter for all sites in the genome. In particular,
traditional discovery-based site quality values that convey confidence that the site is
"anything besides the homozygous reference genotype," such as SNV quality, are not
used. Instead, a site or locus is filtered based on the confidence in the reported genotype
for the current sample.
The genotype quality used in gVCF is a Phred-scaled probability that the given genotype
is correct. It is indicated with the FORMAT field tag GQX. Any locus where the genotype
quality is below the cutoff threshold is filtered with the tag LowGQX. In addition to
filtering on genotype quality, some other filters are also applied.
The gVCF output from Isaac Variant Caller includes several heuristic filters applied to
the site and indel records. The filters are as follows.
Table 6 VCF Site and Indel Record Filters
VCF Filter ID
Type
Description
PhasingConflict
site
The locus read evidence displays unbalanced phasing
patterns.
PLOIDY_
site/indel
The genotype call from the variant caller is not
CONFLICT
consistent with the chromosome ploidy.
IndelConflict
indel
The locus is in region with conflicting indel calls.
HighDPFRatio
site
The fraction of basecalls filtered out at a site, DPF/
(DP+DPF), is greater than 0.4.
HighDepth
site/indel
The locus depth is greater than 3x the mean
chromosome depth.
LowGQX
site/indel
Locus GQX is less than 30.
LowGQXHetSNP
site
Locus GQX is less than 15 for het SNP.
LowGQXHomSNP site
Locus GQX is less than 17 for hom SNP.
LowGQXHetIns
site/indel
Locus GQX is less than 6 for het insertion.
LowGQXHomIns
site/indel
Locus GQX is less than 6 for hom insertion.
LowGQXHetDel
site/indel
Locus GQX is less than 6 for het deletion.
PASS
site/indel
Position has passed all filters.
SiteConflict
indel
The site genotype conflicts with the proximal indel call.
This call is typically a heterozygous SNV call made
inside of a heterozygous deletion.
35
Part # 15070536 Rev. A
Isaac Copy Number Variant (CNV) Caller is an algorithm for calling copy number
variants from a diploid sample. Most of a normal DNA sample is diploid, or having 2
copies. Isaac CNV Caller identifies regions of the sample genome that are not present, or
present either one time or more than 2 times in the genome. Isaac CNV Caller scans the
genome for regions having an unexpected number of short read alignments. Regions
with fewer than the expected number of alignments are classified as losses. Regions
having more than the expected number of alignments are classified as gains.
Isaac CNV Caller is appropriately applied to low-depth cytogenetics experiments, lowdepth single-cell experiments, or whole-genome sequencing experiments. Isaac CNV
Caller is not appropriate for whole exome experiments, cancer studies, or any other
experiment with the following conditions:
} Most of the genome is not assumed to be diploid.
} Reads are not distributed randomly across the diploid genome.
Workflow
Isaac CNV Caller can be conceptually divided into 4 processes:
} Binning—Counting alignments in genomic bins.
} Cleaning—Removal of systematic biases and outliers from the counts.
} Partitioning—Partitioning the counts into homogenous regions.
} Calling—Assigning a copy number to each homogenous region.
These processes are explained in subsequent sections.
Binning
The binning procedure creates genomic windows, or bins, across the genome and counts
the number of observed alignments that fall into each bin. The alignments are provided
in the form of a BAM file.
Isaac CNV Caller binning keeps in memory a collection of BitArrays to store observed
alignments, one BitArray for each chromosome. Each BitArray length is the same as its
corresponding chromosome length. As the BAM file is read in, Isaac CNV Caller records
the position of the left-most base in each alignment within the chromosome-appropriate
BitArray. After all alignments in the BAM file have been read, the BitArrays have a “1”
wherever an alignment was observed and a “0” everywhere else.
After reading in the BAM file, a masked FASTA file is read in, one chromosome at a
time. This FASTA file contains the genomic sequences that were used for alignment.
Each 35-mer within this FASTA file is marked as unique or nonunique with uppercase
and lowercase letters. If a 35-mer is unique, then its first nucleotide is capitalized;
otherwise, it is not capitalized. For example, in the sequence:
acgtttaATgacgatGaacgatcagctaagaatacgacaatatcagacaa
The 35-mers marked as unique are as follows:
ATGACGATGAACGATCAGCTAAGAATACGACAATA
TGACGATGAACGATCAGCTAAGAATACGACAATAT
GAACGATCAGCTAAGAATACGACAATATCAGACAA
Isaac CNV Caller stores the genomic locations of unique 35-mers in another collection of
BitArrays analogous to BitArrays used to store alignment positions. Unique positions
and nonunique positions are marked with “1”s and “0”s, respectively. This marking is
used as a mask to guarantee that only alignments that start at unique 35-mer positions
in the genome are used.
HiSeq Analysis Software v2.0 User Guide
36
Methods for Resequencing and Alignment Analysis
Isaac Copy Number Variant Caller
Bin Sizes
Isaac CNV Caller is initialized with 100 alignments per bin and then proceeds to
compute the bin boundaries such that each bin contains the same bin size, or number of
unique 35-mers. The term “bin size“ refers to the number of unique genomic 35-mers per
bin. Because some regions of the human genome are more repetitive than others,
physical bin sizes (in genomic coordinates) are not identical. In the following example,
each box is a position along the genome. Each checkmark represents a unique 35-mer
while each X represents a nonunique 35-mer. The bin size in this example is 3 (3
checkmarks per bin). The physical size of each bin is not constant. B1 and B3 have a
physical size of 3 but B2 and B4 have physical sizes of 4 and 6, respectively.
Computing Bin Size
To compute bin size, the ratio of observed alignments to unique 35-mers is calculated for
each autosome. The desired number of alignments per bin is then divided by the median
of these ratios to yield bin size. For whole-genome sequencing, bin sizes are typically in
the range of 800–1000 unique 35-mers. Correspondingly, most physical window sizes are
in the 1–1.2 kb range. The advantage of this approach relative to using fixed genomic
intervals is that the same number of reads map to each bin, regardless of “uniqueness”
or ability to be mapped.
After bin size is computed, bins are defined as consecutive genomic windows such that
each bin contains the same bin size, or number of unique 35-mers. The number of
observed alignments present within the boundary of each bin is then counted from the
alignment BitArrays. The GC content of each bin is also calculated. The chromosome,
genomic start, genomic stop, observed counts and GC content in each bin are output to
disk.
Cleaning
The Isaac CNV Caller cleaning comprises the following 3 procedures that remove
outliers and systematic biases from the count data computed in Isaac CNV Caller.
1
Single point outlier removal.
2
Physical size outlier removal.
3
GC content correction.
These procedures are performed on the bins produced during the Isaac CNV Caller
binning process.
Single Point Outlier Removal
This step removes individual bins that represent extreme outliers. These bins have
counts that are very different from the counts present in upstream and downstream bins.
Two values, a and b, are defined as to be very different when their difference is greater
37
Part # 15070536 Rev. A
A value of χ2 greater than 6.635, which is the 99th percentile of the Chi-squared
distribution with 1 degree of freedom, is considered very different. If a bin count is very
different from the count of both upstream and downstream neighbors, then the bin is
deemed an outlier and removed.
Physical Size Outlier Removal
Bins likely do not have the same physical (genomic) size. The average for whole-genome
sequencing runs might be approximately 1 kb. If the bins cover repetitive regions of the
genome, some bins sizes might be several megabases in size. Example regions might
include centromeres and telomeres. The counts in these regions tend to be unreliable so
bins with extreme physical size are removed. Specifically, the 98th percentile of observed
physical sizes is calculated and bins with sizes larger than this threshold are removed.
GC Content Correction
The main variability in bins counts is GC content. An example of the bias is represented
in the following figure.
Figure 3 GC Bias Example
The following correction is performed:
1
Bins are first aggregated according to GC content, which is rounded to the nearest
integer.
2
Second, each bin count is divided by the median count of bins having the same GC
content.
3
Finally, this value is multiplied by the desired average count per bin (100 by default)
and rounded to the nearest integer. The effect is to flatten the midpoints of the bars in
the example box-and-whisker plot.
Some values for GC content have few bins so the estimate of its median is not robust.
Therefore, bins are discarded when the number of bins having the same GC content is
fewer than 100.
For some sample preparation schemes, GC content correction has a dramatic effect. The
following figure illustrates the effect of GC content correction for a low depth sequencing
HiSeq Analysis Software v2.0 User Guide
38
Methods for Resequencing and Alignment Analysis
than expected by chance, assuming a and b come from the same underlying distribution.
These values use the Chi-squared distribution, as follows:
µ = 0.5a + 0.5b
χ2 = ((a - µ)2 + (b - µ)2) µ-1
experiment using the Nextera library preparation method. The figure on the left shows
bins counts as a function of chromosome position before normalization. The figure on
the right shows the result after GC content correction.
For whole-genome sequencing experiments, the typically median absolute deviations
(MADs) are 10.3, which is close to the expected value of 10. The expected value is
predicted using the Poisson model for an average count of 100 and indicates that little
bias remains following GC content correction.
It is important to note that the normalization signal does not dampen signal from CNVs
as shown in the following 2 figures. The figure on the left shows a chromosome known
to harbor a single copy gain. The figure on the right shows chromosome known to
harbor a double copy gain.
Partitioning
The Isaac CNV Caller partitioning implements an algorithm for identifying regions of the
genome such that their average counts are statistically different than average counts of
neighboring regions. The implementation is a port of the circular binary segmentation
(CBS) algorithm.
The algorithm briefly considers each chromosome as a segment. The algorithm assesses
each segment and identifies the pair of bins for which the counts in the bins between
them are maximally different than the counts of the rest of the bins. The statistical
significance of the maximal difference is assessed via permutation testing. If the
difference is statistically significant, then the procedure is applied recursively to the 2 or
3 segments created by partitioning the current segment by the identified pair of points.
Input to the algorithm is the output generated by the Isaac CNV Caller cleaning
algorithm.
39
Part # 15070536 Rev. A
Calling
The final module of the Isaac CNV Caller algorithm is to assign discrete copy numbers
to each of the regions identified by the Isaac CNV Caller partitioner.
A Gaussian model is used as the default calling method. In this case, both the mean and
standard deviation are estimated from the data for the diploid model and adjusted for
the other copy number models. For example, if the mean, µ, and standard deviation, σ,
are estimated to be 100 and 15 in the diploid model, then corresponding estimates in the
haploid model would be µ/2 and σ/2. The mean and standard deviation are estimated
using the autosomal median and MAD of counts. This model is the default as it is more
appropriate in cases where the spread of counts is higher than expected from the Poisson
model due to unaccounted sources of variability. An example of this case is single cell
sequencing experiments where whole-genome amplification is required.
Following assignment of copy number states, neighboring regions that received the same
copy number call are merged into a single region.
Phred-scaled Q-scores are assigned to each region using a simple logistic function
derived using arrayCGH data as the gold standard. The probability of a miscall is
modeled as
p=1-(1/((1+e^(0.5532-0.147N ) ))
Where N is the number of bins found within the nondiploid region. This probability is
converted to a Q-score by
q=-10 log p
This estimate is likely conservative as it is derived from arrayCGH. Importantly, Qscores are a function of number of bins, not genomic size, so they are applicable to
experiments of any sequencing depth, including low-depth cytogenetics screening.
The coordinates of non-diploid regions and their Q-scores are output to a VCF file. Two
filters are applied to PASS variants. First, a variant must have a Q-score of Q10 or
greater. Second, a variant must be of size 10 kb, or greater.
Isaac Structural Variant Caller
Isaac Structural Variant (SV) Caller is a structural variant caller for short sequencing
reads. It can discover structural variants of any size and score these variants using both
a diploid genotype model and a somatic model (when separate tumor and normal
samples are specified). Structural variant discovery and scoring incorporate both paired
read fragment spanning and split read evidence.
Method Overview
Isaac SV Caller works by dividing the structural variant discovery process into 2
primary steps–scanning the genome to find SV associated regions and analysis, scoring,
and output of SVs found in such regions.
HiSeq Analysis Software v2.0 User Guide
40
Methods for Resequencing and Alignment Analysis
Because of the computational complexity of the algorithm O(N2), the problem is divided
into subchromosome problems followed by merging, in practice. Heuristics are used to
speed up the permutation testing.
1
Build SV association graph
In this step, the entire genome is scanned to discover evidence of possible SVs and
large indels. This evidence is enumerated into a graph with edges connecting all
regions of the genome that have a possible SV association. Edges can connect 2
different regions of the genome to represent evidence of a long-range association, or
an edge can connect a region to itself to capture a local indel/small SV association.
These associations are more general than a specific SV hypothesis, in that many SV
candidates can be found on 1 edge, although typically only 1 or 2 candidates are
found per edge.
2
Analyze graph edges to find SVs
The second step is to analyze individual graph edges or groups of highly connected
edges to discover and score SVs associated with the edges. These substeps of this
process include:
• Inference of SV candidates associated with the edge.
• Attempted assembly of the SVs break-ends.
• Scoring and filtration of the SV under various biological models (currently
diploid germline and somatic).
• Output to VCF.
Capabilities
Isaac SV Caller can detect all structural variant types that are identifiable in the absence
of copy number analysis and large scale de novo assembly. Detectable types are
enumerated in this section.
For each structural variant and indel, Isaac SV Caller attempts to align the break-ends to
base pair resolution and report the left-shifted break-end coordinate (per the VCF 4.1 SV
reporting guidelines). Isaac SV Caller also reports any break-end microhomology
sequence and inserted sequence between the break-ends. Often the assembly fails to
provide a confident explanation of the data. In such cases, the variant is reported as
IMPRECISE, and scored according to the paired-end read evidence alone.
The sequencing reads provided as input to Isaac SV Caller are expected to be from a
paired-end sequencing assay that results in an inwards orientation between the 2 reads
of each DNA fragment. Each read presents a read from the outer edge of the fragment
insert inward.
Detected Variant Classes
Isaac SV Caller is able to detect all variation classes that can be explained as novel DNA
adjacencies in the genome. Simple insertion/deletion events can be detected down to a
configurable minimum size cutoff (defaulting to 51). All DNA adjacencies are classified
into the following categories based on the break-end pattern:
} Deletions
} Insertions
} Inversions
} Tandem Duplications
} Interchromosomal Translocations
Known Limitations
Isaac SV Caller cannot detect the following variant types:
} Nontandem repeats/amplifications
41
Part # 15070536 Rev. A
More general repeat-based limitations exist for all variant types:
} Power to assemble variants to break-end resolution falls to 0 as break-end repeat
length approaches the read size.
} Power to detect any break-end falls to (nearly) 0 as the break-end repeat length
approaches the fragment size.
} The method cannot detect nontandem repeats.
While Isaac SV Caller classifies novel DNA-adjacencies, it does not infer the higher level
constructs implied by the classification. For instance, a variant marked as a deletion by
Isaac SV Caller indicates an intrachromosomal translocation with a deletion-like breakend pattern. However, there is no test of depth, b-allele frequency, or intersecting
adjacencies to infer the SV type directly.
HiSeq Analysis Software v2.0 User Guide
42
Methods for Resequencing and Alignment Analysis
} Large insertions—The maximum detectable size corresponds to approximately the
read-pair fragment size, but note that detection power falls off to impractical levels
well before this size.
} Small inversions—The limiting size is not tested, but in theory detection falls off
below ~200 bases. So-called microinversions might be detected indirectly as
combined insertion/deletion variants.
Methods for Tumor-Normal Workflow
This section describes the underlining methodologies for the Tumor-Normal analysis
algorithm in HiSeq Analysis Software v2.0.
Introduction
The somatic variant calling pipeline uses 2 aligned sequence files (*.bam files) as inputs–
a normal *.bam and a tumor *.bam. These *.bam files are then processed through 3
interconnected callers:
} Isaac Somatic Variant Caller
} Isaac Structural Variant Caller
} Copy Number Aberration Caller (SENECA).
Isaac Somatic Variant Caller and SENECA are described in the following sections. For
information on the Isaac Structural Variant Caller, see Isaac Structural Variant Caller on
page 40.
During the first stage of the pipeline, the tumor and normal *.bam files run through a
combined indel realignment operation. This realignment operation is used as the input
for further processing. During calling, putative calls and de novo reassembled sections of
sequence are passed between the callers to produce internally consistent variant calls. All
3 callers use statistical models that operate on the combined tumor and normal reads as
input instead of the variants. The statistical models use combined calling instead of
subtraction of variant calls. Using combined calling produces superior results. However,
subtraction of the calls from the normal and tumor whole genome results often do not
match the somatic calls from a combined caller. For example, you can find a somatic
variant that was not called in the tumor WGS sample because the combined caller is
operating on the reads.
Isaac Somatic Variant Caller
The Isaac Somatic Variant Caller detects somatic SNVs and indels in sequencing data
from a tumor and matched normal sample, based on the following assumptions:
} The normal sample is a mixture of diploid germline variation and noise.
} The tumor sample is a combination of the normal sample and somatic variation. It
is assumed that the somatic variation and the normal noise can occur at any allele
frequency ratio.
For SNVs, but not for indels, the normal noise component is further modeled as a
combination of single-strand and double-strand noise.
43
Part # 15070536 Rev. A
Methods for Tumor-Normal Workflow
Figure 4 Isaac Somatic Variant Caller Method
NOTE
For a detailed overview of Isaac Somatic Variant Caller methods, go to
www.ncbi.nlm.nih.gov/pubmed/22581179.
Candidate Indel Search
The Isaac Somatic Variant Caller caller scans through the genome using sequence
alignments from the normal sample and tumor sample together to find a joint set of
candidate indels. The information in sequence alignments is supplemented with
externally generated candidate indels discovered by the Isaac Structural Variant (SV)
Caller. Isaac SV Caller provides external candidate indels to Isaac Somatic Variant Caller
for indels of size 50 and below.
Candidate indels are used for realignment of reads, during which each candidate indel is
evaluated as a potential somatic indel. Any other types of indels are considered noise
indels. If a better alignment is not found, these indels are allowed to remain in the read
alignments; otherwise, they are not used.
The candidate indel thresholds are designed so that the joint candidate indel set is at
least the combined set found if the Isaac Variant Caller is run on the individual samples.
Specifically, where a minimum number of nominating reads is required for candidacy in
Isaac Variant Caller, Isaac Somatic Variant Caller requires the same minimum number
of nominating reads from the combined input. Isaac Somatic Variant Caller requires that
at least one sample contains a minimum fraction of supporting reads among the sample
reads for candidacy.
HiSeq Analysis Software v2.0 User Guide
44
Realignment
For every read that intersects a candidate alignment, the Isaac Somatic Variant Caller
attempts to find the most probable alignments including the candidate indel and
excluding the candidate indel. Typically, the alignment excluding the candidate indel
aligns to the reference, but occasionally an alternate indel that overlaps or interferes with
the candidate is found to be more likely. The indel caller uses the probabilities of both
alignments as part of the indel quality score calculation, whereas only a single alignment
(usually the most probable) is preserved for SNV calling.
Somatic Caller
The Isaac Somatic Variant Caller uses a Bayesian probability model similar to the one
used for germline variant calling in the Isaac Variant Caller or in external tools such as
GATK. Using this model, our objective is to compute the posterior probability P(θ│ D),
which is the probability of the model state θ conditioned on the observed sequencing
data.
In a germline variant caller, the state space of the model is conventionally a discrete set
of diploid genotypes. For SNVs, the set of possible states is G=
{"AA,CC,GG,TT,AC,AG,AT,CG,CT,GT"}.
The Isaac Somatic Variant Caller model instead approximates continuous allele
frequencies for each allele:
f={f_A, f_C, f_G, f_T}
The allele frequencies are restricted to allow a maximum of 2 nonzero frequencies. Any
additional alleles observed in the data are treated as noise.
Another departure from typical germline calling methods is that the state space of the
model is the allele frequency of both the tumor and the normal sample:
θ=(f_t, f_n)
In the equation above, f_t and f_n represent the allele frequencies of the tumor and
normal samples, respectively.
The final somatic variant quality value reported by the model is computed from the
probability that the allele frequencies are unequal (ie, f_t≠f_n) given the observed
sequence data.
Post-Call Filtration
Heuristic filters remove several types of improbable calls resulting from data artifacts
that cannot be easily represented in the somatic probability model. These filters act as a
final step to separate out the final set of somatic calls reported by Isaac Somatic Variant
Caller.
Input Data Filtration
Isaac Somatic Variant Caller uses 2 tiers of input data filtration during somatic small
variant calling:
} Tier 1—A more stringent filtering to ensure high quality calls
} Tier 2—A lower filtration stringency
Initially, candidates are called using a subset of the data with more stringent tier 1
filtering. If the method produces a nonzero quality score for any SNV or indel, the
potential somatic variant is called again using data with a lower tier 2 stringency. The
45
Part # 15070536 Rev. A
For somatic SNVs and indels, Isaac Somatic Variant Caller produces a general somatic
quality score, Q(ssnv) or Q(somatic indel). This score indicates the probability of the
somatic variant and a joint probability of the somatic variant and a specific normal
genotype, Q(ssnv+ntype), or Q(somatic indel+ntype). The 2 tier evaluation is applied to
each of these qualities separately, as follows:
Q(ssnv) = min(Q(ssnv|tier1), Q(ssnv|tier2))
Q(ssnv+ntype) = min(Q(ssnv+ntype|tier1), Q(ssnv+ntype|tier2))
The tier used for each quality value is provided in the Isaac Somatic Variant Caller
output record for each somatic variant. If the most likely normal genotype is not the
same at tier 1 and tier 2, then the normal genotype is reported as a conflict in the output.
Using 2 data tiers enables an initial somatic call based on high-quality data. Given a
potential call, using 2 data tiers removes support for the putative somatic allele in the
normal sample from lower quality data. The following table lists the primary data
filtration levels that are changed between tier 1 and tier 2.
Table 7 Tiered Filtration Parameters
Parameter
Min paired-end alignment score
Min single-end alignment score
Single-end score rescue?
Include unanchored pairs?
Include anomalous pairs?
Include singleton pairs?
Mismatch density filter - max mismatches in window
Tier 1 Value
20
10
No
No
No
No
3
Tier 2 Value
0
0
Yes
Yes
Yes
Yes
10
Additional Filtration
Additional filters are applied after the somatic caller completes. A single candidate
somatic call can be annotated with several filters.
HiSeq Analysis Software v2.0 User Guide
46
Methods for Tumor-Normal Workflow
lower quality from the 2 tiers is selected for output. However, if the tier 2 quality is 0, the
call is eliminated.
Figure 5 Additional Filtration
Quality Filtration Levels
Only somatic calls originating from homozygous reference alleles in the normal sample
are reviewed for validation and included in the output.
} Somatic SNVs are reported if the normal genotype is equal to the reference
and Q(ssnv+ntype) ≥ 15.
} Somatic indels are reported if the normal genotype is equal to the reference
and Q(somatic indel+ntype) ≥ 30.
NOTE
The value Q(ssnv+ntype) is associated with the VCF key QSS_NT.
The value Q(somatic indel+ntype) is associated with the VCF key QSI_NT.
Copy Number Aberrations (SENECA)
The copy number aberrations module is also referred to as SENECA (SEnsitive detection
of copy NumbErs in CAncer). It identifies copy number aberrations (CNAs) in
heterogeneous tumor samples that exhibit contamination with normal tissues,
aneuploidy, and loss of heterozygosity (LOH) that can confound correct copy assignment
and lead to erroneous CNA calls.
The algorithm workflow comprises 2 distinct steps:
} Segmentation of data into regions with putatively distinct copy numbers.
47
Part # 15070536 Rev. A
As input, SENECA uses aligned sequences from tumor and matched normal samples (in
*.bam format) and annotation information about the location of known variants in
dbSNP, regional alignability, and the location of gaps in dbSNP.
Segmentation
SENECA is a count-based method to assign copy number state. It compares coverage
between tumor and normal samples. Specifically, it bins read coverage using
nonoverlapping 1 kb windows to derive counts in tumor and normal samples, and it
then takes the ratio of the 2 counts. Bins are skipped during segmentation when they
overlap low alignability regions in more than 20% of their size.
Independently, SENECA calculates B allele ratios at dbSNP positions from a tumor BAM
file, and it keeps only SNVs that are heterozygous in the corresponding normal sample.
Segmentation is carried out independently for copy number and B allele ratios.
Ploidy and Purity Calculation
Following segmentation, SENECA performs ploidy and purity calculations. These
calculations are based on the principle that for each value of ploidy and purity and a
selected copy number, the values of B allele and read count ratios are inferred. For
example, for copy number state 1 (1 deleted allele of a diploid genome), the B allele ratio
is always near 0 because only 1 allele is present. However, if a tumor sample has only
70% percent purity because of the presence of the normal genome as background, the B
allele ratio increases due to the presence of a heterozygous normal allele. The low
percentage of purity results in a final B allele ratio of 0.15.
SENECA fits a multivariate Gaussian distribution to copy data and B allele ratio data on
a two-dimensional grid of varying ploidy and purity. On the grid, each state encodes
ploidy and purity values. In addition, SENECA uses a separate state encoding copy
neutral LOH and copy gain LOH to identify loss-of-heterozygosity events.
Ploidy and purity associated with the model having highest log-likelihood are then used
to assign a copy number state to each segment. When both segments and copy numbers
are estimated, a quality score for copy number assignment is computed using a
likelihood ratio test. This test compares the likelihood of a current copy number
assignment to a likelihood of assigning 1 more or 1 fewer copy. Results of the likelihood
ratio test are then reported as a Q-score field in the VCF file using the following
transformation: 2*log (s1/s2), where s1 is a sum of squares for selected model and s2 is a
sum of squares for the next nearest model. Q-score threshold of 1.5 provides a good
trade-off between sensitivity and specificity.
HiSeq Analysis Software v2.0 User Guide
48
Methods for Tumor-Normal Workflow
} Calculation of ploidy and purity with a final copy number assignment.
Notes
For technical assistance, contact Illumina Technical Support.
Table 8 Illumina General Contact Information
Website
Email
www.illumina.com
[email protected]
Table 9 Illumina Customer Support Telephone Numbers
Region
Contact Number
Region
North America
1.800.809.4566
Italy
Australia
1.800.775.688
Netherlands
Austria
0800.296575
New Zealand
Belgium
0800.81102
Norway
Denmark
80882346
Spain
Finland
0800.918363
Sweden
France
0800.911850
Switzerland
Germany
0800.180.8994
United Kingdom
Ireland
1.800.812949
Other countries
Safety Data Sheets
Safety data sheets (SDSs) are available on the Illumina website at
support.illumina.com/sds.html.
HiSeq Analysis Software v2.0 User Guide
Contact Number
800.874909
0800.0223859
0800.451.650
800.16836
900.812168
020790181
0800.563118
0800.917.0041
+44.1799.534000
Technical Assistance
Technical Assistance
*12345678*
Part # 15070536 Rev. A
Illumina
San Diego, California 92122 U.S.A.
+1.800.809.ILMN (4566)
+1.858.202.4566 (outside North America)
[email protected]
www.illumina.com
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement