HiSeq Analysis Software v2.0 User Guide For Research Use Only. Not for use in diagnostic procedures. Revision History Introduction Computing Requirements Software Dependencies Install HiSeq Analysis Software v2.0 RPM Validate HiSeq Analysis Software v2.0 GenerateFASTQ Workflow Resequencing Analysis Workflow Tumor-Normal Analysis Workflow Input Files Output Files Methods for Resequencing and Alignment Analysis Workflows Methods for Tumor-Normal Workflow Technical Assistance ILLUMINA PROPRIETARY Part # 15070536 Rev. A May 2015 3 4 5 6 7 9 10 15 19 24 26 28 43 This document and its contents are proprietary to Illumina, Inc. and its affiliates ("Illumina"), and are intended solely for the contractual use of its customer in connection with the use of the product(s) described herein and for no other purpose. This document and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed, or reproduced in any way whatsoever without the prior written consent of Illumina. Illumina does not convey any license under its patent, trademark, copyright, or common-law rights nor similar rights of any third parties by this document. The instructions in this document must be strictly and explicitly followed by qualified and properly trained personnel in order to ensure the proper and safe use of the product(s) described herein. All of the contents of this document must be fully read and understood prior to using such product(s). FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS CONTAINED HEREIN MAY RESULT IN DAMAGE TO THE PRODUCT(S), INJURY TO PERSONS, INCLUDING TO USERS OR OTHERS, AND DAMAGE TO OTHER PROPERTY. ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE PRODUCT(S) DESCRIBED HEREIN (INCLUDING PARTS THEREOF OR SOFTWARE). © 2015 Illumina, Inc. All rights reserved. Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cBot, CSPro, CytoChip, DesignStudio, Epicentre, GAIIx, Genetic Energy, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium, iScan, iSelect, ForenSeq, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, Nextera, NextBio, NextSeq, Powered by Illumina, SeqMonitor, SureMDA, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the pumpkin orange color, and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the U.S. and/or other countries. All other names, logos, and other trademarks are the property of their respective owners. Revision History Revision History Part # Revision Date 15070536 A May 2015 HiSeq Analysis Software v2.0 User Guide Description of Change Initial release 3 HiSeq Analysis Software v2.0 is a software package for analyzing sequencing data generated by Illumina HiSeq X sequencing systems. The software leverages a suite of proven algorithms to detect genomic variants comprehensively and accurately. HiSeq Analysis Software v2.0 is a complete package with a range of variants, including Single Nucleotide Variants (SNV), Indels, Structural Variants (SV) and Copy Number Variants (CNV) for tumor and normal samples. HiSeq Analysis Software v2.0 utilizes the base calls and quality scores generated by Real-Time Analysis (RTA) software during the sequencing run to analyze data rapidly for high-throughput whole-genome sequencing analysis. HiSeq Analysis Software v2.0 leverages the Isaac Aligner and Isaac Variant Caller to provide both aligned and unaligned reads and variants. See Isaac Aligner on page 28 and Isaac Variant Caller on page 30 for more information. For structural variants, HiSeq Analysis Software v2.0 uses 2 complementary approaches: } Read depth analysis by Isaac Copy Number Variant Caller (Canvas). See Isaac Copy Number Variant Caller on page 36. } Discordant paired-end analysis by Isaac Structural Variant Caller (Manta). See Isaac Structural Variant Caller on page 40. HiSeq Analysis Software v2.0 supports a multistep analysis workflow that is amenable to using a simple command line with recommended default settings, which are specified in the sample sheet. This document provides an overview of the HiSeq Analysis Software v2.0 workflow and data structure, computing requirements and installation guidelines. The aim of this document is to help you understand the details and use of HiSeq Analysis Software v2.0 for high-throughput whole-genome sequencing analysis. HiSeq Analysis Software v2.0 User Guide 4 Introduction Introduction Computing Requirements Operating System HiSeq Analysis Software v2.0 requires the CentOS 6 operating system with the standard set of included libraries, which are indicated in the RPM installer. HiSeq Analysis Software v2.0 also requires the R statistical package, which is not specified in the RPM. For R statistical package installation procedures and a complete list of dependencies, see Software Dependencies on page 6. Compute Hardware HiSeq Analysis Software v2.0 requires specific compute architecture for proper functionality and performance. The hardware required is available from a range of suppliers and manufacturers. For a HiSeq X Ten system, the following architecture is recommended for analysis. Archival storage is not included in this recommendation. } Resilient NAS solution serving CIFS and NFS • A minimum of 100 TB usable storage capacity with performance > 2 GB/s To retain data after analysis, additional capacity is needed. • Multiple 10 Gb or 40 Gb network links } Resilient 10 Gb network } 26 servers, each with the following: • Dual 10 Core CPUs @ 2.8 GHz • 128 GB 1867 MHz memory • 6 TB useable local storage capacity with performance > 500 MB/s • 10 Gb Ethernet adapter 5 Part # 15070536 Rev. A The following system dependencies are required to operate HiSeq Analysis Software v2.0. R Statistical Package Install the R statistical package using the following command: sudo yum install R If yum is unable to find the R package, first install the EPEL repository using these command lines: wget http://download.fedoraproject.org/pub/epel/6/i386/epelrelease-6-8.noarch.rpm sudo rpm -ivh epel-release-6-8.noarch.rpm CentOS v6 Standard Libraries The following packages are indicated in the rpm. If these packages are not present, they are automatically installed during software installation. ConsoleKit, ConsoleKit-libs, GConf2, ImageMagick, ORBit2, OpenEXR-libs, atk, auditlibs, avahi-libs, basesysem, bash, bzip2-libs, ca-certificates, cairo, chkconfig, coreutils, coreutils-libs, cracklib, cracklib-dicts, cups-libs, curl, cyrus-sasl-lib, db4, d4-utils, dbus, dbus-glib, dbus-libs, eggdbus, elfutils-libelf, expat, file-libs, filesystem, fontconfig, freetype, gamin, gd, gdbm, gdk-pixbuf2, ghostcript, ghostscript-fonts, glib2, glibc, glibc-common, gmp, gnuplot, gnuplot-common, gnutls, grep, groff, gtk2, gzip, hicolor-icon-theme, ilmbase, infojasper-libs, keyutils-libs, keyutils-libs-devel, krb5-devel, krb5-libs, lcms-libs, less, libICE, libIDL, libSM, libX11, libX11-common, libXau, libXcmposite, libXcursor, libXdamage, libXext, libXfixes, libXfont, libXft, libXi, libXinerama, libXpm, libXrandr, libXrender, libXt, libacl, libattr, libcp, libcap-ng, libcom_err, libcom_err-devel, libcroco, libcurl, libffi, libfontenc, libgcc, libgcrypt, libgomp, libgpg-error, libgsf, libidn, libjpeg-urbo, libpng, librsvg2, libselinux, libselinux-devel, libsepol, libsepol-devel, libssh2, libstdc++, libtasn1, libthai, libtiff, libtool-ltdl, libuui, libwmf-lite, libxcb, libxml2, libxslt, lua, make, ncurses, ncurses-base, ncurses-libs, nspr, nss, nss-softokn, nss-softokn-freebl, nss-sysinit, nss-ools, nss-util, openldap, openssl, openssl-devel, p11-kit, p11-kit-trust, pam, pango, pcre, pixman, pkgconfig, polkit, popt, psmisc, python, python-lib, readline, rpm, rpmlibs, sed, setup, sgml-common, shadow-utils, shared-mime-info, sqlite, tzdata, urw-fonts, xorg-x11-font-utils, xz-libs, zlib, zli-devel HiSeq Analysis Software v2.0 User Guide 6 Software Dependencies Software Dependencies Install HiSeq Analysis Software v2.0 RPM NOTE You need root privileges to perform the installation and to unpack the reference genome. NOTE Hard drives containing the HiSeq Analysis Software v2.0 and genome installers use the NTFS file system. 1 Copy the rpm and tarball installers to your system. • HiSeqAnalysisSoftwareX-2.5.55.1311.HAS-1.x86-64.rpm (1.1 GB) • HASv2_2.5.55.13XX_Homo_sapiens_UCSC_hg19.tar.gz (32 GB) 2 Install the base analysis software (2.8 GB) using the RPM package. • Use the following install command: sudo yum install HiSeqAnalysisSoftwarev2-2.5.55.1311.HAS1.x86_64.rpm • To install to a different location, use this command: sudo rpm -i --prefix /install/path HiSeqAnalysisSoftwarev22.5.55.1311.HAS-1.x86_64.rpm When using the --prefix option, the package is placed inside the illumina directory within your custom directory. 3 Install the hg19 reference genome (108 GB). Installation takes approximately 45 minutes. Make sure that the HiSeq Analysis Software v2.0 RPM is installed before installing the reference genome files. • Use the following install command: sudo /opt/illumina/HiSeqAnalysisSoftwareX/unpackIsisReference.py --input-file=/path/to/HASv2_2.5.55.13XX_Homo_sapiens_UCSC_ hg19.tar.gz • To install to a different location, use this command: sudo /opt/illumina/HiSeqAnalysisSoftwareX/unpackIsisReference.py --input-file=/path/to/HASv2_2.5.55.13XX_Homo_sapiens_UCSC_ hg19.tar.gz --genomes-root=/genomes/root/genomes_drive --jobs=16 NOTE The --genomes-root section of the command refers to the location for the genome installation. Make sure that the directory exists before executing the command. Additional Information } The RPM installs the prerequisite software to run HiSeq Analysis Software v2.0 (eg, Java and Mono). } If there are missing dependencies when installing the software using the rpm -i --prefix command, use the yum command to install the missing dependences, and then retry installation. } The RPM is unsigned. } If there are problems with running and executing the program, change the ownership of the installation directory to a group containing all users. 7 Part # 15070536 Rev. A HiSeq Analysis Software v2.0 User Guide 8 Install HiSeq Analysis Software v2.0 RPM } To uninstall the RPM, use the following command: sudo yum remove HiSeqAnalysisSoftwareX.x86_64 } When running HiSeq Analysis Software on a cluster, create a script with the necessary command lines and schedule the script on a node. See your scheduler documentation for the correct command. Make sure that the node is dedicated to the script. When running, HiSeq Analysis Software uses all the RAM and cores on the node. Validate HiSeq Analysis Software v2.0 To validate the installation, run the following script. Update the path if the software is installed to a different location. /opt/illumina/scratch/InstallValidationData/ValidateInstall A successful validation prints the following output lines: Starting Isaac validation run... Isaac run passed validation Log file written to /opt/illumina/scratch/InstallValidationData/ValidateInstall.log An unsuccessful validation prints the following output line: Isaac run failed 9 Part # 15070536 Rev. A Run folders of base call and quality score data in .bcl.gz file format are first processed with the GenerateFASTQ workflow, which demultiplexes the data and converts it to FASTQ files organized under a per-sample directory structure. Create a SampleSheet.csv file for the GenerateFASTQ workflow step. For each step, a new sample sheet file is automatically generated that specifies the settings to be applied in the subsequent steps. For more information, see Sample Sheet Settings on page 24. Run GenerateFASTQ 1 Identify a run folder for the analysis, which can be the original run folder for the flow cell. 2 Create a sample sheet named SampleSheet.csv in the top level of the run folder. For more information, see Sample Sheet Settings on page 24. 3 Start the GenerateFASTQ workflow for each flow cell using the following command. Specify absolute paths. Relative paths are not supported. If your HiSeq Analysis Software v2.0 executable is not installed in the default location, use your system location instead of the /opt location shown below. /opt/illumina/HiSeqAnalysisSoftwareX/latest/HAS GenerateFastq r /path/to/your/RunFolder -a /path/to/analysis/folder -f /path/to/fastq/folder If you log out of your terminal session, the HiSeq Analysis Software v2.0 command can be prematurely terminated. To retain your terminal session, run under screen or add the nohup command in the prefix. The log output is captured in the WorkflowError.txt and WorkflowLog.txt files. Alternatively, output can be redirected using 2>&1>, as shown below. nohup /opt/illumina/HiSeqAnalysisSoftwareX/latest/HAS GenerateFastq -r /path/to/your/RunFolder -a /path/to/analysis/folder -f /path/to/fastq/folder 2>&1 > logfile.txt & Use these command line options to customize the GenerateFASTQ run. Option -a -r -f Description The path to the root analysis folder. Use the same path for all analyses. The path to run folder. The path to the FASTQ folder. Use the same root directory to store FASTQ files for all analyses. Top-Off Samples Topping-off allows you to merge samples to combine data from multiple lanes when there is insufficient yield for downstream analysis. To combine data from multiple run folders, use the project name and Sample_ID as a unique identifier for a sample. If a project name and Sample_ID pair are reused in multiple GenerateFASTQ calls, the combined data are treated as a single sample in the Resequencing or Alignment workflows. Make sure to use the same analysis folder path (-a) for the runs. Do not proceed to the Resequencing or Alignment workflows until you have performed all the necessary GenerateFASTQ runs. HiSeq Analysis Software v2.0 User Guide 10 GenerateFASTQ Workflow GenerateFASTQ Workflow Demultiplexing If the sample sheet defines multiple samples within the same lane, demultiplexing is performed to assign clusters to the samples. For runs with index reads, demultiplexing compares each Index Read sequence to the index sequences specified in the sample sheet. Demultiplexing separates data from pooled samples based on short index sequences that tag samples from different libraries. Index reads are identified using the following steps: } Samples are numbered starting from 1 based on the order they are listed in the sample sheet. } Sample number 0 is reserved for clusters that were not successfully assigned to a sample. } Clusters are assigned to a sample when the index sequence matches exactly or there is up to a single mismatch per Index Read. NOTE Illumina indexes are designed so that any index pair differs by ≥ 3 bases, allowing for a single mismatch in index recognition. Index sets that are not from Illumina can include pairs of indexes that differ by < 3 bases. In such cases, the software detects the insufficient difference and modifies the default index recognition (mismatch=1). Instead, the software performs demultiplexing using only perfect index matches (mismatch=0). When demultiplexing is complete, the DemultiplexSummaryF1L1.txt summarizes the following information: } In the file name, F1 represents the flow cell number. } In the file name, L1 represents the lane number. } Reports demultiplexing results in a table with one row per tile and one column per sample, including sample 0. } Reports the most commonly occurring sequences for the index reads. The demultiplexing summary file is written to the QC folder. For the folder location, see FASTQ Folder Structure on page 12. If the index sequence specified in the sample sheet is shorter than the actual Index Read, then the first cycles of the Index Read are used for demultiplexing. If there are no index sequences specified for Index or Index2, then the first or second index reads are not used for demultiplexing. Make sure that all samples in the same lane have a compatible index scheme. Use the same number of cycles for Index and Index2. 11 Part # 15070536 Rev. A [User-Specified FASTQ Folder] [Flow Cell ID]—Contains FASTQ files. Delete_Bcl.cmdline—Shell script used to delete the BCL files that were used for the analysis. samplename_S1_L001_R1_001.fastq.gz–FASTQ file. SampleSheet.csv–Copied from Run Folder. QC—Contains undetermined FASTQ files, demultiplexing summary, and AdapterCounts.txt and AdapterTrimming.txt files. LogFiles—Contains files used for troubleshooting only. InterOp–Contains binary files used by Sequencing Analysis Viewer (SAV). commands.tsv—A select list of commands that were run during the analysis. CompletedJobInfo.xml—XML file containing parameters that were used for the analysis. RunInfo.xml—Copied from Run Folder. runParameters.xml—Copied from Run Folder. WorkflowError.txt—Main error log file. WorkflowLog.txt—Main log file. FASTQ File Generation HiSeq Analysis Software v2.0 generates intermediate analysis files in the FASTQ format, which is a text format used to represent sequences and their quality scores. FASTQ files contain reads for each sample and their quality scores, excluding reads identified as inline controls and clusters that did not pass filter. FASTQ file generation includes all tiles by default. If individual .bcl files or the associated .filter and .loc files are missing or corrupted, make sure that the ConvertMissingBCLsToNoCalls setting is set to true (value=1) in the sample sheet. This setting is true by default. FASTQ File Format FASTQ file is a text-based file format that contains base calls and quality values per read. Each record contains 4 lines: } } } } The identifier The sequence A plus sign (+) The Phred quality scores in an ASCII +33 encoded format The identifier is formatted as @Instrument:RunID:FlowCellID:Lane:Tile:X:Y ReadNum:FilterFlag:0:SampleNumber as shown in the following example: @SIM:1:FCX:1:15:6329:1045 1:N:0:2 TCGCACTCAACGCCCTGCATATGACAAGACAGAATC + <>;##=><9=AAAAAAAAAA9#:<#<;<<<????#= HiSeq Analysis Software v2.0 User Guide 12 GenerateFASTQ Workflow FASTQ Folder Structure The X and Y coordinates in the read name are transformed relative to the raw pixel values according to the following calculation: (int)((PixelX*10) + 1000 +0.5) The instrument name is derived from the <Instrument> tag in the RunInfo.xml file. The run ID is derived from the Number attribute from the <Run ID> tag, or from the ID attribute. The following is an example of the tag, with the run number bolded: <Run Id="11-07-23_BoltM11_0108_A-FCA012D_1" Number="108"> FASTQ File Names FASTQ files are named with the sample name and the sample number. The sample number is a numeric assignment based on the order that the sample is listed in the sample sheet. For example: Data\Intensities\BaseCalls\samplename_S1_L001_R1_001.fastq.gz } samplename—The sample name provided in the sample sheet. If a sample name is not provided, the file name includes the sample ID. } S1—The sample number based on the order that samples are listed in the sample sheet starting with 1. In this example, S1 indicates that this sample is the first sample listed in the sample sheet. NOTE Reads that cannot be assigned to any sample are written to a FASTQ file for sample number 0, and excluded from downstream analysis. } L001—The lane number. } R1—The read. In this example, R1 means Read 1. For a paired-end run, a file from Read 2 includes R2 in the file name. } 001—The last segment is always 001. FASTQ files are compressed in the GNU zip format, as indicated by *.gz in the file name. FASTQ files can be uncompressed using tools such as gzip (command-line) or 7-zip (GUI). Quality Control Files To perform QC, first use the Sequencing Analysis Viewer (SAV) to view important quality metrics. If the generated FASTQ files are empty or nearly empty, view the output files for possible errors. If the run does not include indexes, view the following files. File Description GenerateFASTQRunStatistics.xml Contains the number of raw and PF clusters for the whole flow cell and per sample. WorkflowError.txt BuildFastq*.stdout.txt Processing logs that contain additional information on the run. The BuildFastq*.stdout.txt file contains information for each lane and tile. InterOp files in the InterOp folder Contain the %Q30 scores and additional information for QC. FastqSummaryF1L*.txt Contains the number of raw and PF clusters for each land and tile on a per sample basis. WorkflowLog.txt 13 Part # 15070536 Rev. A File Description IndexMetricsOut.bin SAV uses this file with the other files in the InterOp folder. DemultiplexSummaryF1L1.txt Contains the percentage of each index sample for each lane and tile. This file also lists the most frequent indexes (unique or combination) found in the lanes. View this file when a sample has too many reads or no reads. AdapterCounts.txt Lists the number of reads that contain part of the adapter sequence. AdapterTrimming.txt Lists the number of trimmed bases and percentage of bases for each tile. This file is present only if adapter trimming was specified for the run. Undetermined_S0_L00*_R*_ 001.gz Contains the reads with unidentifiable index reads. This file is present only when demultiplexing. HiSeq Analysis Software v2.0 User Guide 14 GenerateFASTQ Workflow If the run includes indexes, view the following files in addition to the files listed above. Resequencing Analysis Workflow The Resequencing analysis workflow uses the Isaac Aligner and Isaac Variant Caller to compare the DNA sequence in the sample against the human reference genome hg19. It identifies any small variants (SNPs or indels) and large structural variants (SVs and CNVs) relative to the reference sequence. The inputs to the Resequencing workflow are the sample FASTQ files located in the GenerateFASTQ output folder structure and the SampleSheet.csv file generated by the GenerateFASTQ step. The main output files are BAM files (containing the reads after alignment), VCF files (containing the variant calls), Genome VCF files (describing the calls for all variant and nonvariant sites in the genome), and SV VCF files (describing the calls for structural variants and copy number variants in the genome). Figure 1 Resequencing Analysis Workflow Run Resequencing Analysis Run the Resequencing analysis workflow using the following command. If your HiSeq Analysis Software v2.0 executable is not installed in the default location, use your system location instead of the /opt location shown below. /opt/illumina/HiSeqAnalysisSoftwareX/latest/HAS Resequencing -a /path/to/analysis/folder -i NormalSampleId -t /path/for/temp/files/likely/on/node Use these command line options to customize the Resequencing run. Option -p -l --copy-fastqs -t 15 Description [Optional] The project the sample belongs to. Adds a directory level named <Project> to the base of the other output directories. [Optional] The path to the local storage. If specified, the input files for the specified workflow are copied to the local storage of the compute node (local hard drive). The workflow also generates temporary and output files in the local node storage during the analysis. Using this option requires more compute node space than the -t option. Upon completion, the results are moved to the standard location. [Optional] Copy the FASTQ files to the analysis output folder instead of using a symlink. If this option is not specified, FASTQ files are softlinked from the analysis output folder to the actual FASTQ source directory generated by the GenerateFASTQ workflow. If this option is specified, FASTQ folders in the results folder are not subject to the Delete_FastQ.cmdline script. The path to the temporary directory used by Isaac. Part # 15070536 Rev. A Resequencing Analysis Folder Structure [User-Specified Analysis Folder] [Project Name]—Present if the project name is specified in the sample sheet. [Sample ID] Delete_FastQ.cmdline–Shell script used to delete the FASTQ files that were used for the analysis. FASTQ_SOURCE—Contains internal information to link FASTQ source files to the sample. Do not delete or alter. FASTQ_STATS—FASTQ and demultiplex summary files copied from the GenerateFASTQ run. Contains the same content as the QC folder in the original FASTQ directory, but renamed by flow cell. Results samplename_S1.bam—BAM file with all the reads aligned against the hg19 genome. samplename_S1.bam.bai—BAM index file. samplename_S1.bam.md5sum—MD5 checksum of the BAM file. samplename_S1.genome.vcf.gz—gVCF file with all reference and small variant calls. samplename_S1.genome.vcf.gz.tbi—Tabix index file for the gVCF file. samplename_S1.report.html—HTML report. samplename_S1.report.pdf—PDF report. samplename_S1.summary.csv—Comma-separated file with various analysis stats. SampleSheet.csv—Sample sheet used for the analysis. Summary.htm—HTML file containing statistics on the flow cell on a per tile basis. Summary.xml—XML version of the summary file. LogFiles—Contents are used for troubleshooting only. Fastq—Contains softlinks to the FASTQ files. commands.tsv—A select list of commands that were run during the analysis. CompletedJobInfo.xml—XML file containing parameters used for the analysis. RunInfo.xml—Copied from Run Folder. WorkflowError.txt—Main error log file. WorkflowLog.txt—Main log file. Resequencing Analysis Report The Resequencing workflow outputs a PDF and HTML report that provides an overview of statistics per sample. HiSeq Analysis Software v2.0 User Guide 16 Resequencing Analysis Workflow NOTE If there is available storage on the local node, the -l option is recommended as standard practice. If the -l option is specified, there is no need to specify the -t option. Sample Information The sample information table provides statistics about the sample and alignment quality. Statistic Definition Total PF Reads The total number of reads passing filter. % Q30 Bases The percentage of bases with a quality score of 30 or higher. Total Aligned Reads 1 and 2 The total number of aligned reads for reads 1 and 2. % Aligned Reads 1 and 2 The percentage of aligned passing filter reads for reads 1 and 2. Total Aligned Bases Reads 1 and 2 The total number of aligned bases for reads 1 and 2. % Aligned Bases Reads 1 and 2 The percentage of aligned bases for reads 1 and 2. Mismatch Rate Reads 1 and 2 The average percentage of mismatches of reads 1 and 2 over all cycles. Coverage Histogram The coverage histogram shows the number of reference bases plotted against the depth of coverage. The report also includes the mean coverage, which is the total number of aligned bases divided by the genome size. The coverage histogram considers duplicates, but the mean coverage does not. Small Variants Summary The small variants summary table provides metrics about the number of SNVs, insertions, and deletions. Statistic Definition Total Passing The total number of variants present in the data set that pass the variant quality filters. Het/Hom Ratio The number of heterozygous variants divided by the number of homozygous variants. Ts/Tv Ratio Transition rate of SNVs that pass the quality filters divided by the transversion rate of SNVs that pass the quality filters. Transversions are interchanges between purine and pyrimidine bases (eg, A to T). Structural Variants Summary The structural variants summary table separates the structural variant output into the classes of called variants. The table also includes the total number of structural variants and the overlap with annotated genes. All counts are based on PASS filter variants. 17 Part # 15070536 Rev. A The fragment length summary table provides statistics on the sequenced fragments Statistic Definition Fragment Length Median The median length of the sequenced fragment. The fragment length is based on the locations where a read pair aligns to the reference. The read mapping information is parsed from the BAM files. Minimum The minimum length of the sequenced fragment. Minimum The maximum length of the sequenced fragment. Standard Deviation The standard deviation of the sequenced fragment length. Duplicate Information The duplicate information table provides the percentage of paired reads that have duplicates. Analysis Details The analysis detail tables provide the settings and software packages used for the analysis. HiSeq Analysis Software v2.0 User Guide 18 Resequencing Analysis Workflow Fragment Length Summary Tumor-Normal Analysis Workflow HiSeq Analysis Software v2.0 includes a Tumor-Normal analysis workflow that leverages a suite of proven algorithms that are optimized for the complexities of tumor samples. The software delivers a set of accurate somatic variants when compared with a matched normal sample. Following alignment of both the tumor and normal sample, the Tumor-Normal workflow is used to identify the somatic variants (SNVs, small indels, and structural variants) that are unique to the tumor sample. When analyzing the tumor sample, use the Alignment analysis workflow rather than the Resequencing analysis workflow for the alignment step, as shown below. The alignment analysis workflow is identical to the resequencing analysis workflow, but does not include the variant call analysis. The variant call analysis used in the Resequencing analysis workflow is not valid for a tumor sample, which can contain somatic variants. The variant calling methods used in the Resequencing analysis workflow assume a diploid genotype. For optimal results, Illumina recommends a minimum coverage of 30x for the normal sample and 60x coverage for the tumor sample. Below is an outline of the Tumor-Normal analysis workflow and generated files. The Tumor-Normal analysis workflow leverages 3 interconnected somatic variant callers— Isaac Somatic Variant Caller for SNV and small somatic variants, Isaac Structural Variant Caller for large indel and structural variants, and SENECA for somatic copy number variants. Figure 2 Tumor-Normal Analysis Workflow Run Tumor-Normal Analysis 19 1 Run the Alignment workflow for each tumor sample. A sample can span multiple flow cells. /opt/illumina/HiSeqAnalysisSoftwareX/latest/HAS Alignment -a /path/to/analysis/folder -i TumorSampleId -l /path/to/storage/on/node 2 Run the Tumor-Normal workflow for each tumor-normal pair. This example assumes that Resequencing has already been run on the normal sample. /opt/illumina/HiSeqAnalysisSoftwareX/latest/HAS TumorNormal -a /path/to/analysis/folder --tumor-sample-id TumorSampleId --normal-sample-id NormalSampleId Part # 15070536 Rev. A Option -p -l --tumor-sample-id --normal-sample-id Description [Optional] The project the sample belongs to. Adds a directory level named <Project> to the base of the other output directories. [Optional] The path to the local storage. If specified, the input files for the specified workflow are copied to the local storage of the compute node (local hard drive). The workflow also generates temporary and output files in the local node storage during the analysis. Using this option requires more compute node space than the -t option. Upon completion, the results are moved to the standard location. The sample ID of the tumor sample. The sample ID of the normal sample. NOTE If there is available storage on the local node, the -l option is recommended as standard practice. If the -l option is specified, there is no need to specify the -t option. Tumor-Normal Analysis Folder Structure [User-Specified Analysis Folder] [Project Name]—Present if the project name is specified in the sample sheet. [Tumor Sample ID or Normal Sample ID] [Tumor-Normal]—Present in addition to a Resequencing or Alignment folder. normalsamplename_tumorsamplename_G1_P1.report.pdf— PDF report. normalsamplename_tumorsamplename_G1_P1.report.json— JSON report. normalsamplename_tumorsamplename_G1_P1.somatic.vcf—VCF file containing the somatic small variants called. normalsamplename_tumorsamplename_G1_P1.somatic.SV.vcf— VCF file containing the somatic structural variants called (excluding CNVs). normalsamplename_tumorsamplename_G1_P1.somatic.CNV.vcf— VCF file containing the somatic CNVs called. SampleSheet.csv—Sample sheet used for the analysis. LogFiles—Contents are used for troubleshooting only. SoxLogs_G1_P1—Contains more detailed log files. commands.tsv—A select list of commands that were run during the analysis. CompletedJobInfo.xml—XML file containing parameters used for the analysis. runSoxWorkflow_G1_P1_stderr.txt—Main log file. WorkflowError.txt—Main error log file. WorkflowLog.txt—Does not contain detailed information. HiSeq Analysis Software v2.0 User Guide 20 Tumor-Normal Analysis Workflow Use these command line options to customize the Resequencing run. Somatic Analysis Report The Tumor-Normal workflow outputs a PDF report that provides a summary of the somatic analysis results. Sample Information The sample information table provides statistics about the tumor and normal samples. Statistic Definition Gigabases Passing Filter The number of gigabases passing filter. % Bases ≥ Q30 The percentage of bases with a quality score of 30 or higher. Purity The approximate amount of signal from the tumor sample that is derived from the tumor cells. Ploidy The average chromosome content of the cell compared to the haploid state. Somatic Variants Summary The somatic variants summary tables provide the total counts for the following: } Somatic SNVs, deletions, and insertions. } Different structural variants, including CNVs. } Variants overlapping known genes. The workflow does not provide annotations for small variants, which result in 0 counts for the remaining sections of the summary table. All counts are based on PASS filter variants. 21 Part # 15070536 Rev. A The circos plot provides visualization of somatic small variation, ploidy, and structural variations reported in the somatic variation files (VCF). The circos plot displays somatic variation data in tracks with chromosomes circularly arranged. Following is an example legend. Labels are described from inside the circle to the outside. Legend A B Label (From Inner Circle to Outer Circle) Somatic structural variants E Number of somatic indels per Mb Number of somatic SNVs per Mb Copy-neutral loss of heterozygosity (LOH) B-allele frequency F Called level G Karyotype H Chromosome position Chromosome number C D I HiSeq Analysis Software v2.0 User Guide Description The somatic structural variants detailed in somatic.SVs.vcf are plotted in the center of the plot. Green links—Segmental duplications (at the center of the circle). Green boxes—Inversions (the first inner track). Purple boxes—Deletions (the second track). The width of the boxes indicates the length of SVs. Purple bars—Insertion breakpoints (the third track). Red links—Translocations. The end of the links indicates the 2 breakpoints of SVs. The density of PASS somatic indels reported in somatic.indels.vcf.gz in 1 Mb windows. The scale of Y-axis in the histogram indicates the counts. The density of PASS somatic SNVs reported in somatic.snvs.vcf in 1 Mb windows, arbitrarily scaled in a histogram with Y-axis pointing inward. The LOH regions with SNP calls in the normal genome but a homozygous reference call in the tumor genome, in CNVs.vcf. The B-allele ratios calculated by SENECA that will be used in the ploidy and purity estimation. The copy number aberrations from CNVs.vcf file. The scale of Y-axis in the histogram indicates the called level. The standard Circos ideogram defining the chromosome position, identity, and color of cytogenetic bands. The reference coordinates along the chromosome (in megabases) Chromosome number: 1, 2,…,22, X, Y. 22 Tumor-Normal Analysis Workflow Circos Plot of Somatic Variations Depth/B-Allele Plot The top plot provides an overview of the depth of coverage by chromosomal position. Aberrant values indicate copy number variations. Copy number ratios are classified as either gains (red), losses (green), or copy number unchanged (black). Estimated purity and ploidy values are listed at the top of the plot. The bottom of the plot displays B-allele ratio by chromosomal position. The B-allele ratio is the ratio of the 2 alleles A and B. The B-allele ratios that are around 0.5 are filtered out for diploid regions. 23 Part # 15070536 Rev. A The following file is user-provided. } SampleSheet.csv—Contains user-specified analysis options for the run, including Sample_IDs and index sequences. Create the SampleSheet.csv file for the GenerateFASTQ step and save it in the top level of the run folder. HiSeq Analysis Software v2.0 creates the SampleSheet.csv for subsequent workflow steps. HiSeq X generates the following files. These files are found in the run folder after a sequencing run. } RunInfo.xml–Produced by RTA and contains information about cycles per read, etc. } Base call files (s_x_yyyy.bcl.gz)—Includes base calls for lane x, tile yyyy, cycle z. Located in Data\Intensities\BaseCalls\L00x\Cz.1. } FILTER files (s_x_yyyy.filter)—Includes filters flags for lane x, tile yyyy. Located in Data\Intensities\BaseCalls\L00x\. } Position files (s.locs)—Includes locations for all lanes. Located in Data\Intensities. Sample Sheet Settings The sample sheet identifies the samples included in an analysis and contains the index information used in sample demultiplexing. To customize the sample sheet for the GenerateFASTQ workflow, enter the following parameters in the [Settings] section of the sample sheet. Each line in the [Settings] section contains a parameter name in the first column and a value in the second column. Settings are not case-sensitive. Parameter Description Adapter Specify the 5' portion of the adapter sequence to prevent reporting sequence beyond the sample DNA. To trim two or more adapters, separate the sequences by a plus (+) sign. (eg, Nextera Mate Pair Libraries – CTGTCTCTTATACACATCT+AGATGTGTATAAGAGACAG). AdapterRead2 Specify the 5' portion of the Read 2 adapter sequence to prevent reporting sequence beyond the sample DNA. Use this setting to specify a different adapter other than the one specified in the Adapter setting. If not specified, the Adapter setting is used for Read 2. HiSeq Analysis Software v2.0 User Guide 24 Input Files Input Files [Data] Section The following table includes the parameters and descriptions of the sample sheet [Data] fields. Parameter Description Lane [Optional] Lane numbers used for a sample. If not specified, all lanes are used. Each lane can be entered as a separate row. Delimited by '+', eg '1+2+3'. Sample_ID Sample ID. Combined with Sample_Project to make a unique identifier for the sample. Sample_Name [Optional] Sample Name. If this field is completed, the sample name is used when naming FASTQ files. index [Optional] Index bases used to identify a sample. index2 [Optional] Second index bases used to identify a sample. GenomeFolder Location of genome used for alignment. Sample_Project [Optional] Project Name. Completing this field is highly recommended. Example GenerateFASTQ Sample Sheet An example of a sample sheet for the GenerateFASTQ workflow is shown below. [Header] IEMFileVersion,4 Investigator Name,Jane Doe Project Name,Project A Experiment Name,HiSeqX Date,2/12/2015 Workflow,GenerateFASTQ Application,HiSeq FASTQ Only Assay,TruSeq HT Description,HiSeqX run Chemistry,Default [Reads] 151 151 [Settings] Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT [Data] Lane,Sample_ID,Sample_Name,index,index2,GenomeFolder,Sample_ Project 25 Part # 15070536 Rev. A Table 1 Common Data Files File Name Description *.bam files Contains reads for a given sample. Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/ {SAMPLEID}/Results. *.vcf files Contains information about variants found at specific positions in a reference genome. Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/ {SAMPLEID}/Results. *.genome.vcf.gz files All variant and nonvariant sites in the genome for SampleName. Block level compression is used (bgzip) to generate this file so that it is compatible with tabix indexing. Similarly, a tabix generated index file corresponding to the genome.vcf.gz file is also generated. This information is represented using the Genome VCF (gVCF) conventions for VCF. Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/ {SAMPLEID}/Results. For more information, see sites.google.com/site/gvcftools/home/about-gvcf. Table 2 Common Metrics Files File Name Description AdapterTrimming.txt Lists the number of trimmed bases and percentage of bases for each tile. This file is present only if adapter trimming was specified for the run. Located in {USER-SPECIFIED-FASTQ-FOLDER}/ {FLOWCELLID}/QC. *.CoverageHistogram.txt Contains coverage depth information for every chromosome. This file can be used to generate a coverage histogram. Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/ {SAMPLEID}/Results/LogFiles. DemultiplexSummaryF1L1.txt Reports demultiplexing results in a table with one row per tile and one column per sample. Located in {USER-SPECIFIED-FASTQ-FOLDER}/ {FLOWCELLID}/QC. ErrorsAndNoCallsByLaneTile ReadCycle.csv A comma-separated values file that contains the percentage of errors and no-calls for each tile, read, and cycle. Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/ {SAMPLEID}/Results/LogFiles. HiSeq Analysis Software v2.0 User Guide 26 Output Files Output Files File Name Description *.summary.csv Summary statistics for the sample, stored as a comman-separated values (.csv) file for parsing by downstream tools. Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/ {SAMPLEID}/Results. *.report.pdf / *.report.html Summary report for the sample. Provides several high-level statistics pulled from *.summary.csv, such as genome coverage and % aligned. Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/ {SAMPLEID}/Results. Table 3 Common Run Progress Files 27 File Name Description WorkflowLog.txt Processing log that describes every step that occurred during analysis of the current run folder. This file does not contain error messages. Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/ {SAMPLEID}/Results/LogFiles. WorkflowError.txt Processing log that lists any errors that occurred during analysis. This file is present only if errors occurred. Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/ {SAMPLEID}/Results/LogFiles. CompletedJobInfo.xml Written after analysis is complete. Contains information about the run, such as date, flow cell ID, software version, and other parameters. Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/ {SAMPLEID}/Results/LogFiles. RunInfo.xml Contains information about the run. Located in {USER-SPECIFIED-ANALYSIS-FOLDER}/ {SAMPLEID}/Results/LogFiles. Part # 15070536 Rev. A This section describes the underlining methodologies for the resequencing and alignment algorithm in HiSeq Analysis Software v2.0. The resequencing analysis and alignment analysis workflows are identical, except that the alignment analysis workflow does not contain variant calling and is only recommended for tumor samples. Introduction After the sequencer generates base calls and quality scores, the resulting data are analyzed in 2 steps—alignment to the reference genome followed by assembly and variant calling. Alignment and variant calling are performed with the Isaac Alignment Software, Isaac Variant Caller, Isaac CNV Caller, and Isaac SV Caller. The following output is produced: } Realigned and duplicate marked reads in a *.bam file format. } Variants in a VCF file format. } An additional Genome VCF (gVCF) file. This file features an entry for every base in the reference, which differentiates reference calls and no calls, and a summary of quality. The reference calls are block compressed and all single nucleotide polymorphisms and indels are included. Currently Structural Variants and CNVs are kept in separate files. Genome Specific Details Illumina currently uses hg19 from UCSC as a reference genome. The chromosome naming scheme follows the UCSC conventions of chr1-22, chrX, chrY, chrM. The pseudoautosomal region (PAR) of the Y chromosome is masked out with N’s. The result of this is that any mappings occurring in the PAR region map to the X chromosome. Currently, only the main chromosomes and mitochondria are used in the reference; none of the nonmapped contigs are included. As per GATK specification for UCSC, chrM is the first chromosome followed by the rest in karyotypic order. The hg19 PAR regions are defined as follows. Table 4 hg19 PAR regions Name Chr PAR#1 X PAR#2 X PAR#1 Y PAR#2 Y Start 60,001 154,931,044 10,001 59,034,050 Stop 2,699,520 155,260,560 2,649,520 59,363,566 Isaac Aligner The Isaac Aligner1 aligns DNA sequencing data, single or paired-end, with read lengths of 32–150 bp and low error rates using the following steps: } Candidate mapping positions—Identifies the complete set of relevant candidate mapping positions using a 32-mer seed-based search. } Mapping selection—Selects the best mapping among all candidates. } Alignment score—Determines alignment scores for the selected candidates based on a Bayesian model. HiSeq Analysis Software v2.0 User Guide 28 Methods for Resequencing and Alignment Analysis Methods for Resequencing and Alignment Analysis Workflows } Alignment output—Generates final output in a sorted duplicate-marked BAM file, indels realigned and summary file. 1 Come Raczy, Roman Petrovski, Christopher T. Saunders, Ilya Chorny, Semyon Kruglyak, Elliott H. Margulies, Han-Yu Chuang, Morten Källberg, Swathi A. Kumar, Arnold Liao, Kristina M. Little, Michael P. Strömberg and Stephen W. Tanner (2013) Isaac: Ultra-fast whole genome secondary analysis on Illumina sequencing platforms. Bioinformatics 29(16):2041-3 bioinformatics.oxfordjournals.org/content/29/16/2041 Candidate Mapping To align reads, the Isaac Aligner first identifies a small but complete set of relevant candidate mapping positions. The Isaac Aligner begins with a seed-based search using 32-mers from the extremities of the read as seeds. Isaac Aligner performs another search using different seeds for only those reads that were not mapped unambiguously with the first pass seeds. Mapping Selection Following a seed-based search, the Isaac Aligner selects the best mapping among all the candidates. For paired-end data sets, all mappings where only one end is aligned (called orphan mappings) trigger a local search to find additional mapping candidates. These candidates (called shadow mappings) are defined through the expected minimum and maximum insert size. After optional trimming of low quality 3' ends and adapter sequences, the possible mapping positions of each fragment are compared. This step takes into account pair-end information (when available), possible gaps using a banded Smith-Waterman gap aligner, and possible shadows. The selection is based on the Smith-Waterman score and on the log-probability of each mapping. Alignment Scores The alignment scores of each read pair are based on a Bayesian model, where the probability of each mapping is inferred from the base qualities and the positions of the mismatches. The final mapping quality (MAPQ) is the alignment score, truncated to 60 for scores above 60, and corrected based on known ambiguities in the reference flagged during candidate mapping. Following alignment, reads are sorted. Further analysis is performed to identify duplicates and optionally to realign indels. Alignment Output After sorting the reads, the Isaac Aligner generates compressed binary alignment output files, called BAM (*.bam) files, using the following process: } Marking duplicates—Detection of duplicates is based on the location and observed length of each fragment. The Isaac Aligner identifies and marks duplicates even when they appear on oversized fragments or chimeric fragments. } Realigning indels—The Isaac Aligner tracks previously detected indels, over a window large enough for the current read length, and applies the known indels to all reads with mismatches. } Generating BAM files—The first step in BAM file generation is creation of the BAM record, which contains all required information except the name of the read. The Isaac Aligner reads data from base call (BCL) files that were written during base calling on the sequencer to generate the read names. Data are then compressed into blocks of 64 kb or less to create the BAM file. 29 Part # 15070536 Rev. A The Isaac Variant Caller identifies single nucleotide variants (SNVs) and small indels using the following steps: } Read filtering—Filters out reads failing quality checks. } Indel calling—Identifies a set of possible indel candidates and realigns all reads overlapping the candidates using a multiple sequence aligner. } SNV calling—Computes the probability of each possible genotype given the aligned read data and a prior distribution of variation in the genome. } Indel genotypes—Calls indel genotypes and assigns probabilities. Indel Candidates Input reads are filtered by removing any of the following: } Reads that failed base calling quality checks. } Reads marked as PCR duplicates. } Paired-end reads not marked as a proper pair. } Reads with a mapping quality less than 20. Indel Calling The variant caller proceeds with candidate indel discovery and generates alternate read alignments based on the candidate indels. As part of the realignment process, the variant caller selects a representative alignment to be used for site genotype calling and depth summarization by the SNV caller. SNV Calling The variant caller runs a series of filters on the set of filtered and realigned reads for SNV calling without affecting indel calls. First, any contiguous trailing sequence of N base calls is trimmed from the ends of reads. Using a mismatch density filter, reads having an unexpectedly high number of disagreements with the reference are masked, as follows: } The variant caller treats each insertion or deletion as a single mismatch. } Base calls with more than two mismatches to the reference sequence within 20 bases of the call are ignored. } If the call occurs within the first or last 20 bases of a read, the mismatch limit is applied to a 41-base window at the corresponding end of the read. } The mismatch limit is applied to the entire read when the read length is 41 or shorter. Indel Genotypes The variant caller filters out all bases marked by the mismatch density filter and any N base calls that remain after the end-trimming step. These filtered base calls are not used for site-genotyping but appear in the filtered base call counts in the variant caller output for each site. All remaining base calls are used for site-genotyping. The genotyping method heuristically adjusts the joint error probability that is calculated from multiple observations of the same allele on each strand of the genome. This correction accounts for the possibility of error dependencies. HiSeq Analysis Software v2.0 User Guide 30 Methods for Resequencing and Alignment Analysis Isaac Variant Caller This method treats the highest-quality base call from each allele and strand as an independent observation and leaves the associated base call quality scores unmodified. Quality scores for subsequent base calls for each allele and strand are then adjusted. This adjustment is done to increase the joint error probability of the given allele above the error expected from independent base call observations. Variant Call Output After the SNV and indel genotyping methods are complete, the variant caller applies a final set of heuristic filters to produce the final set of calls in the output. The output in the genome variant call (gVCF) file captures the genotype at each position and the probability that the consensus call differs from reference. This score is expressed as a Phred-scaled quality score. Genome VCF (gVCF) Human genome sequencing applications require sequencing information for both variant and nonvariant positions, yet there is no common exchange format for such data. gVCF addresses this issue. gVCF is a set of conventions applied to the standard variant call format (VCF). These conventions allow representation of genotype, annotation, and additional information across all sites in the genome, in a reasonably compact format. Typical human wholegenome sequencing results expressed in gVCF with annotation are less than 1.7 GB, or about 1/50 the size of the BAM file used for variant calling. gVCF is also equally appropriate for representing and compressing targeted sequencing results. Compression is achieved by joining contiguous nonvariant regions with similar properties into single ‘block’ VCF records. To maximize the utility of gVCF, especially for high stringency applications, the properties of the compressed blocks are conservative. Block properties such as depth and genotype quality reflect the minimum of any site in the block. The gVCF file is also a valid VCF v4.1 file, and can be indexed and used with existing VCF tools such as tabix and IGV. This feature makes the file convenient both for direct interpretation and as a starting point for further analysis. gvcftools Illumina has created a full set of utilities aimed at creating and analyzing Genome VCF files. For up-to-date information and downloads, visit the gvcftools website at sites.google.com/site/gvcftools/home. Examples The following is a segment of a VCF file following the gVCF conventions for representation of nonvariant sites and, more specifically, using gvcftools block compression and filtration levels. In the following gVCF example, nonvariant regions are shown in normal text and variants are shown in bold. NOTE The variant lines can be extracted from a gVCF file to produce a conventional variant VCF file. chr20 676337 . T . 0.00 PASS END=676401;BLOCKAVG_min30p3a GT:GQX:DP:DPF 0/0:143:51:0 31 Part # 15070536 Rev. A In addition to the nonvariant and variant regions in the example, there is also one nonvariant region from [676837,676857] that is filtered out due to insufficient confidence that the region is homozygous reference. Conventions Any VCF file following the gVCF convention combines information on variant calls (SNVs and small-indels) with genotype and read depth information for all nonvariant positions in the reference. Because this information is integrated into a single file, distinguishing variant, reference, and no-call states for any site of interest is straightforward. The following subsections describe the general conventions followed in any gVCF file, and provide information on the specific parameters and filters used in the Isaac workflow gVCF output. HiSeq Analysis Software v2.0 User Guide 32 Methods for Resequencing and Alignment Analysis chr20 676402 . A . 0.00 PASS END=676441;BLOCKAVG_min30p3a GT:GQX:DP:DPF 0/0:169:57:0 chr20 676442 . T G 287.00 PASS SNVSB=-30.5;SNVHPOL=3 GT:GQ:GQX:DP:DPF:AD 0/1:316:287:66:1:33,33 chr20 676443 . T . 0.00 PASS END=676468;BLOCKAVG_min30p3a GT:GQX:DP:DPF 0/0:202:68:1 chr20 676469 . G . 0.00 PASS . GT:GQX:DP:DPF 0/0:199:67:5 chr20 676470 . A . 0.00 PASS END=676528;BLOCKAVG_min30p3a GT:GQX:DP:DPF 0/0:157:53:0 chr20 676529 . T . 0.00 PASS END=676566;BLOCKAVG_min30p3a GT:GQX:DP:DPF 0/0:120:41:0 chr20 676567 . C . 0.00 PASS END=676574;BLOCKAVG_min30p3a GT:GQX:DP:DPF 0/0:114:39:0 chr20 676575 . A T 555.00 PASS SNVSB=-50.0;SNVHPOL=3 GT:GQ:GQX:DP:DPF:AD 1/1:114:114:39:0:0,39 chr20 676576 . T . 0.00 PASS END=676625;BLOCKAVG_min30p3a GT:GQX:DP:DPF 0/0:95:36:0 chr20 676626 . T . 0.00 PASS END=676650;BLOCKAVG_min30p3a GT:GQX:DP:DPF 0/0:117:40:0 chr20 676651 . T . 0.00 PASS END=676698;BLOCKAVG_min30p3a GT:GQX:DP:DPF 0/0:90:31:0 chr20 676699 . T . 0.00 PASS END=676728;BLOCKAVG_min30p3a GT:GQX:DP:DPF 0/0:69:24:0 chr20 676729 . C . 0.00 PASS END=676783;BLOCKAVG_min30p3a GT:GQX:DP:DPF 0/0:57:20:0 chr20 676784 . C . 0.00 PASS END=676803;BLOCKAVG_min30p3a GT:GQX:DP:DPF 0/0:51:18:0 chr20 676804 . G A 62.00 PASS SNVSB=-7.5;SNVHPOL=2 GT:GQ:GQX:DP:DPF:AD 0/1:95:62:17:0:11,66 chr20 676805 . C . 0.00 PASS END=676818;BLOCKAVG_min30p3a GT:GQX:DP:DPF 0/0:48:17:0 chr20 676819 . T . 0.00 PASS END=676824;BLOCKAVG_min30p3a GT:GQX:DP:DPF 0/0:39:14:0 chr20 676825 . A . 0.00 PASS END=676836;BLOCKAVG_min30p3a GT:GQX:DP:DPF 0/0:30:11:0 chr20 676837 . T . 0.00 LowGQX END=676857;BLOCKAVG_min30p3a GT:GQX:DP:DPF 0/0:21:8:0 chr20 676858 . G . 0.00 PASS END=676873;BLOCKAVG_min30p3a GT:GQX:DP:DPF 0/0:30:11:0 NOTE gVCF conventions are written with the assumption that only one sample per file is being represented. Interpretation gVCFs file can be interpreted as follows: } Fast interpretation—As a discrete classification of the genome into ‘variant’, ‘reference’, and ‘no-call’ loci. This classification is the simplest way to use the gVCF. The Filter fields for the gVCF file have already been set to mark uncertain calls as filtered for both variant and nonvariant positions. Simple analysis can be performed to look for all loci with a filter value of “PASS” and treat them as called. } Research interpretation—As a ‘statistical’ genome. Additional fields, such as genotype quality, are provided for both variant and reference positions to allow the threshold between called and uncalled sites to be varied. These fields can also be used to apply more stringent criteria to a set of loci from an initial screen. External Tools gVCF is written to the VCF 4.1 specifications, so any tool that is compatible with the specification (such as IGV and tabix) can use the file. However, certain tools are not appropriate if they: } Apply algorithms to VCF files that make sense for only variants calls (as opposed to variant and nonvariant regions in the full gVCF); } Are only computationally feasible for variant calls. For these cases, extract the variant calls from the full gVCF file. Special Handling for Indel Conflicts Sites that are "filled in" inside deletions have additional treatment. } Heterozygous Deletions—Sites inside heterozygous deletions have haploid genotype entries (ie "0" instead of "0/0", "1" instead of "1/1"). Heterozygous SNVs are marked with the SiteConflict filter and their original genotype is left unchanged. Sites inside heterozygous deletions cannot have a genotype quality score higher than the enclosing deletion genotype quality. } Homozygous Deletions—Sites inside homozygous deletions have genotype set to "." (period), and site and genotype quality are also set to "." (period). } All Deletions—Sites inside any deletion are marked with the filters of the deletion, and more filters can be added pertaining to the site itself. These modifications reflect the idea that the enclosing indel confidence bounds the site confidence. } Indel Conflicts—In any region where overlapping deletion evidence cannot be resolved into 2 haplotypes, all indel and set records in the region are marked with the IndelConflict filter. Table 5 Indel Conflict Filters ID Type Description IndelConflict site/indel Locus is in region with conflicting indel calls. SiteConflict site Site genotype conflicts with proximal indel call. This conflict is typically a heterozygous genotype found inside of a heterozygous deletion. Representation of Non-Variant Segments This section includes the following subsections: } Block representation using END key 33 Part # 15070536 Rev. A Block Representation Using END Key Continuous nonvariant segments of the genome can be represented as single records in gVCF. These records use the standard 'END" INFO key to indicate the extent of the record. Even though the record can span multiple bases, only the first base is provided in the REF field (to reduce file size). Following is a simplified example of a nonreference block record: ##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA19238 chr1 51845 . A . . PASS END=51862 The example record spans positions [51845,51862]. Joining Non-Variant Sites Into a Single Block Record Address the following issues when joining adjacent nonvariant sites into block records: } The criteria that allow adjacent sites to be joined into a single block record. } The method to summarize the distribution of SAMPLE or INFO values from each site in the block record. At any gVCF compression level, a set of sites can be joined into a block if... } Each site is nonvariant with the same genotype call. Expected nonvariant genotype calls are { "0/0", "0", "./.", "." }. } Each site has the same coverage state, where 'coverage state' refers to whether at least 1 read maps to the site. For example, sites with 0 coverage cannot be joined into the same block with covered sites. } Each site has the same set of FILTER tags. } Sites have less than a threshold fraction of nonreference allele observations compared to all observed alleles (based on AD and DP field information). This threshold is used to keep sites with high ratios of nonreference alleles from being compressed into nonvariant blocks. In the Isaac Variant Caller gVCF output, the maximum nonreference fraction is 0.2 Block Sample Values Any field provided for a block of sites, such as read depth (using the DP key), shows the minimum value observed among all sites encompassed by the block. Nonvariant Block Implementations Files conforming to the gVCF conventions delineated in this document can use different criteria for creation of block records, depending on the desired trade-off between compression and nonvariant site detail. The Isaac Variant Caller provides the following blocking scheme 'min30p3a' as the nonvariant block compression scheme. Each sample value shown for the block, such as the depth (using the DP key), is restricted to have a range where the maximum value is within 30% or 3 of the minimum. Therefore, for sample value range [x,y], y ≤ x+max(3,x*0.3). This range restriction applies to all sample values written in the final block record. HiSeq Analysis Software v2.0 User Guide 34 Methods for Resequencing and Alignment Analysis } Joining nonvariant sites into a single block record } Block sample values } Nonvariant block implementations Genotype Quality for Variant and Nonvariant Sites The gVCF file uses an adapted version of genotype quality for variant and nonvariant site filtration. This value is associated with the GQX key. The GQX value is intended to represent the minimum of Phred genotype quality {assuming the site is variant, assuming the sites is nonvariant}. You can use this value to allow a single value to be used as the primary quality filter for both variant and nonvariant sites. Filtering on this value corresponds to a conservative assumption appropriate for applications where reference genotype calls must be determined at the same stringency as variant genotypes, for example: Filter Criteria The gVCF FILTER description is divided into 2 sections: (1) describes filtering based on genotype quality; (2) describes all other filters. NOTE These filters are default values used in the current Isaac Variant Caller implementation. However, no set of filters or cutoff values are required for a file to conform to gVCF conventions. The genotype quality is the primary filter for all sites in the genome. In particular, traditional discovery-based site quality values that convey confidence that the site is "anything besides the homozygous reference genotype," such as SNV quality, are not used. Instead, a site or locus is filtered based on the confidence in the reported genotype for the current sample. The genotype quality used in gVCF is a Phred-scaled probability that the given genotype is correct. It is indicated with the FORMAT field tag GQX. Any locus where the genotype quality is below the cutoff threshold is filtered with the tag LowGQX. In addition to filtering on genotype quality, some other filters are also applied. The gVCF output from Isaac Variant Caller includes several heuristic filters applied to the site and indel records. The filters are as follows. Table 6 VCF Site and Indel Record Filters VCF Filter ID Type Description PhasingConflict site The locus read evidence displays unbalanced phasing patterns. PLOIDY_ site/indel The genotype call from the variant caller is not CONFLICT consistent with the chromosome ploidy. IndelConflict indel The locus is in region with conflicting indel calls. HighDPFRatio site The fraction of basecalls filtered out at a site, DPF/ (DP+DPF), is greater than 0.4. HighDepth site/indel The locus depth is greater than 3x the mean chromosome depth. LowGQX site/indel Locus GQX is less than 30. LowGQXHetSNP site Locus GQX is less than 15 for het SNP. LowGQXHomSNP site Locus GQX is less than 17 for hom SNP. LowGQXHetIns site/indel Locus GQX is less than 6 for het insertion. LowGQXHomIns site/indel Locus GQX is less than 6 for hom insertion. LowGQXHetDel site/indel Locus GQX is less than 6 for het deletion. PASS site/indel Position has passed all filters. SiteConflict indel The site genotype conflicts with the proximal indel call. This call is typically a heterozygous SNV call made inside of a heterozygous deletion. 35 Part # 15070536 Rev. A Isaac Copy Number Variant (CNV) Caller is an algorithm for calling copy number variants from a diploid sample. Most of a normal DNA sample is diploid, or having 2 copies. Isaac CNV Caller identifies regions of the sample genome that are not present, or present either one time or more than 2 times in the genome. Isaac CNV Caller scans the genome for regions having an unexpected number of short read alignments. Regions with fewer than the expected number of alignments are classified as losses. Regions having more than the expected number of alignments are classified as gains. Isaac CNV Caller is appropriately applied to low-depth cytogenetics experiments, lowdepth single-cell experiments, or whole-genome sequencing experiments. Isaac CNV Caller is not appropriate for whole exome experiments, cancer studies, or any other experiment with the following conditions: } Most of the genome is not assumed to be diploid. } Reads are not distributed randomly across the diploid genome. Workflow Isaac CNV Caller can be conceptually divided into 4 processes: } Binning—Counting alignments in genomic bins. } Cleaning—Removal of systematic biases and outliers from the counts. } Partitioning—Partitioning the counts into homogenous regions. } Calling—Assigning a copy number to each homogenous region. These processes are explained in subsequent sections. Binning The binning procedure creates genomic windows, or bins, across the genome and counts the number of observed alignments that fall into each bin. The alignments are provided in the form of a BAM file. Isaac CNV Caller binning keeps in memory a collection of BitArrays to store observed alignments, one BitArray for each chromosome. Each BitArray length is the same as its corresponding chromosome length. As the BAM file is read in, Isaac CNV Caller records the position of the left-most base in each alignment within the chromosome-appropriate BitArray. After all alignments in the BAM file have been read, the BitArrays have a “1” wherever an alignment was observed and a “0” everywhere else. After reading in the BAM file, a masked FASTA file is read in, one chromosome at a time. This FASTA file contains the genomic sequences that were used for alignment. Each 35-mer within this FASTA file is marked as unique or nonunique with uppercase and lowercase letters. If a 35-mer is unique, then its first nucleotide is capitalized; otherwise, it is not capitalized. For example, in the sequence: acgtttaATgacgatGaacgatcagctaagaatacgacaatatcagacaa The 35-mers marked as unique are as follows: ATGACGATGAACGATCAGCTAAGAATACGACAATA TGACGATGAACGATCAGCTAAGAATACGACAATAT GAACGATCAGCTAAGAATACGACAATATCAGACAA Isaac CNV Caller stores the genomic locations of unique 35-mers in another collection of BitArrays analogous to BitArrays used to store alignment positions. Unique positions and nonunique positions are marked with “1”s and “0”s, respectively. This marking is used as a mask to guarantee that only alignments that start at unique 35-mer positions in the genome are used. HiSeq Analysis Software v2.0 User Guide 36 Methods for Resequencing and Alignment Analysis Isaac Copy Number Variant Caller Bin Sizes Isaac CNV Caller is initialized with 100 alignments per bin and then proceeds to compute the bin boundaries such that each bin contains the same bin size, or number of unique 35-mers. The term “bin size“ refers to the number of unique genomic 35-mers per bin. Because some regions of the human genome are more repetitive than others, physical bin sizes (in genomic coordinates) are not identical. In the following example, each box is a position along the genome. Each checkmark represents a unique 35-mer while each X represents a nonunique 35-mer. The bin size in this example is 3 (3 checkmarks per bin). The physical size of each bin is not constant. B1 and B3 have a physical size of 3 but B2 and B4 have physical sizes of 4 and 6, respectively. Computing Bin Size To compute bin size, the ratio of observed alignments to unique 35-mers is calculated for each autosome. The desired number of alignments per bin is then divided by the median of these ratios to yield bin size. For whole-genome sequencing, bin sizes are typically in the range of 800–1000 unique 35-mers. Correspondingly, most physical window sizes are in the 1–1.2 kb range. The advantage of this approach relative to using fixed genomic intervals is that the same number of reads map to each bin, regardless of “uniqueness” or ability to be mapped. After bin size is computed, bins are defined as consecutive genomic windows such that each bin contains the same bin size, or number of unique 35-mers. The number of observed alignments present within the boundary of each bin is then counted from the alignment BitArrays. The GC content of each bin is also calculated. The chromosome, genomic start, genomic stop, observed counts and GC content in each bin are output to disk. Cleaning The Isaac CNV Caller cleaning comprises the following 3 procedures that remove outliers and systematic biases from the count data computed in Isaac CNV Caller. 1 Single point outlier removal. 2 Physical size outlier removal. 3 GC content correction. These procedures are performed on the bins produced during the Isaac CNV Caller binning process. Single Point Outlier Removal This step removes individual bins that represent extreme outliers. These bins have counts that are very different from the counts present in upstream and downstream bins. Two values, a and b, are defined as to be very different when their difference is greater 37 Part # 15070536 Rev. A A value of χ2 greater than 6.635, which is the 99th percentile of the Chi-squared distribution with 1 degree of freedom, is considered very different. If a bin count is very different from the count of both upstream and downstream neighbors, then the bin is deemed an outlier and removed. Physical Size Outlier Removal Bins likely do not have the same physical (genomic) size. The average for whole-genome sequencing runs might be approximately 1 kb. If the bins cover repetitive regions of the genome, some bins sizes might be several megabases in size. Example regions might include centromeres and telomeres. The counts in these regions tend to be unreliable so bins with extreme physical size are removed. Specifically, the 98th percentile of observed physical sizes is calculated and bins with sizes larger than this threshold are removed. GC Content Correction The main variability in bins counts is GC content. An example of the bias is represented in the following figure. Figure 3 GC Bias Example The following correction is performed: 1 Bins are first aggregated according to GC content, which is rounded to the nearest integer. 2 Second, each bin count is divided by the median count of bins having the same GC content. 3 Finally, this value is multiplied by the desired average count per bin (100 by default) and rounded to the nearest integer. The effect is to flatten the midpoints of the bars in the example box-and-whisker plot. Some values for GC content have few bins so the estimate of its median is not robust. Therefore, bins are discarded when the number of bins having the same GC content is fewer than 100. For some sample preparation schemes, GC content correction has a dramatic effect. The following figure illustrates the effect of GC content correction for a low depth sequencing HiSeq Analysis Software v2.0 User Guide 38 Methods for Resequencing and Alignment Analysis than expected by chance, assuming a and b come from the same underlying distribution. These values use the Chi-squared distribution, as follows: µ = 0.5a + 0.5b χ2 = ((a - µ)2 + (b - µ)2) µ-1 experiment using the Nextera library preparation method. The figure on the left shows bins counts as a function of chromosome position before normalization. The figure on the right shows the result after GC content correction. For whole-genome sequencing experiments, the typically median absolute deviations (MADs) are 10.3, which is close to the expected value of 10. The expected value is predicted using the Poisson model for an average count of 100 and indicates that little bias remains following GC content correction. It is important to note that the normalization signal does not dampen signal from CNVs as shown in the following 2 figures. The figure on the left shows a chromosome known to harbor a single copy gain. The figure on the right shows chromosome known to harbor a double copy gain. Partitioning The Isaac CNV Caller partitioning implements an algorithm for identifying regions of the genome such that their average counts are statistically different than average counts of neighboring regions. The implementation is a port of the circular binary segmentation (CBS) algorithm. The algorithm briefly considers each chromosome as a segment. The algorithm assesses each segment and identifies the pair of bins for which the counts in the bins between them are maximally different than the counts of the rest of the bins. The statistical significance of the maximal difference is assessed via permutation testing. If the difference is statistically significant, then the procedure is applied recursively to the 2 or 3 segments created by partitioning the current segment by the identified pair of points. Input to the algorithm is the output generated by the Isaac CNV Caller cleaning algorithm. 39 Part # 15070536 Rev. A Calling The final module of the Isaac CNV Caller algorithm is to assign discrete copy numbers to each of the regions identified by the Isaac CNV Caller partitioner. A Gaussian model is used as the default calling method. In this case, both the mean and standard deviation are estimated from the data for the diploid model and adjusted for the other copy number models. For example, if the mean, µ, and standard deviation, σ, are estimated to be 100 and 15 in the diploid model, then corresponding estimates in the haploid model would be µ/2 and σ/2. The mean and standard deviation are estimated using the autosomal median and MAD of counts. This model is the default as it is more appropriate in cases where the spread of counts is higher than expected from the Poisson model due to unaccounted sources of variability. An example of this case is single cell sequencing experiments where whole-genome amplification is required. Following assignment of copy number states, neighboring regions that received the same copy number call are merged into a single region. Phred-scaled Q-scores are assigned to each region using a simple logistic function derived using arrayCGH data as the gold standard. The probability of a miscall is modeled as p=1-(1/((1+e^(0.5532-0.147N ) )) Where N is the number of bins found within the nondiploid region. This probability is converted to a Q-score by q=-10 log p This estimate is likely conservative as it is derived from arrayCGH. Importantly, Qscores are a function of number of bins, not genomic size, so they are applicable to experiments of any sequencing depth, including low-depth cytogenetics screening. The coordinates of non-diploid regions and their Q-scores are output to a VCF file. Two filters are applied to PASS variants. First, a variant must have a Q-score of Q10 or greater. Second, a variant must be of size 10 kb, or greater. Isaac Structural Variant Caller Isaac Structural Variant (SV) Caller is a structural variant caller for short sequencing reads. It can discover structural variants of any size and score these variants using both a diploid genotype model and a somatic model (when separate tumor and normal samples are specified). Structural variant discovery and scoring incorporate both paired read fragment spanning and split read evidence. Method Overview Isaac SV Caller works by dividing the structural variant discovery process into 2 primary steps–scanning the genome to find SV associated regions and analysis, scoring, and output of SVs found in such regions. HiSeq Analysis Software v2.0 User Guide 40 Methods for Resequencing and Alignment Analysis Because of the computational complexity of the algorithm O(N2), the problem is divided into subchromosome problems followed by merging, in practice. Heuristics are used to speed up the permutation testing. 1 Build SV association graph In this step, the entire genome is scanned to discover evidence of possible SVs and large indels. This evidence is enumerated into a graph with edges connecting all regions of the genome that have a possible SV association. Edges can connect 2 different regions of the genome to represent evidence of a long-range association, or an edge can connect a region to itself to capture a local indel/small SV association. These associations are more general than a specific SV hypothesis, in that many SV candidates can be found on 1 edge, although typically only 1 or 2 candidates are found per edge. 2 Analyze graph edges to find SVs The second step is to analyze individual graph edges or groups of highly connected edges to discover and score SVs associated with the edges. These substeps of this process include: • Inference of SV candidates associated with the edge. • Attempted assembly of the SVs break-ends. • Scoring and filtration of the SV under various biological models (currently diploid germline and somatic). • Output to VCF. Capabilities Isaac SV Caller can detect all structural variant types that are identifiable in the absence of copy number analysis and large scale de novo assembly. Detectable types are enumerated in this section. For each structural variant and indel, Isaac SV Caller attempts to align the break-ends to base pair resolution and report the left-shifted break-end coordinate (per the VCF 4.1 SV reporting guidelines). Isaac SV Caller also reports any break-end microhomology sequence and inserted sequence between the break-ends. Often the assembly fails to provide a confident explanation of the data. In such cases, the variant is reported as IMPRECISE, and scored according to the paired-end read evidence alone. The sequencing reads provided as input to Isaac SV Caller are expected to be from a paired-end sequencing assay that results in an inwards orientation between the 2 reads of each DNA fragment. Each read presents a read from the outer edge of the fragment insert inward. Detected Variant Classes Isaac SV Caller is able to detect all variation classes that can be explained as novel DNA adjacencies in the genome. Simple insertion/deletion events can be detected down to a configurable minimum size cutoff (defaulting to 51). All DNA adjacencies are classified into the following categories based on the break-end pattern: } Deletions } Insertions } Inversions } Tandem Duplications } Interchromosomal Translocations Known Limitations Isaac SV Caller cannot detect the following variant types: } Nontandem repeats/amplifications 41 Part # 15070536 Rev. A More general repeat-based limitations exist for all variant types: } Power to assemble variants to break-end resolution falls to 0 as break-end repeat length approaches the read size. } Power to detect any break-end falls to (nearly) 0 as the break-end repeat length approaches the fragment size. } The method cannot detect nontandem repeats. While Isaac SV Caller classifies novel DNA-adjacencies, it does not infer the higher level constructs implied by the classification. For instance, a variant marked as a deletion by Isaac SV Caller indicates an intrachromosomal translocation with a deletion-like breakend pattern. However, there is no test of depth, b-allele frequency, or intersecting adjacencies to infer the SV type directly. HiSeq Analysis Software v2.0 User Guide 42 Methods for Resequencing and Alignment Analysis } Large insertions—The maximum detectable size corresponds to approximately the read-pair fragment size, but note that detection power falls off to impractical levels well before this size. } Small inversions—The limiting size is not tested, but in theory detection falls off below ~200 bases. So-called microinversions might be detected indirectly as combined insertion/deletion variants. Methods for Tumor-Normal Workflow This section describes the underlining methodologies for the Tumor-Normal analysis algorithm in HiSeq Analysis Software v2.0. Introduction The somatic variant calling pipeline uses 2 aligned sequence files (*.bam files) as inputs– a normal *.bam and a tumor *.bam. These *.bam files are then processed through 3 interconnected callers: } Isaac Somatic Variant Caller } Isaac Structural Variant Caller } Copy Number Aberration Caller (SENECA). Isaac Somatic Variant Caller and SENECA are described in the following sections. For information on the Isaac Structural Variant Caller, see Isaac Structural Variant Caller on page 40. During the first stage of the pipeline, the tumor and normal *.bam files run through a combined indel realignment operation. This realignment operation is used as the input for further processing. During calling, putative calls and de novo reassembled sections of sequence are passed between the callers to produce internally consistent variant calls. All 3 callers use statistical models that operate on the combined tumor and normal reads as input instead of the variants. The statistical models use combined calling instead of subtraction of variant calls. Using combined calling produces superior results. However, subtraction of the calls from the normal and tumor whole genome results often do not match the somatic calls from a combined caller. For example, you can find a somatic variant that was not called in the tumor WGS sample because the combined caller is operating on the reads. Isaac Somatic Variant Caller The Isaac Somatic Variant Caller detects somatic SNVs and indels in sequencing data from a tumor and matched normal sample, based on the following assumptions: } The normal sample is a mixture of diploid germline variation and noise. } The tumor sample is a combination of the normal sample and somatic variation. It is assumed that the somatic variation and the normal noise can occur at any allele frequency ratio. For SNVs, but not for indels, the normal noise component is further modeled as a combination of single-strand and double-strand noise. 43 Part # 15070536 Rev. A Methods for Tumor-Normal Workflow Figure 4 Isaac Somatic Variant Caller Method NOTE For a detailed overview of Isaac Somatic Variant Caller methods, go to www.ncbi.nlm.nih.gov/pubmed/22581179. Candidate Indel Search The Isaac Somatic Variant Caller caller scans through the genome using sequence alignments from the normal sample and tumor sample together to find a joint set of candidate indels. The information in sequence alignments is supplemented with externally generated candidate indels discovered by the Isaac Structural Variant (SV) Caller. Isaac SV Caller provides external candidate indels to Isaac Somatic Variant Caller for indels of size 50 and below. Candidate indels are used for realignment of reads, during which each candidate indel is evaluated as a potential somatic indel. Any other types of indels are considered noise indels. If a better alignment is not found, these indels are allowed to remain in the read alignments; otherwise, they are not used. The candidate indel thresholds are designed so that the joint candidate indel set is at least the combined set found if the Isaac Variant Caller is run on the individual samples. Specifically, where a minimum number of nominating reads is required for candidacy in Isaac Variant Caller, Isaac Somatic Variant Caller requires the same minimum number of nominating reads from the combined input. Isaac Somatic Variant Caller requires that at least one sample contains a minimum fraction of supporting reads among the sample reads for candidacy. HiSeq Analysis Software v2.0 User Guide 44 Realignment For every read that intersects a candidate alignment, the Isaac Somatic Variant Caller attempts to find the most probable alignments including the candidate indel and excluding the candidate indel. Typically, the alignment excluding the candidate indel aligns to the reference, but occasionally an alternate indel that overlaps or interferes with the candidate is found to be more likely. The indel caller uses the probabilities of both alignments as part of the indel quality score calculation, whereas only a single alignment (usually the most probable) is preserved for SNV calling. Somatic Caller The Isaac Somatic Variant Caller uses a Bayesian probability model similar to the one used for germline variant calling in the Isaac Variant Caller or in external tools such as GATK. Using this model, our objective is to compute the posterior probability P(θ│ D), which is the probability of the model state θ conditioned on the observed sequencing data. In a germline variant caller, the state space of the model is conventionally a discrete set of diploid genotypes. For SNVs, the set of possible states is G= {"AA,CC,GG,TT,AC,AG,AT,CG,CT,GT"}. The Isaac Somatic Variant Caller model instead approximates continuous allele frequencies for each allele: f={f_A, f_C, f_G, f_T} The allele frequencies are restricted to allow a maximum of 2 nonzero frequencies. Any additional alleles observed in the data are treated as noise. Another departure from typical germline calling methods is that the state space of the model is the allele frequency of both the tumor and the normal sample: θ=(f_t, f_n) In the equation above, f_t and f_n represent the allele frequencies of the tumor and normal samples, respectively. The final somatic variant quality value reported by the model is computed from the probability that the allele frequencies are unequal (ie, f_t≠f_n) given the observed sequence data. Post-Call Filtration Heuristic filters remove several types of improbable calls resulting from data artifacts that cannot be easily represented in the somatic probability model. These filters act as a final step to separate out the final set of somatic calls reported by Isaac Somatic Variant Caller. Input Data Filtration Isaac Somatic Variant Caller uses 2 tiers of input data filtration during somatic small variant calling: } Tier 1—A more stringent filtering to ensure high quality calls } Tier 2—A lower filtration stringency Initially, candidates are called using a subset of the data with more stringent tier 1 filtering. If the method produces a nonzero quality score for any SNV or indel, the potential somatic variant is called again using data with a lower tier 2 stringency. The 45 Part # 15070536 Rev. A For somatic SNVs and indels, Isaac Somatic Variant Caller produces a general somatic quality score, Q(ssnv) or Q(somatic indel). This score indicates the probability of the somatic variant and a joint probability of the somatic variant and a specific normal genotype, Q(ssnv+ntype), or Q(somatic indel+ntype). The 2 tier evaluation is applied to each of these qualities separately, as follows: Q(ssnv) = min(Q(ssnv|tier1), Q(ssnv|tier2)) Q(ssnv+ntype) = min(Q(ssnv+ntype|tier1), Q(ssnv+ntype|tier2)) The tier used for each quality value is provided in the Isaac Somatic Variant Caller output record for each somatic variant. If the most likely normal genotype is not the same at tier 1 and tier 2, then the normal genotype is reported as a conflict in the output. Using 2 data tiers enables an initial somatic call based on high-quality data. Given a potential call, using 2 data tiers removes support for the putative somatic allele in the normal sample from lower quality data. The following table lists the primary data filtration levels that are changed between tier 1 and tier 2. Table 7 Tiered Filtration Parameters Parameter Min paired-end alignment score Min single-end alignment score Single-end score rescue? Include unanchored pairs? Include anomalous pairs? Include singleton pairs? Mismatch density filter - max mismatches in window Tier 1 Value 20 10 No No No No 3 Tier 2 Value 0 0 Yes Yes Yes Yes 10 Additional Filtration Additional filters are applied after the somatic caller completes. A single candidate somatic call can be annotated with several filters. HiSeq Analysis Software v2.0 User Guide 46 Methods for Tumor-Normal Workflow lower quality from the 2 tiers is selected for output. However, if the tier 2 quality is 0, the call is eliminated. Figure 5 Additional Filtration Quality Filtration Levels Only somatic calls originating from homozygous reference alleles in the normal sample are reviewed for validation and included in the output. } Somatic SNVs are reported if the normal genotype is equal to the reference and Q(ssnv+ntype) ≥ 15. } Somatic indels are reported if the normal genotype is equal to the reference and Q(somatic indel+ntype) ≥ 30. NOTE The value Q(ssnv+ntype) is associated with the VCF key QSS_NT. The value Q(somatic indel+ntype) is associated with the VCF key QSI_NT. Copy Number Aberrations (SENECA) The copy number aberrations module is also referred to as SENECA (SEnsitive detection of copy NumbErs in CAncer). It identifies copy number aberrations (CNAs) in heterogeneous tumor samples that exhibit contamination with normal tissues, aneuploidy, and loss of heterozygosity (LOH) that can confound correct copy assignment and lead to erroneous CNA calls. The algorithm workflow comprises 2 distinct steps: } Segmentation of data into regions with putatively distinct copy numbers. 47 Part # 15070536 Rev. A As input, SENECA uses aligned sequences from tumor and matched normal samples (in *.bam format) and annotation information about the location of known variants in dbSNP, regional alignability, and the location of gaps in dbSNP. Segmentation SENECA is a count-based method to assign copy number state. It compares coverage between tumor and normal samples. Specifically, it bins read coverage using nonoverlapping 1 kb windows to derive counts in tumor and normal samples, and it then takes the ratio of the 2 counts. Bins are skipped during segmentation when they overlap low alignability regions in more than 20% of their size. Independently, SENECA calculates B allele ratios at dbSNP positions from a tumor BAM file, and it keeps only SNVs that are heterozygous in the corresponding normal sample. Segmentation is carried out independently for copy number and B allele ratios. Ploidy and Purity Calculation Following segmentation, SENECA performs ploidy and purity calculations. These calculations are based on the principle that for each value of ploidy and purity and a selected copy number, the values of B allele and read count ratios are inferred. For example, for copy number state 1 (1 deleted allele of a diploid genome), the B allele ratio is always near 0 because only 1 allele is present. However, if a tumor sample has only 70% percent purity because of the presence of the normal genome as background, the B allele ratio increases due to the presence of a heterozygous normal allele. The low percentage of purity results in a final B allele ratio of 0.15. SENECA fits a multivariate Gaussian distribution to copy data and B allele ratio data on a two-dimensional grid of varying ploidy and purity. On the grid, each state encodes ploidy and purity values. In addition, SENECA uses a separate state encoding copy neutral LOH and copy gain LOH to identify loss-of-heterozygosity events. Ploidy and purity associated with the model having highest log-likelihood are then used to assign a copy number state to each segment. When both segments and copy numbers are estimated, a quality score for copy number assignment is computed using a likelihood ratio test. This test compares the likelihood of a current copy number assignment to a likelihood of assigning 1 more or 1 fewer copy. Results of the likelihood ratio test are then reported as a Q-score field in the VCF file using the following transformation: 2*log (s1/s2), where s1 is a sum of squares for selected model and s2 is a sum of squares for the next nearest model. Q-score threshold of 1.5 provides a good trade-off between sensitivity and specificity. HiSeq Analysis Software v2.0 User Guide 48 Methods for Tumor-Normal Workflow } Calculation of ploidy and purity with a final copy number assignment. Notes For technical assistance, contact Illumina Technical Support. Table 8 Illumina General Contact Information Website Email www.illumina.com [email protected] Table 9 Illumina Customer Support Telephone Numbers Region Contact Number Region North America 1.800.809.4566 Italy Australia 1.800.775.688 Netherlands Austria 0800.296575 New Zealand Belgium 0800.81102 Norway Denmark 80882346 Spain Finland 0800.918363 Sweden France 0800.911850 Switzerland Germany 0800.180.8994 United Kingdom Ireland 1.800.812949 Other countries Safety Data Sheets Safety data sheets (SDSs) are available on the Illumina website at support.illumina.com/sds.html. HiSeq Analysis Software v2.0 User Guide Contact Number 800.874909 0800.0223859 0800.451.650 800.16836 900.812168 020790181 0800.563118 0800.917.0041 +44.1799.534000 Technical Assistance Technical Assistance *12345678* Part # 15070536 Rev. A Illumina San Diego, California 92122 U.S.A. +1.800.809.ILMN (4566) +1.858.202.4566 (outside North America) [email protected] www.illumina.com
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement