Download manual - CLC Cancer

Download manual - CLC Cancer
CLC Cancer Research Workbench
APPLICATION BASED MANUAL
Manual for
CLC Cancer Research Workbench 1.5
Windows, Mac OS X and Linux
November 18, 2014
This software is for research purposes only.
CLC bio, a QIAGEN Company
Silkeborgvej 2
Prismet
DK-8000 Aarhus C
Denmark
Contents
I
Introduction
6
1 Welcome to CLC Cancer Research Workbench
1.1
Introduction to CLC Cancer Research Workbench . . . . . . . . . . . . . . . . .
7
1.2
Available documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.3
The material covered by this manual . . . . . . . . . . . . . . . . . . . . . . . .
8
1.4
We welcome your comments and suggestions . . . . . . . . . . . . . . . . . . .
8
1.5
Contact information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2 Introduction to user interface, workflows, and tracks
II
7
10
2.1
The start screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2
The user interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.3
Workflows - an overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.4
The track format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
Applications - ready-to-use workflows
28
3 Getting started
29
3.1
Reference data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.2
Create new folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.3
Import data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4 Preparing Raw Data
43
4.1
Prepare sequencing data - all application types . . . . . . . . . . . . . . . . . .
43
4.2
Analysis of sequencing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
5 Whole genome sequencing (WGS)
55
3
CONTENTS
4
5.1
Automatic analysis of sequencing data (WGS) . . . . . . . . . . . . . . . . . . .
55
5.2
Identify Variants (WGS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
5.3
Annotate Variants (WGS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
5.4
Filter Somatic Variants (WGS) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
5.5
Identify Somatic Variants from Tumor Normal Pair (WGS) . . . . . . . . . . . . .
68
5.6
Identify Known Variants in One Sample (WGS) . . . . . . . . . . . . . . . . . . .
72
6 Whole exome sequencing (WES)
79
6.1
Automatic analysis of sequencing data (WES) . . . . . . . . . . . . . . . . . . .
79
6.2
Identify Variants (WES)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
6.3
Annotate Variants (WES) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
6.4
Filter Somatic Variants (WES) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
6.5
Identify Somatic Variants from Tumor Normal Pair (WES) . . . . . . . . . . . . .
93
6.6
Identify Known Variants in One Sample (WES) . . . . . . . . . . . . . . . . . . .
98
6.7
Identify and Annotate Variants (WES) . . . . . . . . . . . . . . . . . . . . . . . . 104
7 Targeted amplicon sequencing (TAS)
112
7.1
Automatic analysis of sequencing data (TAS) . . . . . . . . . . . . . . . . . . . . 112
7.2
Identify Variants (TAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3
Annotate Variants (TAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4
Filter Somatic Variants (TAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.5
Identify Somatic Variants from Tumor Normal Pair (TAS) . . . . . . . . . . . . . . 126
7.6
Identify Known Variants in One Sample (TAS)
7.7
Identify and Annotate Variants (TAS) . . . . . . . . . . . . . . . . . . . . . . . . 137
8 Whole Transcriptome Sequencing (WTS)
. . . . . . . . . . . . . . . . . . . 131
145
8.1
Automatic analysis of RNA-seq data . . . . . . . . . . . . . . . . . . . . . . . . 145
8.2
Analysis of multiple samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.3
Annotate Variants (WTS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.4
Compare variants in DNA and RNA . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.5
Identify Candidate Variants and Genes from Tumor Normal Pair . . . . . . . . . . 157
8.6
Identify variants and add expression values . . . . . . . . . . . . . . . . . . . . 163
8.7
Identify and Annotate Differentially Expressed Genes and Pathways . . . . . . . 168
CONTENTS
III
5
Customized data analysis
171
9 How to edit application workflows
172
9.1
Introduction to customized data analysis . . . . . . . . . . . . . . . . . . . . . . 172
9.2
How to edit preinstalled workflows . . . . . . . . . . . . . . . . . . . . . . . . . 172
10 Using data from other workbenches
175
10.1 Open outputs from other workbenches . . . . . . . . . . . . . . . . . . . . . . . 175
IV
Plugins
176
11 Plugins
177
V
178
Appendix
A Reference data overview
179
B Mini dictionary
182
Bibliography
183
VI
184
Index
Part I
Introduction
6
Chapter 1
Welcome to CLC Cancer Research
Workbench
Contents
1.1
Introduction to CLC Cancer Research Workbench . . . . . . . . . . . . . . .
7
1.2
Available documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.3
1.4
The material covered by this manual . . . . . . . . . . . . . . . . . . . . . . .
We welcome your comments and suggestions . . . . . . . . . . . . . . . . .
8
8
1.5
Contact information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
High throughput sequencing is currently revolutionizing both the cancer research and diagnostics
areas. Since the introduction of "next generation sequencing" (NGS) technologies, the field has
quickly moved forward, with rapid improvements in sequencing capacity and the time required for
data production. As a result, in many studies the sequencing process is no longer the bottleneck.
The bottleneck now is the bioinformatic analysis of the data.
CLC Cancer Research Workbench has been developed to address the bioinformatic bottleneck
by offering automated workflows that cover all steps from the initial data processing and quality
assurance through data analyses, annotation, and reporting.
1.1
Introduction to CLC Cancer Research Workbench
CLC Cancer Research Workbench has been developed specifically for cancer research.
A core part of the the CLC Cancer Research Workbench is the ready-to-use workflows that are
bundled with reference data. Workflows have been developed for the following applications:
• Whole Genome Sequencing
• Whole exome Sequencing
• Targeted Amplicon Sequencing
7
CHAPTER 1. WELCOME TO CLC CANCER RESEARCH WORKBENCH
1.2
8
Available documentation
The documentation for CLC Cancer Research Workbench can be found here: http://clccancer.
com/software/#downloads.
Two manuals are available for CLC Cancer Research Workbench:
• The CLC Cancer Research Workbench application based manual. This relatively short
manual gives a basic introduction to CLC Cancer Research Workbench, which includes a
section on how to get started, as well as describing how to use the different ready-to-use
workflows for analysis of different types of sequencing data.
• The CLC Cancer Research Workbench reference manual. This comprehensive manual
explains the features and functionalities of the CLC Cancer Research Workbench in detail.
If you would like to use a CLC Server, there are three additional manuals that are relevant:
• The CLC Cancer Research Server Plugin manual. This short manual gives a description
of how to get started using the CLC Cancer Research Server Plugin. We recommend that
this manual is used in combination with the "CLC Genomics Server administrator manual"
and the "CLC Genomics Server end user manual" if you are not already familiar with CLC
Genomics Server.
• The CLC Genomics Server administrator manual. This manual is for server administrators
and describes how to install and manage CLC Genomics Server.
• CLC Genomics Server end user manual. This manual is for the users of the CLC Server. In
this manual you can find a description of how to use a CLC Server from a CLC Workbench.
1.3
The material covered by this manual
This usermanual provides introductory material on how to work with the software, including the
import and initial handling of data and a guide to the data types and user interface. Its main
focus is to provide guidance on how to use the workflows that come with the software.
Also included is an appendix where there is a table listing the available reference data as well as
a small dictionary of terminology used in the CLC Cancer Research Workbench. The dictionary is
not exhaustive, but we hope it will serve as a useful reference, especially for new users.
For comprehensive descriptions of the features and functionalities of the individual tools, please
refer to the CLC Cancer Research Workbench reference manual.
1.4
We welcome your comments and suggestions
We aim to provide user-friendly software for important analyses, such as identifying inherited
disease traits and identifying somatic mutations that underlie this complex disease. To this end,
we continuously develop our bioinformatic platform, expand the collection of research tools, and
extend our documentation resources. We welcome comments or suggestions you have. These
help us greatly in further developing and improving our software. Comments and suggestions
can be submitted directly from within the software using the menu option: CLC Cancer Research
Workbench:
CHAPTER 1. WELCOME TO CLC CANCER RESEARCH WORKBENCH
9
Help | Contact Support
1.5
Contact information
The CLC Cancer Research Workbench is developed by:
CLC bio, a QIAGEN Company
Silkeborgvej 2
Prismet
8000 Aarhus C
Denmark
http://www.clcbio.com
VAT no.: DK 28 30 50 87
Telephone: 45 70 22 32 44
Fax: +45 86 20 12 22
E-mail: [email protected]
If you have questions or comments regarding the program, you can contact us through the support team as described here: http://www.clcsupport.com/clcgenomicsworkbench/
current/index.php?manual=Getting_help.html.
Chapter 2
Introduction to user interface, workflows,
and tracks
Contents
2.1
The start screen . . . . . . . .
2.1.1
The getting started table
2.1.2
Import of example data .
2.2
The user interface . . . . . . .
2.2.1
The Toolbox . . . . . . .
2.3
Workflows - an overview . . .
2.4
The track format . . . . . . .
2.4.1
Track types . . . . . . . .
2.4.2
The Genome Browser . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
10
12
13
14
15
19
19
19
This section introduces the CLC Cancer Research Workbench general features and functionalities,
including the user interface and a general introduction to workflows and tracks. The information in
this chapter underpins that of later chapters and is highly recommended for new users of the Workbench. You can find more detailed information in the CLC Cancer Research Workbench reference
manual, which can be found online at http://clccancer.com/software/#downloads/.
2.1
The start screen
When you start up the CLC Cancer Research Workbench, you should see an image like the one in
figure 2.1. The information in the left hand panes will differ, depending on what data you already
have available and any plugins you may have installed.
2.1.1
The getting started table
When no data has been opened for viewing, a table is visible in the View Area of the Workbench.
This table provides links to sections of the application based user manual, and is thus a simple
and fast way to access information about using the CLC Cancer Research Workbench.
Currently CLC Cancer Research Workbench can be used to analyze DNA sequencing data. Analysis
of RNA sequencing data is planned for a future release.
10
CHAPTER 2. INTRODUCTION TO USER INTERFACE, WORKFLOWS, AND TRACKS
11
Figure 2.1: The Cancer Research Workbench start up window.
In this section, we take a closer look at the table in the viewing area (figure 2.2).
Figure 2.2: The table in the Cancer Research Workbench, visible when no datasets have been
opened for viewing, provides links so that you can quickly navigate to relevant sections of the
application based manual. To the right hand side of the table, the "Getting Started" and "Explore
and Learn" areas provide links to more general information resources that you may find useful.
Summary stages in data analysis are listed at the left side of the table: Data Preparation, Data
Analysis, Interpretation, and Data Analysis and Interpretation. Click on the text in the table to
open the relevant section in the application based manual.
CHAPTER 2. INTRODUCTION TO USER INTERFACE, WORKFLOWS, AND TRACKS
12
The recommended way to use the table is to start at the top and click on one of the "Whole
genome", "Whole exome", or "Targeted" tabs found under the big "DNA" label if you are working
on DNA-seq data. This acts to select the relevant application area. This done, when you click on
a link within the "DNA" section of the table, you will be directed to the section in the application
based manual about that topic, for example, "Annotate Variants" that applies to that particular
application area, for example, "Whole genome analysis". Likewise, if you work on RNA-seq data,
you can find relevant manual entries with the links provided under the big "RNA" label.
To the right side of the table is a box with two sections; "Getting started" and "Explore and
learn". The "Getting started" area contains links to: the Tutorials (http://clccancer.com/
tutorials/), Full-length application based manual (PDF), and Full-length reference manual
(PDF) (http://clccancer.com/software/#downloads). The "Explore and Learn" section
provides links to different sections of the application based manual as well as a link to a web
page where you can download example data.
Finally, the Download example data provides links to two different example data sets. This is
described in section 2.1.2
2.1.2
Import of example data
It might be easier to understand the logic of the program by trying to do simple operations on
existing data. Therefore CLC Cancer Research Workbench includes an example data set.
If you would like to download the example data you have three options:
1. You can click Download Example Data in the start up table that is visible in the CLC Cancer
Research Workbench when no datasets have been opened for viewing. This will take you to
http://clccancer.com/downloads/ where you can choose to download two different
example datasets that can be used for the following purposes:
• Variant identification in a tumor sample. This dataset is taken from a larger whole
exome dataset and includes data from a small fraction of chromosome 5 (Example_data_tumor.zip).
• Identification of somatic variants in a tumor sample using the matched normal sample
for removal of germline variants. This is matched tumor and normal samples from
chromosome 22 from a whole exome dataset (Example_data_tumor_normal.zip).
2. You can also go to directly to http://clccancer.com/downloads/ and download the
example data from there.
3. Finally, you can use these links to get the data:
http://download.clcbio.com/testdata/cancer/current/Example_data_tumor.
zip or
http://download.clcbio.com/testdata/cancer/current/Example_data_tumor_
normal.zip
When you have downloaded the data from the website, you need to import them into the CLC
Cancer Research Workbench. How to import data is described in section 3.3.
CHAPTER 2. INTRODUCTION TO USER INTERFACE, WORKFLOWS, AND TRACKS
2.2
13
The user interface
The CLC Cancer Research Workbench user interface includes the Toolbox, Navigation Area,
Menu Bar, Toolbar, Side Panel, View Area, View Tools, and Status Bar (figure 2.3).
Figure 2.3: At the top you find the Menu Bar and under that, the Toolbar. The Navigation Area
is on the left. Here, you can view and organize your data, and from here, you can open data to
view, select it for launching in applications. Saved data will appear within this area. The Toolbox
is available in two locations in the Workbench. One is in a tab of the pane below the Navigation
Area. The other is via the menu system. The Toolbox is where Workflows and most tools that play a
role in your data analysis are launched from. When opened, datasets are shown in the View Area
along with a Side Panel appears that allows you to customize the viewing options and also navigate
to specific areas of the data. At the bottom of a data view on the right, are the View Tools that
can be used for panning, zooming and selection of specific regions. At the bottom on the left are
icons allowing to view data in a different way, for example look at a table view of the data or view
the history of actions taken on that dataset. The Status Bar in the lower right corner indicates the
location of a selection you have made or where the mouse pointer is pointing to within a dataset
with co-ordinates, such as a track or sequence.
After a dataset is opened, for example by double-clicking on an item in one of the folders visible
in the Navigation Area, the user interface will look similar to that shown in figure 2.3. Each
dataset in the View Area will have an associated Side Panel, Status Bar, and a set of View
Tools.
The Side Panel, Status Bar, and View Area are only visible when data are open for viewing. When
no datasets are open, the view is like that in figure 2.1.
To learn more about the specific areas and functionalities of the user interface, please refer
to the CLC Cancer Research Workbench reference manual, which can be found here: http:
CHAPTER 2. INTRODUCTION TO USER INTERFACE, WORKFLOWS, AND TRACKS
14
//clccancer.com/software/#downloads.
2.2.1
The Toolbox
Here, we focus on organization of the Toolbox. The first to note is the top level folders and their
associated icons (see figure 2.4).
Figure 2.4: The top level folders of the Toolbox are divided into two main categories; the "Ready-toUse Workflows" and the "Tools". The elements under the folders of the "Tools" section can be used
for manual analysis or used for editing existing workflows and building your own workflows.
The toolbox contains two different categories of tools:1) the Ready-to-Use Workflows, which can
be used to run complete analyses, and 2) Tools, containing many individual tools that can be
used for analysis by themselves, or can be used to build workflows from, or which can be added
to existing workflows to expand their functionality. The name of the folders in the Ready-to-use
workflows section reflect the type of analysis the workflows in that folder are designed for. See
figure 2.5).
Manual data analysis, that is, execution of individual analysis steps, can be performed using
the tools contained in the Tools section. Full analyses can be run this way, or such tools can
be used upstream or downstream of workflow-based data analyses. The tools that relevant for
different types of data analysis will vary depending on the questions being asked of the data. In
section 2.3 we will use diagrams and examples to illustrate how different tools and workflows
CHAPTER 2. INTRODUCTION TO USER INTERFACE, WORKFLOWS, AND TRACKS
15
Figure 2.5: Each application type has its own set of ready-to-use workflows.
can be used for data analysis.
For a detailed description of the individual tools we refer to the CLC Cancer Research Workbench
reference manual (http://clccancer.com/software/#downloads).
2.3
Workflows - an overview
CLC Cancer Research Workbench offers a number of analysis workflows, also referred to here as
the pre-installed ready-to-use workflows, which include all the necessary steps for a particular
analysis, from the initial quality checking and trimming of the reads to the final reporting of the
results, for example, the disease causing mutations detected in an analysis. The workflows are
easy to use and just require the sequence data as input. You may need to provide additional
CHAPTER 2. INTRODUCTION TO USER INTERFACE, WORKFLOWS, AND TRACKS
16
information relevant to your data and analysis to run a given workflow, for example adapter trim
lists for trimming sequences, or, when performing "Targeted Amplicon Sequencing", a description
of the sequenced regions.
Irrespective of the type of sequencing data you wish to analyze, there are only few steps necessary
before the identified variants are available for your inspection. A schematic representation of the
flow that an analysis could take is shown in figure 2.6.
Figure 2.6: A basic example of the flow of steps for a sequencing data analysis. The data is
first imported into the Workbench. Then it should be prepared for analysis. Here, a ready-to-use
workflow labeled workflow 1 is used for this. It runs quality control and trimming steps. After
inspection of the quality and trimming reports, the trimmed data are used as input for another
ready-to-use workflow, called workflow 2 in this figure. This is where the data analysis is carried
out. Here, workflow choices associated with variant detection are shown. Additional analyses
can be performed downsteam of this if desired. Downstream analysis could involve using another
ready-to-use workflow or could involve running individual tools from the Tools section of the Toolbox.
The ready-to-use workflows to run, and how many of them to run depend on the type of data you
have and the analysis you wish to perform. For example, overlapping paired data involves other
considerations than single or non-overlapping paired data. Different workflows will be relevant
if your aim is to detect variants or annotate variants with information from other databases.
Typically you will need to run two or three workflows to complete a full analysis that includes
preparation of the raw data.
CHAPTER 2. INTRODUCTION TO USER INTERFACE, WORKFLOWS, AND TRACKS
17
The ready-to-use workflows can be divided into four categories:
1. Preparing Raw Data The overall purpose of this step is to perform quality control (QC)
of the reads, trim the reads whenever relevant, and when working with reads containing
overlapping pairs, merge the reads at this step. At this step you must choose the
appropriate workflow based on the read types you are working with.
The available "Preparing Raw Data" ready-to-use workflows are:
• Prepare Overlapping Raw Data: Performs quality control and trimming of the sequencing
reads and merges overlapping read pairs. This workflow generates five different
outputs:
QC graphic report
QC supplementary report
Trimming report (the trimmed sequences will be used directly and automatically
as input for the merging of paired reads step).
Merged reads output
Not merged reads output
• Prepare Raw Data: Performs quality control and trimming of the sequencing reads.
This workflow generates five different outputs:
QC graphic report
QC supplementary report
Trimming report
Trimmed sequences output
Trimmed sequences (broken pairs) output
2. Data analysis This includes the identification and calling of variants. The "Identify Variants"
workflow performs read mapping and variant calling. The workflow also includes a quality
control of the read mapping and removal of false positives. Optionally you can choose to
extend your analysis with an "interpretation" step.
The available tool for data analysis is:
• Identify Variants
3. Interpretation At this step you can annotate, filter and compare the variants, that were
identified in the data analysis step.
The available tools for data interpretation are:
• Annotate Variants
• Filter Somatic Variants
• Filter Somatic Variants from a Tumor Normal Pair
4. Data analysis and interpretation This type of workflow combines both data analysis and the
interpretation and includes variant calling, annotation, filtering and comparison of variants.
The available tools for data analysis and interpretation are:
• Identify and Annotate Variants
• Identify Known Variants in One Sample
CHAPTER 2. INTRODUCTION TO USER INTERFACE, WORKFLOWS, AND TRACKS
18
You can find a detailed description of what the individual workflows can be used for in section
4.2.
Figure 2.7 shows all the ready-to-use workflows, available for each application. Irrespective of the
application type, the first step involves preparation of the raw data. The ready-to-use workflow to
choose to launch the data preparation depends on the type of data being analyzed. For example,
the "Prepare Overlapping Raw Data" workflow is designed to handle reads with overlapping pairs,
whereas the "Prepare Raw Data" workflow is for read sets without overlapping pairs. The initial
data preparation step involves quality control and trimming of the reads.
Figure 2.7: The available pre-installed ready-to-use workflows for the individual application types.
CHAPTER 2. INTRODUCTION TO USER INTERFACE, WORKFLOWS, AND TRACKS
2.4
19
The track format
The CLC Cancer Research Workbench provides a built-in Genome Browser. This view allows
the reference sequenced to be displayed together with other data provided in a so-called track
format. One of the big advantages of using tracks is that they allow visualization, comparison,
and analysis of genome-scale studies, with all the information tied to genomic positions. A central
coordinate-system, provided by a reference genome, makes it possible to view and compare
different datasets together in a Genome Browser view. Of course, each track can be viewed
individually if desired.
2.4.1
Track types
Several different track types are available. To make it easier to recognize the different track types
in the Navigation Area and in the View Area, each track type is associated with a specific icon:
• Coverage graph (
• Read mapping (
)
)
• Reference genome sequence (
• Annotation track (
)
• Genome browser view (
)
• Variants from variant calling (
• Expression track (
)
)
• Differentially expressed genes (
2.4.2
)
)
The Genome Browser
The Genome Browser view is a collection of tracks. Each track in a Genome Browser view is tied
to the same underlying genomic co-ordinate set, making visualization and comparison of different
results and data types simple and intuitive.
Annotations and variant information are provided together with the human reference genome via
our Data Management. Datasets, e.g. in GFF of VCF format, from resources not provided for
download by CLC Cancer Research Workbench can be imported into the Navigation Area using the
import option found in the toolbar:
Toolbar | Import (
) | Tracks
To illustrate this a Genome Browser view is shown in figure 2.8 to figure 2.13. It consists of
the following tracks, all tied to the human hg19 reference: genomic sequence, gene, coding
sequence (CDS), a read mapping, and variants. In figure 2.8 we have used the zoom tools to
zoom all the way in on a SNV that is found in a coding region.
A Genome Browser view like the one shown in figure 2.8, allows for a complete overview of reads
mapped to a reference and identified variants. You can see how many reads and variants you
have, and you can compare them to the complete human genome, genes and coding regions.
How to zoom in a Genome Browser view
CHAPTER 2. INTRODUCTION TO USER INTERFACE, WORKFLOWS, AND TRACKS
20
Figure 2.8: A Genome Browser view with a genomic sequence track, a gene track, a coding
sequence (CDS) track, a read mapping track, and a variant track.
One way to zoom in to take a closer look at the reads and variants is to use the zoom tools.
These are located in the lower right corner of the view area (see figure 2.9). Click and hold
down the mouse button for a second or two on the relevant icon. This can be either an arrow
or a magnifying glass. By clicking the magnifying glass icon, three icons will appear. These can
be used for zooming in, zooming out, or panning. The different zoom options are described in
detail in the CLC Cancer Research Workbench reference manual in the section entitled "Zoom
and selection in View Area".
Figure 2.9: Click and hold down the mouse button for a second or two on the mangnifying glass
icon until additional icons appear. Select the arrow to activate the "selection" tool. This can be
used to select user-defined regions.
An quick and easy way to zoom in on a particular region is to first use the selection tool, which
is activated by clicking on the arrow shown in figure 2.9). You can then select specific regions
by clicking on the relevant point in the track and, keeping the mouse button depressed, dragging
across the area that you wish to zoom in on. This selects the region. Once selected, you can use
the "Zoom to selection" tool (shown in figure 2.10) to zoom in on the selected region.
It is also possible to zoom in just using the mouse: hold down the "Alt" key while scrolling with
the mouse wheel. This zooms in (or out) on the region that is in focus in the View Area.
When clicking on the "Zoom to selection" icon, you will zoom in on the region that you have
CHAPTER 2. INTRODUCTION TO USER INTERFACE, WORKFLOWS, AND TRACKS
21
Figure 2.10: The "Zoom to selection" tool can be used to zoom in on a selected region. Next to the
"Zoom to selection" icon you can find the "Zoom to fit" icon that can be used to zoom all the way
out. The "Zoom slider" on the left side of the "Zoom to selection" can also be used to zoom in and
out.
selected, and you will be able to see more and more details as you zoom in. This is shown in
figure 2.11 and figure 2.12.
In figure 2.12 the presence of SNVs can be seen in the variant track and an overview of the
mapping at that region in the mapping can be focused on.
To expand the depth of the reads track to view more details of the reads in a specific region,
simply place the mouse cursor near the bottom of the left side of the genome Browser view,
where the track names are, hold down the mouse and drag downwards. This is illustrated in
the lower left side of figure 2.12. Here, the blue line with the arrow under it (within a red circle)
illustrates where you would place the mouse cursor to be able to expand the depth of the track.
In this figure, the four bases in the genomic reference sequence can be discerned via the color
coding. The color codes for each of the bases are: A=red, C=blue, G=yellow, and T=green.
Particular SNVs can also be discerned at this zoom level. The color of the reads indicates
whether a read is part of an intact pair (blue), is a single read or a member of a broken pair
mapped in the forward direction relative to the reference (green), or a single read or a member
of a broken pair mapped in the reverse direction relative to the reference (red). Reads that could
map equally well to other locations in the reference are colored yellow.
Figure 2.13 shows the view after zooming in on one specific SNV. By looking at the other tracks
at that point, we can see that this SNV is found in a gene. The tooltip, which comes up with
the mouse cursor hovers over the SNV in the variant track reveals that this is a heterozygous
mutation occurring in 29 out of 447 reads. Full details about the variants in a track are shown in
the table view of the track, as described in the next section.
How to open a table in split view The table view of a track provides the details of the information
that is presented in the track itself. It is often useful to view the table at the same time as the
track, this is done by opening the table in a split view.
From an individual track open in the Viewing area of the Workbench, this can be done by
depressing the Ctrl key and clicking using the mouse on the small icon of a table at the bottom
of the view.
From a genome Browser view open in the Viewing area, the table view of a particular track can
be opened in a split view by double-clicking on the track name in the list. This is shown in
figure 2.14.
The table and the track are linked, which means that clicking on a particular row in the table
brings that position into focus in the Genome Browser view. For example, if you wished to jump
to a particular SNV in the Genome Browser view, you could click on the row in the variant track
table. This is shown in figure 2.15.
Add tracks to a Genome Browser view
CHAPTER 2. INTRODUCTION TO USER INTERFACE, WORKFLOWS, AND TRACKS
22
Figure 2.11: When zooming in on a selected region more and more details become visible. In this
image, the individual genes are visible. To distinguish the individual exons, you would have to zoom
in a bit further.
The most simple way to add a track to the Genome Browser view is simply to locate the file in the
Navigation Area, click on the file while holding down the mouse key and drag it into the genome
Browser view in the View Area. When you drop the file in the Genome Browser view, the track
will be added to the Genome Browser view (figure 2.16).
Note! After having added a new track to the Genome Browser view, an asterisk has appeared on
the Genome Browser view tab. This indicates that the Genome Browser view must be saved if
you wish to keep the track that has been added.
CHAPTER 2. INTRODUCTION TO USER INTERFACE, WORKFLOWS, AND TRACKS
Figure 2.12: Zooming in reveals more details in all tracks.
23
CHAPTER 2. INTRODUCTION TO USER INTERFACE, WORKFLOWS, AND TRACKS
24
Figure 2.13: We have now zoomed in on one specific SNV that is found in a coding region. By
holding the mouse over the variant, a tooltip will appear that provide further information about the
specific variant. In this case we have found a heterozygous SNV. The normal base at this position
is G but in some of the reads you will see a "T". Actually you can only see one "T" in the reads,
but if you look in the stacked reads, which are those in the color mass where you cannot see each
individual read represented, there are four green lines (read box) indicating that there are Ts at this
position in more reads. When holding the mouse over an individual SNV, as highlighted in the red
circle, a tooltip will appear with information about the SNV. This tooltip informs us that 29 Ts are
observed in the 447 reads covering this particular position. When hovering the mouse cursor over
a particular base in the reference track, the genomic position for this base is shown, as highlighted
with a red arrow here.
CHAPTER 2. INTRODUCTION TO USER INTERFACE, WORKFLOWS, AND TRACKS
25
Figure 2.14: Double-click on the track name in the left side of the view area to open the table
view shown in split view. When opening a track directly from the genome browser view, the table
and track are linked. Hence, when selecting a row in the table by clicking on this row, this specific
position in the track will be brought into focus.
CHAPTER 2. INTRODUCTION TO USER INTERFACE, WORKFLOWS, AND TRACKS
26
Figure 2.15: When you click on an entry in the table this position will automatically be brought into
focus. Here, a row with information about an MNV, which is variant consisting of two or more SNVs,
was clicked on. This brought the location of that MNV into focus in the graphical view. To jump
directly to a detailed view of a position, zoom the graphical view to the desired level first and then
click on the row in the table view.
CHAPTER 2. INTRODUCTION TO USER INTERFACE, WORKFLOWS, AND TRACKS
27
Figure 2.16: The COSMIC track has been added to the Genome Browser view by dragging the track
from the Navigation Area into the Genome Browser view in the View Area.
Part II
Applications - ready-to-use workflows
28
Chapter 3
Getting started
Contents
3.1
Reference data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1
The Workbench Reference data location . . . . . . . . . . . . . . . . . .
3.1.2
Space requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.1.3
Where reference data is downloaded from . . . . . . . . . . . . . . . . .
32
3.1.4
Download and configure reference data . . . . . . . . . . . . . . . . . .
32
29
3.1.5
Troubleshooting reference data downloads . . . . . . . . . . . . . . . . .
38
3.2
Create new folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.3
Import data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.3.1
3.1
29
How to import data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
Reference data
The ready-to-use workflows rely on the presence of particular reference datasets. This reference
data must be downloaded and configured before these workflows can be used. The tools in
the Workbench make it easy to download the necessary data such that the workflows can find
and use it. This section covers the download and configurations needed to make available the
reference data relevant to the CLC Cancer Research Workbench, including the human genome
reference, annotations and variants made available by a variety of databases.
3.1.1
The Workbench Reference data location
Reference data must be stored in a folder called CLC_References. When the CLC Cancer
Research Workbench is installed, such a folder is created on your file system under your home
area. This folder is specified within the Workbench as a reference location.
You can specify a different location to download reference data to. This is recommended if you
do not have enough space in the area the Workbench designates as the reference data location
by default. To change the reference data location from within the Navigation Area:
Right-click on the folder "CLC_References" | Choose "Location" | Choose "Specify
Reference Location"
29
CHAPTER 3. GETTING STARTED
30
The new folder will also be called CLC_References, but will be located where you specify.
In more detail, this action results in the following:
• A folder called CLC_References is created in the location you specified, if a folder of this
name did not already exist.
• The Workbench sets this new location as the place to download reference data to and the
place the ready-to-use workflows should look for reference data.
This action does not:
• Remove the old CLC_References folder.
• Remove the contents of the old CLC_References folder, such as previously downloaded
data.
If you have previously downloaded data into the CLC_References folder with the old location, you
will need to use standard system tools to delete this folder and/or its contents. If you would
like to keep the reference data from the old location, you can move it, using standard system
tools, into the new CLC_References folder that you just specified. This would save you needing
to download it again.
Note! If you run out of space, and realize that the CLC_References should be stored somewhere
else, you can do this by choosing a new location, then manually moving the already downloaded
files to that new location, and restarting the workbench. The "downloaded references" file will
then be updated with all the new references.
3.1.2
Space requirements
The total size of the complete reference data set you can download is approximately 12 GB1 . It
is in a zipped format, and the total size after the data is unzipped is substantially larger. The
amount of time it will take to download this amount of data depends on your network connection.
It can take several hours, or longer on slower connections. When unzipped the size of the full
reference dataset is about 75 GB2 .
For reference, in April, 2014, the size of each individual reference data file was approximately:
1
2
Size as estimated in April, 2014
Size as estimated in April, 2014
CHAPTER 3. GETTING STARTED
Database
1000 Genomes
CDS
ClinVar
PhastConc
COSMIC
dbSNP
dbSNP Common
Genes
Gene Ontology
HapMap
mRNA
Sequence
Size
10 GB
49 MB
41 MB
5 GB
372 MB
44 GB
12 GB
3 MB
33 MB
3 GB
62 MB
683 MB
31
CHAPTER 3. GETTING STARTED
3.1.3
32
Where reference data is downloaded from
Reference data must be downloaded and configured manually before you can start using the
ready-to-use workflows in the CLC Cancer Research Workbench. You only have to do this
once. When all necessary reference data have been downloaded and configured, you will be
automatically notified whenever updated reference data are available.
Data is provided by CLC bio and the Workbench is configured to download from CLC bio by
default. The location to download the data from can be seen in the Workbench Preferences as
shown in figure 3.1).
Edit | Preferences | Advanced
Unless you are in the special circumstance that your system administrator has decided to mirror
this data locally and wishes you to use that mirror of the data, you should not change this setting.
Figure 3.1: The location where reference data is downloaded from can be seen in the Workbench
Preferences. Generally this should not be altered except in the special case that the data from CLC
bio is being mirrored locally.
3.1.4
Download and configure reference data
The first time you open CLC Cancer Research Workbench you will be presented with the dialog
box shown in figure 3.2, which informs you that data are available for download for either to
the local or server CLC_References repository. If you check the "Never show this dialog again"
then subsequently you will only be presented with the dialog box when updated versions of the
reference data are available.
Click on the button labeled Yes. This will take you to the wizard shown in figure 3.3.
This wizard can also be accessed from the upper right corner of the CLC Cancer Research
CHAPTER 3. GETTING STARTED
33
Figure 3.2: Notification that new versions of the reference data are available.
Figure 3.3: The Manage Reference Data wizard gives access to the reference data that are required
to be able to run the ready-to-use workflows. The default view shows the references that are used
in the workflows. With the "Show All" button the reference list can be expanded with additional
(optional) reference data that you may find useful.
Workbench by clicking on Data Management (
) figure 3.4.
Figure 3.4: Click on the button labeled "Data management" to open the "Manage Reference Data"
dialog where you can download and configure the reference data that are necessary to be able to
run the ready-to-use-workflows.
The "Manage Reference Data" wizard gives access to all the reference data that are used in
the ready-to-use workflows. From the wizard you can download and configure the reference data.
A button labeled "Show All" at the bottom of the dialog can be used to expand the list with
additional reference data that are not required for any of the workflows (Gene Ontology). Rather
CHAPTER 3. GETTING STARTED
34
these extra reference data have been provided as an extra service for those of our users who
would like to include information from these databases in the data analyses.
Icons are used in the "Manage Reference Data" wizard to give a quick overview of the current
status of each reference: "Not downloaded and / or unconfigured", "Workflows use different
versions" or "Selected version is inconsistent / not fully downloaded" references are marked
with a red exclamation mark ( ), references that are "Up to date and configured" are marked
with a green check mark ( ), and when a new version of a reference data set is available, you
will see a green mark labeled "New" ( ).
Guide to the "Manage Reference Data" wizard:
• In the upper part of the wizard you can find:
A small descriptive text
An indication of how many issues you have, how many of these are "unconfigured
issues", and how many are reference data that are "ready for update".
The button labeled Download All, which can be used to download all reference data
that are shown in the wizard. This is the case the first time you use the "Download
All" button. Subsequently, only reference data where a newer version is available,
will be downloaded. If you have selected "Show All" (the "Show All" button is found
at the bottom of the wizard), all reference data will be downloaded (including "Gene
Ontology"). If you have selected "Show Used", only the reference data that are used
in the ready-to-use workflows will be downloaded.
• The central area of the wizard:
Lists all available references data. After the reference name, a small note shows the
status of the reference (see figure 3.5), which can be:
в€—
в€—
в€—
в€—
в€—
Not downloaded and / or unconfigured ( )
Workflows use different versions ( )
Selected version is inconsistent / not fully downloaded (
Up to date and configured ( )
New version available ( )
)
When a new version is available ( ), it is stated in a parenthesis whether it is for
your local disc, for the server, or local and server (see figure 3.5).
If a version is inconsistent / not fully downloaded ( ), it will be stated in parenthesis
whether it is the local or server version (or both). Check the process tab for running
or suspended download processes. Please wait for all of these to finish. If the data
is inconsistent, even after all downloads have finished, it is likely that you ran out of
disk space, or the download or import was somehow stopped prematurely.
In this case, you can ``Delete'' the reference, and try downloading it again.
In the unlikely event that a reference has the mark Workflows use different versions
( ), the Workbench has discovered that two or more installed workflows use different
versions of a reference, and is unable to determine which should be used. Please
select the correct version from the drop down menu and click ``Use Reference'' to
CHAPTER 3. GETTING STARTED
35
Figure 3.5: The Manage Reference Data wizard lists the reference data. Three different icons are
used to mark the status of the reference.
solve this. See Workflow configuration below for more information on configuring
workflows.
Under the reference used, you can find info about the reference version (Versions
available from CLC bio) and the size of the reference data. By clicking on ( ) you
can see the legal notice and license information for this particular reference data set
(see figure 3.6).
Figure 3.6: Click on the info button to see the legal notice and license information.
The button labeled Download can be used to download the reference data individually.
When you click on the button labeled Download, a wizard appears with a message
informing you that the selected reference data are now being downloaded (figure 3.7).
After the reference data have been downloaded the icon changes to a green check
mark for those of the databases that only contain one reference data file.
CHAPTER 3. GETTING STARTED
36
Figure 3.7: The Downloading Reference wizard informs you about that data is being downloaded.
The boldface text Workflow configuration can be expanded to reveal additional
options. When unfolded, you can see which version of the reference is being used,
and which of the ready-to-use workflows use this reference. In addition, three buttons
appear:
в€— Use Reference When the reference data have been downloaded, the workflows will
automatically be configured with all the reference data available. The drop down
``Select Version'' allows you to change between the downloaded versions, and
pressing ``Use Reference'' will update the installed workflows to use this specific
version for the selected reference. However, references like the ``1000 Genomes
Project'' and ``HapMap'' databases, which contain more than one reference data
file3 , you have to specify which reference data to use. This is what the "Use
Reference" option allows you to do. Select the reference data by clicking on the
data you want to use. If you want to select more than one population, hold down
the Ctrl key while selecting data files.
When you have selected the population that you want to use for your data
analyses, click on the button labeled OK. Your workflow will now be configured
with the reference data for the population(s) that you have selected. Please
note that you have to do this for both the "1000 Genomes Project" and for the
"HapMap" reference data. See figure 3.8.
в€— Delete Version With this button all users are capable of deleting locally installed
reference data, whereas only administrators are capable of deleting reference
data installed on the server. This can be used if you suspect that a downloaded
reference is corrupt, and needs to be re-downloaded, or if you need to clean up
space, e.g. locally.
в€— Use Own File allows you to use your own reference data. The data type and
number of files to select will be restricted to match the reference. This is useful
when you have your own version of the reference data that you would like to use
rather than the data made available to download directly into the Workbench. If
you want to switch back to using the downloaded references, you must use the
3
In some cases, reference data are available from different population subgroups. This is the case for HapMap
and the 1000 Genomes Project. Three letter codes are used to specify the population that the different reference data
origin from (e.g. ASW = American's of African Ancestry in SW USA). For the phase 3 HapMap population codes, please
see http://www.sanger.ac.uk/resources/downloads/human/hapmap3.html and for the 1000 Genomes
Project see http://www.ensembl.org/Help/Faq?id=328. Figure 3.10 shows the CLC_References folder. You
can see that different populations are available for HapMap and the 1000 Genomes Project.
CHAPTER 3. GETTING STARTED
37
Figure 3.8: Select the population variant track that you want to use in your ready-to-use workflows.
"Use Reference" again.
• At the bottom of the wizard you can find:
A button with a question mark. This is the "help" button that links to the section in
the CLC Cancer Research Workbench reference manual that describes the "Manage
Reference Data" button.
A button labeled "Show All" (or "Show Used"). With this button you can choose
whether you only want to see the reference data that is being used in the ready-to-use
workflows, or if you want to see all available reference data. Please note that if you
choose to use the "Download All" function, you will download the references that are
shown in the wizard. This means that if you have selected "Show Used" you will only
download the reference data that is being used in the workflows.
A button labeled "Close". Click on this to close the wizard.
If you are connected to a CLC Server you will be asked where you want to save the downloaded
reference data, to your Workbench or your Server when you click on the button labeled Download
or Download All. See figure 3.9. You will see this dialog the first time you download data.
After this the dialog will appear only in situations where both the Local and Server version need
updating. If a new version is found with respect to only Local or Server, the data will automatically
be downloaded to that location.
When the reference data have been downloaded, the workflows will automatically be configured
with the reference data. However, in some cases reference data are available from different
population subgroups. This is the case for HapMap and the 1000 Genomes Project. Three letter
codes are used to specify the population that the different reference data origin from (e.g. ASW =
CHAPTER 3. GETTING STARTED
38
Figure 3.9: Select where to save the downloaded reference data. Please be aware that the total
size of all reference data (in April 2014) is about 12 GB when compressed. It can take some
time to download all reference data. When unzipped the size of all the reference data, when the
compressed size was about 12 GB is about 75 GB.
American's of African Ancestry in SW USA). For the phase 3 HapMap population codes, please see
http://www.sanger.ac.uk/resources/downloads/human/hapmap3.html and for the
1000 Genomes Project see http://www.ensembl.org/Help/Faq?id=328.
Whenever workflows use reference data that are available from more than one population, the
workflow will initially be automatically configured with all the populations being available, and
which population to use in the workflow will then need to be specified by the user in one of the
wizard steps that appear when starting the workflow. How to configure your workflow with the
right population is described in section 3.1.4.
Figure 3.10 shows the CLC_References folder. If you open the folders holding the reference data,
you can see that different populations are available for HapMap and the 1000 Genomes Project.
Figure 3.10: For the 1000 Genomes Project and HapMap reference data, data are available
from different populations. For these two databases the user must manually specify the relevant
population to be used in the workflows. If the user choose not to select a population manually, the
workflow will use a randomly selected population.
3.1.5
Troubleshooting reference data downloads
Network connection errors can occur when downloading reference data. If this happens, you can
try to resume the download when the network connection has been restored (see figure 3.11).
Alternatively, you can simply press stop to cancel the download process and clean up any
temporary data.
CHAPTER 3. GETTING STARTED
39
Figure 3.11: It is possible to resume the download of data if you have encountered e.g. network
connection errors.
3.2
Create new folder
To get started you need some data to work with. However, before looking into how you can
import your data into the CLC Cancer Research Workbench we will first create a new folder in the
Navigation area that can be used to hold all data that are relevant for the analysis you are about
to perform. You can see how to do this in figure 3.12.
Figure 3.12: Click on the Create Folder icon (or use the tool labeled "New" in the toolbar) to create
a new folder. Provide a name that will make it easy to keep track of your data.
The folder that you have just created will be placed in the CLC_Data location as shown in
figure 3.13.
Figure 3.13: The folder that you have just created will be placed in the CLC_Data location.
CHAPTER 3. GETTING STARTED
3.3
40
Import data
We are now ready to start importing the data. The simplistic diagram shown in figure 3.14 will be
used throughout the rest of the manual to provide an overview as we step by step move through
the different steps from data import to analysis of your sequencing data.
Figure 3.14: The first thing to do is to import your sequencing data.
Below you can find a short guide on how to import data into the CLC Cancer Research Workbench.
If you wish to learn more about the import options in the CLC Cancer Research Workbench, you
can find a more detailed description in the CLC Cancer Research Workbench reference manual
(http://clccancer.com/software/#downloads).
3.3.1
How to import data
1. Use the Import tool in the toolbar (see figure 3.15) to import your sequencing data into the
CLC Cancer Research Workbench.
2. Click on one of the import options e.g. "Illumina". This will make a wizard appear as shown
in figure 3.16.
3. Locate and select the files to import. Note that you can select all sequence files and import
them simultaneously. If you take a closer look at the different options in this wizard, you
can see that it is possible to choose different import options. We recommend to import
CHAPTER 3. GETTING STARTED
41
Figure 3.15: Click on the tool labeled "Import" in the toolbar to import data. Select importer
according to the data type you wish to import.
Figure 3.16: Locate and select the files to import. Tick "Paired reads" if you, as in this example,
are importing paired reads.
data with the standard settings. If you wish to make your own adjustments, you can find
further details about the import options in the CLC Cancer Research Workbench reference
manual (http://clccancer.com/software/#downloads).
4. Click on the button labeled Next. This will take you to the next wizard step (see figure 3.17).
5. Choose the default settings to save the sequence data and click on the button labeled
CHAPTER 3. GETTING STARTED
42
Figure 3.17: You now have the option to choose whether you wish to open or save the imported
reads. If you select to open the reads, they will not be saved unless you do it manually at a later
point. Select "Save" and click on the button labeled "Next".
Next. This will take you to the wizard step shown in figure 3.18.
6. Locate the folder in the Navigation Area that you have created for the purpose.
Figure 3.18: Locate the folder in the Navigation Area that you have just created and save your
imported reads in the folder.
7. Click on the button labeled Finish. It can take some seconds or even minutes before all
data have been imported and saved.
Chapter 4
Preparing Raw Data
Contents
4.1
Prepare sequencing data - all application types . . . . . . . . . . . . . . . . .
4.1.1
Import adapter trim list . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.1.2
How to run the "Prepare Overlapping Raw Data" ready-to-use workflow . .
45
4.1.3
How to run the "Prepare Raw Data" ready-to-use workflow . . . . . . . .
48
4.1.4
Output from the Prepare Overlapping Raw Data and Prepare Raw Data
workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.1.5
4.2
4.1
43
How to check the output reports . . . . . . . . . . . . . . . . . . . . . .
50
Analysis of sequencing data . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
Prepare sequencing data - all application types
The first thing to do after data import is to check the quality of the sequencing reads and
perform the necessary trimming. This applies no matter whether you are working with Whole
Genome Sequencing, Exome Sequencing, or Targeted Amplicon Sequencing. In the toolbox you
can choose between the two different ready-to-use workflows for data preparation that are shown
in the "Run workflow 1" box in figure 4.1.
The "Preparing Raw Data" ready-to-use workflows are universal and can be used for all applications; Whole Genome Sequencing, Exome Sequencing, and Targeted Amplicon Sequencing.
Choosing between "Prepare Raw Data" and "Prepare Overlapping Raw Data" workflows:
Many whole genome sequencing, exome sequencing using capture technology, and targeted
amplicon sequencing strategies produce overlapping reads. Downstream stages of the Cancer
Research Workbench (e.g. Variant calling) take the frequencies of observed alleles into consideration as well as the forward-reverse strand balance. When merging overlapping reads these
two parameters will be affected: 1) the frequency of observed alleles in overlapping regions will
be corrected (a variant found both on the forward and the reverse read of the same fragment
should only be counted once), and 2) in the merged fragments the information on forward-reverse
strand origin has become meaningless. These effects have to be taken into consideration when
filtering variants on these statistics. As the forward-reverse strand balance statistic is used as a
variant filter (i.e. the Read direction filter), we recommend using the "Prepare Overlapping Raw
43
CHAPTER 4. PREPARING RAW DATA
44
Data" workflow on targeted amplicon sequencing data with overlapping read sequencing strategy,
whereas we recommend the "Prepare Raw Data" workflow for other sequencing protocols (e.g.
whole genome sequencing, whole exome-sequencing, also if making use of overlapping read
sequencing).
Figure 4.1: Two ready-to-use workflows are available for data preparation; "Prepare Overlapping
Raw Data" and "Prepare Raw data".
4.1.1
Import adapter trim list
One important part of the preparation of raw data is adapter trimming. To be able to trim off the
adaptors, an adapter trim list is required. To obtain this file you will have to get in contact with the
vendor and ask them to send this adapter trim list file to you. As the adapter trim list has been
supplied by the vendor of the enrichment kit and sequencing machine, the adapter trim list must
be imported into the CLC Cancer Research Workbench. The adapter trim list can be imported by
clicking on the button labeled "Import" in the Toolbar. Select standard import (figure 4.2) and
find the adapter trim list you want to import.
Select "Trim adapter list (.xls, .xlsx/.csv)" in the "Files of type" drop-down list in the Import
wizard. Click on the button labeled Next and select where you wish to save the adapter trim list.
CHAPTER 4. PREPARING RAW DATA
45
Figure 4.2: After you have identified the trim list that you want to import, select "Trim adapter list
(.xls, .xlsx/.csv)" in the "Files of type" drop-down list in the Import wizard.
4.1.2
How to run the "Prepare Overlapping Raw Data" ready-to-use workflow
If your sequencing reads contain overlapping pairs you can use the "Prepare Overlapping Raw
Data" ready-to-use workflow for preparation of your sequences before you proceed to data
analysis such as variant calling.
1. Go to the toolbox and double-click on the "Prepare Overlapping Raw Data" ready-to-use
workflow (figure 4.3).
Figure 4.3: The ready-to-use workflows are found in the toolbox.
This will open the wizard shown in figure 4.4 where you can select the reads that you wish
to prepare for further analyses.
At this step you can choose to prepare one sample at the time or you can select several
samples and prepare them simultaneously. If you choose to select more than one sample
you can choose to select multiple samples and use the small arrow pointing to the right
side in the middle of the wizard to send them to "Selected elements" in the right side
of the wizard. Alternatively you can run the samples in "Batch" mode. This is done by
selecting "Batch" (tick "Batch" at the bottom of the wizard as shown in figure 4.4) and
select the folder that holds the data you wish to analyze. If your sequencing data are found
in separate folders, you should choose to run the analysis in batch mode.
The difference between analyzing multiple samples in batch mode versus in non-batch
mode is the reporting. If you use batch mode, you will get an individual report for every
single sample whereas you will get one combined report for all samples if you do not run in
batch mode.
CHAPTER 4. PREPARING RAW DATA
46
Figure 4.4: Select the sequencing raw data that should be prepared for further analysis. At this
step you can also choose to prepare several reads in batch mode.
When you have selected the sample(s) you want to prepare, click on the button labeled
Next.
2. As part of the data preparation, the sequences are trimmed. In the wizard shown in
figure 4.5 you can specify different trimming parameters and select the adapter trim list
that should be used for adapter trimming by clicking on the folder icon ( ).
Figure 4.5: Select your adapter trim list. You can use the default trim parameters or adjust them if
necessary.
3. Click on the button labeled Next. This will take you to the next wizard step (figure 4.6).
At this step you get the chance to check the selected settings by clicking on the button
labeled Preview All Parameters (figure 4.7).
CHAPTER 4. PREPARING RAW DATA
47
Figure 4.6: Check the settings and save your results.
Figure 4.7: In this wizard you can check the parameter settings. It is also possible to export the
settings to a file format that can be specified using the "Export to" drop-down list.
In the Preview All Parameters wizard you can only check the settings, it is not possible to
make any changes at this point. At the bottom of the wizard there are two buttons regarding
export functions; one button allows specification of the export format, and the other button
(the one labeled "Export Parameters") allows specification of the export destination. When
selecting an export location, you will export the analysis parameter settings that were
specified for this specific experiment.
4. Click on the button labeled OK to go back to the previous wizard step and choose Save.
Note! If you choose to open the results, the results will not be saved automatically. You
can always save the results at a later point.
CHAPTER 4. PREPARING RAW DATA
4.1.3
48
How to run the "Prepare Raw Data" ready-to-use workflow
If you have sequencing reads without overlapping pairs, you can use the "Prepare Raw Data"
ready-to-use workflow for preparation of your sequences before you proceed to data analysis such
as variant calling.
1. Go to the toolbox and double-click on the "Prepare Raw Data" ready-to-use workflow
(figure 4.8).
Figure 4.8: The ready-to-use workflows are found in the toolbox.
This will open the wizard shown in figure 4.9 where you can select the reads that you wish
to prepare for further analyses.
Figure 4.9: Select the sequencing raw data that you wish to prepare before further analysis. At this
step you can also choose whether you wish to prepare several reads in batch mode.
At this step you can choose to prepare one sample at the time or you can select several
samples and prepare them simultaneously. If you choose to select more than one sample
you can choose to either select multiple samples and use the small arrow to send them
to the "Selected elements" in the right side of the wizard. Alternatively you can run the
samples in "batch mode". This is done by selecting "Batch" (tick "Batch" at the bottom
of the wizard as shown in figure 4.4) and select the folder that holds the data you wish to
analyse. If you have your sequencing data in separate folders, you should choose to run
the analysis in batch mode.
The difference between analyzing multiple samples in batch mode versus in non-batch
mode is the reporting. If you use batch mode, you will get an individual report for every
single sample whereas you will get one combined report for all samples if you do not run in
batch mode.
2. When you have selected the sample(s) you want to prepare, click on the button labeled
Next.
CHAPTER 4. PREPARING RAW DATA
49
As part of the data preparation, the sequences are trimmed. In the next wizard (figure 4.10)
you can specify different trimming parameters and select the adapter trim list that should
be used for adapter trimming by clicking on the folder icon ( ). To obtain this file you will
have to get in contact with the vendor and ask them to send this adapter trim list file to you.
The adapter trim list has been supplied by the vendor of the enrichment kit and sequencing
machine. See section 4.1.1 for a description of how to import the adapter trim list.
Figure 4.10: Select your adapter trim list. You can use the default trim parameters or adjust them
if necessary.
3. Click on the button labeled Next, which will take you to the next wizard (figure 4.11).
Figure 4.11: Check the settings and save your results.
If you click on the button labeled Preview All Parameters you get the chance to check the
selected settings. At this step you can only check the settings, it is not possible to make
any changes at this point.
The settings can be exported with the two buttons found at the bottom of this wizard; one
button allows specification of the export format, and the other button (the one labeled
"Export Parameters") allows specification of the export destination. When selecting an
CHAPTER 4. PREPARING RAW DATA
50
export location, you will export the analysis parameter settings that were specified for this
specific experiment.
4. Click on the button labeled OK to go back to the previous wizard and choose Save.
4.1.4
Output from the Prepare Overlapping Raw Data and Prepare Raw Data workflows
Different outputs are generated from the "Prepare Overlapping Raw Data" and "Prepare Raw
Data" workflows.
Prepare Overlapping Raw Data. Performs quality control and trimming of the sequencing reads
and merging of overlapping read pairs and generates five different outputs:
1. QC graphic report. The report should be checked by the user.
2. QC supplementary report. The report should be checked by the user.
3. Trimming report (the trimmed sequences are automatically used as input in the merging of
paired reads step). The report should be checked by the user.
4. Merged reads output. Use as input together with the "Not merged reads output" in the next
ready-to-use workflow (e.g. "Identify Variants WES").
5. Not merged reads output. These should be used as input together with the "Merged reads
output" in the next ready-to-use workflow (e.g. "Identify Variants WES").
Prepare Raw Data. Performs quality control and trimming of the sequencing reads and generates
five different outputs:
1. QC graphic report. The report should be checked by the user.
2. QC supplementary report. The report should be checked by the user.
3. Trimming report. The report should be checked by the user.
4. Trimmed sequences output. Use as input together with the "Trimmed sequences (broken
pairs) output" in the next ready-to-use workflow (e.g. "Identify Variants WES").
5. Trimmed sequences (broken pairs) output. Use as input together with the "Trimmed
sequences output" in the next ready-to-use workflow (e.g. "Identify Variants WES").
4.1.5
How to check the output reports
Three different reports are generated, and all of these should be inspected in order to determine
whether the quality of the sequencing reads and the trimming is acceptable. We are now at the
"Inspect results" step in figure 4.12. The interpretation of the reports is not always completely
straightforward, but as you gain experience it becomes easier.
CHAPTER 4. PREPARING RAW DATA
51
Figure 4.12: Inspect the quality and trimming reports and determine whether you can proceed with
the data analysis or if you have to resequence some of the samples.
Graphical QC Report
• 1 Summary
• 2 Per-sequence analysis
• 2.1 Lengths distribution
• 2.2 GC-content
• 2.3 Ambiguous base-content
• 2.4 Quality distribution
• 3 Per-base analysis
• 3.1 Coverage
• 3.2 Nucleotide distributions
• 3.3 GC-content
• 3.4 Ambiguous base-content
CHAPTER 4. PREPARING RAW DATA
52
• 3.5 Quality distribution
• 4 Over-representation analyses
• 4.1 Enriched 5mers
• 4.2 Sequence duplication levels
• 4.3 Duplicated sequences
Supplementary QC Report
• 1 Summary
• 2 Per-sequence analysis
• 2.1 Lengths distribution
• 2.2 GC-content
• 2.3 Ambiguous base-content
• 2.4 Quality distribution
• 3 Per-base analysis
• 3.1 Coverage
• 3.2 Nucleotide distributions
• 3.3 GC-content
• 3.4 Ambiguous base-content
• 3.5 Quality distribution
• 4 Over-representation analyses
• 4.1 Enriched 5mers
• 4.2 Sequence duplication levels
• 4.3 Duplicated sequences
The majority of the reads should have a PHRED score above 30 when looking at the "Quality
distribution" graph.
If you can accept the read quality you can now proceed to the next step and use the prepared
reads output as input in the next ready-to-use workflow. If the quality of your reads is poor and
cannot be accepted for further analysis, the best solution to the problem is to go back to start
and resequence the sample.
CHAPTER 4. PREPARING RAW DATA
53
Figure 4.13: Use the prepared data as input in the relevant ready-to-use workflow, which we here
for the sake of simplicity call "Workflow 2".
4.2
Analysis of sequencing data
You are now ready to perform the actual analysis of your sequencing data (see figure 4.13).
For each application six different ready-to-use workflows are available. These can be divided into
three different categories; "Data analysis", "Interpretation", and "Data analysis and Interpretation".
Note! The ready-to-use workflows found under each of the three application types have similar
names (with the only difference that "WGS", "WES", or "TAS" have been added after the name).
However, some of the workflows have been tailored to the individual applications. Therefore, we
recommend that you use the ready-to-use workflow that is found under the relevant application
heading.
• Data analysis The data analysis includes read mapping and variant calling. One ready-to-use
workflow is available in this category; the Identify Variants ready to use workflow.
• Interpretation At this step you can annotate, filter and compare the variants, that were
identified in the data analysis step.
The available tools for interpretation are:
CHAPTER 4. PREPARING RAW DATA
54
Annotate Variants: Annotates variants with gene names, conservation scores, amino
acid changes, and information from clinically relevant databases.
Filter Somatic Variants: Removes variants outside the target region (only targeted
experiments) and common variants present in publicly available databases. Annotates with gene names, conservation scores, and information from clinically relevant
databases.
Identify Somatic Variants from Tumor Normal Pair: Removes germline variants by referring to the control sample read mapping, removes variants outside the target region
(in case of a targeted experiment), and annotates with gene names, conservation
scores, amino acid changes, and information from clinically relevant databases.
• Data analysis and Interpretation With these ready-to-use workflows you can perform the
variant calling, annotation, filtering, and/or comparison of variants in one go.
The available tools for Data analysis and Interpretation are:
Identify and Annotate Variants: Maps reads to the human reference sequence, does
a local realignment, runs quality control for targeted regions, calls variants, removes
false positives, and annotates variants with gene names, amino acid changes,
conservation scores, and information from different external databases.
Identify Known Variants in One Sample: Maps sequencing reads and looks for the
presence or absence of user-specified variants in the mapping.
Chapter 5
Whole genome sequencing (WGS)
Contents
5.1
Automatic analysis of sequencing data (WGS) . . . . . . . . . . . . . . . . .
55
5.2
Identify Variants (WGS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
5.2.1
How to run the "Identify Variants" ready-to-use workflow . . . . . . . . .
56
5.2.2
Output from the Identify Variants workflow . . . . . . . . . . . . . . . . .
58
5.3
Annotate Variants (WGS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
5.4
Filter Somatic Variants (WGS) . . . . . . . . . . . . . . . . . . . . . . . . . .
64
5.5
Identify Somatic Variants from Tumor Normal Pair (WGS) . . . . . . . . . . .
68
5.6
Identify Known Variants in One Sample (WGS) . . . . . . . . . . . . . . . . .
72
5.6.1
Import your known variants . . . . . . . . . . . . . . . . . . . . . . . . .
72
5.6.2
Import your targeted regions . . . . . . . . . . . . . . . . . . . . . . . .
72
5.6.3
How to run the "Identify Known Variants in One Sample" ready-to-use
workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
Output from the Identify Known Variants in One Sample . . . . . . . . . .
76
5.6.4
The most comprehensive sequencing method is whole genome sequencing that allows for
identification of genetic variations and somatic mutations across the entire human genome. This
type of sequencing encompasses both chromosomal and mitochondrial DNA. The advantage of
sequencing the entire genome is that not only the protein-coding regions are sequenced, but
information is also provided for regulatory and non-protein-coding regions.
5.1
Automatic analysis of sequencing data (WGS)
Five ready-to-use workflows are available for analysis of whole genome sequencing data. The
concept of the pre-installed ready-to-use workflows is that read data are used as input in one
end of the workflow and in the other end of the workflow you get a track based genome browser
view and a table with all the identified variants, which may or may not have been subjected to
different kinds of filtering and/or annotation.
In this chapter we will discuss what the individual ready-to-use workflows can be used for and go
through step by step how to run the workflows.
55
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
56
Note! Often you will have to prepare data with one of the two Preparing Raw Data workflows
described in section 4 before you proceed to Automatic analysis of sequencing data (WGS).
5.2
Identify Variants (WGS)
The "Identify Variants" tool takes sequencing reads as input and returns identified variants in a
Genome Browser View.
The tool runs an internal workflow that first maps the sequencing reads to the human reference
sequence. Next, it runs a local realignment that is used to improve the variant detection
that comes after the local realignment. Two different variant callers are used; the "Low
Frequency Variant Detection" caller that is used to call small insertions, deletions, SNVs, MNV,
and replacements, and the "InDel and Structural Variants" caller that calls larger insertions,
deletions, translocations, and replacements. By the end of the variant detection, variants that
have been detected by the "Low Frequency Variant Detection" caller with an average base quality
smaller than 20 are filtered away.
A detailed mapping report is created to inspect the overall coverage and mapping specificity in
the targeted regions.
5.2.1
How to run the "Identify Variants" ready-to-use workflow
1. Go to the toolbox and double-click on the "Identify Variants" ready-to-use workflow (figure 5.1).
Figure 5.1: Find the "Identify Variants" ready-to-use workflows in the toolbox from the folder that
has the name of the application you are using.
This will open the wizard shown in figure 5.2 where you can select the sequencing reads
from the sample that should be analyzed.
Figure 5.2: Please select all sequencing reads from the sample to be analyzed.
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
57
Please select all sequencing reads from your sample. If several samples should be
analyzed, the tool has to be run in batch mode. To do this, tick "Batch" at the bottom of
the wizard and select the folder that holds the data you wish to analyze.
If you have your sequencing data in separate folders, you should choose to run the analysis
in batch mode.
When you have selected the sample(s) that you want to prepare, click on the button labeled
Next.
2. In the next wizard step (figure 5.3) you can specify the parameters for variant detection.
Figure 5.3: The next thing to do is to specify the parameters that should be used to detect variants.
3. Click on the button labeled Next. This will take you to the next wizard step (figure 5.4).
Figure 5.4: Check the settings and save your results.
In this wizard you can check the selected settings by clicking on the button labeled Preview
All Parameters.
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
58
In the Preview All Parameters wizard you can only check the settings, it is not possible to
make any changes at this point. At the bottom of this wizard there are two buttons regarding
export functions; one button allows specification of the export format, and the other button
(the one labeled "Export Parameters") allows specification of the export destination. When
selecting an export location, you will export the analysis parameter settings that were
specified for this specific experiment.
4. Click on the button labeled OK to go back to the previous wizard and choose Save.
Note! If you choose to open the results, the results will not be saved automatically. You
can always save the results at a later point.
5.2.2
Output from the Identify Variants workflow
The "Identify Variants" tool produces six different types of output:
1. Structural Variants ( ) Variant track showing the structural variants; insertions, deletions,
replacements. Hold the mouse over one of the variants or right-clicking on the variant. A
tooltip will appear with detailed information about the variant. The structural variants can
also be viewed in table format by switching to the table view. This is done by pressing the
table icon found in the lower left corner of the View Area.
2. Structural Variant Report ( ) The report consists of a number of tables and graphs that
in different ways provide information about the structural variants.
3. Read Mapping ( ) The mapped sequencing reads. The reads are shown in different
colors depending on their orientation, whether they are single reads or paired reads,
and whether they map unambiguously. For the color codes please see the description of sequence colors in the CLC Genomics Workbench manual that can be found
here: http://www.clcsupport.com/clcgenomicsworkbench/current/index.
php?manual=View_settings_in_Side_Panel.html.
4. Read Mapping Report ( ) The report consists of a number of tables and graphs that in
different ways provide information about the mapped reads.
5. Structural Variants ( ) A variant track holding the identified variants. The variants can
be shown in track format or in table format. When holding the mouse over the detected
variants in the Genome Browser view a tooltip appears with information about the individual
variants. You will have to zoom in on the variants to be able to see the detailed tooltip.
6. Genome Browser View Identify Variants ( ) A collection of tracks presented together.
Shows the annotated variants track together with the human reference sequence, genes,
transcripts, coding regions, the mapped reads, the identified variants, and the structural
variants (see figure 5.10).
Before looking at the identified variants, we recommend that you first take a look at the mapping
report to see whether the coverage is sufficient in the regions of interest (e.g. > 30 ). Furthermore,
please check that at least 90% of the reads map to the human reference sequence. In case of a
targeted experiment, please also check that the majority of reads map to the targeted region.
Next, open the Genome Browser View file (see figure 5.5).
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
59
The Genome Browser View lists the track of the identified variants in context to the human
reference sequence, genes, transcripts, coding regions, and mapped sequencing reads.
Figure 5.5: The Genome Browser View allows easy inspection of the identified smaller variants,
larger insertions and deletions, and structural variants in the context of the human genome.
By double-clicking on the InDel variant track in the Genome Browser View, a table will be shown
that lists all identified larger insertions and deletions (see figure 5.6).
In case you would like to change the reference sequence used for read mapping or the human
genes, please use the "Data Management" (see section 3.1.4).
5.3
Annotate Variants (WGS)
Using a variant track ( ) (e.g. the output from the Identify Variants ready-to-use workflow) the
Annotate Variants (WGS) ready-to-use workflow runs an internal workflow that adds the following
annotations to the variant track:
• Gene names Adds names of genes whenever a variant is found within a known gene.
• mRNA Adds names of mRNA whenever a variant is found within a known transcript.
• CDS Adds names of CDS whenever a variant is found within a coding sequence.
• Amino acid changes Adds information about amino acid changes caused by the variants.
• Information from COSMIC. Adds information from the "Catalogue of Somatic Mutations in
Cancer" database.
• Information from ClinVar Adds information about the relationships between human variations and their clinical significance.
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
60
Figure 5.6: This figure shows a Genome Browser View with an open track table. The table allows
deeper inspection of the identified variants.
• Information from dbSNP Adds information from the "Single Nucleotide Polymorphism
Database", which is a general catalog of genome variation, including SNPs, multinucleotide
polymorphisms (MNPs), insertions and deletions (InDels), and short tandem repeats (STRs).
• PhastCons Conservation scores The conservation scores, in this case generated from
a multiple alignment with a number of vertebrates, describe the level of nucleotide
conservation in the region around each variant.
1. Go to the toolbox and select the Annotate Variants (WGS) workflow. In the first wizard
step, select the input variant track (figure 5.7).
2. Click on the button labeled Next. The only parameter that should be specified by the
user is which 1000 Genomes population you use (figure 5.8). This can be done using the
drop-down list found in this wizard step. Please note that the populations available from
the drop-down list can be specified with the Data Management ( ) function found in the
top right corner of the Workbench (see section 3.1.4).
3. Click on the button labeled Next to go to the last wizard step (figure 5.9).
In this wizard step you can check the selected settings by clicking on the button labeled
Preview All Parameters. In the Preview All Parameters wizard you can only check the
settings, it is not possible to make any changes at this point.
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
61
Figure 5.7: Select the variant track to annotate.
Figure 5.8: Select the relevant 1000 Genomes population(s).
Figure 5.9: Check the settings and save your results.
4. Choose to Save your results and click on the button labeled Finish.
Two types of output are generated:
1. Annotated Variants ( ) Annotation track showing the variants. Hold the mouse over one
of the variants or right-clicking on the variant. A tooltip will appear with detailed information
about the variant.
2. Genome Browser View Annotated Variants ( ) A collection of tracks presented together.
Shows the annotated variants track together with the human reference sequence, genes,
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
62
transcripts, coding regions, and variants detected in dbSNP, ClinVar, COSMIC, 1000
Genomes, and PhastCons conservation scores (see figure 5.10).
Figure 5.10: The output from the "Annotate Variants" ready-to-use workflow is a genome browser
view (a track list) containing individual tracks for all added annotations.
Note! Please be aware, that if you delete the annotated variant track, this track will also
disappear from the genome browser view.
It is possible to add tracks to the Genome Browser View such as mapped sequencing reads as
well as other tracks. This can be done by dragging the track directly from the Navigation Area to
the Genome Browser View.
If you double-click on the name of the annotated variant track in the left hand side of the Genome
Browser View, a table that includes all variants and the added information/annotations will open
(see figure 5.11). The table and the Genome Browser View are linked; if you click on an entry in
the table, this particular position in the genome will automatically be brought into focus in the
Genome Browser View.
You may be met with a warning as shown in figure 5.12. This is simply a warning telling you that
it may take some time to create the table if you are working with tracks containing large amounts
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
63
Figure 5.11: The output from the "Annotate Variants" ready-to-use workflow is a genome browser
view (a track list). The information is also available in table view. Click on the small table icon to
open the table view. If you hold down the "Ctrl" key while clicking on the table icon, you will open a
split view showing both the genome browser view and the table view.
of annotations. Please note that in case none of the variants are present in COSMIC, ClinVar or
dbSNP, the corresponding annotation column headers are missing from the result.
Figure 5.12: Warning that appears when you work with tracks containing many annotations.
Adding information from other sources may help you identify interesting candidate variants for
further research. E.g. known cancer associated variants (present in the COSMIC database) or
variants known to play a role in drug response or other clinical relevant phenotypes (present in
the ClinVar database) can easily be identified. Further, variants not found in the COSMIC and/or
ClinVar databases, can be prioritized based on amino acid changes in case the variant causes
changes on the amino acid level.
A high conservation level between different vertebrates or mammals in the region containing the
variant can also be used to give a hint about whether a given variant is found in a region with an
important functional role. If you would like to use the conservation scores to identify interesting
variants, we recommend that variants with a conservation score of more than 0.9 (PhastCons
score) are prioritized over variants with lower conservation scores.
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
64
It is possible to filter variants based on their annotations. This type of filtering can be facilitated
using the table filter found at the top part of the table. If you are performing multiple experiments
where you would like to use the exact same filter criteria, you can create a filter that can be
saved and reused. To do this:
Toolbox | Identify Candidate Variants (
) | Create Filter Criteria (
)
This tool can be used to specify the filter and the Annotate Variants workflow should be
extended by the Identify Candidate Tool (configured with the Filter Criterion). The CLC Cancer
Research Workbench reference manual has a chapter that describes this in detail (http:
//clccancer.com/software/#downloads, see chapter: "Workflows" for more information
on how pre-installed workflows can be extended and/or edited).
Note! Sometimes the databases (e.g. COSMIC) are updated with a newer version, or maybe you
have your own version of the database. In such cases you may wish to change one of the used
databases. This can be done with "Data Management" function, which is described in section
3.1.4.
5.4
Filter Somatic Variants (WGS)
If you are analyzing a list of variants that have been detected in a tumor or blood sample
where no control sample is available from the same patient, you can use the "Filter Somatic
Variants (WGS)" ready-to-use workflow to identify potential somatic variants. The purpose of this
ready-to-use workflow is to use publicly available (or your own) databases, with common variants
in a population, to extract potential somatic variants whenever no control/normal sample from
the same patient is available.
The "Filter Somatic Variants (WGS)" ready-to-use workflow accepts variant tracks ( ) (e.g. the
output from the Identify Variants ready-to-use workflow) as input. Variants that are identical to the
human reference sequence are first filtered away and then variants found in the Common dbSNP,
1000 Genomes Project, and HapMap databases are deleted. Variants in those databases are
assumed to not contain relevant somatic variants.
Please note that this tool will likely also remove inherited cancer variants that are present at a
low percentage in a population.
Next, the remaining somatic variants are annotated with gene names, amino acid changes,
conservation scores and information from COSMIC (database with known variants in cancer),
ClinVar (known variants with medical impact) and dbSNP (all known variants).
To run the Filter Somatic Variants tool, go to:
Toolbox | Ready-to-Use Workflows | Whole Genome Sequencing (
Variants ( )
) | Filter Somatic
1. Double-click on the Filter Somatic Variants tool to start the analysis. If you are connected
to a server, you will first be asked where you would like to run the analysis. Next, you will
be asked to select the variant track you would like to use for filtering somatic variants.
The panel in the left side of the wizard shows the kind of input that should be provided
(figure 5.13). Select by double-clicking on the reads file name or clicking once on the file
and then clicking on the arrow pointing to the right side in the middle of the wizard.
Click on the button labeled Next.
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
65
Figure 5.13: Select the variant track from which you would like to filter somatic variants.
2. In the next step you will be asked to specify which of the 1000 Genomes populations that
should be used for annotation (figure 5.14).
Figure 5.14: Specify which 1000 Genomes population to use for annotation.
Click on the button labeled Next.
3. The next wizard step will once again allow you to specify the 1000 Genomes population
that should be used, this time for filtering out variants found in the 1000 Genomes project
(figure 5.15).
Figure 5.15: Specify which 1000 Genomes population to use for filtering out known variants.
Click on the button labeled Next.
4. The next wizard step (figure 5.16) concerns removal of variants found in the HapMap
database. Select the population you would like to use from the drop-down list. Please
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
66
note that the populations available from the drop-down list can be specified with the Data
Management ( ) function found in the top right corner of the Workbench (see section
3.1.4).
Figure 5.16: Specify which HapMap population to use for filtering out known variants.
5. Click on the button labeled Next to go to the last wizard step (shown in figure 5.17).
Figure 5.17: Check the selected parametes by pressing "Preview All Parameters".
Pressing the button Preview All Parameters allows you to preview all parameters. At this
step you can only view the parameters, it is not possible to make any changes. Choose to
save the results and click on the button labeled Finish.
Two types of output are generated:
1. Somatic Candidate Variants Track that holds the variant data. This track is also included
in the Genome Browser View. If you hold down the Ctrl key (Cmd on Mac) while clicking on
the table icon in the lower left side of the View Area, you can open the table view in split
view. The table and the variant track are linked together, and when you click on a row in
the table, the track view will automatically bring this position into focus.
2. Genome Browser View Filter Somatic Variants A collection of tracks presented together.
Shows the somatic candidate variants together with the human reference sequence, genes,
transcripts, coding regions, and variants detected in ClinVar, COSMIC, 1000 Genomes, and
the PhastCons conservation scores (see figure 5.18).
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
67
Figure 5.18: The Genome Browser View showing the annotated somatic variants together with a
range of other tracks.
The track with the conservation scores allows you to see the level of nucleotide conservation
(from a multiple alignment with many vertebrates) in the region around each variant. Mapped
sequencing reads as well as other tracks can be easily added to the Genome Browser View.
If you click on the annotated variant track in the Genome Browser View, a table will be shown
that includes all variants and the added information/annotations. This is shown in figure 5.19.
Adding information from other sources may help you identify interesting candidate variants for
further research. E.g. known cancer associated variants (present in the COSMIC database) or
variants known to play a role in drug response or other clinical relevant phenotypes (present in
the ClinVar database) can easily be identified. Further, variants not found in the COSMIC and/or
ClinVar databases, can be prioritized based on amino acid changes in case the variant causes
changes on the amino acid level.
A high conservation level, between different vertebrates or mammals, in the region containing the
variant, can also be used to give a hint about whether a given variant is found in a region with an
important functional role. If you would like to use the conservation scores to identify interesting
variants, we recommend that variants with a conservation score of more than 0.9 (PhastCons
score) is prioritized over variants with lower conservation scores.
It is possible to filter variants based on their annotations. This type of filtering can be facilitated
using the table filter found at the top part of the table. If you are performing multiple experiments
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
68
Figure 5.19: The Genome Browser View showing the annotated somatic variants together with a
range of other tracks.
where you would like to use the exact same filter criteria, you can create a filter that can be
saved and reused. To do this:
Toolbox | Identify Candidate Variants (
) | Create Filter Criteria (
)
This tool can be used to specify the filter and the Annotate Variants workflow should be
extended by the Identify Candidate Tool (configured with the Filter Criterion). The CLC Cancer
Research Workbench reference manual has a chapter that describes this in detail (http:
//clccancer.com/software/#downloads, see chapter: "Workflows" for more information
on how pre-installed workflows can be extended and/or edited).
Note! Sometimes the databases (e.g. COSMIC) are updated with a newer version, or maybe you
have your own version of the database. In such cases you may wish to change one of the used
databases. This can be done with "Data Management" function, which is described in section
3.1.4.
5.5
Identify Somatic Variants from Tumor Normal Pair (WGS)
The "Identify Somatic Variants from Tumor Normal Pair (WGS)" ready-to-use workflow can be used
to identify potential somatic variants in a tumor sample when you also have a normal/control
sample from the same patient.
When running the "Identify Somatic Variants from Tumor Normal Pair (WGS)" the reads are
mapped and the variants identified. An internal workflow removes germline variants that are
found in the mapped reads of the normal/control sample and variants outside the target region are
removed as they are likely to be false positives due to non-specific mapping of sequencing reads.
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
69
Next, remaining variants are annotated with gene names, amino acid changes, conservation
scores and information from clinically relevant databases like COSMIC (known cancer associated
variants) and ClinVar (variants with clinically relevant association). Finally, information from
dbSNP is added to see which of the detected variants have been observed before and which are
completely new.
How to run the "Identify Somatic Variants from Tumor Normal Pair" ready-to-use workflow
1. Go to the toolbox and double-click on the "Identify Somatic Variants from Tumor Normal
Pair" ready-to-use workflow (figure 5.20).
Figure 5.20: The ready-to-use workflows are found in the toolbox.
This will open the wizard shown in figure 5.21 where you can select the tumor sample
reads.
Figure 5.21: Select the tumor sample reads.
When you have selected the tumor sample reads click on the button labeled Next.
2. In the next wizard step (figure 5.22), please specify the normal sample reads.
3. Click on the button labeled Next, which will take you to the next wizard step (figure 5.23).
In this wizard step you can adjust the settings used for variant detection. For a description of the different parameters that can be adjusted in the variant detection
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
70
Figure 5.22: Select the normal sample reads.
Figure 5.23: Specify the settings for the variant detection.
step, we refer to the description of the "Low Frequency Variant Detection" tool in
the CLC Cancer Research Workbench user manual (http://www.clcsupport.com/
clccancerresearchworkbench/current/index.php?manual=Low_Frequency_
Variant_Detection.html). As general filters are applied to the different variant detectors that are available in CLC Cancer Research Workbench, the description of the filters
are found in a separate section called "Filters" (see http://www.clcsupport.com/
clccancerresearchworkbench/current/index.php?manual=Filters.html). If
you click on "Locked Settings", you will be able to see all parameters used for variant
detection in the ready-to-use workflow.
4. Click on the button labeled Next to go to the step where you can adjust the settings for
removal of germline variants (figure 5.24)..
5. Click on the button labeled Next.
In the next wizard step you can check the selected settings by clicking on the button labeled
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
71
Figure 5.24: Specify setting for removal of germline variants.
Preview All Parameters (figure 5.25).
Figure 5.25: Check the parameters and save the results.
In the Preview All Parameters wizard you can only check the settings, it is not possible to
make any changes at this point. At the bottom of this wizard there are two buttons regarding
export functions; one button allows specification of the export format, and the other button
(the one labeled "Export Parameters") allows specification of the export destination. When
selecting an export location, you will export the analysis parameter settings that were
specified for this specific experiment.
6. Click on the button labeled OK to go back to the previous wizard step and choose Save.
Note! If you choose to open the results, the results will not be saved automatically. You
can always save the results at a later point.
Six different outputs are generated:
1. Read Mapping Tumor ( ) The mapped sequencing reads for the tumor sample. The
reads are shown in different colors depending on their orientation, whether they are single
reads or paired reads, and whether they map unambiguously. For the color codes please
see the description of sequence colors in the CLC Genomics Workbench manual that can
be found here: http://www.clcsupport.com/clcgenomicsworkbench/current/
index.php?manual=View_settings_in_Side_Panel.html.
2. Read Mapping Normal ( ) The mapped sequencing reads for the normal sample. The
reads are shown in different colors depending on their orientation, whether they are single
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
72
reads or paired reads, and whether they map unambiguously. For the color codes please
see the description of sequence colors in the CLC Genomics Workbench manual that can
be found here: http://www.clcsupport.com/clcgenomicsworkbench/current/
index.php?manual=View_settings_in_Side_Panel.html.
3. Mapping Report Tumor ( ) The report consists of a number of tables and graphs that in
different ways provide information about the mapped reads from the tumor sample.
4. Mapping Report Normal ( ) The report consists of a number of tables and graphs that in
different ways provide information about the mapped reads from the normal sample.
5. Annotated Somatic Variants ( ) A variant track holding the identified and annotated
somatic variants. The variants can be shown in track format or in table format. When
holding the mouse over the detected variants in the Genome Browser view a tooltip appears
with information about the individual variants. You will have to zoom in on the variants to
be able to see the detailed tooltip.
6. Genome Browser View Tumor Normal Comparison ( ) A collection of tracks presented
together. Shows the annotated variants track together with the human reference sequence,
genes, transcripts, coding regions, the mapped reads for both normal and tumor, the
annotated somatic variants, information from the ClinVar and COSMIC databases, and
finally a track showing the conservation score (see figure 5.26).
5.6
Identify Known Variants in One Sample (WGS)
The "Identify Known Variants in One Sample" ready-to-use workflow is a combined data analysis
and interpretation ready-to-use workflow.
It should be used to identify known variants, specified by the user (e.g. known breast cancer
associated variants), for their presence or absence in a sample.
Please note that the ready-to-use workflow will not identify new variants.
The Identify Known Variants in One Sample ready-to-use workflow runs an internal workflow that
maps the sequencing reads to the human genome sequence and does a local realignment of the
mapped reads to improve the following variant detection. Next, specified variants by the user are
identified in the read mapping. At the end, information present on the known variants before, are
added to the results.
5.6.1
Import your known variants
To make an import into the Cancer Research Workbench, you should have your variants in GVF
or VCF 4.1 format.
Please use the Tracks import as part of the Import tool in the toolbar to import your file into the
Cancer Research Workbench.
5.6.2
Import your targeted regions
A file with the genomic regions targeted by the amplicon or hybridization kit will be provided by
the vendor. To obtain this file you will have to get in contact with the vendor and ask them to
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
73
Figure 5.26: The Genome Browser View presents all the different data tracks together and makes
it easy to compare different tracks.
send this target regions file to you. You will get it in either .bad or .gff format.
Please use the Tracks import as part of the Import tool in the toolbar to import your file into the
Cancer Research Workbench.
5.6.3
How to run the "Identify Known Variants in One Sample" ready-to-use workflow
1. Go to the toolbox and double-click on the "Identify Known Variants from One Sample"
ready-to-use workflow (figure 5.27).
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
74
Figure 5.27: The ready-to-use workflows are found in the toolbox.
This will open the wizard step shown in figure 5.28 where you can select the reads of the
sample, which should be tested for presence or absence of your known variants.
Figure 5.28: Select the sequencing reads from the sample you would like to test for your known
variants.
Please select all sequencing reads from your sample. If several samples should be
analyzed, the tool has to be run in batch mode. This is done by selecting "Batch" (tick
"Batch" at the bottom of the wizard as shown in figure 5.28) and select the folder that
holds the data you wish to analyse. If you have your sequencing data in separate folders,
you should choose to run the analysis in batch mode.
When you have selected the sample(s) you wish to analyze, click on the button labeled
Next and specify the track with the known variants that should be identified in your sample
(figure 5.29). Furthermore, in this wizard step you can specify the minimum read coverage
for the position of the variant that should be identified. If the coverage at the position of
the variant is below this, the result will show this.
The parameter "Detection Frequency" will be used in the calculation twice. First, it will report
in the result if a variant has been detected (observed frequency > specified frequency) or
not (observed frequency > specified frequency). Moreover, it will determine if a variant
should be labeled as heterozygote (frequency of another allele identified at a position of a
variant in the alignment > specified frequency) or homozygote (frequency of all other alleles
identified at a position of a variant in the alignment < specified frequency).
2. Click on the button labeled Next, which will take you to the next wizard step (figure 5.30).
In this and the next dialog, you will be asked about which of the annotations/informations
added to variants should be included in the results.
Please specify your track with known variants.
3. Click on the button labeled Next and once again select the same track with known variants
(figure 5.31).
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
75
Figure 5.29: Specify the track with the known variants that should be identified.
Figure 5.30: Please select the track with your known variants again. Annotations/Informations
from this track will be added to the overview mutation track.
Figure 5.31: Once again select the track with known variants. This time the track is used to add
information to the detailed mutation track.
4. Click on the button labeled Next to go to the last wizard step (figure 5.32).
In this wizard step you can check the selected settings by clicking on the button labeled
Preview All Parameters. In the Preview All Parameters wizard you can only check the
settings, it is not possible to make any changes at this point. At the bottom of this
wizard there are two buttons regarding export functions; one button allows specification
of the export format, and the other button (the one labeled "Export Parameters") allows
specification of the export destination. When selecting an export location, you will export
the analysis parameter settings that were specified for this specific experiment.
5. Click on the button labeled OK to go back to the previous dialog box and choose Save.
Note! If you choose to open the results, the results will not be saved automatically. You
can always save the results at a later point.
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
76
Figure 5.32: Check the settings and save your results.
5.6.4
Output from the Identify Known Variants in One Sample
The "Identify Known Variants in One Sample" tool produces six different output types.
1. Read Mapping Report ( ) The report consists of a number of tables and graphs that in
different ways provide information about the mapped reads.
2. Read Mapping ( ) The mapped sequencing reads. The reads are shown in different
colors depending on their orientation, whether they are single reads or paired reads,
and whether they map unambiguously. For the color codes please see the description of sequence colors in the CLC Genomics Workbench manual that can be found
here: http://www.clcsupport.com/clcgenomicsworkbench/current/index.
php?manual=View_settings_in_Side_Panel.html.
3. Overview Variants Detected ( ) Annotation track showing the known variants. The
table view provides information about the known variants. Four columns starting with the
sample name and followed by "Read Mapping coverage", "Read Mapping detection", "Read
Mapping frequency", and "Read Mapping zygosity" provides the overview of whether or not
the known variants have been detected in the sequencing reads.
4. Variants Detected in Detail ( ) Annotation track showing the known variants. Like
the "Overview Variants Detected" table, this table provides information about the known
variants. The difference between the two tables is that the "Variants Detected in Detail"
table includes detailed information about the most frequent alternative allele (MFAA).
5. Genome Browser View Identify Known Variants ( ) A collection of tracks presented
together. Shows the annotated variants track together with the human reference sequence,
genes, transcripts, coding regions, target regions coverage, the mapped reads, the overview
of the detected variants, and the variants detected in detail.
6. Log (
) A log of the workflow execution.
It is a good idea to start looking at the mapping report to see whether the coverage is sufficient in
the regions of interest (e.g. > 30 ). Please also check that at least 90% of the reads are mapped
to the human reference sequence. In case of a targeted experiment, we also recommend that
you check that the majority of the reads are mapping to the targeted region.
When this has been done you can open the Genome Browser View file (see 5.33).
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
77
The Genome Browser View includes the overview track of known variants and the detailed result
track in the context to the human reference sequence, genes, transcripts, coding regions,
targeted regions, mapped sequencing reads, and clinically relevant variants in the COSMIC
databases.
Figure 5.33: Genome Browser View that allows inspection of the identified variants in the context
of the human genome and external databases.
Finally, a track with conservation scores has been added to be able to see the level of nucleotide
conservation (from a multiple alignment with many vertebrates) in the region around each variant.
The difference between the overview variant track and the detailed variant track is the annotations
added to the variants.
By double clicking on one of the annotated variant tracks in the Genome Browser View, a table
will be shown that includes all variants and the added information/annotations (see 5.34).
Note We do not recommend that any of the produced files are deleted individually as some of
them are linked to other outputs. Please always delete all of them at the same time.
CHAPTER 5. WHOLE GENOME SEQUENCING (WGS)
78
Figure 5.34: Genome Browser View with an open overview variant track with information about if
the variant has been detected or not, the identified zygosity, if the coverage was sufficient at this
position and the observed allele frequency.
Chapter 6
Whole exome sequencing (WES)
Contents
6.1
6.2
6.3
6.4
6.5
Automatic analysis of sequencing data (WES) . . . . . . . . . . . . . . . . .
Identify Variants (WES) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Annotate Variants (WES) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Filter Somatic Variants (WES) . . . . . . . . . . . . . . . . . . . . . . . . . .
Identify Somatic Variants from Tumor Normal Pair (WES) . . . . . . . . . . .
6.5.1
Import your targeted regions . . . . . . . . . . . . . . . . . . . . . . . .
6.5.2
How to run the "Identify Somatic Variants from Tumor Normal Pair"
ready-to-use workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6 Identify Known Variants in One Sample (WES) . . . . . . . . . . . . . . . . .
6.6.1
Import your known variants . . . . . . . . . . . . . . . . . . . . . . . . .
6.6.2
Import your targeted regions . . . . . . . . . . . . . . . . . . . . . . . .
6.6.3
How to run the "Identify Known Variants in One Sample" ready-to-use
workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6.4
Output from the Identify Known Variants in One Sample . . . . . . . . . .
6.7
Identify and Annotate Variants (WES) . . . . . . . . . . . . . . . . . . . . . .
79
80
83
89
93
94
94
98
99
100
100
103
104
The protein coding part of the human genome accounts for around 1 % of the genome and
consists of around 180,000 exons covering an area of 30 megabases (Mb) [Ng et al., 2009].
By targeting sequencing to only the protein coding parts of the genome, exome sequencing is a
cost efficient way of generating sequencing data that is believed to harbor the vast majority of
the disease-causing mutations [Choi et al., 2009].
6.1
Automatic analysis of sequencing data (WES)
Six ready-to-use workflows are available for analysis of whole genome sequencing data. The
concept of the pre-installed ready-to-use workflows is that read data are used as input in one
end of the workflow and in the other end of the workflow you get a track based genome browser
view and a table with all the identified variants, which may or may not have been subjected to
different kinds of filtering and/or annotation.
In this chapter we will discuss what the individual ready-to-use workflows can be used for and go
through step by step how to run the workflows.
79
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
80
Note! Often you will have to prepare data with one of the two Preparing Raw Data workflows
described in section 4 before you proceed to Analysis of sequencing data (WES).
6.2
Identify Variants (WES)
The "Identify Variants" tool takes sequencing reads as input and returns identified variants as
part of a Genome Browser View.
The tool runs an internal workflow, which starts with mapping the sequencing reads to the human
reference sequence. Then it runs a local realignment to improve the variant detection, which is
run afterwards. At the end, variants with an average base quality smaller than 20 are filtered
away.
In addition, a targeted region report is created to inspect the overall coverage and mapping
specificity in the targeted regions.
Import your targeted regions
A file with the genomic regions targeted by the amplicon or hybridization kit will be provided by
the vendor. To obtain this file you will have to get in contact with the vendor and ask them to
send this target regions file to you. You will get it in either .bed or .gff format.
Please use the Tracks import as part of the Import tool in the toolbar to import your file into the
Cancer Research Workbench.
How to run the "Identify Variants" ready-to-use workflow
1. Go to the toolbox and double-click on the "Identify Variants" ready-to-use workflow (figure 6.1).
Figure 6.1: The ready-to-use workflows are found in the toolbox.
This will open the wizard shown in figure 6.2 where you can select the sequencing reads
from the sample, which should be analyzed.
Please select all sequencing reads from your sample. If several samples should be
analyzed, the tool has to be run in batch mode. This is done by selecting "Batch" (tick
"Batch" at the bottom of the wizard as shown in figure 6.43) and select the folder that
holds the data you wish to analyze. If you have your sequencing data in separate folders,
you should choose to run the analysis in batch mode.
When you have selected the sample(s) you wish to prepare, click on the button labeled
Next.
2. In the next wizard step (figure 6.3) you have to specify the track with the targeted regions
from the experiment. You can also specify the minimum read coverage, which should be
present in the targeted regions.
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
81
Figure 6.2: Please select all sequencing reads from the sample to be analyzed.
Figure 6.3: Select the track with the targeted regions from your experiment.
3. Click on the button labeled Next, which will take you to the next wizard step (figure 6.4). In
this wizard you can specify the parameter for detecting variants.
4. Click on the button labeled Next, which will take you to the next wizard step (figure 6.5).
5. Click on the button labeled Next to go to the last wizard step (figure 6.6).
In this wizard you get the chance to check the selected settings by clicking on the button
labeled Preview All Parameters. In the Preview All Parameters wizard step you can only
check the settings, it is not possible to make any changes at this point. At the bottom of
this wizard there are two buttons regarding export functions; one button allows specification
of the export format, and the other button (the one labeled "Export Parameters") allows
specification of the export destination. When selecting an export location, you will export
the analysis parameter settings that were specified for this specific experiment.
6. Click on the button labeled OK to go back to the previous wizard step and choose Save.
Note! If you choose to open the results, the results will not be saved automatically. You
can always save the results at a later point.
Output from the Identify Variants workflow
The "Identify Variants" tool produces six different types of output:
1. Read Mapping ( ) The mapped sequencing reads. The reads are shown in different
colors depending on their orientation, whether they are single reads or paired reads,
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
82
Figure 6.4: Please specify the parameters for variant detection.
Figure 6.5: Select the targeted region track. Variants found outside the targeted region will be
removed.
and whether they map unambiguously. For the color codes please see the description of sequence colors in the CLC Genomics Workbench manual that can be found
here: http://www.clcsupport.com/clcgenomicsworkbench/current/index.
php?manual=View_settings_in_Side_Panel.html.
2. Target Regions Coverage ( ) The target regions coverage track shows the coverage of the
targeted regions. Detailed information about coverage and read count can be found in the
table format, which can be opened by pressing the table icon found in the lower left corner
of the View Area.
3. Target Regions Coverage Report ( ) The report consists of a number of tables and graphs
that in different ways provide information about the targeted regions.
4. Identified Variants ( ) A variant track holding the identified variants. The variants can
be shown in track format or in table format. When holding the mouse over the detected
variants in the Genome Browser view a tooltip appears with information about the individual
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
83
Figure 6.6: Choose to save the results. In this wizard step you get the chance to preview the
settings used in the ready-to-use workflow.
variants. You will have to zoom in on the variants to be able to see the detailed tooltip.
5. Genome Browser View Identify Variants ( ) A collection of tracks presented together.
Shows the annotated variants track together with the human reference sequence, genes,
transcripts, coding regions, the mapped reads, the identified variants, and the structural
variants (see figure 6.12).
It is important that you do not delete any of the produced files individually as some of the outputs
are linked to other outputs. If you would like to delete the outputs, please always delete all of
them at the same time.
Please have first a look at the mapping report to see if the coverage is sufficient in regions of
interest (e.g. > 30 ). Furthermore, please check that at least 90% of reads are mapped to the
human reference sequence. In case of a targeted experiment, please also check that the majority
of reads are mapping to the targeted region.
Afterwards please open the Genome Browser View file (see 6.7).
The Genome Browser View includes the track of identified variants in context to the human
reference sequence, genes, transcripts, coding regions, targeted regions and mapped sequencing
reads.
By double clicking on the variant track in the Genome Browser View, a table will be shown which
includes information about all identified variants (see 6.8).
In case you like to change the reference sequence used for mapping as well as the human genes,
please use the "Data Management".
6.3
Annotate Variants (WES)
Using a variant track ( ) (e.g. the output from the Identify Variants ready-to-use workflow) the
Annotate Variants (WGS) ready-to-use workflow runs an internal workflow that adds the following
annotations to the variant track:
• Gene names Adds names of genes whenever a variant is found within a known gene.
• mRNA Adds names of mRNA whenever a variant is found within a known transcript.
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
84
Figure 6.7: The Genome Browser View allows you to inspect the identified variants in the context
of the human genome.
• CDS Adds names of CDS whenever a variant is found within a coding sequence.
• Amino acid changes Adds information about amino acid changes caused by the variants.
• Information from COSMIC. Adds information from the "Catalogue of Somatic Mutations in
Cancer" database.
• Information from ClinVar Adds information about the relationships between human variations and their clinical significance.
• Information from dbSNP Adds information from the "Single Nucleotide Polymorphism
Database", which is a general catalog of genome variation, including SNPs, multinucleotide
polymorphisms (MNPs), insertions and deletions (InDels), and short tandem repeats (STRs).
• PhastCons Conservation scores The conservation scores, in this case generated from
a multiple alignment with a number of vertebrates, describe the level of nucleotide
conservation in the region around each variant.
1. Go to the toolbox and select the Annotate Variants (WES) workflow. In the first wizard
step, select the input variant track (figure 6.9).
2. Click on the button labeled Next. The only parameter that should be specified by the
user is which 1000 Genomes population yo use (figure 6.10). This can be done using the
drop-down list found in this wizard step. Please note that the populations available from
the drop-down list can be specified with the Data Management ( ) function found in the
top right corner of the Workbench (see section 3.1.4).
3. Click on the button labeled Next to go to the last wizard step (figure 6.11).
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
85
Figure 6.8: Genome Browser View with an open track table to inspect identified variants more
closely in the context of the human genome.
Figure 6.9: Select the variant track to annotate.
In this wizard step you can check the selected settings by clicking on the button labeled
Preview All Parameters. In the Preview All Parameters wizard you can only check the
settings, it is not possible to make any changes at this point.
4. Choose to Save your results and click on the button labeled Finish.
Two types of output are generated:
1. Annotated Variants (
) Annotation track showing the variants. Hold the mouse over one
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
86
Figure 6.10: Select the relevant 1000 Genomes popultaion(s).
Figure 6.11: Check the settings and save your results.
of the variants or right-clicking on the variant. A tooltip will appear with detailed information
about the variant.
2. Genome Browser View Annotated Variants ( ) A collection of tracks presented together.
Shows the annotated variants track together with the human reference sequence, genes,
transcripts, coding regions, and variants detected in dbSNP, ClinVar, COSMIC, 1000
Genomes, and PhastCons conservation scores (see figure 6.12).
Note! Please be aware, that if you delete the annotated variant track, this track will also
disappear from the genome browser view.
It is possible to add tracks to the Genome Browser View such as mapped sequencing reads as
well as other tracks. This can be done by dragging the track directly from the Navigation Area to
the Genome Browser View.
If you double-click on the name of the annotated variant track in the left hand side of the Genome
Browser View, a table that includes all variants and the added information/annotations will open
(see figure 6.13). The table and the Genome Browser View are linked; if you click on an entry in
the table, this particular position in the genome will automatically be brought into focus in the
Genome Browser View.
You may be met with a warning as shown in figure 6.14. This is simply a warning telling you that
it may take some time to create the table if you are working with tracks containing large amounts
of annotations. Please note that in case none of the variants are present in COSMIC, ClinVar or
dbSNP, the corresponding annotation column headers are missing from the result.
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
87
Figure 6.12: The output from the "Annotate Variants" ready-to-use workflow is a genome browser
view (a track list) containing individual tracks for all added annotations.
Adding information from other sources may help you identify interesting candidate variants for
further research. E.g. known cancer associated variants (present in the COSMIC database) or
variants known to play a role in drug response or other clinical relevant phenotypes (present in
the ClinVar database) can easily be identified. Further, variants not found in the COSMIC and/or
ClinVar databases, can be prioritized based on amino acid changes in case the variant causes
changes on the amino acid level.
A high conservation level between different vertebrates or mammals, in the region containing the
variant, can also be used to give a hint about whether a given variant is found in a region with an
important functional role. If you would like to use the conservation scores to identify interesting
variants, we recommend that variants with a conservation score of more than 0.9 (PhastCons
score) is prioritized over variants with lower conservation scores.
It is possible to filter variants based on their annotations. This type of filtering can be facilitated
using the table filter found at the top part of the table. If you are performing multiple experiments
where you would like to use the exact same filter criteria, you can create a filter that can be
saved and reused. To do this:
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
88
Figure 6.13: The output from the "Annotate Variants" ready-to-use workflow is a genome browser
view (a track list). The information is also available in table view. Click on the small table icon to
open the table view. If you hold down the "Ctrl" key while clicking on the table icon, you will open a
split view showing both the genome browser view and the table view.
Figure 6.14: Warning that appears when you work with tracks containing many annotations.
Toolbox | Identify Candidate Variants (
) | Create Filter Criteria (
)
This tool can be used to specify the filter and the Annotate Variants workflow should be
extended by the Identify Candidate Tool (configured with the Filter Criterion). The CLC Cancer
Research Workbench reference manual has a chapter that describes this in detail (http:
//clccancer.com/software/#downloads, see chapter: "Workflows" for more information
on how pre-installed workflows can be extended and/or edited).
Note! Sometimes the databases (e.g. COSMIC) are updated with a newer version, or maybe you
have your own version of the database. In such cases you may wish to change one of the used
databases. This can be done with "Data Management" function, which is described in section
3.1.4.
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
6.4
89
Filter Somatic Variants (WES)
If you are analyzing a list of variants that have been detected in a tumor or blood sample
where no control sample is available from the same patient, you can use the "Filter Somatic
Variants (WES)" ready-to-use workflow to identify potential somatic variants. The purpose of this
ready-to-use workflow is to use publicly available (or your own) databases, with common variants
in a population, to extract potential somatic variants whenever no control/normal sample from
the same patient is available.
The "Filter Somatic Variants (WES)" ready-to-use workflow accepts variant tracks ( ) (e.g. the
output from the Identify Variants ready-to-use workflow) as input. In cases with heterozygous
variants, the reference allele is first filtered away, then variants outside the targeted region are
removed, and lastly, variants found in the Common dbSNP, 1000 Genomes Project, and HapMap
databases are deleted. Variants in those databases are assumed to not contain relevant somatic
variants.
Please note that this tool will likely also remove inherited cancer variants that are present at a
low percentage in a population.
Next, the remaining somatic variants are annotated with gene names, amino acid changes,
conservation scores and information from COSMIC (database with known variants in cancer),
ClinVar (known variants with medical impact) and dbSNP (all known variants).
To run the Filter Somatic Variants tool, go to:
Toolbox | Ready-to-Use Workflows | Whole Exome Sequencing (
Variants ( )
) | Filter Somatic
1. Double-click on the Filter Somatic Variants tool to start the analysis. If you are connected
to a server, you will first be asked where you would like to run the analysis. Next, you will
be asked to select the variant track you would like to use for filtering somatic variants.
The panel in the left side of the wizard shows the kind of input that should be provided
(figure 6.15). Select by double-clicking on the reads file name or clicking once on the file
and then clicking on the arrow pointing to the right side in the middle of the wizard.
Figure 6.15: Select the variant track from which you would like to filter somatic variants.
Click on the button labeled Next.
2. In the next step you will be asked to specify which of the 1000 Genomes populations that
should be used for annotation (figure 6.16).
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
90
Figure 6.16: Specify which 1000 Genomes population to use for annotation.
Click on the button labeled Next.
3. In this wizard step, you are asked to supply a track containing the targeted regions
(figure 6.17). Select the track by clicking on the folder icon ( ) in the wizard.
Figure 6.17: Select your target regions track.
Click on the button labeled Next.
4. The next wizard step will once again allow you to specify the 1000 Genomes population
that should be used, this time for filtering out variants found in the 1000 Genomes project
(figure 6.18).
Figure 6.18: Specify which 1000 Genomes population to use for filtering out known variants.
Click on the button labeled Next.
5. The next wizard step (figure 6.19) concerns removal of variants found in the HapMap
database. Select the population you would like to use from the drop-down list. Please
note that the populations available from the drop-down list can be specified with the Data
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
Management (
3.1.4).
91
) function found in the top right corner of the Workbench (see section
Figure 6.19: Specify which HapMap population to use for filtering out known variants.
6. Click on the button labeled Next to go to the last wizard step (shown in figure 6.20).
Figure 6.20: Check the selected parametes by pressing "Preview All Parameters".
Pressing the button Preview All Parameters allows you to preview all parameters. At this
step you can only view the parameters, it is not possible to make any changes. Choose to
save the results and click on the button labeled Finish.
Two types of output are generated:
1. Somatic Candidate Variants Track that holds the variant data. This track is also included
in the Genome Browser View. If you hold down the Ctrl key (Cmd on Mac) while clicking on
the table icon in the lower left side of the View Area, you can open the table view in split
view. The table and the variant track are linked together, and when you click on a row in
the table, the track view will automatically bring this position into focus.
2. Genome Browser View Filter Somatic Variants A collection of tracks presented together.
Shows the somatic candidate variants together with the human reference sequence, genes,
transcripts, coding regions, and variants detected in ClinVar, COSMIC, 1000 Genomes, and
the PhastCons conservation scores (see figure 6.21).
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
92
Figure 6.21: The Genome Browser View showing the annotated somatic variants together with a
range of other tracks.
To see the level of nucleotide conservation (from a multiple alignment with many vertebrates)
in the region around each variant, a track with conservation scores is added as well. Mapped
sequencing reads as well as other tracks can be easily added to this Genome Browser View. By
double clicking on the annotated variant track in the Genome Browser View, a table will be shown
that includes all variants and the added information/annotations (see figure 6.22).
Adding information from other sources may help you identify interesting candidate variants for
further research. E.g. known cancer associated variants (present in the COSMIC database) or
variants known to play a role in drug response or other clinical relevant phenotypes (present in
the ClinVar database) can easily be identified. Further, variants not found in the COSMIC and/or
ClinVar databases, can be prioritized based on amino acid changes in case the variant causes
changes on the amino acid level.
A high conservation level, between different vertebrates or mammals, in the region containing the
variant, can also be used to give a hint about whether a given variant is found in a region with an
important functional role. If you would like to use the conservation scores to identify interesting
variants, we recommend that variants with a conservation score of more than 0.9 (PhastCons
score) is prioritized over variants with lower conservation scores.
It is possible to filter variants based on their annotations. This type of filtering can be facilitated
using the table filter found at the top part of the table. If you are performing multiple experiments
where you would like to use the exact same filter criteria, you can create a filter that can be
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
93
Figure 6.22: The Genome Browser View showing the annotated somatic variants together with a
range of other tracks.
saved and reused. To do this:
Toolbox | Identify Candidate Variants (
) | Create Filter Criteria (
)
This tool can be used to specify the filter and the Annotate Variants workflow should be
extended by the Identify Candidate Tool (configured with the Filter Criterion). The CLC Cancer
Research Workbench reference manual has a chapter that describes this in detail (http:
//clccancer.com/software/#downloads, see chapter: "Workflows" for more information
on how pre-installed workflows can be extended and/or edited).
Note! Sometimes the databases (e.g. COSMIC) are updated with a newer version, or maybe you
have your own version of the database. In such cases you may wish to change one of the used
databases. This can be done with "Data Management" function, which is described in section
3.1.4.
6.5
Identify Somatic Variants from Tumor Normal Pair (WES)
The "Identify Somatic Variants from Tumor Normal Pair" ready-to-use workflow can be used to
identify potential somatic variants in a tumor sample when you also have a normal/control
sample from the same patient.
When running the "Identify Somatic Variants from Tumor Normal Pair" the reads are mapped
and the variants identified. An internal workflow removes germline variants that are found in the
mapped reads of the normal/control sample and variants outside the target region are removed
as they are likely to be false positives due to non-specific mapping of sequencing reads. Next,
remaining variants are annotated with gene names, amino acid changes, conservation scores and
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
94
information from clinically relevant databases like COSMIC (known cancer associated variants)
and ClinVar (variants with clinically relevant association). Finally, information from dbSNP is
added to see which of the detected variants have been observed before and which are completely
new.
6.5.1
Import your targeted regions
A file with the genomic regions targeted by the amplicon or hybridization kit is available from the
vendor of the enrichment kit and sequencing machine. To obtain this file you will have to get in
contact with the vendor and ask them to send this target regions file to you. You will get the file
in either .bed or .gff format.
To import the file:
Go to the toolbar | Import (
6.5.2
) | Tracks (
)
How to run the "Identify Somatic Variants from Tumor Normal Pair" ready-to-use
workflow
1. Go to the toolbox and double-click on the "Identify Somatic Variants from Tumor Normal
Pair" ready-to-use workflow (figure 6.23).
Figure 6.23: The ready-to-use workflows are found in the toolbox.
This will open the wizard shown in figure 6.24 where you can select the tumor sample
reads.
When you have selected the tumor sample reads click on the button labeled Next.
2. In the next wizard step (figure 6.25), please specify the normal sample reads.
3. Click on the button labeled Next, which will take you to the next wizard step (figure 6.26).
4. Click on the button labeled Next, which will take you to the next wizard step (figure 6.27).
In this wizard step you can select your target regions track.
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
95
Figure 6.24: Select the tumor sample reads.
Figure 6.25: Select the normal sample reads.
5. Click on the button labeled Next to specify the target regions track to be used in the
"Remove Variants Outside Targeted Regions" step (figure 6.28). The targeted region track
should be the same as the track you selected in the previous wizard step. Variants found
outside the targeted regions will not be included in the output that is generated with the
ready-to-use workflow.
Click on the button labeled Next.
6. Click on the button labeled Next to go to the step where you can adjust the settings for
removal of germline variants (figure 6.29)..
7. Click on the button labeled Next and once again select the target region track (the same
track as you have already selected in previous wizard steps). This time you specify the track
to be used for quality control of the targeted sequencing as this tool reports the performance
(enrichment and specificity) of a targeted re-sequencing experiment(figure 6.30).
In the next wizard step you can check the selected settings by clicking on the button labeled
Preview All Parameters (figure 6.31).
In the Preview All Parameters wizard you can only check the settings, it is not possible to
make any changes at this point. At the bottom of this wizard there are two buttons regarding
export functions; one button allows specification of the export format, and the other button
(the one labeled "Export Parameters") allows specification of the export destination. When
selecting an export location, you will export the analysis parameter settings that were
specified for this specific experiment.
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
96
Figure 6.26: Specify the settings for the variant detection.
Figure 6.27: Select your target region track.
8. Click on the button labeled OK to go back to the previous wizard step and choose Save.
Note! If you choose to open the results, the results will not be saved automatically. You
can always save the results at a later point.
Eight different outputs are generated:
1. Read Mapping Normal ( ) The mapped sequencing reads for the normal sample. The
reads are shown in different colors depending on their orientation, whether they are single
reads or paired reads, and whether they map unambiguously. For the color codes please
see the description of sequence colors in the CLC Genomics Workbench manual that can
be found here: http://www.clcsupport.com/clcgenomicsworkbench/current/
index.php?manual=View_settings_in_Side_Panel.html
2. Read Mapping Tumor ( ) The mapped sequencing reads for the tumor sample. The
reads are shown in different colors depending on their orientation, whether they are single
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
97
Figure 6.28: Select your target region track.
Figure 6.29: Specify setting for removal of germline variants.
Figure 6.30: Select target region track.
reads or paired reads, and whether they map unambiguously. For the color codes please
see the description of sequence colors in the CLC Genomics Workbench manual that can
be found here: http://www.clcsupport.com/clcgenomicsworkbench/current/
[email protected]@[email protected]@View_settings_in_Side_Panel.html.
3. Target Region Coverage Report Normal ( ) The report consists of a number of tables and
graphs that in different ways provide information about the mapped reads from the normal
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
98
Figure 6.31: Check the parameters and save the results.
sample.
4. Target Region Coverage Tumor ( ) A track showing the targeted regions. The table view
provides information about the targeted regions such as target region length, coverage,
regions without coverage, and GC content.
5. Target Region Coverage Report Tumor ( ) The report consists of a number of tables and
graphs that in different ways provide information about the mapped reads from the tumor
sample.
6. Variants ( ) A variant track holding the identified variants that are found in the targeted
resions. The variants can be shown in track format or in table format. When holding
the mouse over the detected variants in the Genome Browser view a tooltip appears with
information about the individual variants. You will have to zoom in on the variants to be
able to see the detailed tooltip.
7. Annotated Somatic Variants ( ) A variant track holding the identified and annotated
somatic variants. The variants can be shown in track format or in table format. When
holding the mouse over the detected variants in the Genome Browser view a tooltip appears
with information about the individual variants. You will have to zoom in on the variants to
be able to see the detailed tooltip.
8. Genome Browser View Tumor Normal Comparison ( ) A collection of tracks presented
together. Shows the annotated variants track together with the human reference sequence,
genes, transcripts, coding regions, the mapped reads for both normal and tumor, the
annotated somatic variants, information from the ClinVar and COSMIC databases, and
finally a track showing the conservation score (see figure 6.32).
6.6
Identify Known Variants in One Sample (WES)
The "Identify Known Variants in One Sample" ready-to-use workflow is a combined data analysis
and interpretation ready-to-use workflow.
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
99
Figure 6.32: The Genome Browser View presents all the different data tracks together and makes
it easy to compare different tracks.
It should be used to identify known variants, specified by the user (e.g. known breast cancer
associated variants), for their presence or absence in a sample.
Please note that the ready-to-use workflow will not identify new variants.
The Identify Known Variants in One Sample ready-to-use workflow runs an internal workflow that
maps the sequencing reads to the human genome sequence and does a local realignment of the
mapped reads to improve the following variant detection. Next, specified variants by the user are
identified in the read mapping. At the end, information present on the known variants before, are
added to the results.
6.6.1
Import your known variants
To make an import into the Cancer Research Workbench, you should have your variants in GVF
or VCF 4.1 format.
Please use the Tracks import as part of the Import tool in the toolbar to import your file into the
Cancer Research Workbench.
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
6.6.2
100
Import your targeted regions
A file with the genomic regions targeted by the amplicon or hybridization kit will be provided by
the vendor. To obtain this file you will have to get in contact with the vendor and ask them to
send this target regions file to you. You will get it in either .bed or .gff format.
Please use the Tracks import as part of the Import tool in the toolbar to import your file into the
Cancer Research Workbench.
6.6.3
How to run the "Identify Known Variants in One Sample" ready-to-use workflow
1. Go to the toolbox and double-click on the "Identify Known Variants from One Sample"
ready-to-use workflow (figure 6.33).
Figure 6.33: The ready-to-use workflows are found in the toolbox.
This will open the wizard step shown in figure 6.34 where you can select the reads of the
sample, which should be tested for presence or absence of your known variants.
Figure 6.34: Select the sequencing reads from the sample you would like to test for your known
variants.
Please select all sequencing reads from your sample. If several samples should be
analyzed, the tool has to be run in batch mode. This is done by selecting "Batch" (tick
"Batch" at the bottom of the wizard as shown in figure 6.34) and select the folder that
holds the data you wish to analyse. If you have your sequencing data in separate folders,
you should choose to run the analysis in batch mode.
When you have selected the sample(s) you wish to analyze, click on the button labeled
Next.
2. In the next wizard step you can select your target regions track and specify the minimum
coverage to be used when checking the quality of the targeted sequencing. The minimum
coverage will be used to provide the length of each target region that has at least this
coverage. You can also specify whether or not to ignore non-specific matches and broken
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
101
pairs. When these are applied, reads that are non-specifically mapped or belong to broken
pairs will be ignored (figure 6.35).
Figure 6.35: Select your target regions track and specify the parameters to be used for checking
the quality of the targeted sequecing.
3. Click on the button labeled Next and in specify the track with the known variants that
should be identified in your sample (figure 6.36). Furthermore, in this wizard step you can
specify the minimum read coverage for the position of the variant that should be identified.
If the coverage at the position of the variant is below this, the result will show this.
The parameter "Detection Frequency" will be used in the calculation twice. First, it will report
in the result if a variant has been detected (observed frequency > specified frequency) or
not (observed frequency <= specified frequency). Moreover, it will determine if a variant
should be labeled as heterozygous (frequency of another allele identified at a position of a
variant in the alignment > specified frequency) or homozygous (frequency of all other alleles
identified at a position of a variant in the alignment < specified frequency).
Figure 6.36: Specify the track with the known variants that should be identified.
4. Click on the button labeled Next, which will take you to the next wizard step (figure 6.37).
In this and the next dialog, you will be asked about which of the annotations/informations
added to variants should be included in the results.
Please specify your track with known variants.
5. Click on the button labeled Next and once again select the same track with known variants
(figure 6.38).
6. Click on the button labeled Next to go to the last wizard step (figure 6.39).
In this wizard step you can check the selected settings by clicking on the button labeled
Preview All Parameters. In the Preview All Parameters wizard you can only check the
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
102
Figure 6.37: Please select the track with your known variants again. Annotations/Informations
from this track will be added to the overview mutation track.
Figure 6.38: Once again select the track with known variants. This time the track is used to add
information to the detailed mutation track.
Figure 6.39: Check the settings and save your results.
settings, it is not possible to make any changes at this point. At the bottom of this
wizard there are two buttons regarding export functions; one button allows specification
of the export format, and the other button (the one labeled "Export Parameters") allows
specification of the export destination. When selecting an export location, you will export
the analysis parameter settings that were specified for this specific experiment.
7. Click on the button labeled OK to go back to the previous dialog box and choose Save.
Note! If you choose to open the results, the results will not be saved automatically. You
can always save the results at a later point.
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
6.6.4
103
Output from the Identify Known Variants in One Sample
The "Identify Known Variants in One Sample" tool produces seven different output types:
1. Read Mapping ( ) The mapped sequencing reads. The reads are shown in different
colors depending on their orientation, whether they are single reads or paired reads,
and whether they map unambiguously. For the color codes please see the description of sequence colors in the CLC Genomics Workbench manual that can be found
here: http://www.clcsupport.com/clcgenomicsworkbench/current/index.
php?manual=View_settings_in_Side_Panel.html.
2. Target Regions Coverage ( ) A track showing the targeted regions. The table view
provides information about the targeted regions such as target region length, coverage,
regions without coverage, and GC content.
3. Target Regions Coverage Report ( ) The report consists of a number of tables and graphs
that in different ways show e.g. the number, length, and coverage of the target regions and
provides information about the read count per GC%.
4. Overview Variants Detected ( ) Annotation track showing the known variants. The
table view provides information about the known variants. Four columns starting with the
sample name and followed by "Read Mapping coverage", "Read Mapping detection", "Read
Mapping frequency", and "Read Mapping zygosity" provides the overview of whether or not
the known variants have been detected in the sequencing reads.
5. Variants Detected in Detail ( ) Annotation track showing the known variants. Like
the "Overview Variants Detected" table, this table provides information about the known
variants. The difference between the two tables is that the "Variants Detected in Detail"
table includes detailed information about the most frequent alternative allele (MFAA).
6. Genome Browser View Identify Known Variants ( ) A collection of tracks presented
together. Shows the annotated variants track together with the human reference sequence,
genes, transcripts, coding regions, target regions coverage, the mapped reads, the overview
of the detected variants, and the variants detected in detail.
7. Log (
) A log of the workflow execution.
It is a good idea to start looking at the Target Regions Coverage Report to see whether the
coverage is sufficient in the regions of interest (e.g. > 30 ). Please also check that at least 90%
of the reads are mapped to the human reference sequence. In case of a targeted experiment,
we also recommend that you check that the majority of the reads are mapping to the targeted
region.
When you have inspected the target regions coverage report you can open the Genome Browser
View Identify Known Variants file (see 6.40).
The Genome Browser View includes an overview track of the known variants and a detailed
result track presented in the context of the human reference sequence, genes, transcripts,
coding regions, targeted regions, mapped sequencing reads, and clinically relevant variants in
the COSMIC databases.
Finally, a track with conservation scores has been added to be able to see the level of nucleotide
conservation (from a multiple alignment with many vertebrates) in the region around each variant.
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
104
Figure 6.40: Genome Browser View that allows inspection of the identified variants in the context
of the human genome and external databases.
The difference between the overview variant track and the detailed variant track is the annotations
added to the variants.
By double clicking on one of the annotated variant tracks in the Genome Browser View, a table
will be shown that includes all variants and the added information/annotations (see 6.41).
Note We do not recommend that any of the produced files are deleted individually as some of
them are linked to other outputs. Please always delete all of them at the same time.
6.7
Identify and Annotate Variants (WES)
The "Identify and Annotate Variants" tool should be used to identify and annotate variants in one
sample. The tool consists of a workflow that is a combination of the "Identify Variants" and the
"Annotate Variants" workflows.
The tool runs an internal workflow, which starts with mapping the sequencing reads to the
human reference sequence. Then it runs a local realignment to improve the variant detection,
which is run afterwards. After the variants have been detected, they are annotated with gene
names, amino acid changes, conservation scores, information from clinically relevant variants
present in the COSMIC and ClinVar database, and information from common variants present in
the common dbSNP, HapMap, and 1000 Genomes database. Furthermore, a detailed mapping
report or a targeted region report (whole exome and targeted amplicon analysis) is created to
inspect the overall coverage and mapping specificity.
Import your targeted regions
A file with the genomic regions targeted by the amplicon or hybridization kit is available from the
vendor of the enrichment kit and sequencing machine. To obtain this file you will have to get in
contact with the vendor and ask them to send this target regions file to you. You will get the file
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
105
Figure 6.41: Genome Browser View with an open overview variant track with information about if
the variant has been detected or not, the identified zygosity, if the coverage was sufficient at this
position and the observed allele frequency.
in either .bed or .gff format.
To import the file:
Go to the toolbar | Import (
) | Tracks (
)
How to run the "Identify and Annotate Variants" ready-to-use workflow
1. Go to the toolbox and double-click on the "Identify and Annotate Variants" ready-to-use
workflow (figure 6.42).
Figure 6.42: The ready-to-use workflows are found in the toolbox.
This will open the wizard shown in figure 6.43 where you can select the sequencing reads
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
106
from the sample that should be analyzed.
Figure 6.43: Please select all sequencing reads from the sample to be analyzed.
If several samples should be analyzed, the tool has to be run in batch mode. This is done
by selecting "Batch" (tick "Batch" at the bottom of the wizard as shown in figure 6.43) and
select the folder that holds the data you wish to analyse. If you have your sequencing data
in separate folders, you should choose to run the analysis in batch mode.
When you have selected the sample(s) you wish to prepare, click on the button labeled
Next.
2. In the next wizard step (figure 6.44) you can select the population from the 1000 Genomes
project that you would like to use for annotation.
Figure 6.44: Select the population from the 1000 Genomes project that you would like to use for
annotation.
3. In the next wizard (figure 6.45) you can select the target region track and specify the
minimum read coverage that should be present in the targeted regions.
4. Click on the button labeled Next, which will take you to the next wizard step (figure 6.46). In this dialog, you have to specify the parameters for the variant detection.
For a description of the different parameters that can be adjusted in the variant detection step, we refer to the description of the "Low Frequency Variant Detection" tool
in the CLC Cancer Research Workbench user manual (http://www.clcsupport.com/
clccancerresearchworkbench/current/index.php?manual=Low_Frequency_
Variant_Detection.html). As general filters are applied to the different variant detectors that are available in CLC Cancer Research Workbench, the description of the filters
are found in a separate section called "Filters" (see http://www.clcsupport.com/
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
107
Figure 6.45: Select the track with targeted regions from your experiment.
clccancerresearchworkbench/current/index.php?manual=Filters.html). If
you click on "Locked Settings", you will be able to see all parameters used for variant
detection in the ready-to-use workflow.
Figure 6.46: Specify the parameters for variant calling.
5. Click on the button labeled Next, which will take you to the next wizard step (figure 6.47). In
this dialog you can specify the target regions track. The variants found outside the targeted
region will be removed at this step in the workflow.
6. Click on the button labeled Next, which will take you to the next wizard step (figure 6.48).
Once again, select the relevant population from the 1000 Genomes project. This will add
information from the 1000 Genomes project to your variants.
7. Click on the button labeled Next, which will take you to the next wizard step (figure 6.49). At
this step you can select a population from the HapMap database. This will add information
from the Hapmap database to your variants.
8. In this wizard step (figure 6.50) you get the chance to check the selected settings by clicking
on the button labeled Preview All Parameters. In the Preview All Parameters wizard you
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
108
Figure 6.47: In this wizard step you can specify the target regions track. Variants found outside
these regions will be removed.
Figure 6.48: Select the relevant population from the 1000 Genomes project. This will add
information from the 1000 Genomes project to your variants.
can only check the settings, it is not possible to make any changes at this point.
9. Choose to Save your results and press Finish.
Note! If you choose to open the results, the results will not be saved automatically. You
can always save the results at a later point.
Output from the Identify and Annotate Variants workflow
The "Identify and Annotate Variants" tool produces several outputs.
Please do not delete any of the produced files alone as some of them are linked to other outputs.
Please always delete all of them at the same time.
A good place to start is to take a look at the mapping report to see whether the coverage is
sufficient in the regions of interest (e.g. > 30 ). Furthermore, please check that at least 90%
of the reads are mapped to the human reference sequence. In case of a targeted experiment,
please also check that the majority of the reads are mapping to the targeted region.
Next, open the Genome Browser View file (see figure 6.51).
The Genome Browser View includes a track of the identified annotated variants in context to
the human reference sequence, genes, transcripts, coding regions, targeted regions, mapped
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
109
Figure 6.49: Select a population from the HapMap database. This will add information from the
Hapmap database to your variants.
Figure 6.50: Check the settings and save your results.
sequencing reads, clinically relevant variants in the COSMIC and ClinVar database as well as
common variants in common dbSNP, HapMap, and 1000 Genomes databases.
To see the level of nucleotide conservation (from a multiple alignment with many vertebrates) in
the region around each variant, a track with conservation scores is added as well.
By double-clicking on the annotated variant track in the Genome Browser View, a table will be
shown that includes all variants and the added information/annotations (see 6.52).
The added information will help you to identify candidate variants for further research. For
example can known cancer associated variants (present in the COSMIC database) or variants
known to play a role in drug response or other clinical relevant phenotypes (present in the ClinVar
database) easily be seen.
Not identified variants in COSMIC and ClinVar, can for example be prioritized based on amino
acid changes (do they cause any changes on the amino acid level?). A high conservation level
on the position of the variant between many vertebrates or mammals can also be a hint that this
region could have an important functional role and variants with a conservation score of more
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
110
Figure 6.51: Genome Browser View to inspect identified variants in the context of the human
genome and external databases.
Figure 6.52: Genome Browser View with an open track table to inspect identified somatic variants
more closely in the context of the human genome and external databases.
than 0.9 (PhastCons score) should be prioritized higher. A further filtering of the variants based
on their annotations can be facilitated using the table filter on top of the table.
If you wish to always apply the same filter criteria, the "Create new Filter Criteria" tool should be
used to specify this filter and the "Identify and Annotate" workflow should be extended by the
"Identify Candidate Tool" (configured with the Filter Criterion). See the reference manual for more
CHAPTER 6. WHOLE EXOME SEQUENCING (WES)
111
information on how preinstalled workflows can be edited.
Please note that in case none of the variants are present in COSMIC, ClinVar or dbSNP, the
corresponding annotation column headers are missing from the result.
In case you like to change the databases as well as the used database version, please use the
"Data Management".
Chapter 7
Targeted amplicon sequencing (TAS)
Contents
7.1
Automatic analysis of sequencing data (TAS) . . . . . . . . . . . . . . . . . . 112
7.2
Identify Variants (TAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3
Annotate Variants (TAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4
Filter Somatic Variants (TAS)
7.5
Identify Somatic Variants from Tumor Normal Pair (TAS) . . . . . . . . . . . . 126
7.6
7.7
. . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.5.1
Import your targeted regions . . . . . . . . . . . . . . . . . . . . . . . . 127
7.5.2
How to run the "Identify Somatic Variants from Tumor Normal Pair"
ready-to-use workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Identify Known Variants in One Sample (TAS) . . . . . . . . . . . . . . . . . . 131
7.6.1
Import your known variants . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.6.2
Import your targeted regions . . . . . . . . . . . . . . . . . . . . . . . . 132
7.6.3
How to run the "Identify Known Variants in One Sample" ready-to-use
workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.6.4
Output from the Identify Known Variants in One Sample . . . . . . . . . . 135
Identify and Annotate Variants (TAS) . . . . . . . . . . . . . . . . . . . . . . 137
Targeted sequencing, also known as "targeted resequencing" or "amplicon sequencing" is
a focused approach to genome sequencing with only selected areas of the genome being
sequenced. In cancer research and diagnostics, targeted sequencing is usually based on
sequencing panels that target a number of known cancer-associated genes.
7.1
Automatic analysis of sequencing data (TAS)
Six ready-to-use workflows are available for analysis of whole genome sequencing data. The
concept of the pre-installed ready-to-use workflows is that read data are used as input in one
end of the workflow and in the other end of the workflow you get a track based genome browser
view and a table with all the identified variants, which may or may not have been subjected to
different kinds of filtering and/or annotation.
In this chapter we will discuss what the individual ready-to-use workflows can be used for and go
through step by step how to run the workflows.
112
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
113
Note! Often you will have to prepare data with one of the two Preparing Raw Data workflows
described in section 4 before you proceed to Automatic analysis of sequencing data (TAS).
7.2
Identify Variants (TAS)
The "Identify Variants" tool takes sequencing reads as input and returns identified variants as
part of a Genome Browser View.
The tool runs an internal workflow, which starts with mapping the sequencing reads to the human
reference sequence. Then it runs a local realignment to improve the variant detection, which is
run afterwards. At the end, variants with an average base quality smaller than 20 are filtered
away.
In addition, a targeted region report is created to inspect the overall coverage and mapping
specificity in the targeted regions.
Import your targeted regions
A file with the genomic regions targeted by the amplicon or hybridization kit will be provided by
the vendor. To obtain this file you will have to get in contact with the vendor and ask them to
send this target regions file to you. You will get it in either .bed or .gff format.
Please use the Tracks import as part of the Import tool in the toolbar to import your file into the
Cancer Research Workbench.
How to run the "Identify Variants" ready-to-use workflow
1. Go to the toolbox and double-click on the "Identify Variants" ready-to-use workflow (figure 7.1).
Figure 7.1: The ready-to-use workflows are found in the toolbox.
This will open the wizard shown in figure 7.2 where you can select the sequencing reads
from the sample, which should be analyzed.
Please select all sequencing reads from your sample. If several samples should be
analyzed, the tool has to be run in batch mode. This is done by selecting "Batch" (tick
"Batch" at the bottom of the wizard as shown in figure 7.43) and select the folder that
holds the data you wish to analyze. If you have your sequencing data in separate folders,
you should choose to run the analysis in batch mode.
When you have selected the sample(s) you wish to prepare, click on the button labeled
Next.
2. In the next wizard step (figure 7.3) you have to specify the track with the targeted regions
from the experiment. You can also specify the minimum read coverage, which should be
present in the targeted regions.
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
114
Figure 7.2: Please select all sequencing reads from the sample to be analyzed.
Figure 7.3: Select the track with the targeted regions from your experiment.
3. Click on the button labeled Next, which will take you to the next wizard step (figure 7.4). In
this wizard you can specify the parameter for detecting variants.
4. Click on the button labeled Next, which will take you to the next wizard step (figure 7.5).
5. Click on the button labeled Next to go to the last wizard step (figure 7.6).
In this wizard you get the chance to check the selected settings by clicking on the button
labeled Preview All Parameters. In the Preview All Parameters wizard step you can only
check the settings, it is not possible to make any changes at this point. At the bottom of
this wizard there are two buttons regarding export functions; one button allows specification
of the export format, and the other button (the one labeled "Export Parameters") allows
specification of the export destination. When selecting an export location, you will export
the analysis parameter settings that were specified for this specific experiment.
6. Click on the button labeled OK to go back to the previous wizard step and choose Save.
Note! If you choose to open the results, the results will not be saved automatically. You
can always save the results at a later point.
Output from the Identify Variants workflow
The "Identify Variants" tool produces six different types of output:
1. Read Mapping ( ) The mapped sequencing reads. The reads are shown in different
colors depending on their orientation, whether they are single reads or paired reads,
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
115
Figure 7.4: Please specify the parameters for variant detection.
Figure 7.5: Select the targeted region track. Variants found outside the targeted region will be
removed.
and whether they map unambiguously. For the color codes please see the description of sequence colors in the CLC Genomics Workbench manual that can be found
here: http://www.clcsupport.com/clcgenomicsworkbench/current/index.
php?manual=View_settings_in_Side_Panel.html.
2. Target Regions Coverage ( ) The target regions coverage track shows the coverage of the
targeted regions. Detailed information about coverage and read count can be found in the
table format, which can be opened by pressing the table icon found in the lower left corner
of the View Area.
3. Target Regions Coverage Report ( ) The report consists of a number of tables and graphs
that in different ways provide information about the targeted regions.
4. Identified Variants ( ) A variant track holding the identified variants. The variants can
be shown in track format or in table format. When holding the mouse over the detected
variants in the Genome Browser view a tooltip appears with information about the individual
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
116
Figure 7.6: Choose to save the results. In this wizard step you get the chance to preview the
settings used in the ready-to-use workflow.
variants. You will have to zoom in on the variants to be able to see the detailed tooltip.
5. Genome Browser View Identify Variants ( ) A collection of tracks presented together.
Shows the annotated variants track together with the human reference sequence, genes,
transcripts, coding regions, the mapped reads, the identified variants, and the structural
variants (see figure 7.12).
It is important that you do not delete any of the produced files individually as some of the outputs
are linked to other outputs. If you would like to delete the outputs, please always delete all of
them at the same time.
Please have first a look at the mapping report to see if the coverage is sufficient in regions of
interest (e.g. > 30 ). Furthermore, please check that at least 90% of reads are mapped to the
human reference sequence. In case of a targeted experiment, please also check that the majority
of reads are mapping to the targeted region.
Afterwards please open the Genome Browser View file (see 7.7).
The Genome Browser View includes the track of identified variants in context to the human
reference sequence, genes, transcripts, coding regions, targeted regions and mapped sequencing
reads.
By double clicking on the variant track in the Genome Browser View, a table will be shown which
includes information about all identified variants (see 7.8).
In case you like to change the reference sequence used for mapping as well as the human genes,
please use the "Data Management".
7.3
Annotate Variants (TAS)
Using a variant track ( ) (e.g. the output from the Identify Variants ready-to-use workflow)
the Annotate Variants (WGS) ready-to-use workflow runs an "internal" workflow that adds the
following annotations to the variant track:
• Gene names Adds names of genes whenever a variant is found within a known gene.
• mRNA Adds names of mRNA whenever a variant is found within a known transcript.
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
117
Figure 7.7: The Genome Browser View allows you to inspect the identified variants in the context
of the human genome.
• CDS Adds names of CDS whenever a variant is found within a coding sequence.
• Amino acid changes Adds information about amino acid changes caused by the variants.
• Information from COSMIC. Adds information from the "Catalogue of Somatic Mutations in
Cancer" database.
• Information from ClinVar Adds information about the relationships between human variations and their clinical significance.
• Information from dbSNP Adds information from the "Single Nucleotide Polymorphism
Database", which is a general catalog of genome variation, including SNPs, multinucleotide
polymorphisms (MNPs), insertions and deletions (InDels), and short tandem repeats (STRs).
• PhastCons Conservation scores The conservation scores, in this case generated from
a multiple alignment with a number of vertebrates, describe the level of nucleotide
conservation in the region around each variant.
1. Go to the toolbox and select the Annotate Variants (TAS) workflow. In the first wizard step,
select the input variant track (figure 7.9).
2. Click on the button labeled Next. The only parameter that should be specified by the
user is which 1000 Genomes population yo use (figure 7.10). This can be done using the
drop-down list found in this wizard step. Please note that the populations available from
the drop-down list can be specified with the Data Management ( ) function found in the
top right corner of the Workbench (see section 3.1.4).
3. Click on the button labeled Next to go to the last wizard step (figure 7.11).
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
118
Figure 7.8: Genome Browser View with an open track table to inspect identified variants more
closely in the context of the human genome.
Figure 7.9: Select the variant track to annotate.
In this wizard step you can check the selected settings by clicking on the button labeled
Preview All Parameters. In the Preview All Parameters wizard you can only check the
settings, it is not possible to make any changes at this point.
4. Choose to Save your results and click on the button labeled Finish.
Two types of output are generated:
1. Annotated Variants ( ) Annotation track showing the variants. Hold the mouse over one
of the variants or right-clicking on the variant. A tooltip will appear with detailed information
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
119
Figure 7.10: Select the relevant 1000 Genomes popultaion(s).
Figure 7.11: Check the settings and save your results.
about the variant.
2. Genome Browser View Annotated Variants ( ) A collection of tracks presented together.
Shows the annotated variants track together with the human reference sequence, genes,
transcripts, coding regions, and variants detected in dbSNP, ClinVar, COSMIC, 1000
Genomes, and PhastCons conservation scores (see figure 7.12).
Note! Please be aware, that if you delete the annotated variant track, this track will also
disappear from the genome browser view.
It is possible to add tracks to the Genome Browser View such as mapped sequencing reads as
well as other tracks. This can be done by dragging the track directly from the Navigation Area to
the Genome Browser View.
If you double-click on the name of the annotated variant track in the left hand side of the Genome
Browser View, a table that includes all variants and the added information/annotations will open
(see figure 7.13). The table and the Genome Browser View are linked; if you click on an entry in
the table, this particular position in the genome will automatically be brought into focus in the
Genome Browser View.
You may be met with a warning as shown in figure 7.14. This is simply a warning telling you that
it may take some time to create the table if you are working with tracks containing large amounts
of annotations. Please note that in case none of the variants are present in COSMIC, ClinVar or
dbSNP, the corresponding annotation column headers are missing from the result.
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
120
Figure 7.12: The output from the "Annotate Variants" ready-to-use workflow is a genome browser
view (a track list) containing individual tracks for all added annotations.
Adding information from other sources may help you identify interesting candidate variants for
further research. E.g. known cancer associated variants (present in the COSMIC database) or
variants known to play a role in drug response or other clinical relevant phenotypes (present in
the ClinVar database) can easily be identified. Further, variants not found in the COSMIC and/or
ClinVar databases, can be prioritized based on amino acid changes in case the variant causes
changes on the amino acid level.
A high conservation level between different vertebrates or mammals, in the region containing the
variant, can also be used to give a hint about whether a given variant is found in a region with an
important functional role. If you would like to use the conservation scores to identify interesting
variants, we recommend that variants with a conservation score of more than 0.9 (PhastCons
score) is prioritized over variants with lower conservation scores.
It is possible to filter variants based on their annotations. This type of filtering can be facilitated
using the table filter found at the top part of the table. If you are performing multiple experiments
where you would like to use the exact same filter criteria, you can create a filter that can be
saved and reused. To do this:
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
121
Figure 7.13: The output from the "Annotate Variants" ready-to-use workflow is a genome browser
view (a track list). The information is also available in table view. Click on the small table icon to
open the table view. If you hold down the "Ctrl" key while clicking on the table icon, you will open a
split view showing both the genome browser view and the table view.
Figure 7.14: Warning that appears when you work with tracks containing many annotations.
Toolbox | Identify Candidate Variants (
) | Create Filter Criteria (
)
This tool can be used to specify the filter and the Annotate Variants workflow should be
extended by the Identify Candidate Tool (configured with the Filter Criterion). The CLC Cancer
Research Workbench reference manual has a chapter that describes this in detail (http:
//clccancer.com/software/#downloads, see chapter: "Workflows" for more information
on how pre-installed workflows can be extended and/or edited).
Note! Sometimes the databases (e.g. COSMIC) are updated with a newer version, or maybe you
have your own version of the database. In such cases you may wish to change one of the used
databases. This can be done with "Data Management" function, which is described in section
3.1.4.
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
7.4
122
Filter Somatic Variants (TAS)
If you are analyzing a list of variants that have been detected in a tumor or blood sample
where no control sample is available from the same patient, you can use the "Filter Somatic
Variants (TAS)" ready-to-use workflow to identify potential somatic variants. The purpose of this
ready-to-use workflow is to use publicly available (or your own) databases, with common variants
in a population, to extract potential somatic variants whenever no control/normal sample from
the same patient is available.
The "Filter Somatic Variants (TAS)" ready-to-use workflow accepts variant tracks ( ) (e.g. the
output from the Identify Variants ready-to-use workflow) as input. Variants that are identical to the
human reference sequence are first filtered away, then variants outside the targeted region are
removed, and lastly, variants found in the Common dbSNP, 1000 Genomes Project, and HapMap
databases are deleted. Variants in those databases are assumed to not contain relevant somatic
variants.
Please note that this tool will likely also remove inherited cancer variants that are present at a
low percentage in a population.
Next, the remaining somatic variants are annotated with gene names, amino acid changes,
conservation scores and information from COSMIC (database with known variants in cancer),
ClinVar (known variants with medical impact) and dbSNP (all known variants).
To run the Filter Somatic Variants tool, go to:
Toolbox | Ready-to-Use Workflows | Targeted Amplicon Sequencing (
Somatic Variants ( )
) | Filter
1. Double-click on the Filter Somatic Variants tool to start the analysis. If you are connected
to a server, you will first be asked where you would like to run the analysis. Next, you will
be asked to select the variant track you would like to use for filtering somatic variants.
The panel in the left side of the wizard shows the kind of input that should be provided
(figure 7.15). Select by double-clicking on the reads file name or clicking once on the file
and then clicking on the arrow pointing to the right side in the middle of the wizard.
Figure 7.15: Select the variant track from which you would like to filter somatic variants.
Click on the button labeled Next.
2. In the next step you will be asked to specify which of the 1000 Genomes populations that
should be used for annotation (figure 7.16).
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
123
Figure 7.16: Specify which 1000 Genomes population to use for annotation.
Click on the button labeled Next.
3. In this wizard step, you are asked to supply a track containing the targeted regions
(figure 7.17). Select the track by clicking on the folder icon ( ) in the wizard.
Figure 7.17: Select your target regions track.
Click on the button labeled Next.
4. The next wizard step will once again allow you to specify the 1000 Genomes population
that should be used, this time for filtering out variants found in the 1000 Genomes project
(figure 7.18).
Figure 7.18: Specify which 1000 Genomes population to use for filtering out known variants.
Click on the button labeled Next.
5. The next wizard step (figure 7.19) concerns removal of variants found in the HapMap
database. Select the population you would like to use from the drop-down list. Please
note that the populations available from the drop-down list can be specified with the Data
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
Management (
3.1.4).
124
) function found in the top right corner of the Workbench (see section
Figure 7.19: Specify which HapMap population to use for filtering out known variants.
6. Click on the button labeled Next to go to the last wizard step (shown in figure 7.20).
Figure 7.20: Check the selected parametes by pressing "Preview All Parameters".
Pressing the button Preview All Parameters allows you to preview all parameters. At this
step you can only view the parameters, it is not possible to make any changes. Choose to
save the results and click on the button labeled Finish.
Two types of output are generated:
1. Somatic Candidate Variants Track that holds the variant data. This track is also included
in the Genome Browser View. If you hold down the Ctrl key (Cmd on Mac) while clicking on
the table icon in the lower left side of the View Area, you can open the table view in split
view. The table and the variant track are linked together, and when you click on a row in
the table, the track view will automatically bring this position into focus.
2. Genome Browser View Filter Somatic Variants A collection of tracks presented together.
Shows the somatic candidate variants together with the human reference sequence, genes,
transcripts, coding regions, and variants detected in ClinVar, COSMIC, 1000 Genomes, and
the PhastCons conservation scores (see figure 7.21).
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
125
Figure 7.21: The Genome Browser View showing the annotated somatic variants together with a
range of other tracks.
To see the level of nucleotide conservation (from a multiple alignment with many vertebrates)
in the region around each variant, a track with conservation scores is added as well. Mapped
sequencing reads as well as other tracks can be easily added to this Genome Browser View. By
double clicking on the annotated variant track in the Genome Browser View, a table will be shown
that includes all variants and the added information/annotations (see figure 7.22).
Adding information from other sources may help you identify interesting candidate variants for
further research. E.g. known cancer associated variants (present in the COSMIC database) or
variants known to play a role in drug response or other clinical relevant phenotypes (present in
the ClinVar database) can easily be identified. Further, variants not found in the COSMIC and/or
ClinVar databases, can be prioritized based on amino acid changes in case the variant causes
changes on the amino acid level.
A high conservation level between different vertebrates or mammals, in the region containing the
variant, can also be used to give a hint about whether a given variant is found in a region with an
important functional role. If you would like to use the conservation scores to identify interesting
variants, we recommend that variants with a conservation score of more than 0.9 (PhastCons
score) is prioritized over variants with lower conservation scores.
It is possible to filter variants based on their annotations. This type of filtering can be facilitated
using the table filter found at the top part of the table. If you are performing multiple experiments
where you would like to use the exact same filter criteria, you can create a filter that can be
saved and reused. To do this:
Toolbox | Identify Candidate Variants (
) | Create Filter Criteria (
)
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
126
Figure 7.22: The Genome Browser View showing the annotated somatic variants together with a
range of other tracks.
This tool can be used to specify the filter and the Annotate Variants workflow should be
extended by the Identify Candidate Tool (configured with the Filter Criterion). The CLC Cancer
Research Workbench reference manual has a chapter that describes this in detail (http:
//clccancer.com/software/#downloads, see chapter: "Workflows" for more information
on how pre-installed workflows can be extended and/or edited).
Note! Sometimes the databases (e.g. COSMIC) are updated with a newer version, or maybe you
have your own version of the database. In such cases you may wish to change one of the used
databases. This can be done with "Data Management" function, which is described in section
3.1.4.
7.5
Identify Somatic Variants from Tumor Normal Pair (TAS)
The "Identify Somatic Variants from Tumor Normal Pair" ready-to-use workflow can be used to
identify potential somatic variants in a tumor sample when you also have a normal/control
sample from the same patient.
When running the "Identify Somatic Variants from Tumor Normal Pair" the reads are mapped
and the variants identified. An internal workflow removes germline variants that are found in the
mapped reads of the normal/control sample and variants outside the target region are removed
as they are likely to be false positives due to non-specific mapping of sequencing reads. Next,
remaining variants are annotated with gene names, amino acid changes, conservation scores and
information from clinically relevant databases like COSMIC (known cancer associated variants)
and ClinVar (variants with clinically relevant association). Finally, information from dbSNP is
added to see which of the detected variants have been observed before and which are completely
new.
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
7.5.1
127
Import your targeted regions
A file with the genomic regions targeted by the amplicon or hybridization kit is available from the
vendor of the enrichment kit and sequencing machine. To obtain this file you will have to get in
contact with the vendor and ask them to send this target regions file to you. You will get the file
in either .bed or .gff format.
To import the file:
Go to the toolbar | Import (
7.5.2
) | Tracks (
)
How to run the "Identify Somatic Variants from Tumor Normal Pair" ready-to-use
workflow
1. Go to the toolbox and double-click on the "Identify Somatic Variants from Tumor Normal
Pair" ready-to-use workflow (figure 7.23).
Figure 7.23: The ready-to-use workflows are found in the toolbox.
This will open the wizard shown in figure 7.24 where you can select the tumor sample
reads.
Figure 7.24: Select the tumor sample reads.
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
128
When you have selected the tumor sample reads click on the button labeled Next.
2. In the next wizard step (figure 7.25), please specify the normal sample reads.
Figure 7.25: Select the normal sample reads.
3. Click on the button labeled Next, which will take you to the next wizard step (figure 7.26).
Figure 7.26: Specify the settings for the variant detection.
4. Click on the button labeled Next, which will take you to the next wizard step (figure 7.27).
In this wizard step you can select your target regions track.
5. Click on the button labeled Next to specify the target regions track to be used in the
"Remove Variants Outside Targeted Regions" step (figure 7.28). The targeted region track
should be the same as the track you selected in the previous wizard step. Variants found
outside the targeted regions will not be included in the output that is generated with the
ready-to-use workflow.
Click on the button labeled Next.
6. Click on the button labeled Next to go to the step where you can adjust the settings for
removal of germline variants (figure 7.29)..
7. Click on the button labeled Next and once again select the target region track (the same
track as you have already selected in previous wizard steps). This time you specify the track
to be used for quality control of the targeted sequencing as this tool reports the performance
(enrichment and specificity) of a targeted re-sequencing experiment(figure 7.30).
In the next wizard step you can check the selected settings by clicking on the button labeled
Preview All Parameters (figure 7.31).
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
129
Figure 7.27: Select your target region track.
Figure 7.28: Select your target region track.
In the Preview All Parameters wizard you can only check the settings, it is not possible to
make any changes at this point. At the bottom of this wizard there are two buttons regarding
export functions; one button allows specification of the export format, and the other button
(the one labeled "Export Parameters") allows specification of the export destination. When
selecting an export location, you will export the analysis parameter settings that were
specified for this specific experiment.
8. Click on the button labeled OK to go back to the previous wizard step and choose Save.
Note! If you choose to open the results, the results will not be saved automatically. You
can always save the results at a later point.
Eight different outputs are generated:
1. Read Mapping Normal (
) The mapped sequencing reads for the normal sample. The
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
130
Figure 7.29: Specify setting for removal of germline variants.
Figure 7.30: Select target region track.
reads are shown in different colors depending on their orientation, whether they are single
reads or paired reads, and whether they map unambiguously. For the color codes please
see the description of sequence colors in the CLC Genomics Workbench manual that can
be found here: http://www.clcsupport.com/clcgenomicsworkbench/current/
index.php?manual=View_settings_in_Side_Panel.html
2. Read Mapping Tumor ( ) The mapped sequencing reads for the tumor sample. The
reads are shown in different colors depending on their orientation, whether they are single
reads or paired reads, and whether they map unambiguously. For the color codes please
see the description of sequence colors in the CLC Genomics Workbench manual that can
be found here: http://www.clcsupport.com/clcgenomicsworkbench/current/
[email protected]@[email protected]@View_settings_in_Side_Panel.html.
3. Target Region Coverage Report Normal ( ) The report consists of a number of tables and
graphs that in different ways provide information about the mapped reads from the normal
sample.
4. Target Region Coverage Tumor ( ) A track showing the targeted regions. The table view
provides information about the targeted regions such as target region length, coverage,
regions without coverage, and GC content.
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
131
Figure 7.31: Check the parameters and save the results.
5. Target Region Coverage Report Tumor ( ) The report consists of a number of tables and
graphs that in different ways provide information about the mapped reads from the tumor
sample.
6. Variants ( ) A variant track holding the identified variants that are found in the targeted
resions. The variants can be shown in track format or in table format. When holding
the mouse over the detected variants in the Genome Browser view a tooltip appears with
information about the individual variants. You will have to zoom in on the variants to be
able to see the detailed tooltip.
7. Annotated Somatic Variants ( ) A variant track holding the identified and annotated
somatic variants. The variants can be shown in track format or in table format. When
holding the mouse over the detected variants in the Genome Browser view a tooltip appears
with information about the individual variants. You will have to zoom in on the variants to
be able to see the detailed tooltip.
8. Genome Browser View Tumor Normal Comparison ( ) A collection of tracks presented
together. Shows the annotated variants track together with the human reference sequence,
genes, transcripts, coding regions, the mapped reads for both normal and tumor, the
annotated somatic variants, information from the ClinVar and COSMIC databases, and
finally a track showing the conservation score (see figure 7.32).
7.6
Identify Known Variants in One Sample (TAS)
The "Identify Known Variants in One Sample" ready-to-use workflow is a combined data analysis
and interpretation ready-to-use workflow.
It should be used to identify known variants, specified by the user (e.g. known breast cancer
associated variants), for their presence or absence in a sample.
Please note that the ready-to-use workflow will not identify new variants.
The Identify Known Variants in One Sample ready-to-use workflow runs an internal workflow that
maps the sequencing reads to the human genome sequence and does a local realignment of the
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
132
Figure 7.32: The Genome Browser View presents all the different data tracks together and makes
it easy to compare different tracks.
mapped reads to improve the following variant detection. Next, specified variants by the user are
identified in the read mapping. At the end, information present on the known variants before, are
added to the results.
7.6.1
Import your known variants
To make an import into the Cancer Research Workbench, you should have your variants in GVF
or VCF 4.1 format.
Please use the Tracks import as part of the Import tool in the toolbar to import your file into the
Cancer Research Workbench.
7.6.2
Import your targeted regions
A file with the genomic regions targeted by the amplicon or hybridization kit will be provided by
the vendor. To obtain this file you will have to get in contact with the vendor and ask them to
send this target regions file to you. You will get it in either .bed or .gff format.
Please use the Tracks import as part of the Import tool in the toolbar to import your file into the
Cancer Research Workbench.
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
7.6.3
133
How to run the "Identify Known Variants in One Sample" ready-to-use workflow
1. Go to the toolbox and double-click on the "Identify Known Variants from One Sample"
ready-to-use workflow (figure 7.33).
Figure 7.33: The ready-to-use workflows are found in the toolbox.
This will open the wizard step shown in figure 7.34 where you can select the reads of the
sample, which should be tested for presence or absence of your known variants.
Figure 7.34: Select the sequencing reads from the sample you would like to test for your known
variants.
Please select all sequencing reads from your sample. If several samples should be
analyzed, the tool has to be run in batch mode. This is done by selecting "Batch" (tick
"Batch" at the bottom of the wizard as shown in figure 7.34) and select the folder that
holds the data you wish to analyse. If you have your sequencing data in separate folders,
you should choose to run the analysis in batch mode.
When you have selected the sample(s) you wish to analyze, click on the button labeled
Next.
2. In the next wizard step you can select your target regions track and specify the minimum
coverage to be used when checking the quality of the targeted sequencing. The minimum
coverage will be used to provide the length of each target region that has at least this
coverage. You can also specify whether or not to ignore non-specific matches and broken
pairs. When these are applied, reads that are non-specifically mapped or belong to broken
pairs will be ignored (figure 7.35).
3. Click on the button labeled Next and in specify the track with the known variants that
should be identified in your sample (figure 7.36). Furthermore, in this wizard step you can
specify the minimum read coverage for the position of the variant that should be identified.
If the coverage at the position of the variant is below this, the result will show this.
The parameter "Detection Frequency" will be used in the calculation twice. First, it will report
in the result if a variant has been detected (observed frequency > specified frequency) or
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
134
Figure 7.35: Select your target regions track and specify the parameters to be used for checking
the quality of the targeted sequecing.
not (observed frequency <= specified frequency). Moreover, it will determine if a variant
should be labeled as heterozygous (frequency of another allele identified at a position of a
variant in the alignment > specified frequency) or homozygous (frequency of all other alleles
identified at a position of a variant in the alignment < specified frequency).
Figure 7.36: Specify the track with the known variants that should be identified.
4. Click on the button labeled Next, which will take you to the next wizard step (figure 7.37).
In this and the next dialog, you will be asked about which of the annotations/informations
added to variants should be included in the results.
Please specify your track with known variants.
Figure 7.37: Please select the track with your known variants again. Annotations/Informations
from this track will be added to the overview mutation track.
5. Click on the button labeled Next and once again select the same track with known variants
(figure 7.38).
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
135
Figure 7.38: Once again select the track with known variants. This time the track is used to add
information to the detailed mutation track.
6. Click on the button labeled Next to go to the last wizard step (figure 7.39).
Figure 7.39: Check the settings and save your results.
In this wizard step you can check the selected settings by clicking on the button labeled
Preview All Parameters. In the Preview All Parameters wizard you can only check the
settings, it is not possible to make any changes at this point. At the bottom of this
wizard there are two buttons regarding export functions; one button allows specification
of the export format, and the other button (the one labeled "Export Parameters") allows
specification of the export destination. When selecting an export location, you will export
the analysis parameter settings that were specified for this specific experiment.
7. Click on the button labeled OK to go back to the previous dialog box and choose Save.
Note! If you choose to open the results, the results will not be saved automatically. You
can always save the results at a later point.
7.6.4
Output from the Identify Known Variants in One Sample
The "Identify Known Variants in One Sample" tool produces seven different output types:
1. Read Mapping ( ) The mapped sequencing reads. The reads are shown in different
colors depending on their orientation, whether they are single reads or paired reads,
and whether they map unambiguously. For the color codes please see the description of sequence colors in the CLC Genomics Workbench manual that can be found
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
136
here: http://www.clcsupport.com/clcgenomicsworkbench/current/index.
php?manual=View_settings_in_Side_Panel.html.
2. Target Regions Coverage ( ) A track showing the targeted regions. The table view
provides information about the targeted regions such as target region length, coverage,
regions without coverage, and GC content.
3. Target Regions Coverage Report ( ) The report consists of a number of tables and graphs
that in different ways show e.g. the number, length, and coverage of the target regions and
provides information about the read count per GC%.
4. Overview Variants Detected ( ) Annotation track showing the known variants. The
table view provides information about the known variants. Four columns starting with the
sample name and followed by "Read Mapping coverage", "Read Mapping detection", "Read
Mapping frequency", and "Read Mapping zygosity" provides the overview of whether or not
the known variants have been detected in the sequencing reads.
5. Variants Detected in Detail ( ) Annotation track showing the known variants. Like
the "Overview Variants Detected" table, this table provides information about the known
variants. The difference between the two tables is that the "Variants Detected in Detail"
table includes detailed information about the most frequent alternative allele (MFAA).
6. Genome Browser View Identify Known Variants ( ) A collection of tracks presented
together. Shows the annotated variants track together with the human reference sequence,
genes, transcripts, coding regions, target regions coverage, the mapped reads, the overview
of the detected variants, and the variants detected in detail.
7. Log (
) A log of the workflow execution.
It is a good idea to start looking at the Target Regions Coverage Report to see whether the
coverage is sufficient in the regions of interest (e.g. > 30 ). Please also check that at least 90%
of the reads are mapped to the human reference sequence. In case of a targeted experiment,
we also recommend that you check that the majority of the reads are mapping to the targeted
region.
When you have inspected the target regions coverage report you can open the Genome Browser
View Identify Known Variants file (see 7.40).
The Genome Browser View includes an overview track of the known variants and a detailed
result track presented in the context of the human reference sequence, genes, transcripts,
coding regions, targeted regions, mapped sequencing reads, and clinically relevant variants in
the COSMIC databases.
Finally, a track with conservation scores has been added to be able to see the level of nucleotide
conservation (from a multiple alignment with many vertebrates) in the region around each variant.
The difference between the overview variant track and the detailed variant track is the annotations
added to the variants.
By double clicking on one of the annotated variant tracks in the Genome Browser View, a table
will be shown that includes all variants and the added information/annotations (see 7.41).
Note We do not recommend that any of the produced files are deleted individually as some of
them are linked to other outputs. Please always delete all of them at the same time.
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
137
Figure 7.40: Genome Browser View that allows inspection of the identified variants in the context
of the human genome and external databases.
7.7
Identify and Annotate Variants (TAS)
The "Identify and Annotate Variants" tool should be used to identify and annotate variants in one
sample. The tool consists of a workflow that is a combination of the "Identify Variants" and the
"Annotate Variants" workflows.
The tool runs an internal workflow, which starts with mapping the sequencing reads to the
human reference sequence. Then it runs a local realignment to improve the variant detection,
which is run afterwards. After the variants have been detected, they are annotated with gene
names, amino acid changes, conservation scores, information from clinically relevant variants
present in the COSMIC and ClinVar database, and information from common variants present in
the common dbSNP, HapMap, and 1000 Genomes database. Furthermore, a detailed mapping
report or a targeted region report (whole exome and targeted amplicon analysis) is created to
inspect the overall coverage and mapping specificity.
Import your targeted regions
A file with the genomic regions targeted by the amplicon or hybridization kit is available from the
vendor of the enrichment kit and sequencing machine. To obtain this file you will have to get in
contact with the vendor and ask them to send this target regions file to you. You will get the file
in either .bed or .gff format.
To import the file:
Go to the toolbar | Import (
) | Tracks (
)
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
138
Figure 7.41: Genome Browser View with an open overview variant track with information about if
the variant has been detected or not, the identified zygosity, if the coverage was sufficient at this
position and the observed allele frequency.
How to run the "Identify and Annotate Variants" ready-to-use workflow
1. Go to the toolbox and double-click on the "Identify and Annotate Variants" ready-to-use
workflow (figure 7.42).
This will open the wizard shown in figure 7.43 where you can select the sequencing reads
from the sample that should be analyzed.
If several samples should be analyzed, the tool has to be run in batch mode. This is done
by selecting "Batch" (tick "Batch" at the bottom of the wizard as shown in figure 7.43) and
select the folder that holds the data you wish to analyse. If you have your sequencing data
in separate folders, you should choose to run the analysis in batch mode.
When you have selected the sample(s) you wish to prepare, click on the button labeled
Next.
2. In the next wizard step (figure 7.44) you can select the population from the 1000 Genomes
project that you would like to use for annotation.
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
139
Figure 7.42: The ready-to-use workflows are found in the toolbox.
Figure 7.43: Please select all sequencing reads from the sample to be analyzed.
Figure 7.44: Select the population from the 1000 Genomes project that you would like to use for
annotation.
3. In the next wizard (figure 7.45) you can select the target region track and specify the
minimum read coverage that should be present in the targeted regions.
4. Click on the button labeled Next, which will take you to the next wizard step (figure 7.46). In this dialog, you have to specify the parameters for the variant detection.
For a description of the different parameters that can be adjusted in the variant de-
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
140
Figure 7.45: Select the track with targeted regions from your experiment.
tection step, we refer to the description of the "Low Frequency Variant Detection" tool
in the CLC Cancer Research Workbench user manual (http://www.clcsupport.com/
clccancerresearchworkbench/current/index.php?manual=Low_Frequency_
Variant_Detection.html). As general filters are applied to the different variant detectors that are available in CLC Cancer Research Workbench, the description of the filters
are found in a separate section called "Filters" (see http://www.clcsupport.com/
clccancerresearchworkbench/current/index.php?manual=Filters.html). If
you click on "Locked Settings", you will be able to see all parameters used for variant
detection in the ready-to-use workflow.
Figure 7.46: Specify the parameters for variant calling.
5. Click on the button labeled Next, which will take you to the next wizard step (figure 7.47). In
this dialog you can specify the target regions track. The variants found outside the targeted
region will be removed at this step in the workflow.
6. Click on the button labeled Next, which will take you to the next wizard step (figure 7.48).
Once again, select the relevant population from the 1000 Genomes project. This will add
information from the 1000 Genomes project to your variants.
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
141
Figure 7.47: In this wizard step you can specify the target regions track. Variants found outside
these regions will be removed.
Figure 7.48: Select the relevant population from the 1000 Genomes project. This will add
information from the 1000 Genomes project to your variants.
7. Click on the button labeled Next, which will take you to the next wizard step (figure 7.49). At
this step you can select a population from the HapMap database. This will add information
from the Hapmap database to your variants.
Figure 7.49: Select a population from the HapMap database. This will add information from the
Hapmap database to your variants.
8. In this wizard step (figure 7.50) you get the chance to check the selected settings by clicking
on the button labeled Preview All Parameters. In the Preview All Parameters wizard you
can only check the settings, it is not possible to make any changes at this point.
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
142
Figure 7.50: Check the settings and save your results.
9. Choose to Save your results and press Finish.
Note! If you choose to open the results, the results will not be saved automatically. You
can always save the results at a later point.
Output from the Identify and Annotate Variants workflow
The "Identify and Annotate Variants" tool produces several outputs.
Please do not delete any of the produced files alone as some of them are linked to other outputs.
Please always delete all of them at the same time.
A good place to start is to take a look at the mapping report to see whether the coverage is
sufficient in the regions of interest (e.g. > 30 ). Furthermore, please check that at least 90%
of the reads are mapped to the human reference sequence. In case of a targeted experiment,
please also check that the majority of the reads are mapping to the targeted region.
Next, open the Genome Browser View file (see figure 7.51).
The Genome Browser View includes a track of the identified annotated variants in context to
the human reference sequence, genes, transcripts, coding regions, targeted regions, mapped
sequencing reads, clinically relevant variants in the COSMIC and ClinVar database as well as
common variants in common dbSNP, HapMap, and 1000 Genomes databases.
To see the level of nucleotide conservation (from a multiple alignment with many vertebrates) in
the region around each variant, a track with conservation scores is added as well.
By double-clicking on the annotated variant track in the Genome Browser View, a table will be
shown that includes all variants and the added information/annotations (see 7.52).
The added information will help you to identify candidate variants for further research. For
example can known cancer associated variants (present in the COSMIC database) or variants
known to play a role in drug response or other clinical relevant phenotypes (present in the ClinVar
database) easily be seen.
Not identified variants in COSMIC and ClinVar, can for example be prioritized based on amino
acid changes (do they cause any changes on the amino acid level?). A high conservation level
on the position of the variant between many vertebrates or mammals can also be a hint that this
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
143
Figure 7.51: Genome Browser View to inspect identified variants in the context of the human
genome and external databases.
Figure 7.52: Genome Browser View with an open track table to inspect identified somatic variants
more closely in the context of the human genome and external databases.
region could have an important functional role and variants with a conservation score of more
than 0.9 (PhastCons score) should be prioritized higher. A further filtering of the variants based
on their annotations can be facilitated using the table filter on top of the table.
If you wish to always apply the same filter criteria, the "Create new Filter Criteria" tool should be
used to specify this filter and the "Identify and Annotate" workflow should be extended by the
CHAPTER 7. TARGETED AMPLICON SEQUENCING (TAS)
144
"Identify Candidate Tool" (configured with the Filter Criterion). See the reference manual for more
information on how preinstalled workflows can be edited.
Please note that in case none of the variants are present in COSMIC, ClinVar or dbSNP, the
corresponding annotation column headers are missing from the result.
In case you like to change the databases as well as the used database version, please use the
"Data Management".
Chapter 8
Whole Transcriptome Sequencing (WTS)
Contents
8.1
Automatic analysis of RNA-seq data . . . . . . . . . . . . . . . . . . . . . . . 145
8.2
Analysis of multiple samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.3
Annotate Variants (WTS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.4
Compare variants in DNA and RNA . . . . . . . . . . . . . . . . . . . . . . . . 151
8.5
Identify Candidate Variants and Genes from Tumor Normal Pair . . . . . . . . 157
8.6
Identify variants and add expression values . . . . . . . . . . . . . . . . . . . 163
8.7
Identify and Annotate Differentially Expressed Genes and Pathways . . . . . 168
The technologies originally developed for next-generation DNA sequencing can also be applied to
deep sequencing of the transcriptome. This is done through cDNA sequencing and is called RNA
sequencing or simply RNA-seq.
One of the key advantages of RNA-seq is that the method is independent of prior knowledge
of the corresponding genomic sequences and therefore can be used to identify transcripts
from unannotated genes, novel splicing isoforms, and gene-fusion transcripts [Wang et al.,
2009, Martin and Wang, 2011]. Another strength is that it opens up for studies of transcriptomic
complexities such as deciphering allele-specific transcription by the use of SNPs present in the
transcribed regions [Heap et al., 2010].
RNA-seq-based transcriptomic studies have the potential to increase the overall understanding of
the transcriptome. However, the key to get access to the hidden information and be able to make
a meaningful interpretation of the sequencing data highly relies on the downstream bioinformatic
analysis.
In this chapter we will first discuss the initial steps in the data analysis that lie upstream of
the analysis using ready-to-use workflows. Next, we will look at what the individual ready-to-use
workflows can be used for and go through step by step how to run the workflows.
8.1
Automatic analysis of RNA-seq data
The CLC Cancer Research Workbench offers a range of different tools for RNA-seq analysis.
Currently four different ready-to-use workflows are available for analysis of RNA-seq data:
145
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
146
• Annotate Variants (WTS)
• Compare Variants in DNA and RNA
• Identify Candidate Variants and Genes from Tumor Normal Pair
• Identify and Annotate Differentially Expressed Genes and Pathways
• Identify Variants and Add Expression Values
The ready-to-use workflows can be found in the toolbox under Whole Transcriptome Sequencing
as shown in figure 8.1.
Figure 8.1: The RNA-seq ready-to-use workflows.
Note! Often you will have to prepare data with one of the two Preparing Raw Data workflows
described in section 4 before you proceed to the analysis of the sequencing data RNA-Seq.
8.2
Analysis of multiple samples
To analyze differential expression in multiple samples, you need to tell the workbench how the
samples are related. This is done by setting up an experiment. The tool that can be used to do
this can be found here:
Toolbox | Tools | Transcriptomics Analysis (
)| Set Up Experiment (
)
The output from the tool is an experiment, which essentially is a set of samples that are grouped.
When setting up the experiment, you define the relationship between the samples. This makes it
possible to do statistical analysis to investigate the differential expression between the groups.
The experiment is also used to accumulate calculations like t-tests and clustering because this
information is closely related to the grouping of the samples.
How to set up an experiment is described in detail in the CLC Cancer Research Workbench
reference manual under "Setting up an experiment" in Chapter "Transcriptomics Analysis".
8.3
Annotate Variants (WTS)
Using a variant track ( ) (e.g. the output from the Identify Variants and Add Expression Values
ready-to-use workflow) the Annotate Variants (WGS) ready-to-use workflow runs an "internal"
workflow that adds the following annotations to the variant track:
• Gene names Adds names of genes whenever a variant is found within a known gene.
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
147
• mRNA Adds names of mRNA whenever a variant is found within a known transcript.
• CDS Adds names of CDS whenever a variant is found within a coding sequence.
• Amino acid changes Adds information about amino acid changes caused by the variants.
• Information from COSMIC. Adds information from the "Catalogue of Somatic Mutations in
Cancer" database.
• Information from ClinVar Adds information about the relationships between human variations and their clinical significance.
• Information from dbSNP Adds information from the "Single Nucleotide Polymorphism
Database", which is a general catalog of genome variation, including SNPs, multinucleotide
polymorphisms (MNPs), insertions and deletions (InDels), and short tandem repeats (STRs).
• PhastCons Conservation scores The conservation scores, in this case generated from
a multiple alignment with a number of vertebrates, describe the level of nucleotide
conservation in the region around each variant.
1. Go to the toolbox and select the Annotate Variants (WTS) workflow. In the first wizard
step, select the input variant track (figure 8.2).
Figure 8.2: Select the variant track to annotate.
2. Click on the button labeled Next. The only parameter that should be specified by the user is
which 1000 Genomes population yo use (figure 8.3). This can be done using the drop-down
list found in this wizard step. Please note that the populations available from the drop-down
list can be specified with the Data Management ( ) function found in the top right corner
of the Workbench (see section 3.1.4).
3. Click on the button labeled Next to go to the last wizard step (figure 8.4).
In this wizard step you can check the selected settings by clicking on the button labeled
Preview All Parameters. In the Preview All Parameters wizard you can only check the
settings, it is not possible to make any changes at this point.
4. Choose to Save your results and click on the button labeled Finish.
Two types of output are generated:
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
148
Figure 8.3: Select the relevant 1000 Genomes popultaion(s).
Figure 8.4: Check the settings and save your results.
1. Annotated Variants ( ) Annotation track showing the variants. Hold the mouse over one
of the variants or right-clicking on the variant. A tooltip will appear with detailed information
about the variant.
2. Genome Browser View Annotated Variants ( ) A collection of tracks presented together.
Shows the annotated variants track together with the human reference sequence, genes,
transcripts, coding regions, and variants detected in dbSNP, ClinVar, COSMIC, 1000
Genomes, and PhastCons conservation scores (see figure 8.5).
Note! Please be aware that if you delete the annotated variant track, this track will also disappear
from the genome browser view.
It is possible to add tracks to the Genome Browser View such as mapped sequencing reads as
well as other tracks. This can be done by dragging the track directly from the Navigation Area to
the Genome Browser View.
If you double-click on the name of the annotated variant track in the left hand side of the Genome
Browser View, a table that includes all variants and the added information/annotations will open
(see figure 8.6). The table and the Genome Browser View are linked; if you click on an entry in
the table, this particular position in the genome will automatically be brought into focus in the
Genome Browser View.
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
149
Figure 8.5: The output from the "Annotate Variants" ready-to-use workflow is a genome browser
view (a track list) containing individual tracks for all added annotations.
You may be met with a warning as shown in figure 8.7. This is simply a warning telling you that it
may take some time to create the table if you are working with tracks containing large amounts
of annotations. Please note that in case none of the variants are present in COSMIC, ClinVar or
dbSNP, the corresponding annotation column headers are missing from the result.
Adding information from other sources may help you identify interesting candidate variants for
further research. E.g. known cancer associated variants (present in the COSMIC database) or
variants known to play a role in drug response or other clinical relevant phenotypes (present in
the ClinVar database) can easily be identified. Further, variants not found in the COSMIC and/or
ClinVar databases, can be prioritized based on amino acid changes in case the variant causes
changes on the amino acid level.
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
150
Figure 8.6: The output from the "Annotate Variants" ready-to-use workflow is a genome browser
view (a track list). The information is also available in table view. Click on the small table icon to
open the table view. If you hold down the "Ctrl" key while clicking on the table icon, you will open a
split view showing both the genome browser view and the table view.
Figure 8.7: Warning that appears when you work with tracks containing many annotations.
A high conservation level between different vertebrates or mammals, in the region containing the
variant, can also be used to give a hint about whether a given variant is found in a region with an
important functional role. If you would like to use the conservation scores to identify interesting
variants, we recommend that variants with a conservation score of more than 0.9 (PhastCons
score) is prioritized over variants with lower conservation scores.
It is possible to filter variants based on their annotations. This type of filtering can be facilitated
using the table filter found at the top part of the table. If you are performing multiple experiments
where you would like to use the exact same filter criteria, you can create a filter that can be
saved and reused. To do this:
Toolbox | Identify Candidate Variants (
) | Create Filter Criteria (
)
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
151
This tool can be used to specify the filter and the Annotate Variants workflow should be
extended by the Identify Candidate Tool (configured with the Filter Criterion). The CLC Cancer
Research Workbench reference manual has a chapter that describes this in detail (http:
//clccancer.com/software/#downloads, see chapter: "Workflows" for more information
on how pre-installed workflows can be extended and/or edited).
Note! Sometimes the databases (e.g. COSMIC) are updated with a newer version, or maybe you
have your own version of the database. In such cases you may wish to change one of the used
databases. This can be done with "Data Management" function, which is described in section
3.1.4.
8.4
Compare variants in DNA and RNA
Integrated analysis of genomic and transcriptomic sequencing data is a powerful tool that can
help increase our current understanding of human genomic variants. The Compare variants
in DNA and RNA ready-to-use workflow identifies variants in DNA and RNA and studies the
relationship between the identified genomic and transcriptomic variants.
To run the ready-to-use workflow:
Toolbox | Ready-to-Use Workflows | Whole Transcriptome Sequencing (
pare variants in DNA and RNA ( )
) | Com-
1. Double-click on the Compare variants in DNA and RNA ready-to-use workflow to start the
analysis. If you are connected to a server, you will first be asked where you would like to
run the analysis. Next, you will be asked to select the DNA reads that you would like to
analyze (figure 8.8). To select the DNA reads, double-click on the reads file name or click
once on the file and then on the arrow pointing to the right side in the middle of the wizard.
Click on the button labeled Next.
Figure 8.8: Select the DNA reads to analyze.
2. In the next step you can choose the RNA reads to analyze (see figure 8.9).
3. Click on the button labeled Next to go to the transcriptomic variant detection step (see
figure 8.10). For a description of the different parameters that can be adjusted in the variant
detection step, we refer to the description of the "Low Frequency Variant Detection" tool
in the CLC Cancer Research Workbench user manual (http://www.clcsupport.com/
clccancerresearchworkbench/current/index.php?manual=Low_Frequency_
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
152
Figure 8.9: Select the RNA reads to analyze.
Variant_Detection.html). As general filters are applied to the different variant detectors that are available in CLC Cancer Research Workbench, the description of the filters
are found in a separate section called "Filters" (see http://www.clcsupport.com/
clccancerresearchworkbench/current/index.php?manual=Filters.html). If
you click on "Locked Settings", you will be able to see all parameters used for variant
detection in the ready-to-use workflow.
Figure 8.10: Specify the parametes for transcriptomic variant detection.
4. The next two wizard steps are annotation steps where the transcriptomic variants are
annotated with information from known databases. Actually the variants are annotated with
a range of different data in this ready-to-use workflow, but only databases that provide data
from more than one population needs to be specified by the user. This is the case for
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
153
HapMap and the 1000 Genomes Project. First, the variants are annotated with information
from the 1000 Genomes Project (see figure 8.11). From the drop-down list you can choose
the population that matches the population your samples are derived from. The drop-down
list shows the populations that were selected under "Data Management" as described
in the CLC Cancer Research Workbench user manual (http://www.clcsupport.com/
clccancerresearchworkbench/current/index.php?manual=Download_configure_
reference_data.html).
Under "Locked settings" you can see that "Automatically join adjacent MNVs and SNVs"
has been selected. The reason for this is that many databases do not report a succession
of SNVs as one MNV as is the case for the CLC Cancer Research Workbench, and as a
consequence it is not possible to directly compare variants called with CLC Cancer Research
Workbench with these databases. In order to support filtering against these databases
anyway, the option to Automatically join adjacent SNVs and MNVs is enabled. This means
that an MNV in the experimental data will get an exact match, if a set of SNVs and MNVs
in the database can be combined to provide the same allele.
Note! This assumes that SNVs and MNVs in the track of known variants represent the
same allele, although there is no evidence for this in the track of known variants.
Figure 8.11: Select the relevant population from the drop-down list.
5. Click on the button labeled Next and do the same to annotate with information from
HapMap (figure 8.12).
Figure 8.12: Select the relevant population from the drop-down list.
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
154
6. Click on the button labeled Next to go to the genomic variant detection step (shown in
figure 8.13).
Figure 8.13: Specify the parametes for genomic variant detection.
7. Again, the two next wizard steps are annotation steps. This time the genomic variants are
annotated with information from known databases. First, the variants are annotated with
information from the 1000 Genomes Project (see figure 8.14).
Figure 8.14: Select the relevant population from the drop-down list.
8. Click on the button labeled Next and do the same to annotate the genomic variants with
information from HapMap (figure 8.15).
9. Click on the button labeled Next to go to the result handling step (figure 8.16).
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
155
Figure 8.15: Select the relevant population from the drop-down list.
Figure 8.16: Select the relevant population from the drop-down list.
Pressing the button Preview All Parameters allows you to preview all parameters. At
this step you can only view the parameters, it is not possible to make any changes (see
figure 8.17). Choose to save the results and click on the button labeled Finish.
10. Press OK, specify where to save the results, and then click on the button labeled Finish to
run the analysis.
Ten different output types are generated:
1. DNA Read Mapping ( ) The mapped DNA sequencing reads. The DNA sequencing reads
are shown in different colors depending on their orientation, whether they are single reads
or paired reads, and whether they map unambiguously. For the color codes please see
the description of sequence colors in the CLC Genomics Workbench manual that can
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
156
Figure 8.17: Preview all parameters. At this step it is not possible to introduce any changes, it is
only possible to view the settings.
be found here: http://www.clcsupport.com/clcgenomicsworkbench/current/
index.php?manual=View_settings_in_Side_Panel.html.
2. DNA Mapping Report ( ) This report contains information about the reads, reference,
transcripts, and statistics. This is explained in more detail in the CLC Cancer Research
Workbench reference manual in section RNA-Seq report (http://clcsupport.com/
clccancerresearchworkbench/current/index.php?manual=RNA_Seq_report.
html).
3. RNA Gene Expression ( ) A track showing gene expression annotations. Hold the mouse
over or right-clicking on the track. If you have zoomed in to nucleotide level, a tooltip will
appear with information about e.g. gene name and expression values.
4. RNA Transcript Expression ( ) A track showing transcript expression annotations. Hold
the mouse over or right-clicking on the track. A tooltip will appear with information about
e.g. gene name and expression values.
5. RNA Mapping Report ( ) This report contains information about the reads, reference,
transcripts, and statistics. This is explained in more detail in the CLC Cancer Research
Workbench reference manual in section RNA-Seq report (http://clcsupport.com/
clccancerresearchworkbench/current/index.php?manual=RNA_Seq_report.
html).
6. RNA Read Mapping ( ) The mapped RNA-seq reads. The RNA-seq reads are shown in
different colors depending on their orientation, whether they are single reads or paired
reads, and whether they map unambiguously. For the color codes please see the
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
157
description of sequence colors in the CLC Genomics Workbench manual that can be found
here: http://www.clcsupport.com/clcgenomicsworkbench/current/index.
php?manual=View_settings_in_Side_Panel.html.
7. Variants Found in Both DNA and RNA ( ) This track shows only the variants that are
present in both DNA and RNA. With the table icon ( ) found in the lower left part of the
View Area it is possible to switch to table view. The table view provides details about the
variants such as type, zygosity, and information from a range of different databases.
8. All Variants Found in DNA or RNA (
detected in either RNA, DNA or both.
) This track shows all variants that have been
9. Genome Browser View Variants Found in DNA and RNA ( ) A collection of tracks
presented together. Shows the annotated variants track together with the human reference
sequence, genes, transcripts, coding regions, and variants detected in COSMIC, ClinVar
and dbSNP (see figure 8.18).
10. Log (
) A log of the workflow execution.
The three most important tracks of the ten generated are the Variants found in both DNA and
RNA track, All variants found in DNA or RNA track, and the Genome Browser View. The Genome
Browser View makes it easy to get an overview in the context of a reference sequence, and
compare variant and expression tracks with information from different databases. The two other
tracks (Variants found in both DNA and RNA track and All variants found in DNA or RNA track)
provides detailed information about the detected variants when opened in table view.
8.5
Identify Candidate Variants and Genes from Tumor Normal Pair
The Identify Candidate Variants and Genes from Tumor Normal Pair tool identifies somatic
variants and differentially expressed genes in a tumor normal pair. One tumor normal pair can
be compared at the time. If you would like to compare more than one pair you must repeat the
analysis with the next tumor normal pair.
To run the ready-to-use workflow:
Toolbox | Ready-to-Use Workflows | Whole Transcriptome Sequencing (
Candidate Variants and Genes from Tumor Normal Pair ( )
) | Identify
1. Double-click on the Identify Candidate Variants and Genes from Tumor Normal Pair tool
to start the analysis. If you are connected to a server, you will first be asked where you
would like to run the analysis. Next, you will be asked to select the RNA-seq reads from the
normal sample. The panel in the left side of the wizard shows the kind of input that should
be provided (figure 8.19). Select by double-clicking on the reads file name or clicking once
on the file and then clicking on the arrow pointing to the right side in the middle of the
wizard. Click on the button labeled Next.
2. In the next step you will be asked to select the RNA-seq reads from the tumor sample (see
figure 8.20).
3. Click on the button labeled Next. In this wizard step (figure 8.21) you can adjust the
settings for the Create fold change track tool. In brief, what the tool does is, for
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
158
Figure 8.18: The genome browser view makes it easy to compare a range of different data.
each transcript or gene, to calculate the ratio between the expression values in the
normal and the tumor sample. This makes it possible to filter on fold changes and
expression values, which makes it easy to identify differentially expressed transcripts
or genes. The parameters that can be adjusted in this wizard step are described in detail in the CLC Cancer Research Workbench user manual (see http://clcsupport.com/
clccancerresearchworkbench/current/index.php?manual=Create_fold_change_
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
159
Figure 8.19: Select the RNA-seq reads from the normal sample.
Figure 8.20: Select the RNA-seq reads from the tumor sample.
track.html).
Figure 8.21: Specify the parameters for variant calling.
4. Click on the button labeled Next. This will allow you to specify the parameters for the
variant detection (figure 8.22). For a description of the different parameters that can be
adjusted in the variant detection step, we refer to the description of the "Low Frequency
Variant Detection" tool in the CLC Cancer Research Workbench user manual (http://www.
clcsupport.com/clccancerresearchworkbench/current/index.php?manual=
Low_Frequency_Variant_Detection.html). . As general filters are applied to the
different variant detectors that are available in CLC Cancer Research Workbench, the de-
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
160
scription of the filters are found in a separate section called "Filters" (see http://www.
clcsupport.com/clccancerresearchworkbench/current/index.php?manual=
Filters.html). If you click on "Locked Settings", you will be able to see all parameters
used for variant detection in the ready-to-use workflow.
Figure 8.22: Specify the parameters for variant calling.
5. The next wizard step (figure 8.23) concerns removal of germline variants. You are asked to
supply the number of reads in the control data set that should support the variant allele in
order to include it as a match. All the variants where at least this number of control reads
show the particular allele will be filtered away in the result track.
Figure 8.23: Specify the number of reads to use as cutoff for removal of germline variants.
6. In the next wizard step variants found in known databases are removed. Actually the
variants from a range of different databases are removed in this ready-to-use workflow, but
only databases that provide data from more than one population needs to be specified by
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
161
the user. This is the case for the HapMap database. From the drop-down list you can choose
the population that matches the population your samples are derived from (figure 8.24). The
drop-down list shows the populations that were selected under "Data Management" as described in the CLC Cancer Research Workbench user manual (http://www.clcsupport.
com/clccancerresearchworkbench/current/index.php?manual=Download_configure_
reference_data.html).
Figure 8.24: Select the relevant population from the drop-down list.
7. Click on the button labeled Next to go to the last wizard step (shown in figure 8.25).
Figure 8.25: Check the selected parametes by pressing "Preview All Parameters".
Pressing the button Preview All Parameters allows you to preview all parameters. At
this step you can only view the parameters, it is not possible to make any changes (see
figure 8.26). Choose to save the results and click on the button labeled Finish.
Thirteen types of output are generated:
1. Gene Expression Normal ( ) A track showing gene expression annotations. Hold the
mouse over or right-clicking on the track. A tooltip will appear with information about e.g.
gene name and gene expression values.
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
162
Figure 8.26: Preview all parameters. At this step it is not possible to introduce any changes, it is
only possible to view the settings.
2. Transcript Expression Normal ( ) A track showing transcript expression annotations.
Hold the mouse over or right-clicking on the track. A tooltip will appear with information
about e.g. gene name and transcript expression values.
3. RNA-Seq Mapping Report Normal ( ) This report contains information about the reads,
reference, transcripts, and statistics. This is explained in more detail in the CLC Cancer
Research Workbench reference manual in section RNA-Seq report (http://clcsupport.
com/clccancerresearchworkbench/current/index.php?manual=RNA_Seq_report.
html).
4. Gene Expression Tumor ( ) A track showing gene expression annotations. Hold the
mouse over or right-clicking on the track. A tooltip will appear with information about e.g.
gene name and gene expression values.
5. Transcript Expression Tumor ( ) A track showing transcript expression annotations. Hold
the mouse over or right-clicking on the track. A tooltip will appear with information about
e.g. gene name and transcript expression values.
6. RNA-Seq Mapping Report Tumor ( ) This report contains information about the reads,
reference, transcripts, and statistics. This is explained in more detail in the CLC Cancer
Research Workbench reference manual in section RNA-Seq report (http://clcsupport.
com/clccancerresearchworkbench/current/index.php?manual=RNA_Seq_report.
html).
7. Differentially Expressed Genes ( ) A track showing the differentially expressed genes. The
table view provides information about fold change, difference in expression, the maximum
expression (observed in either the case or the control), the expression in the case, and the
expression in the control.
8. Read Mapping Tumor ( ) The mapped RNA-seq reads. The RNA-seq reads are shown
in different colors depending on their orientation, whether they are single reads or paired
reads, and whether they map unambiguously. For the color codes please see the
description of sequence colors in the CLC Genomics Workbench manual that can be found
here: http://www.clcsupport.com/clcgenomicsworkbench/current/index.
php?manual=View_settings_in_Side_Panel.html.
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
163
9. Read Mapping Normal ( ) The mapped RNA-seq reads. The RNA-seq reads are shown
in different colors depending on their orientation, whether they are single reads or paired
reads, and whether they map unambiguously. For the color codes please see the
description of sequence colors in the CLC Genomics Workbench manual that can be found
here: http://www.clcsupport.com/clcgenomicsworkbench/current/index.
php?manual=View_settings_in_Side_Panel.html.
10. Variant Calling Report Tumor ( ) Report showing error rates for quality categories, quality
of examined sites, and estimated frequencies of actual to called bases for different quality
score ranges.
11. Annotated Somatic Variants with Expression Values ( ) A variant track showing the
somatic variants. When mousing over a variant, a tooltip will appear with information about
the variant.
12. Genome Browser View RNA-Seq Tumor_Normal Comparison ( ) A collection of tracks
presented together. Shows the annotated variants track together with the human reference
sequence, genes, transcripts, coding regions, and variants detected in COSMIC, ClinVar
and dbSNP (see figure 8.27).
13. Log (
8.6
) A log of the workflow execution.
Identify variants and add expression values
The Identify Variants and Add Expression Values ready-to-use workflows can be used to identify
novel and known mutations in RNA-seq data, automatically map, quantify, and annotate the
transcriptomes, and compare the mutational patterns in the samples with the expression values
of the corresponding transcripts and genes.
To run the ready-to-use workflow:
Toolbox | Ready-to-Use Workflows | Whole Transcriptome Sequencing (
Variants and Add Expression Values ( )
) | Identify
1. Double-click on the Identify Variants and Add Expression Values tool to start the analysis.
If you are connected to a server, you will first be asked, where you would like to run the
analysis. Next, you will be asked to select the RNA-seq reads. The reads can be selected
by double-clicking on the reads file name or clicking once on the file and then clicking on
the arrow pointing to the right side in the middle of the wizard (figure 8.30).
Click on the button labeled Next.
2. In the next wizard step (figure 8.29) you can specify the parameters for variant detection. For a description of the different parameters that can be adjusted in the variant
detection step, we refer to the description of the "Low Frequency Variant Detection" tool
in the CLC Cancer Research Workbench user manual (http://www.clcsupport.com/
clccancerresearchworkbench/current/index.php?manual=Low_Frequency_
Variant_Detection.html). As general filters are applied to the different variant detectors that are available in CLC Cancer Research Workbench, the description of the filters
are found in a separate section called "Filters" (see http://www.clcsupport.com/
clccancerresearchworkbench/current/index.php?manual=Filters.html).
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
164
Figure 8.27: The Genome Browser View is a collection of a number of tracks. The Genome Browser
View makes it easy to compare the different tracks. Each track kan be opened individually by
double-clicking on the track name in the left side of the View Area.
3. The next two wizard steps are annotation steps where the detected variants are annotated
with information from known databases. Actually the variants are annotated with a range
of different data in this ready-to-use workflow, but only databases that provide data from
more than one population needs to be specified by the user. This is the case for HapMap
and the 1000 Genomes Project. First, the variants are annotated with information from
the 1000 Genomes Project (see figure 8.30). From the drop-down list you can choose the
population that matches the population your samples are derived from. The drop-down
list shows the populations that were selected under "Data Management" as described
in the CLC Cancer Research Workbench user manual (http://www.clcsupport.com/
clccancerresearchworkbench/current/index.php?manual=Download_configure_
reference_data.html).
Under "Locked settings" you can see that "Automatically join adjacent MNVs and SNVs"
has been selected. The reason for this is that many databases do not report a succession
of SNVs as one MNV as is the case for the CLC Cancer Research Workbench, and as a
consequence it is not possible to directly compare variants called with CLC Cancer Research
Workbench with these databases. In order to support filtering against these databases
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
165
Figure 8.28: Select the sequencing reads to analyze.
Figure 8.29: Specify the parameters for variant calling.
anyway, the option to Automatically join adjacent SNVs and MNVs is enabled. This means
that an MNV in the experimental data will get an exact match, if a set of SNVs and MNVs
in the database can be combined to provide the same allele.
Note! This assumes that SNVs and MNVs in the track of known variants represent the
same allele, although there is no evidence for this in the track of known variants.
4. Click on the button labeled Next and do the same to annotate with information from
HapMap (figure 8.31).
5. Click on the button labeled Next to go to the last wizard step (shown in figure 8.32).
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
166
Figure 8.30: Select the relevant population from the drop-down list.
Figure 8.31: Select the relevant population from the drop-down list.
Figure 8.32: Check the selected parametes by pressing "Preview All Parameters".
Pressing the button Preview All Parameters allows you to preview all parameters. At
this step you can only view the parameters, it is not possible to make any changes (see
figure 8.33). Choose to save the results and click on the button labeled Finish.
Seven different output types are generated:
1. Gene expression ( ) A track showing gene expression annotations. Hold the mouse over
or right-clicking on the track. A tooltip will appear with information about e.g. gene name
and expression values.
2. Transcript expression ( ) A track showing transcript expression annotations. Hold the
mouse over or right-clicking on the track. A tooltip will appear with information about e.g.
gene name and expression values.
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
167
Figure 8.33: Preview all parameters. At this step it is not possible to introduce any changes, it is
only possible to view the settings.
3. RNA-Seq Mapping Report ( ) This report contains information about the reads, reference,
transcripts, and statistics. This is explained in more detail in the CLC Cancer Research
Workbench reference manual in section RNA-Seq report (http://clcsupport.com/
clccancerresearchworkbench/current/index.php?manual=RNA_Seq_report.
html).
4. Read Mapping ( ) The mapped RNA-seq reads. The RNA-seq reads are shown in
different colors depending on their orientation, whether they are single reads or paired
reads, and whether they map unambiguously. For the color codes please see the
description of sequence colors in the CLC Genomics Workbench manual that can be found
here: http://www.clcsupport.com/clcgenomicsworkbench/current/index.
php?manual=View_settings_in_Side_Panel.html.
5. Annotated Variants with Expression Values ( ) Annotation track showing the variants.
Hold the mouse over one of the variants or right-clicking on the variant. A tooltip will appear
with detailed information about the variant.
6. RNA-Seq Genome Browser View ( ) A collection of tracks presented together. Shows the
annotated variants track together with the human reference sequence, genes, transcripts,
coding regions, and variants detected in COSMIC, ClinVar and dbSNP (see figure 8.18).
7. Log (
) A log of the workflow execution.
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
8.7
168
Identify and Annotate Differentially Expressed Genes and Pathways
The Identify and Annotate Differentially Expressed Genes and Pathways compares the gene
expression in different groups of samples using an empirical analysis and performs a gene
ontology (GO) enrichment analysis on the differentially expressed genes to identify affected
pathways.
To run the ready-to-use workflow:
Toolbox | Ready-to-Use Workflows | Whole Transcriptome Sequencing (
and Annotate Differentially Expressed Genes and Pathways ( )
) | Identify
1. Double-click on the Identify and Annotate Differentially Expressed Genes and Pathways
ready-to-use workflow to start the analysis. If you are connected to a server, you will first
be asked where you would like to run the analysis. Next, you will be asked to select the
experiment to analyze (figure 8.34). To select an experiment ( ), double-click on the
experiment file name or click once on the file and then on the arrow pointing to the right
side in the middle of the wizard. Click on the button labeled Next.
Figure 8.34: Select the experiment to analyze.
2. In the next wizard step you can specify the parameters to be used for extraction of
differentially expressed genes.
Configurable Parameters
• Type of p-value This drop-down menu allows you to select between raw and corrected p-values. For a description of these, please see the Transcriptomics Chapter, section "Corrected p-values" in the CLC Genomics Workbench manual that
can be found here: http://www.clcsupport.com/clcgenomicsworkbench/
current/index.php?manual=Corrected_p_values.html. Only the types of
p-values available for the given statistical analysis will be present in the drop-down
menu.
• Maximum p-value In this input field, you can enter the maximum allowed p-value, as
a number between 0 and 1. If you do not want any filtering based on p-value, enter 1.
• Minimum fold-change value You can also specify the minimum allowed fold-change
value as a number greater than zero. If you do not want any filtering based on
fold-change, enter 0.
3. Click on the button labeled Next to go to the next step where you can choose the gene
ontology type you wish to use.
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
169
Figure 8.35: Select the parameters for extraction of differentially expressed genes.
Figure 8.36: Select which gene ontology type to use.
4. In the next step you can choose to preview the settings and save the results (see
figure 8.37).
Figure 8.37: The results handling step.
5. Click on the button labeled "Preview All Parameters" if you would like to preview the
settings. The parameters settings can be viewed but not edited in this view.
6. Press OK, specify where to save the results, and then click on the button labeled Finish to
run the analysis.
Three different types of output are generated:
1. Annotated Differentially Expressed Genes (
) This is an annotation track that gives
CHAPTER 8. WHOLE TRANSCRIPTOME SEQUENCING (WTS)
170
access to the expression values and other information. This information can be accessed
in two different ways:
• Hold the mouse over or right-clicking on the track. A tooltip will appear with information
about e.g. gene name, results of statistical tests, expression values, and GO
information.
• Open the track in table format by clicking on the table icon in the lower left side of the
View Area.
2. Enriched Gene Groups and Pathways ( ) A table showing the results of the GO enrichment
analysis. The table includes GO terms, a description of the affected function/pathway,
the number of genes in each function/pathway, the number of affected genes within the
function/pathway, and p-values.
3. Genome Browser View Differentially Expressed Genes and Pathways ( ) A collection of
tracks presented together. Shows the human reference sequence, annotation tracks for
genes, coding regions, transcripts, and expression comparison with GO information, and a
conservation score track (see figure 8.38).
Figure 8.38: The genome browser view allows comparison of the expression comparison tracks
with the reference sequence and different annotation tracks.
Part III
Customized data analysis
171
Chapter 9
How to edit application workflows
Contents
9.1
9.1
Introduction to customized data analysis . . . . . . . . . . . . . . . . . . . . 172
9.2
How to edit preinstalled workflows
. . . . . . . . . . . . . . . . . . . . . . . 172
Introduction to customized data analysis
CLC Cancer Research Workbench offers a range of different tools that can be used for customized
data analysis. The vast majority of the tools are workflow enabled, which means that the tools can
be connected and used in customized workflows. The CLC Cancer Research Workbench reference
manual has a chapter that describes this in detail (http://clccancer.com/software/
#downloads, chapter: "Workflows").
9.2
How to edit preinstalled workflows
An important feature of the CLC Cancer Research Workbench is the possibility to add, delete,
and replace tools in the preinstalled workflows (the tools found in the "Application" folder of the
toolbox). Moreover, parameter settings can be unlocked or locked with different values.
The edited workflow can be installed in the CLC Cancer Research Workbench and Genomics
Server as well as distributed between your collaborators.
When would it be relevant to edit a preinstalled workflow?
Example 1
You have an in-house database with common variants identified in people from your local region.
You have imported the database variants as a track and would like to use this database for
filtering out common variants instead of using HapMap, 1000 Genomes data and common
dbSNP.
Hence, what you would like to do is to modify the "Filter Somatic Variants" workflow and replace
the tools "Add Information from HapMap", "Add Information from 1000 Genomes project" and
"Add Information from common dbSNP" with "Add Information from External Databases".
172
CHAPTER 9. HOW TO EDIT APPLICATION WORKFLOWS
173
Example 2
You would like to only see the known cancer associated variants and non synonymous variants
in the result.
You have used the "Create New Filter Criteria" tool to create a new filter criterion and would
like to extend the "Identify Somatic Variants from Tumor Normal Pair" to include the "Identify
Candidate Variants" tool at the end.
How can I edit a workflow
Click on Workflows -> Create new Workflow in the upper right side corner of the workbench
(figure 9.1).
Figure 9.1: Click on Create new Workflow.
Next, drag and drop the preinstalled workflow that you would like to modify, from the toolbox to
the opened Workflow Editor (figure 9.2). You can now see the underlying workflow. If you right
click on the View Area and click "Layout", the layout will be adjusted.
You will see that at this point you do not have any input associated with the workflow. Please
add an input at the top of the workflow by right-clicking on the first tool in the workflow.
Figure 9.2: Drag and drop the presintalled workflow in the workflow editor.
You can remove tools, connections, or drag and drop new tools from the toolbox into the workflow
editor.
CHAPTER 9. HOW TO EDIT APPLICATION WORKFLOWS
174
How can I install the edited workflow and where will it be in the toolbox
After you have finished editing your workflow, make sure that the validation of the workflow was
successful and save your workflow design file.
Then click on the button labeled Installation. This will open the wizard in figure 9.3
Figure 9.3: The "Create Installer" wizard to be used for workflow installation.
After you have added your details; your name, institution, workflow name and a description of the
workflow, please click on the button labeled Next. This will open the wizard shown in figure 9.4
Figure 9.4: The second "Create Installer" wizard step.
The installed workflow will appear in the "Workflow" folder in the toolbox.
Chapter 10
Using data from other workbenches
Contents
10.1 Open outputs from other workbenches . . . . . . . . . . . . . . . . . . . . . 175
10.1
Open outputs from other workbenches
Please note that if you also have access to CLC Genomics Workbench, CLC Main Workbench, or
CLC Sequence Viewer you may have generated different types of output that you would like to
view in the CLC Cancer Research Workbench. All types of output that have been created in CLC
Genomics Workbench, CLC Main Workbench, or CLC Sequence Viewer can be opened in the CLC
Cancer Research Workbench. This means that you are capable of opening certain output types
that cannot be generated from within the CLC Cancer Research Workbench. In such cases we
refer to our other manuals e.g. the CLC Genomics Workbench manual that can be found here:
http://www.clcbio.com/support/downloads/#manuals for further information about
the output types that are not described in the CLC Cancer Research Workbench manual.
Output files from other workbenches can be imported as described in section 3.3.1 using
Standard Import.
175
Part IV
Plugins
176
Chapter 11
Plugins
The CLC Cancer Research Workbench can be upgraded and customized by installing plugins. This
can be done by clicking on the button labeled "Plugins" in the upper right corner of the CLC
Cancer Research Workbench (figure 11.1.
Figure 11.1: Click on the button labeled "Plugins" to download plugins.
The plugins that are available for CLC Cancer Research Workbench are:
• Batch Rename
• Biobase Genome Trax Annotate
• Biobase Genome Trax Download
• Duplicate Mapped Reads Removal
• Shannon Human Splicing Pipeline
• Shannon Human Splicing Pipeline Client
You can find a detailed description of how to download and install plugins in the CLC Cancer Research Workbench reference manual in chapter Introduction to CLC Cancer Research
Workbench section Plugins.
177
Part V
Appendix
178
Appendix A
Reference data overview
Data
Human
reference
sequence
Human
genes,
coding
sequences
and transcripts
HapMap
variants
Provider
URL to the original file
ENSEMBL ftp://ftp.
ensembl.org/pub/
current_fasta/
homo_sapiens/dna/
Description
Chromosomes 1-22, X, Y and M human
reference DNA sequence GRCh37(HG19).
ENSEMBL ftp://ftp.
ensembl.org/pub/
current_gtf/homo_
sapiens/
All annotated protein coding genes for human reference sequence GRCh37(HG19).
The annotation was done by ENSEMBL
and includes annotations from RefSeq,
CCDS as well as ENSEMBL itself.
ENSEMBL ftp://ftp.
ensembl.org/
pub/current_
variation/gvf/
homo_sapiens/
The goal of the International HapMap
Project is to develop a haplotype map of
the human genome, the HapMap, which
will describe the common patterns of
human DNA sequence variation (for more
information about HapMap see http:
//hapmap.ncbi.nlm.nih.gov/).
Please note that there are 12 different
files (tracks) to be downloaded (one file
for each population). It is recommended
that you configure your workflows with
the file from this population that best
matches the ethnicity of the patient from
which the sample was taken. You can find
more about the population codes, which
are part of the filename here: http:
//www.sanger.ac.uk/resources/
downloads/human/hapmap3.html.
179
APPENDIX A. REFERENCE DATA OVERVIEW
Data
Provider
Variants ENSEMBL
found
by
the
1000
Genomes
Project
COSMIC
variants
Sanger
Institute
dbSNP
variants
UCSC
URL to the original file
ftp://ftp.
ensembl.org/
pub/current_
variation/gvf/
homo_sapiens/
180
Description
The 1000 Genomes Project Phase 1 created an integrated map of genetic variations from 1092 human genomes [ et al.,
2012]. Please note that there are 4
different files (tracks) to be downloaded
(one file for each population).
It is
recommended that you configure your
workflows with the file from the population that bests matches the ethnicity
of patient from which the sample was
taken. You can learn more about the
population codes that are part of the
filename here: http://www.ensembl.
org/Help/Faq?id=328.
ftp://ftp.
The mutation data was obtained from
sanger.ac.uk/
the Sanger Institute Catalogue Of Sopub/CGP/cosmic/
matic Mutations In Cancer web site,
data_export/
http://www.sanger.ac.uk/cosmic BamCosmicMutantExport_ ford et al (2004) The COSMIC (Catalogue
of Somatic Mutations in Cancer) database
*.tsv.gz
and website [Bamford et al., 2004]. The
COSMIC database is a human, curated
database.
http://
Human variants present in the Single Nuhgdownload.
cleotide Polymorphism Database (dbSNP),
soe.ucsc.edu/
which includes smaller insertions, delegoldenPath/hg19/
tions, replacements, SNPs and MNVs.
database/snp*.
Please note that most variants in dbSNP
txt.gz
are not validated and everybody can submit data to dbSNP. The collection of variants includes clinical relevant as well as
common variants.
APPENDIX A. REFERENCE DATA OVERVIEW
Data
dbSNP
variants
Provider
UCSC
ClinVar
NCBI
database
variants
PhastCons UCSC
Conservation
Scores
Human
Gene
Ontology
(GO
slim) file
EBI
URL to the original file
http://
hgdownload.
soe.ucsc.edu/
goldenPath/
hg19/database/
snp*Common.txt.gz
ftp://ftp.ncbi.
nlm.nih.gov/
pub/clinvar/
vcf/clinvar_00latest.vcf.gz
http://
hgdownload.
soe.ucsc.edu/
goldenPath/hg19/
phastCons100way/
hg19.100way.
phastCons/
http://www.ebi.
ac.uk/QuickGO/
GMultiTerm
181
Description
Uniquely mapped variants that appear in
at least 1% of the population or are 100%
non-reference
ClinVar is designed to provide a freely
accessible, public archive of reports of
the relationships among human variations and phenotypes, with supporting evidence.
Conservation track of UCSC from a multiple alignments of 100 species and measurements of evolutionary conservation
using the phastCons algorithm from the
PHAST package.
Gene Ontology file in slim format (only high
level GO terms annotated) for the GO categories Molecular Function, Biological Process and Cellular Component annotated
on human genes. The file was made using
the QuickGO tool from the EBI (http:
//www.ebi.ac.uk/QuickGO/ GMultiTerm).
Appendix B
Mini dictionary
Application
Automated workflow
Navigation area
Ready-to-use workflow
Side Panel
Status Bar
Tool
Toolbox
Track
View Area
View Tools
Description
Type of analysis (Whole Genome Sequencing, Wole Exome
Sequencing, Targeted Amplicon Sequencing, RNA-seq)
A workflow consisting of several tools that have been built
together and only requires few inputs from the user
The area in the left side of the CLC Cancer Research Workbench that holds the data
Pre-installed automated workflow consisting of several tools
that have been built together and only requires few inputs
from the user
The Side Panel, shown to the right of all views that are
opened in CLC Cancer Research Workbench allows you to
change the way the content of a view is displayed
The Status Bar is located at the bottom of all views. The
left side of the bar shows whether the computer is making
calculations or whether it is idle. The right side of the bar
indicates the range of the selection of a sequence.
In the CLC Cancer Research Workbench this term is used
about both single tools and ready-to-use workflows
The area in the lower left side of the CLC Cancer Research
Workbench that holds the tools
Data is presented in track format (=genome browser view)
in the CLC Cancer Research Workbench
The area in the middle of the CLC Cancer Research Workbench. This is where you can visualize your results and work
with your data
The area in the lower right part of the View Area. Here you
can find tools for zooming, panning, and selection of data
182
Bibliography
[ et al., 2012] , . G. P. C., Abecasis, G. R., Auton, A., Brooks, L. D., DePristo, M. A., Durbin,
R. M., Handsaker, R. E., Kang, H. M., Marth, G. T., and McVean, G. A. (2012). An integrated
map of genetic variation from 1,092 human genomes. Nature, 491(7422):56--65.
[Bamford et al., 2004] Bamford, S., Dawson, E., Forbes, S., Clements, J., Pettett, R., Dogan, A.,
Flanagan, A., Teague, J., Futreal, P. A., Stratton, M. R., and Wooster, R. (2004). The cosmic
(catalogue of somatic mutations in cancer) database and website. Br J Cancer, 91(2):355--358.
[Choi et al., 2009] Choi, M., Scholl, U. I., Ji, W., Liu, T., Tikhonova, I. R., Zumbo, P., Nayir, A.,
Bakkaloglu, A., Г–zen, S., Sanjad, S., Nelson-Williams, C., Farhi, A., Mane, S., and Lifton, R. P.
(2009). Genetic diagnosis by whole exome capture and massively parallel DNA sequencing.
Proc Natl Acad Sci U S A, 106(45):19096--19101.
[Heap et al., 2010] Heap, G. A., Yang, J. H. M., Downes, K., Healy, B. C., Hunt, K. A., Bockett,
N., Franke, L., Dubois, P. C., Mein, C. A., Dobson, R. J., Albert, T. J., Rodesch, M. J., Clayton,
D. G., Todd, J. A., van Heel, D. A., and Plagnol, V. (2010). Genome-wide analysis of allelic
expression imbalance in human primary cells by high-throughput transcriptome resequencing.
Hum Mol Genet, 19(1):122--134.
[Martin and Wang, 2011] Martin, J. A. and Wang, Z. (2011). Next-generation transcriptome
assembly. Nat Rev Genet, 12(10):671--682.
[Ng et al., 2009] Ng, S. B., Turner, E. H., Robertson, P. D., Flygare, S. D., Bigham, A. W., Lee,
C., Shaffer, T., Wong, M., Bhattacharjee, A., Eichler, E. E., Bamshad, M., Nickerson, D. A.,
and Shendure, J. (2009). Targeted capture and massively parallel sequencing of 12 human
exomes. Nature, 461(7261):272--276.
[Wang et al., 2009] Wang, Z., Gerstein, M., and Snyder, M. (2009). RNA-Seq: a revolutionary
tool for transcriptomics. Nat Rev Genet, 10(1):57--63.
183
Part VI
Index
184
Index
Annotate Variants (TAS), 116
Annotate Variants (WES), 83
Annotate Variants (WGS), 59
Annotate Variants (WTS), 146
Automatic analysis (TAS), 112
Automatic analysis (WES), 79
Automatic analysis (WGS), 55
Automatic analysis, RNA-seq, 145
Identify Somatic Variants from Tumor Normal
Pair (WGS), 68
Identify Variants (TAS), 113
Identify Variants (WES), 80
Identify Variants (WGS), 56
Identify variants and add expression values,
163
Import data, 40
Bibliography, 183
Menu Bar, illustration, 13
Compare variants in DNA and RNA, 151
Compare variants, in DNA and RNA, 151
Configure reference data, 32
Contact information, 9
Create new folder, 39
Customized data analysis, 172
Navigation Area
illustration, 13
Download reference data, 32
Edit preinstalled workflows, 172
Example data, import, 12
Filter Somatic Variants (TAS), 122
Filter Somatic Variants (WES), 89
Filter Somatic Variants (WGS), 64
Identify
Identify
Identify
Identify
Identify
Identify
Identify
Identify
Identify
Reference data, 29
Configure, 32
Download, 32
References, 183
RNA-seq analysis, Identify variants and add expression values, 163
RNA-seq, differentially expressed genes and
pathways, 168
RNA-seq, identify candidate variants and differentially expressed genes, 157
Status Bar
illustration, 13
and annotate differentially expressed
genes, 168
Toolbar
and Annotate Variants (TAS), 137
illustration, 13
and Annotate Variants (WES), 104
Toolbox
candidate variants and genes from tuillustration, 13
mor normal pair, 157
Known Variants in One Sample (TAS), User interface, 13
131
Known Variants in One Sample (WES), View Area
98
illustration, 13
Known Variants in One Sample (WGS),
72
Somatic Variants from Tumor Normal
Pair (TAS), 126
Somatic Variants from Tumor Normal
Pair (WES), 93
185
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project