SUPPORT BY APPLIED MATHS While the best efforts have

SUPPORT BY APPLIED MATHS While the best efforts have
1
NOTES
SUPPORT BY APPLIED MATHS
LIMITATIONS ON USE
While the best efforts have been made in preparing this
manuscript, no liability is assumed by the authors with
respect to the use of the information provided.
The BioNumerics software and this accompanying
guide are subject to the terms and conditions outlined in
the License Agreement. The support, entitlement to
upgrades and the right to use the software automatically
terminate if the user fails to comply with any of the
statements of the License Agreement.
Applied Maths will provide support to research
laboratories in developing new and highly specialized
applications, as well as to diagnostic laboratories where
speed, efficiency and continuity are of primary
importance. Our software thanks its current status for a
part to the response of many customers worldwide.
Please contact us if you have any problems or questions
concerning the use of BioNumerics, or suggestions for
improvement, refinement or extension of the software to
your specific applications:
No part of this guide may be reproduced by any means
without prior written permission of the authors.
Applied Maths BVBA
Keistraat 120
9830 Sint-Martens-Latem
Belgium
PHONE: +32 9 2222 100
FAX:
+32 9 2222 102
E-MAIL: [email protected]
Applied Maths, Inc.
512 East 11th Street, Suite 207
Austin, Texas 78701
U.S.A.
PHONE: +1 512-482-9700
FAX:
+1 512-482-9708
E-MAIL: [email protected]
URL:
www.applied-maths.com
Copyright (C) 1998, 2004, Applied Maths BVBA. All rights reserved.
BioNumerics is a registered trademark of Applied Maths BVBA.
All other product names or trademarks are the property of their respective owners.
BioNumerics includes a library for XML input and output from Apache Software Foundation (http://www.apache.org).
2
The BioNumerics manual
3
Table of contents
1. The concepts of BioNumerics . . . . . . . . 7
7. Setting up experiments . . . . . . . . . . . . . . 29
1.1 The programs . . . . . . . . . . . . . . . . . . . . . . . . . 7
7.1 Defining a new Fingerprint Type. . . . . . . . . 29
1.2 The database and the experiments . . . . . . . 7
7.2 Processing gels . . . . . . . . . . . . . . . . . . . . . . . . 31
1.3 Multi-database setup. . . . . . . . . . . . . . . . . . . 9
7.3 Defining pattern strips on the gel . . . . . . . . 32
1.4 Home directory and databases . . . . . . . . . . 9
7.4 Defining densitometric curves . . . . . . . . . . . 37
7.5 Normalizing a gel . . . . . . . . . . . . . . . . . . . . . . 39
2. About this guide . . . . . . . . . . . . . . . . . . . 11
7.6 Defining bands and quantification . . . . . . . 42
2.1 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . 11
7.7 Advanced band search using size-dependent
2.2 Floating menus. . . . . . . . . . . . . . . . . . . . . . . . 11
threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.8 Quantification of bands . . . . . . . . . . . . . . . . . 45
3. Installing the software . . . . . . . . . . . . . . 13
7.9 Adding the gel lanes to the database. . . . . . 50
3.1 The Setup program . . . . . . . . . . . . . . . . . . . . 13
7.10 Superimposed normalization based on internal
3.2 Example database . . . . . . . . . . . . . . . . . . . . . 13
reference patterns . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.11 Import of molecular size tables as Fingerprint
4. Installing the BioNumerics Network Software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Advanced features of the Netkey server program
18
4.4 Features of the Client program (BioNumerics)18
Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.12 Conversion of gel patterns from GelCompar versions 4.1 and 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.13 Dealing with multiple reference systems within
the same Fingerprint Type . . . . . . . . . . . . . . . . . 55
7.14 Defining a new Character Type . . . . . . . . . 56
7.15 Input of character data. . . . . . . . . . . . . . . . . 58
7.16 Import of character data by quantification of im-
5. Starting and setting up BioNumerics . 19
5.1 The programs . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 The BioNumerics Startup program. . . . . . . 19
5.3 Creating a database . . . . . . . . . . . . . . . . . . . . 20
5.4 Settings of a database . . . . . . . . . . . . . . . . . . 20
5.5 Database protection tools . . . . . . . . . . . . . . . 21
5.6 Log files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
ages scanned as TIFF files . . . . . . . . . . . . . . . . . . 60
7.17 Defining a new Sequence Type . . . . . . . . . 68
7.18 Input of sequences using the BioNumerics Assembler program . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.19 Defining a new Matrix Type . . . . . . . . . . . . 81
8. Experiment display and edit functions83
8.1 The experiment card . . . . . . . . . . . . . . . . . . . 83
6. Database functions . . . . . . . . . . . . . . . . . 23
6.1 The BioNumerics Main window . . . . . . . . . 23
6.2 Adding entries to the database . . . . . . . . . . 23
6.3 Creating database information fields . . . . . 24
6.4 Entering information fields . . . . . . . . . . . . . 24
6.5 Attaching files to database entries . . . . . . . 25
6.6 Configuring the database layout . . . . . . . . . 26
8.2 Entering experiment data via the experiment card
83
8.3 Entering experiment data via the experiment file
85
9. Comparison functions. . . . . . . . . . . . . . . 87
9.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4
The BioNumerics manual
9.2 Manual selection functions . . . . . . . . . . . . . .87
13.1 Coefficients for character data and conversion
9.3 Automatic search and select functions . . . .87
to binary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .125
9.4 The advanced query tool . . . . . . . . . . . . . . . .88
13.2 Advanced analysis of massive character sets
9.5 Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92
using GeneMaths . . . . . . . . . . . . . . . . . . . . . . . . .126
9.6 Pairwise comparison between two entries .93
9.7 The Comparison window . . . . . . . . . . . . . . .93
14. Multiple alignment and cluster analysis
of sequences . . . . . . . . . . . . . . . . . . . . . . . 127
10. Band matching and polymorphism
14.1 Calculating a cluster analysis based on pairwise
analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
alignment (steps 1 and 2). . . . . . . . . . . . . . . . . . .128
10.1 Composite Data Sets. . . . . . . . . . . . . . . . . . .97
14.2 Calculating a multiple alignment (steps 3 and
10.2 Creating a band matching . . . . . . . . . . . . . .98
4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .129
10.3 Manual editing of a band matching . . . . . .99
14.3 Multiple alignment display options. . . . . .130
10.4 Analyzing polymorphic bands only . . . . .101
14.4 Editing a multiple alignment . . . . . . . . . . .130
10.5 Adding entries to a band matching . . . . . .101
14.5 Drag-and-drop manual alignment . . . . . . .131
10.6 Band and band class filters . . . . . . . . . . . . .101
14.6 Inserting and deleting gaps. . . . . . . . . . . . .131
10.7 Creating a band matching table for polymor-
14.7 Removing common gaps in a multiple align-
phism analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .102
ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .133
10.8 Tools to display selective band classes . . .103
14.8 Changing sequences in a multiple alignment
10.9 Finding discriminative bands between entries
133
104
14.9 Finding a subsequence. . . . . . . . . . . . . . . . .133
14.10 Calculating a clustering based on the multiple
11. Cluster analysis. . . . . . . . . . . . . . . . . . . 107
alignment (steps 5 and 6). . . . . . . . . . . . . . . . . . .133
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .107
14.11 Adding entries to and deleting entries from an
11.2 Calculating a dendrogram . . . . . . . . . . . . . .107
existing global alignment . . . . . . . . . . . . . . . . . .134
11.3 Calculation priority settings . . . . . . . . . . . .109
14.12 Automatically realigning selected sequences
11.4 General edit functions . . . . . . . . . . . . . . . . .110
135
11.5 Adding and deleting entries . . . . . . . . . . . .110
14.13 Sequence display and analysis settings. .135
11.6 Dendrogram display functions . . . . . . . . . .110
14.14 Converting sequences data to categorical
11.7 Working with Groups . . . . . . . . . . . . . . . . .111
character sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . .135
11.8 Cluster significance tools . . . . . . . . . . . . . . .112
14.15 Excluding regions from the sequence compar-
11.9 Matrix display functions . . . . . . . . . . . . . . .113
isons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .137
11.10 Group statistics . . . . . . . . . . . . . . . . . . . . . .114
14.16 Writing comments in the alignment . . . .138
11.11 Printing a cluster analysis . . . . . . . . . . . . .116
11.12 Analysis of the concordance between techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .118
15. Cluster analysis of Composite Data
Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
15.1 Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . .141
12. Cluster analysis of fingerprints . . . . . 121
15.2 Composite Data Sets. . . . . . . . . . . . . . . . . . .141
12.1 Defining ‘active zones’ on fingerprints . . .121
15.3 Calculating a dendrogram from a Composite
12.2 Calculation of optimal position tolerance opti-
Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143
mization and settings . . . . . . . . . . . . . . . . . . . . . .122
15.4 Cluster analysis of characters. . . . . . . . . . .145
13. Cluster Analysis of characters . . . . . . 125
16. Phylogenetic clustering methods . . . 147
Table of contents
5
16.1 Maximum parsimony of Fingerprint Type data
20.8 3-D scatterplot . . . . . . . . . . . . . . . . . . . . . . . 188
and Character Type data . . . . . . . . . . . . . . . . . . 147
20.9 ANOVA plot . . . . . . . . . . . . . . . . . . . . . . . . 188
16.2 Maximum parsimony clustering of sequence
20.10 1-D numerical distribution . . . . . . . . . . . 189
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
20.11 3-D Bar graph. . . . . . . . . . . . . . . . . . . . . . . 189
16.3 Maximum likelihood clustering . . . . . . . . 149
21. Identification with database entries . 193
17. Advanced clustering and consensus
trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
21.1 Creating lists for identification . . . . . . . . . 193
21.2 Identifying unknown entries . . . . . . . . . . . 193
17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 151
21.3 Fast band-based database screening of finger-
17.2 Degeneracy of dendrograms . . . . . . . . . . . 151
prints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
17.3 Consensus trees . . . . . . . . . . . . . . . . . . . . . . 152
17.4 Advanced clustering tools . . . . . . . . . . . . . 153
22. Identification using libraries . . . . . . . 195
17.5 Displaying the degeneracy of a tree . . . . . 153
22.1 Creating a library. . . . . . . . . . . . . . . . . . . . . 195
17.6 Creating consensus trees . . . . . . . . . . . . . . 155
22.2 Identifying entries against a library . . . . . 195
17.7 Managing Advanced Trees . . . . . . . . . . . . 155
22.3 Detailed identification reports. . . . . . . . . . 197
22.4 Creating a neural network . . . . . . . . . . . . . 198
18. Minimum Spanning Trees for population
modelling. . . . . . . . . . . . . . . . . . . . . . . . . . 157
18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 157
18.2 Minimum spanning trees in BioNumerics 157
18.3 Calculating a minimum spanning tree . . . 158
18.4 Interpreting and editing a minimum spanning
tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
23. Analyzing 2D gels . . . . . . . . . . . . . . . . . 201
23.1 Proteomics in a broader context: the BioNumerics Platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
23.2 Data sources for BioNumerics 2D . . . . . . . 201
23.3 Applications for BioNumerics 2D. . . . . . . 202
23.4 Getting started with BioNumerics 2D . . . 202
23.5 Creating a new database . . . . . . . . . . . . . . 202
19. Dimensioning techniques (PCA, MDS
and SOM). . . . . . . . . . . . . . . . . . . . . . . . . . 163
19.1 Calculating an MDS . . . . . . . . . . . . . . . . . . 163
19.2 Editing an MDS . . . . . . . . . . . . . . . . . . . . . . 163
19.3 Calculating a PCA . . . . . . . . . . . . . . . . . . . . 164
19.4 Calculating a discriminant analysis . . . . . 167
19.5 Self organizing maps. . . . . . . . . . . . . . . . . . 168
19.6 Multivariate analysis of variance (MANOVA)
23.6 Defining a new 2D experiment type. . . . . 203
23.7 Importing 2D gel image files . . . . . . . . . . 203
23.8 Processing 2D gel images . . . . . . . . . . . . . . 204
23.9 Step 1: Spot detection . . . . . . . . . . . . . . . . . 209
23.10 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . 212
23.11 Normalization . . . . . . . . . . . . . . . . . . . . . . 213
23.12 Defining metrics . . . . . . . . . . . . . . . . . . . . 216
23.13 Describing the 2D gel in the database . . 219
23.14 Normalization of other 2D gels . . . . . . . . 219
and discriminant analysis. . . . . . . . . . . . . . . . . . 170
20. Chart and statistics tools . . . . . . . . . . . 173
20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 173
20.2 Basic terminology . . . . . . . . . . . . . . . . . . . . 173
20.3 Charts and statistics. . . . . . . . . . . . . . . . . . . 175
20.4 Using the plot tool and general appearance183
20.5 Bar graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
20.6 Contingency table . . . . . . . . . . . . . . . . . . . . 185
20.7 2-D Scatterplot . . . . . . . . . . . . . . . . . . . . . . . 187
24. Comparing 2D gels . . . . . . . . . . . . . . . . 223
24.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 223
24.2 Matching spots on different gels. . . . . . . . 223
24.3 Creating 2D spot queries . . . . . . . . . . . . . . 227
24.4 Listing spots in spreadsheets. . . . . . . . . . . 230
24.5 Comparing spots in scatter plots . . . . . . . 233
24.6 Clustering and statistical analysis of 2D gels in
BioNumerics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
6
The BioNumerics manual
24.7 Analyzing 2D gel spot tables with GeneMaths
28.3 Working in a Connected Database. . . . . . .256
or GeneMaths XT . . . . . . . . . . . . . . . . . . . . . . . . .236
28.4 Linking to an existing database with standard
24.8 Editing reference systems . . . . . . . . . . . . . .237
BioNumerics table structure . . . . . . . . . . . . . . . .257
24.9 Creating synthetic gels . . . . . . . . . . . . . . . .238
28.5 Linking to an existing database with table
structure not in BioNumerics format. . . . . . . . .258
25. Database exchange tools . . . . . . . . . . . 241
28.6 Converting a local database to a Connected Da-
25.1 Creating a new bundle . . . . . . . . . . . . . . . . .241
tabase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .259
25.2 Opening an existing bundle . . . . . . . . . . . .241
28.7 Opening and closing database connections260
28.8 Restricting queries . . . . . . . . . . . . . . . . . . . .260
26. The BioNumerics client functions . . 245
26.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .245
26.2 Connecting to the server datatase . . . . . . .245
29. Preserving the BioNumerics database
integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
26.3 Searching and downloading entries on the
29.1 Taking backups of a database . . . . . . . . . . .263
server database . . . . . . . . . . . . . . . . . . . . . . . . . . .247
29.2 Detecting and correcting faults in a database
26.4 Uploading bundles and gels to the server.248
263
26.5 Performing identifications on the server . .249
29.3 Missing directories . . . . . . . . . . . . . . . . . . . .264
29.4 Corrupted files . . . . . . . . . . . . . . . . . . . . . . .264
27. Import of data from external
databases . . . . . . . . . . . . . . . . . . . . . . . . . . 251
29.5 Empty files . . . . . . . . . . . . . . . . . . . . . . . . . . .264
29.6 Database entries with identical keys . . . . .264
27.1 Setting up the ODBC link . . . . . . . . . . . . . .251
29.7 Database entries without keys . . . . . . . . . .264
27.2 Import of database fields using ODBC . . .251
29.8 Experiments with identical keys . . . . . . . .264
27.3 Import of character data using ODBC . . . .252
29.9 Experiment keys without database entries265
27.4 Import of sequence data using ODBC . . . .252
30. Appendix . . . . . . . . . . . . . . . . . . . . . . . . 267
28. Connected databases . . . . . . . . . . . . . . 253
28.1 Setting up a new Connected Database . . .254
30.1 Connected Database table structure . . . . .267
30.2 Regular expressions . . . . . . . . . . . . . . . . . . .270
28.2 Configuring the Connected Database link in BioNumerics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .254
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
7
1.
The concepts of BioNumerics
1.1 The programs
The core of the BioNumerics software is composed of
two executable units: a Startup program that creates and
manages the databases and associated directories and
that starts the Analyze program with a selected
database. Most import functions and all analysis
functions are done in the Analyze program. This
includes processing of gel files starting from TIFF
images, including lane finding and normalization.
Independent conversion programs allow the import of
character data in different file formats. These programs
can also be loaded from the BioNumerics Startup
program.
1.2 The database and the experiments
The logical flow of processing raw experiment files is
represented in Figure 1-1. The basis of BioNumerics is a
powerful relational database (4) consisting of entries. The
entries correspond to the individual organisms under
study: animals, plants, fungi, bacterial strains. Each
database entry is characterized by a unique key, assigned
either automatically by the software or manually, and
by a number of user-defined information fields. Each
entry may be characterized by one or more experiments
that can be linked easily to the entry (3). What we call
experiments in BioNumerics are in fact the experimental
data that are the numerical results of the biological
experiments or assays performed to estimate the
relationship between the organisms. In BioNumerics, all
biological experiments are functionally classified in five
different classes, called Experiment Types:
•Fingerprint Types: Any densitometric record seen as a
one-dimensional profile of peaks or bands can be
considered as a Fingerprint Type. Examples are of
course gel and capillary electrophoresis patterns, but
also gas chromatography or HPLC profiles,
spectrophotometric curves, etc. Fingerprint Types can
be derived from TIFF or bitmap files as well, which
are two-dimensional bitmaps. The condition is that
one must be able to translate the patterns into
densitometric curves.
•2D Gel Types: Any two-dimensional bitmap image
seen as a profile spots or defined labelled structures.
Examples are e.g. 2D protein gel electrophoresis
patterns, 2D DNA electrophoresis profiles, 2D thin
layer chromatograms, or even images from
radioactively labelled cryosections or short half-life
radiotracers.
•Character Types: Any array of named characters,
binary or continuous, with fixed or undefined length
can be classified within the Character Types. The main
difference
between
Character
Types
and
electrophoresis types is that in the Character Types,
each character has a well-determined name, whereas
in the electrophoresis types, the bands, peaks or
densitometric values are unnamed (a molecular size is
NOT a well-determined name!).
•Sequence Types: Within the Sequence Types, the user
can enter nucleic acid (DNA and RNA) sequences and
amino acid (protein) sequences.
•A fifth type, Matrix Types, is not a native experiment
type, but the result of a comparison between database
entries, expressed as similarity values between certain
database entries. An example of a Matrix Type is a
matrix of DNA homology values. DNA homology
between organisms can only be expressed as pairwise
similarity, not as native character data.
Each experiment type is available as a module of the
BioNumerics software.
Essentially, adding a single organism (entry) with its
associated experiments to the database constitutes three
steps (steps 1 to 4 on Figure 1-1.):
1. Generating experimental data. This involves carrying
out the experiments on the organism, and creating a
data file of the experiment in any readable (text)
format.
2.
Import, conversion and normalization of
experiments. Import of Character Type data is usually
done by a script language available in the
BioNumerics software. This script tool allows the user
to define reading protocols for virtually any character
file in readable text format. Most other import and
conversion functions are done by the same script tool.
These include processing of gel images starting from
densitometric curve files, sequences, etc. Special
formats such as ABI four channel chromatogram files
are imported using a separate program. BioNumerics
offers the state-of-the-art technology for processing
electrophoresis fingerprints, including lane-finding,
normalization, band searching and quantification
tools. Normalization means the correction of shifts of
whatever kind or origin, within and between gels, so
that bands or peaks of the same size have the same
physical position after normalization, even if they
occur on different gels.
3. See under 4.
8
The BioNumerics manual
Sequence types
0.841
1.428
1.203
1.753
1.522
1
Character types
Electrophoresis types
Import, conversion and normalization
of experiments
2
Linkage of experiments
to database entry
3
Generation of unlimited
relational databases
Organism X
Organism Y
Organism Z
Organism R
Organism S
Organism X
Organism Y
Organism Z
Organism R
Organism S
4
Org
anis
mX
Org
anis
mY
Org
anis
mZ
Org
anis
mR
Org
anis
mS
Org
anis
mX
Org
anis
mY
Org
anis
mZ
Org
anis
mR
Org
anis
mS
Organ
ism X
Organi
sm Y
Organ
ism Z
Organ
ism R
Organ
ism S
X
nism
Orga
Y
nism
Orga
mZ
is
n
Orga
R
nism
Orga
S
nism
Orga
X
nism
Orga
Y
nism
Orga
Z
m
nis
Orga
R
nism
Orga
S
m
is
n
Orga
Organ
ism X
Organ
ism Y
Organ
ism Z
Organi
sm R
Organ
ism S
6
Clustering of individual experiments
and composite data sets
Figure 1-1. Flow chart of conversion and import of TIFF and densitometric files.
Generation of libraries for identification
5
Chapter 1 - The concepts of BioNumerics
4. Creation of database entries and linkage of
experiments. There are three ways to create entries
(note: an entry is the record for an individual
organism) in the BioNumerics database: (1) an entry
or a batch of entries can be generated directly in the
database. Afterwards, experiments can be linked to
these entries; (2) new database entries can be created
automatically from
a file of experiments; (3)
BioNumerics can search automatically for entries with
corresponding database information fields in the
database and in an experiment file. This means that
new database entries will be created for the
experiments if the information fields do not
correspond with one of the existing database entries.
If on the other hand, the information fields of an
experiment correspond to an existing entry, this
experiment will be linked to the entry.
5. Cluster analysis. The grouping analysis functions
involves the calculation of dendrograms to reveal
groups of related organisms, principal component
analysis, estimation of significance of groups, etc.
9
depending on whether the directory is write or read
protected.
Figure 1-2. is an example of the database structure of one
Windows NT user. The figure illustrates that one user
can have many databases, and within each database, can
define various experiment types. If it is the intention to
compare database entries across different databases, the
experiment types that the entries of different databases
share should have the same name and definition.
Within the specified home directory, the BioNumerics
Startup program automatically creates the necessary
subdirectories for each database. The following
directory structure (Figure 1-3.) corresponds to the
above database structure of user X. The files shown are
the configuration files for the experiment types defined
under this user. If you want to create exactly the same
experiment types under another database or another
user, you can copy these .CNF files to the corresponding
directories.
6. Generation of libraries and identification. The user
can create highly specific libraries for identification.
New and unknown organisms, available as database
entries, can be identified against libraries of known
entries,
using
individual
experiments,
or
combinations of experiments, which results in a
consensus-identification.
1.3 Multi-database setup
BioNumerics is a multi-database software, which
supports the setup of different users in Windows NT. It
is very important to understand the hierarchical
structure of the user, database, and experiment setup in
order to make optimal use of these features.
Windows NT (Windows 2000, Windows XP): The
BioNumerics users are associated with the Windows NT
login users. Each Windows NT user can specify his/her
BioNumerics databases directory, and BioNumerics
saves this information in the user’s system registry. For
example, suppose that a user X logs in on a Windows
NT machine with BioNumerics installed. This user can
create a directory, e.g. XXX, and specify this directory as
the Home directory in the Startup program.
BioNumerics will save this information in this user’s
system registry, so that each time the user logs in,
BioNumerics will automatically consider the same
directory as the home directory. In this way, each
Windows NT user can define his/her own BioNumerics
home directory, without interfering with other users.
Within this home directory, the user can specify as many
Databases as desired. Protection of the BioNumerics
databases depends on the protection of the specified
directory by the Windows NT user. If a user protects the
directory containing the BioNumerics databases, other
users will not be able to change or to read the databases,
Figure 1-3. Structure of databases
experiment types within one user.
and
1.4 Home directory and databases
As explained in the previous paragraph, BioNumerics
recognizes its databases by means of a Home directory.
This home directory can be different per Windows login,
and can even be on a different computer in the network.
By default, BioNumerics will install its databases under
the home directory. However, a databases can also be
located in a different directory, and even on a different
computer in the network. What is important is that in
the home directory, a Database descriptor file is present
for each database. These files have the name of the
databases
with
the
extension
.DBS.
The
*Database*.DBS file is a pure text file which can be
edited in Notepad or any other text editor. It contains
the following information:
10
The BioNumerics manual
Windows NT USER ‘X’
(Home directory XXX)
Database X1
Fingerprint type X1A, Fingerprint type X1B
Character type X1A, Character type X1B
Sequence type I1A, Sequence type I1B
Database X2
Fingerprint type X2A
Character type X2A, Character typeX2B
Database X3
Fingerprint type X3A, Fingerprint type X3B
Character type X3A
Matrix type X3A
Figure 1-2. Structure of databases and experiment types within one user.
[DIR]
C:\Program Files\BioNumerics\data\DemoBase
[BACKCOL]
150 171 172
[SAVELOGFILE]
0
The line after the tag [DIR] indicates the full path where
the database is located. The line after [BACKCOL]
contains the RGB values for the window background
color, and the line after [SAVELOGFILE] indicates
whether log files are saved or not.
Under normal circumstances, BioNumerics keeps track
of this database descriptor file. However, in case a
database is moved from one computer to another, it may
be useful to edit the file and enter the correct path. The
correct path for a database can also be entered from the
Startup program by pressing <Settings> and <Change
directory>.
In case a database has been physically removed (or
moved) from a computer, the *Database*.DBS file may
still be present in the home directory, which causes the
BioNumerics Startup program to list the database. When
attempts are made to open or edit such a removed
database, BioNumerics will produce an error. The only
remedy is to delete the *Database*.DBS file.
11
2.
About this guide
2.1 Conventions
In the sections that follow, all menu commands and
button text is typed in bold-italic. Submenus are
separated from parent menus by a “greater than” sign
(>). Button text is always given between < and > signs.
Each window and dialog box described in the guide will
be given a name. This name is shown in italic, and
usually corresponds to the name in the caption of the
window or dialog box. For example, the dialog box
above will be called the Character type settings dialog
box.
Examples:
2.2 Floating menus
The following menu command will be indicated as Edit
> Order entries by database field.
In almost every window in the BioNumerics Analyze
program, the use of place-specific “floating menus” is
supported. For example, if you right-click (clicking the
right mouse button) on a database entry, a floating
menu is popped up, showing you all the possible menu
commands that apply to the selected entry (see Figure 21.). Clicking right in the Comparison panel shows a
menu with all commands related to comparisons: Create
new comparison, Load comparison, and Delete
comparison.
The following buttons are indicated as Consider absent
values as zero (check box), <OK>, <Cancel>, <Apply>
(disabled).
Figure 2-1. Floating menu appearing after clicking
right on a database entry.
The floating menus make the use of BioNumerics easier
and more intuitive for beginners, and much faster for
experienced users. In describing menu commands in
this guide, we will not usually mention the
corresponding floating menu command. It is up to the
user to try right-clicking in all window panels in order
to find out which is more convenient in every specific
case: calling the command from the window’s menu or
toolbar button or from the place-specific floating menu.
12
The BioNumerics manual
13
3.
Installing the software
3.1 The Setup program
The BioNumerics software is delivered on CD-ROM. If
you insert the CD-ROM in the drive, the Installation
program will automatically load.
NOTE: The Installation program will automatically
load if the Auto insert notification of the CD-ROM
drive is enabled. You can change this setting in the
Control Panel, System, Device Manager, CD-ROM,
Properties, Settings, where you check the Auto insert
notification option.
On the installation screen, click Install BioNumerics. If
this is the first installation of BioNumerics, you should
allow the program to install the demo database. The
software is installed in a subdirectory BioNumerics of
the Program files directory. Upon completion, the
installation program creates a shortcut to the
BioNumerics Startup program on the desktop.
Insert the protection key (dongle) in the parallel port of
the computer. BioNumerics is now ready for use.
Chapter 4. deals with the installation and features of the
BioNumerics network software.
3.2 Example database
One database, DemoBase, is preinstalled with the
software, and this database will serve as a tutorial and as
an example in this guide. This database contains
experimental data on some fictitious bacterial genera.
The database contains the following experiment types:
•Fingerprint Types:
RFLP: Two different RFLP techniques, called RFLP1 and
RFLP2, resulting in two patterns for each bacterial
strain.
•Character Types:
FAME: Fatty Acid Methyl Esters (FAME) profiles
obtained on a Hewlett Packard 5890A gas-liquid
chromatography instrument. This is a typical example of
an open data set: the number of fatty acids found
depends on the group of entries analyzed. If more
entries are added, more fatty acids will probably be
found. Furthermore, FAME profiles are an example of a
continuous Character Type: the percentage occurrence of
a fatty acid in a bacterium can have any real value
between zero and 100%.
PhenoTest: This is a fictitious phenotypic test assay that
reveals the metabolic activity or enzyme activities of
bacteria on 19 different compounds. The first cup of the
test is a blank control. This is an example of a closed data
set: the 20 characters are well-defined, and regardless of
the number of entries examined, the number of
characters in the experiment will always remain 20. Real
examples of such types of assays are Biolog microplates,
API test galleries, Vitek, etc. They can be interpreted in
two ways. One can read the reactions by eye and score
them as positive or negative; in this case the Character
Type is binary. If the microplates are read automatically
using a microplate reader, the reactions in the cups may
have any real value between an OD of zero and 2.5 to
3.0, which again is a continuous Character Type. In the
example database, the reactions are scored as continuous
characters.
In addition to the binary and continuous Character
Types, one can also distinguish the semi-quantitative
Character Types. These are tests that can have a number
discrete values, e.g. 0, 1, 2, 3, 4, or 5. In practice, a
number of continuous Character Types are interpreted
as multistate characters for convenience. Examples are
API galleries (BioMérieux SA, La Balme-les-Grottes,
France) that can be read using a color scale ranging from
0 to 5.
•Sequence Types:
16S rDNA: For all of the strains, and a number of
additional strains, the nearly complete 16S ribosomal
RNA gene has been sequenced. The sequences are
approximately 1500 bases long, but not all of them are
sequenced completely.
•Matrix Types:
A partial homology matrix based upon hybridization of
total genomic DNA has been generated for the genera.
A separate demobase, Demo2D is available for
exploring the 2D module of BioNumerics (chapter 23.).
14
The BioNumerics manual
15
4.
Installing the BioNumerics Network Software
4.1 Introduction
The BioNumerics network software is compatible with
any TCP/IP supporting network in combination with
Windows 2000, Windows NT 4.0 and Windows XP. The
communication is based on TCP/IP sockets provided by
Windows. The server may be any Windows 2000, XP, or
NT 4.0 computer in the network, and the clients, running
BioNumerics, are all computers with the same Domain
Name, including the server computer. The network
software even allows licenses to be granted to physically
distant locations via Dial-up connections, provided that
the domain name for such distant clients is the same.
The system consists of three components: the security
driver program, the security key (dongle) and the client
software.
The security key is a hardware device (dongle). It
attaches to the parallel port (or USB port) of a computer
that is part of the network. This computer will be the
License server.
The security driver is a program, NETKEY.EXE, that is
available on the server computer. This program
manages multiple licensing over the network. It is
permanently running as a Windows Service on the server
computer in the network, i.e. where the security key is
attached.
The client software is a BioNumerics software version
that contains the routines needed to register with the
security server. This can be installed on any computer
connected to the network, but only a restricted number
of computers, the license limit, can run the software at the
same time.
NOTE: In a TCP/IP network with Internet access, each
computer has its own name in addition to its IP
address. These computer names must be valid and
registered names for all client computers, since the
BioNumerics network software uses these names to
recognize the client computers. If a Name Server is
used, the names of the client computers must be
validly registered in the Name Server of the
network, otherwise, license granting will not be
possible!
application software installed, but only the number
allowed by the license limit is able to run the software
simultaneously. If the license limit is reached, a new
license becomes free whenever the application is closed
on one computer.
First, identify a suitable License server computer. The
server should be a stable computer in terms of hardware
and software configuration, that is permanently
working and available over the network to other
computers. A computer running Windows 2000/NT or
XP Professional is to be preferred, only for reasons of
stability of the operating system.
Once the License server is located, install BioNumerics on
the License server and on one or more client computers
that will run BioNumerics. When installing BioNumerics
on the server, you should check the option Install
Netkey server program in the startup wizard. Next, start
perhaps with installing on just one client computer and
take note of all steps needed to configure the network
software. Do not forget to plug the network security key
into the parallel port of the License server computer!
After installation on the server, the following programs
are installed under Program files > BioNumerics :
•BioNumerics
•Netkey
The Netkey Server program manages the network
licenses. This program runs as a Service that is
automatically started on the License server, and should
never be halted as long as licenses are in use.
Before the network software can be put into use
successfully, there are some settings that will need to be
made or changed. If changes to the network settings of
the computers are needed, we recommend to have
these changes made by the system administrator or
computer expert of your department or institution!
•TCP/IP
Each computer that will be used in the BioNumerics
network configuration needs the TCP/IP protocol
installed on the network. The TCP/IP protocol is
provided with the installation package of Windows.
4.2 Setup
•IP address and DNS host name
The License server computer has the security key inserted
in the LPT1 port and runs the Netkey server program,
which manages the network licenses. All computers
connected to the network can have the BioNumerics
Furthermore, each computer in the BioNumerics
network configuration needs a valid and unique IP
address, to be specified in the TCP/IP properties. The IP
address may be a permanent address assigned to the
computer, or an IP address assigned by the DHCP
16
The BioNumerics manual
server (Dynamic Host Configuration Protocol). It also
must have a valid and unique DNS host name, which
should not include spaces or periods. The DNS host
name can be found by opening Network in the Control
Panel, selecting TCP/IP and clicking Properties, under
IP address and DNS Configuration. If permanent, the IP
address can be found in the same window. They can
also be seen in the BioNumerics Startup program by
clicking the
button right from the Homedir
button. In the Netkey settings box that appears, click Info
to show the computer name, domain name and the IP
address. Note down the DNS host names of the client
computers and the server computer, and if permanent,
also note down the IP addresses.
•Initial settings
On each computer, including the server computer,
BioNumerics has created a settings file NETKEY.INI,
which needs to be completed for the network. Run the
BioNumerics Startup program on the server, and click
the small Network settings button
right from
the Homedir button. Under Server computer name, fill in
the DNS host name without the domain name. For
example,
if
a
computer
is
known
as
computer.dept.univ.ext, you should fill in
computer without the domain name dept.univ.ext.
You should not change the Port number unless there is a
conflict with other software that uses the same port
number.
You can also edit NETKEY.INI in Notepad by double
clicking on the file name in the Windows Explorer.
When opened in Notepad, the contents of the file look as
follows:
SERVERNAME=
SERVERPORT=2350
After SERVERNAME= , enter the DNS host name of the
server computer, and save the file. This change must be
made on each client computer and on the server
computer, in order to allow BioNumerics to find the
server in the network.
The SERVERPORT is the TCP/IP port that is used by
the Netkey server and the clients to communicate with
each other, and thus should be the same on all
computers. In normal circumstances, you can leave
SERVERPORT unchanged. However, in case there is a
firewall between the Netkey server and the clients, you
will have to open three TCP/IP ports: 2350, 2351 and
2352. Any other three successive port numbers can be
specified, as long as the first port number is correctly
indicated in the SERVERPORT line, both on the Netkey
server computer and on the clients.
Start the Netkey configuration program on the server
computer. The following window appears (Figure 4-1.).
Figure 4-1. The Netkey configuration program, initial view.
Chapter 4 - Installing the BioNumerics Network Software
Initially, both panels 'Registered computers' and
'Current connected users' are empty. The upper panel
lists the computers that are granted access to the
BioNumerics network, and the lower panel lists the
computers where the software is currently in use. Every
computer that can get access to the BioNumerics
network must be specified in the server program, by
means of its IP address and DNS host name. In large
institutions, this feature allows perfect control over
which computers/users that are allowed to use the
BioNumerics software.
Start the Netkey service by clicking the button <Start
service>. Since the service is not installed yet, you will
be asked to confirm to install it. When this is finished, a
message "The NetKey service has been successfully
installed" appears.
After clicking <OK> to this message, another message
tells that the "Service has been started". The Netkey
service is now ready to distribute licenses.
•Configuring a client
On each client computer, configure the file NETKEY.INI
in the same way as described above.
•Defining a client
On the server computer, add the client computer to the
list of BioNumerics clients as follows: Click <Add>.
Enter the DNS host name of the client computer in the
17
dialog box. In non-DHCP configurations (i.e. in case of
permanent IP addresses), also enter the IP address. Press
<OK>. The client is now shown in the upper panel, with
its name only (DHCP) or with its name and IP address
(permanent). From this point on, the client has access to
the BioNumerics network software.
NOTE: If you do not wish to define specific computers
to have permission to obtain a license for BioNumerics,
you can enter an asterisk (*) in place of the computer
name, without specifying an IP address. When doing
so, every computer in the LAN will be able to obtain a
BioNumerics license.
•Running BioNumerics
On the client computer, start BioNumerics and press the
Analyze button. The program should load if the
network is configured correctly and if the server name,
the IP addresses, and domain host names are filled in
correctly.
On the server computer, the client that uses
BioNumerics is now listed in the lower panel, showing
its IP address, DNS host name, total usage time
(elapsed) and idle time (4.3) (Figure 4-2.).
More client computers can be added to the network by
simply adding the IP address and the computer name as
described in the previous paragraph.
Figure 4-2. The Netkey configuration program, listing all computers that are granted access (top)
and licenses in use (bottom).
18
4.3 Advanced features of the Netkey
server program
The Netkey server program is a Windows NT/2000
Service. As such, it can be seen in the list of installed
Services. The startup settings, i.e. Manual, Automatic or
Disabled, can be specified from the Windows Services
administration tool (Control Panel > Administrative
tools > Services) If you close the Netkey configuration
program, the service will not be halted. Even when the
current user logs off, the service remains running in the
background. To effectively shut down the service, click
the <Stop service> button in the Netkey configuration
window. If licenses are still in use, the program will
produce a warning message, asking you to continue or
not.
•License granting
Each computer in the network can be granted or refused
access to the application software by the server
program. To refuse access to a particular computer,
select it in the upper panel, and refuse its access with
<Change access>. The blue screen icon changes into a
red screen. To grant the access again, click <Change
access> a second time. To permanently remove a
computer from the users list, select the computer in the
upper panel, and click <Delete>.
The BioNumerics manual
•Messaging
The License server can send messages to any or all
connected clients, for example in case the server
computer will be shut down or if a client will be
disconnected. Send a message to one user by selecting
the user in the lower panel, and <Send message>. Enter a
message string and press <OK>. The user will receive
the message in a dialog box. Send a message to all users
with <Send message to all users>. Enter a message
string and press <OK>. All active users will receive the
message in a dialog box.
•Usage statistics
The Netkey server program records every usage of each
client. Graphical statistics can be displayed about the
history of the usage over longer periods, and the relative
usage of each client computer can be shown for any time
interval. To view the usage history of the BioNumerics
network version, click <Statistics>. The panel shows a
detailed view of the number of computers that have
used the software on a time scale divided in hours. You
can scroll in this panel to view back in the past. The
license limit is shown as a red line; computers in a
waiting list are shown in red. The relative usage of each
client computer can be shown by clicking the <Relative
usage> tab. Enter the time period (from-to) in Days /
Months / Years. The result is a circle diagram with the
percentage usage time for each computer shown.
•Disconnect users
The server can disconnect a client if needed. Select a user
in the lower panel, and disconnect it (withdraw its
license) with <Disconnect>.
4.4 Features of the Client program
(BioNumerics)
•Time-out
•Waiting lists
The idle time of each user is recorded by the Netkey
server program. A time-out for inactive licenses can be
specified: in case there is a waiting list, a client for whom
the idle time exceeds the time-out value will loose his
license in favor of the first in the waiting list. Specify a
maximum idle time with <Settings>, and enter the
minutes of idle time. Note that a user who has exceeded
the idle time limit will not be disconnected by the
server as long as there is no waiting list.
In case the maximum license number is exceeded, the
server program manages a waiting list. The client
receives a message with its number in the waiting
queue, and the BioNumerics software pops up as soon
as the client’s license becomes available.
•Maximum usage limit
Startup program, and then clicking <Status>. It shows
for each connected computer the IP address, the
computer name, the total usage time and the idle time.
The usage time by each client is recorded by the Netkey
server program; it is the total connection time of the
current session. A maximum usage time can be
specified: in case there is a waiting list, a client for which
the usage time exceeds the maximum usage time will
loose his license in favor of the first in the waiting list.
Specify a maximum usage limit with <Settings>, and
enter the minutes of usage time. Note that a user who
has exceeded the maximum usage time limit will not
be disconnected by the License server as long as there
is no waiting list.
The user can request an overview of the computers
currently using a BioNumerics license by click the
Network settings button
in the BioNumerics
•Disconnection by server or license loss
If the client is disconnected by the server or looses its
license, e.g. due to idle time or maximum usage limit, a
warning box flashes that you should save any unsaved
data and quit the program immediately. BioNumerics
tries four times again to negotiate its license with
intervals of 15 seconds. After the fourth time (1’ 15’’ in
total), the program halts automatically.
19
5.
Starting and setting up BioNumerics
5.1 The programs
5.2 The BioNumerics Startup program
The BioNumerics software consists of 2 programs:
5.2.1 Double-click the “BioNumerics” icon on the
desktop to run the Startup program. This program (see
Figure 5-1.) shows the Intro screen with version number
and license site information (1). It allows you to run the
BioNumerics server application Analysis, to Inspect the
databases with the Diagnostics program (4 and 5
respectively), to create New databases (6), to run import
or Filter programs (7), and customize various Settings
(8) such as colors, directories, order of information fields
and experiments, for the selected database (2). The
Home directory for the databases can be specified (2).
The Startup program. This program will allow you to
run the BioNumerics main application, to set-up and
select databases and to customize various settings
(colours, directories, information fields etc.) for each
database.
The Main program. This is the real analysis program,
including database functions, experiment processing,
analysis, and reporting functions.
5.2.2 Use the <Exit> button (9) when you are finished
running the BioNumerics applications.
1
2
6
8
3
9
4
7
5
Figure 5-1. The BioNumerics Startup program.
20
The Diagnostics program analyzes system features of
your computer and detects and reports any errors in the
selected database. It checks the BioNumerics drives and
directories of the selected database and inspects the
validity of the files contained therein.
5.3 Creating a database
In order to facilitate the use of BioNumerics in different
research projects, it is possible to set up Databases. The
principles of a database are explained in 1.3. The
BioNumerics Startup program will look for all databases
in one Home directory, specified by the user. Note that, in
Windows NT, Windows 2000, and Windows XP, each
Windows user may specify a different home directory.
The BioNumerics home directory is saved with the
system registry of the user.
If you want to change the current home directory follow
the steps below:
5.3.1 Press the button <Homedir> on the startup screen,
and select the desired directory. You can also specify a
directory on a network drive, on condition that this
drive is permanently available.
5.3.2 Press <OK> to select the new home directory. The
program updates the list of available databases in the
new directory.
BioNumerics offers two alternative database solutions to
store its databases: the program’s own built-in database
(=local database) or an external ODBC compatible
database engine. The latter solution is called a Connected
Database. Connected databases are particularly useful
for environments where data is already stored in a
relational database, or where vast amounts of data are
generated. It also allows more professional database
setups to be achieved, for example with different
access/permission settings for different users. In
addition, when the connected database support multiuser access, BioNumerics becomes a true multi-user
application. Currently supported are Microsoft Access,
SQL Server, and Oracle. Note that Connected Databases
are only available with the Database Sharing Tools
module.
However, for most purposes and single-user access, the
local BioNumerics database is quick, powerful, and easy
to set up and maintain. We will describe the use of local
databases in this chapter. The use of connected
databases is discussed in detail in chapter 28.
Create a new database as follows:
5.3.3 Press the <New> button to enter the New database
wizard.
The BioNumerics manual
5.3.4 Enter a name for the database, e.g. Example, and
press <Next>.
Notes:
The program automatically creates a subdirectory of the
current home directory for the database directories. If
you want to change this (not recommended), press
<Change>. This option allows you to select any
directory from any permanent drive. It is recommended
to create a new empty directory before you choose it as
databasqe directory.
If you do not want the program to automatically create
subdirectories, click No to this question (not
recommended). In that case, you will have to create the
subdirectories manually (see Figure 1-3.).
5.3.5 Press <Next> again.
5.3.6 You are now asked whether or not you want to
create log files. If you enable BioNumerics to create log
files, every change made to a database component
(entry, experiment etc.) is recorded to the log file with
indication of the kind, the date, and the time of change.
5.3.7 Press <Finish> to complete the setup of the new
database.
The BioNumerics database is now created. A new dialog
box pops up, asking if you want to store the data as a
Local database or as a Connected database.
5.3.8 Select Local database and press <OK>.
5.4 Settings of a database
5.4.1 In the Startup program, select the new database
and press the <Settings> button. The Database settings
dialog box appears (Figure 5-2.). Here you can change the
Database directory, you can Enable the use of log files
(5.3.6), you can remove the current database with Delete
database, and you can modify the window background
colors for the database using red, green and blue slide
bars. The checkbox Use default Windows colors makes
the program use the standard Windows colors for the
window panels and the highlight bars.With <ID code>
you can install an ID code to protect all important
settings in the database.
5.4.2 The database tab allows the order of the database
fields to be changed, and four experiment types tabs
allow the order of the experiments to be changed. In the
new database, these tabs are initially empty since new
database fields or experiments are not defined yet.
5.4.3 Press <OK> or <Cancel> to exit the Database
settings dialog box.
Chapter 5 - Starting and setting up BioNumerics
21
addition, locked files can only be unlocked or vice versa
after entering the ID code.
5.5.2 In order to set an ID code for the database, run the
startup program and press <Settings>.
5.5.3 In the database settings (Figure 5-2.), press <ID
code> and enter the ID code. Any string of characters is
allowed. The program will ask you to confirm this by
entering the ID code a second time.
5.5.4 If you want to remove an ID code, press <ID code>
in the database settings and leave the input box empty.
WARNING: if you forget the ID code, you will need a
special bypass code from Applied Maths.
5.6 Log files
Figure 5-2. The Database settings dialog box in the
Startup program.
5.5 Database protection tools
In order to protect a database against incidental data
loss, it is possible to lock the important settings and data
files in the database. The following files can be locked:
•The settings files of the experiment types: as long as
this file is locked, the settings for the experiment type
cannot be changed.
•The data files for the experiment types: data of
existing entries cannot be changed, however, new
entries can be added in a new experiment data file.
•Libraries for identification: nothing can be changed in
a locked library, but it still can be used for
identification.
Each file can be locked and unlocked separately, so that
it is possible to lock and protect "final" files and leave
other files open for additional input.
5.5.1 The setting files (see chapter 7.), data files, and
libraries in the database can be locked using the File >
Lock command in the file's edit window. Once the
settings are locked they cannot be changed anymore,
until they are unlocked again by executing the
command File > Lock.
A locked file is shown with a small key icon left from the
filename in the Files panel of the BioNumerics Main
window, and a key icon also appears in the left upper
corner of the file's edit window.
To protect a database against modification by others or
misuse, BioNumerics allows an ID code to be set for a
database. Once an ID code is set, the database settings
can only be changed after entering the ID code. In
In certified environments and laboratories where
conscientious recording of manipulations is important,
the log files in BioNumerics are a very useful tool. For
every BioNumerics session, the log files show the
Windows user who has last made the changes together
with the kind of changes and the date and hour.
Log files are recorded for the following files:
•The database: the log file lists any changes in names of
database fields, any entries that are added or deleted,
and keys of entries that are changed. It also reports if
new experiment types are created, if experiment types
have been renamed or removed.
•The settings files of the experiment types: for every
change made, the kind of change is indicated. All
settings are recorded in the log file, so that the user
may restore the previous settings based upon the log
file, if enabled.
•The data files for the experiment types: if data for
entries are changed, the log file lists from which
entries. It also mentions the creation of new
experiments, and the deletion of experiments.
•Libraries for identification: the log file keeps record of
any changes in library units and records the addition
and deletion of library units.
5.6.1 In order to enable the creation of log files for the
database, run the startup program and press <Settings>.
5.6.2 In the database settings (Figure 5-2.), check the
Enable log files checkbox.
5.6.3 In the data file windows, experiment file windows,
the Main window, or the Library window, select File >
View log file or
to display the log file. In a local
database (as opposed to a Connected Database; see
chapter 1.), BioNumerics creates a temporary file,
DBASE.LOG or <EXPERIMENTNAME>.LG. For data
22
The BioNumerics manual
Figure 5-3. Event log viewer for a database component.
files, it creates a log file <DATAFILE>.LOG. The log
files are loaded in BioNumerics’ Event log viewer
(Figure 5-3.).
5.6.4 You can clear a log file with the command <Clear>
or copy it to the clipboard with <Copy to clipboard>.
From the clipboard it can be pasted in other applications
with a Paste command. The text is formatted as RTF
(Rich Text Format) which enables the formatting to be
kept in other software that supports RTF.
The items Database and Component are only applicable
to Connected databases (see chapter 28.).
23
6.
Database functions
6.1 The BioNumerics Main window
6.1.1 In the Startup program, select the user Demobase,
and click on the <Analyze> button.
The Main window (Figure 6-1.) consists of a menu, a
toolbar for quick access to the most important functions,
a status bar, and the following six panels:
•The Database panel, listing all the available entries in
the database, with their information fields and their
unique keys (see 1.2). A BioNumerics database can
contain up to 150,000 entries.
•The Experiment types panel, showing the different
experiment types, and the experiments that are
defined under each type.
•The Experiment presence panel, which for each database
entry shows whether an experiment is available
(green dot) or not. Clicking on a green dot causes the
Experiment card for that experiment to be popped up
(see 8.2).
•The Experiment files panel, showing the available data
files for the selected experiment.
•The Comparisons panel, listing all comparisons that are
saved.
•The Libraries panel, which shows the available
identification libraries.
6.2 Adding entries to the database
In the database Demobase, there are already entries
defined. In most further exercises in this guide, we will
work on our own database Example. Therefore we will
start BioNumerics again with this new database:
6.2.1 Select File > Exit to quit the Analyze program.
Menu
Toolbar
Database
field names
Database
panel
Status bar
Experiment
presence panel
Experiment
types panel
Experiment files
panel
Comparisons
panel
Figure 6-1. The Main window of the BioNumerics Analyze program.
Libraries
panel
24
The BioNumerics manual
6.2.2 Back in the Startup program, select the database
Example.
Add new entries and Change entry key (if you click
on an entry).
6.2.3 Press <Analyze> to start the Analyze program with
the empty Example database.
6.2.8 To remove an entry from the database, select one of
the entries, e.g. the third one, and Database > Remove
Adding entries to the database can happen in two ways:
entry or
You can add one or more entries directly to the
database. Initially, these entries will be empty and no
experiments will be linked to them. When you import
experiment data later, you can link the data to the
entries.
When you import a file of experiments, the program will
ask you whether you want it to automatically create a
corresponding database entry for each experiment.
We will now create a few database entries without
importing experiments.
6.2.4 Select Database > Add new entries or
in the
in the toolbar. The program asks to
confirm this action., and will warn you if there is any
experiment information linked to the entry.
6.2.9 To remove all selected entries at once, choose
Database > Remove all selected entries.
WARNING: There is no undo function for this action
and removed entries are irrevocably lost, together with
any experiment information linked to them!
6.2.10 To remove all entries that have no experiment
linked to them, you can select Database > Remove
unlinked entries. In the case of our example database
this would result in removal of all entries, since none has
an experiment linked yet.
toolbar.
A dialog box appears, asking for the number of new
entries to create, and the database where they should be
created. When there are Connected Databases associated
with the database (see chapter 28.), there is a possibility
to add the new entries either in the local database or in
the Connected Database.
The input field in the bottom of the window allows a
key to be entered by the user. This input field is only
accessible when one single entry is added. As soon as
the number of entries is specified to be more than one,
the field is disabled.
6.2.5 Enter the number of entries you want to create, e.g.
3, and press <OK>.
The database now lists three entries with a unique key
automatically assigned by the software. Usually, one
will not want to change this entry key, but in special
cases, it may be useful to change or correct the key
manually. This can be done as follows.
6.2.6 Select the entry and Database > Change entry key.
6.2.7 Change the entry key in the input box, e.g. Entry 1,
and press <OK>.
The key is a critical identifier of the database entries, and
if you already have unique labels that identify your
organisms under study, you can use these labels as keys
in BioNumerics. In the latter case, the can be effectively
used as a database field. As we will explain later, the key
is also an important component in automatically linking
experiments to existing database entries.
NOTE: Remember the use of floating menus as
described in 2.2: right-clicking in the database panel of
the Main window would directly pop up the menu
6.3 Creating database information fields
Besides the key, which is the primary information field
for each database entry, the user can define a number of
database information fields. An information field may
contain up to 80 characters, and a maximum of 90 fields
can be defined for a database.
6.3.1 Select Database > Add new information field.
6.3.2 Enter the name of the database information field,
for example Genus, and press <OK>.
6.3.3 Select Database > Add new information field again
to define the second field, Species.
6.3.4 Then, select Database > Add new information field
again to define a third field, Subspecies.
6.3.5 Finally, select Database > Add new information
field again to define a field Strain no.
The menu functions Database > Rename information
field and Database > Remove information field can be
used to rename and remove an information field,
respectively.
6.4 Entering information fields
6.4.1 By double-clicking, or pressing Enter on a database
entry, the Entry edit window appears (Figure 6-2.). Rightclicking on the entry, and selecting Open entry also
works.
The upper left panel shows the information fields and
the upper right panel shows the available experiments
for the entry. The bottom panel allows attachments to be
added and viewed for the entry (see 6.5). The Entry edit
Chapter 6 - Database functions
25
window can be rescaled to see more and/or longer
information fields.
comparisons. When the entry is selected, this button
shows as
.
6.4.8 Press the Enter key or <OK> to close the Entry edit
window and store the information, or press the Escape
key or <Cancel> to close the window without changing
any information.
In order to quickly enter the same information for many
entries, the use of the keyboard is recommended: use the
Arrow Up and Down keys to move through the entries
in the database, use the Enter key to edit an entry, use
the F7 and F8 keys to copy and paste information, and
use the Enter key again to close the Entry edit window.
6.5 Attaching files to database entries
Figure 6-2. The Entry edit window.
6.4.2 Enter some information in each of the fields (see
Figure 6-2.).
6.4.3 If a number of entries have mostly the same fields,
you can copy the complete entry information to the
clipboard using the F7 key or
.
6.4.4 To clear the complete information of the entry,
press
•Text: Plain ASCII text attachments of unlimited
length. BioNumerics contains its own editor (similar
to Notepad) to paste or type text strings.
•Bitmap image: images of the following bitmap types
are supported: TIFF, JPEG (JPG), GIF, BMP, PNG, and
WMF. BioNumerics contains its own viewer for image
attachments.
.
6.4.5 To paste the information from the clipboard, press
the F8 key or
.
If some of the information fields are the same as entered
for previous entries (for example genus and species
name), you can drop down a history list for each
information field. The history lists can contain up to 10
previously entered strings for the information field.
Using the history lists is recommended (i) to save time
and work and (ii) to avoid typographical errors.
6.4.6 Drop down a history list by clicking the
button
right from the information field. A floating menu
appears from which you can select an information
string.
The
Besides its information fields and the experiments
linked to it, a database entry can also have files attached
to it. Usually the attachment is a link to a file, except for
text attachments, which are physically contained in the
database. In addition to the attachment itself,
BioNumerics also allows a description to be entered for
the attachment. The following data types are supported
as attachment:
•HTML documents: HTML and XML documents can
be attached as well as URLs. BioNumerics contains its
own HTML viewer.
®
®
®
®
®
®
®
®
•Word document: Documents in Microsoft Word
format can be attached. The default editor or viewer
registered by your Windows system will be opened if
you want to edit or view the document.
•Excel document: Documents in Microsoft Excel
format can be attached. The default editor or viewer
registered by your Windows system will be opened if
you want to edit or view the document.
®
document: Documents in Adobe
PDF
•PDF
format can be attached. The default editor or viewer
registered by your Windows system will be opened if
you want to edit or view the document.
button is related to ODBC communication
with an external database (see chapter 27.).
6.5.1 To create an attachment for an entry, open the
Entry edit window as described in 6.4. The Attachment
panel (Figure 6-2.) contains three buttons: to create a
6.4.7 Using
new attachment
, you can select or unselect the opened
entry in the database (see 9.2), for the construction of
, open (view) an attachment
and delete an attachment
. The same commands are
26
The BioNumerics manual
available from the menu as Attachment > Add new,
Attachment > Open, and Attachment > Delete,
respectively.
6.5.2 Press
to create a new attachment. The Entry
attachment dialog box appears (Figure 6-3.).
6.5.8 In case of a bitmap image (TIFF, JPEG [JPG], GIF,
BMP, PNG, and WMF), BioNumerics’ own viewer is
opened with the bitmap displayed. The window
contains a Zoom in and Zoom out button.
6.5.9 In case of HTML and XML attachments,
BioNumerics’ own browser is opened with the HTML or
XML file displayed. Note that an HTML document can
be a link to a website, in which case the browser will
display the website. The browser contains a Back button
to return to the previous page.
®
®
®
and PDF
attachments are
6.5.10 Word , Excel
opened in the default programs registered by your
Windows system for these file types.
Figure 6-3. The Entry attachment dialog box.
6.5.3 Under Data type, specify one of the supported data
types, as described earlier in this paragraph.
6.5.4 All data types except Text link to a file on the
computer or the network. You can enter a path and a file
name or use the <Browse> button to browse to a file of
the specified type. Text attachments are stored inside the
BioNumerics database.
6.5.5 A Description input field allows you to enter a
description line for the attachment. The description will
appear next to the attachment icon in the Entry edit
window (Figure 6-2.) and for text, bitmap, and HTML
type attachments, it will also appear in the viewer or
editor (text) window when the attachment is opened.
6.5.11 To edit the link of an attachment or its description
line, use Attachments > Edit in the menu of the Entry
editor. The Entry attachment dialog box appears as shown
in Figure 6-3..
Entries having one or more attachments linked are
marked with a small paperclip icon in the left index
column of the database panel in the Main window (Figure
6-5.).
Figure 6-5. Detail of database panel in Main
window, showing two entries with attachments.
6.6 Configuring the database layout
6.5.6 To open an attachment, double click on the
attachment icon in the Entry edit window.
Entries in the database can be ordered alphabetically by
any of the information fields.
6.5.7 In case of a text attachment, a Text attachment editor
is opened (Figure 6-4.) where one can type or paste a text
document of unlimited length. The format should be
pure text; any formatting will be lost while pasting texts
from other editors. The editor contains a Save button, an
Undo (CTRL+Z) and Redo (CTRL+Y) button, as well as
a Cut (CTRL+X), Copy (CTRL+C) and Paste (CTRL+V)
button.
6.6.1 Click on one of the database field names in the
database panel header (see Figure 6-1.).
Figure 6-4. The Text attachment editor.
6.6.2 Select Edit > Arrange entries by field.
When two or more entries have identical strings in a
field used to rearrange the order, the existing order of
the entries is preserved. As such it is possible to
categorize entries according to fields that contain
information of different hierarchical rank, for example
genus and species. In this case, first arrange the entries
based upon the field with the lowest hierarchical rank,
i.e. species, and then upon the higher rank, i.e. genus.
When a field contains numerical values, which you want
to sort according to increasing number, use Edit >
Arrange entries by field (numerical). In case numbers are
combined numerical and alphabetical, for example entry
numbers [213, 126c, 126a, 126c], you can first arrange the
entries alphabetically (Edit > Arrange entries by field),
and then numerically using Edit > Arrange entries by
field (numerical) . The result will be [126a, 126b, 126c,
213].
Chapter 6 - Database functions
27
The number of characters to be displayed for a database
field can be specified as follows.
database panel and the experiment presence panel to the
left or to the right.
6.6.3 Click on one of the database info fields in the
database panel header (see Figure 6-1.).
6.6.8 It is possible to freeze one or more information
fields, so that they always remain visible left from the
scrollable area. For example, if you want to freeze the
Key field, select the field right from the Key in the field
header, and select Edit > Freeze left pane. This feature,
combined with the possibility to change the order of
information fields (see 5.4.2) makes it possible to freeze
any subset of fields. The feature applies to the
Comparison window as well (see 9.7).
6.6.4 Select Edit > Set database field length.
6.6.5 Enter a number between 0 and 80, and press <OK>.
6.6.6 The width of each database field can be adjusted by
dragging the separator lines between the database field
names to the left or to the right.
For example, if the key is not informative for your
database, you can drag the separator line between the
key and the first information field as much to the left as
possible. If the genus name for the organisms is known
or mostly the same, you can abbreviate it to one
character and drag the separator between Genus and
Species as much to the left as possible to show one
character.
6.6.7 The width of the database panel as a whole can be
changed by dragging the separator line between the
6.6.9 Optionally, a grid can be shown, separating the
entries and information fields, similar as in spreadsheet
programs. The grid can be shown or hidden using the
Edit > Show grid command. The grid also extends to the
Experiment file windows (see chapter 7.) and the
Comparison window (see 9.7).
Settings 6.6.4 to 6.6.9 as well as all window sizes and
positions are stored when you exit the software and are
specific for each database.
28
The BioNumerics manual
29
7.
Setting up experiments
In BioNumerics, experiments are divided in five classes:
Fingerprint Types, Character Types, Sequence Types, 2D Gel
Types, and Matrix Types.
The Fingerprint Types include any densitometric record
seen as a profile of peaks or bands. Examples are
electrophoresis patterns, gas chromatography or HPLC
profiles, spectrophotometric curves, etc. For example,
within the Fingerprint Types, you can create a Pulsed
Field Gel Electrophoresis (PFGE) experiment type with
specific settings such as reference marker, MW
regression, stain, band matching tolerance, similarity
coefficient, clustering method, etc. Fingerprint Types
can be derived from TIFF or bitmap files as well, which
are two-dimensional bitmaps. The condition is that one
must be able to translate the patterns into densitometric
curves.
With the Character Types, it is possible to define any
array of named characters, binary or continuous, with
fixed or undefined length. The main difference between
Character Types and electrophoresis types is that in the
Character Types, each character has a well-determined
name, whereas in the electrophoresis types, the bands,
peaks or densitometric values are unnamed (a molecular
size is NOT a well-determined name!). Examples of
Character Types are antibiotics resistance profiles, fatty
acid profiles (if the fatty acids are known), metabolic
assimilation or enzyme activity test panels such as API,
Biolog, and Vitek, etc. Single characters such as Gram
stain, length, etc. also fall within this category.
Within the Sequence Types, the user can enter
sequences of nucleic acids (DNA and RNA) and amino
acids. BioNumerics recognizes widely used sequence
file formats such as EMBL, GenBank, and Fasta with
import of user-selected header tags as information
fields, and optional storage of headers. Other sequence
formats can be imported easily.
The 2D Gel Types include any two-dimensional bitmap
image seen as a profile spots or defined labelled
structures. Examples are e.g. 2D protein gel
electrophoresis patterns, 2D DNA electrophoresis
profiles, 2D thin layer chromatograms, or even images
from radioactively labelled cryosections or short half-life
radiotracers.
With the Matrix Types, it is possible to import external
similarity matrices, providing similarity between entries
revealed directly by the technique, or by other software.
These matrices can be linked to the database entries in
BioNumerics and they are used together with other
information to obtain classifications and identifications.
An example of a Matrix Type is a matrix of DNA
homology values. DNA homology between organisms
can only be expressed as pairwise similarity, not as
character data.
The user can create several experiments of the same
type. For example, one can create two different
Fingerprint Type experiments, to analyze PFGE gels
obtained with two different restriction enzymes. The
setup of the different experiment types is described
below. The 2D Gel Types are described in a separately in
chapter 23.
7.1 Defining a new Fingerprint Type
Before we create a new Fingerprint Type, we will copy
an example TIFF file from the CD-ROM. The CD-ROM
contains a directory EXAMPLES, which contains a gel
GEL_01.TIF. After a standard installation of
BioNumerics, the default path to copy gel files is
C:\Program Files \ BioNumerics \ Data \ Example \
Images. The recommended way of working is to save
new gel files immediately in this directory. However, if
TIFF files are already present in another directory, you
can select the Fingerprint Type in the Experiments
panel, simply drag the files from the Windows explorer
and drop it into the Files panel. Alternatively, you can
select File > Import experiment data in the BioNumerics
main menu, and select the TIFF files you want to process
in this Fingerprint Type. In both cases, BioNumerics
makes a copy of the original TIFF file in the Images
subdirectory.
7.1.1 In BioNumerics, select Experiments > Create new
fingerprint type from the main menu, or press
and
New fingerprint type.
7.1.2 The New fingerprint type wizard prompts you to
enter a name for the new type. Enter a name, for
example “RFLP”.
7.1.3 Press <Next> and check the type of the fingerprint
data files. The default settings correspond to the most
common case, i.e. two-dimensional TIFF files with 8-bit
OD depth (256 gray values).
7.1.4 After pressing <Next> again, the wizard asks
whether the fingerprints have inverted densitometric
values. This is the case when you are using EtBr stained
gels, photographed under UV light (such as the example
provided). The bands then appear as fluorescently
lighting on a black background. Since BioNumerics
recognizes the darkness as the intensity of a band, you
should answer Yes, to allow the program to
automatically invert the values when converting the
30
The BioNumerics manual
images to densitometric curves. Furthermore, the
wizard allows you to adjust the color of the background
and the bands to match the reality. The red, green and
blue components can be adjusted individually for both
the background color and the band color. Usually, you
will leave the colors unaltered.
7.1.7 Run the Windows Explorer, and select file
GEL_01.TIF in the directory EXAMPLES on the CDROM.
7.1.5 In the next step, you are prompted to allow a
Background subtraction, and to enter the size of the
disk, as a percentage of the track length. The default disk
size of 10% will suit for most Fingerprint Types. For
high resolution fingerprints (e.g. AFLP and sequencergenerated patterns) you can try a smaller disk size.
Later, we will see how we can have the program
propose the optimal background subtraction settings
automatically. At this time, we leave the background
subtraction disabled.
> Add new experiment file) to select an image file from
the browser.
7.1.6 Press Finish to complete the creation of the new
Fingerprint Type.
NOTE: You will be able to adjust any of these
parameters later on.
The experiment types panel (Figure 6-1.) now lists “RFLP”
under Fingerprint types.
Figure 7-1. The Fingerprint image import window.
7.1.8 Drag and drop the file in the Files panel in the
BioNumerics Main window or click the
button (File
7.1.9 The software now asks "Do you want to edit the
image before adding it to the database". Answer <Yes>
to open the image import editor.
The selected file is opened in the Fingerprint image import
editor, an editor which allows the user to perform a
number of preprocessing functions on the image (Figure
7-2.). These functions include flipping, rotating and
mirroring the image, inverting the image color,
converting color images to grayscale, and cropping the
image to defined areas.
NOTE: It is possible to skip the Fingerprint image
import editor and copy the file directly to the database,
by answering <No> to the question in 7.1.9. In case
you skip this step, make sure the file is an uncompressed
Chapter 7 - Setting up experiments
31
grayscale TIFF file, which is the only format recognized
by the BioNumerics database. The Fingerprint import
image editor supports most known file types such as
JPEG, GIF, PNG and compressed TIFF files in gray
scale or RGB color.
program to recalculate densitometric values based
upon interpolation, which means that the quality of the
image may slightly decrease. This action is therefore
not recommended unless it is inevitable.
- Crop > Delete selected crop or
The window consists of three tabs: Original, Processed,
and Cropped.
7.1.10 In the Original tab, the unprocessed image is
shown. In the Original view, you can zoom in (
or zoom out (
(
)
), and save the image to the database
or File > Add image to database). The image can
only be saved when it is in grayscale mode (see below).
7.1.11 In the Processed tab, the same options are
available as in the Original tab, plus a number of image
editing tools. These include:
- Inverting the color (
or Image > Invert) to invert
images that have a black background, for example gels
that were stained with EtBr.
- Rotating the image 90° left (
90° left), 90° right (
right), or 180° (
or Image > Rotate >
or Image > Rotate > 90°
is to delete the
crop mask that is currently selected. Note in this
respect that the program allows multiple crop masks to
be defined for a single image. The final image that will
be saved to the database, will be composed of all
cropped areas aligned horizontally next to each other.
7.1.13 With Image > Expand intensity range, it is
possible to recalculate the pixel values of the image so
that they cover the entire range within the OD depth of
the file, e.g. 8-bit = 256 gray levels, 16-bit = 65536 gray
levels.
7.1.14 The image can be reset to its original state by
Image > Load from original or pressing
7.1.15 To edit this gel, convert it to gray scale by
averaging the 3 channels (
) and define a crop mask
within the gel borders, excluding the black area at the
left bottom, but including the full patterns.
7.1.16 The third tab, Cropped, displays the result of the
image as defined by the crop mask(s). When OK you can
save the image to the database using the
or Image > Rotate > 180°).
.
button
and exit the Fingerprint image import window.
- Mirroring the image horizontally (
or Image >
Mirror > Horizontal) or vertically (
, Image >
Mirror > Vertical).
- Average RGB colors to grayscale (
or Image >
Convert to gray scale > Averaged), or convert a single
channel to grayscale, either red (
or Image >
Convert to gray scale > Red channel), green (
or
Image > Convert to gray scale > Green channel) or blue
(
or Image > Convert to gray scale > Blue
channel).
7.1.12 The editor also allows you to crop the image to a
selected area, to which the following functions are
available:
- Crop > Add new crop or
, to add a new crop mask
to the image. The crop mask can be moved by clicking
anywhere inside the rectangle and dragging it to
another position, or resized by clicking and dragging
the bottom right corner of the rectangle.
- Crop > Rotate selected crop or
One gel becomes available in the experiment files panel
(Figure 6-1.), Gel_01. The file is marked with N, which
means that it has not been edited yet. Any other gel TIFF
file you want to process can be imported in the same
way in the current database (see Figure 1-3.). The
program will list these TIFF files in the experiment files
panel.
, to rotate the crop
mask over a defined angle. Rotating the crop mask
over an angle different from 90° or 180° will cause the
NOTE: Experiment files added to the Experiment files
panel can also be deleted by selecting the file and
chosing File > Delete experiment file from the main
menu. Deleted experiment files are struck through (red
line) but are not actually deleted until you exit the
program. So long, you can undo the deletion of the file
by selecting File > Delete experiment file again.
7.2 Processing gels
An experiment file is edited in two steps: in a first step,
the data are entered or edited, and in a second step, the
data is assigned to the database entries..
7.2.1 Click on Gel_01 in the experiment file panel, and
then select File > Open experiment file (data) in the main
menu.
Since the gel is new (unprocessed), BioNumerics doesn’t
know what Fingerprint Type it belongs to. Therefore, a
32
The BioNumerics manual
Figure 7-2. The Fingerprint data editor window. Step 1: defining pattern strips.
list box is first shown, listing all available Fingerprint
Types, and allowing you to select one of them, or to
create a new Fingerprint Type with <Create new>. In
this case, there is only one Fingerprint Type available,
RFLP.
Normalization, and 4. Bands (defining bands and
quantification).
7.2.2 Select RFLP and press <OK>.
undo button
Within each of these four steps, there is an undo/redo
function. To undo one or more actions, you can use the
, or Edit > Undo (CTRL+ Z) from the
menu. To redo one or more actions, use the redo button
The gel file is being loaded, which may take some time,
depending on the size of the image. The Fingerprint data
editor window appears (Figure 7-2.), showing the image
of the gel.
7.2.3 If the gel needs to be mirrored, use the tools File >
Tools > Vertical mirror of TIFF image or File > Tools >
Horizontal mirror of TIFF image. Using two times the
same command restores the original TIFF file.
The whole process of lane finding, normalization, band
finding and band quantification is contained in a
wizard, allowing the user to move back and forth
through the process and make changes easily in
whichever step of the process. The
, or Edit > Redo (CTRL+Y) from the menu. Once
you have moved from one step to another, the undo/
redo function within that step is lost.
7.3 Defining pattern strips on the gel
7.3.1 At the start, the image is shown in original size (x
1.00, see status bar of the window). You can zoom in and
zoom out with Edit > Zoom in and Edit > Zoom out, or
using the
and
buttons, or the + and - keys,
respectively.
and
buttons in the toolbar are to move back and forth,
respectively. The process involves the following steps,
shown in the toolbar of the window: 1. Strips (defining
lanes), 2. Curves (defining densitometric curves), 3.
7.3.2 When a large image is loaded, a Navigator window
can be popped up to focus on a region of the image. To
call the navigator, double-click on the image, or rightclick and select Navigator from the floating menu.
Chapter 7 - Setting up experiments
7.3.3 You can change the brightness and contrast of the
image with Edit > Change brightness & contrast or with
. This pops up the Image brightness & contrast dialog
box (Figure 7-3.).
33
7.3.4 In the Image brightness & contrast dialog box, click
Dynamical preview to have the image directly updated
with changes you make.
7.3.5 Use the Minimum value slide bar to reduce
background if the background of the whole image is too
high.
7.3.6 Use the Maximum value slide bar to darken the
image if the darkest bands are too weak.
The option Rainbow palette can be used to reveal even
more visual information in areas of poor contrast (weak
and oversaturated areas) by using a palette that exists of
multiple color transitions.
7.3.7 If you press <OK>, the changes made to the image
appearence are saved along with the Fingerprint Type.
NOTE: The brightness and contrast settings are saved
along with the Fingerprint Type, but are not specific for
a particular gel.The Tone curve editor, as explained
further, is a more powerful image enhancement tool for
which the settings are saved for each particular gel.
Figure 7-3. Image brightness & contrast dialog box.
7.3.8 With File > Show 3D view or
, a 3-
dimensional view of the gel image can be obtained in a
separate 3D view window (Figure 7-4.).
7.3.9 In the 3D view window, you can use the Left,
Right, Up and Down arrows keys on the keyboard, to
Figure 7-4. The 3D view window.
34
The BioNumerics manual
turn the position of the image in all directions. The
image can also be rotated horizontally and vertically by
dragging the image left/right or up/down using the
mouse.
7.3.10 You can change the zoom factor using View >
Zoom in (PgDn) or View > Zoom out (PgUp).
7.3.11 You can also change the vertical zoom (Z-axis
showing the peak height) with View > Higher peaks
(INS) or View > Lower peaks (DEL).
NOTE: In the three further steps of the Fingerprint
data editor window (2. Curves, 3. Normalization,
and 4. Bands), the 3D view window can also be
popped up, showing only the selected lane image rather
than the entire gel image.
7.3.12 Close the 3D view window with File > Exit.
7.3.13 To save the work done at any stage of the process,
Figure 7-5. Defining the bounding box to follow
contours of distorted gel.
you can select File > Save, press the F2 key or the
button. In case you work with complex gels, it is
advisable to save the work at regular times.
When you save the gel file with File > Save, the program
may prompt you with the following question: “The
resolution of this gel differs considerably from the
normalized track resolution. Do you wish to update the
normalized track resolution?”. The resolution is
explained further (see 7.8.17). If the question appears
(not for the example gel), answer <Yes>.
The green rectangle is the bounding box, which delimits
the region of interest of the gel: tracks and gelstrips will
be extracted within the bounding box.
7.3.14 To move the bounding box as a whole, hold down
the CTRL key while dragging it in any of the green
squares (distortion nodes).
7.3.15 Adjust the box by dragging the distortion nodes
as necessary: corner nodes can be used to resize the box
in two directions, whereas inside nodes can only be used
to resize one side of the box.
7.3.16 By using the SHIFT key, one can even distort the
sides of the rectangle. Holding the SHIFT key while
dragging the corner nodes will change the rectangle into
a non-rectangular quadrangle (parallelepiped).
7.3.17 A curvature can be assigned to the sides of the
bounding box by holding the SHIFT key while dragging
one if the inside nodes in any direction (see Figure 7-5.,
top and bottom sides).
7.3.18 On the top and bottom sides of the bounding box,
more nodes can be added using Lanes > Add bounding
box node. While holding down the SHIFT key, a node
can be dragged to the left or to the right using the
mouse.
7.3.19 A node can be deleted from the bounding box
using Lanes > Delete bounding box node.
NOTES:
(1) Following the curvature of a distorted gel is not
crucial, as this is normally corrected in the
normalization step (see further, 7.5) in case there are
sufficient reference lanes on the gel. However, as it will
provide a first rough normalization, it can aid the
automatic or manual assignment of bands as explained
in 7.5. Also, the software allows the bounding box
curvature to be used for rectifying sloping or “smiling”
lanes (e.g. Figure 7-5., outer lanes), if this option is
enabled (see the Fingerprint conversion settings
dialog box, Figure 7-6., and explanation below).
(2) If you are running an upgrade from an older
BioNumerics version (prior to 4.0) and using a
Connected Database, the column BOUNDINGBOX in
the Connected Database may not be long enough to hold
an increased number of nodes. To resolve this, perform
<Auto construct tables> in the Connected
Database setup window (see 28.2).
7.3.20 Select Lanes > Auto search lanes or
to let
the program find the patterns automatically. A dialog
box asks you to enter the approximate number of tracks
on the gel.
Each lane found on the image is represented by a strip: a
small image that is extracted from the complete file to
represent a particular pattern. The borders of these
strips are represented as blue lines, or red for the
selected lane (see Figure 7-7.). By default, the strip
thickness is 31 points, which is too wide in this example.
7.3.21 Call the Fingerprint conversion settings dialog box
with Edit > Settings or
. This dialog box consists of
Chapter 7 - Setting up experiments
four tabs, of which the tab corresponding to the current
stage of the processing is automatically selected. Since
we are now in the first step (defining strips), the Raw
data tab is selected (see Figure 7-6.).
35
quantification is done, the gelstrips with background
subtracted and spots removed are used. Hence, we
recommend NOT to use these options unless (1) the
image has a strong irregular background, for example
by non-homogeneous illumination of the gel, so that the
gelstrips would not look appropriate for presentation or
publication; (2) the gel contains numerous spots that
would influence the densitometric curves extracted
from the gelstrips (spots on the image are seen as peaks
on a densitometric curve, and hence have a strong
impact on correlation coefficients, band searching etc.).
The Background subtraction is based on the “rolling
ball” principle, and the size of the ball in pixels of the
image can be entered. The larger the size of the ball, the
less background will be subtracted.
The Spot removal is a similar mechanism as the rolling
ball, but an ellipse is used instead, in order to separate
bands from spots. The size of the ellipse can be entered
in pixels. Unlike the background subtraction, the size of
the ellipse should be kept as small as possible in order
not to erase bands.
NOTES:
Figure 7-6. The Fingerprint conversion settings
dialog box. Raw data tab.
The spot removal mechanism inevitably causes some
distortion on the patterns. The smaller the size of the
background removal, the less the distortion.
7.3.22 Adjust the Thickness of the image strips so that
the blue lines enclose the complete patterns (blue lines
of neighboring patterns should nearly touch each other).
See Figure 7-7. for an optimally adjusted example.
If background subtraction on the gelstrips is applied, it
is not necessary anymore to perform background
subtraction on the densitometric curves, since this is
doing exactly the same but on one-dimensional
patterns.
The effect of background subtraction and spot removal
on gelstrips is only seen in the next step, when the
gelstrips are shown. Since the example gels do not
require these features, we will not further discuss them.
Figure 7-7. Optimal strip thickness settings, detail.
7.3.23 If necessary, increase the number of distortion
nodes. These nodes allow you to bend the strips locally.
Usually, three nodes should be fine.
Two more options, Background subtraction and Spot
removal allow gel scannings with irregular background
and spots or artifacts to be cleaned up to a certain extent.
It should be emphasized that the options Background
subtraction and Spot removal have an influence on
gelstrips in all further processes of the program: gelstrips
will always be shown with background subtracted and
with spots removed. In addition, when two-dimensional
Using the option Use bounding box curvature, it is
possible to have the program correct smiling or sloping
bands due to distortion in the gel. The bands will be
rectified according to the bounding box curvatures
defined (7.3.17). An example is given in Figure 7-5.,
where the bounding box has been assigned a curvature
to follow the distortions in the outer lanes. The result of
enabling the correction for bounding box curvature is
shown in Figure 7-9., where it can be clearly seen that
the bands of the outer lanes have been straightened
7.3.24 Click <OK> to validate the changes.
7.3.25 Adjust the position of each spline as necessary by
grabbing the nodes using the mouse. Use the SHIFT key
to bend a spline locally in one node.
7.3.26 Add lanes with Lanes > Add new lane or the
ENTER key or
the selected one.
. A new lane is placed right from
36
The BioNumerics manual
7.3.27 Remove a selected lane with Lanes > Delete
selected lane or DEL or
if necessary.
•The user can fine-tune the tone curve to obtain
optimal results. This will be explained below.
7.3.30 First select the brightness and contrast box with
7.3.28 If one lane is more distorted than the number of
nodes can follow, you can increase the number of nodes
in that lane by selecting it and Strips > Increase number
of nodes.
7.3.29 If the lanes are not equally thick, you can increase
or decrease the thickness of each individual strip with
Strips > Make larger and Strips > Make smaller (F7 and
F8, or
and
), respectively.
Once the lanes are defined on the gel, a more powerful
tool to edit the appearance of the image is the Tone curve
editor. While the Image brightness and contrast settings act
at the screen (monitor) level, i.e. after the TIFF grayscale
information is converted into 8-bit grayscale, the Tone
curve editor acts at the original TIFF information level.
This means that, in case a gel image is scanned as 16-bit
TIFF file, the tone curve settings are applied to the full
16-bit (65000) grayscale information which allows much
more information to be magnified in particular areas of
darkness. The advantages are:
•Weak bands are much better enhanced resulting in a
smoother and more reliable picture.
•The tone curve acts at a level below the brightness and
contrast settings and can be saved along with a
particular gel. In all further imaging tools of the
program, the tone curve for the particular gel is
applied. Brightness and contrast settings are not
specific to a particular gel.
Figure 7-8. Gel image tone curve editor.
Edit > Change brightness & contrast or with
, and
press <Defaults> to restore the defaults.
7.3.31 In the Fingerprint data editor window menu, select
Edit > Edit tone curve. The tone curve editor appears as in
Figure 7-8.
The upper panel is a distribution plot of the
densitometric values in the TIFF file over the available
range. The right two windows are a part of the image
Before correction and After correction, respectively.
7.3.32 You can scroll through the preview images by leftclicking and moving the mouse while keeping the
mouse button pressed.
7.3.33 Select a part of the preview images which contains
both very weak and dark bands.
Left, there are two buttons, <Linear> and
<Logarithmic>. Both functions introduce a number of
distortion points on the tone curve, and reposition the
tone curve so that it begins at the grayscale level where
the first densitometric values are found, and ends at its
maximum where the darkest densitometric values are
found. This is a simple optimization function that
rescales the used grayscale interval optimally within the
available display range. The difference between linear
and logarithmic is whether a linear or a logarithmic
curve is used.
Chapter 7 - Setting up experiments
37
Figure 7-9. The Fingerprint data editor window. Step 2: defining densitometric curves.
7.3.34 In case of 8-bit gels, a linear curve is the best
starting point, so press <Linear>. The interval is now
optimized between minimum and maximum available
values, and the preview After correction looks a little bit
brighter.
There are six other buttons that are more or less selfexplaining: <Decrease zero level> and <Increase zero
level> are to decrease and increase the starting point of
the curve, respectively.
NOTE: It is also possible to edit the tone curve
manually: nodes can be added by double-clicking on the
curve,or can be deleted by selecting them and pressing
the DEL key.The curve can be edited in each node by
left-clicking on the node and moving it.There is a
<Reset> button to restore the original linear zero-to100% curve.
7.3.37 Press
to go to the next step: defining
densitometric curves.
<Enhance weak bands> and <Enhance dark bands> are
also complementary to each other, the first making the
curve more logarithmic so that more contrast is revealed
in the left part of the curve (bright area), and the second
making the curve more exponential so that more
contrast is revealed in the right part of the curve (dark
area).
<Reduce contrast> and <Increase contrast> make the
curve more sigmoid so that the total contrast of the
image is reduced or enhanced, respectively.
7.3.35 For the image loaded, pressing three times
<Enhance weak bands> and subsequently 10 times
<Increase zero level> provides a clear, sharp and
contrastful picture.
7.3.36 Press <OK> to save these tone curve settings.
7.4 Defining densitometric curves
In this step, the window is divided in two panels (Figure
7-9.): the left panel shows the strips extracted from the
image file and the right panel shows the densitometric
curve of the selected pattern, extracted from the image
file.
7.4.1 You can move the separator between both panels to
the left or to the right to allow more space for the strips
or for the curves.
The program has automatically defined the
densitometric curves using the information of the lane
strips you entered in the previous step. Normally, you
will not have to change the positions of the
densitometric curves anymore, except when you want to
38
avoid a distorted region within a pattern, e.g. due to an
air bubble within the gel.
7.4.2 If necessary, adjust the position of a spline by
grabbing the nodes using the mouse. Use the SHIFT key
to bend the spline locally in one node.
The blue lines represent the width of the area within
which the curve will be averaged. The default value is 7
points. In most cases, you will have to optimize this
value for a given type of gel images.
7.4.3 Call the Fingerprint conversion settings dialog box
with Edit > Settings. This time, the Densitometric curves
tab is displayed (Figure 7-10.).
The BioNumerics manual
The curve extraction settings include other important
parameters which apply to the background removal and
smoothing.
When we defined the Fingerprint Type, we left the
Background subtraction disabled (see 7.1.5), because we
will see how we can have the program propose the
optimal settings.
The Filtering is a method to make an average of the
values within the specified thickness. Simple averaging
is obtained with Arithmetic average, whereas Median
filter and Mode filter are more sophisticated methods to
reduce peak-like artifacts caused by spots on the
patterns. Figure 7-12. illustrates the effect of the Median
filter on a small spot. These filters, however, reduce less
noise on the curves (particularly the Mode filter). Only
in case your gels contain hampering spots, you should
use the Mode filter.
Figure 7-12. Result of Arithmetic average filtering
(left) and Median filtering (right).
7.4.6 Select Median filter.
Figure 7-10. The Fingerprint conversion settings
dialog box. Densitometric curves tab.
7.4.4 Change the Averaging thickness for curve
extraction. For the example, enter 11. Ideally, the
thickness should be chosen as broad as possible.
However, smiling and distortion at the edges of the
bands should be excluded (see Figure 7-11.).
The Least square filtering applies to the smoothing of
the profiles. This filter will remove background noise,
seen as small irregular peaks, from the profile of real
(broader) peaks. Like for background subtraction, the
program can predict the optimal settings for least square
filtering, if necessary. For now, we leave this parameter
disabled.
Richardson-Lucy deconvolution is a method to deblur
(sharpen) one-dimensional and two-dimensional arrays.
This function sharpens and enhances the contrast of
peaks in the densitometric curves. While peaks will
become sharper, noise also will increase. Deconvolution
actually does the opposite of least-square filtering. Since
the method is iterative, the number of Iterations can be
set (default 50). The more iterations, the stronger
deconvolution will be obtained. The Kernel size (default
2.00) determines the resolution of the deconvolution: the
smaller this value is set, the more shoulders will be split
up into separate peaks.
Figure 7-11. Optimal settings for curve averaging
thickness.
7.4.7 Press <OK> to save the settings.
7.4.5 Select Edit > Settings again to specify other
settings.
We will now determine the optimal settings for
background and filtering settings using spectral
(Fourrier) analysis.
Chapter 7 - Setting up experiments
7.4.8 Select Curves > Spectral analysis. This shows the
Spectral analysis window (Figure 7-13.).
The black line is the spectral analysis of the curves in
function of the frequency in number of points
(logarithmic scale). Ideally, the curve should show a flat
background line at the right hand side, and then slowly
raise further to the left. This indicates that the scanning
resolution is high enough. Another parameter which
indicates the quality is the Signal/noise ratio, which
should be above 50 if possible. The example gel is only
of moderate resolution.
The Wiener cut-off scale determines the optimal setting
for the least square filtering. Figure 7-13. shows an
optimal setting of 0.82%.
The Background scale is an estimation of the disk size
for background subtraction. The figure shows a setting
of 8%.
7.4.9 Call Edit > Settings again and specify the
background subtraction and the least square filtering.
7.4.10 If you want to have a better look at the curves
(right panel) you can rescale them with Edit > Rescale
curves. This will rescale the gray processed curves
(background subtracted and filtering applied) to fit
within the available window space. The raw curves
(lines) may then fall beyond the window.
39
7.5 Normalizing a gel
In the Normalization step, the Fingerprint data editor
window consists of three panels (Figure 7-14.): left the
Reference system panel, which will show the reference
positions, and the standard pattern; the center panel shows
the pattern strips; and the right panel shows the
densitometric curve of the selected pattern.
When setting up a new database, the normalization
process of the first gel involves the following steps. The
underlined steps are the ones that will be followed for
all subsequent gels.
•Marking the reference patterns (reference patterns are
identical samples loaded at different positions on the
gel for normalization purposes);
•Showing the gel in normalized view;
•Identifying a suitable reference pattern on which we
will define bands as reference positions. Reference
positions are bands that will be used to align the
corresponding bands on all reference patterns from
the same and from other gels.
•Defining the reference positions;
•Assigning the bands on the reference patterns to the
corresponding reference positions;
•Updating the normalization.
•Defining a standard (optional).
We proceed as follows:
7.5.1 Select the first reference pattern (lane 1 on the
example) and Reference > Use as reference lane or
. Repeat this action for all other reference lanes
(lanes 9 and 18 on the example).
7.5.2 Select Normalization > Show normalized view or
.
7.5.3 Choose the most suitable reference pattern to serve
as standard: lane 9.
Figure 7-13. Spectral analysis of the patterns of a gel.
7.4.11 With the command File > Print report or File >
Export report, you can generate a printed or text report
of
the
non-normalized
densitometric
curves,
respectively.
7.4.12 Press
of the patterns.
to enter the next phase: normalization
7.5.4 Select a suitable band on the destined standard
pattern and References > Add external reference
position.
You are prompted to enter a name for the band. You can
enter any name, or if possible, the molecular weight of
the band. In the latter case, the program will be able to
determine the molecular weight regression from the
sizes entered at this stage.
40
The BioNumerics manual
7.5.5 Use the following scheme to enter all reference
positions on the example gel (Figure 7-15.).
Within a Fingerprint Type, the set of reference positions
as defined, and their names, together form a reference
system. Once a gel is normalized using the defined
reference positions and saved, the reference system is
saved as well. As soon as you change anything in the
reference system, a position or a name, a new reference
system will automatically be created in addition to the
original reference system. Once a reference system has
been used in one or more gels however, the program
will produce a warning if you want to change anything
to the reference positions.
If more than one reference system exists, one of them is
the active reference system, i.e. the reference system to
which all new gels will be normalized. Without
intervention of the user, the first created reference
system will always remain the default. Later, we will see
how we can set the active reference system and delete
unused reference systems (7.13).
NOTE: Our current gel shows “No active reference
system defined” in the left panel. This message is
displayed because we are processing the first gel of this
Fingerprint Type. We already have created the reference
system, but it is not saved to disk yet. Once a second gel
is normalized, this message will not be displayed
anymore.
The normalization is done in two steps: first are the
reference bands assigned to the corresponding reference
positions, and then is the display updated according to
the assignments made. The last step is optional, but is
useful to facilitate the correctness of the alignments
made.
Assign bands manually as follows:
7.5.6 Click on a label of a reference position, or wherever
on the gel at the height of the reference position.
7.5.7 Then, hold the CTRL key and click on the reference
band you want to assign to that reference position.
7.5.8 Repeat this action for all other reference bands you
want to assign to the same reference position.
7.5.9 Repeat actions 7.5.6 to 7.5.8 until all reference
bands are assigned to their corresponding reference
positions.
NOTE: the cursor automatically jumps to the closest
peak; to avoid this, hold down the TAB key while
clicking on a band.
7.5.10 With Normalization > Show normalized view, or
the
button, the gel will be shown in normalized
view, i.e. the gelstrips will be stretched or shrinked so
Figure 7-14. The Fingerprint data editor window. Step 3: normalization.
Chapter 7 - Setting up experiments
41
positions dialog box (Figure 7-18.). Under Search method,
Figure 7-16. The Auto assign reference positions
dialog box
two options are available: Using bands and Using
densitometric curve.
Figure 7-15. Band sizes of the reference positions
on the example gel.
that assigned bands on the reference patterns match
with their corresponding reference positions.
To show how the automated assignment works, we will
undo the manual normalization:
7.5.11 Show the gel back in original view by pressing
the
button again.
7.5.12 Remove all the manual assignments
Normalization > Delete all assignments.
by
To let the program assign the bands and reference
positions automaticallyn select Normalization > Auto
assign or
. This will open the Auto assign reference
In the Using bands option, the program searches for
bands on the reference patterns and tries to match them
optimally with the defined reference positions. This
method is always applicable, even for the very first gel,
when no standard is defined. This method depends on
two parameters that can be varied between 0% and
100%: the Band intensity importance and the Local
distortion importance. With increased band intensity
importance, the algorithm will give priority to bands
that have a sufficient intensity, and skip very weak
bands that might otherwise be assigned to a reference
position. With increased local distortion tolerance, the
algorithm will allow a larger local distortion to be
induced between bands, i.e., bands with larger shifts in
the opposite direction can be matched with adjacent
reference positions. The default parameter settings are
50% each.
In the Using densitometric curve option, a different
algorithm is used, which matches the densitometric
curve of standard pattern with the curves of the
reference patterns. Obviously, the option requires a
standard to be defined. This method employs a pattern
matching algorithm that works best for complex
banding patterns, but is less suitable for simple patterns
such as molecular weight ladders. In addition to the
Local distortion tolerance parameter, which is the same
as described above, the algorithm provides a Window
size parameter. This has to do with the way the
algorithm works: to allow local distortions to be
corrected, it divides the patterns into a number of
windows, and the patterns are matched within each
window. The larger the window size is taken, the more
chance that the overall alignment is correct, but the less
flexibility the program has to correct small local
distortions. In general, the more bands the patterns
contain, the smaller the window size can be taken. The
default valuye is 12%.
An option independent of the search method is Keep
existing assignments. When this option is chosen, any
assignments made previously are preserved. This option
allows the user to assign a few bands manually and let
42
The BioNumerics manual
the program automatically assign the remaining bands
on the reference patterns. This way of working is useful
to provide some initial help to the algorithm in case of
very distorted or difficult gels.
7.5.13 Select Using bands and press <OK>. Carefully
inspect the assignments made, and if some are incorrect,
correct them manually, as explained in 7.5.6 to 7.5.8.
7.5.14 Finally, when all assignments are made correctly,
select Normalization > Show normalized view, or
.
NOTE: In case most or all of the patterns on a gel
contain one or more identical bands, such bands can be
used for internal alignment of the gel. The software
therefore creates an internal reference position which
is saved with the gel but is not part of the reference
system. An internal reference position can be created
with References > Add internal reference
position, or right-clicking on the band and Add
internal reference position. The program then asks
“Do you want to automatically search for this reference
band?”. If you answer <Yes>, it will try to find all the
correct assignments, but you can change or delete
assignments afterwards.
When the gel is in normalized view, a very reliable way
to reveal remaining mismatches is by showing the
distortion bars: these bars indicate local deviations with
respect to the general shift of a reference pattern
compared to the reference positions. A too strong shift is
seen as a zone ranging from yellow over red to black,
whereas a too week shift is indicated by a zone ranging
from bright blue over dark blue to black.
7.5.16 Save the normalized gel with File > Save (F2) or
.
7.5.17 It is possible to generate a text file or a printout of
the complete alignment of the gel, by selecting the
command File > Print report or File > Export report,
respectively.
The file lists all the reference bands defined in the
reference system with their relative positions, and the
corresponding bands on each reference pattern, with the
absolute occurrence on the pattern in distance from the
start.
If you are going to use band-matching coefficients to
compare the patterns, you should read the next
paragraph (7.6), corresponding to the fourth phase in
the processing of a gel (see page 32). If you are going to
use a curve-based coefficient, you can skip paragraph
7.6 and continue with 7.9.
7.6 Defining bands and quantification
In step 3. Normalization, press
, which brings you
in step 4. Bands. This is the last step in processing a gel,
which involves defining bands and quantifying band
areas and/or volumes (see Figure 7-19.).
7.6.1 Call the Fingerprint conversion settings dialog box
with Edit > Settings or
. The fourth tab, Bands is
shown, which allows you to enter the Band search
filters and the Quantification units (Figure 7-17.).
7.5.15 Show the distortion bars with Normalization >
Show distortion bars.
Slight transitions from bright yellow to bright blue are
normal, as long as the color doesn’t change abruptly. In
the latter case, a wrong assignment was made. You can
correct the misalignment by assigning the correct band
manually and Normalization > Update normalization
or
. Alternatively, you can show back the original
view (7.5.11), assign the correct band manually, and
show the normalized view again (7.5.14). The Show
distorion bars setting (on or off) is stored along with the
Fingerprint Type.
NOTE: if the program has difficulties in assigning the
bands correctly, you can first make a few assignments
manually (for example, the first and the last band of the
reference patterns), then display the normalized view
with Normalization > Show normalized view, or
the
button and then have the program find the
assignments automatically with the option Keep
existing assignments checked.
Figure 7-17. The Fingerprint conversion settings
dialog box. Bands tab.
Chapter 7 - Setting up experiments
The band search filters involve a Minimum profiling
which is the elevation of the band with respect to the
surrounding background, also as percentage. The
minimal profiling is dependent on the OD range you
specified under Raw data (same dialog box, first tab). If,
for example, you increase the OD range, peaks will look
smaller on the densitometric profiles, and a smaller
minimum profiling will need to be set in order to find
the same number of bands. However, when Rel. to max.
val is checked, the minimal profiling, i.e. the minimal
height of the bands will be taken relative to the highest
band on that pattern. When patterns with different
intensities occur on the same gel, it is recommended to
enable this option. Along with the minimum profiling, it
is possible to specify a "Gray zone", also as a height
percentage. This gray zone specifies bands that will be
marked uncertain. In comparing two patterns, the
software will ignore all the positions in which one of the
patterns has an uncertain band. The percentage value
for the gray zone is added to the minimum profiling
value. To take the example of Figure 7-17., all bands
with a profiling of less than 5% are excluded; bands with
a profiling between 5% and 10% are marked uncertain,
and all bands with a profiling of more than 10% are
selected (see Figure 7-18.).
100% = OD range or highest band
(with Rel. to max. val checked)
Bands are selected
"Gray zone" = 5%
Min. profiling = 5%
10%
5%
Bands are marked uncertain
Bands are not selected
0%
Figure 7-18. Understanding the meaning of the
“gray zone” of uncertain bands in relation to the
minimum profiling.
A Minimum area can also be specified, as percentage of
the total area of the pattern.
A more advanced tool based on deconvolution
algorithms, Shoulder sensitivity, allows shoulders
without a local maximum as well as doublets of bands
with one maximum to be found. If you want to use the
shoulder sensitivity feature, we recommend to start
with a sensitivity of 5, but optimal parameters may
depend on the type of gels analyzed.
7.6.2 Change Minimal profiling to adjust the minimal
peak height (in % of the heighest peak of the pattern),
and/or Minimal area to adjust the minimal area, in % of
the total area of the pattern. Usually, setting 5% minimal
profiling will be convenient, whereas the minimal area
can be left zero in most cases. The present example
however, requires a higher minimal profiling (e.g. 10%).
Optionally, you can enter a percentage for uncertain
43
bands (gray zone). As an example to see what happens,
enter 5%. Click Relative to max. value of lane. Specify a
Shoulder sensitivity only if you want to allow the
program to find band doublets and bands on shoulders
(sensitivity of 5 should be fine for most gels).
7.6.3 Press <OK> to accept the settings.
7.6.4 Select Bands > Auto search bands or
to find
bands on all the patterns.
Before actually defining the bands on the patterns, the
software displays a preview window (Figure 7-20.). This
preview shows the first pattern on the gel with its curve
and gelstrip. Press the <Preview> button to see what
bands the program finds using the current settings. A
pink mask shows the threshold level based upon both
the minimal profiling and the minimal area (if set). Only
bands that exceed the threshold will be selected. If
inappropriate, the settings can be changed in this
preview window. The sensitivity of this search depends
on the band search settings: If too many (false) peaks are
found, or if real bands are undetected, you can change
the search sensitivity using the band search filters as
described above.
In addition, a blue mask shows the threshold level for
bands that will be found as uncertain (gray zone). All
bands exceeding the pink mask but not exceeding the
blue mask will become uncertain bands.
In the band search preview window, the currently
selected pattern is shown and indicated in the status bar
(bottom). To scroll through other patterns in the
preview, press the < or > button (left and right from the
curve).
You can search for bands on an individual lane by
pressing <Search on this lane>, or on all lanes of the gel
at once by pressing <Search on all lanes>.
7.6.5 Press <Search on all lanes> to start the search on
the full gel.
NOTE: If bands were already defined on the gel, the
program will now ask "There are already some
bands defined on the gel. Do you want to keep
existing bands?". If you answer <No>, the existing
bands will be deleted before the program starts a new
search. By answering <Yes>, you can change the search
settings and start a new search while any work done
previously is preserved.
Bands that were found are marked with a green
horizontal line, whereas uncertain bands are marked
with a small green ellipse (see magnification in Figure 719.).
7.6.6 Add a band with Bands > Add new band, the
ENTER key, or CTRL + left-click.
NOTES:
44
The BioNumerics manual
Figure 7-19. The Fingerprint data editor window. Step 4. Bands.
Figure 7-20. Band search preview window.
(1) The cursor automatically jumps to the closest peak;
to avoid this, hold down the TAB key while clicking on a
band.
(2) When there is evidence of a double band at a certain
position, you can add a band over an existing one
(7.6.6). Double bands (or multiplets) are indicated by
outwards pointing arrows on the band marker:
Chapter 7 - Setting up experiments
45
To obtain the regression, we proceed as follows.
. Double uncertain bands are marked with a
filled ellipse instead of an open ellipse. The clustering
and identification functions using band based similarity
coefficients (11.2) support the existence of double
overlapping bands. For example, two patterns, having a
single band and a double band, respectively, at the same
position will be treated as having one matching and one
unmatched band. Two patterns, each having a double
band at the same position, will be treated as having two
matching bands.
7.6.7 Hold the SHIFT key and drag the mouse pointer
holding the left mouse button to select a group of bands.
7.6.8 Press the DEL key or Bands > Delete selected
band(s) to delete all selected bands.
7.7.1 Open DemoBase in the BioNumerics Main window.
7.7.2 Select Edit > Search entries or press F3 or
.
This pops up the Entry search dialog box (see 9.3 for
detailed explanation on search and select functions).
7.7.3 In the Entry search dialog box, check RFLP1 and
press <Search>. All entries having a pattern of RFLP
associated are now selected in the database, which is
visible as a blue arrow left from the entry fields (see 9.3).
7.7.4 Under Fingerprint types (Experiment type panel),
double click on RFLP1 to open the Fingerprint type
window.
7.6.10 With Bands > Mark band(s) as certain (or press
F6), the band is marked again as certain.
7.7.5 In the Fingerprint type window, select Settings >
Create peak intensity profile. This pops up the Peak
intensity profile window, a plot of all all intensities of the
selected patterns in function of the position on the
pattern (Figure 7-21.).
7.7 Advanced band search using sizedependent threshold
7.7.6 Initially, the threshold factor is a flat line at 1.0. By
pressing <Calculate from peaks>, a non-linear
regression is automatically calculated from the
scatterplot (Figure 7-21.).
7.6.9 Select a band and Bands > Mark band(s) as
uncertain (or press F5).
In many electrophoresis systems, staining intensity of
the bands is dependent on the size of the molecules. In
DNA patterns stained with Ethidium Bromide for
example (e.g., Pulsed-Field Gel Electrophoresis, PFGE),
larger DNA molecules can capture many more Ethidium
Bromide molecules than small DNA molecules,
resulting in large size bands to appear much stronger
than small size bands.
In other electrophoresis systems, the definition of the
bands (sharpness) might depend on the size, which can
also result in apparent different height depending on the
position on the pattern.
In such systems, a method that uses a single threshold
parameter for finding bands on the patterns (i.e. the
minimum profiling) might not work well: in case of
PFGE for example, in the high molecular weight zone it
might detect spots and irrelevant fragments whereas in
the low molecular weight zone real bands might remain
undetected.
In order to provide a more accurate band search for
patterns with systematic dependence of intensity
according to the position, BioNumerics provides a way
to calculate a regression that reflects the average peak
intensity for every position on the patterns in a given
Fingerprint Type. The only requirement for this method
is that a sufficient number of gels already needs to be
processed, with the bands defined appropriately, before
the regression can be calculated. The user can make a
selection of entries from the database, and based upon
that selection and the bands they contain in the
Fingerprint Type, the regression is established.
7.7.7 The regression line contains 5 nodes, of which the
position can be changed independently by the user. To
change a node’s position, click and hold the left mouse
button and move the node to the desired position.
7.7.8 The regression can be reset to a flat line using the
<Reset> button. To confirm and save the regression,
press <OK>.
The regression can be edited anytime later by opening
the Peak intensity profile window again (7.7.5). As a result
of creating a peak intensity regression curve, the
minimum profiling threshold (7.6.2) will be dependent
on the curve. The value entered for the minimum
profiling will correspond to the highest value on the
intensity profile regression curve (the outermost left
point in Figure 7-21.). Therefore, after creating an
intensity profile regression, you may have to increase
the mimimum profiling setting to find the bands
optimally: noise and irrelevant peaks will be filtered out
in the high intensity areas whereas faint bands will still
be detected in the low intensity areas.
7.8 Quantification of bands
The right panel of the window shows the densitometric
curve of the selected pattern. For each band found, the
program automatically calculates a best-fitting Gaussian
curve, which makes more reliable quantification
possible.
7.8.1 Select a band on a pattern.
7.8.2 Show rescaled curves with Edit > Rescale curves.
46
The BioNumerics manual
Figure 7-21. The Peak intensity profile window with peak intensity regression
curve.
molecular weight), the height, and relative onedimensional surface, as calculated by Gaussian fit.
7.8.3 Zoom in on the band by pressing
repeatedly. Figure 7-22. shows a strongly zoomed band
with its densitometric representation and the Gaussian
fit (red). The blue points are dragging nodes where you
can change the position and the shape of the Gaussian fit
for each band separately.
Once bands are defined, two-dimensional quantification
is done as follows.
7.8.6 Bring the window in Quantification mode with
or Quantification > Band quantification. The
quantification button now shows as
and two
more quantification buttons are shown.
7.8.7 To find the surfaces (contours) of the bands, use
Quantification > Search all surfaces or
.
If you have added a band later, you can search the
surface of that band alone with Quantification > Search
surface of band.
Figure 7-22. Zoomed band with its densitometric
curve and best-fitting Gaussian approach.
7.8.4 Save the gel with File > Save (F2) or
.
7.8.5 It is possible to generate a text file or a printout of
the complete band information of the gel, by selecting
the command File > Print report or File > Export report,
respectively.
The file lists all the bands defined for each pattern with
their normalized relative positions, the metrics (e.g.
When the contours are found, the program shows for
each selected band its volume in the status bar: the sum
of the densitometric values within the contour.
7.8.8 To change the contour of a band manually, first
select the band and zoom in heavily (7.8.1 and 7.8.3).
7.8.9 Hold the CTRL key and drag the mouse (holding
left button) to correct the upper and lower contours.
7.8.10 For known reference bands, you can enter a
concentration value by selecting the band and
Quantification > Assign value. (or floating menu by
Chapter 7 - Setting up experiments
47
right-clicking, or double click). Known reference bands
are marked with .
7.8.11 Once multiple reference bands are assigned their
concentrations, a regression to determine each unknown
band concentration is calculated by selecting
Quantification > Calculate concentrations.
The Band quantification window (Figure 7-23.) shows the
real concentration in function of the band volumes,
using cubic spline regression functions.
gelstrip thickness, the rolling disk size, etc. If you
answer <Yes>, the settings used for this gel will be saved
in the Fingerprint Type’s settings, and all new gels will
be processed using the same settings.
7.8.15 Answer <Yes> to save the changes made into the
Fingerprint Type settings.
NOTE: Answering <Yes> to the above question has the
same effect as the menu function Edit > Save as
default settings in the Fingerprint data editor
window. Conversely, the current default settings can
be copied to the current gel with Edit > Load default
settings.
To show that the reference system is now defined for
our gel type RFLP, we will open the Fingerprint type
window.
7.8.16 In the Main window, select RFLP under
Fingerprint types in the experiment types panel (see
Figure 6-1.). Double-click on RFLP, or select
Experiments > Edit experiment type in the main menu.
This opens the Fingerprint type window (Figure 7-24.).
7.8.17 The Fingerprint type window allows you to change
all settings which we have defined when creating the
Fingerprint Type, and when processing the first gel with
Settings > General settings or
Figure 7-23.
concentration
volumes.
Band quantification window:
in function of known band
7.8.12 Save the gel with File > Save (F2) or
.
One setting which we have not discussed during the
normalization of the example gel is the Normalization
tab. This tab shows the Resolution of normalized tracks
as only setting. In reality, the program always stores the
real length of the raw patterns. For display purposes
in
order to store the quantification data.
7.8.13 It is possible to generate a text file or a printout of
the complete two-dimensional band information of the
gel, by selecting the command File > Print report or File
> Export report, respectively.
The file lists all the bands defined for each pattern with
their normalized relative positions, the absolute volume,
and if regression is done, the relative volume as
determined by the calibration bands.
We are now at a point that we can discuss the
functioning of the reference system. We will explain how
to calculate molecular weights for the Fingerprint Type
and how to link a standard pattern to the Fingerprint
Type.
7.8.14 Exit the Fingerprint data editor window: File > Exit.
The program asks “Settings have been changed. Do you
want to use the current settings as new defaults?”. This
question is asked when changes have been made to the
Fingerprint Type-related settings, for example the
Figure 7-24. The Experiment type window; Standard
is not yet defined.
48
The BioNumerics manual
however, the program converts the tracks to the same
length at real-time, so that the gel strips are properly
aligned to each other. For comparison of patterns by
means of the Pearson product-moment correlation also,
the densitometric curves need to be of the same length.
Thus, the resolution value only influences two features:
the length of the patterns shown on the screen, and the
length (resolution, number of points) of the
densitometric curves to be compared by the Pearson
product-moment correlation coefficient. By default, the
program uses 600 as resolution, but when you normalize
the first gel, the program automatically uses the average
track length for that gel as the new resolution value.
Whenever you save the gel, and the value differs more
than 50% from the default value, BioNumerics will ask
you to copy the resolution of the current gel to the
default for the Fingerprint Type (see 7.3.13). Another
option is Bypass normalization. You can use this option
to have the program process the densitometric curves of
the tracks without any change. This option is only useful
to import patterns in BioNumerics that are already
normalized, and for which you want the values of the
densitometric curves to remain exactly the same after
the normalization process.
reference systems (if created) will be called R02, R03 etc.
Currently, R01 is shown in red because it is the active
reference system.
In this window, the panel for the Standard is still blank:
the Fingerprint Type still misses a standard pattern. The
standard pattern actually has no essential contribution
to the normalization; it is only intended to show a
normalized reference pattern next to the reference
positions, in order to make visual assignment of bands
to the reference positions easier. Another feature for
which the standard is required is the automated
normalization by pattern recognition. This algorithm
requires a curve of a normalized reference pattern to be
present in order to be able to align other reference
patterns to it.
Now, link a standard to the Fingerprint Type as follows:
7.8.18 Close the Fingerprint type window for now (File >
Exit).
7.8.19 Select the gel file in the experiment files panel
(Figure 6-1.) and choose File > Open experiment file
(entries) from the main menu.
The default brightness and contrast setting can be
changed with Layout > Brightness & contrast or
and
the
quantification
settings
Comparative quantification or
with
Settings
,
>
. Further settings
This opens the Fingerprint entry file window, listing the
lanes defined for the example gel (Figure 7-25.).
These lanes are not linked to database entries yet. A link
arrow
for each lane allows you to link a lane to a
include the comparison settings, and the position
tolerance settings, which will be discussed later.
database entry, by clicking on the arrow and dragging it
onto a database entry, and then releasing the mouse
button. When the experiment is linked, its link arrow is
The Fingerprint type window shows the defined reference
positions in relation to the distance on the pattern (in
percentage), and calls this reference system RO1. Other
purple:
Figure 7-25. The Fingerprint entry file.
. The window also shows the Fingerprint
Type of the gel, the reference system according to which
Chapter 7 - Setting up experiments
49
the gel is normalized, and the reference positions of this
reference system.
7.8.20 In the Main window, add a new database entry
with Database > Add new entries (see 6.2.4 to 6.2.5).
7.8.21 Edit the new entry’s information fields (see 6.4.1
to 6.4.2) and enter STANDARD as genus name.
7.8.22 Drag the link arrow of lane 9 to the new database
entry ‘STANDARD’: pattern 9 is now linked to this
database entry.
7.8.23 Select the lane marked as STANDARD, and
choose Database > Set lane as standard. The program
will ask a confirmation.
Alternatively, the standard can also be assigned using a
drag-and-drop operation from the Fingerprint Type
window, as follows:
NOTE: The choice of a standard has no influence on the
normalization process, since it is only used as a visual
aid. One can change the standard pattern at any time
later on, e.g. if another reference pattern appears to be
more suitable for this purpose.
The molecular sizes of the bands are not calculated within
a particular gel file, but for a whole reference system.
This means that, once you have created a reference
system and normalized one gel, you can define the
molecular size regression for all further gels that will be
normalized using the same reference system.
7.8.27 In the Fingerprint type window for RFLP, call
Settings > Edit reference system (or double-click in the
R01 panel). This pops up the Reference system window for
Fingerprint Type RFLP (Figure 7-27.).
7.8.24 Close the Fingerprint entry file window with File >
Exit.
7.8.25 In the Main window, open the Fingerprint Type
window again for RFLP (7.8.16).
7.8.26 Link a reference lane (for example lane 9) to the
Fingerprint Type by dragging the
button to the
database entry STANDARD.
The standard pattern is now displayed in the standard
panel next to the reference positions, and the database
entry key of the standard is indicated next to the link
arrow (Figure 7-26.). From this point on, all further gels
that are normalized will display the standard pattern
left from the gel panel in the normalization step. This
makes manual association of peaks easier and allows
automated alignment using curve matching.
Figure 7-27. The Reference system window, showing
molecular weight regression and remapping
function to the active reference system (if different).
Initially, the regression cannot be calculated, since the
program doesn’t know where to take the marker points
from. The message “Could not calculate calibration curve.
Not enough markers” is displayed.
7.8.28 You can add the markers manually (Metrics >
Add marker), but if you have entered the molecular
weights as names for the reference positions (see 7.5.4
and 7.5.5), the obvious solution is to copy these
molecular weights: Metrics > Copy markers from
reference system.
Figure 7-26. The Experiment type window; Standard
is defined.
The result is a regression curve, shown in Figure 7-27..
As regression function, you can choose between a first
degree, third degree, cubic spline, and pole fit, and each
50
The BioNumerics manual
of these functions can be combined with a logarithmic
dependence.
The corresponding menu command is Database > Link
7.8.29 For this example, choose Metrics > Cubic spline
fit with Logarithmic Dependence.
the database entry to which the experiment is to be
linked.
7.8.30 Choose a unit with Metric > Assign unit, and
enter bp (base pairs).
lane (
). The program asks you to enter the key of
NOTES:
in the
(1) If you try to link a lane to an entry which already
has a lane of the same experiment type linked to it, the
program will ask whether you want to create a
duplicate key for this entry. This feature is very
useful in case you want to define experiments that are
run in duplicate for one or more organisms. Rather than
overwriting the first entry or disregarding duplicate
entries, BioNumerics automatically considers them as
duplicates and assigns an extension /# x to such
duplicates. In case for a given entry a duplicate already
exists (after import of another experiment),
BioNumerics will automatically fill such existing
duplicates that are still empty for the experiment type
that is being imported. Database fields are
automatically taken over from the "master" entry, i.e.
the entry without extension. If the database fields from
the "master" entry are changed, the /# x duplicates are
automatically changed accordingly.
7.9.2 Enter the number of entries you want to create, e.g.
1, and press <OK>.
(2) If you enter an entry key which does not already
exist, the program asks whether you want to create an
entry with that key.
7.8.31 Close the Reference system window, and close the
Fingerprint type window.
7.9 Adding the gel lanes to the database
In Paragraph 6.2, we have seen how entries are added to
the database. Once these entries are defined in the
database, it is easy to link the experiments, which are gel
lanes in this case, to the corresponding entries. We have
done so with the STANDARD lane, explained in the
previous paragraph. In summary, adding lanes to the
database and linking experiments to them works as
follows:
7.9.1 Select Database > Add new entries or
toolbar.
The database now lists one more entry with a unique
key automatically assigned by the software.
As soon as an experiment is linked to a database entry,
the experiment presence panel (see Figure 6-1.) shows a
green dot for the experiment of this entry.
7.9.3 Select the gel file in the experiment files panel
(Figure 6-1.) and choose File > Open experiment file
(entries) from the main menu.
7.9.6 You can click on a green dot, which pops up the
Experiment card for that experiment (see 8.1).
This opens the Fingerprint entry file window, listing the
lanes defined for the example gel (Figure 7-25.).
These lanes are not linked to database entries yet. A link
7.9.7 You can edit the information fields for this entry in
two places: directly in the database (see 6.4.1 to 6.4.2), or
in the Fingerprint entry file window, by double clicking
on the entry.
arrow
for each lane allows you to link a lane to a
database entry, by clicking on the arrow and dragging it
onto a database entry, and then releasing the mouse
button. When the experiment is linked, its link arrow is
If no database entries are defined for the current gel
lanes, you can have the program create new entries and
link the gel lanes automatically in a very simple way:
purple:
7.9.8 In the Fingerprint entry file window, select Database
.
7.9.4 Drag the link arrow of lane 2 (lane one is a
reference) to the new database entry: as soon as you pass
over a database entry, the cursor shape changes into
.
7.9.5 Release the mouse button on the above created
database entry; pattern 2 is now linked to this database
entry, and its arrow in the Fingerprint entry file window
has become purple
instead of gray
.
> Add all lanes to database (
). All lanes that were
not linked yet, will be added as new entries to the
database, with the gel lanes linked.
NOTE: In some cases, a gel can be composed of patterns
belonging to different Fingerprint Types. For example,
if you are running digests by three different restriction
enzymes for the same set of organisms, for some
remaining entries, you may want to run all three RE
digests on the same gel. In this case, you should process
the gel according to one of the Fingerprint Types, and
then, in the Fingerprint entry file window, select a
lane that belongs to another Fingerprint Type and
Database > Change fingerprint type of lane. A
Chapter 7 - Setting up experiments
51
condition for this feature to work is that both
Fingerprint Types are based upon the same reference
system (the same set of reference markers, defined
consistently using the same names). If the reference
system for both Fingerprint Types is not the same, the
software can still use the molecular weight calibration
curves as a basis for conversion, if these are defined.
If you do not wish to add all lanes to the database, you
can select individual lanes, and use the menu command
Database > Add lane to database (
).
In case (2), the conversion from the ABI sample files is
not needed, but on the other hand, an initial alignment
between the images of the reference gel and the data gel
is needed, since both images are not usually scanned in
exactly the same position. The further steps are the same
as for the ABI gels: A non-reference gel (i.e. a real data
gel) is normalized by first normalizing the reference gel
(i.e. the gel containing the internal reference patterns),
and then copying the normalization of the reference gel
to the data gel, or linking the data gel to the reference
gel.
A. Multichannel sequencer gels
You can unlink a gel lane from the database using
Database > Remove link (
). All entries from the
gel are unlinked at once using Database > Remove all
links.
7.10 Superimposed normalization based
on internal reference patterns
This paragraph describes how to normalize patterns
based upon “inline” reference patterns, i.e. reference
patterns that are loaded in each lane, but that are
revealed using a different color dye or hybridization
probe. Examples within this category are (1) the
multichannel automated sequencer chromatograms and
(2) RFLP gels that contain internal reference patterns
which are visualized using a different color dye or
hybridization probe.
In case (1), a special import program, CrvConv is
required to convert the multichannel sample
chromatogram files into the BioNumerics curve format.
It can read chromatogram files from ABI, Beckman, and
Amersham MegaBace. CrvConv splits the multichannel
sample files into separate gel files for each available
channel (color). The gel files are automatically saved in
the IMAGES directory of the selected database.
Logically, the separate gels all contain the same lanes at
the same position. One of the gels contains the internal
reference patterns, whereas the other gel (or gels)
contain the real data samples, to be normalized
according to the reference patterns. The aim is to
normalize the obtained reference gel, and to
superimpose the normalization on the other gel(s). The
only difference with TIFF files is that there are no twodimensional gelstrips available for the ABI sequencer
patterns. BioNumerics creates reconstructed gelstrips
instead. A non-reference gel (i.e. a real data gel) is
normalized by first normalizing the reference gel (i.e.
the gel containing the internal reference patterns), and
then copying the normalization of the reference gel to
the data gel. This can be done easily by simply linking
the data gel(s) to the corresponding reference gel: each
data gel is automatically updated when anything in the
conversion and normalization of the reference gel is
changed.
7.10.1 In the BioNumerics startup program, press the
<Filters> button and select CrvConv.
7.10.2 Open the ABI sample chromatogram files in the
CrvConv window with File > Import curves from file. A
set of example files together composing one gel can be
found on the CD-ROM, in directory EXAMPLES\ABI.
If you are using this example, select all files.
7.10.3 The program may produce a warning that the
curve order is not specified, and that the default setting
CTAG will be used. If you want to change the colors for
the curves, you can use View > Customize colors.
7.10.4 If necessary, you can change the order of the lanes
with Edit > Move curve up and Edit > Move curve down,
or using the + and - keys.
7.10.5 You can remove lanes if necessary with Edit >
Remove curve.
7.10.6 Select File > Export curves to save the curve gel
files in the default directory for the active database. Use
for example the name ABIGEL01.
7.10.7 The program now asks "Do you want to reverse
the curves?". If you know the top of the lanes is at the
end of the curves, press <Reverse>, otherwise, press
<Don't reverse>.
The program adds a number 01, 02, 03, 04 (the program
supports up to 8 channels per lane) to each gel,
depending on the color, and adds the extension .CRV to
each gel.
7.10.8 Run the Analyze program and create a new
Fingerprint Type (7.1), e.g. ABI, specifying
Densitometric curves when the wizard asks “What kind
of fingerprint type do you have?”.
7.10.9 Specify 12-bit OD range (4096 gray levels).
When finished, two new files are listed in the Files
panel: ABIGEL01_02 and ABIGEL01_04.
7.10.10 Process gel ABIGEL01_04 with File > Open
experiment file (data), and assign it to Fingerprint Type
ABI.
52
The BioNumerics manual
We select this gel first because it is the one containing
the reference patterns. The lanes are shown as
reconstructed pattern images. It may be necessary to
adjust the brightness and contrast (Edit > Change
brightness & contrast), by enabling the Dynamical
preview and slowly moving down the Maximum value
until the darkest bands are (nearly) black.
7.10.11 Define the lanes with Lanes > Auto search lanes.
You will notice that some setting options that apply to
TIFF files are not available here: for example the
Gelstrip thickness, the Number of nodes.
7.10.12 Move on to the next step with
. This shows
7.10.21 Once the data gel is linked, you can close the
Fingerprint data editor window of the reference gel.
The tracking info, curve settings, and alignment of the
reference gel are now automatically superimposed to
the data gel. You can run through the different steps till
you reach the normalization step: the alignments as
obtained in the reference gel are shown. If you wish, you
can show the normalized view before you move to the
last step, i.e. defining bands.
7.10.22 Whenever needed, you can pop up a reference
gel to which a data gel is linked by clicking the
button (File > Open reference gel).
the densitometric curves.
7.10.23 If you have made changes to the reference gel
without saving them, you can update the changes to the
Here again, Average thickness and Number of nodes do
not apply. You may want to adjust the background
subtraction and the filtering as described in 7.2.
data gel by right clicking on the
7.10.13 Move on to the Normalization step with
.
7.10.14 Locate a suitable standard, place the gel in
Normalized view
, and define the reference
bands. The example uses the reference mix from ABI,
containing 13 bands with known molecular weight.
7.10.15 Select References > Use all lanes as reference
lanes to mark all 16 lanes as Reference lane, and align
the bands with Normalization > Auto assign (bands).
7.10.16 Update the normalization with
and save the
normalized reference gel.
7.10.17 Do not close the Fingerprint data editor window for
gel ABIGEL01_04, but reduce it or move it away from
the center of the screen so that the Experiments and
Files panels of the main analyze screen become
available.
7.10.18 Select the data gel ABIGEL01_02 with File >
Open experiment file (data), and assign it to Fingerprint
Type ABI.
7.10.19 Link this gel to the reference gel by dragging the
grayed
button of the data gel and dropping it on
button (File >
Update linked information). Once you save the changes
to the reference gel, the data gel(s) are updated
automatically.
B. RFLP
markers
gelscans
containing
internal
The way of processing the gels is similary as described
for the ABI multichannel files, except that we start from
two independent TIFF files here. An absolute condition
is that the two TIFF files, containing the references and
the datalanes respectively, have exactly the same
resolution (dpi). If the images are shifted or rotated, they
can be aligned to each other by applying two or more
marker points to the gel. These marker points will be
visible on the TIFF files, and the software allows such
markers to be used to align the images.
7.10.24 Read the TIFF image of the reference gel.
If the reference gel and the data gel need to be aligned to
each other, you should define marker points as follows:
7.10.25 In the first step (1. Strips), select Lanes > Add
marker point and click on the first marker point of the
gel.
7.10.26 Repeat the same action for the other marker
points.
At least two marker points should be present before the
program can copy the geometry from one gel to another.
the window of the reference gel.
You can also use the menu command File > Link to
reference gel, and enter the name of the reference gel.
The program now asks “Link this gel to the reference gel
ABIGEL01_04?”.
7.10.20 If you answer <Yes>, the linkage button becomes
ungrayed
.
If the TIFF images are already aligned (for example,
when different fluorescent markers are used in the same
gel, which are visualized at the same time), you should
not add marker points.
7.10.27 Proceed with the full normalization of the
reference gel as described in 7.2. Save the file but do not
close the window.
Chapter 7 - Setting up experiments
53
7.10.28 Open the data gel and assign it to the same
Fingerprint Type.
based upon the size and the amplitude (area or height)
of the peaks.
7.10.29 Link this gel to the reference gel by dragging the
7.11.1 In the Example database, create a new Fingerprint
Type ABI-Genescan. Leave every setting as default
except in the second step, where you should specify
Densitometric curves and 16-bit (65536 values)..
grayed
button of the data gel and dropping it on
the window of the reference gel.
The program now asks “Link this gel to the reference gel
GELNAME?”.
7.10.30 If you answer <Yes>, the linkage button becomes
ungrayed
.
7.10.31 Once the data gel is linked, you can close the
Fingerprint data editor window of the reference gel.
The tracking info, curve settings, and alignment of the
reference gel are now automatically superimposed to
the data gel. In the second step (2. Curves), it is still
possible to adjust the position of the track splines
individually, or to add nodes and distort the curves
where necessary. You can run through the different
steps till you reach the normalization step: the
alignments as obtained in the reference gel are shown. If
you wish, you can show the normalized view before you
move to the last step, i.e. defining bands.
7.10.32 Whenever needed, you can pop up a reference
Sample &
band no.
Running
time
17B,1
17B,2
17B,3
17B,4
17B,5
17B,6
17B,7
17B,8
17B,9
33.00
34.60
43.30
52.90
88.20
89.00
155.40
158.50
165.10
Size in bp
Height
Volume
60.47
67.53
106.02
146.14
298.95
302.68
709.46
736.00
796.02
228
201
113
381
131
1425
304
182
121
929
815
855
1908
690
7821
1800
966
713
330
346
433
529
882
890
1554
1585
1651
90.9
86.7
69.3
56.7
34.0
33.7
19.3
18.9
18.2
Figure 7-28. Lane in an ABI Genescan table and
conversion of running distances to BioNumerics
positions.
gel to which a data gel is linked by clicking the
button (File > Open reference gel).
7.10.33 If you have made changes to the reference gel
without saving them, you can update the changes to the
data gel by right clicking on the
button (File >
Update linked information). Once you save the changes
to the reference gel, the data gel(s) are updated
automatically.
NOTE: It is also possible to copy the geometry and
normalization from one gel to another without linking
them. In the reference gel, go back to the first step (1.
Strips) with
and select Lanes > Copy
geometry. In the data gel, use Lanes > Paste
geometry to copy the gelstrip definition from the
reference gel. The normalization from the reference gel
is copied with References > Copy normalization
and References > Paste normalization in the
normalization step. This approach may offer additional
flexibility in special cases, but is not recommended.
7.11 Import of molecular size tables as
Fingerprint Type
BioNumerics allows the input of band size and band
position tables, and reconstruct fingerprints of these,
We will now create a new reference system to allow the
import of an ABI Genescan table, part of which is shown
in Figure 7-28. below. The whole file (5 patterns) can be
found in the Examples\TXTfiles directory on the
installation CD-ROM, as Genescan.txt. There are two
possible approaches to create a new reference system:
•Enter positions on the gel (running distances) and the
corresponding band sizes. Based upon the positions
and the corresponding sizes, the program is able to
establish a regression curve, upon which all imported
bands can be mapped. This option is particularly
suitable when you know the exact positions of the size
markers in a gel system, and you want to reproduce
the real regression exactly.
•Allow the program to create its own regression curve
between a defined maximum and minimum
molecular weight, so that it can map the imported
bands on this synthetic regression curve. This method
is useful if you want to import band tables of which
you know nothing else than the sizes.
We will focus on the example Genescan file
Genescan.txt to apply both methods. The file format
contains a column with the sample number, a comma
and then the band number (Figure 7-28.), next is a
column with the running time, next is the size in
basepairs, then the height, the volume, and the running
time again.
54
The BioNumerics manual
7.11.6 Make the new reference system the active reference
system by selecting it and Settings > Set as active
reference system (not necessary if the reference system is
the only one available).
7.11.7 Select Settings > Edit reference system or doubleclick to define the molecular weight regression.
7.11.8 In the Reference system window, copy the entered
molecular weights with Metrics > Copy markers from
reference system.
BioNumerics is now configured to import the Genescan
tables.
Figure 7-29. Defining a new reference system based
upon known band positions and sizes.
Option 1: Composing a regression curve by
entering positions and sizes.
The running distance needed by BioNumerics is
reciprocal to the running time given in the Genescan file
(second column). Therefore, we will calculate the
reciprocal value of the running time, keeping in mind
that this value should never exceed 100%. Thus in order
to calculate a running distance of a band (RD), we look
for the lowest running time (RTmin) in the file (highest
running distance), divide this number by the actual
running time (RT) of that band and multiply by 100 to
have it in percent:
%RD = RTmin/RTx100
In the example, RTmin = 30. For the reference lane 17B,
this yields the extra column under the running time
column (Figure 7-28.)
7.11.9 Exit the Reference
Fingerprint type window.
system
window
and
the
To import ABI Genescan files, there are scripts available
on the website of Applied Maths. These scripts can be
launched from the BioNumerics Main window, using
the menu Scripts > Browse Internet, or
. The
script to import Genescan data can be found under
Import tools and is called Import ABI Genescan tables.
A description of how to use this script is available on the
website.
7.11.10 When running the script, you can use the
example
Genescan
file
on
the
CD-ROM:
Examples\TXTfiles\Genescan.txt.
Option 2: Importing band sizes by using a
synthetic regression curve
As an exercise, we will now import the same file using
the second option described above, i.e. allowing the
program to create its own regression curve.
7.11.11 In the Main window, open the Fingerprint Type
window for ABI-Genescan.
Based upon this running distance in percent and the
band sizes, we can create a realistic regression curve
according to the first approach described above.
7.11.12 In the ABI-Genescan Fingerprint type window,
select Settings > New reference system (curve) .
7.11.2 In the Fingerprint type window, select Settings >
New reference system (positions) .
The New reference system window (Figure 7-30.) allows the
size range to be specified as well as the type and
strength of the regression.
The input box shown in Figure 7-29. allows all known
reference bands to be entered.
7.11.13 Under Metrics range of fingerprint, enter 1000 as
Top and 30 as Bottom.
7.11.3 Press the <Add> button and enter all running
distances and sizes of lane 17B, as shown in Figure 7-29..
7.11.14 Press the <Add> button to add the sizes for all
reference bands available in the Fingerprint Type (see
lane 17B, Figure 7-28.).
7.11.4 Enter a name for the reference system, e.g. ABI.
7.11.5 When finished, press <OK>.
Warning: once a new reference system is defined, it is
not possible to change it anymore! If you want to change
a self-made reference system once it is saved, you will
have to delete it and create it again.
The reference bands are shown as red dots on the
regression curve. This makes the adjustment of the
Calibration curve easier.
7.11.15 Optimize the Calibration curve and the strength
in % to obtain the best spread of the reference bands.
Chapter 7 - Setting up experiments
55
7.11.20 Create a comparison containing these entries and
Layout > Show image. The patterns look the same
except for very minor differences due inevitable error
caused by remapping.
7.12 Conversion of gel patterns from
GelCompar versions 4.1 and 4.2
The installation CD-ROM contains a directory
GCEXPORT, in which the following two files are found:
BNEXPORT.EXE and BNEXPORT.HLP.
The program BNEXPORT.EXE and its help file
BNEXPORT.HLP should be copied to the home
directory of GelCompar 4.1 or GelCompar 4.2.
Figure 7-30. Defining a new reference system using
a synthetic regression curve between user-defined
size limits.
7.11.16 When finished, press <OK> to save the new
reference system.
7.11.17 Make the new reference system the active
reference system by selecting it and Settings > Set as
active reference system.
Now you can import the same band table as described in
7.11.9 and further. When creating the database file, you
should change the name Genescan.txt into another
name, for example Genescan2.txt, because the program
does not allow existing database files to be overwritten.
The two differently imported band size tables are an
excellent example to illustrate the remapping functions in
BioNumerics. Both gels have their bands on different
positions because of the different logarithmic function
that was used to reconstruct the gels.
7.11.18 Select an entry of the first imported file (should
be [email protected]@[email protected] or similar if you used
other names).
7.11.19 Select the corresponding entry of the second
imported file (should be [email protected]@[email protected]
or similar if you used other names).
The file BNEXPORT.HLP is a Windows help file which
explains step by step how to proceed to convert patterns
from GelCompar to BioNumerics.
7.13 Dealing with multiple reference
systems within the same Fingerprint
Type
Under normal circumstances, a reference system is
created once initially, and is never changed afterwards.
In some cases however, it can be required that a second
reference system is created. Some examples are:
(1) The gel used originally for defining the reference
positions appears to be an aberrant one, so that
repositioning the reference positions is required to
allow most other gels to be normalized easily.
(2) One or more bands defined as reference positions are
found to be unreliable or inappropriate and should
be deleted or replaced with another band.
(3) The user switches to a new reference pattern for the
Fingerprint Type.
(4) Gels of the same Fingerprint Type are imported from
another database and need to be analyzed together
with gels from the local database.
Case (1), shown in Figure 7-31., results in two reference
systems with the same reference position names, but
having different % distances on the gel. Gels processed
under both reference systems are perfectly compatible
56
The BioNumerics manual
and there is no loss of accuracy compared to gels
analyzed under the same reference system.
15.3
15.3 (7%)
(7%)
15.3
15.3 (9%)
(9%)
11.5
11.5 (18%)
(18%)
11.5
11.5 (18%)
(18%)
9.6
9.6 (57%)
(57%)
9.6
9.6 (65%)
(65%)
8.5
8.5 (85%)
(85%)
8.5
8.5 (95%)
(95%)
Figure 7-31. Example of different reference systems
in the same Fingeprint Type for which remapping
causes no loss of accuracy. See text for explanation.
The same situation can raise if gels are imported from
another database, which have been processed under a
different reference system [case (4)], but where the same
marker pattern is used and the reference positions have
been given the same name (even though the % distances
are different).
Case (2) may result in a new reference system with more
or less bands, or with bands having a different name
(Figure 7-32.). In either case, the new reference system
15.3
15.3 (7%)
(7%)
15.3
15.3 (9%)
(9%)
11.5
11.5 (18%)
(18%)
9.6
9.6 (57%)
(57%)
?
10.8
10.8 (22%)
(22%)
9.6
9.6 (65%)
(65%)
8.5
8.5 (85%)
(85%)
8.5
8.5 (95%)
(95%)
Figure 7-32. Example of different reference systems
in the same Fingeprint Type for which remapping
relies on molecular weight regression curves for
both reference systems and as such, causes some
loss of accuracy. See text for explanation.
(reference positions), the definition of the reference
bands, etc.
Case (3) obviously causes also a situation where
reference positions have different names, since one can
asume that a new marker has different bands, and
results in a situation where remapping is required.
When more than one reference system is present in a
Fingerprint Type, one of the reference systems is
specified as the “active” reference system. The active
reference system is the one to which all new gels will be
normalized. By default, the first created reference
system is the active one. The name of the active
reference system is shown in red in the Fingerprint type
window.
7.13.1 To change the active reference system, open the
Fingerprint type window, and select the reference system
to become the active one. Choose Settings > Set as
active reference system.
7.13.2 To remove a reference system that is not used
anymore, select the reference system in the Fingerprint
type window, and choose Settings > Remove reference
system.
The program asks “Do you want to check if this
reference system is in use?”. For large Connected
Databases, this may take a long time. If you answer
<No> to this question, the selected reference system is
removed, regardless of whether it is used in gels or not.
By opening and saving a gel that was processed under
the removed reference system however, it will be
restored. By answering <Yes>, the program checks the
database for gels normalized with the reference system,
and if any such gels are found, the reference system is
not removed.
NOTE: To avoid any possible conflict situations, it is
recommended to allow the program to scan the database
for the presence of gels normalized with the reference
system, and not to remove any reference systems that
are in use.
7.14 Defining a new Character Type
7.14.1 Select Experiments > Create new character type
from the main menu, or press
will not be automatically compatible with the original,
and compatibility can only be obtained by creating a
molecular weight regression curve for both reference
systems. Both reference systems can then be remapped
onto each other, which inevitably causes some loss in
accuracy. The degree of compatibility depends on the
number of reference positions in both systems, the
amount of overlap between regression curves, the
predictability of the regression curve using one of the
available methods, the spread of calibration points
and New character
type.
7.14.2 The New character type wizard prompts you to
enter a name for the new type. Enter a name, for
example “Pheno”.
7.14.3 Press <Next> and check the kind of the character
data files. Check Numerical values if the tests are not
just positive or negative but can differ in intensity
(choose Numerical values in this example).
Chapter 7 - Setting up experiments
7.14.4 For numerical values, enter the number of
decimal digits you want to use. If you only want to use
integer values, for example between 0 and 10, enter zero
(this example).
7.14.5 After pressing <Next> again, the wizard asks if
the Character Type has an open or closed character set.
In an open character set, the number of characters is not
defined. For example, studying 10 bacterial strains by
means of fatty acids can result in a total of 20 fatty acids
found, but if some more strains are added, more fatty
acids may become present in the list. In such cases,
Consider absent values as zero should be checked,
because if a fatty acid is not found in a strain it will not
be listed in its fatty acid profile, and thus should be
considered as zero.
In a closed character set, the same number of characters
are present for all entries studied. This is the case with
commercially available test kits. In such cases, Consider
absent values as zero should not usually be checked.
7.14.6 Answer No to the open character set and leave the
absent values checkbox unchecked.
If the character set is closed, i.e. when all the tests are
predefined, the user is allowed to specify the Layout of
the test panel. This layout involves a Number of rows
and Number of columns to be specified, as well as the
Maximum value for all the tests. By default, the number
of rows and columns is set to zero, which means that the
character set will be empty initially. In this case, you still
can add all the tests one by one or by columns and rows,
once the chatacter type is defined. If you are defining a
test panel based upon a microplate system (96 wells),
you can now enter 8 as Number of rows and 12 as
Number of columns. The program will automatically
57
assign names to the tests: A1, A2, A3, …, A12, B1, B2, B3,
… These names can be changed into the real test or
substrate names afterwards.
Press the <Finish> button to complete the setup of the
new Character Type. It is now listed under Character
types in the experiment type panel.
The new Character Type exists by now, but the program
still doesn’t know which, and how many tests it
contains. We will further define its tests.
7.14.7 Double click on Pheno in the experiment types
panel or select Pheno and Experiments > Edit
experiment type or press
in the Experiments panel.
The Character type window appears (Figure 7-34.),
initially with an empty character list. Suppose that the
phenotypic character kit exists of 10 tests; we will enter
them one by one.
7.14.8 Select Characters > Add new character. Enter any
name, e.g. Character 1, Glucosidase... and press <OK>.
The character is now listed in the characters panel
(Figure 7-34.). Its default color scale ranges from white
(negative) to black (most positive). Its default intensity
range is 0 to 100. If you want the character to cover
another range, proceed as follows:
7.14.9 Select Character > Change character range. Enter
a minimum and maximum value and press <OK>.
NOTE: You can quickly access all menu commands
that apply to a character by right-clicking on the
character.
Figure 7-34. Character type window; no characters defined yet.
58
7.14.10 To change the color for a character, first select it
and press <Copy from character> in the color setup
panel.
The BioNumerics manual
identification kits. Unused characters are marked with a
red cross.
7.14.17 Using the menu Settings > General settings or
7.14.11 Then, select the start color left (negative reaction)
on the color scale: it is now marked with a black triangle
(see arrow on Figure 7-34.).
The three slide bars represent red, green, and blue,
respectively. By moving the sliders to the right, the color
becomes brighter.
7.14.12 Adjust the red, green and blue components
individually until you have obtained the desired color
for a negative reaction of this character.
7.14.13 Select the end color right (positive reaction) on
the color scale and adjust the red, green and blue
components individually until you have obtained the
desired color for a positive reaction of this character.
7.14.14 If you want more transition colors, use the <Add
color> button. A new color mark appears in the middle.
You can select this color, drag it to the left or to the right,
and adjust it as described.
7.14.15 Press <Copy to character> to copy the created
color scale to the character.
Repeat action 7.14.8 to add more characters, and adjust
the colors and the ranges individually as needed. When
all characters have the same color range, you can use
<Copy to all characters>.
A quick method to add a complete array of characters at
a time, for example a microplate array, is Characters >
Add array of characters. The program subsequently
asks to enter the number of rows, the number of
columns, and the maximum values for the tests. The
program automatically assigns names to the tests: A1,
A2, A3, …, B1, B2, B3, … These names can be changed
into the real test or substrate names afterwards.
When a default Connected Database is defined for the
current database, it is possible to define additional
information fields for the characters. Additional
character fields can be added, renamed or removed with
Fields > Add new field, Fields > Rename field, and Field
> Remove field, respectively. A string can be entered for
a given character by double-clicking on the intersection
between the field and the character, or by clicking on the
intersection and Fields > Set field content. In a
Comparison window, you can choose to display another
field than the character name (first field) as the default to
display. Select the field header and Fields > Use as
default field.
Each character is marked with a green 9 sign, which
means that it is used in comparisons and identifications.
7.14.16 If you want a character to be disabled (not used)
in comparisons, uncheck the Characters > Use character
for comparisons menu item. This may be useful for a
blank test which is often present in commercial
, the Character Type settings which were entered
in the setup wizard can be changed. The Experiment
card tab, however, lets you define some visual attributes
of the experiment. These settings apply to the experiment
card, which is explained in paragraph 8.1.
With Represent as Plate and Represent as List, you can
choose whether the individual tests are shown
graphically on a panel, using colors, or as a list of
characters with their name and intensities as a
numerical value.
7.14.18 For the example here, choose Plate, and enter the
number of columns in the test panel (if you entered 10
tests, enter 10 here too).
For test kits on microtiter plates, one would enter 96
tests and 12 columns.
In order to represent existing commercial kits as
truthfully as possible, you can choose between three
different circular cup types, elliptical cups. For blots and
microarrays, you can choose between small blot, large
microarray spots and small microarray spots.
7.14.19 Select the type of cells in the Cell type pull-down
menu, e.g. elliptical, and press <OK>.
7.14.20 With the menu command Settings > Binary
conversion settings, or
, you can specify a binary
cutoff value in percent.
Whenever converting the numbers to binary states,
BioNumerics will consider all values above the cutoff
value as positive and those below the cutoff value as
negative. If you have entered 50% as cutoff value, you
can choose the cutoff level to be 50% of the maximum
value found in the experiment, or 50% of the average
value from the experiment.
7.14.21 Enter 50% and Of mean value.
7.15 Input of character data
There are four possibilities for entering data of a
Character Type:
1. Importing a character file or importing characters
from text files or from an ODBC source using the
BioNumerics script language. Such files can be any
text format (see Figure 1-1., step 2). This method is
used to import external text formats, e.g. from
automated reading devices and other software.
2. Defining a new character file in BioNumerics and
entering the values manually.
Chapter 7 - Setting up experiments
59
3. Entering the data via the experiment card of the
database entry (see 8.2).
4. Processing and quantification of images scanned as
TIFF files (see 7.16).
•Importing characters from files or databases
You can add characters here with Characters > Add new
character or
.
Before you can enter data, you have to add new entries
to the file. Suppose that we want to add character data
for all entries of the database except the standard (17).
To import characters from text or ODBC sources
(databases or spreadsheets), there are scripts available
on the website of Applied Maths. These scripts can be
launched from the BioNumerics Main window, using
7.15.5 Select Entries > Add new entries or
the menu Scripts > Browse Internet, or
7.15.6 You are prompted to enter the number of entries;
enter 17, and press <OK>.
. The
scripts to import character data can be found under
Import tools. A general script to import character data
from tab-delimited text documents is called Import
characters from text files. A script to import characters
from ODBC sources is called Import characters
(ODBC). A full description of how to use these scripts is
available on the website.
•Defining a new character file
.
Seventeen entries are now present, and all character
values are initially represented by a question mark
(Figure 7-35.).
7.15.7 To enter values, you can either double click on the
question mark, or press the Enter key.
7.15.8 Enter values between 0 and 5 (the range of the
characters), and press Enter again.
A new character file is created as follows:
7.15.1 In database Example, select the new Character
Type Pheno.
7.15.2 Right-click in the Files panel, and choose Add new
experiment file from the floating menu.
The next character of the same entry is automatically
selected, so that you can directly enter the next value.
7.15.9 If you are entering large character files, we
recommend to save now and then with File > Save
(
) or the F2 shortcut.
7.15.3 Enter a name, e.g. Data01. and press <OK>.
7.15.4 Select Data01 in the Files panel, and File > Open
experiment file (data).
This opens the Character data file window (Figure 7-35.),
which is empty initially. The 10 test names which we
entered as an example, are shown in the column header.
Figure 7-35. Character data file window.
7.15.10 File > Exit when you are finished entering the
data.
7.15.11 In the Main window, double click on the file
Data01 (or File > Open experiment file (entries).
The Character entry file window (cf. the Fingerprint entry
file window, Figure 7-25.) contains unlinked entries,
60
The BioNumerics manual
which you can now link to the corresponding database
entries.
A link arrow
for each entry allows you to link an
entry to a database entry, by clicking on the arrow and
dragging it onto a database entry, and then releasing the
mouse button. When the experiment is linked, its link
arrow is purple:
.
7.15.12 Drag the link arrow of entry 1 to the first
database entry: as soon as you pass over a database
entry, the cursor shape changes into
.
7.15.13 Release the mouse button on the database entry;
entry is now linked to this database entry, and its arrow
in the Character entry file window has become purple
instead of gray
.
NOTE: if you try to link an entry to a database entry
which already has an entry of the same experiment type
linked to it, the program will refuse the second link with
the message:
“The Experiment ‘Pheno’ of this database is already
defined in XXX”
where XXX is another lane of the same Character Type,
in the same or another experiment file.
by the intensity would provide no meaningful
information. Rather, the program needs to be able to
read the files as true RGB images and allow the
possibility to define negative colors and positive colors,
as well as transition colors. For example, in an
acidification reaction with a bromophenol blue dye,
non-reactive tests will be blue, whereas weak reactions
will show the transition color green, and strongly
positive reactions will show up yellow.
Using the same tool, BioNumerics also allows the
import of micro-array and gene chip images scanned as
TIFF files, offering for each gene or oligo a quantitative
reaction value.
The character import tool is provided as a separate
program, BNIMA.EXE, that can be started from the
BioNumerics program. BNIMA only works when the
BioNumerics analyze program is running, in other
words, you should either start BNIMA from within
BioNumerics, or first launch BioNumerics and then start
BNIMA.
Example 1: import of microtiter plate image.
The first example we will use to illustrate the program is
EXAMPLES\CHARFILE\PLATE1.TIF
on
the
BioNumerics installation CD-ROM. It is a photograph of
a 96 wells microtiter plate with bromophenol blue as
reaction indicator dye (see Figure 7-36.).
As soon as an experiment is linked to a database entry,
the experiment presence panel (see Figure 6-1.) shows a
green dot for the experiment of this entry.
You can edit the information fields for this entry in two
places: directly in the database (see 6.4.1 to 6.4.2), or in
the Character entry file window, by double clicking on the
entry.
NOTE: Experiment files added to the Experiment files
panel can also be deleted by selecting the file and
chosing File > Delete experiment file from the main
menu. Deleted experiment files are struck through (red
line) but are not actually deleted until you exit the
program. So long, you can undo the deletion of the file
by selecting File > Delete experiment file again.
Figure 7-36. 96-wells microtiter plate
bromophenol blue as reaction indicator.
with
7.16 Import of character data by
quantification of images scanned as TIFF
files
7.16.1 Create a new closed Character Type as described in
7.14.1 to 7.14.6, and call it Microplate. Specify 8 rows
and 12 columns under Layout (third step). Specify a
color range from blue to yellow, over green (see 7.14.10).
Similar as for gel images, BioNumerics can import
Character Type data from TIFF images. This happens by
quantification of the color intensity and/or color
transitions on the TIFF file. Character data from
phenotypic test panels often provide color transitions
rather than changes in intensity. For example, many test
panels have reactions that change from yellow to red, or
blue – green - yellow, and hence, quantifying the colors
7.16.2 Double click on an entry to show its Entry edit
window.
The experiment type Microplate shows an empty flask.
7.16.3 Click on the flask button. Since this experiment is
not defined for the selected entry, the program asks “Do
you want to create a new one?”.
Chapter 7 - Setting up experiments
7.16.4 Answer <Yes> to create an Experiment card (see
further, paragraph 8.1), an empty microplate image.
7.16.5 Right-click on the empty microplate image and
select Edit image from the floating menu.
This loads the BNIMA program.
7.16.6 Select File > Load image in BNIMA and load the
file EXAMPLES\CHARFILE\PLATE1.TIF from the
BioNumerics installation CD-ROM. The resulting
window looks as in Figure 7-37.
7.16.7 First call the Settings dialog box with Edit >
Settings or
.
The Image tab offers two choices for the Image type:
Densitometric and Color scale.
In case the color reaction can be interpreted as a simple
change in intensity (e.g. from light to dark), one should
select Densitometric. The Densitometric values panel
offers some additional tools to edit the TIFF file:
Inverted values is to invert the densitometric values;
Background subtraction allows a two-dimensional
subtraction of the background from the TIFF file, using
the rolling ball principle. The Ball size can be entered in
pixels. Spot removal allows all spots and irregularities
61
below a certain size to be removed from the image,
whereas larger structures are preserved. The
background subtraction and spot removal changes are
only seen when Edit > Show value scale is enabled in
the Main window.
In case the reaction causes a change from one color to
another color, as in the above example, Color scale is the
right option. An additional feature, Hue only, is
particularly useful when the scanned images differ in
brightness (illumination) or contrast. If the images do
not contain black or white in their color range, it is better
to enable this feature.
7.16.8 Select Color scale and Hue only.
7.16.9 Press <OK> to proceed with these settings. The
other settings will be discussed later.
Like the process of normalization of gels, processing a
character panel image exists of a number of steps: (1)
Grid definition; (2) Cell layout; and (3) Quantification.
In Step 1: Grid definition we will create a grid that
defines the wells of the microplate.
7.16.10 Select Grid > Add new and enter 8 as Number of
rows and 12 as Number of columns.
Figure 7-37. The BNIMA program with a microplate image loaded.
62
The BioNumerics manual
7.16.11 Press <OK> and the grid appears.
At each edge of the grid, there is a dragging node (green
square). The upper left dragging node is to move the grid
as a whole; the lower right node is to resize the grid, and
the upper right and lower left nodes are to distort the
grid in case the image is not perfectly rectangular or not
scanned horizontally.
7.16.12 Drag the nodes until the grid matches with all 96
wells.
NOTE: Using the SHIFT key, one can distort the grid
locally if needed. The size of the local distortian area is
indicated by a circle. It is even possible to reduce or
enlarge the size of the distortion area as follows: hold the
shift key and left-click on any cell-marking cross of the
grid. The area defining circle appears. Hold the left
mouse button down and release the SHIFT key: the
circle is still visible. While holding the left mouse
button down, press the PgDn or PgUp key to reduce or
enlarge the area of distortion. The size of the circle will
decrease or increase. Using a very small circle, it is
possible to correct the grid in any individual cell.
7.16.13 In case you want to remove the grid and define it
again, select one of the cells of the grid (becomes red)
and Grid > Delete.
masks to individual (groups of) cells. In the case of a
microplate it is obvious that all cells should have the
same mask.
7.16.18 Select all cells as in 7.16.15.
7.16.19 Add a circular mask to all selected cells with
Cells > Add disk to mask.
A dialog box prompts to enter a Radius for the disk in
pixels, the X offset (horizontal shift from the cell
marking cross) and the Y offset (vertical shift from the
cell marking cross). For the offsets, a negative value can
be entered.
7.16.20 Enter 8 as radius, and leave the offsets zero.
Press <OK> to confirm.
The masks appear on all used cells of the grid as
semitransparent red disks. In order to see the masks, it is
important that they are in a color that is complementary
to the reaction colors of the cells. One can change the
color of the masks as follows:
7.16.21 Select Edit > Settings or
.
7.16.22 Under Layout, pull down the Mask color menu
and select the appropriate color.
In case the image consists of two or more subsets of cells
(e.g. some more complex test panels or micro-arrays), it
is possible to define more than one grid using the Grid >
Add new command.
In the example microplate, the most appropriate color is
red.
7.16.14 Move to the next step using Edit > Next step or
NOTE: In case of very small cells, e.g. microplate
images, you can select Small cross marks, so that they
don’t overlap most of the cells.
the
button.
In this step, the layout of the cells is defined: the shape
and size of the quantification area within each cell. In
this step, we also define which cells we want to use for
quantification and which cells not. By default, all cells of
the grid are used for quantification.
It is possible to add more than one mask to the cells. In
case the cells have a more complex layout, i.e. not just
circular, one can add two or more disks with different
offsets to approach the shape of the cells. By selecting
individual cells or groups of cells, it is also possible to
change the shape of the masks per cell or per group of
cells.
7.16.15 Click in the upper left corner of the image and
while holding the left mouse button down, select the left
half of the test panel.
NOTE: Some more advanced features allow the mask of
individual cells to be changed manually: With Cells >
All the selected cells are marked in red.
Add pixels to mask or
7.16.16 Select Cells > Delete selected. The left half of the
panel will not be used for quantification, and hence,
cannot be used in the resulting character set.
the cursor changes into
a pencil which you can use to add pixels to the masks
manually. When doing so, it is recommended to zoom in
on the cell using the Edit > Zoom in command or
. Similarly, it is possible to remove pixels from
Cross marks of unused cells are smaller than of used
cells.
the mask with Cells > Remove pixels from mask or
. Clicking a second time on these buttons or
7.16.17 Select the non-used cells again with Cells > Add
selected.
selecting the menu item finishes the pixel editing mode.
Before the program can do the quantification, it needs to
know what the averaging area of the cells is. This is
done using a mask which the user defines. One can
define the same mask for all cells, or assign particular
If you selected the image type to be Color scale, and not
Densitometric in the settings (see 7.16.7 to 7.16.8), you
can now specify the negative color, the positive color,
and any transition colors between negative and positive.
Chapter 7 - Setting up experiments
63
For each cell, you can define a unique color scale, which
can be necessary for some commercial test panels
containing more than one reaction dye.
The upper color scale now should range from blue over
green to yellow (Figure 7-39.).
7.16.23 In the example case, there is only one reaction
dye, so select all cells as in 7.16.15.
7.16.24 Select Cells > Edit color scale or
. This
Figure 7-39. Appropriate color scale for the
example microplate image.
brings up the Color scale editor as shown in Figure 7-38.
By default, the color scale exists of two colors: white as
negative and black as positive. In the case of the
example microtiter plate, this scale would obviously not
work. Since the scale ranges from blueish (negative)
over greenish to yellow (positive), we will add a new
intermediate color.
7.16.25 Press the Add color button. One new color (gray)
is defined in the middle of the scale.
NOTE: One can also pick up colors from the image in
order to define the selected color in the upper color scale.
To this end, click and hold the left mouse button on the
left pipet button
. The mouse pointer shape changes
into a pipet which you can drag to the most negative
cell, e.g. the blank control. The selected color in the color
scale automatically changes into the color at the pipet’s
position. If Hue only is enabled, the closest hue color is
selected.
7.16.26 Select the color selector of the negative color
(left).
Once the color scale is defined you can interrogate the
7.16.27 Move the slider on the color scale of predefined
colors to blue.
described above. This pipet does not affect the defined
color scale, but only shows the position of the pointed
cell graphically on the color scale, and the percentage
reaction with error indication.
7.16.28 Move the slider in the Saturation/Brightness
square to the lower left corner to obtain maximum
brightness and saturation.
7.16.29 Repeat 7.16.26 to 7.16.28 for the intermediate
color (middle), assigning green, and for the positive
color (right), assigning yellow.
NOTE: If you selected Hue only in the settings (see
7.16.7 to 7.16.8), changing saturation and brightness
has no effect on the obtained color scale. If saturation or
brightness transitions within the same color are to be
registered, you should disable the Hue only feature in
the settings.
reaction of any cell using the right pipet button
as
With the Max. value field you can enter the maximum
value to which all characters will be rescaled.
7.16.30 Enter 100 as Max. value and press <OK> to
confirm the color settings.
7.16.31 Move to the next step using Edit > Next step or
the
button.
The next and last step involves quantification of the
cells. First of all, the cells to be added to the character set
Figure 7-38. Color scale editor in the Cell layout step of the BNIMAGE program.
64
The BioNumerics manual
need to be defined. In case one or more cells are
intended only for calibration purposes, they can be
excluded from the resulting character set, but used as
calibration marker.
7.16.32 Select all cells as in ”Click in the upper left corner
of the image and while holding the left mouse button
down, select the left half of the test panel.”.
7.16.42 Select Quantification > View calibration curve.
This shows a linear regression between the two
calibration points, zero and 100.
Finaly, there is one more thing to do, i.e. to copy the
character values in the microplate opened in
BioNumerics.
7.16.43 Select Quantification > Export to clipboard or
7.16.33 Select Quantification > Add cells to character
set.
The cells are now numbered from 1 to 96.
7.16.34 If you click on a particular cell, its quantified
value as rescaled according to 7.16.30 is given in the
status bar as well as the value after calibration (see
further).
Quantification is done by integrating the pixels within
the defined mask. There are different options for
integration:
7.16.35 Select Edit > Settings or
.and choose the
.
Before closing the BNIMA program, you can save the
entire configuration defined for this microplate system.
If you load a next microplate, you can reload the grid
and all other settings such as color scale, disabled cells,
quantification parameters etc.
7.16.44 Select File > Save configuration as or
.
7.16.45 Enter a name e.g. microplate, and press <OK> to
save the configuration.
For next microplates you can reload the configuration
Quantification tab.
using File > Load configuration or
Cell integration methods include Average, Median, and
Sum. In case the image contains spots that could
influence the quantified values, the Median option will
provide more reliable results than the arithmetic
averages.
7.16.36 Select Median integration and press <OK>.
In order to illustrate the calibration feature, we will
define one of the cells as negative control (nimimum
value), and another cell as positive control (maximum
value).
.
7.16.46 Close the BNIMA program.
7.16.47 Right click on the Experiment card (see also
paragraph 8.1), and select Paste data from clipboard
from the floating menu.
The microplate now is filled with data and looks like in
Figure 7-40.
7.16.48 Click the upper left triangular button to close the
experiment card.
7.16.37 Select cell A1 (negative control) and
Quantification > Define calibration point. Enter 0 as
value and press <OK>.
7.16.38 Select cell A12 (positive control) and
Quantification > Define calibration point. Enter 100 as
value and press <OK>.
Since only two calibration points are defined now, it is
obvious that the program needs to calculate a linear
regression through the defined points, in order to
requantify the other cells according to the negative and
positive controls:
7.16.39 Select Edit > Settings or
and choose the
Quantification tab.
Figure 7-40. Example microplate experiment card
after import of character values using BNIMA.
7.16.40 Under Calibration, enter 1 as Polynomial degree
. This will result in a first degree regression.
Example 2: import of gene-array scanning.
7.16.41 Press <OK> to close the Settings dialog box.
The second example we will use to illustrate the BNIMA
program is a fragment of a gene array which can be
Chapter 7 - Setting up experiments
found in EXAMPLES\CHARFILE\ARRAY.TIF on the
BioNumerics installation CD-ROM. The array image
was generated by chemiluminescent detection of
digoxigenin-labeled cDNA1. Each gene is characterized
by two spots (horizontally next to each other), which can
be considered as a control measure. For this example,
we have used a fragment representing two blocks of 14 x
7 genes (the complete array is composed of six blocks of
14x7 genes, totalling 588 characters). The left and right
half are separated by one blank column, and the two
bottom rows contain calibration and reference spots (see
Figure 7-41.).
65
7.16.49 Create a new closed Character Type as described in
7.14.1 to 7.14.6, and call it Gene array. Specify 14 rows
and 14 columns under Layout (third step).Figure 7-42.
7.16.50 When the Gene array experiment type is created,
double click on it in the Experiments panel.
7.16.51 In the appearing Character type window, select
Settings > Gerenal settings, and click the Experiment
card tab.
7.16.52 Under Cell type, select Small blot, which makes
it possible to show large data sets in the experiment
cards (see paragraph 8.1).
7.16.53 Click <OK> and close the Character Type window.
7.16.54 Double click on an entry to show its Entry edit
window.
The experiment type Gene array shows an empty flask.
7.16.55 Click on the flask button. Since this experiment is
not defined for the selected entry, the program asks “Do
you want to create a new one?”.
Figure 7-41. Fragment of gene array scanned as TIFF
image (file ARRAY.TIFF).
7.16.56 Answer <Yes> to create an Experiment card (see
further, paragraph 8.1), an empty 14 by 14 array image.
7.16.57 Right-click on the empty array image and select
Edit image from the floating menu.
1. Courtesy S.D. Vernon, M.S. Mangalathu, and E.R. Unger
(J. Histochemistry & Cytochemistry 1999; 47:337-342).
This loads the BNIMA program.
Figure 7-42. The BNIMA program with a gene array image (fragment) loaded.
66
The BioNumerics manual
7.16.58 Select File > Load image in BNIMA and load the
file EXAMPLES\CHARFILE\ARRAY.TIF from the
BioNumerics installation CD-ROM. The resulting
window looks as in Figure 7-42.
7.16.59 First call the Settings dialog box with Edit >
Settings or
.
The Image tab offers two choices for the Image type:
Densitometric and Color scale.
Unlike the first microplate image, the color reaction of
this gene array can be interpreted as a simple change in
intensity (e.g. from light to dark), hence one should
select Densitometric.
7.16.60 Select Densitometric under Image type.
The Densitometric values panel offers some additional
tools to edit the TIFF file: Inverted values is to invert the
densitometric values; Background subtraction allows a
two-dimensional subtraction of the background from
the TIFF file, using the rolling ball principle. The Ball
size can be entered in pixels. Background subtraction is
only necessary if the illumination of the image is not
uniform, which is not the case in the example image.
Spot removal allows all spots and irregularities below a
certain size to be removed from the image, whereas
larger structures are preserved.
Figure 7-43. Correct alignment of grid on gene array
spots.
7.16.67 Next, move the lower right dragging node until
the grid crosses match the middle of each double spot in
the lower right area of the array.
7.16.68 Then, move the lower left and upper right
dragging nodes of the grid to distort the rectangle so
that the grid crosses in the lower left and upper right
areas, respectively, match with the double spots.
The grid on the image should now look as in Figure 744.
7.16.61 Leave Background subtraction disabled, and
enable Spot removal, with a maximal Spot size of 3
pixels.
7.16.62 Press <OK> to quit the settings panel.
The background subtraction and spot removal changes
are only seen when Edit > Show value scale is enabled
in the Main window.
7.16.63 Check Edit > Show value scale. The image now
looks “cleaned up”: spots are removed and the image is
shown in grayscale rather than as 24 bit true color
image.
In Step 1: Grid definition we will create a grid that
defines the cells of the array.
7.16.64 Select Grid > Add new and enter 17 as Number of
rows and 15 as Number of columns.
Choosing 17 andf 15 rather than 14 by 14 is to allow the
calibration spots to be included, and to take account of
the blank column.
7.16.65 Press <OK> and the grid appears.
7.16.66 Move the upper left dragging node until the grid
crosses match the middle of each double spot in the
upper left area of the array (see Figure 7-43.).
Figure 7-44. Correctly aligned grid on example gene
array.
7.16.69 Move to the next step using Edit > Next step or
the
button.
In this step, the layout of the cells is defined: the shape
and size of the quantification area within each cell. In
this step, we also define which cells we want to use for
quantification and which cells not. By default, all cells of
the grid are used for quantification.
7.16.70 Select the cells in the blank comumn of the image
and Cells > Delete selected.
7.16.71 Similarly, select the three lowest rows and Cells
> Delete selected.
Two cells of the second last row represent 0 and 100%
hybridization respectively: the 4th and the 5th cell. We
will include these cells for calibration, hence we have to
include them again:
7.16.72 Select the 4th and 5th cell of the second last row
and Cells > Add selected.
Chapter 7 - Setting up experiments
67
Before the program can do the quantification, it needs to
know what the averaging area of the cells is. This is
done using a mask which the user defines. In this case, it
is clear that we will have to define two masks per cell, in
order to cover the duplicate spots.
7.16.82 Enter 0 (zero).
7.16.83 Select the 5th cell in the second last row and
Quantification > Define calibration point.
7.16.84 Enter 100.
7.16.73 Select all cells as in 7.16.15.
7.16.74 Add a circular mask to all selected cells with
Cells > Add disk to mask.
A dialog box prompts to enter a Radius for the disk in
pixels, the X offset (horizontal shift from the cell
marking cross) and the Y offset (vertical shift from the
cell marking cross). For the offsets, a negative value can
be entered.
7.16.75 Enter 6 as radius, and –6 as X offset. Press <OK>
to confirm.
All cells are now quantified between the zero and 100%
hybridization control, and we now need to specify
which cells to add to the character set. Since the
calibration cells (second last row) are not part of the
character set, these should not be included.
7.16.85 Select all but the three last rows
Quantification > Add cells to character set.
and
The cells to be used in the character set are now
numbered 1 to 196.
7.16.86 Copy the quantified cells to the clipboard with
The masks appear on all used cells of the grid as
semitransparent red disks.
Quantification > Export to clipboard or
7.16.76 Add a second mask to all selected cells with Cells
> Add disk to mask.
Before closing the BNIMA program, you can save the
entire configuration defined for this gene array system:
7.16.77 Enter 6 as radius, and 6 as X offset. Press <OK>
to confirm.
7.16.87 Select File > Save configuration as or
After these steps, the BNIMA window should look like in
Figure 7-45.
.
.
7.16.88 Enter a name e.g. “Gene array”, and press <OK>
to save the configuration.
7.16.89 Close the BNIMA program.
7.16.90 Right click on the Experiment card (see also
paragraph 8.1), and select Paste data from clipboard
from the floating menu.
The experiment card now is filled with data and looks
like in Figure 7-46.
Figure 7-45. Array editing in BNIMA, with included
and excluded cells, and masks defined.
7.16.78 If this is the case, move to the next step using
Edit > Next step or the
button.
7.16.79 In the Quantification step, first call the Settings
dialog box with Edit > Settings or
.
7.16.80 Select the Quantification tab, specify a first
degree polynomial fit and click <OK>.
7.16.81 Select the 4th cell in the second last row and
Quantification > Define calibration point.
Figure 7-46. Example gene array experiment card
after import of character values using BNIMA.
7.16.91 Click the upper left triangular button to close the
experiment card.
68
The BioNumerics manual
7.17 Defining a new Sequence Type
7.17.1 Select Experiments > Create new sequence type
from the main menu, or press
and New sequence
type.
7.17.2 The New sequence type wizard prompts you to
enter a name for the new type. Enter a name, for
example “SSU-Ribo”.
7.17.3 Press <Next> and check the kind of the sequences:
nucleic acid sequences or amino acid sequences. Select
nucleic acid sequences.
Press the <Finish> button to complete the setup of the
new Sequence Type. It is now listed under Sequence
types in the experiment type panel.
The new Sequence Type exists by now, and we can enter
sequence data in several ways:
1. Assembling sequencer trace files into consensus
sequences using BioNumerics’ own Assembler
program (see 7.18).
Figure 7-47. The Sequence import dialog box.
2. Importing a sequence file as external format. EMBL,
GenBank and Fasta formats are supported, and the
information fields can be extracted from the headers
using tags.
3. Defining a new sequence file in BioNumerics and
entering the values manually, or pasting them from
the clipboard.
4. Entering or pasting the sequences via the experiment
card of the database entry (see 8.2).
5. Importing the sequences using a script available from
the website of Applied Maths. To import sequences in
various formats files, there are scripts available on the
website of Applied Maths. These scripts can be
launched from the BioNumerics Main window, using
the menu Scripts > Browse Internet, or
. The
relevant scripts can be found under Import tools or
Sequence related scripts. A description of how to use
the scripts is available on the website.
To import sequences in EMBL format, we will use the
example file EMBL.TXT that is provided on the CDROM (directory EXAMPLES\TXTfiles) and use
BioNumerics’ internal import routine.
7.17.4 Select SSU-Ribo under Sequence types.
7.17.5 Choose File > Import experiment data.
In the Open file dialog box, select file EMBL.TXT on the
CD-ROM. This opens the Sequence import dialog box
(Figure 7-47.).
In order to understand how the import of information
via tags works, you may want to open the file
EMBL.TXT in Notepad (Wordpad) or another text
editor. EMBL files and GenBank files contain for each
sequence a header of which the information is
characterized by tags. In EMBL, DE refers to the
organism name, AC is the accession number, KW is the
keyword, etc. Similar tags are used in the GenBank
format. We will specify these tags to extract specific
information from the headers.
In the Fasta format, each sequence is preceded by one
information line, which starts with a > sign. If this
information line contains several fields separated by
vertical slashes (|), you can enter a number: for
example, the third field is represented by a 3.
7.17.6 The file name as it will be saved in the experiment
files directory of BioNumerics is shown in a text box
File. If desired, you can change this name, for example if
a file with the same name already exist in the
experiment files directory.
7.17.7 Select EMBL format from the supported formats.
With Do not create keys, the program will import the
sequences without automatically creating new
corresponding database entries. Since no entries are
created, the Database fields list has no importance in
this case. If you select this function, you will have to link
the sequences manually to existing database entries, as
explained for Character Types (see 7.15.11 to 7.15.13).
If Create keys from tag: is selected, the program will
automatically create a new database entry for each
Chapter 7 - Setting up experiments
69
sequence. If nothing is filled in the TAG field, the
program will automatically construct keys for the
entries. By entering a tag that refers to a unique code, for
example the accession number, you can have the
program use these numbers to create the entry keys. The
<Prefix> button allows you to specify a fixed prefix that
will precede each key.
7.17.8 Choose Create keys from tag: and enter NI as tag.
7.17.9 Select Genus under Database fields, and enter OS
as tag. In fact, OS in the EMBL file displays the full
organism name, so that we are unable to read the genus
and species names separately from the file.
7.17.10 Select Strain no under Database fields, and enter
AC as tag.
7.17.11 Press <OK>. Twenty new database entries are
created.
7.17.12 You can further edit the sequences by selecting
file EMBL.TXT in the Files panel, and File > Open
experiment file (data).
NOTE: Files to import should contain no more than
300 entries. If the file is larger, the number of imported
entries will be truncated after the first 300. If you want
to import more than 300 sequences from single files,
you should use the sequence import script(s) available
on the website of Applied Maths, as explained above.
Double click on an entry or Sequence > Edit to edit the
sequence.
7.17.19 Double click or Sequence > Edit is to edit the
sequence or to enter the bases manually.
7.17.20 If you are doing a lot of editing work, we
recommend to save now and then with File > Save
(
) or the F2 shortcut.
7.17.21 File > Exit when you are finished editing the
sequences.
7.17.22 In the Main window, double click on the file
Seq01 (or File > Open experiment file (entries).
The Sequence entry file window (cf. the Fingerprint entry file
window, Figure 7-25.) contains unlinked entries, which
you can now link to the corresponding database entries.
A link arrow
for each entry allows you to link an
entry to a database entry, by clicking on the arrow and
dragging it onto a database entry, and then releasing the
mouse button. When the experiment is linked, its link
arrow is purple:
.
7.17.23 Drag the link arrow of entry 1 to any database
entry that doesn’t have a sequence linked: as soon as
you pass over a database entry, the cursor shape
changes into
.
7.17.24 Release the mouse button on the database entry;
entry is now linked to this database entry, and its arrow
A new sequence file is created as follows:
7.17.13 In database Example, select the new Sequence
Type SSU-Ribo.
7.17.14 Right-click in the Files panel, and choose Add
new experiment file from the floating menu.
7.17.15 Enter a name, e.g. Seq01. and press <OK>.
7.17.16 Select Seq01 in the Files panel, and File > Open
experiment file (data).
This opens the Sequence data file window, which is empty
initially.
Before you can enter sequences, you have to add new
entries to the file. Suppose that we want to add sequence
data for three more entries of the database.
7.17.17 Select Entries > Add new entries or
With Sequence > Paste from clipboard, the contents of
the clipboard is pasted into the selected sequence.
.
7.17.18 You are prompted to enter the number of entries;
enter 3, and press <OK>.
Three entries are now present, and all sequences are
initially represented by a blank line.
in the Sequence entry file window has become purple
instead of gray
.
NOTE: if you try to link an entry to a database entry
which already has an entry of the same experiment type
linked to it, the program will refuse the second link with
the message:
“The Experiment ‘SSU-Ribo’ of this database is already
defined in XXX”
where XXX is another lane of the same Sequence Type,
in the same or another experiment file.
As soon as an experiment is linked to a database entry,
the experiment presence panel (see Figure 6-1.) shows a
green dot for the experiment of this entry.
You can edit the information fields for this entry in two
places: directly in the database (see 6.4.1 to 6.4.2), or in
the Sequence entry file window, by double clicking on the
entry.
NOTE: Experiment files added to the Experiment files
panel can also be deleted by selecting the file and
chosing File > Delete experiment file from the main
70
The BioNumerics manual
menu. Deleted experiment files are struck through (red
line) but are not actually deleted until you exit the
program. So long, you can undo the deletion of the file
by selecting File > Delete experiment file again.
7.18 Input of sequences using the
BioNumerics Assembler program
Assembler is a plugin tool to assemble contig sequences
from partial sequences which result from sequencing
experiments. The program accepts flat text files as well
as binary chromatogram files from ABI, Beckman, and
Amersham automated sequencers, including the SCF
sequence trace format. In the latter cases, Assembler
allows the user to verify base assignments by inspecting
the chromatograms along with the partial sequences and
the consensus sequence. Assembler also investigates the
quality and ambiguity of the curve profiles to assign a
quality label to the partial sequences and trim off bad
parts where necessary.
Figure 7-48. Empty sequence experiment card with
button to launch Assembler.
7.18.4 It is now possible to simply paste a sequence in
this window; however, pressing the
button will
launch the Assembler program to assemble a contig
sequence from a series of partial sequencing
experiments for this entry (see also 8.2).
A. The Assembler main window
Contig sequences are saved into projects, which contain
all the information about the partial sequences, the
editing made by the user, the multiple alignment, and
the editing done on the contig. A contig project and its
full information can be opened at anytime from the
BioNumerics sequence entry to which it is associated.
Assembler can handle thousands of sequences in one
single contig project and is optimized for speed and
editability in large projects. The program can be
launched from BioNumerics but not as a separate
program.
A set of partial 16S-rRNA sequences from Xanthomonas
strain ICMP 91211 run on an ABI 370 machine are
provided on the installation CD-ROM by selecting all
the files under Examples\Seqassem.
7.18.1 Double-click on an entry in the database which
does not have a sequence assigned.
7.18.6 Select all sequences 11 ICMP 9121 ... 18 ICMP
9121 under Examples\Seqassem and press <Open>.
The Entry edit window of the entry appears.
The six partial sequences are now shown in the
Assembler main window as in Figure 7-49. The window
consists of two tabs: Trimming and Assembly. The first
tab, Trimming, displays the original sequences and
gives an indication of the quality
7.18.2 Click on the
button next to the Sequence
Type (e.g. SSU-ribo or another name you entered).
The program now asks "The experiment "SSU-ribo" is
not defined for this entry. Do you want to create a new
one?" .
7.18.3 By answering <Yes>, the program will create a
new empty sequence that is linked to this entry. The
experiment card for the Sequence Type of this entry
appears: a small empty window (see Figure 7-48.; see
also 8.1).
The Assembler main window, initially empty, looks as in
Figure 7-49.
7.18.5 Select File|Import sequence files or
.
NOTE: The colors of text, background, bases, and all
other symbols may be changed by the user. The
descriptions below are given using the default colors,
which can be obtained by selecting View > Display
settings and pressing <Default>.
The top right panel shows the sequences in a graphical
representation. For each sequence, there is a quality
asignment, based on the quality of the densitometric
curves and the base assignment. Based on the quality,
the program will automatically trim the bad parts from
the sequences, which are underlined with a black bar.
Unknown bases (ambiguous positions) are indicated
with a dark red flag on top of the sequence.
1. Hauben L., L. Vauterin, J. Swings, and E.R.B. Moore.
1997. Int. J. Syst. Evol. Microbiol. 47: 328-335.
Chapter 7 - Setting up experiments
71
The top left panel shows the corresponding file names in
the upper line. In the bottom line, the original size in
base pairs and the size after trimming are shown for the
sequence.
7.18.14 A sequence can be selected from the graphical
overview in the upper right panel, or from the upper left
file name panel. The selected sequence is highlighted in
blue and its graphical overview is bordered by a blue
rectangle.
7.18.7 You can zoom in on the graphical overview panel
with View > Zoom in (overview) or by pressing
from the overview panel.
left
7.18.8 To zoom out on the graphical overview panel, use
View > Zoom out (overview) or by pressing
from the overview panel.
left
7.18.9 The bottom panel displays the densitometric
curve in four colors and the corresponding bases for the
selected sequence.
7.18.15 A position can be selected on any sequence of the
graphical overview by left-clicking. the selected position
is indicated with a blue vertical line. The corresponding
sequence chromatogram is shown in the bottom
window, with the selected position centralized and
highlighted in blue.
7.18.16 Likewise, a base position can be selected on the
chromatogram in the bottom panel, which causes the
selection to be updated in the upper panel as well.
The logical working flow a contig assembly is
7.18.10 You can zoom in on the curve view with View >
Zoom in (trace) or
curve panel.
in the button bar left from the
2. Manual inspection of cleaning result
7.18.11 To zoom out on the curve view, use View >
Zoom out (trace) or
curve panel.
1. Cleaning trace files and quality assignment
3. Removal of vector sequences (optional)
in the button bar left from the
4. Assembling the contig (multiple alignment)
7.18.12 To enlarge the curve vertically, click the
button in the button bar left from the curve panel.
5. Manual inspection and correction of mismatches and
unresolved positions
7.18.13 To shrink the curve vertically, click the
button in the button bar left from the curve panel.
6. Trimming the consensus sequence according to
known start and end signatures (e.g. primers)
(optional)
Figure 7-49. The Assembler main window.
72
The BioNumerics manual
The steps will be described in this order
B. Cleaning chromatogram readouts
Before we actually align the sequences, we need to have
the bad parts cut out, i.e. the outermost left and/or right
parts from the curves with unreliable signal or no signal
at all. This process, called cleaning of sequences, consists
of two levels:
1. Trimming of the sequences, i.e. physically removing
the unusable ends. This level of cleaning is based
upon the percentage of unresolved positions at both
ends of the sequence. Trimmed ends are neither used,
nor shown in the Assembly view of the Assembler
main window.
2. Inactivating doubtful parts of the sequence. This level
of cleaning is based both on the quality of the
densitometric curves and the proportion of
unresolved positions. Inactivated parts are still
shown, but do not actively contribute to obtain the
consensus. However, they are aligned to the
consensus. In case there is no consensus base at a
position, the inactivated regions will not be
considered by the program. The user can still compare
the consensus position with the base in an inactivated
sequence region. Inactive regions can still be set as
active at anytime, whereas active regions can be set as
inactive as well. In case an inactiveted region is the
only information available in a part of the consensus
sequence, it will be used to fill in the consensus
sequence. In case a position on an inactivated region
conflicts with other sequences, it will be ignored.
7.18.17 Cleaning
of
the
sequences
happens
automatically and is based on the quality assignment
settings. The quality of the sequence is shown on the
graphical overview in the Trimming view (Figure 7-49.).
A color scale ranges from green (acceptable quality)
over yellow and orange to red (unacceptable quality).
The trimmed ends are indicated by a black bar
underlining the sequence. Inactivated zones are
indicated by a gray bar. Unresolved positions (‘N’) are
indicated with a small flag on top of the sequence.
7.18.18 The quality assignment can be changed by
modifying the settings in the Quality assignment dialog
box (Figure 7-50.). This dialog box can be popped up
with File > Quality assignment or
.
7.18.19 The Curve quality parameters determine how
the program will investigate the quality of signal
derived from the curves. They include two ratios that
are considered in a certain window, determined by the
Sliding window size. The latter should be an odd
number, including the position itself and a number of
positions at either side.
The Minimum good/bad peak ratio is the ratio between
the signal strength of the weakest peak resulting in a
base and the strongest peak not resulting in a base
Figure 7-50. The Quality assignment dialog box.
within the sliding window. The higher this ratio is set,
the more stringent the quality assignment becomes. A
suitable starting value for most systems is 1.50.
The Minimum short/long distance ratio is the ratio of
the shortest distance between two positions and the
longest distance within the sliding window. A suitable
starting value is 0.60; the larger it is set, the more
stringent the quality assignment will be.
A typical value for the Sliding window size is 5
positions; increasing this value will result in a more
stringent quality asignment.
7.18.20 Under Base calling quality parameters, you can
specify a Sliding window size, and the number of
resolved positions that should be found within the
sliding window (Minimum resolved). Similar as under
Base quality assignment, the sliding window size
should be an odd number. Suggested starting values are
a Sliding window size of 41 of which minimum 30
resolved positions.
7.18.21 Sequence trimming is based upon the Minimum
number of consecutive good bases, as defined by the
Curve quality parameters and the Base calling quality
parameters. A suggested value is 15; the larger the
number the heavier the sequence will be trimmed.
7.18.22 The Sequence acceptance parameters determine
whether a sequence as a whole will be appected to
contribute to the consensus or not. The Minimum length
of usable sequence determines the length of the nontrimmed part of the sequence. The Minimum fraction of
good bases determines the ratio of good bases over the
total number of bases in the usable part of the sequence.
Suggested values are 50 bases of which minimum 25%
good bases.
7.18.23 For the example sequences, which were
generated on an old sequencer and have a rather poor
quality, it is recommended to change the standard
trimming settings slightly: under Curve quality
Chapter 7 - Setting up experiments
73
parameters: Sliding window size 5; Minimum good/bad
peak ratio 1.30; Minimum short/long distance ratio 0.60.
Under Base calling quality parameters: Sliding window
size 51; Minimum resolved 30. The other settings can
remain unchanged, i.e. under Sequence trimming, 15
bp and under Sequence accaptance 50 bp and 25%,
respectively.
C. Removing vectors
7.18.24 Automatic cleanup (trimming and assignment of
inactive zones) happens automatically after pressing the
<OK> button in the Quality assignment dialog box. Any
manual trimming and (in)activation done (see further)
will be lost at this point.
sequences with File > Remove vectors or
. This
pops up the Remove vectors dialog box (Figure 7-51.).
If the sequences contain residues from vector sequences,
these need to be removed before the sequences are
assembled.
7.18.30 Vectors can be removed from the unaligned
After automatic quality assignment and trimming, the
user can still manually correct the trimmed ends and
inactive zones.
7.18.25 To mark the start of a sequence, click on the
position to start (this can be done both on the overview
and on the chromatogram) and select Edit > Mark start
of sequence. You can also use the CTRL+Home keys on
the keyboard. You can also use the
the curve view (lower panel).
button left from
7.18.26 To mark the end of a sequence, click on the
position to end (this can be done both on the overview
and on the chromatogram) and select Edit > Mark end of
sequence. You can also use the CTRL+End keys on the
keyboard. You can also use the
curve view (lower panel).
button left from the
7.18.27 To mark a zone as inactive, click on the start
position of the zone, then hold down the SHIFT key
while clicking on the end position of the zone (this can
be done both on the overview and on the
chromatogram). Choose Edit > Inactivate selected
region or press the - (minus) key on the keyboard. The
Figure 7-51. The Remove vectors dialog box.
7.18.31 Vector sequences to remove can be added from
the clipboard (by copying from another application).
The can be pasted in the list by pressing <Add from
clipboard>. This opens a new window, the Import vectors
from clipboard editor (Figure 7-52.). The sequence on the
button left from the curve view (lower panel) can
also be used.
7.18.28 To mark a selected zone as active, choose Edit >
Activate selected region or press the + (plus) key on the
keyboard. The
button left from the curve view
(lower panel) can also be used.
A sequence can be inactivated as a whole with Edit
Inactivate selected sequence. When inactivated, a
sequence is marked with a red cross in the files panel
(upper left).
A sequence that was inactivated by the Sequence
acceptance parameters (7.18.22) can be activated
manually with Edit > Activate selected sequence.
7.18.29 A sequence can be removed from the project
with File > Remove selected sequence or
.
Conversely, sequences can be added to a project at any
time with File|Import sequence files or
.
Figure 7-52. Import vectors from clipboard editor.
clipboard is automatically pasted into the editor, which
74
The BioNumerics manual
the user can still edit. An input field Name allows a
name to be entered for the vector.
7.18.32 Vectors can be deleted from the list using the
<Delete selected> button.
Vectors entered are automatically saved along with the
project.
The Remove vectors dialog box (Figure 7-51.) contains a
number of alignment parameters:
7.18.33 Minimum score: the minimum number of
matching bases the sequence and the vector should have
in order for the vector sequence to be removed. This
number is the result of the total number of matching
bases minus the total penalty resulting from mismatches
and gaps.
Figure 7-53. Calculate assembly dialog box with
lignment parameters.
7.18.34 Unit penalty per gap: the penalty, as a factor of
the match score, assigned to a gap in either the sequence
or the vector after the alignment.
size number. In the default setting of 7, only stretches of
7 identical bases or more will be considered as matches.
7.18.35 Unit penalty per mismatch: the penalty, as a
factor of the match score, for a single mismatch between
the vector and the sequence after the alignment
7.18.36 Maximum distance to edge: the maximum
number of unmatched bases at the end of the sequence.
Normally a vector sequence will extend over the end of
the trace sequence, so one will not expect unmatched
bases at the end of the sequence. Therefore, this number
should be set very low (e.g. 5 or less).
7.18.37 By pressing <OK> the vector sequences are
automatically searched for and removed from the
unaligned sequences. Removed vector sequences are
indicated in blue on the sequence overview (upper
panel).
NOTE: To undo vector removal, open the Remove
vector dialog box, delete all vectors defined and press
<OK>. Vector removal as well as undoing vector
removal can only be executed on unaligned sequences. If
sequences are already aligned, you will first have to
remove the consensus (see below).
D. Alignment to consensus
7.18.38 The sequences are assembled into a consensus
with the menu command File > Assemble sequences or
by pressing the
button. The Calculate assembly dialog
box is displayed (Figure 7-53.), allowing the various
alignment parameters to be entered.
The Minimum match word size determines the number
of bases that are taken together into one Word. The
algorithm creates a lookup table of groups of bases to
accelerate the alignment, which increases the speed of
the algorithm. In an alignment to a consensus sequence,
no mismatches are expected, except due to bad base
calling. In that case, it is justified to choose a high word
7.18.39 Minimum score: the minimum number of
matching bases the two sequence should have before
they will be aligned. This number is the result of the
total number of matching bases minus the total penalty
resulting from mismatches and gaps.
7.18.40 Unit penalty per gap: the penalty, as a factor of
the match score, assigned to a gap introduced in one of
the sequences after the alignment.
7.18.41 Unit penalty per mismatch: the penalty, as a
factor of the match score, for a single mismatch between
the two sequences after the alignment
7.18.42 Maximum number of gaps relates to the
alignment technique that is used, i.e. a fast algorithm
based upon Needleman and Wunsch (1970)1. The
number of gaps the algorithm can create is proportional
to the number of diagonals specified. The larger the
number, the more accurate but the slower the
calculations. The suggested default setting is 25
diagonals.
7.18.43 The checkbox Ignore current assemblies allows
the algorithm to recalculate the consensus sequence(s)
from individual trace sequences without taking into
account any already calculated contigs.
7.18.44 Press <OK> to calculate the assembly or
<Cancel> to exit the Calculate assembly dialog box without
anything to happen.
D. Editing a consensus sequence
7.18.45 When the alignment is finished, the second view,
.i.e. the Assembly view, is shown (Figure 7-54.). As
compared to the first view (Trimming view, see Figure
7-49.), a central panel now shows the consensus sequence
1. Needleman, S., and C. Wunsch. 1970. J. Mol. Biol. 48:443453.
Chapter 7 - Setting up experiments
75
Figure 7-54. The Assembler Main window, Assembly view (second tab).
(upper line) and the individual trace sequences that
contribute to the displayed consensus.
7.18.46 The upper panel (overview panel) now displays
the aligned trace sequences. If the arrow points to the
left, the program has invert-complemented the sequence
to obtain the correct alignment.
7.18.47 The upper left panel now displays the selected
consensus with its length and the number of sequences
that are part of it. If the program could not align all the
trace sequences to a single consensus, the panel lists the
different consensus sequences with their lengths and
number of trace sequences. One should click on a
particular consensus sequence to select it for viewing
and editing.
The bottom panel displays the chromatogram file for the
selected trace sequence. Regardless of whether the
sequence is invert-complemented in the alignment, the
chromatogram is always shown in original mode. This
means that, when the sequence has been invertcomplemented, a G on the original sequence, for
example, will appear as a C on the consensus. Due to the
fact that the direction of the curve can be opposite from
the sequence and that the bases are not aligned, it is not
possible to select bases on the raw curves directly.
7.18.48 As opposed to this raw mode view, there is also an
aligned mode view, which is obiatned by pressing the
button or View > Show aligned sequences. This view has
the following features:
- Curves have been stretched or shrinked to obtain
equidistant spacing between the base positions
- Trace sequences are always shown as transformed and
oriented in the consensus. If a sequence is invertcomplemented, the complement of the bases is shown,
and the colors of the curves are adjusted likewise.
- Multiple trace chromatograms can be shown together
and are aligned to each other and to the consensus (see
Figure 7-54.).
- Arrows on the curves indicate the direction of the
sequence: if the sequence has been inverted, the arrow
points to the left (Figure 7-54.).
- In the aligned view, it is possible to select bases directly
on the curves.
7.18.49 To shrink the curves vertically, click the
button in the button bar left from the curve panel.
7.18.50 To enlarge the curves vertically, click the
button in the button bar left from the curve panel.
7.18.51 A sequence can be moved up or down by
selecting it and choosing Edit > Move sequence up
(PgUp or
) or Edit > Move sequence down (PgDown
or
), respectively.
7.18.52 Bases on the consensus sequence are assigned
according to the Consensus determination parameters,
which can be set with Assembly > Consensus
determination. The dialog box (Figure 7-55.) allows
three parameters to be set:
76
The BioNumerics manual
7.18.58 If consensus editing is enabled in the Consensus
determination parameters (7.18.52), it is also possible to
place the cursor on the consensus sequence and type a
base, which causes the base to be changed on all
sequences that have signal at the selected position.
Figure 7-55. The Consensus
parameters dialog box.
determination
- Required bases to include position: The percentage of
sequences that need to have a base at a certain
position in order for the position to be inserted in the
consensus. For example using the default value 40, if
the consensus is determined by three sequences at a
certain position, it will not be accepted as a base of
there is a gap in two of the three sequences (33.3%).
7.18.59 As mentioned before, the Assembler program
contains a multistep undo and redo function. In
addition, the program also stores a history of editing
actions done on each individual sequence. This
information can be popped up by selecting the sequence
(clicking on any position on the sequence, in the
chromatogram or on the overview) and calling Edit >
Sequence information (CTRL+I) or pressing the
button. The Sequence editing information box (Figure 756.) lists all base corrections that are made to the
sequence. The corrections recorded include base
changes, deletions and insertions.
- Required consensus for unique base calling: The
precentage of sequences that need to have the same
base at a position in order for the base to be accepted
as resolved.
- Minimum difference between first and second: The
percentage that the second most occurring base
should be less than the first choice. For example, if the
most frequent base at a position is 50%, the second
most frequent base should be less than 30% otherwise
the position remains unresolved.
7.18.53 - Allow group editing of sequences is a feature
that allows bases to be changed directly on the
consensus sequence. If this feature is disabled, bases can
only be changed on the individual trace sequences.
Figure 7-56. The Sequence editing information box.
7.18.54 Unresolved positions on the consensus are
indicated in pink and extend over all sequences shown
(central panel, see Figure 7-54.).
Problem positions on individual trace sequences, which
have been solved under the current Consensus
determination parameters (7.18.52) are indicated in
orange. Such problem positions include mismatches as
well as unresolved positions.
7.18.55 To change a base in a trace sequence, place the
cursor on the base or on the position on the
chromatogram and type the base, which can be A, G, C,
or T (IUPAC codes for ambiguous positions are not
accepted).
7.18.56 To delete a base, select Edit > Delete base or
press the DEL key.
7.18.57 To insert a position, select Edit > Insert column
or press the INSERT key.
7.18.60 From the Sequence editing information box, you
can select a particular editing action in the list, and press
<Select on sequence>. The position will be selected on
the sequence. A correction made can be undone by
pressing the <Discard change> button.
7.18.61 A range of bases can be selected on the curves or
on the sequences in the central panel by clicking the first
position of the range, then holding down the SHIFT key
while clicking on the last position. A selected range is
highlighted by a blue rectangle in the sequence view.
Range selection by dragging the mouse is also possible
in the sequence view.
7.18.62 If a selected selection of bases is flanked by a gap
at one side, it is possible to shift the selection towards
that gap, to correct misalignments. Shifting towards the
left can be done with ALT+left arrow key and shifting
towards the right with ALT+right arrow key. These
Chapter 7 - Setting up experiments
77
commands can also be found in the menu (Edit > Shift
block left, and Edit > Shift block right, respectively).
7.18.63 To check the consensus sequence for correctness,
you can let the program jump to each next unresolved
problem position using View > Next unresolved problem
or
(or using the shortcut CTRL+Right arrow key).
7.18.64 To jump to the previous unresolved problem,
be done both on the sequence and on the chromatogram,
not on the overview). Choose Edit > Inactivate selected
region or press the - (minus) key on the keyboard. The
button left from the curve view (lower panel) can
also be used.
7.18.72 To mark a selected zone as active, choose Edit >
Activate selected region or press the + (plus) key on the
(or
keyboard. The
button left from the curve view
(lower panel) can also be used.
7.18.65 In case the program has incorrectly aligned a
sequence to one or more other sequences, you can place
the cursor on the misaligned sequence and select
Alignment > Break selected sequence apart.
7.18.73 Additionally, it is also possible to extend a
sequence that has been trimmed off too far. To do so,
select the outermost base on the sequence and Edit >
Extend sequence (CTRL+X). An input box will ask you to
enter the number of bases to extend.
use View > Previous unresolved problem or
using the shortcut CTRL+Left arrow key).
7.18.66 New sequences can be added at any time to the
existing alignment project by switching to the first view
and selecting File > Import sequence files, and
subsequently selecting File > Assemble sequences. In the
Calculate assembly dialog box (Figure 7-53.), Ignore current
assemblies should normally be unchecked, to preserve
the assembly or assemblies already present.
7.18.74 A region on an individual sequence or on the
consensus can be selected as explained in 7.18.61, and
can be copied to the clipboard using Edit > Copy.
Although Assembler automatically inverts and
complements subsequences wherever necessary to
obtain the consensus sequence, the program cannot
know the correct orientation of the consensus sequence.
Hence, it may be necessary to invert and complement
the consensus sequence before entering it into the
database.
7.18.76 A consensus sequence and its associated
alignment can be removed by selecting it in the upper
left panel and choosing Assembly > Delete selected
7.18.67 Invert-complement the consensus sequence by
selecting the consensus to invert and Assembly > Invert
direction or
.
NOTE: In case the program couldn't find one single
consensus for all subsequences, two or more assemblies
will exist. Therefore you will need to select the assembly
to invert from the list in the upper left panel before
executing the invert-complement function.
7.18.68 The following editing actions are available to
further clean up sequences (see also 7.18.25 to 7.18.28 in
the Trimming view).
7.18.69 To mark the start of a sequence, click on the
position to start and select Edit > Mark start of
sequence. You can also use the CTRL+Home keys on the
keyboard. You can also use the
curve view (central panel).
button left from the
7.18.70 To mark the end of a sequence, click on the
position to end and select Edit > Mark end of sequence.
You can also use the CTRL+End keys on the keyboard.
You can also use the
(central panel).
button left from the curve view
7.18.71 To mark a zone as inactive, click on the start
position of the zone, then hold down the SHIFT key
while clicking on the end position of the zone (this can
7.18.75 The entire sequence on which the cursor stands,
or the entire consensus, can be selected with Edit >
Select all.
contig or pressing the
button.
7.18.77 All alignments and consensus sequences can be
removed with Assembly > Delete all contigs or pressing
.
The latter two options can be useful if you want to load
stored templates (see 7.18.101), remove vectors (7.18.30)
or change the quality assignment parameters (7.18.18).
Those actions cannot be performed if an alignment is
present.
7.18.78 The overview panel of a contig project can be
printed with File > Print overview.
E. Advanced alignment editing using the dot
plot view
7.18.79 Using a dot plot view,regions of homology
between two sequences are displayed graphically. To
allow the dot plot to display the homology between very
long sequences in an efficient way, three reduction
factors will be applied: (1) bases are grouped together
into words of a specific length, (2) a minimum number of
bases should match before the match is displayed on the
dot plot, and (3) the entire plot is reduced in size. These
three parameters can be installed when the Dot plot view
window is called with Assembly > View dot plot or
.
7.18.80 The Dot plot parameters dialog box (Figure 7-57.)
prompts to enter the parameters for Word size,
78
Minimum score, and Reduction factor. The values to
enter depend strongly on the size of the project.
The BioNumerics manual
7.18.82 You can zoom in or out on the plot using Edit >
Zoom in and Edit > Zoom out, or by using the zoom
buttons.
A repeat (direct or inverted) can only be considered
interesting in a contig project if it extends from a vertical
side of a rectangle to a horizontal size: only then there is
a complete overlapping end between consensus
sequences, which can thus be merged.
Figure 7-57. The Dot plot parameters dialog box
Inside the dot plot view, you can click on a particular
dot or stretch of dots. A red cursor appears, and the
upper panel displays the matching region between the
two sequences, matching bases on a green background.
7.18.83 To merge two sequences that have a terminal
7.18.81 When pressing <OK> the Dot plot view window
appears (Figure 7-58.). In this view, each consensus
sequence is represented as one gray square. Repeats
found within the a consensus are shown within the gray
squares; whereas repeats found between the consensus
sequences are shown in the rectangles that form the
intersections between the consensus sequences. The
upper left part of the view displays the direct repeats (in
green), whereas the lower left part of the view displays
the inverted repeats (in blue).
Figure 7-58. The Dot plot view window.
match, select Edit > Merge contigs or press
. The
consensus sequences are now merged in the Assembler
main window and the dot plot view is updated
accordingly.
F. Storing and a contig project
7.18.84 A contig project can be marked as being
approved or not. When working in a connected
database, the user can specify to display the status of the
Chapter 7 - Setting up experiments
contig projects (approved or not) in the Experiment
presence panel (see 28.2). Approved sequences are
indicated as a green square whereas non-approved
sequences are indicated with a transparent square. The
same squares are also indicated on the sequence
experiment card (Figure 7-48.). Sequences can be
marked approved or non-approved with the File >
Approved command.
7.18.85 When the aligned sequences are ready for
importing in the sequence database, select File > Save
(CTRL+S), or
.
7.18.86 From a sequence in BioNumerics assembled
using the Assembler tool, the project can be opened in
Assembler pressing the
button in the small
sequence edit box opened from the Entry edit card, or in
the Kodon Sequence Editor if Kodon is linked to
BioNumerics. Such projects can be changed at any time
and are updated automatically in the BioNumerics
database.
79
7.18.89 Using Mismatches allowed, it is possible to find
subsequences that differ in a defined number of bases
from the entered string.
7.18.90 The checkbox Consider gaps as mismatches,
allows the search algorithm to introduce gaps in either
the search sequence or the target sequence to match
them. Gaps are considered in the same way as
mismatches, and thus depend on the Mismatches
allowed setting.
7.18.91 Use IUPAC codes allows the search sequence to
be matched with uncertain positions denoted as IUPAC
unresolved positions (e.g. "N", "R", "Y", etc., including
“N”).
7.18.92 With Search in both directions enabled, the
invert-complemeted sequence will be searched through
as well.
G. Finding subsequences
7.18.93 Press <Search> to execute the search command.
The Result set displays all the instances that were found
(Figure 7-59.), indicating with arrows if they have been
found on the sequence as is, or after invertcomplementing. The positions are also indicated.
7.18.87 With Edit > Find or CTRL+F you can pop up a
Find sequence tool in Assembler (Figure 7-59.) to find
subsequences. You can fill in a subsequence including
unresolved positions according to the IUPAC code (e.g.
"N", "R", "Y", etc., including “N”).
7.18.94 If you click on an item in the list under Result set,
the matching subsequence is selected in the sequence
panel (central panel). The bottom panel of the Find
sequence window displays the alignment of the search
sequence and the target sequence, indicating
mismatches and gaps introduced (if allowed).
7.18.88 Under Search in, you can choose between
Current sequence (the selected one)., All sequences, and
Consensus.
Figure 7-59. The Find sequence dialog box.
H. Trimming the consensus
The purpose of this tool is to locate two fixed
subsequences on the consensus to define the start and
end position, respectively. One can choose to include or
80
exclude the locator sequences in the final consensus. In
many cases, but not always, these subsequences will
correspond to primers used. For generality, the
subsequences are called trimming targets in the program
and in the description that follows.
7.18.95 Select Assembly > Consensus trimming or press
the
button to open the Consensus trimming dialog
box (Figure 7-60.).
The BioNumerics manual
no offset is specified (zero), the trimming targets are
included in the trimmed consensus.
7.18.100 When the trimming targets have been set by
pressing the <OK> button in the Consensus trimming
dialog box, the overview window (upper panel) shows
black hatched lines at the positions of the trimming
targets. Likewise, the consensus sequence in the central
panel is grayed where it is trimmed off.
I. Storing and using assembly templates
7.18.96 Under Trimming targets, you can fill in a Start
pattern and an End pattern. For both the start and end
patterns, you can specify Mismatches allowed, and fill
in a Target range on the consensus. The latter is to
restrict the search to certain regions on the consensus,
e.g. to prevent incidental matches inside the targeted
consensus sequence.
7.18.97 With Search both directions, the entered
trimming targets will be searched for on the conesnsus
as it appears as well as on its complementary strand. In
case the trimming targets match the complementary
strand of the consensus, it will be automatically invertcomplemented.
7.18.98 Minimum number of sequences specifies a
minimum number of trace sequences that should be
contributing to the subsequence on the consensus that
matches the trimming targets. For example, if 2 is
entered, a trimming target will only be set if the
matching region on the consensus is fully defined by at
least 2 sequences.
7.18.99 With Start offset and End offset, one can specify
that the consensus is trimmed at a certain offset from the
start and end trimming target positions, respectively. If
Figure 7-60. The Consensus trimming dialog box.
7.18.101 The Assembler program allows all user defined
settings to be stored in a Template. These settings
include the display settings, the quality assignment
parameters, the vectors to remove and their parameters,
the alignment parameters, the consensus determination
parameters, the consensus trimming targets and their
parameters. A template can be stored or opened with
File > Templates or
. This will open the Templates
dialog box (Figure 7-61.) which allows the current
template to be saved with <Save current>, or a selected
template from the list (left) to be loaded with <Load
template>. A selected template can be deleted with
<Delete selected>.
NOTE: A template can only be loaded if no alignment is
present. To load a template, you will need to remove the
assemblies first, which can be done with Assembly >
Delete all contigs (7.18.77).
The BioNumerics sequence experiment card is filled
with the assembled sequence as soon as the project is
saved.
7.18.102 From a sequence in BioNumerics assembled
using GeneBuilder, the project can be opened in
Chapter 7 - Setting up experiments
81
Unlike other experiments, a Matrix Type does not
provide an experiment for each entry. Instead, it
contains similarities between entries. Hence, the “data file”
which contains the experiment data, and the “entry file”
which links the experiments to database entries are the
same here. There are two ways to enter similarity
values: by importing a matrix as a whole, and by
entering the values from the keyboard.
To import a matrix, it must have the following format:
ENTRY KEY<tab>VALUE<eol>
ENTRY KEY<tab>VALUE<tab>VALUE<eol>
ENTRYKEY<tab>VALUE<tab>VALUE<tab>VALUE
<eol>
Etc.
Figure 7-61. The Templates dialog box.
<eol> means “end of line”, a simple return in MS-DOS
text, which corresponds to ASCII character # 13 followed
by ASCII character # 10.
GeneBuilder by pressing the
button in the sequence
experiment card (7.18.4). The base selected in the
experiment card will automatically be selected in the
GeneBuilder editor. Such projects can be changed at any
time and are updated automatically in the BioNumerics
database.
Matrix files can be imported by selecting the Matrix
Type in the experiment types panel, and File > Import
experiment file. The program compares the entry keys as
provided in the import file with the entry keys in the
database and assigns values to the corresponding keys.
If entry keys are not found in the database, it will
automatically create new database entries.
7.18.103 From within a BioNumerics comparison, you
can double-click on a base of a sequence, which pops up
the sequence experiment card with that base selected.
Pressing the
button in the sequence experiment
card in turn launches GeneBuilder with the same base
selected.
7.19 Defining a new Matrix Type
7.19.1 Select Experiments > Create new matrix type from
the main menu, or press
and New matrix type.
7.19.2 Enter a name for the new type. Enter a name, for
example “DNA-homol”.
Press the <OK> button to complete the setup of the new
Matrix Type. It is now listed under Matrix types in the
experiment type panel.
To enter similarity values manually, you first have to
select the entries in the database for which you want to
create a matrix.
7.19.3 Select some entries in the database by holding the
CTRL key and left-clicking (see further, paragraph 9.2).
Selected entries are marked with a blue arrow.
7.19.4 Double click on the file DNA homol in the Files
panel (not in the Experiment types panel). This opens
the Matrix file window (Figure 7-62.).
The diagonals, i.e. the similarity values of the entries
with themselves, are filled in already and cannot be
changed.
7.19.5 To enter a value, press Enter or double click on a
field.
7.19.6 When finished, exit the window with File > Exit.
82
Figure 7-62. The Matrix file window to enter and edit similarity values.
The BioNumerics manual
83
8.
Experiment display and edit functions
In paragraph 6.4, we have explained how you can edit
the information fields for each database entry by double
clicking on the entry (6.4.1), which pops up the Entry edit
window. It is possible to enter and view experiment data
directly from the Entry edit window.
In order to explain the edit functions, we will use the
DemoBase database.
8.0.1 Exit the main program
8.0.2 Back in the Startup screen, select DemoBase and
start Analyze again.
8.1 The experiment card
8.1.1 If we open the Entry edit window for any database
entry (except a standard) here, the window lists all
available experiment types for this entry, each of which
contains two buttons (Figure 8-1.) .
8.1.4 When you hover over the image card with the
mouse, a small tag displays additional information. In
case of a fingerprint, it shows the key of the entry, and
the gel name and lane number. In case of a Character
Type shown as a plate it shows the key of the entry, and
the name and the value of the character being pointed
to.
8.1.5 Close an experiment card by clicking in the small
triangle-shaped button in the left upper corner.
You can open an experiment card for an entry, close its
Entry edit window, and then show the corresponding
experiment card for another entry, to arrange and
compare them side by side. Only the screen size will be
the limiting factor as to the number of experiment cards
that can be shown together.
8.1.6 In case of a Fingerprint Type experiment card, you
can increase or decrease the size of the card using the
keyboard, by pressing the numerical + key (increase) or
the numerical - key (decrease). You can right-click on
the experiment card to pop up a floating menu, from
which you can Export normalized curve, Export
normalized band positions, and Export band metrics.
8.1.7 In case of a Character Type, you can right-click on
the experiment card to pop up a floating menu, from
which you can call the character image import program
BNIMA (Edit image), copy the data set to the clipboard
(Copy data to clipboard), or paste data from the
clipboard into the experiment (Paste data from
clipboard). Export character values creates a similar
output, but provides the names of the characters in case
of an open character set (see 7.14).
Figure 8-1. Entry edit window.
8.1.2 With the
button, you can display the
Experiment card of an experiment (Figure 8-2.).
8.2 Entering experiment data via the
experiment card
For Character Types and Sequence Types, it is possible
to enter the experiment data directly on the experiment
card.
An Experiment card can also be opened from the Main
window, by clicking on the green dot that indicates that
an experiment is present for an entry (see 6.1).
8.2.1 Open the Entry edit window. In this example, we
will open the Entry edit window for a STANDARD.
8.1.3 You can move the experiment card by clicking and
holding the left mouse button on the card, and then
dragging it to its new position. For sequence experiment
cards, move the window in the caption.
empty flask is shown:
For experiments that are not available for the entry, an
.
•Character Types
Depending on whether the Character Type is closed
(defined number of tests) or open (undefined number of
tests), the input method is different.
84
The BioNumerics manual
(1)
(2)
(3)
(4)
Figure 8-2. Experiment cards of Fingerprint Type (1), Character Type with fixed number of characters (closed
type) (2), Character Type with non-fixed number of characters (open type) (3), and Sequence Type (4) .
8.2.2 Click the
button of an empty Character Type;
8.2.9 In the Entry edit window, press the
button of
in the example, we choose PhenoTest, which is a closed
Character Type.
FAME. The card lists all fatty acids that are present in
this entry, as percentages.
A message displays “The experiment ‘PhenoTest’ is not
defined for this entry. Do you wish to create a new one?”.
8.2.10 Double click on a fatty acid to change its value. A
dialog box prompts to “Enter a value of character XXX”.
8.2.3 Answer <Yes> to this question. An empty
experiment card appears.
8.2.11 Press <Cancel> if you do not wish to modify the
data of the demobase.
8.2.4 In case of binary (plus or minus) data, you can
enter the values using the numerical + and - keys. The
cursor automatically jumps to the next test if you have
entered a value.
8.2.12 At the bottom of the list, the last line shows <Add
new character>. If you double click on this line, a dialog
box shows all known characters for this Character Type
which are not yet available in this entry, from which you
can select one.
8.2.5 You can move the cursor using the Left and Right
arrow keys.
NOTE: if you use the + and - keys to enter non-binary
data, the defined maximum for the Character Type is
used if + is entered.
8.2.13 Press the <Create new> button to create a new
character.
•Sequence Types
8.2.14 Press the
button of an empty Sequence Type;
8.2.6 In case of non-binary values (real or integer
values), each test can be varied continuously between
the minimum and the maximum using the PgUp key
(increase intensity) and the PgDn key (decrease
intensity).
in the example, 16S rDNA.
8.2.7 Press the close button of the experiment card. The
program asks to save the changes made.
8.2.16 When the sequence has been generated using the
BioNumerics contig assembler program "GeneBuilder",
8.2.8 Open an Entry edit window of an entry which
contains a FAME experiment (shown as a green dot in
the experiment presence panel).
8.2.15 An empty sequence editor appears. You can enter
bases or amino acids manually, or by pasting from the
clipboard (SHIFT + INS).
pressing the
button will launch GeneBuilder with
the contig project associated with this sequence.
8.2.17 When no contig project is available for this entry,
pressing the
button will launch GeneBuilder with a
Chapter 8 - Experiment display and edit functions
new project associated. See 7.18
GeneBuilder.
to work with
85
8.3 Entering experiment data via the
experiment file
8.3.1 Using the
button, you can open the Experiment
entry file window, with the entry selected. With File >
Edit fingerprint data (
), File > Edit character data
or File > Edit sequence data, you can edit the
experiment data for the entry directly in the file.
86
The BioNumerics manual
87
9.
Comparison functions
9.1 Definition
A Comparison in BioNumerics includes every function to
compare multiple database entries. This involves
displaying experiment images of selected entries,
calculating and showing cluster analyses, aligning
sequences, and calculating principal component
analysis (PCA) and multi-dimensional scaling (MDS).
The Comparison window in BioNumerics presents a
comprehensive overview of all available experiments for
a selection of entries and enables the user to show and
compare any combination of images of experiments. A
Comparison window is always associated with a selection
of entries from the database. To select entries, several
search and selection functions are available.
In order to explain the comparison functions, we will
use the DemoBase database. If the DemoBase is not the
current database loaded, execute the following two
steps.
9.1.1 Exit the main program
9.1.2 Back in the Startup screen, select DemoBase and
start Analyze again.
9.2 Manual selection functions
A single entry can be selected by holding the CTRL key
and left-clicking. Selected entries are marked by a blue
arrow (Figure 9-1.). Selected entries are unselected in the
same way.
Figure 9-1. Database panel in the main program,
showing selected entries (blue arrows).
9.2.1 Select the first non-standard lane (CTRL + mouse
click). The entry is now marked by a blue arrow.
9.2.2 In order to select a group of entries, hold the SHIFT
key and click on another entry.
9.2.3 If you wish to select entries using the keyboard,
you can scroll through the database using the Up/Down
arrow keys, and select or unselect entries using the
space bar.
9.2.4 A single entry can be selected or unselected from
its Entry edit window (6.4) using the
button. When
the entry is selected, this button shows as
.
9.2.5 To make viewing of selected entries easier in a
large database, you can bring all selected entries to the
top of the list with Edit > Bring selected entries to top.
9.2.6 Clear all selected entries with Edit > Clear selection
list (F4 key) or
.
9.3 Automatic search and select functions
Besides the manual selection functions as described
above, BioNumerics possesses more advanced database
search functions.
9.3.1 Select Edit > Search entries (F3) or
pops up the Entry search dialog box (Figure 9-2.).
. This
88
The BioNumerics manual
9.3.7 Clear the selection with the F4 key or click the
button (Edit > Clear selection list) .
9.4 The advanced query tool
BioNumerics contains an advanced query tool that
allows searches of any complexity to be made within the
database, based on information fields and experiment
data.
9.4.1 Call the query tool again, by selecting Edit > Search
entries or pressing
.
The Entry search dialog box (Figure 9-2.) contains a button
<Advanced query tool>.
Figure 9-2. Entry search dialog box.
You can enter a specific search string for each of the
database fields defined in the database (left panel).
Wildcards can be used to search for substrings: an
asterisk * replaces any range of characters in the
beginning or the end of a string, whereas a question
mark ? replaces one single character.
It is also possible to search for all entries that contain a
certain experiment (right panel). Both the string search
and the experiment search can be combined.
Normally, successive searches are additive: new
searches are added to the selection list. The Search in list
checkbox allows you to refine the search within a list of
selected entries.
9.4.2 Press <Advanced query tool>. The normal Entry
search dialog box changes into the Advanced query tool
(Figure 9-3.).
The advanced query tool allows you to create individual
query components, which can be combined with logical
operators. The available targets for query components are
Database field, Database field range, Experiment
presence,
Fingerprint bands, Character value,
Subsequence, and Attachment.
•Database field
Using this component button, you can enter a
(sub)string to find in any database field (<Any field>) or
in any specific field that exists in the database (Figure 94.). Note that the wildcards * and ? are not used in the
advanced query tool.
With Negative search, all entries that do not match the
specified criteria will be selected.
Case sensitive lets the program make a distinction
between uppercase and lowercase.
The <Clear> button clears all entered search criteria.
9.3.2 As an example, enter *L* in the Species field.
9.3.3 Press <Search>. All entries having a L in their
species name are selected: Ambiorix sylvestris and
Vercingetorix palustris.
9.3.4 Call the Entry search dialog box again, and press the
<Clear> button.
9.3.5 Enter STANDARD in the Genus field, and check
the Negative search checkbox.
9.3.6 Press <Search> to select all database entries, except
the entries used as standard lanes in the RFLP
techniques.
Figure 9-4. Database field search component dialog
box.
The search component can be specified to be Case
sensitive or not. In addition, a search string can be
entered as a regular expression (see 30.2).
•Database field range
Using this component button, you can search for
database field data within a specific range, which can be
alphabetical or numerical. Specify a database field, and
enter the start and the end of the range in the respective
input boxes (Figure 9-5.). A range should be specified
Chapter 9 - Comparison functions
89
Figure 9-3. The advanced query tool.
with the lower string or value first. Note that, when only
one of both limits is entered, the program will accept all
strings above or below that limit, depending on which
limit was entered. For example, when only the first
(lower) limit of the range is entered and the upper limit
is left blank, all strings (values) above the specified string
(value) will be accepted.
•Experiment presence
With this search component, you can specify an
experiment to be present in order for entries to be
selected.
•Fingerprint bands
The Fingerprint bands search component allows specific
combinations of bands to be found in the database
entries. The dialog box that pops up (Figure 9-6.) allows
you to enter a Fingerprint experiment type, and specify
an Intensity filter, a Target range, and a Number of
bands present.
Figure 9-5. Database field range component dialog
box.
The search component can be specified to be Case
sensitive or not. When Numerical values is checked, the
search component will look only for numerical values
and ignore any other characters.
Figure 9-6. Fingerprint bands presence component
dialog box.
90
Under Intensity filter, you can choose which intensity
parameter to be used: Band height, Band surface or
Relative band surface. When a 2D quantification
analysis is done, you can also choose Volume, Relative
volume or Concentration. A range should always be
specified with the lower value first. Note that, when
only one of both limits is entered, the program will
consider all bands above or below that limit, depending
on which limit was entered. For example, when only the
first (lower) limit is entered and the upper limit is left
blank, all bands above the specified intensity will be
accepted. When both fields are left blank, no intensity
range will be looked for, i.e. all bands will be
considered.
Under Target range, you can search for bands with
specific sizes, either entered as Normalized run length
(%) or as Metric values. A target range should always be
entered with the lower value first. Note that, when only
one of both limits is entered, the program will consider
all bands above or below that limit, depending on which
limit was entered. For example, when only the first
(lower) limit is entered and the upper limit is left blank,
all bands above the specified size will be accepted. When
both fields are left blank, no size range will be looked
for, i.e. all bands will be considered.
Under Number of bands present, you can enter a
minimum and a maximum number of bands the
patterns should contain. Note that, when only one of
both limits is entered, the program will consider all
patterns with band numbers above or below that limit,
depending on which limit was entered. For example,
when only the first (lower) limit is entered and the
upper limit is left blank, all patterns having at least the
specified number of bands will be accepted. At least one
of both limits must be entered.
•Character value
The BioNumerics manual
gaps in sequence. Similarly, the program can also find
subsequences that match the search string with one or
more gaps introduced with Allow gaps in search string.
The gaps are counted with the mismatches, and the total
number of mismatches and gaps together is defined by
the parameter Maximum number of mismatches
allowed. Unknown or partially unknown positions can
also be entered according to the IUPAC code, when
Accept IUPAC codes is enabled.
Figure 9-7. The Subsequence search dialog box.
•Attachments
With the Attachments component, one can perform a
search in attachments that are linked to database entries
(see 6.5). With the picklist you can choose the type of
attachments to search in. One of the possibilities is All,
i.e. to search within all attachment types. For all types of
attachments it is possible to search in the Description
field, and for text type attachments, it is also possible to
search within the Text. The Text option does not apply
to the other attachment types.
With the Character value component, you can search for
characters within certain ranges. You should select a
Character experiment type, specify a character or <All>
characters, and enter a maximum and minimum value.
A range should always be specified with the lower value
first. Note that, when only one of both limits is entered,
the program will consider all characters above or below
that limit, depending on which limit was entered. For
example, when only the first (lower) limit is entered and
the upper limit is left blank, all characters with values
above the specified value will be accepted.
•Subsequence
With the Subsequence component you can perform a
search for a specific subsequence in a Sequence Type
experiment (Figure 9-7.). The Sequence Type experiment
should be chosen, and a subsequence entered. A
mismatch tolerance can be specified with Maximum
number of mismatches allowed. The progran can also
search for sequences that have one or more gaps as
compared to the search sequence, with the option Allow
Figure 9-8. The Attachment search dialog box.
•Logical operators
NOT, operates on one component. When
a component is combined with NOT, the condition of
the component will be inverted.
Chapter 9 - Comparison functions
91
AND, combines two or more components.
All conditions of the combined components should be
fulfilled at the same time for a sequence to be selected.
OR, combines two or more components.
The condition implied by at least one of the combined
components should be fulfilled for a sequence to be
selected.
XOR, combines two or more components.
Exactly one condition from the combined components
should be fulfilled for a sequence to be selected.
NOTE: the buttons for the logical operators contain a
helpful Venn diagram icon that clearly explains the
function of the operator.
To create a search component, you can select to search in
the database fields, fingerprint bands, characters, and
sequences. As an example, we will select all entries from
the genus Ambiorix that have no RFLP1 bands in the
range 71-72 basepairs, and of which the 16S ribosomal
RNA
sequences
contains
a
subsequence
“TGGTGCATTG”.
9.4.3 Press <Database field>. In the box that appears
you can choose the genus field or leave <Any field>.
9.4.12 Press the <AND>
button to combine
the created components with AND.
9.4.13 Press <Subsequence>. This box allows you to type
or paste a sequence that will be searched for.
9.4.14 Type TGGTGCATTG in the input field, and
select 16S rDNA. Press <OK>.
A third query component appears in the right panel,
stating “Subsequence: Search ‘TGGTGCATTG’ in the
sequence ‘16S rDNA’”. We will now combine the
resulting AND box from the first two components with
this last component, using an AND operator, to restrict
the selection to those sequences that fulfill the
‘Ambiorix’ and RFLP1 conditions AND contain the
specified subsequence.
9.4.15 Select the AND box by clicking on it.
9.4.16 Hold down the CTRL key and click on the
Subsequence component to select it together with the
AND box.
As both components are not selected, we can combine
them with a logical operator.
9.4.17 Press the <AND> button to combine the created
components with AND.
This is now shown graphically in the right panel (Figure
9-9.).
9.4.4 Enter “ambiorix” and press <OK>.
Note that:
A query component now appears in the right panel,
stating “Database field: Search ‘ambiorix’ in field
‘Genus’”.
9.4.5 Press <Fingerprint bands>. The Fingerprint bands
presence component dialog box appears (Figure 9-6.).
9.4.6 Select RFLP1 from the Fingerprint experiment pulldown list.
9.4.7 Under Target range, enter 71 - 72, and specify
Metric values.
9.4.8 Press <OK>. A second component appears in the
query window, saying “Fingerprint bands: ‘RFLP1’ has
at least 1 bands in the range 71.00 – 72.00”.
9.4.9 Select this Fingerprint bands component by
clicking on it (purple highlighted when selected), and
press the <NOT> button
.
9.4.10 Select the first component by clicking on it.
9.4.11 Hold down the CTRL key and click on the NOT
box resulting from the second component to select it
together with the first one.
•Individual components can be re-edited at any time
by double-clicking on the component or by selecting
them and pressing <Edit>.
•Selected components can be deleted with <Delete>.
•The result of a logical operator as obtained in this
example can be combined again with other
components (or logical operators) to construct more
complex queries.
•Queries can be saved with <Save> or <Save as>.
•Saved queries can be loaded using the pull-down
listbox under Stored queries.
•Existing queries can be removed with loading them
first and pressing <Delete>.
9.4.18 To view the selected entries, press <Add to list>.
The entries that were found are highlighted with a blue
arrow left from them.
NOTES:
(1) In order to speed up the search function in case of
large databases, it is important to know that searching
92
The BioNumerics manual
through the database fields is extremely quick, while
searching through sequences or large character sets can
be much slower. Using the AND operator, it is always
recommended to define the quickest search component
as the first, since the searching algorithm will first
screen this first component and subsequently screen for
the second component on the subset that match the first
component.
(2) When combined with a logical operator, query
components contain a small node at the place where
they are connected to the logical operator box (AND,
OR, XOR). By dragging this node up or down, you can
switch the order of the query components, thus making
it possible to move the most efficient component to the
top in AND combinations, as explained above.
(3) Multiple components/operators can also be selected
together by dragging the mouse over the boxes in the
right panel.
the entries of the subset. Search functions, copy and
select functions will be restricted only to the displayed
subset, and new comparisons, when created, will only
contain the selected entries from the subset.
9.5.1 In database Demobase, make sure no entries are
selected using Edit > Clear selection list (F4 key) or
.
9.5.2 Selecting Edit > Search entries or press
.
9.5.3 In the Entry search dialog box, enter “Ambiorix”
under Genus, and press <OK>.
All Ambiorix entries are now selected. When we create a
new subset, the selected entries will be automatically
placed in the subset.
9.5.4 Select Subsets > Create new or press
9.5 Subsets
A selection of entries from the database can be saved as
a subset. Subsets can include a certain target group in a
database, for example, a single species in a database
containing many species, or any selection of relevant
strains for a certain purpose. Selecting the defined
subset displays a view of the database containing only
Alternatively,
you
can
also
click
on
.
the
button which will
drop down a list of currently defined subsets (initially
empty), and an option <Create new subset>. Selecting
this option has the same effect.
Figure 9-9. Combined query constructed in the Advanced query tool (see text for explanation).
Chapter 9 - Comparison functions
93
9.5.5 Enter a name for the subset, e.g. the name of the
selected genus “Ambiorix”.
9.5.13 The current subset can be deleted using Edit >
Delete current or
.
The created subset is now displayed, and the name of
the current subset is displayed in the subset selector
button
9.6 Pairwise comparison between two
entries
.
9.5.6 Selecting the complete database or another subset,
when available, can be done by pressing
and
selecting
Complete database or the other subset in the list.
Once a subset exists, it remains possible to add or
remove entries, using the copy and paste functions. The
following example will illustrate this.
9.5.7 Select subset Ambiorix from the subset selector
button.
9.5.8 We want to remove all the “sp.” entries from this
subset. Clear any selected entries and select the 3 “sp.”
entries by manual selection or using the search function.
9.5.9 Press
or select Edit > Cut selection to cut the
From within any window where you can select entries,
you can display a detailed comparison between two
entries. This pairwise comparison includes all the
images of the experiment types as well as the similarities
obtained using the specified coefficients.
9.6.1 In the Main window, clear any selection with F4.
9.6.2 Select any two entries you want to compare.
9.6.3 In the main menu, select Comparison > Compare
two entries or use the CTRL+2 shortcut (numerical 2).
The CTRL+2 shortcut works from within any window.
The Pairwise comparison window appears (Figure 9-10.).
9.6.4 Click on an experiment type in the left panel to
display the images and the similarity in the right panel.
selected entries from the current subset.
In case of sequences, the aligned sequences are shown.
9.5.10 We can place them in a new subset by pressing
In case of a Matrix Type, the similarity value is shown, if
available.
, entering a name, e.g. “Unknowns” and in this
new subset, pressing
or selecting Edit > Paste
selection.
9.5.11 If you want to copy entries from a subset to
another subset, without removing them from that first
subset, there is also a command Edit > Copy selection or
.
NOTES:
9.7 The Comparison window
The Comparison window is the basis for all grouping and
comparison functions. When a new comparison is
created, it will automatically use the list of selected
entries in the database. With all entries except the
standards selected, we will create a new comparison.
9.7.1 Select Comparison > Create new comparison
(ALT+C) or
(1) A selection that is copied or cut from a subset or
copied from the database is placed on the Windows
clipboard as the keys of the selected entries, separated by
line breaks. You can paste them in other software when
desired.
(2) The commands Cut selection, Paste selection and
Delete selection are not available in the Complete
database.
When you want to remove entries from a subset without
overwriting the contents of the clipboard, you can use
the command Edit > Delete selection.
9.5.12 The current subset can be renamed using Edit >
Rename current.
. A Comparison window is created,
with the selected database entries (Figure 9-11.).
The Comparison window is divided in four main panels:
the Dendrogram panel, which shows the dendrogram if
calculated, the Image panel, showing the images of the
experiments, the Entry names panel, which shows the
database fields in the same layout as in the database (see
6.6), and the matrix panel, which shows the similarity
values. Initially, the dendrogram panel, the image panel
and the matrix panel are empty.
9.7.2 You can drag the separator lines between the four
panels to the left or to the right, in order to divide the
space among the panels optimally.
The bottom of the window is the Experiment type selection
bar, from where you can select one of the available
experiment types, to show an image, to calculate a
94
The BioNumerics manual
Toolbar
Dendrogram
panel
Image panel
Experiment type selection bar
Entry names panel
Matrix panel
Figure 9-11. The Comparison window, initial view.
dendrogram, or to show a matrix. Each experiment type
on the selection bar contains two objects: a button, and
the experiment type name right from the button. In case
of an electrophoresis (fingerprint) experiment type, the
button is shown as
; character experiments as
sequence experiment types as
;
, and Matrix Types as
. This button is to show the image of the experiment
type.
Figure 9-10. Pairwise comparison window.
9.7.3 Press the
button of RFLP1; the pattern images
are shown for RFLP1. When the image of an experiment
type is shown, the button shows like
.
NOTE: It is also possible to display an experiment type
image from the menu by selecting Layout > Display
experiments, and then choosing the experiment type
from the list that appears.
Chapter 9 - Comparison functions
95
NOTES:
9.7.12 When densitometric curves are shown on the
image, they can be exported as a tab-delinated file with
File > Export densitometric curves. The export file,
popped up as RESULT.TXT in Notepad, contains the list
entry keys separated by tabs, and a list of densitometric
curves, of which the curves are listed as columns,
separated by tabs.
(1) To display more than one image at a time, we
recommend to maximize the Comparison window, and
to use maximal space for the image panel by minimizing
the dendrogram and matrix panels (see 9.7.2).
In case only densitometric curves are available, it can be
useful to display the curves as pseudo gelstrips
(reconstructed images). This option is selected in the
Fingerprint type window as follows:
(2) When preferred, the image of patterns can be shown
with a space between the gelstrips. To do so, open the
Experiment type window in the program's Main
window (under Fingerprint types) and select
Layout > Show space between gelstrips.
9.7.13 In the Main window, double-click on a
Fingerprint Type (e.g. RFLP1) in the Experiments panel.
9.7.4 Press the
button of RFLP2; the pattern images
of RFLP2 are shown right from those of RFLP1. When
the image of an experiment type is shown, the button
shows like
.
9.7.5 If insufficient space is available to show both
images at the same time, you can scroll through the
image panel, or use the zoom functions
and
(Layout > Zoom in and Layout > Zoom out).
9.7.6 In the caption of the image panels, you can drag the
separator line between the images to the left or to the
right, to reserve more or less horizontal space for a
particular experiment image. Contrary to the zoom
function, the original aspect ratio (proportion height to
width) of the image will not be maintained by this
action.
9.7.7 To select an experiment type in the Comparison
window, you can either click on the experiment type
name in the selection bar (bottom), or on the image
itself.
9.7.8 Press the
button or Layout > Show metric
scale to display the molecular weight scale of the
selected Fingerprint Type.
9.7.9 Press
or Layout > Show bands to show or
hide the band positions in the image panel. One can
show band positions without showing the image.
9.7.10 When bands are shown on the image, they can be
exported as a tab-delinated file with File > Export bands.
The export file, popped up as RESULT.TXT in Notepad,
contains the key of the entry, and a list of band positions
as relative run lengths (in percent) and molecular weight
(if determined).
9.7.14 In the Fingerprint type window, select Layout >
Show curves as images. If densitometric curves are now
shown in a comparison (9.7.11), they will be displayed
as pseudo gelstrips.
In case densitometric curves have different intensities,
the densitometric curves can be rescaled so that each
curve fills the full available intensity range specified for
the Fingerprint Type. This can be achieved as follows.
9.7.15 In the Fingerprint type window, select Layout >
Rescale curves. If densitometric curves are now shown
in a comparison (9.7.11), they will all be displayed with
equal intensity.
When an experiment type is selected, both the image
caption and its name in the selection bar are highlighted.
Functions like Clustering, PCA, and Bandmatching as
well as some Layout functions, apply to the selected
experiment type.
Selections of entries made in the database are also
shown in the comparison. The entries in the current
comparison are all marked with a blue selection arrow,
since they were all selected in the database. You can
manually select and unselect entries in the Entry names
panel (see Figure 9-11.), using the CTRL and SHIFT keys
as described in 9.2.1 and 9.2.2.
9.7.16 First unselect all entries by pressing the F4 key
(9.2.6).
9.7.17 Select some entries from the comparison (see 9.2.1
and 9.2.2).
9.7.18 With Edit > Cut selection or
entries are removed from the comparison, and are
copied to the clipboard.
9.7.19 With Edit > Paste selection or
9.7.11 Press
or Layout > Show densitometric
curves to show or hide small densitometric curves in the
image panel. One can also show curves without
showing the image.
, the selected
, the same
entries are placed back in the comparison. If no
dendrogram is present, they are placed at the position of
the selection bar.
This tool can be used to rearrange entries in the
Comparison window. Some other convenient functions
96
are available for rearrangement of entries in a
comparison, as explained below.
9.7.20 Select Edit > Arrange entries by database field to
sort the entries according to the highlighted database
field.
When two or more entries have identical strings in a
field used to rearrange the order, the existing order of
the entries is preserved. As such it is possible to
categorize entries according to fields that contain
information of different hierarchical rank, for example
genus and species. In this case, first arrange the entries
based upon the field with the lowest hierarchical rank,
i.e. species, and then upon the higher rank, i.e. genus.
9.7.21 When a field contains numerical values, which
you want to sort according to increasing number, use
Edit > Arrange entries by database field (numerical).
In case numbers are combined numerical and
alphabetical, for example entry numbers [213, 126c,
126a, 126c], you can first arrange the entries
alphabetically (Edit > Arrange entries by database field),
and then numerically using Edit > Arrange entries by
database field (numerical). The result will be [126a,
126b, 126c, 213].
9.7.22 A group of selected entries (blue arrows) can be
placed at the position of the cursor (the entry you last
clicked on) with Edit > Bring selected entries to cursor.
The BioNumerics manual
9.7.26 As explained in 6.6.8, it is possible to freeze one or
more information fields in the Main window using Edit >
Freeze left panel, so that they always remain visible left
from the scrollable area. The same fields will be freezed
in the Comparison window. This feature can be combined
with the possibility to change the order of information
fields (see 5.4.2), which makes it possible to freeze any
subset of fields.
9.7.27 The option to show a grid in the Information fields
and the Experiments presence panels in the Main window
(6.6.9) extends to the Comparison window as well. The
grid can be shown or hidden using the Edit > Show grid
command in the Main window.
9.7.28 Database fields can be abbreviated using Edit >
Set database field length (see also 6.6.4).
9.7.29 To save the comparison, select File > Save as or
and enter a name, e.g. All.
9.7.30 Close the comparison with File > Exit.
Comparison “All” is now listed in the Comparisons panel
of the Main window.
9.7.31 You can open the existing comparison by double
clicking on All in the comparisons panel.
Entries can be added to an existing comparison at any
time. The entries first need to be copied to the clipboard
from the Main window or from another comparison.
9.7.23 A group of selected entries (blue arrows) can be
moved to the top of the comparison with Edit > Bring
selected entries to top.
9.7.32 To copy entries to the clipboard, use the Edit >
9.7.24 An individual entry can be moved up and down
by left-clicking on it to select it, and Edit > Move entry
up
or Edit > Move entry down
.
9.7.33 To cut entries from one comparison into another,
9.7.25 When using the up/down buttons
and
, you can move an entry to the top or the bottom at
once by holding the CTRL key.
Copy selection command or
use Edit > Cut selection or
and Edit > Paste selection or
comparison.
.
in the one comparison
in the other
97
10. Band matching and polymorphism analysis
Band matching is a comparison function which applies
only to Fingerprint Types. It can be executed on any
selection of entries from the database. In a first step,
BioNumerics divides all the bands found among the
selected patterns into classes of common bands. As such,
every band of a given pattern belongs to a class, and
conversely, every band class is represented by a band on
one or more patterns. The result is shown in Figure 10-1.
123
45
6 7
8
Pattern 1
Pattern 2
Pattern 3
1
+
+
+
2
+
+
3
+
-
4
+
-
5
+
+
6
+
+
+
7
+
+
8
+
+
+
Figure 10-2. Binary presence/absence table of
banding patterns.
Instead of using binary (+/−) data, the
be generated using band intensities
curves (band heights or surfaces)
dimensional
pattern
contours
concentrations).
same tables can
obtained from
or from two(volumes
or
The use of band matching tables is obvious: it provides a
binary or numerical character table for Fingerprint Type
patterns, which allows a number of statistical techniques
to be applied, including Minimum Spanning Trees
(chapter 18.), Maximum Parsimony trees (chapter 16.),
dimensioning techniques such as Principal Components
Analysis and related techniques (chapter 19.), and
bootstrap analysis on dendrograms (11.8).
Figure 10-1. Comparative quantification: bands are
assigned to classes.
Clearly, the number of band classes distinguished will
depend on the optimization and the position tolerance that
is allowed between bands considered as matching. For
example, when a larger position tolerance is specified,
more bands will be grouped in the same class than when
a small position tolerance is chosen. In Figure 10-1.,
taking a larger position tolerance would have resulted in
the merging of band classes 2 and 3, whereas a smaller
position tolerance would have resulted in two separate
classes for band class 8.
For each pattern, a particular band class can have two
states: present or absent. This is the basis for
polymorphism analysis, a tool which allows comparative
binary (+/−) tables to be generated, displaying
polymorphic bands between the selected patterns. These
tables, created as text or tab-delineated files, are ready
for export to other specialized software for statistics,
genetic mapping or other further analysis. The binary
table for the above example (Figure 10-1.) is shown in
Figure 10-2..
To visualize a band matching table as a character matrix
(binary or quantitative), it is necessary that a Composite
Data Set is associated with the Fingerprint Type.
Therefore, the use of Composite Data Sets is described
here in association with band matching tables. However,
it is possible to apply the techniques mentioned in the
previous paragraph directly on the Fingerprint Type
without having a Composite Data Set associated to it.
10.1 Composite Data Sets
Before you can export a (binary) presence/absence table
as shown in Figure 10-2., you will need to define a
Composite Data Set, containing the Fingerprint Type as
input. A Composite Data Set is a character table that
contains all the characters of one or more experiment
types. Such a character table is necessary to convert the
band classes and represent them as presence/absence
tables.
10.1.1 In the Main window, with the database DemoBase
loaded, select Experiments > Create new composite data
set, or
.
10.1.2 Enter a name, for example RFLP1-table and press
<OK>.
98
The Composite data set window is shown for “RFLP1table” (see Figure 10-3.). All experiment types defined
for the database are listed, and when they are marked
with a red cross, they are not selected in the Composite
Data Set.
The BioNumerics manual
The Position tolerance settings dialog box for the
Fingerprint Type is popped up (Figure 10-4.). This is the
same dialog box which can be called from the Experiment
type window settings (7.8.16).
10.1.3 Since we want to create a character table for the
Fingerprint Type RFLP1, we select RFLP1 and
Experiment > Use in composite data set.
When the experiment type is selected, it is marked with
a green 9 sign. The scroll bar that appears in the
Weights column as well as the other menu commands
under Experiment apply to cluster analysis and will not
be discussed here.
Figure 10-4. Position tolerance settings dialog box of
a Fingerprint Type.
The Position tolerance is the maximal shift (in
percentage of the pattern length) between two bands
allowed to consider the bands as matching. With Change
towards end of fingerprint, you can specify a gradual
increase or decrease in tolerance.
The Optimization is a shift that you allow between any
two patterns and within which the program will look for
the best possible matching. To understand the utility of
optimization in addition to tolerance, see the example in
Figure 10-5.
Figure 10-3. Composite data set window.
(1)
(2)
In order to create a band presence/absence table using
multiple Fingerprint Types (for example, obtained by
using different primers or restriction enzymes), one can
simply include the other Fingerprint Types in the
Composite Data Set.
10.1.4 Close the Composite data set window with File >
Exit. The new Composite Data Set is shown in the
experiment types panel of the Main window.
10.2 Creating a band matching
10.2.1 In the Main window, with the database DemoBase
loaded, select a number of entries, for example all but
the “STANDARD” entries. If the comparison All is
present, you can open this one (see previous chapter).
10.2.2 Select RFLP1 in the experiment type selection bar
(bottom), and press the
button or Layout > Show
image.
10.2.3 Choose Bandmatching > Perform band matching.
Figure 10-5. Effect of position tolerance (1) and
optimization (2) on the matching between shifted
patterns.
With minimum height and minimum surface, you can
exclude weak or irrelevant bands.
The Uncertain bands option allows you to either include
uncertain bands or ignore them (see 7.6.1). When Ignore
is chosen, uncertain bands are ignored. This means that
in composing a band matching table, the software will
omit the uncertain bands, considering them as
characters that are unknown. When Include is chosen,
uncertain bands are treated in the same way as certain
Chapter 10 - Band matching and polymorphism analysis
99
Figure 10-6. Band matching analysis.
bands, which means that uncertain bands will
contribute to the band classes of a band matching tables
in the same way as certain bands.
10.2.4 Enter a position tolerance of 1%, an optimization
of 1%, a change of 0%, and a minimum height and
minimum surface of 0%, and press <OK>.
NOTE: Depending on whether or not there are selected
entries in the comparison, the program will asks
“Search only inside the selection?”. If you want to
include all patterns for band class searching, answer
<No> to this question.
The program has now defined the band classes and has
associated each band with a class. The band classes are
shown as blue lines (Figure 10-6.) and the bands are
linked to a class in red. On top of the image are the band
class selectors. If a band class is selected, this selector is
blue.
10.2.5 Zoom in on the image as necessary using the
zoom functions
and
10.2.8 Press the
button or Layout > Show metric
scale to display the molecular weight scale of the
Fingerprint Type.
10.3 Manual editing of a band matching
The program does not always assign the bands to the
correct class. Therefore, you can manually correct the
assignments.
10.3.1 For the manual band matching editing tools, a
multilevel undo and redo function is available. The
undo function can be accessed with Bandmatching >
Undo or CTRL+Z or the
button. The redo function
is accessible through Bandmatching > Redo or CTRL+Y
or the
button.
In Figure 10-7., the band marked with the arrow is
assigned to the left of two close classes, whereas it
should be assigned to the right class.
(Layout > Zoom in and
Layout > Zoom out).
10.2.6 An interesting option for long patterns with
numerous small bands is Layout > Stretch (X dir).
This causes the image to be enlarged in the horizontal
direction only, so that sharp bands become better
visible, without loosing the overview of a large number
of patterns.
10.2.7 Restore the view with Layout > Compress (X dir).
Figure 10-7. Detail of band class assignments.
100
Figure 10-8.
The BioNumerics manual
Splitting up a band class into two band classes.
NOTE: you can easily see which bands that belong to a
given band class by double clicking on the vertical blue
line that represents the class: all bands that belong to
the class are selected with a green flag.
10.3.9 Choose a band which occurs quite in the middle
of the two classes.
10.3.10 Right-click on the band, and select Band classes
> Add new band class from the floating menu.
In order to reassign a band to another class, proceed as
follows:
10.3.11 Choose a band which belongs to the left class.
10.3.2 Select the band class to which the band ought to
be assigned. The selector becomes blue.
10.3.12 Right-click on the band, and select Band classes
> Remove band class from the floating menu, or press
SHIFT + DEL.
10.3.3 Right-click on the band, and select Band classes >
Assign band to class from the floating menu.
Alternatively, simply hold down the CTRL key and leftclick on the band. You can also click on the band and
then press SHIFT+ENTER on the keyboard.
A whole band class is deleted as follows:
10.3.4 Click on a band belonging to the band class.
10.3.5 Right-click on the band, and select Band classes >
Remove band class from the floating menu. You can also
press SHIFT+DEL on the keyboard to remove the
selected band class.
If different bands are incorrectly assigned to the same
class, you can create a second class as follows (Figure 108.):
10.3.6 Select a band which should belong to a new class
(left-click) (see Figure 10-8.).
10.3.13 Choose a band which belongs to the right class
(left-click).
10.3.14 Right-click on the band, and select Band classes
> Remove band class from the floating menu, or press
SHIFT + DEL.
10.3.15 Right-click again on the band, and select Band
classes > Auto assign bands to class from the floating
menu.
10.3.16 Select the new band class to which all the bands
should belong (left-click). The selector becomes blue.
10.3.17 Right-click, and select Band classes > Auto
assign bands to class from the floating menu.
If you do not wish to use a single band in a band
matching analysis, you can undo its assignment it as
follows:
10.3.7 Right-click on the band, and select Band classes >
Add new band class from the floating menu.
10.3.18 Right-click on the band and select Band classes >
Remove band from class or DEL.
10.3.8 Right-click again on the band, and select Band
classes > Auto assign bands to class from the floating
menu.
In defining a band class, the program automatically
calculates the average position of all bands belonging to
that class to define the position of the band class. After
reassigning bands, removing and adding bands etc. the
band class position may not be the center anymore. You
can correct the position of the band class:
All bands that are closer to the new band class are
automatically reassigned to that new class. In order to
reassign bands to the other class, follow the procedure
explained in 10.3.2 to 10.3.3.
If bands are incorrectly assigned to different classes, you
can merge the classes as follows (Figure 10-8.):
10.3.19 Select the band class (left-click) and call the
floating menu (right mouse button) to select Band
classes > Center class position.
NOTE: these commands are also accessible from the
main menu, but they are much easier using the floating
menu.
Chapter 10 - Band matching and polymorphism analysis
10.3.20 If all assignments are corrected, you can save the
band matching with File > Save or
.
101
10.5.4 With Edit > Paste selection or
in the other
comparison, the selected entries are placed back in the
band matching.
10.3.21 A band matching report can be exported as a tabdelimited table using Bandmatching > Export
bandmatching. The resulting table looks as in Figure 1011.
10.5.5 Select Bandmatching > Search band classes. The
Position tolerance settings dialog box as in Figure 10-4. is
shown. Leave the settings as they are and press <Ok>.
The program now asks “Remove existing band classes?”.
10.4 Analyzing polymorphic bands only
Some more advanced tools are available for specific
purposes. Suppose that it is only important to show and
compare the bands that are polymorphic in a subset of
the entries in the comparison (for example, two parent
lines in a plant breeding study).
10.4.1 First clear any selected entries, and then select the
entries that are of interest to find polymorphic bands on.
10.4.2 Select Bandmatching > Polymorphic classes only
(for selection list).
The table of band classes is reduced to contain only the
bands that are polymorphic within the selected entries
(i.e. not present in all of them and not absent in all of
them). Obviously, this feature can also be used to select
only polymorphic bands from the complete study: in
that case, first select all entries in the comparison, and
Bandmatching > Polymorphic classes only (for selection
list) .
10.5 Adding entries to a band matching
Since a band matching analysis and the associated table
can be saved, it should be possible to delete entries from,
or add entries to the band matching at any time.
10.5.1 To delete some entries, simply select some entries
and Edit > Cut selection or
.
If entries are added however, it is possible that those
new entries contain bands that are not defined as a band
class yet. If you have done some editing work to the
band classes already, it would be beneficial to preserve
the existing band classes, and simply associate the bands
of the new entries to the existing classes, and introduce
new classes in those cases where the new entries have
bands that do not fit in any of the existing classes. This is
achieved as follows:
10.5.2 Select a few entries in the database. If you have
executed the previous step (10.5.1) there are still some
entries selected and placed on the clipboard.
10.5.3 [In case you would have copied something else in
the meantime, select Edit > Copy selection or
the Main window.]
in
10.5.6 In order to preserve the existing band matching, it
is important to answer <No> to this question.
The program asks one more question: “Search only inside
the selection?”.
10.5.7 If you wish to search only for new band classes
that might occur in the new patterns added to the band
matching, answer <Yes> to this question. Otherwise,
any band classes that you may have deleted manually
from the other entries would appear again.
10.6 Band and band class filters
When searching bands in complex patterns, especially
those for which the terminal step is a PCR reaction such
as AFLP patterns, it is sometimes difficult to define
objective criteria as to what is a band and what is not a
band. However, when the user examines a set of
patterns by eye, it often becomes easier to decide
whether a band is valid or not, because the user
automatically compares the band with those on
neighboring patterns, thus obtaining information which
cannot be obtained by inspecting the pattern alone. This
is more or less the way the band filters work in the band
matching application of BioNumerics: in a first step,
band classes are defined over all patterns; then the
relative areas of all bands of a given class are averaged,
and if a band deviates more than a certain percentage
from this average, it is not considered as being a
matching band for this class.
Using this tool, it is possible to define more bands on the
gels than one would usually do, without spending a lot
of time deleting and adding bands manually. Using the
band matching filters, weak bands or artifacts that do
not reflect the expected intensity will be filtered out
automatically, and the assignment of bands is often as
reliable as after hours of band editing work.
10.6.1 In the band matching analysis created in the
current paragraph, select Bandmatching > Band class
filter.
This pops up the Band filtering settings dialog box (Figure
10-9.). It exisit of two parts: the upper part “Remove all
bands below…” is to filter individual bands within a
found band class, and the lower part “Remove all band
classes that have no bands exceeding…” is to remove all
band classes that do not contain any significant band.
102
The BioNumerics manual
If you entered 20%, this means that a band class for
which the highest band is less high than 20% of the OD
range of the Fingerprint Type will be considered
irrelevant and will be removed.
NOTE: This is again a non-relative parameter. If by
incidence a band class is formed by a set of weak
patterns, it may be excluded incorrectly. If this happens
to be a problem, we recommend to use the more reliable
feature of % minimum area only.
10.6.5 With Remove all band classes that have no bands
exceeding a certain % minimum area, you can remove all
irrelevant band classes based upon the minimum area of
the bands included. The minimum area is defined as the
area relative to the total area of a pattern.
Figure 10-9. Band filtering settings dialog box for
band matching.
Similar as for band searching, the band class filters
consist of two separately working components: a
profiling component, which is the height of the band or
class, and an area component, which is the relative area
(surface).
10.6.2 Within a band class, you can Remove all bands
below a certain % of the average profiling in the class.
If you entered 80%, this means that, if the height of a
band is lower than 80% of the average profiling
calculated for its class, it will not be matched with that
class, and the band will be recorded negative in the band
matching table. Note that the profiling of a band is an
absolute measure: if a pattern as a whole is rather weak,
many of its bands may be excluded from the band
matching just by this fact. In such cases, we recommend
to take the area as filtering factor.
10.6.3 Within a band class, you also can Remove all
bands below a certain % of the relative surface in the
class.
In this case, if you entered 80%, all bands that have a
relative surface less than 80% of the average surface for
the band class will not be matched with that class, and
the bands will be recorded negative in the band
matching table. Since the surface is relative to the total
surface of a pattern, weak patterns in principle will not
be treated differently compared to dark patterns.
If you enter 20% here, a band class that contains no band
with an area bigger than 20% of its pattern’s total area
will be removed from the band matching table.
10.7 Creating a band matching table for
polymorphism analysis
After the Composite Data Set RFLP1-table had been
defined (see paragraph 10.1), it is listed in the
experiment type selection bar in the bottom of the
Comparison window. Since we defined RFLP1 as being the
experiment type used in this Composite Data Set, the
band matching values are automatically filled in as
character values.
10.7.1 Press the
button of RFLP1-table. The binary
band matching table appears as in Figure 10-10.
10.7.2 In order to reveal the complete information on the
band classes, it may be necessary to drag the separator
line between the table and its header (see Figure 10-10.)
downwards.
NOTES:
In case of complex patterns such as AFLP, many band
classes consist of just one weak band, spot or artifact and
have no phylogenetic or taxonomic relevance. Such
band classes are just filling up the band matching table,
and being treated equally important, they are disturbing
the information provided by the band matching table.
Therefore, BioNumerics offers the possibility to have all
band classes that do not contain at least one clear
relevant band excluded from the band matching table.
10.6.4 With Remove all band classes that have no bands
exceeding a certain % minimum profiling, you can
remove all irrelevant band classes based upon the
minimum height of the bands included.
Figure 10-10. Binary band matching table; detail.
Chapter 10 - Band matching and polymorphism analysis
103
(1) You can scroll between the image of gel patterns and
the character table using the scroll bar at the bottom of
the image panel. Once the character table is present, it is
still possible to edit the band class assignments on the
patterns. The character table is updated automatically.
(2) Band classes that have been created by the user are
marked with an asterisk (*).
10.7.3 Use Composite > Export character table to export
a space or tab-delineated text file of the binary band
matching table.
When the program asks “Use tab-delineated fields”,
you should answer <Yes> to produce a tab-delineated
text file. The space-delineated table looks as follows
(Figure 10-11.).
Figure 10-12. The
settings dialog box.
Comparative
quantification
approximating a band; Relative band surface is the
same as band surface, as a percentage of the total band
area of the pattern.
Figure 10-11. Binary band matching character table
exported from BioNumerics (tab-delimited).
In the tab-delineated format, the band classes (header)
and the band presence/absence table are given in
columns separated by tabs. This format is the easiest to
import in spreadsheet or database software packages.
10.7.4 To show the intensity of the bands, choose
Composite > Show quantification (colors).
The color ranges from blue (weakest bands) over cyan,
green, yellow, orange to red (darkest bands). The
intensity is based upon the Comparative quantification
settings for the used Fingerprint Type. This option can be
defined in the Fingerprint type window (see 7.8.16), but it
can also be changed in the Comparison window, as
follows:
Two dimensions quantification is based on the band
contours of the two-dimensional pattern images (see
paragraph 7.6): Volume is the absolute volume within
the contour; Relative volume is the same as a percentage
of the total band volume of the pattern; Concentration is
the physical concentration unit the user has assigned
based upon regression through known calibration
bands.
If no two-dimensional quantification is performed for
the gels, it is obvious that one should select among the
first three options.
10.7.8 Close the Comparative quantification settings dialog
box with <OK> or <Cancel>.
10.7.9 Select Composite > Show quantification (values)
to display the numerical intensities of the bands.
10.7.10 With Composite > Export character table, a
numerical band matching table is created in text format,
separated by tabs (Figure 10-13.).
The meaning of the numbers depends on the selected
comparative quantification parameter (see 10.7.5 to
10.7.7).
10.7.5 Select RFLP1 in the experiment type selection bar.
10.7.6 Make sure the bands are shown for RFLP1. Check
whether Show bands is marked with 9 in the Layout
menu.
10.7.7 Select
Bandmatching
>
Comparative
Quantification settings. This opens the Comparative
Quantification settings dialog box (Figure 10-12.).
One dimension quantification is based on the
densitometric curves extracted from the patterns (see
paragraph 7.6): Band height is the height of the peak;
Band surface is the area under the Gaussian curve
10.8 Tools to display selective band
classes
10.8.1 First, remove the existing band matching analysis
by selecting RFLP1 in the experiment type selection bar
and Bandmatching > Perform band matching or
.
10.8.2 Clear any selection made by pressing F4, and
manually select some entries in the comparison.
104
The BioNumerics manual
RFLP1:94.36
RFLP1:88.09
RFLP1:84.46
RFLP1:78.00
RFLP1:74.40
RFLP1:70.94
RFLP1:68.23
RFLP1:54.45
RFLP1:53.05
RFLP1:46.59
RFLP1:41.00
RFLP1:40.33
0.00 0.00 25.27
0.00 10.39 0.00
0.00 13.49 25.32
0.00 0.00 19.79
0.00 0.00 21.67
0.00 0.00 25.06
0.00 11.89 0.00
0.00 0.00 26.11
0.00 0.00 26.20
HEADER:
Band classes
3.48
0.00
0.00
0.00
2.36
0.00
0.00
0.00
0.00
2.99
8.85
0.00
0.00
2.42
0.00
0.00
0.00
3.76
8.95
10.46
15.51
13.71
6.39
3.92
10.34
4.97
2.59
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00 8.31
0.00 11.78
9.28 7.11
0.00 5.13
0.00 6.29
0.00 6.33
5.47 7.44
0.00 5.85
0.00 5.95
0.00 13.31
0.00 10.23
0.00 5.05
0.00 5.55
0.00 6.61
0.00 5.27
0.00 3.69
0.00 3.65
0.00 4.56
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
TABLE:
Rows=entries
Figure 10-13. Numerical band matching character table exported from BioNumerics (space-delimited).
10.8.3 Select Bandmatching > Perform band matching to
create a new band matching. Press <OK> to the position
tolerance settings.
The program now asks “Search only inside the
selection?”. With this option, the program will perform
the band matching only for the selected entries. This
means that it will only create band classes for bands
found on the entries in the selection.
10.8.4 Answer <Yes> to perform a band matching on the
selected entries only.
10.8.5 With Bandmatching > Auto assign all bands to
all classes, you can let the program assign the bands of
the not selected entries to the corresponding band
classes.
BioNumerics offers another interesting tool to display
only the polymorphic bands. To make this tool as
flexible as possible, the polymorphic bands are only
looked for within the selection list. For genetic mapping
purposes, the user can select the patterns from two (or
more) parent entries, and have the program display only
the polymorphic band classes between these two
patterns. This reduces the size of the band matching
table to contain only the polymorphic bands of interest.
Of course, the user can add or delete band classes
afterwards, as desired.
10.8.6 Clear any list of selected patterns with F4.
.
10.8.8 Select Bandmatching > Perform band matching to
create a new band matching, including all band classes.
10.8.9 Select two entries having a few different bands.
10.9 Finding discriminative bands
between entries
10.9.1 In database Demobase, have a Comparison window
open with all non-“STANDARD” entries selected (e.g.
comparison ‘All’, see 9.7.29) and the Composite Data Set
RFLP1-table shown (see 10.2 and 10.6.3).
10.9.2 Make sure that the image of the Composite Data
Set is shown, by pressing the
button of RFLP1-table.
10.9.3 Minimize or reduce the Comparison window so that
the Main window (at least the menu and toolbar)
becomes visible.
10.9.4 Press F4 to make sure that no entries are selected.
10.9.5 In the Main window, select Edit > Search entries
(F3) and enter Vercingetorix in the Genus field.
All Vercingetorix entries are selected in the database (and
in the Comparison window).
10.9.6 To group the selected entries, choose Edit > Bring
selected entries to top in the Comparison window.
10.8.7 First, remove the existing band matching analysis
by selecting RFLP1 in the experiment type selection bar
and Bandmatching > Perform band matching or
10.8.10 Select Bandmatching > Polymorphic bands only
(for selection list). Only the band classes that are
polymorphic between the selected two patterns are now
displayed.
10.9.7 Select Composite > Discriminative characters.
The characters (bands) are reorganized in such a way
that those characters positive for the selected entries and
negative for the other entries occur left, and those
characters negative for the selected entries and positive
for the other entries occur right (see Figure 10-14.).
Chapter 10 - Band matching and polymorphism analysis
In Composite Data Sets, it is also possible to list the
entries according to the amount of a selected character.
In case of banding patterns, the entries will be ordered
by the intensity of a selected band. This feature allows
for a particular bands the entries to be found in which
the band is present or not.
105
10.9.8 Show the band table as intensity table with
Composite > Show quantification (colors).
10.9.9 Select a band class in the band classes header
(Figure 10-14.) and Composite > Sort by character.
The entries are now sorted by increasing intensity of the
selected band class.
Figure 10-14. Discriminative bands for selected entries, positive discrimination left, negative discrimination
right.
106
The BioNumerics manual
107
11. Cluster analysis
11.1 Introduction
button, in which case the following menu pops up
(Figure 11-1.).
The term cluster analysis is a collective noun for a variety
of techniques that have the common feature to produce
a hierarchical tree-like structure from the set of sample
data provided. The tree usually allows the samples to be
classified based upon the clusters produced by the
method. Apart from this common goal, the principles
and algorithms used, as well as the purposes, are very
different. Cluster analysis sensu latu has therefore been
subdivided in four chapters in this manual:
The first choice is the matrix-base cluster analysis
discussed in this chapter, whereas the second and third
choices are discussed in chapters 16. and 18.,
respectively. Note that, in case of aligned sequence data,
an extra option Calculate maximum likelihood tree
becomes available, which is also discussed in chapter
16..
•Cluster analysis sensu strictu (this chapter) is based
upon a matrix of similarities between database entries
and a subsequent algorithm for calculating bifurcating
dendrograms to cluster the entries.
•Phylogenetic cluster methods (chapter 16.) are
methods which attempt to create trees that optimize a
specific phylogenetic criterion. These methods start
from the data set directly rather than from a similarity
matrix.
•Minimum spanning trees (chapter 18.) are trees
calculated from a distance matrix, that possess the
property of having a total branch length that is as
small as possible.
Cluster analysis based upon a similarity matrix can also
be performed on characters, which is described in
chapter 15..
11.2 Calculating a dendrogram
11.2.1 Run the Analyze
DemoBase loaded.
program
with
database
11.2.2 Select all entries except STANDARD (9.3) and
create a new comparison (9.7), or open comparison All if
existing.
11.2.3 Select RFLP1 in the experiment type selection bar
(bottom of window), and show the normalized gel
image by pressing the
button.
11.2.4 Select Clustering > Calculate > Cluster analysis
(similarity matrix). You can also press the
Figure 11-1. Cluster analysis menu popped up from
the dendrogram button.
The Comparison settings dialog box allows you to specify
the similarity coefficient to calculate the similarity
matrix, and the clustering method (Figure 11-2.).
Two coefficients provide similarity based upon
densitometric curves; the Pearson product-moment
correlation (Pearson correlation) and the Cosine
coefficient.
Four different binary coefficients measure the similarity
based upon common and different bands: the Jaccard,
Dice, Jeffrey’s X, and Ochiai coefficients. A fifth
coefficient, Different bands, is essentially a distance
coefficient as it simply counts the number of different
bands in two patterns. It is converted into a similarity by
subtracting this distance value from 100. If you select
one of these binary coefficients, you can enable the
Fuzzy logic option: instead of a yes/no decision whether
two bands are matching or not, the program lets the
matching value gradually decrease with the distance
between the bands. The Area sensitive option makes the
coefficient take into account differences in area between
two matching bands: if for each matching band the areas
on both patterns are exactly the same, the coefficient
reduces to a normal binary coefficient; the more the
areas differ, the lower the similarity will be.
108
The BioNumerics manual
The Position tolerance is the maximal shift (in
percentage of the pattern length) between two bands
allowed to consider the bands as matching. This
parameter only applies to band matching coefficients.
With Change towards end of fingerprint, you can specify
a gradual increase or decrease in tolerance. In paragraph
12.2, we discuss how to have the program automatically
calculate the optimal position tolerance settings for your
Fingerprint Type.
The Optimization is a shift that you allow between any
two patterns and within which the program will look for
the best possible matching. This parameter applies for
both curve-based and band matching coefficients. To
understand the utility of optimization in addition to
tolerance, see the example in Figure 11-4.
(1)
(2)
Figure 11-2. The Comparison settings dialog box.
Among the dendrogram types, the program offers the
Unweighted Pair group Method using Arithmetic
averages (UPGMA), the Ward algorithm, the Neighbor
Joining method, and two variants of UPGMA, namely
Single linkage and Complete linkage. The option
Advanced is explained in chapter chapter 17..
11.2.5 Select Dice and UPGMA.
The Position tolerances button allow you to specify the
maximum allowed distance between the positions of
two bands on different patterns, for these bands to be
considered as matching.
11.2.6 Press the <Position tolerances> button.
The Position tolerance settings dialog box for the
Fingerprint Type is popped up (Figure 10-4.).
Figure 11-4. Effect of position tolerance (1) and
optimization (2) on the matching between shifted
patterns.
In paragraph 12.2, we discuss how to have the program
automatically find the best optimization value for your
Fingerprint Type.
With minimum height and minimum surface, you can
exclude weak or irrelevant bands.
NOTE: Both the Comparison settings and the
Position tolerance settings are stored along with the
Fingerprint Type. The same dialog boxes can be called
from the Experiment type window settings (7.8.16).
Figure 11-3. Position tolerance settings dialog box of
a Fingerprint Type.
The Uncertain bands option allows you to either include
uncertain bands or ignore them (see 7.6.1). When Ignore
is chosen, uncertain bands are ignored. This means that
in a pairwise comparison, an uncertain band is not
penalized if there is no matching band on the other
pattern. Conversely, if there is a band on the other
pattern that matches an uncertain band, it will also be
ignored in that comparison. When Include is chosen,
uncertain bands are treated in the same way as certain
bands, which means that an uncertain band which is not
complemented by a band in the other pattern, is
penalized.
NOTE: The Ignore option will only work when both
Fuzzy logic and Area sensitive are disabled in the
Comparison settings dialog box (Figure 11-2.).
Chapter 11 - Cluster analysis
109
11.2.7 Enter a position tolerance of 1%, an optimization
of 1%, a change of 0%, and a minimum height and
minimum surface of 0%, and press <OK>.
take long). In order to speed up the calculations, or
make multitasking smoother, you may want to select
one of the priority settings for the calculations.
11.2.8 Press <OK> again in the Comparison settings dialog
box to start the cluster analysis.
11.3.1 In the Main window, select File > Calculation
priority settings. The Calculation settings dialog box
(Figure 11-6.) allows 5 priority levels.
During the calculations, the program shows the
progress in the Comparison window’s caption (as a
percentage), and there is a green progress bar between
the toolbar and the window panel that proceeds from
left to right.
When finished, the dendrogram and the similarity
matrix are shown (Figure 11-5.). For more information
about the panels in the Comparison window, see 9.7.
The experiment type (or Composite Data Set) from
which the dendrogram is generated, is shown in the
header of the dendrogram panel. The parameters and
settings of the cluster analysis are shown in the header
of the matrix panel
Figure 11-6. Calculation priority settings dialog box.
11.3 Calculation priority settings
BioNumerics performs almost all its calculations in
multithreaded mode. This means that you can further
use BioNumerics or any other programs while the often
time-consuming calculations are going on (especially
sequence alignments and phylogenetic clustering can
If Foreground is chosen, it will not be possible to run
other applications while the calculations are going on.
Idle time background means that the computer will only
process the BioNumerics calculations while it has
nothing else to do.
11.3.2 Select Normal priority background and <OK>.
Figure 11-5. Comparison window with dendrogram, image, entry names, and similarity matrix.
110
The BioNumerics manual
11.3.3 While the program is calculating, you can abort
the calculations at any time using the
button.
less time than recalculating the whole cluster analysis.
The entries first need to be copied to the clipboard from
the Main window or from another comparison.
11.5.6 To copy entries to the clipboard, use the Edit >
11.4 General edit functions
Copy selection command or
11.4.1 You can drag the separator lines between the four
panels to the left or to the right, in order to divide the
space among the panels optimally.
11.4.2 Similarly, you can drag the separator lines
between the information field columns to the left or to
the right, in order to divide the space among the
information fields optimally.
.
11.5.7 To cut entries from one comparison into another,
use Edit > Cut selection or
in the one comparison
and Edit > Paste selection or
in the other
comparison.
11.5.8 To save the comparison with the dendrogram,
11.4.3 Left-lick on the dendrogram to place a cursor on
any node or tip (where a branch ends in an individual
entry). The average similarity at the cursor’s place is
shown in the upper left corner.
select File > Save as or
and enter a name, e.g. All.
Comparison “All” now contains
Fingerprint Type RFLP1.
a dendrogram for
11.4.4 You can also move the cursor with the arrow keys.
Many of the display and edit functions, for example
zooming in and out, showing bands, are described in
paragraph 9.7 (9.7.5 to 9.7.10).
11.6 Dendrogram display functions
11.5 Adding and deleting entries
In some cases, it may be necessary to select the root of a
dendrogram, for example if you want to (un)select all
the entries of the dendrogram. In case of large
dendrograms, selecting the root may be difficult using
the mouse.
11.5.1 First make sure that no entries are selected by
pressing the F4 key (9.2.6).
11.6.1 With Clustering > Select root, the cursor is placed
on the root of the dendrogram.
11.5.2 Select some entries from the comparison (see 9.2.1
and 9.2.2).
Two branches grouped at the same node can be
swapped to improve the layout of a dendrogram or
make its description easier:
This can either be done in the information fields panel
(use CTRL and Shift keys) or groups can be directly
selected on the dendrogram.
11.5.3 To select a cluster on the dendrogram at once,
hold the CTRL key and left-click on a branch node.
Repeat this action to unselect a branch. Alternatively,
right-click on a branch and choose Select branch into list
from the floating menu.
11.5.4 With Edit > Cut selection or
, the selected
entries are removed from the comparison, and are
copied to the clipboard.
11.5.5 With Edit > Paste selection or
, the same
entries are placed back in the comparison. If no
dendrogram is present, they are placed at the position of
the selection bar.
Note that in this way, you can add new database entries
to an existing dendrogram: select the new entries in the
database, open an existing comparison with
dendrogram, and paste the selection into the
comparison. Both the similarity matrix and the
dendrogram will be updated, which uses considerably
11.6.2 Select the node where two branches originate and
Clustering > Swap branches.
To simplify the representation of large and complex
dendrograms, it is possible to simplify branches by
abridging them as a triangle.
11.6.3 Select a cluster of closely related entries and
Clustering > Collapse/expand branch.
11.6.4 With Clustering > Show similarity values, the
average similarity of every branch is indicated on the
dendrogram.
Another function, Clustering > Reroot tree, only applies
to so-called unrooted trees, i.e. neighbor joining,
parsimony and maximum likelihood. These clustering
methods produce trees without any specification as to
the position of the root or origin. Since users will want to
display such trees in the familiar dendrogram
representation, the tree is to be rooted artificially.
“Rerooting” is usually done by adding one or more
unrelated entries (so-called outgroup) to the clustering,
and using the branch connecting the outgroup to the
others as root. The result is a pseudo-rooted tree.
Chapter 11 - Cluster analysis
111
To illustrate the rerooting of an unrooted tree, we will
create a second dendrogram, based upon neighbor
joining of the Fingerprint Type RFLP2.
This dendrogram reveals three major clusters:
Vercingetorix, Ambiorix, and Perdrix (with some
exceptions).
11.6.5 In the experiment selection bar, select experiment
RFLP2.
11.7.2 First make sure that no entries are selected by
pressing F4.
11.6.6 Select Clustering > Calculate > Cluster analysis
(similarity matrix) and specify Neighbor Joining in the
dialog box. A neighbor joining tree is calculated for
RFLP2.
11.7.3 Hold the CTRL key and click on the node that
connects all the entries belonging to the Vercingetorix
cluster. The entries of this cluster are now selected.
If you scroll through the tree, you will notice that two
entries, i.e. Perdrix sp. strain numbers 53175 and 25693
protrude on a very long branch. These two entries are
ideally suited as “outgroup”.
11.6.7 Click somewhere in the middle of the branch. A
secondary, X-shaped cursor appears.
11.6.8 Select Clustering > Reroot tree, and the new root
connects the outgroup with the rest of the entries.
11.6.9 The software automatically limits the similarity
range to the depth of the dendrogram. If you want to
change this range, select Clustering > Set minimum
similarity value.
11.7.4 In the menu, select Groups > Assign selected to.
The menu lists 30 different colors and accompanying
symbols, from which and choose one (e.g. the first one,
green).
11.7.5 Press F4 to clear the selection and CTRL click on
the node connecting all Ambiorix entries.
11.7.6 Select Group > Assign selected to, and choose the
second color (red).
11.7.7 Repeat actions 11.7.2, 11.7.3, and 11.7.4 for the
third cluster mainly composed of Perdrix. Use for
example the third color (blue).
11.7.8 You can repeat these actions for two outliers of
Perdrix, using another color.
11.6.10 The similarity scale can be displayed in
similarity (default for most clustering types) or in
distance. To toggle between similarity and distance
modes, select Layout > Show distances.
Whatever dendrogram you now display, you will be
able to recover the groups of the RFLP1 dendrogram at a
glance.
11.7 Working with Groups
11.7.9 Right-click on RFLP2 in the experiment type
selection bar, and select Show dendrogram. The Perdrix
and Ambiorix strains are not well separated by this
technique: the second and the third Group are mixed up.
An important display function is the creation of Groups.
Groups basically are subsets of a comparison, that can
be defined from clusters, from database fields, or just
from any subdivision the user desires. Groups are
normally displayed using rhombs of different colors
next to the entries, each group having its own color.
They can also be displayed using different symbols, or
using alphanumerical codes. In the first place, Groups
make the comparison between a dendrogram or a
dimensioning on the one hand, and a certain
characteristic on the other hand, easier. Groups also
make the comparison between dendrograms obtained
from different experiments easier. In addition, Groups
are used in a number of derived statistical analysis
functions, such as Partitioning, Group separation,
Discriminant Analysis, and MANOVA, and form an
easy link between dimensional representations such as
PCA, SOM or graphs and scatterplots on the one hand,
and database field information on the other hand. To
make the distinction between groups as clusters on the
one hand and groups as defined by the Groups tool on
the other hand, the latter Groups are always written
with a capital.
11.7.1 Show the UPGMA dendrogram of RFLP1 as
follows: right-click on RFLP1 in the experiment type
selection bar, and select Show dendrogram.
The Group assignments are saved along with the cluster
analysis.
An alternative method to define Groups is by selecting a
database field and having the program automatically
create Groups based upon the different names that exist
in this database field. One should be aware however,
that any misspelled name or typographic error will
result in a different group. The method works as
follows.
11.7.10 Select a database field by clicking on the
database field name, for example Genus.
11.7.11 In the Groups menu, select Create from database
field.
The program now answers ”Remove existing groups?”.
If you answer <Yes>, the existing Groups will first be
removed and the program will assign the new Groups
based upon the selected database field only. If you
answer <No>, the program will keep the Groups that
are already defined, and split existing Groups into more
groups if differences in the selected database field are
found.
112
11.7.12 Press <Yes> to remove the existing Groups. The
program creates three groups according to the genus
names.
11.7.13 Select the Species database field and Group >
Create from database field again.
11.7.14 Press <No> to keep the existing Groups.
Every unique species name now is assigned to a
different Group. In addition, if two different general
would have the same species name, they would belong
to a different Group too, since we kept the existing
Groups based upon the genus database field.
Since different colors are not equally distinguishable by
different persons it may be useful to customize the
Group colors in a user-defined scheme.
11.7.15 To define an own Group color scheme, select
Groups > Edit group colors. This brings up the Group
color editor dialog box (Figure 11-7.). For each color, three
slide bars (red, green and blue, respectively) can be
adjusted to produce any desired color.
11.7.16 A thus obtained color scheme can be saved by
pressing the <Save as> button, and entering a name.
The BioNumerics manual
11.8 Cluster significance tools
A dendrogram tells you something about the groups
among a selection of entries, but nothing about the
significance, i.e. the reliability, the value of these groups.
Therefore, the software offers a range of methods that
express the stability or the error at each branching level.
The simplest indication of the significance of branches is
showing the average similarities of the dendrogram
branches (see 11.6.4).
The Standard Deviation of a branch is obtained by
reconstructing the similarity values from the
dendrogram branch and comparing the values with the
original similarity values. The standard deviation of the
derived values versus the original values is a measure of
the reliability and internal consistence of the branch.
11.8.1 Right-click on RFLP1 in the experiment type
selection bar, and select Show dendrogram.
11.8.2 Select Clustering > Calculate error flags.
An error flag is drawn on each branch. The average
similarity and the exact standard deviation is shown at
the position of the cursor (see Figure 11-8.). The smaller
this error flag, the more consistent a group is. For
example, the Perdrix group has a small error flag,
meaning that this group is very consistent. This group
will for example not disappear by incidental changes
such as tolerance settings, adding or deleting entries etc.
Figure 11-7. The Group colors editor dialog box.
11.7.17 A user defined color scheme can be selected from
the drop down list of Saved color schemes.
11.7.18 To delete a saved color scheme, first select it, and
then press the <Delete selected> button.
11.7.19 To bring up the default color scheme, press
<Default>. Another predefined scheme, using pastel
colors, can be loaded by pressing <Pastels>.
11.7.20 It is also possible to generate a scheme of
transition colors by pressing <Range>. The program will
ask to enter the number of colors to include in the range.
Enter a number between 2 and 30.
Figure 11-8. Dendrogram with error flags, detail.
The average similarity and standard deviation is
shown at the cursor’s position (top).
The Cophenetic Correlation is also a parameter to express
the consistence of a cluster. This method calculates the
correlation
between
the
dendrogram-derived
similarities and the matrix similarities. The value is
usually calculated for a whole dendrogram, to have an
estimation of the faithfulness of a cluster analysis. In
BioNumerics, the value is calculated for each cluster
(branch) thus estimating the faithfulness of each
subcluster of the dendrogram. Obviously, you can
obtain the cophenetic correlation for the whole
dendrogram by looking at the cophenetic correlation at
the root.
Chapter 11 - Cluster analysis
113
11.8.3 Select Clustering > Calculate error flags again to
remove the error flags.
11.8.4 Select
correlations.
Clustering
>
Calculate
cophenetic
The cophenetic correlation is shown at each branch
(Figure 11-9.), together with a colored dot, of which the
color
ranges
between
green-yellow-orange-red
according to decreasing cophenetic correlation. Thus, it
is easy to detect reliable and unreliable clusters at a
glance.
Figure 11-9. Dendrogram showing
correlation values, detail.
cophenetic
Bootstrap analysis1 measures cluster significance at a
different level. Instead of comparing the dendrogram to
the similarity matrix, it directly measures the influence
of characters on the obtained dendrogram. The concept
is very simple: “sampling with replacement”, i.e.
characters are randomly left out from the character set
and replaced with another2. For each sampling case, the
dendrogram is recalculated, and the relative number of
dendrograms in which a given cluster occurs is a
measure of its significance. This method requires the
characters to be independent and equally important.
Since bootstrap analysis requires a closed character set,
the method can only be performed on aligned sequences
and Character Type data. In case of Fingerprint Type
data, a band matching needs to be performed first
(chapter 10.).
11.8.7 Press <OK> and wait till the sampling and
calculation process is finished. No need to explain that
calculating 100 matrices and dendrograms can take
some computing time.
The bootstrap values are shown in a similar way as the
cophenetic correlation values (see Figure 11-9.).
Another way of looking at dendrograms is to try to
delimit, by objective means, the relevant clusters from
the non-relevant clusters. The simplest and most
arbitrary method is to draw a vertical line through the
dendrogram in a way that it cuts most homogeneous
clusters from most heterogeneous clusters. However,
there are more statistically founded methods to draw
either straight lines, or to evaluate cluster by cluster and
delimit relevant clusters at different similarity levels.
The Cluster Cutoff method in BioNumerics is one of these
statistical methods. The method draws a line through
the dendrogram at a certain similarity level, and from
the resulting number of clusters defined by that line, it
creates a new, simplified, similarity matrix, in which all
within-cluster values are 100%, and all between-cluster
values are 0%. Then, the Point-bisectional correlation is
calculated, i.e. the correlation between this new matrix
and the original simlarity matrix. The same is done
again for other cutoff similarity levels, and the level
offering the highest PBC is the one offering the most
relevant groups.
In BioNumerics, this standard method is even refined,
as the cutoff values can be different per cluster, to allow
even more reliable clusters to be defined.
11.8.8 Select Clustering > Calculate cluster cutoff
values. The branches that were found to be below the
cluster cutoff value are shown in dashed lines.
11.9 Matrix display functions
NOTES:
(1) If the similarity matrix is not shown for the selected
experiment, you can display it with Layout > Show
matrix.
(2) It is also possible to show the average similarities for
the branches directly on the dendrogram; see 11.6.4.
11.8.5 Select All-Pheno (a Composite Data Set; see
chapter 15.) in the experiment type selection bar and
Clustering > Calculate > Cluster analysis (similarity
matrix).
The matrix panel is at the right side of the Comparison
window (see Figure 11-5.).
11.8.6 When the dendrogram appears, Select Clustering
> Bootstrap analysis and enter the number of
simulations (samplings) to perform. A reasonable
number of samplings is 100.
11.9.1 It may be necessary to reduce the space allocated
for the image and for the information fields, in order to
increase the space for the matrix panel, by dragging the
separator lines between the panels.
1. Efron, B. 1979. Bootstrap methods: another look at the
jackknife. Ann. Statist. 7:1-26
2. Felsenstein, J. 1985. Confidence limits on phylogenies: an
approach using the bootstrap. Evolution 39: 783-791
Initially, the matrix is displayed as differentially shaded
blocks representing the similarity values. The interval
settings for the shadings is graphically represented in
the caption of the matrix panel (Figure 11-10.).
114
The BioNumerics manual
field, the export file will be abbreviated too for this
information field.
Figure 11-10. Adjustable similarity shading scale.
There are two ways to change the intervals for shading:
11.9.2 Drag the interval bars on the scale; the matrix is
updated instantly.
11.9.3 Select Layout > Similarity shades in the menu.
The maximum/minimum values for each interval can
be entered as numbers.
11.9.4 To show the similarity values in the matrix, select
Layout > Show similarity values. If it is difficult to read
the values on the shaded background, you can remove
the shades with Layout > Similarity shades and
entering 100% for each interval.
11.9.5 With the option Layout > Show matrix rulers
(default enabled), a set of horizontal and vertical rulers
appear on the similarity cell where click, and connect the
two entries from which the similarity value is derived.
If you want to find the similarity value on the matrix
between two entries in the comparison, click first on one
entry inside the information fields panel, and then on
the other entry inside the selection panel (Figure 11-11.).
The similarity value is the intersection between the
horizontal and the vertical rulers.
11.10 Group statistics
The groups statistics functions are based upon the
groups the user has defined. We have explained earlier
how to define groups (see 11.7.2 to 11.7.8), and if you
have gone through the dendrogram display functions
(11.6), the groups are already present on the
dendrogram.
•K-means partitioning
One function to let the software automatically determine
groups is the mathematical function K-means
partitioning. The user first creates groups based upon
one or more strains (e.g. type strains). Then, the
program automatically calculates for each entry of the
cluster analysis in which group it fits best. This fitting
can be based upon Average similarity with the group,
upon the highest similarity (Nearest neighbor), or upon
the lowest similarity (Furthest neighbor). Obviously, the
partitioning process must be iteratively executed, since
by adding an entry to a group, the average similarity of
the group as well as the heighest and lowest similarities
with entries may change.
11.10.1 To illustrate the partitioning method, we select
RFLP1.
11.10.2 Select the root and select all entries on the
dendrogram with CTRL + left-click.
11.10.3 Remove all group assignments with Groups >
Assign selected to > None.
1. Click here
11.10.4 Select one or a few entries per cluster, each time
assigning a different group color to them (see Figure 1112. for an example). Do not forget to unselect all entries
before you start defining a next group.
2. Click here
3. Value betwen selected entri
11.10.5 In the menu, select Group > Partitioning of
groups, which allows you to choose between the three
options described above (Figure 11-13.).
Figure 11-11. Workflow for finding a similarity
value between two entries.
11.9.6 By double-clicking on a similarity block or value,
you can pop up the detailed comparison between the
two entries (9.6).
11.9.7 To export a tab-delineated text file of the
similarity matrix, select File > Export similarity matrix.
This text file contains the entry keys as descriptors. You
can export a text file which contains the same
descriptors with the corresponding information fields:
11.9.8 Export the information fields with File > Export
database fields. If a maximum is specified for a database
Figure 11-13. Partitioning of groups dialog box.
11.10.6 Leave Number of groups on zero so that the
program will only use the groups we have defined
manually.
Chapter 11 - Cluster analysis
115
created from an information field. They involve the
Jackknife method and the “group violation” measurement.
11.10.11 With Groups > Group separation, the
separation between the defined groups are investigated.
The Group separation settings dialog box is shown,
allowing a number of choices to be made (Figure 11-14.).
Figure 11-14. Group separation settings dialog box.
Figure 11-12. Example of manual group assignment
in preparation of a partitioning process.
11.10.7 Select Nearest neighbor, which will place a new
entry in the group containing the highest individual
similarity with that entry.
11.10.8 Press <OK> to execute the partitioning. After
partitioning, all entries belong to one of the defined
groups.
Note that these groups do not necessarily correspond
exactly to the visual clusters on the dendrogram. This is
not the case if the clusters on the dendrogram are not
well-defined or inconsistent. A second reason is the
oversimplification of complex matrices by the UPGMA
algorithm.
11.10.9 As an alternative, you can also select Groups >
Partitioning of groups, specifying a predefined number
of groups, e.g. 3.
11.10.10 Press <OK> to partition into 3 groups. The
program now has defined the 3 most relevant groups in
the comparison.
The principle of the Jackknife method is to take out one
entry from the list, and to identify this entry against the
different groups. This can be done by calculating the
Average similarities with each group, or finding the
Maximum similarities with each group. This is done for
all entries (when Match against selection only is not
checked). The percentage of cases that entries are
identified to the group they were assigned to, is a
measure of the internal stability (significance) of that
group. Between the groups, the percentage of cases that
entries are identified to another group are given.
Using Match against selection only, you can let the
program calculate the matches against a selection you
made in the comparison, rather than against all entries
of the groups.
In cases where an entry has an equal match with a
member of its own group and a member of another
group (a “tie”), there are two equally valid
interpretations possible. The program can handle such
ties in an ‘optimistic’ way, i.e., by always assigning
equal matches to their own group, or in a ‘realistic’ way,
by spreading ties equally between their own groups and
the other groups.
•Group separation statistics
These statistical methods determine the stability of the
defined groups, whether they are defined manually,
derived from clusters, using K-means partitioning, or
The way ties are handled can be chosen in the Settings
dialog box under Tie handling. This includes two options,
Assign to own group and Spread equally.
116
The BioNumerics manual
11.10.12 Click <OK> with the default settings to display
the Group separation window (Figure 11-15.).
A
B
C
A
Figure 11-16. Schematized representation of internal
similarity range of group A (A-A), and similarity
ranges with other groups (A-B, A-C, and A-D). The
overlapping values are group violations.
B
C
Figure 11-15. Group separation statistics window.
Note that the values in the matrix are not reciprocal,
i.e., the matrix is not symmetric! The number of
misidentifications for members of group A are given in
column 1 (Figure 11-15.), for members of group B in
column 2, etc. In for example, 25% of group B members
are identified as group C, but only 3% of group C
members are identified as group B. To facilitate the
interpretation of this matrix, the columns are separated
by black lines.
When the Jackknife method is used, a value (or cell) in
the group separation matrix can be selected, and with
the
button or File > Select cell members, the
entries contributing to this cell will be selected in the
Comparison window. The method is useful to identify
entries that fit well or do not fit well in their assigned
groups.
NOTE: The interpretation of matching and nonmatching entries is less easy when the Spread equally
function has been chosen, since in that case, some
entries may fall outside their group “unexpectedly”
when they have an equally high score with another
group.
11.10.13 Click
or select Settings > Statistics to
call the Settings dialog box again.
11.10.14 Under Method, select Group violations. Figure
11-15. is based on group violations between three
groups partitioned as above (RFLP1).
The group violations method compares all the similarity
values within a group with those between the group and
the other groups. All the values occurring in the overlap
zones (see Figure 11-16.) are considered “violations” of
the integrity of the group.
The percentage of group violations for group A is the
number of external entries scoring higher than the
lowest internal values over the total number of
similarity values considered. The percentages seen in
the diagonal of the matrix are the percentages of nonviolations.
11.10.15 The Group statistics can be copied to the
clipboard using File > Copy to clipboard.
11.11 Printing a cluster analysis
When printing from the Comparison window,
BioNumerics first shows a print preview. This print
preview shows the same as is shown in the panels of the
Comparison window: for example, a dendrogram, one or
more images from different experiments, metrics scale,
etc. One exception is the similarity matrix: the print
preview does not print matrices unless you explicitly
select it in the print preview. The preview looks exactly
as it will look on printed pages. You can edit the layout
of the print preview by adjusting the space allowed for
the different items (dendrogram, image(s), information
fields), by changing the size of the figure to fit on one or
more pages, etc.
11.11.1 In the Comparison window, select File > Print
preview, which opens the Comparison print preview
window (Figure 11-17.).
The Comparison print preview window is divided in two
panels: left, it shows an overview of the pages that will
be printed, with the actual page in yellow. Right, the
actual page is shown.
11.11.2 With the PgUp and PgDn keys or Edit >
Previous page
and Edit > Next page
, you
can thumb through the pages that will be printed out.
11.11.3 It is possible to zoom in or out on a page using
Edit > Zoom in
the + and - keys.
and Edit > Zoom out
, or
Chapter 11 - Cluster analysis
117
11.11.4 When zoomed, the horizontal and vertical scroll
bars allow you to scroll through the page.
11.11.5 The whole image can be enlarged or reduced
with Layout > Enlarge image size
Reduce image size
or Layout >
.
11.11.6 If a similarity matrix is available, it can be shown
and printed with Layout > Show similarity matrix or
.
On top of the preview page, there are a number of small
yellow slide bars (Figure 11-17.). These slide bars
represent the following margins, respectively:
•Left margin of the whole image;
•If dendrogram shown, right margin of dendrogram;
•If image shown, right margin of image;
•If groups are defined, right margin of groups;
Left on the first preview page, there are two slide bars:
representing the top margin of the whole figure and
lower margin of the header, respectively. Left on the last
page, there is one slide bar representing the bottom
margin of the image.
Each of these slide bars can be shifted individually to
reserve the appropriate space for the mentioned items.
The image is printed exactly as it looks on the preview.
11.11.7 You can preview and print the image in full color
with Layout > Use colors or
.
11.11.8 In addition, the menu command File > Printer
setup or
allows you to set the paper orientation,
the margins, and other printer settings for the default
printer.
11.11.9 If the preview is taking more than one page, you
can click on a page in the left page preview panel to
select a page from the range.
11.11.10 With File > Print this page or
•Right margin of entry keys or group codes (if not set to
zero length);
current page is printed.
•Right margins of different information fields (except
those set to zero length);
11.11.11 Use File > Print all pages or
•If similarity matrix shown, right margin of matrix.
Figure 11-17. Comparison print preview window.
pages at once.
, the
to print all
118
The BioNumerics manual
11.11.12 If you want to export the image to another
software package for further editing, use File > Copy
page to clipboard or
.
This function provides a choice between the Windows
Enhanced Metafile format, i.e. the standard clipboard
exchange format between native Windows applications
(default), or a bitmap file with 75 dpi, 150 dpi, 300 dpi or
600 dpi resolution. Many software applications,
although supporting the enhanced metafile format, are
unable to properly import some advanced BioNumerics
clipboard files that make use of mixed vector, bitmap
and (rotated) text components. If you experience such
problems, you should select a bitmap file to be exported,
or use another software application to import the
graphical data.
With the latter function, only the current page is copied
to the clipboard. If you want the whole image to be
copied to the clipboard, first reduce the size of the image
(11.11.5).
by right-clicking on the experiment name in the
experiment type selection bar (bottom). The floating
menu appears.
If the menu items Show dendrogram and Show matrix
display in black (enabled), a cluster analysis is present
for the experiment and no further calculation is needed.
11.12.3 If the menu items Show dendrogram and Show
matrix display in gray (disabled), it means that no
cluster analysis is present for the experiment. In that
case, select Calculate cluster analysis from the floating
menu.
11.12.4 In the Comparison window's menu, select
Clustering > Congruence of experiments.
The Experiment congruence window (Figure 11-18.) shows
both a matrix of congruence values between the
techniques (experiment types) and a dendrogram
derived from that matrix.
NOTE: When preferred, the image of a Fingerprint
Type can be shown and printed with a space between
the gelstrips. To do so, open the Experiment type
window in the program's Main window (under
Fingerprint types) and select Layout > Show space
between gelstrips.
11.11.13 Select File > Exit to close the Comparison print
preview window.
11.12 Analysis of the concordance
between techniques
As soon as multiple techniques are used to study the
relationships between organisms, the question arises
how concordant the groupings obtained using the
different techniques are. It is also interesting to compare
the techniques by the level at which they discriminate
the entries, in other words, the taxonomic depth of the
techniques.
An evident way to perform such a study is by
comparing the similarity matrices obtained from the
different experiment types used. By plotting the
corresponding similarity values in an X-Y coordinate
system, one can easily observe the kind and degree of
concordance at a glance. BioNumerics even calculates a
regression curve through the plot.
11.12.1 In the Main window with DemoBase loaded,
open comparison All, or a comparison with all entries
except those defined as STANDARD.
11.12.2 Check whether a cluster analysis is present for
each experiment type (except the Composite Data Sets)
Figure 11-18. Experiment congruence window.
The default method to calculate the congruence between
two experiment types is by using the Pearson productmoment correlation coefficient. An alternative
coefficient is Kendall's tau. The principle of Kendall’s tau
is as follows: if value A is higher than value B in
experiment 1, then corresponding value A of experiment
2 should also be higher than corresponding value B of
experiment 2. The less infringements on this statement,
the more congruent the techniques are. Kendall's tau has
the advantage over Pearson correlation that non-linear
correlations still have good scores. In addition, the
significance of the correlation between techniques is
shown (green values) when the Kendall’s tau is selected,
as well as the standard deviation on the values.
11.12.5 Select Calculate > Experiment correlations.
Chapter 11 - Cluster analysis
119
The Correlation between experiments dialog box that pops
up shows the settings for the calculation of correlation
between the experiment types (Figure 11-19.).
Figure 11-20.
experiments.
Similarity
plot
between
two
expect that they are seen identical in another technique
as well.
Figure 11-19. Correlation
settings dialog box.
between
experiments
11.12.6 In the Correlation type box, select Kendall's tau.
The Minimum similarity used and Maximum similarity
used allow a range of similarity values to be specified
within which the analysis is done. Normally, one can
enter 0% and 100%, respectively, for these values.
Include self-matches is an option which gives the user
the choice whether to include entries compared with
themselves. Obviously, self-matches are always 100%
and may thus influence the correlation obtained
between two experiment types.
11.12.7 Enter zero and 100% respectively for minimum
and maximum similarity used, and uncheck (disable)
Include self-matches.
The Regression determines the kind of best-fitting curve
that is calculated through a Similarity plot of two
experiment types (see 11.12.9). You can enter the Degree
of the regression (first degree is linear, second degree is
a quadratic function, etc.). If there is any concordance
between techniques, one should expect that the function
increases monotonously; with Monotonous fit, only
such functions are allowed. With Force through 100%,
the program will force the regression curve to pass
through the 100% for both techniques. In other words, if
entries are seen identical in one technique, you would
Under Correlation plot, you can choose Scatter plot to
plot each pair of similarity values as one dot in a
Similarity plot between two experiment types (see
11.12.9). Especially for very large data sets resulting into
dense scatter plots, it can be useful to average the
number of dots in a given area and represent that
average rather than the individual pairs. This can be
achieved with Histogram (gray scales) and Histogram
(color). When color is chosen, a multicolor scale is used
that ranges continuously from white over blue, green,
yellow, orange, and red to black.
11.12.8 Select a 3rd degree, Monotonous fit, and Force
through 100%. Choose Scatter plot as correlation plot
type and press <OK>.
11.12.9 Click on a value in the similarity matrix and
Calculate > Similarity plot.
The similarity plot between the two selected
experiments appears (Figure 11-20.), with a third degree
regression drawn through it. Excluded values (due to
Minimum similarity used and Maximum similarity
used) are shown in gray.
11.12.10 You can click on any dot in the similarity plot
to pop up a detailed pairwise comparison between the
two entries (see 9.6).
120
The BioNumerics manual
121
12. Cluster analysis of fingerprints
12.1 Defining ‘active zones’ on
fingerprints
When clustering fingerprints, one is not necessarily
interested in comparing the complete patterns. For
example, when the loading well or the loading dye is
comprised within the fingerprints, it may be better to
exclude such a region from the cluster analysis.
It is possible to define excluded regions which are applied
for all comparisons.
12.1.1 Select any entry in the database that contains a
fingerprint of RFLP1.
12.1.2 Open the Fingerprint Type window for RFLP1 in the
experiment types panel.
At the bottom of the window, the fingerprint of the
selected database entry is shown (Figure 12-1.).
fingerprints that are included for comparison, as
percentages (see Figure 12-1.).
12.1.4 To include a region, hold the left mouse button
(without holding the SHIFT key), while dragging the
mouse pointer over the fingerprint.
12.1.5 You can for example exclude the top 15% and the
end 15% of the fingerprints.
NOTE: you can exclude / include multiple regions. The
defined regions apply both to comparisons based on
densitometric curves and to comparisons based on band
matching. Bands falling within an excluded region will
not be considered for cluster analysis and band
matching analysis.
12.1.6 You can specify the exact start and end of the
active zone(s) using a script available on Applied Maths’
website. The scripts can be launched from the
BioNumerics Main window, using the menu Scripts >
12.1.3 To exclude a region for comparison, hold the left
mouse button and the SHIFT key at the same time while
dragging the mouse pointer over the fingerprint.
Browse
The excluded region becomes cross-hatched in red. The
header of the window shows the parts of the
12.1.7 Back in the Comparison window, select RFLP1 and
Clustering > Calculate > Cluster analysis (similarity
matrix), to recalculate the dendrogram using the
excluded regions.
Internet,
or
,
and
then
selecting
Fingerprint related tools > Set active zones.
Figure 12-1. Fingerprint type window with excluded regions defined (see arrows).
122
12.2 Calculation of optimal position
tolerance optimization and settings
BioNumerics possesses a very interesting option to
calculate the optimal settings for position tolerance and
optimization automatically for a given Fingerprint Type.
The principle is as follows: the user selects a number of
entries, which he or she wants to cluster into a
comparison, and the program will calculate similarity
matrices with a range of position tolerance values. We
have found that the optimal position tolerance value
yields the matrix with the highest group contrast: scores
as high as possible within groups and as low as possible
between groups. This translates in the highest standard
deviation on the matrix of similarity values. The same
process can be done to find the best optimization range.
Given the principle of the method, it is important to
select entries belonging to different groups or showing
enough heterogeneity.
The BioNumerics manual
12.2.2 In case no groups are defined, select the Genus
database field and Groups > Create from database field.
12.2.3 Select RFLP1 in the experiment type selection bar,
and Clustering > Tolerance & optimization analysis.
The comparison settings dialog box appears (see Figure 112.) where you can select the coefficient and clustering
method. Only the coefficient is important to calculate
the optimization.
12.2.4 Select Pearson correlation under Similarity
coefficient and press <OK>.
The program now calculates the best optimization
value. When finished, the Position tolerance analysis
window appears (Figure 12-2.) showing the group
separation in function of the allowed optimization in the
right diagram.
The best way to proceed is to create a comparison with
groups (see 11.6) already defined, e.g. based upon cluster
analysis or partitioning (see 11.10). The program will
then optimize the intergroup separation based upon
these groups. If no groups are defined, the standard
deviation of the whole matrix is optimized, which also
works in case the comparison contains some groups of
more related patterns.
In case you choose a correlation coefficient based on
densitometric curves, only the optimization value is
needed, and the program will calculate this value.
However, in case you apply a band matching coefficient,
for example Dice or Jaccard, both the tolerance and
optimization values are important. Therefore, the
program can also calculate the optimal setting for both
values in combination with each other. If n matrices are
to be calculated for the tolerance value, and n matrices
for the optimization, the combined process requires n x
n matrices to be calculated. In addition, each value from
each matrix is to be calculated a number of times within
the tolerance/optimization boundaries, in order to find
the highest value. No need to argue that this process is
extremely time-consuming; it should only be executed
on very small numbers of entries.
Given the time needed to calculate n x n matrices with
increasing tolerance applied, we recommend to first
calculate the optimization value using Pearson
coefficient, and then, using this value, calculate the
optimal position tolerance setting. This is done as
follows:
12.2.1 In the Main window with DemoBase loaded, open
comparison All, or a comparison containing all entries
except the STANDARDs.
Figure
12-2.
Position
tolerance
analysis.
Optimization analysis shown for curve-based
coefficient.
The ideal optimization value is shown (bottom) and is
automatically saved in the settings for the experiment
type.
12.2.5 Close the window with File > Exit and select
Clustering > Tolerance & optimization analysis again.
12.2.6 This time, select the Dice coefficient and press
<OK>.
12.2.7 The program asks "Do you wish to estimate the
optimization parameter?". Answer <No>.
The program now calculates the best position tolerance
value for band matching. When finished, the Position
tolerance analysis window (Figure 12-3.) shows the group
separation in function of the allowed band matching
tolerance in the left diagram.
Chapter 12 - Cluster analysis of fingerprints
123
The position tolerance value is shown (bottom) and is
automatically saved in the settings for the experiment
type.
12.2.8 Close the window with File > Exit.
Figure 12-3. Position tolerance analysis. Position
tolerance analysis shown for band matching
coefficient.
124
The BioNumerics manual
125
13. Cluster Analysis of characters
13.1 Coefficients for character data and
conversion to binary
In terms of parameter settings, character sets are the
simplest class of data to analyze. The various types of
character sets that exist, however, require that a large
number of coefficients should be available for analyzing
character tables.
13.1.1 When selecting a character set to analyze in the
Comparison window, for example Phenodata, and
selecting Clustering > Calculate > Cluster analysis
(similarity matrix), the Comparison settings dialog box
appears (Figure 13-1.).
Figure 13-2. Conversion to binary dialog box.
13.1.2 The <Conversion to binary> button lets you
specify how this conversion is done (Figure 13-2.).
By default, every character that has a value above zero
will be converted to positive. Alternatively, one can
specify a certain percentage of either the maximum
value or the mean value from the experiment.
Figure 13-1. Comparison settings dialog box for
character data.
Binary coefficients include Jaccard, Dice, and Simple
matching. Dice and Jaccard are very related to each
other whereas Simple matching is more fundamentally
different. The Jaccard and Dice coefficients only
consider "scoring characters" being two positive
characters in both data sets, whereas the Simple
matching coefficient also considers two negative
characters as scoring. When dealing with a non-binary
(numerical) data set, a conversion needs to be done from
numerical values to binary values (positive or negative).
Numerical coefficients include Pearson correlation
(product-moment correlation) and the related Cosine
coefficient, the Canberra metric and Gower coefficients,
and Euclidian distance. The Rank correlation ( or
Spearmann rank-order correlation) is a special purpose
coefficient, which, for each entry, ranks the correlation
of the other entries, and uses this rank value as the
correlation value. By thus not taking into account the
relative distances, the coefficient is very robust, but not
sensitive for details. For comparisons between highly
related organisms, it can be useful to check the option
Use square root, especially when using Pearson
correlation or Euclidian distance. This has the effect that
narrow branches on a dendrogram are stretched out
relatively more than distant links.
The Categorical coefficient is neither binary nor
numerical, since it treats each different value as a
different state. This coefficient is useful for analyzing
multistate character sets, for example colors (red, green,
blue etc.) are each categorical states. Typical multistate
characters used in typing, taxonomy and phylogeny are
phage typing, Multilocus Sequence Typing (MLST),
Variable Number Tandem Repeats (VNTR) typing. The
types or categories assigned to the different phage
reactions, allele numbers, or repeat numbers,
respectively, in the aforementioned techniques, are good
examples of categorical or multistate data which can be
analyzed using the categorical coefficient.
126
13.2 Advanced analysis of massive
character sets using GeneMaths
The analysis of huge data sets such as produced by gene
chips or high-density gene arrays (micro-arrays)
requires special clustering algorithms capable of
processing many thousands of entries or characters. In
addition, the successful exploration of such data sets
also depends on the ability to associate certain clusters
of characters (observations) with groups of entries
(samples). Although these features are available in the
Comparison functions of BioNumerics, the flexibility of
handling and clustering extremely large matrices as well
as some sophisticated functions are something which is
provided in separate programs, GeneMaths or
GeneMaths XT. The GeneMaths and GeneMaths XT
programs are capable of clustering data sets of up to a
million characters per entry.
GeneMaths and GeneMaths XT are available as
standalone programs, but can also be added as a module
to the BioNumerics software. In the latter case,
BioNumerics provides the database tools, and a data
matrix for comparison is first created in BioNumerics.
The menu command File > Analyze with GeneMaths in
the BioNumerics Comparison window automatically
launches GeneMaths or GeneMaths XT (whichever is
last installed) with the created selection of entries and
experiments.
The BioNumerics manual
13.2.1 To run GeneMaths or GeneMaths XT as a module
of BioNumerics, create a comparison in BioNumerics
with an appropriate large character set. If you do not
have gene array data available, create a comparison in
BioNumerics' DemoBase containing all entries that have
the experiment FAME available, and press the
button in the Comparison window's status
bar. This character set contains some 60 characters.
13.2.2 In the BioNumerics Comparison window, select
File > Analyze with GeneMaths.
This will launch the GeneMaths or GeneMaths XT
program with its Main window (whichever program that
was last installed). Full descriptions of the GeneMaths
software and the GeneMaths XT software is available in
separate manuals.
When a Connected Databases is defined, characters can
be described by more than one information field (see
page 58). GeneMaths/GeneMaths XT is launched with
all the character field information, and is able to display
multiple character fields together, when the characters
are chosen to be the rows.
Character as well as entry information fields can be
edited directly from GeneMaths or GeneMaths XT and
changes are saved in the BioNumerics database.
Selecting entries in BioNumerics and GeneMaths or
GeneMaths XT is also synchronized.
127
14. Multiple alignment and cluster analysis of
sequences
Among all types of experimental data, cluster analysis of
sequence data is by far the most complex in steps and
possibilities. The fact that sequences need to be aligned
before one can estimate similarity requires a number of
additional steps before a dendrogram is achieved.
Furthermore, sequence data are a suitable substrate for a
number of phylogenetic clustering algorithms which can
seldom be applied to other types of data.
3. Determination of consensus sequences at each linkage
node of the dendrogram, down to the root.
There are two ways to obtain a dendrogram from
sequence data: by aligning the sequences pairwise (steps 12 in Figure 14-1.), or by first obtaining a multiple alignment
of all sequences (steps 1-6 in Figure 14-1.).
6. Construction of a Neighbor Joining dendrogram based
on the multiple alignment similarity matrix.
The best multiple alignments that can be achieved,
particularly for large numbers of sequences, involve the
following steps.
1. Pairwise alignment and calculation of similarity of all
possible pairs of sequences, resulting in the Pairwise
alignment similarity matrix.
2. Construction of a UPGMA dendrogram based on the
similarity matrix.
Seq1
Seq2
Seq3
ACTAGTGACTTA
ACAAGGACTTT
GACTAGGACTTA
1
Unaligned sequences
Seq1
Seq2
Seq3
4. Alignment of all sequences based on the local and the
root consensus sequences.
5. Calculation of a similarity matrix based on the aligned
sequences, the Multiple alignment similarity matrix.
In step 1, each individual sequence is aligned with each
other sequence, and for each pair of aligned sequences,
the similarity value is calculated into a similarity matrix.
The obtained matrix of similarity values based on
pairwise alignment (pairwise alignment similarity matrix)
will serve as the basis for cluster analysis by the
Unweighted Pair Group Method using Arithmetic
averages (UPGMA) (step 2). Neighbor Joining or other
algorithms resulting in unrooted dendrograms would not
be suitable here, as in such dendrograms, the closest
linked sequences are not necessarily the most related ones.
This is a requirement for step 3 discussed below.
100
79 100
91 79 100
2
Homology matrix
Seq1
Seq3
Seq2
Dendrogram
3
Seq1
Seq2
Seq3
100
81 100
91 81 100
5
Homology matrix
Seq1
Seq2
Seq3
-ACTAGTGACTTA
-ACAAG-GACTTT
GACTAG-GACTTA
Global alignment
4
Seq1
Seq3
Seq3
Cons(1,3)
Cons((1,3),2)
Local consensus sequences
6
Seq1
Seq2
Seq3
-ACTAGTGACTTA
-ACAAG-GACTTT
GACTAG-GACTTA
100
81 100
91 81 100
Global alignment with associated dendrogram and homology matrix
Figure 14-1. Steps in a cluster analysis of sequences: dendrogram based on pairwise alignment (steps 1 to 2),
and dendrogram based on multiple alignment (steps 1 to 6).
128
The BioNumerics manual
Steps 3 and 4 are very important for obtaining a sensible
global alignment. Each linkage node on the UPGMA
dendrogram represents a local alignment of the
sequences linked at the node, resulting in a local
consensus. These local consensus sequences are
calculated downwards, i.e. starting from the highest
related sequences down to the dendrogram root (step 3).
In the above example, the highest linkage observed is
between sequences 1 and 3, leading to consensus (1,3).
The next linkage level is the branch that links sequences
1 and 3 with sequence 2. At this node, the consensus
(1,3) is aligned with sequence 2. This results in a
consensus ((1,3),2), which will in turn be aligned with
the consensus of another group linked to this one. For
each sequence or local consensus, the program keeps
track of the positions of the gaps that are introduced to
align it with the branch it is linked to. Finally, a global
consensus for the whole dendrogram is inferred.
The program now introduces to each individual
sequence all the gaps that were introduced on the
subsequent consensus sequences following the path
from the sequence itself down to the global consensus
(step 4). This results in a global or multiple alignment.
The multiple alignment in turn can be used as the basis
for the calculation of a similarity matrix. Now, instead of
aligning each sequence with each other sequence to
determine their mutual similarity, the multiple
alignment is used to calculate the multiple alignmentbased similarity between each pair of sequences (step 5).
Once the multiple alignment is present, this step is
extremely fast. The multiple alignment-based similarity
matrix can be used for Neighbor Joining or UPGMA
clustering, or other clustering algorithms (step 6).
14.1 Calculating a cluster analysis based
on pairwise alignment (steps 1 and 2)
14.1.1 Open comparison All, or any selection containing
the three genera in the database DemoBase.
Select 16S rDNA and Layout > Show image.
Initially, the sequences are not aligned and no similarity
matrix exists.
NOTE: it is possible that a dendrogram (and a matrix)
are still displayed in the Comparison window. This is
the dendrogram of the last clustered experiment, which
you can remove with Layout > Show dendrogram
and Layout > Show matrix.
14.1.2 The similarity matrix is calculated with Clustering
> Calculate > Cluster analysis (similarity matrix) or the
button.
The Pairwise comparison settings dialog box appears
(Figure 14-2.), showing three groups of settings: the
Pairwise alignment settings, the settings for Similarity
calculation, and the Clustering method.
Figure 14-2. The Pairwise comparison settings
dialog box.
The pairwise alignment settings involve an Open gap
penalty and a Unit gap penalty. A match between bases
on two sequences, e.g. A with G, is considered as 100%
score. The open gap penalty is the percentage cost of
that score if one single gap is introduced in one of both
sequences. The unit gap penalty is the percentage cost of
that score to increase the gap by one base position. The
default setting is 100% open gap penalty and 0% unit
gap penalty, which means that introducing a gap in one
of both sequences has the same cost as a mismatch,
whereas there is no extra cost for gaps of multiple
positions. It should be emphasized that the pairwise
alignment settings will only determine the way the
alignment is done: if a large unit gap cost is set (e.g.
350%), the program would not easily introduce gaps
between sequences; for example, the program would
rather allow three successive mismatches than one
single gap. If no gap cost is chosen (0%) the program
would introduce gaps to match every single base. The
pairwise alignment settings have no direct influence on
the similarity values, but of course, if the obtained
alignments differ, the similarity values may differ too.
Use fast algorithm offers an interesting accelerated
algorithm, with two adjustable parameters: the
Minimum match sequence and the Maximum number of
gaps. The program creates a lookup table of groups of
bases for both sequences. The minimum match sequence is
the size of such a group. The smaller the groups are, the
more precise the alignment will be, but the longer the
alignment will take. The parameter can be varied
between 1 and 5, with 2 as default. The maximum number
of gaps is the maximum number of possible gaps that
you allow the algorithm to introduce in one of both
sequences. The values can be varied between 0 and 99
with 9 as default. The larger the number, the more gaps
the program can create to align every two sequences, but
the longer the alignment will take. If zero is selected, no
gaps at all would be introduced. Thus, you can custom-
Chapter 14 - Multiple alignment and cluster analysis of sequences
define its accuracy between very fast and fairly rough to
slow and very accurate.
Contrary to the pairwise alignment settings, the
Similarity calculation parameters will not influence the
alignments, but they determine the way the similarity is
calculated. The Gap penalty is a parameter which allows
you to specify the cost the program uses when one
single gap is introduced. This cost is relative to the score
the program uses for a base mismatch, which is equal to
100%. The program uses 0% as default. When Discard
unknown bases is disabled, the program will use a
predefined cost table for scoring uncertain or unknown
bases. For example, N with A will have 75% penalty, as
there is only 25% chance that N is A. Y and C will be
counted 50% penalty because Y can be C or T with 50%
chance each. If this setting is disabled, all uncertain and
unknown bases will not be considered in calculating the
final similarity. Use conversion cost is a parameter
which makes calculation of the pairwise similarity
matrix faster. Both described alignment methods work
in two steps: first they determine the total maximal
conversion score to convert one sequence into the other
(given the current alignment settings) and then they
realize the alignment using the minimal gap cost and
maximal matching score. If Use conversion cost is
enabled, the calculated conversion cost is transformed
into a similarity value. This method is two times faster
than the usual similarity calculation, but the obtained
values cannot be described as real “similarity”.
Under Correction, one can select the one parameter
correction for the evolutionary distance as calculated
from the number of nucleotide substitutions as
described by Jukes and Cantor (1969)1. The resulting
dendrogram displays a distance scale which is
proportional to an evolutionary time, rather than a
similarity scale.
As Clustering method, you can choose between
UPGMA, Ward, Neighbor Joining, single linkage and
complete linkage.
14.1.3 Select an Open gap penalty of 100, a Unit gap
penalty of 0, Minimum match sequence of 2, Maximum
number of gaps of 9, enable Discard unknown bases,
with a Gap penalty of 0 for similarity calculation, None
for correction, and select UPGMA as Clustering method.
14.1.4 Press <OK> to calculate the matrix and the
dendrogram.
When the calculations are finished, the dendrogram and
the matrix are shown. The sequences are still unaligned
since no multiple alignment is calculated yet.
1. Jukes, T.H. and C.R. Cantor. 1969. In "Mammalian
Protein Metabolism III" (H.N. Munro, ed.), p. 21. Academic
Press, New York.
129
14.2 Calculating a multiple alignment
(steps 3 and 4)
14.2.1 Select Sequence > Multiple alignment or
.
The Global alignment settings dialog box (Figure 14-3.)
appears.
When a multiple alignment is calculated, individual
sequences and local consensus sequences are aligned
pairwise, down to the root, to obtain a global consensus
(see steps 3 and 4 on Figure 14-1.). It is this pairwise
alignment of local consensus sequences that uses the
same two parameters as explained before: the Open gap
penalty and the Unit gap penalty.
Figure 14-3. Global alignment settings dialog box.
The significance of the open and unit gap penalties is the
same as explained for pairwise alignment: they are the
percentage of the mismatch cost to create a gap, and to
increase the gap by one base position, respectively. The
default setting is 100% open gap penalty and 0% unit
gap penalty, which means that introducing a gap in one
of both sequences has the same cost as a mismatch,
whereas there is no extra cost for gaps of multiple
positions. These pairwise alignment settings will only
determine the way the alignment of the local consensus
sequences is done: if a large unit gap cost is set (e.g.
more than 100%), the program would not easily
introduce gaps between sequences. If no gap cost is
chosen (0%) the program would introduce gaps in order
to match single bases. The pairwise alignment settings
have no direct influence on the similarity values
obtained from a global alignment, but if the eventual
multiple alignment differs, the derived similarity values
may differ too.
Use fast algorithm is an algorithm with two adjustable
parameters: the Minimum match sequence and the
Maximum number of gaps (see also under pairwise
alignment). The minimum match sequence can be varied
between 1 and 5, with 2 as default. The maximum number
of gaps can be varied between 0 and 198 with 98 as
default. The smaller the first number and the larger the
130
The BioNumerics manual
number, the more accurate the multiple alignment
should be. If the default values are not satisfactory, some
experimenting is recommended.
the sequences have that base at the given position. A
consensus sequence of the root is now shown on the
header of the image panel.
Note that the Global alignment settings dialog box does not
contain settings for similarity calculation, unlike the
Pairwise alignment settings dialog box. The similarity
matrix based upon the global alignment is not calculated
automatically by the program, but requires a further
command by the user (step 5 in Figure 14-1.).
NOTE: A consensus sequence cannot be obtained from
an Advanced Tree (see chapter 17.).
14.3.2 Select Sequence > Consensus blocks to show the
consensus match representation (Figure 14-5.).
14.2.2 Press <OK> to start the multiple alignment. When
the calculations are done, the sequences are aligned in
the image panel.
14.3 Multiple alignment display options
With Sequence > Display settings, the general display
options such as colors and symbols, can be changed.
These settings are specific to the Sequence Type and can
therefore also be accessed from the Sequence type window
(see 14.13).
In order to facilitate visual interpretation of multiple
alignments there are three methods to highlight
homologous regions.
Figure 14-5. Consensus match representation.
The Consensus difference also displays the consensus
sequence in the editor caption, and only shows bases
that differ from the consensus while bases that are the
same as the consensus are shown as |. Example:
14.3.3 Select Sequence > Consensus difference. The
consensus difference representation is as in Figure 14-6..
Select Sequence > Neighbor blocks to show the Neighbor
match representation.
This representation shows bases as blocks (highlighted)
if at least one of the neighboring sequences has the same
base at the corresponding position. Between two
different groups of consensus, a small black line is
drawn (Figure 14-4.)
Figure 14-6. Consensus difference representation.
14.3.4 A consensus sequence can be copied to the
clipboard with Sequence > Copy consensus to clipboard.
Bases for which there is a consensus in more than 50% of
the sequences are named, the other bases are unnamed
(N).
Figure 14-4. Neighbor match representation.
The Consensus match first requires a consensus sequence
to be present. A consensus sequence is defined from one
or more sequences, and in case a user-defined
percentage of the sequences have the same base at a
given position, this base will be written in the
consensus. Usually, one will select the root to calculate a
consensus from. This method highlights bases (shown as
blocks) on the aligned sequences if they are the same as
on the consensus sequence.
14.3.1 Select the root and Sequence > Create consensus of
branch. A dialog box prompts Enter minimum
consensus percentage. You can for example enter 50,
which means that a base at a given position will only be
shown in the consensus sequence if more than 50% of
14.4 Editing a multiple alignment
A multiple alignment can be edited manually and is
saved along with the comparison.
14.4.1 Select File > Save or
to save the multiple
alignment.
In order to rearrange the multiple alignment as desired,
any sequence can be moved up or down:
14.4.2 Left-click on the entry you want to move up or
down.
Chapter 14 - Multiple alignment and cluster analysis of sequences
131
14.4.3 Press the
button to move the entry up, or
the
button to move it down.
14.4.4 To move a sequence to the top or the bottom of the
alignment, hold the CTRL key and press the up or down
button, respectively.
Note that, as soon as an entry is moved up or down, the
dendrogram disappears: a dendrogram imposes a
certain order to the entries, which is not compatibe with
freely moving sequences up or down. You can display
the dendrogram again using Layout > Show
dendrogram, however, this will reorder the entries again
so that any manual changes you made to the sequence
order is lost.
14.4.5 A number of manual alignment editing tools are
described below. For these editing tools, the multiple
alignment editor contains a multilevel undo and redo
function. The undo function can be accessed with
Sequence > Edit alignment > Undo or CTRL+Z or the
button. The redo function is accessible through
Sequence > Edit alignment > Redo or CTRL+Y or the
button.
The undo/redo function works for the following
sequence editing functions: drag-and-drop realignments
(14.5), inserting and deleting gaps (14.6), removing
common gaps (14.7), and changing sequence bases
(14.8). The undo/redo function also works for all
automatic alignment functions, including full multiple
alignment (14.2) and partial alignments obtained with
one of the following commands Align internal branch,
Align external branch, and Align selected sequences
(14.11).
14.5 Drag-and-drop manual alignment
14.5.1 A cursor, visible as a white rectangle can be
placed on any base of any sequence, and can be moved
up, down, left, and right using the arrow keys.
14.5.2 The cursor can also be extended to cover a range
of bases both in the vertical and the horizontal direction.
This can be achieved by holding down the SHIFT key
while pressing the arrow keys. The result is that blocks
of bases can be selected as shown in Figure 14-7.. By
dragging the mouse towards the left or the right (see
Figure 14-7.), the block of bases can be realigned within
the alignment. While moving the block it remains
displayed so that the user can see the resulting
alignment at each position. The realignment is
effectuated as soon as the mouse button is released. If
necessary, the block can be moved over other bases at
the left or right side. This will then force a gap to be
introduced in the sequences up and down from the
block, in order to both preserve the original alignments
left and right from the block, and align the block the way
the user has forced it to.
Figure 14-7. Selecting blocks of bases for drag-anddrop manual alignment.
A useful tool to select a group of identical bases at once
is to click on one of the bases and choose Sequence > Edit
> Highlight identical positions or CTRL+SHIFT+E on
the keyboard.
14.6 Inserting and deleting gaps
Besides the easy drag-and-drop realignment tool
described above (14.5), a number of buttons (and
corresponding keyboard shortcuts) are available to
manually edit a multiple alignment. Using the below
editing tools, all changes made to a sequence, i.e.
inserting gaps or deleting gaps, cause shifts no further
than the next gap. You can consider an aligned sequence
as a series of blocks with some space in between (the
gaps), just like carriages on a railway: if one block is
shifted to the right, it will move alone until it touches the
next block, which will then move together, until they
touch the next block etc. The following manual
alignment editing tools are available:
Inserts a gap at the position of the cursor, by
shifting the block right from the cursor position to the
right. This function can be used on a gap as well as on a
base. In the latter case, the base at the cursor position
will also shift to the right, i.e. the gap will be inserted left
from it. Keyboard: INSERT.
Example:
Result:
Inserts a gap at the position of the cursor, by
shifting the block left from the cursor position to the left.
This function is similar to the previous function.
Keyboard: HOME.
Example:
132
The BioNumerics manual
Result:
Example:
Inserts gaps at the position of the cursor, by
shifting the block right from the cursor position to the
right, until it closes up with the next block. Keyboard:
SHIFT+INSERT.
Result:
Example:
Deletes all gaps left from, and including the
cursor, by shifting the block left from the gap to the
right. Keyboard: SHIFT+END.
Example:
Result:
Result:
Inserts gaps at the position of the cursor, by
shifting the block left from the cursor position to the left,
until it closes up with the next block. Keyboard:
SHIFT+HOME.
Example:
14.6.1 To insert and delete gaps or move blocks of a
group of sequences as a whole, it is possible to lock a
branch on the dendrogram by selecting the branch on
the dendrogram (click on the dendrogram node) and
Sequence > Lock / unlock dendrogram branch.
Result:
Locked branches are displayed in red and can be
unlocked using the same command.
When no dendrogram is present for a set of aligned
sequences, it is also possible to create groups of locked
sequences, as follows.
Deletes a gap by shifting the block right from the
gap to the left. Keyboard: DEL.
Example:
Result:
14.6.2 Make sure no dendrogram is present with the
multiple alignment. If a dendrogram is present, rightclick in the dendrogram panel and select 9Show
dendrogram.
14.6.3 Select a consecutive group of entries using CTRL
+ left-click or SHIFT + left-click. When selected, the
entries are marked with blue arrows.
14.6.4 In the Sequence menu, select Create locked group.
Locked sequences are connected by a red brace in the
left panel.
Deletes a gap by shifting the block left from the
gap to the left. Keyboard: END.
Example:
Result:
Deletes all gaps right from, and including the
cursor, by shifting the block right from the gap to the
left. Keyboard: SHIFT+DEL.
14.6.5 To unlock locked groups of sequences, click on
any of the entries within the group, and select Sequence
> Unlock group.
Note that locked groups are not the same as locked
branches on the dendrogram (14.6.1). When the
dendrogram is shown, the locked groups will not be
seen anymore, whereas clusters on the dendrogram that
were locked previously, become visible and active.
When the dendrogram is removed again, the locked
groups become visible and active again.
Locked groups have the advantage over locked
dendrogram branches that the sequences within a
locked group are not restricted to clusters from the
dendrogram. One can rearrange the sequences in the
Chapter 14 - Multiple alignment and cluster analysis of sequences
multiple alignment as desired (14.4.3), and then create
groups of locked sequences.
14.7 Removing common gaps in a
multiple alignment
14.7.1 After a series of manual realignments, it may be
possible that the multiple alignment contains one or
more common gaps, i.e. gaps that occur over all
sequences. Instead of having to remove those gaps for
all sequences, the user can let the software find and
remove all common gaps automatically.
14.7.2 To remove common gaps automatically, select
Sequence > Edit alignment > Remove common gaps or
press CTRL+SHIFT+G on the keyboard.
14.8 Changing sequences in a multiple
alignment
In some cases, it is possible that ambiguous positions in
certain sequences can be filled in when a multiple
alignment of highly homologous sequences is present.
BioNumerics offers the possibility to change bases in
sequences within a multiple alignment.
133
14.9 Finding a subsequence
In order to find certain subsequences in a sequence from
a multiple alignment, e.g. restriction sites, primer
sequences, repeat patterns etc., you can perform a
subsequence search.
14.9.1 First, select a sequence within the multiple
alignment (white rectangular cursor).
14.9.2 The Subsequence search dialog box (Figure 14-8.) is
popped up with Sequence > Find sequence pattern.
You can enter any sequence including unknown
positions, which are entered as a question mark. You
also can allow a number of mismatches to occur in
matching subsequences, by specifying a number under
Mismatches allowed.
For rare subsequences which you do not expect to occur
more than once, select Complete sequence. For
frequently occurring subsequences, you can place the
cursor at the start of the sequence, and check Right from
cursor. By successively pressing Find, all subsequent
matching patterns will be shown. Similarly, Left from
cursor shows the first matching pattern left from the
cursor, whereas Closest to cursor only shows the
matching pattern closest to the cursor, in any direction.
14.8.1 Place the cursor on any base in the multiple
alignment.
14.8.2 Hold the CTRL key and type a base letter, a space
(gap) or any letter corresponding to the IUPAC
nucleotide naming code.
The sequence is now changed in the multiple alignment,
but not yet in the BioNuymerics database.
14.8.3 In order to reload the original sequence, select
Sequence > Reload sequence from database.
14.8.4 To save the changed sequence to the database,
select Sequence > Save changed sequences.
As an alternative, a base in a sequence can also be
changed by double-clicking on that base, which will pop
up the experiment card of the sequence, with the clicked
base selected. To change the base, simply type in
another base from the keyboard. Upon exiting the
experiment card, the software will ask to save the
changes. These changes, however, are not updated in
the multiple alignment. You have to close and reopen
the multiple alignment for the changes to become
visible.
Figure 14-8. Subsequence search dialog box.
14.10 Calculating a clustering based on
the multiple alignment (steps 5 and 6)
The mutual similarities between all the sequences are
calculated from the aligned sequences as present in the
multiple alignment.
14.10.1 Select Sequence > Calculate global cluster
analysis or
; the Global alignment similarity dialog
box is shown (Figure 14-9.):
134
The BioNumerics manual
14.11 Adding entries to and deleting
entries from an existing global
alignment
The feature of BioNumerics that makes it possible to add
entries to (or delete entries from) an existing cluster
analysis also applies to sequence clusterings: it is not
necessary to recalculate the complete similarity matrix
because the program calculates the similarity of the new
sequence(s) with each of the other sequences and adds
the new similarity values to the existing matrix.
Particularly in case of sequence clusterings, this feature
is extremely time-saving and causes no degeneration of
the clusterings.
Figure 14-9. Global alignment similarity dialog box.
These settings determine the way the similarity is
calculated between the pairs of sequences. The Gap
penalty is a parameter which allows you to specify the
cost the program uses when one single gap is
introduced. This cost is relative to the score the program
uses for a base matching, which is equal to 100%. The
program uses 0% as default. When Discard unknown
bases is disabled, the program will use a predefined cost
table for scoring uncertain or unknown bases. For
example, N with A will have 75% penalty, as there is
only 25% chance that N is A. Y and C will be counted
50% penalty because Y can be C or T with 50% chance
each. If this setting is disabled, all uncertain and
unknown bases will not be considered in calculating the
final similarity. To obtain a dendrogram, you can choose
between the available clustering algorithms.
The checkbox Use active zones only is only applicable
when a reference sequence is defined, and when certain
zones on this reference sequence are excluded for
analysis (see further on page 137).
Under Correction, one can select the Jukes and Cantor
(1969)1 correction, a one parameter correction for the
evolutionary distance as calculated from the number of
nucleotide substitutions as. Alternatively, the Kimura 2
parameter correction (Kimura, 1980) 2 can be selected. In
either case, the resulting dendrogram displays a
distance scale which is proportional to an evolutionary
time, rather than a similarity scale.
14.10.2 Check Discard unknown bases, select a Gap
penalty of 0, and leave Use active zones only
unchecked. Apply no Correction and select Neighbor
Joining as clustering method.
14.10.3 Press <OK> to start calculating the multiple
alignment-based dendrogram. This calculation is
usually fast.
1. Jukes, T.H. and C.R. Cantor. 1969. In "Mammalian
Protein Metabolism III" (H.N. Munro, ed.), p. 21. Academic
Press, New York.
2. Kimura, M. J. 1980. Mol. Evol. 16: 111.
In case a multiple alignment exists, the problem is more
complex. As soon as sequences are added, the program
will have to recalculate the multiple alignment (steps 3
and 4 of scheme in Figure 14-1.) to find the optimal
alignments again for the new set of sequences. This
would cause corrections in the alignment made by the
user to be lost each time sequences are added. Therefore,
the program offers some additional features to add
sequences to existing multiple alignments without
affecting the alignments (and manual corrections)
between the other sequences.
14.11.1 In database DemoBase, open comparison All,
and display the multiple alignment.
14.11.2 Calculate a dendrogram from the multiple
alignment with Sequence > Calculate global cluster
analysis.
14.11.3 Select some entries and cut them from the
analysis with Edit > Cut selection.
The dendrogram is recalculated immediately, and the
multiple alignment is preserved, since deleting entries
does not influence the multiple alignment.
14.11.4 Paste the selection (which is still on the
clipboard) again with Edit > Paste selection.
The matrix based upon pairwise alignments and the
corresponding dendrogram are now being updated, and
when finished, the pasted sequences are shown in the
multiple alignment. However, the program has NOT
aligned them. The multiple alignment dendrogram does
not appear, since the multiple alignment is not updated
yet.
14.11.5 To show that the pairwise dendrogram is
updated, select Sequence > 9Show global cluster
analysis to undo displaying the global cluster analysis.
14.11.6 The pairwise dendrogram appears; if not, choose
Layout > Show dendrogram.
14.11.7 Inspect where the pasted sequences are inserted
in the dendrogram (blue arrows).
Chapter 14 - Multiple alignment and cluster analysis of sequences
14.11.8 If the pasted sequences constitute one single
branch, select that branch on the dendrogram, and
Sequence > Align internal branch.
14.11.9 The sequence within the branch are now being
aligned internally, and once this is finished, you can
select Sequence > Align external branch to align the
sequences from the rest of the dendrogram with the
selected branch.
The advantage of this approach is that by using the
Align internal branch feature, only the sequences within
the selected branch are aligned. This is useful to update
a part of a multiple alignment without affecting the nonselected branches. With the Align external branch
feature, the selected branch is aligned to the rest of the
dendrogram as a whole: all sequences within the branch
are treated as one block, and all the other sequences are
treated as another block. The two blocks are aligned to
each other. These features give the user full control over
how new sequences are added to a multiple alignment
without affecting any editing.
14.11.10 A similar result can be obtained with the Align
selected sequences function (see 14.12).
135
14.13.4 With Edit > Global alignment comparison
settings, the settings for calculating cluster analysis from
a multiple alignment can be edited, as explained in
14.10.
14.13.5 The menu Edit > Character conversion settings
allows the parameters to be set for converting bases into
categorical characters (see further, 14.14).
14.13.6 Edit > Display settings allows the color and
viewing settings in the multiple alignment editor to be
specified.
14.13.7 The Sequence display settings window (Figure 1410.) provides two defaults for color settings: the White
default, which corresponds to the most widely used
colors for the bases on a white background, and the
Black default, which uses a black background in the
multiple alignment editor, using the base color scheme
of earlier versions of BioNumerics.
14.13.8 Apart from the two defaults, every item can be
assigned a specific color using the slide bars for the Red,
Green and Blue components. A character can be chosen
to
indicate
gaps
and
consensus
positions,
respectively.
14.12 Automatically realigning selected
sequences
14.12.1 With the function Sequence > Align selected
sequences, any set of selected sequences can be realigned
within an existing multiple alignment. The function
preserves any automatic and manual alignment that
exists between all the non-selected sequences, which are
treated as one block. The selected sequences are aligned
on by one to the non-selected sequences. The difference
with the method described in 14.11 is that the new
sequences are not first aligned among each other, which
may produce a slightly different result.
14.13 Sequence display and analysis
settings
Figure 14-10. The Sequence display settings window.
A number of settings related to a Sequence Type are
stored as initial settings. These include display settings
as well as alignment, clustering, and conversion settings.
The initial settings can be changed in the Sequence type
window (Figure 14-12.).
14.14 Converting sequences data to
categorical character sets
14.13.1 To open the Sequence type window, double click
on a Sequence Type in the Experiments panel of the Main
window (or select Experiments > Edit experiment type).
14.13.2 With Edit > Comparison settings, you can edit
the pairwise comparison settings as explained in 14.1.
14.13.3 Using Edit > Global alignment settings, the
settings for calculating a global alignment can be edited
(see 14.2).
DNA sequence data can be converted into categorical
character data, whereby each base is represented by an
integer number: A = 1, C = 2, G = 3, and T = 4. The
converted catagorical data can be visualized and
analyzed as a Composite Data Set (10.1). The possibility
to convert bases into categorical characters requires that
a multiple alignment is calculated from the sequences,
and also, that a Composite Data Set exists which
includes the Sequence Type (exclusively or in
combination with other Sequence Types).
136
In addition to the four above states displayed in the
Composite Data Set, a fifth state (zero) can optionally be
assigned to a gap position. As another option, it is
possible to consider only the mutating positions, i.e. the
positions that differ in at least one sequence from the
others.
14.14.1 The settings for converting sequences into
character data can be changed in the Sequence type
window (see 14.13). Choose Settings > Character
conversion settings to open the Character conversion
settings dialog box.
14.14.2 With the option Exclude non-mutating positions,
only those base positions in a multiple alignment that do
not contain the same for all sequences will be included
in the character set.
14.14.3 With the option Exclude positions with gaps,
those positions where one or more sequences have a gap
in the multiple alignment will be excluded from the
character set. If gaps are not excluded, a fifth state is
assigned to gaps (zero).
Converting sequences into character data can have
several useful applications:
The BioNumerics manual
Minimum Spanning Trees can be calculated from the
Composite Data Set (see chapter 18.), thus allowing
sequence data to be analyzed using MSTs.
Different genes, each represented in a separate Sequence
Type, can be combined in one Composite Data Set, so
that the information from the different genes can be
condensed in one single dendrogram. The option
Exclude non-mutating positions (14.14.2) thereby offers
the possibility to reduce the amount of information to
only those base positions that are polymorphic in the
entries analyzed.
In addition to clustering the entries, it is also possible in
the Composite Data Set to cluster the base positions,
using the Transversal Clustering method (see 15.4). The
result looks like in Figure 14-11., where groups of bases
are clustered together according to their discriminatory
behaviour between groups of entries.
14.14.4 In this view (Figure 14-11.), the bases can be
shown as letters (default, to be obtained with Composite
> Show presence/absence), as colors (using Composite >
Show quantification (colors)), or as numbers (with
Composite > Show quantification (numbers)). In the
color view, “A” is shown in magenta, “C” in green, “G”
in orange and “T” in red; a gap is blue. In the numbered
view, “A” is 1, “C” is 2, “G” is 3, and “T” is 4; a gap is
zero.
Chapter 14 - Multiple alignment and cluster analysis of sequences
137
Figure 14-11. Comparison window showing Composite Data Set generated from Sequence Type (Demobase).
Bases were converted into categorical characters and clustered in both directions.
14.15 Excluding regions from the
sequence comparisons
Similar as for the comparison of fingerprints, it is
possible to exclude regions from the sequences to be
clustered. First, one needs to define a reference
sequence, and next, one can indicate the zones to be
excluded and included on the reference sequence. The
exclusion of regions is only possible when calculating a
cluster analysis based upon globally aligned sequences
(multiple alignment) and when the reference sequence is
included in the multiple alignment. Only then, the
program can introduce a consistent base numbering
based on the reference sequence, which makes it
possible to specify the same exclude/include settings for
different multiple alignments within the Sequence Type.
14.15.1 In the Main window, open the Sequence type
window of 16S rDNA by double-clicking on 16S rDNA
in the Experiment types panel.
Initially, there is no reference sequence present. A link
arrow
allows you to link a reference sequence to a
database entry, by clicking on the arrow and dragging it
onto a database entry, and then releasing the mouse
button. When the experiment is linked, its link arrow is
purple:
.
14.15.2 Drag the link arrow to database entry
Vercingetorix palustris strain no. 42819: as soon as you
pass over a database entry, the cursor shape changes
into
.
14.15.3 Release the mouse button on database entry
Vercingetorix palustris strain no. 42819.
This entry is now defined as reference sequence, and the
arrow in the Sequence type window has become purple
instead of gray
. The reference sequence is shown in
the Sequence type window (Figure 14-12.).
14.15.4 Select Settings > Exclude active region or
to exclude a region for comparison.
14.15.5 Enter start and end base number of the region to
be excluded, and press the <OK> button.
138
The BioNumerics manual
Base numbering
Comment line
Reference sequence
Excluded regions
(red)
Figure 14-12. Sequence type window with reference sequence defined, region excluded, and comments added.
The included regions are marked with a green line
whereas the excluded regions are marked with a red line
(Figure 14-12.).
reference sequence is present, when the sequences are
aligned, and when a consensus sequence is shown. The
comment line is saved along with the Sequence Type,
and new comments can be added at any time.
14.15.6 In order to remove all excluded regions at a time,
select Settings > Include active region or
. Enter 1
as From number, and enter the length of the sequence,
or a number which certainly exceeds the sequence's
length, as To number.
14.15.7 Open comparison All (or another comparison
which you have saved with the aligned sequences).
14.15.8 Show the image of the aligned sequences.
14.15.9 Select the branch top of Vercingetorix palustris
strain no. 42819 in the dendrogram panel (the reference
sequence) and Sequence > Create consensus of branch.
By creating a consensus of a single sequence, you can
display the reference sequence in the consensus
sequence line (Figure 14-13.). At the same time, the
excluded and included regions are indicated, and the
base numbering, according to the reference sequence
appears (Figure 14-13.).
14.15.10 In order to see the base numbering it may be
necessary to drag the horizontal line that separates the
header from the image panels downwards
14.16 Writing comments in the alignment
In order to mark special regions on the reference
sequence or on the multiple alignment, a simple
comment editor allows you to add any comment to the
comparison. The comments can only be added when a
14.16.1 Click the cursor on one of the aligned sequences
in the image. At the position of the cursor, you can start
writing comments.
The comments appear in the image header, above the
consensus sequence (Figure 14-13.). Any character input
is supported. A's, C's, G's and T's are written in the
colors of the bases.
14.16.2 To delete a comment, place the cursor on any
sequence at the position of the first character of the
comment and enter spaces.
14.16.3 Aligned or not aligned sequences in a
comparison can be exported as text-file with the
command File > Export sequences.
The program now asks "Do you want to export the database
fields?".
14.16.4 Answer <Yes> to export tab-delineated database
fields along with the sequences.
Next, the program asks “Do you want to include regions
with gaps?”.
14.16.5 Answer <Yes> if you want to preserve the gaps
introduced in the multiple alignment.
This allows aligned sequences to be exported from
BioNumerics to other software applications. Gaps are
represented as spaces.
Chapter 14 - Multiple alignment and cluster analysis of sequences
Base
numbering
Comment
editor
139
Excluded
regions (red)
Consensus
sequence
Figure 14-13. Comparison window with image of aligned sequences in consensus blocks view, detail.
140
The BioNumerics manual
141
15. Cluster analysis of Composite Data Sets
15.1 Principles
equally important, or the program can assign a weight
proportional to the number of tests in an experiment.
In addition, the user can define an extra weight for
each experiment manually.
A clustering based upon a similarity matrix can be
performed on an individual experiment type or on a
combination of experiment types. The methods that
BioNumerics uses to arrive at dendrograms
representing combined techniques are schematized in
Figure 15-1.
•Flow 4 starts directly from the character tables, and
merges all characters from different experiment types
to obtain a Composite Data Set. From this Composite
Data Set, a similarity matrix is calculated (combined
matrix B), resulting in combined dendrogram B.
•Flows 1 and 2 represent the steps to obtain
dendrograms for two single experiments, experiment
1 and experiment 2, respectively. The steps involve
the creation of a similarity matrix and the calculation
of a dendrogram based on this matrix.
Both steps 3 and 4 require a Composite Data Set to be
generated.
15.2 Composite Data Sets
•Flow 3 is the first method to calculate a combined
dendrogram from multiple experiments: the
individual similarity matrices are first calculated and
from these matrices, a combined matrix (A) is
calculated by averaging the values. The averaging can
happen in two ways: each value can be considered
A Composite Data Set is a character table that contains
all the characters of one or more experiment types. We
have briefly described the use of Composite Data Sets in
4
Experiment 1
1
Experiment 2
2
100
75 100
60 85 100
100
85 100
50 70 100
Matrix 1
Matrix 2
1
2
Dendrogram 1
Composite experiment
Dendrogram 2
4
3
100
80 100
55 77 100
Combined matrix A
3
Combined dendrogram A
100
80 100
55 77 100
Combined matrix B
4
Combined dendrogram B
Figure 15-1. Scheme of possibilities in BioNumerics to obtain combined dendrograms from multiple
experiments.
142
The BioNumerics manual
order to create a character table for a band matching
analysis (10.1).
NOTE: In addition to the obvious reasons of (i) creating
band matching tables and (ii) creating clusterings of
multiple data sets, a Composite Data Set also offers
some additional interesting features compared to single
Character Types. These include a function to
discriminate groups based upon differential characters
in the Comparison window (Composite >
Discriminative characters) and a function to sort the
entries in a comparison by the intensity of a selected
character (Composite > Sort by character). Both
features are described in 10.9. Further, the characters in
a Composite Data Set can be displayed as numerical
values which can be exported as tab-delineated tables in
a comparison (Composite > Export character table).
This feature is described in 10.2.
We will now describe the use of Composite Data Sets in
function of cluster analysis based upon multiple
experiments.
15.2.1 In the Main window, with the database DemoBase
loaded, select Experiments > Create new composite data
set, or
.
15.2.2 Enter a name, for example All-pheno and press
<OK>.
The Composite data set window is shown for “All-Pheno”
(see Figure 15-2.). All experiment types defined for the
database are listed, and when they are marked with a
red cross, they are not selected in the Composite Data
Set. We will now describe the use of Composite Data
Sets in function of cluster analysis based upon multiple
experiments.
15.2.3 We want to create a character table for all
phenotypic tests defined in the database, so we select
Phenotest and Experiment > Use in composite data set.
We repeat this for FAME.
When an experiment type is selected in the Composite
Data Set, it is marked with a green 9 sign.
The scroll bar that appears in the Weights column
allows the user to manually assign weights to each of the
selected experiment types (see step 3 described in 15.1).
If the individual matrices of the experiments are
averaged to obtain a combined matrix, the similarity
values will be multiplied by the weights the user has
specified for each experiment.
In order to treat individual characters on an equal basis
while averaging matrices, the program can
automatically use weights proportional to the number of
tests each experiment contains. This correction is
achieved as follows:
15.2.4 Select Experiment > Correct for internal weights.
The header now shows Weights (x internal weight).
Figure 15-2. Composite data set window.
NOTES:
(1) The correction for internal weights also applies to
banding patterns: if technique RFLP1 reveals 10 bands
between entries A and B, whereas RFLP2 only reveals 5
bands, the similarity value resulting from RFLP1 will
be twofold more important in averaging similarity
between entries A and B.
(2) Both functions Correct for internal weights and
the manual weight assignment can be combined. The
program will then multiply the weights obtained after
correction by the weights assigned by the user.
(3) In case step 4 described in 15.1 is chosen further in
the analysis, i.e. the character sets are merged to a
combined character set to which a similarity coefficient
is applied, the user defined weights also have their
function: in this case, the program multiplies each
character of a given experiment with the weight
assigned to that experiment. This feature is useful in
case the ranges of combined experiments are different;
for example when one experiment has a character value
range between 0 and 1 and another experiment has a
range between 0 and 100, a quantitative coefficient such
as the correlation coefficients, Gower, Euclidian
distance, would in practice only rely on the second
experiment. Assigning a weight of x100 to the first
experiment makes them equally important for
quantitative coefficients.
15.2.5 The comparison settings for the Composite Data
Set are entered with Experiment > Comparison settings.
The Composite data set comparison dialog box (Figure 15-3.)
allows you to choose between step 3 (averaging the
matrices of the experiments) and step 4 (merging the
experiments to a composite experiment). With the
Similarity option Average from experiments, the
matrices from the individual experiments are averaged
Chapter 15 - Cluster analysis of Composite Data Sets
according to the defined weights (step 3 in Figure 15-1.).
With one of the coefficients under Binary coefficient,
Numerical coefficient (non-binary coefficients), or
Multi-state coefficient step 4 in Figure 15-1. will be
followed using a composite character table. For a
description of the coefficients, see 13.1.
143
correlation coefficient. The Rank correlation is known to
be a very robust coefficient, but with low sensitivity.
15.2.6 Select Pearson correlation with Standardized
characters and UPGMA as clustering method.
15.2.7 Close the Composite data set window with File >
Exit. The new Composite Data Set is shown in the
experiment types panel of the Main window.
15.3 Calculating a dendrogram from a
Composite Data Set
Calculating a dendrogram from a Composite Data Set is
almost the same as for a single experiment. The creation
of a Composite Data Set and its functions is described in
15.2, and if you have gone through that paragraph, a
Composite Data Set All-Pheno should be available in
the database, including the Character Type experiments
Phenotest and FAME.
15.3.1 Run the Analyze
DemoBase loaded.
Figure 15-3. Composite data set comparison dialog
box.
If non-binary characters (values) are used, it may be
meaningful to enable the feature Standardized
characters in the following cases. (1) For some
techniques, e.g. fatty acid methyl ester analysis, it is
common that some fatty acids occur in high amounts,
whereas other fatty acids occur only in very small
amounts. It is likely that the major fatty acids will
account for most of the discrimination between the
organisms studied, whereas the minor fatty acids, which
may be as valuable from a taxonomic point of view, are
masked. (2) When creating composite character sets
from different experiments, the ranges of the experiment
may be different. When using a coefficient such as the
correlation coefficient, characters with a higher range
will have more influence on the similarity and the
dendrogram. The feature Standardized characters
standardizes each character by subtracting its mean
value and dividing by its standard deviation. The result
is that all characters have equal influences on the
similarity.
The feature Use square root is intended for character
sets that yield high similarities within groups. In such
cases, it may be useful to combine Use square root with
Pearson correlation and Cosine coefficient (or Euclidean
distance in case of non-Composite Data Sets).
The Rank correlation coefficient first transforms an
array of characters into an array of ranks according to
the magnitude of the character values. The rank arrays
are then compared using the Pearson product-moment
program
with
database
15.3.2 Select all entries except STANDARD (9.3) and
create a new comparison (9.7), or open comparison All if
existing.
15.3.3 Select All-Pheno in the experiment type selection
bar (bottom of window), and show the character image
by pressing the
button of All-Pheno.
15.3.4 Right-click on the
quantification (colors) .
image
to
select
Show
15.3.5 Select Clustering > Calculate > Cluster analysis
(similarity matrix).
The Composite data set comparison settings dialog box
allows you to specify the similarity coefficient to
calculate the similarity matrix, and the clustering
method (Figure 15-3.). It also allows you to take the
similarity matrices from the individual experiments and
to calculate an average matrix from these (Average from
experiments). The meaning of the options are described
in 15.2.
If the combined experiments are comparable in terms of
biological meaning, reaction type and numerical range,
it is possible to use one of the binary coefficients Jaccard,
Dice, Simple matching, or one of the numerical
coefficients Pearson correlation, Cosine correlation, or
Canberra metric. For example, if both experiments
involve substrate utilization tests and are recorded
either positive or negative, the best option is to select
under Binary coefficient (Jaccard, Dice or Simple
matching). If both character tests are registered
quantitatively as numerical values between 0 and 100, a
suitable option is to select under Numerical coefficient
(Pearson correlation, Cosine correlation, Canberra
metric, or Rank correlation).
144
NOTE: It can be proven that in case of binary data sets
the option Average from experiments offers exactly
the same results when Correct for internal weights
is enabled in the Composite data set settings (see
15.2.4).
In the case however, that the ranges of the combined
experiments are different, e.g. a range between 0 and 10
for one experiment and between 0 and 100 for another
experiment, the numerical coefficients Pearson
correlation, Cosine correlation, or Canberra are not
suitable, as they would assign much more weight to the
second experiment than to the first. In such cases, you
should either take the similarity values from the
individual experiments (Average from experiments) and
average them into a new matrix, or specify user-defined
weights for the experiments, so that their final weights
are comparable (see notes on page 142).
The Categorical coefficient can be chosen in case all the
characters of the individual experiment types are
multistate characters. As opposed to binary, where only
two states are known, multistate characters are defined
as characters that can take more than two states.
However, as opposed to numerical characters, the
different states represent discrete categories, which
cannot be ranked somehow. Examples are Phage types,
Multilocus Sequence Types (MLST), colors, etc.
15.3.6 In the example data sets Phenotest and FAME, the
character sets have different ranges, so select perhaps
Average from experiments.
15.3.7 Press <OK> to calculate the cluster analysis.
The resulting dendrogram is based upon the average
matrix of both similarity matrices. In this Composite
Data Set, we have chosen the averaging to correct for
internal weights (see 15.2.4), so since the experiment
type FAME contains more characters than Phenotest, it
is assigned more weight proportionally. Hence, we can
expect that the composite clustering will have a higher
congruence with FAME than with Phenotest. You can
check this as follows:
15.3.8 First make sure that a matrix is present for both
FAME and Phenotest: select Calculate cluster analysis
for both experiments.
15.3.9 Select Clustering > Congruence of experiments
(see 11.12). All-Pheno shows 96.4% similarity with
FAME and only 71.6% with Phenotest.
NOTE: If you want to see the difference when Correct
for internal weights is not enabled, save and close the
Comparison window, open All-pheno in the
Experiments panel, and uncheck Experiment >
Correct for internal weights. Open the comparison
again, Calculate cluster analysis again for Allpheno, and Clustering > Congruence of
experiments. The similarity of All-pheno now is 88.6
and 84.7 with Phenotest and FAME, respectively.
The BioNumerics manual
It is obvious that the possibility of approach 4 described
in 15.2, i.e. merging two character sets into a combined
character set, is only applicable to comparable character
sets. It makes no sense, and is even impossible to
combine a phenotypic test panel with a sequencing
experiment in this way. When such experiments of
different nature are to be used for consensus groupings,
the only remaining approach is to combine the obtained
individual similarity matrices (approach 3 in 15.1).
However, the option to create an average matrix from
individual experiment matrices only works well in case
two conditions are fulfilled: (i) the expected similarity
range for both experiments is comparable, and (ii) the
matrices are complete, i.e. for each experiment there is a
similarity value present for each pair of entries. Suppose
that two experiment types are to be combined which
generate strongly different similarity levels, e.g. DNA
homology values on the one hand and 16S rDNA
similarity on the other hand. In many cases, DNA
homology values will range from 100% to 40% or less,
whereas 16S rDNA similarity will range between 100%
and 90% or even higher. It is clear that the small but very
significant similarity differences in 16S rDNA homology
will be masked by the much larger differences
(including experimental error) of DNA hybridization,
and will have no contribution to the clustering based
upon averaging of matrices. In such cases, other
methods are needed to compose a consensus matrix,
that “takes the best of it all”.
The principle of averaging matrices is even worse when
one or more matrices are incomplete. Suppose three
entries in BioNumerics, A, B, and C. Consider the
following matrices for these three entries, generated
from 16S rDNA aligned sequence similarity and DNA
hybridization. The DNA hybridization matrix is
incomplete, a situation which happens often.
A 100
B 97 100
C 93 98 100
16S rDNA similarity
A 100
B 83 100
C 82 100
A 100
B 90 100
C 93 90 100
Averaged composite matrix
DNA hybridization homology
The averaged matrix created in the Composite Data Set
from these two experiments shows averaged values for
(AB) and (BC) but for (AC) it has taken the only
available value, 93%. The resulting matrix provides a
completely distorted view of the relationships between
these three organisms, as it suggests A and C to be
closest related. In reality however, one can predict,
based upon the lower 16S rDNA similarity, that (AC)
will be much less related than (AB) and (BC).
This is an obvious example where averaging similarity
matrices is not a good approach, and therefore, another
algorithm has been incorporated in BioNumerics, based
Chapter 15 - Cluster analysis of Composite Data Sets
upon linearization of the consensus matrix with respect
to the individual experiment matrices. The consensus
matrix is composed in such a way that it constitutes a
third degree function of each individual experiment
matrix, and the result is that it reflects each of the
constituent matrices as closely as possible.
The consensus matrix can be calculated in BioNumerics as
follows:
15.3.10 If not existing yet, create a new Composite Data
Set “All”, in which you add all experiments available in
DEMOBASE.
145
Char 1 Char 2
…
Char p
Entry 1
Val 11
Val 12
…
Val 1p
Entry 2
Val 21
Val 22
…
Val 2p
…
…
…
…
…
Val np
…
Entry n
Val n1 Val n2
15.3.11 Open a comparison containing all but the
STANDARD lanes, and create a matrix (Calculate
cluster analysis) for each experiment.
Figure 15-4. Data matrix of n entries and p
characters.
15.3.12 Select Composite > Calculate consensus matrix.
The consensus matrix and a corresponding consensus
dendrogram is calculated. The resulting groupings can
be considered as the most faithful “compromise” from
all available data.
data matrix, i.e. in which the entries are clustered by
means of their character values (the conventional
clustering as described in chapter 11.; also called Qclustering), and the characters are clustered by means of
their values per entry (R-clustering).
NOTE: The feature to correct for internal weights
(15.2.4) does not apply to a consensus matrix.
The result is a data matrix in which both the entries and
the characters are ordered according to their relatedness
(Figure 15-5.), which we will call transversal clustering.
This representation makes it easy to associate visually
clusters of characters with clusters of entries. For
example, the first group of entries (E1, E9, and E5) is
separated from the others by a cluster of characters (C5,
C16, C3, and C11) which are all more positive in the first
cluster than in the other clusters. Another group of three
characters (C14, C17, and C20) separates the second
group of entries (E3, E6, and E2) from the other clusters
because they are less positive.
15.4 Cluster analysis of characters
The input for a cluster analysis in a Composite Data Set
is a data matrix. A data matrix of n entries having p
characters looks like in Figure 15-4.: the entries are
presented as rows and the characters as columns. In
BioNumerics, the data matrix should not necessarily be
complete: some missing character values are allowed,
for example if test results are ambiguous or not
available.
A simple and efficient way to visually associate groups
of characters (columns) with groups of entries (rows) in
a data matrix is to construct a two-way clustering of the
15.4.1 In BioNumerics, it is possible to calculate a
transversal clustering from a Composite Data Set. As an
example, we use the Composite Data Set All-pheno in
Demobase including FAME and Phenotest as described
in 15.2.
Figure 15-5. Transversal clustering of entries (horizontal) and characters (vertical).
146
15.4.2 Create a Comparison window with a selection of
entries and select the Composite Data Set in the
experiment type selection bar (bottom of window). You
can show the character image by pressing the
button
of All-Pheno.
15.4.3 Calculate a cluster analysis of the entries as
described in 11.2.
15.4.4 Choose Composite > Calculate clustering of
characters or click the
button. A dialog box offers a
choice between the Pearson correlation for numerical
characters, the Jaccard, Dice, and Simple Matching
The BioNumerics manual
coefficients for binary data, and the Categorical
coefficient for multi-state or categorical characters. For a
description of the coefficients, see 13.1.
15.4.5 Select Pearson correlation and press <OK> to
calculate a character dendrogram, which appears
horizontally in the caption of the data matrix display of
the Composite Data Set.
15.4.6 It may be useful to drag the separator bar between
the image panel and its caption down to obtain more
space for the character dendrogram and the character
names.
147
16. Phylogenetic clustering methods
In addition to the Neighbor Joining method, which we
already described in the previous chapter, BioNumerics
offers two alternative phylogenetic clustering methods,
based on the concepts of maximum parsimony and
maximum likelihood, respectively. Maximum parsimony
can be applied to any data set that can be presented as a
binary or categorical data matrix. As such it can be applied
to Fingerprint Type data on condition that a band
matching is performed (see chapter 10.). It also can be
applied to Character Type data with binary or
categorical character data. In case of non-binary
numerical data, the default Binary conversion settings for
the Character Type will be used (7.14.20). Likewise, the
maximum parsimony method can also be used for
Composite Data Sets. The maximum likelihood
clustering method can only be applied to nucleic acid
sequence data, and we will describe them with the
Sequence Type 16S rDNA of database DemoBase. In
case of sequence data, the maximum parsimony and
maximum likelihood clustering methods only work on
aligned sequences: a multiple alignment must be
present.
16.1 Maximum parsimony of Fingerprint
Type data and Character Type data
16.1.1 Since maximum parsimony requires a binary or
categorical data matrix as input, it can only be applied to
Fingerprint Type data for which a band matching is
performed. For the Fingerprint Type you want to cluster
using maximum parsimony, a band matching should be
performed as described in chapter 10.. The program will
use the binary band presence table associated with the
band matching as input for maximum parsimony.
16.1.2 In case of Character Type data, the maximum
parsimony can be calculated directly on the data set. If
the data set is non-categorical and non-binary, the
default Binary conversion settings for the Character Type
will be used. You can check this setting by opening the
Character type window and selecting Settings > Binary
conversion settings, which pops up the Conversion to
binary data dialog box (7.14.20).
16.1.3 To calculate a maximum parsimony dendrogram,
select Clustering > Calculate > Maximum parsimony
tree (evolutionary modelling). You can also press the
button, in which case the floating menu as shown
in Figure 11-1. pops up.
The Maximum parsimony cluster analysis dialog box for
character data appears (Figure 16-1.).
Figure 16-1. The Maximum parsimony cluster
analysis dialog box for character data.
16.1.4 Under Data set, you can specify how to treat the
data, i.e. Convert to binary or Treat as categorical. In
case of Fingerprint Type data and binary character sets,
these options are redundant.
16.1.5 BioNumerics uses methods that are described in
the literature to optimize the topology of parsimonious
trees. An alternative method, which sometimes finds
even more parsimonious trees, but which is
considerably slower, is the mathematical principle of
Simulated annealing.
16.1.6 In addition, BioNumerics can do a Bootstrap
analysis on the parsimony clustering, of which you can
enter the number of simulations. Caution: enabling
simulated annealing and at the same time entering a
number of bootstrap simulations will increase the
computing time dramatically. We do not recommend to
combine these option.
The resulting Unrooted dendrogram window (Figure 16-3.)
is discussed in the next paragraph.
16.2 Maximum parsimony clustering of
sequence data
16.2.1 In DemoBase, open the Comparison window with
the aligned sequences shown.
First, we will reduce the number of entries in the
comparison, in order to make maximum likelihood
(16.3) possible in a reasonable time.
16.2.2 Select all entries in the comparison, and then
unselect a couple of entries per cluster, so that some 10
entries from all clusters are unselected in total.
148
The BioNumerics manual
16.2.3 Remove the selected entries from the comparison
with Edit > Cut selection.
16.2.8 You can toggle between the colors and the blackand-white representation mode with Layout > Show
16.2.4 Select Clustering > Calculate > Maximum
parsimony tree (evolutionary modelling) . You can also
group colors or
press the
button, in which case the floating menu
.
In black-and-white mode, the groups are represented
(and printed) as symbols.
as shown in Figure 11-1. pops up.
16.2.9 With Layout > Show keys or group numbers or
From the floating menu you can select Calculate
maximum parsimony tree (evolutionary modelling).
Note that, in case of aligned sequence data, an extra
option Calculate maximum likelihood tree becomes
available, which is also discussed in chapter 16.
The Maximum parsimony clustering dialog box (Figure 162.) allows you to specify a cost for each base conversion
(mutation) in the Cost table. The default settings is 100%
for each possible conversion.
Gaps can be dealt with in two ways: the program can
Ignore positions with gaps, or can consider gaps as an
Extra state. In the first case, when gaps are ignored,
every position that contains a gap in one or more
sequences of the multiple alignment, will be excluded
from the analysis.
, the entry keys are displayed next to the
dendrogram entries.
However, the entry keys may be long and
uninformative for the user, so the entry keys can be
replaced by a group code. The program assigns a letter
to each defined group, and within a group, each entry
receives a number. The group codes are shown as
follows:
16.2.10 In the parent Comparison window, select Layout
> Use group numbers as keys.
16.2.11 A legend to the group numbers can be obtained
with File > Export database fields in the parent
Comparison window.
16.2.12 With Layout > Show branch lengths or
,
the lengths of the branches, as numbers of base
conversions, are shown.
In more complex trees, the spread of the branches may
not be optimal. The program can iteratively optimize the
spread of the branches:
16.2.13 In the Unrooted dendrogram window, select
Layout > Optimize branch spread or
Figure 16-2. Maximum parsimony clustering dialog
box.
.
The user can rotate and swap the branches manually if
the tree layout is not satisfactory.
16.2.14 Left click in the proximity of a node or a branch
tip.
16.2.5 Leave the Cost table unaltered, check Ignore
positions with gaps, enable Optimize topology, and
leave the Number of bootstrap simulations zero.
16.2.6 Press <OK> to start the calculations, which will
take some time.
The result is an Unrooted dendrogram window, of which
the parsimony (the total number of base conversions
over the three) is given in the status bar (Figure 16-3.).
The entries are represented in the colors of the groups
we have defined earlier.
16.2.7 To zoom in or out on the tree, use the
and
buttons or Layout > Zoom in and Layout > Zoom
out.
16.2.15 While holding down the mouse button, rotate
the branch to the desired position.
16.2.16 If you select entries in the parent Comparison
window or in the Main window, these entries are shown
within a square in the Unrooted dendrogram window.
16.2.17 You can also select entries directly in the
Unrooted dendrogram window, by holding the CTRL key
while clicking in the proximity of a node. All entries
branching off from this node will be selected.
16.2.18 Repeat this action to unselect entries.
16.2.19 To copy the unrooted tree
to the clipboard,
select File > Copy image to clipboard or
.
Chapter 16 - Phylogenetic clustering methods
149
Figure 16-3. Unrooted maximum parsimony tree. Number of mutations are indicated on the
branches (top) as well as the bootstrap values (bottom).
16.2.20 The unrooted tree can be printed with File >
Print image or
.
Since interpreting unrooted trees is not always easy,
especially with large numbers of entries, it is possible to
create a rooted dendrogram from the unrooted tree. This
process requires an artificial root to be defined as
follows:
16.2.21 Select a branch by clicking in the proximity of
one of the two nodes it connects. The selected branch is
red.
16.2.22 In the menu, choose Layout > Create rooted tree
or
.
The dendrogram in the parent Comparison window now
is a rooted version of the maximum parsimony or
maximum likelihood tree.
NOTE: all functions described under cluster analysis
apply to such unrooted trees, except the incremental
clustering: one cannot delete or add entries while the
tree is automatically updated.
Figure 16-4. Maximum likelihood clustering dialog
box.
Similar as for the maximum parsimony clustering, a
Mutation rate can be defined for each individual base
conversion. The default of the program is 25% for each
possible mutation. The maximum likelihood clustering
algorithm also allows a standard deviation to be
calculated for each branch (Estimate errors). This is only
an approximate error estimation, but since maximum
likelihood clustering is absurdly slow, it is absolutely
impossible to perform bootstrap analysis.
16.2.23 Close the Unrooted dendrogram window.
16.3.1 Leave the Mutation rate unaltered, and enable
the Estimate errors checkbox.
16.3 Maximum likelihood clustering
In the Comparison window, select Clustering > Calculate
> Maximum likelihood tree (evolutionary modelling).
The Maximum likelihood clustering dialog box shows up
(Figure 16-4.).
16.3.2 Press <OK> to start the calculations.
Maximum likelihood clustering is an extremely timeconsuming process; depending on the length of the
sequences, clustering 30-50 entries may take several
hours on a powerful computer. The calculation time
150
increases with the third power of the number of entries
included.
When the calculations are finished, an unrooted tree is
shown which has all the same functions as described for
maximum parsimony (16.2).
The BioNumerics manual
16.3.3 If you want to see the estimated errors on the
branch lengths, use Layout > Show branch lengths or
.
151
17. Advanced clustering and consensus trees
17.1 Introduction
Cluster analysis is one of the most popular ways of
revealing and visualizing hierarchical structure in
complex data sets. As explained before (11.1), cluster
analysis is a collective noun for a variety of algorithms
Char 1
Char 2
Char 3
Char 4
Sample 1
x11
x12
x13
x14
Sample 2
x21
x22
x22
x22
Sample 3
x31
x32
x33
x34
Data matrix
Distance/
Similarity
coefficient
Sample 1
100
Sample 2
s 12
100
Sample 3
s 13
s23
100
Similarity/distance matrix
UPGMA
clustering
Sample 2
Sample 3
Sample 1
In the analysis steps outlined in Figure 17-1., one should
consider the matrix of pair-wise similarities (or
distances) as the complete comparative information
between all the samples analyzed. Obviously, for larger
numbers of samples, interpreting a similarity matrix
becomes hardly simpler than looking at the original
data. This is why a similarity matrix is not usually
calculated as a final result, but as an intermediate step
for grouping algorithms such as cluster analysis or
multi-dimensional scaling.
The real simplification of the data is obtained by cluster
analysis. Both the power and the weakness of a
dendrogram lie in its ability to present an easy to
interpret, well-structured, hierarchical grouping of the
samples. Indeed, simplification means loss of
information, and there is no way to present the data in a
simple and easily interpretable way, yet holding all the
information. As a consequence, every dendrogram
resulting from a non-artificial data set will contain
errors, the amount of error being proportional to the
complexity of the similarity matrix. A second source of
error results from the fact that hierarchical clustering
always imposes hierarchical structure, even if the data
does not support it. The fact that even a perfectly
random data set results in a dendrogram with branches,
is a clear example of the danger that hierarchical
clustering holds. Various statistical methods allow the
error associated with dendrogram branches or their
uncertainty to be estimated, e.g., standard deviation
values and the cophenetic correlation. Other methods,
such as bootstrap, allow the probability of dendrogram
branches, as a result of the data set, to be indicated.
Dendrogram
Figure 17-1. Steps in similarity based cluster
analysis.
that have the common feature of visualizing the
hierarchical relatedness between samples by grouping
them in a dendrogram or tree. The most universally
applied methods are pairwise clustering algorithms that
use a distance or similarity matrix as input (Figure 171.). UPGMA (Unweighted Pair Group Method using
Arithmetic Averages), Complete Linkage, Single
Linkage, and Ward’s method are examples of such
methods. The advantage of these methods is that they
can be applied to any type of data, as long as there exists
a suitable similarity or distance coefficient that can
generate a similarity (distance) matrix from the data. As
such, similarity-based clustering can be applied to
incomplete data sets or data that is not presented in the
form of a data matrix (e.g., electrophoresis band sizes).
17.2 Degeneracy of dendrograms
Another problem with pairwise hierarchical clustering
methods such as UPGMA is the degeneracy of the
solution. Whereas UPGMA results in just one tree, in
many cases there exist a number of equally good
alternative solutions. Such degeneracies are very likely
to occur in cases where the similarity matrix contains
multiple identical values. In practice, binary and
categorical data sets and banding patterns treated as
absent/present states result in frequent occurrence of
identical similarity values, whereas quantitative
measurements registered as decimal numbers almost
never yield identical similarity values. To understand
how the occurrence of identical similarity values can
result in multiple possible trees, we consider the
example of three banding patterns (Figure 17-2.). As can
be seen from this simple example, s[A,B] and s[B,C] are
both 0.75, whereas s[A,C ] is 0.50. The way how UPGMA
constructs a dendrogram is by first searching for the
152
The BioNumerics manual
highest similarity value in the matrix, and linking the
two samples from which it results. In the present
example, [A,B] and [B,C] are equivalent solutions, two
partial dendrograms can be constructed: one with [A,B]
linked at 50% (solution 1) and the other with [B,C]
linked at 50% (solution 2). In the next step of UPGMA,
the remaining sample is linked at the average of its
similarity with the samples already grouped. In solution
1, this leads to C being linked at 62.5% to [A,B], whereas
in solution 2, A is being linked at 62.5% to [B,C]. Both
dendrograms suggest a quite different hierarchical
relatedness but actually none of them truly reflects the
relationships suggested by the data set and the
similarity matrix.
A
B
position tolerance as indicated on the figure, the pairs of
patterns [A,B] and [A,C] will have a 100% score,
whereas [A,C] will have only 75% similarity as the
distance between their lower bands is greater than the
position tolerance specified. Similar as explained above,
the UPGMA algorithm has two choices to perform the
first linkage, and the results are displayed as solution 1
and solution 2. Neither of the two dendrograms reflect
the discrepancy indicated by the similarity values, but
instead, each dendrogram falsely suggests a hierarchic
structure that is not supported by the data.
A
B
C
C
Position
tolerance
Coefficient
A
1.00
B
1.00 1.00
C
0.75 1.00 1.00
Coefficient
A
1.00
B
0.75 1.00
C
0.50 0.75 1.00
UPGMA
B & C first
UPGMA
A & B first
UPGMA
B & C first
UPGMA
A & B first
A
A
B
B
C
A
A
B
B
C
C
Solution 1
Solution 2
Figure 17-2. A scenario of three banding patterns
resulting in two possible UPGMA solutions.
Another inconsistency in pairwise clustering results
from the inability to deal with infringments upon the
transitivity rule of identity. When sample A is identical
to sample B, and sample B is identical to sample C, the
transitivity rule predicts that A will be identical to C as
well. Infringments upon this rule are particularly found
in the comparison of banding patterns, where the
identity of bands is judged based upon their distance,
using a position tolerance value that specifies a
maximum distance between bands to be considered
identical. The example below (Figure 17-3.) illustrates
the result of a UPGMA clustering of three banding
patterns for which one band is slightly shifted. With a
Solution 1
C
Solution 2
Figure 17-3. Infringment upon the transitivity rule
for sample identity and resulting dendrograms.
17.3 Consensus trees
A more truthful representation of the relationships
given in Figure 17-2. and Figure 17-3. can only be
obtained by respecting the indeterminacy resulting from
the identical similarity values. Using the conventional
pairwise linkage dendrogram representation, this
cannot be achieved, and therefore, a new dendrogram
type has been introduced in BioNumerics, allowing
more than two entries or branches to be linked together.
The resulting tree can be called a consensus tree because it
allows all entries that are part of a degeneracy to be
linked at one similarity level in a single consensus
branch (Figure 17-4.). To obtain such a consensus
representation of different possible trees, BioNumerics
will first calculate all possible solutions and draw a
Chapter 17 - Advanced clustering and consensus trees
consensus tree that uses pairwise linkage as the primary
criterion, but applies multilinkage in those cases where
branches or entries are degenerated.
1.00
0.75
0.50
1.00
0.75
A
A
B
B
C
C
D
D
Solution 1
Solution 2
1.00
0.75
0.50
0.50
153
17.5 Displaying the degeneracy of a tree
In BioNumerics, select a data type that can potentially
result in multiple tree solutions, for example, the
Fingerprint Type RFLP1 in Demobase.
17.5.1 In Demobase, select all entries except those named
STANDARD and create a Comparison.
17.5.2 Select the Fingerprint Type RFLP1 from the
bottom bar of the Comparison window and choose
Clustering > Calculate > Cluster analysis (similarity
matrix). This pops up the Comparison settings dialog box
(Figure 11-2.), which shows five clustering options
(UPGMA, Ward, Neighbor Joining, Single Linkage and
Complete Linkage) and an option Advanced.
17.5.3 If the option Advanced is checked, a button
<Settings> becomes available, which will open the
Advanced cluster analysis dialog box (Figure 17-5.).
A
B
C
D
Consensus
Figure 17-4. Displaying different UPGMA solutions
as a consensus branch.
Another advantage of the presentation method that
supports multilinkage of entries or branches is that it
can be used to calculate consensus trees from trees
generated from different data sets as well. The same
algorithms can be applied to compare the different and
common branches on the trees, and the example shown
in Figure 17-4. could as well be a case where Solution 1
and Solution 2 result from different data sets.
17.4 Advanced clustering tools
The advanced clustering tools in BioNumerics offer
some additional functionality compared to the standard
clustering tools in the Comparison window. This
functionality is related to the possibility of linking more
than two entries or branches together, as shown in
Figure 17-4.. As such it becomes possible to display
multiple solutions of a cluster analysis in a consensus
representation, as well as representing two trees from
different data sets in one consensus tree. In addition,
each tree obtained using the advanced clustering tools is
automatically saved, which makes it possible to have
more than one stored tree per experiment type. This
feature is useful if one wants to compare trees generated
using different similarity coefficients or using different
parameters such as position tolerance in case of banding
patterns.
Figure 17-5. The Advanced cluster analysis dialog
box.
Under Primary criterion, the criterion for clustering can
be chosen, which can be UPGMA, Single Linkage or
Complete Linkage. All three methods are pairwise
clustering algorithms, i.e. which will construct
dendrograms by grouping branches and/or entries pair
by pair, based upon the highest similarity as criterion. In
UPGMA the similarity between clusters is calculated as
the average of all individual similarities between the
clusters, whereas in Single Linkage it is the highest
similarity found between the clusters. In Complete
Linkage, it is the lowest similarity found between the
clusters.
The Secondary criterion applies to those cases where two
clusters have the same (highest) similarity with a third,
in which case two different tree solutions exist. The
program will then apply one of the folowing criteria to
solve the indeterminacy left by the standard clustering
algorithm (i.e., the primary criterion):
154
The BioNumerics manual
Figure 17-6. Advanced tree representation with a highlighted cluster, indication of the number of degenerated
entries relative to the cluster, and the degenerated entry selected.
(1) Highest overall similarity: the two clusters will be
joined that result in the cluster with the highest overall
similarity with all other members of the comparison.
(2) The option Primary criterion will calculate all
degeneracies resulting from the primary criterion only
and will not consider any secondary criterion specified.
(2) Largest number of entries: the two clusters will be
joined that result in the cluster with the largest number
of entries.
(3) Primary + secondary criterion will use the specified
secondary criterion to solve the degeneracies resulting
from the primary criterion and will only display the
degeneracies that remain after the secondary criterion. It
is very unlikely that there will remain any degeneracies
with this option checked.
(3) Most homogeneous clusters: the two clusters will be
joined that result in a cluster that has the highest internal
homogeneity.
Note that criteria (1) and (3) are complementary to each
other as (1) will only consider the external similarity
values of the resulting clusters whereas (3) will only
consider their internal similarity values.
Under Degeneracy, there are three options to deal with
degenerated trees:
(1) Do not calculate will not look for degeneracies and
will display just one solution. The differences with a
conventional cluster analysis are that (i) the solution
presented is the best according to the secondary
criterion specified, and (ii) the resulting tree is saved
automatically as an Advanced Tree and can be used
together with other Advanced Trees to calculate a
Consensus Tree.
The Cut off above parameter specifies the maximum
allowed number of degenerate entries relative to a
cluster. A degenerate entry is an entry that does not
belong to a given cluster in the present tree, but that
does belong to the cluster in at least one alternative
solution. If zero is entered as cutoff value, no degenerate
entries are allowed and as a consequence, a consensus
tree is generated that includes all possible solutions. If
the field is left blank, the degeneracy of the tree will not
be reduced at all. If a number is entered, for example 2,
all clusters for which there are more than 2 degenerate
entries will be displayed as consensus clusters with the
degenerate entries included.
Each cluster that has degenerate entries relative to it,
will have an indication of the number of degenerate
entries (see Figure 17-6., which shows 1 degenerated
entry for the selected cluster).
Chapter 17 - Advanced clustering and consensus trees
17.5.4 When a cluster is selected by clicking on its
branching node, the cluster is filled in gray (Figure 176.), which makes it easier to see which entries belong to
it.
17.5.5 If there are degenerated entries relative to the
highlighted cluster, you can find them by choosing
Clustering > Advanced trees > Select degenerate entries.
All degenerate entries relative to the cluster are now
added to the selection.
The interpretation of degeneracies and tracking back
their reason is sometimes difficult. The larger the tree
and the deeper the branch, the more complex the
degeneracies will be. The example screen in Figure 17-6.
is a capture taken from experiment RFLP1 in the
Demobase. The highlighted cluster has one degenerated
entry, which is selected. The cluster consists of two
subclusters which have an overall average similarity of
93.3%. The single degenerate entry, however, also has an
average similarity of 93.3% with the second subcluster.
The present solution has first linked subcluster 1 to
subcluster 2 and then linked the single entry to the
merged cluster. According to the criterion of UPGMA,
however, an equivalent solution would be to first link
the single entry to subcluster 2 and then link subcluster
1 to this new cluster.When the same clustering is done
with zero as cutoff value, the cluster looks like in Figure
17-7..Note that the three subclusters are now linked
together at the same level. The clusters that connect
always at the displayed similarity level in the solution
obtained using the secondary criterion are represented
by solid lines (in the present case, the single entry),
whereas subclusters that cluster at higher levels using
the secondary criterion are connected by an interrupted
line.
155
experiment types. In case you want to calculate different
dendrograms from the same experiment type, you
should use the Advanced Clustering tools. To create a
Consensus Tree, the program will look for all branches
that hold exactly the same entries in both trees and
represent them as branches in the Consensus Tree.
17.6.1 As an example, we can calculate two
dendrograms in Demobase: one from experiment
Phenotest using Pearson correlation and the other from
experiment 16S rDNA. You can calculate the trees using
the conventional clustering tools or using the Advanced
Clustering tools.
17.6.2 Select Clustering > Advanced trees > Create
consensus tree, which pops up a dialog box listing the
Stored trees (Figure 17-8.).
Figure 17-8. Stored trees dialog box to calculate a
Consensus Tree.
17.6.3 Select the two calculated trees, which have the
name of the experiment types they were derived from,
and enter a name for the Consensus Tree to be generated
(the default name is Consensus). With the option Correct
for scale differences, the dendrograms will first be
rescaled so that they have the same similarity ranges.
The result is that dendrograms covering a narrow
similarity range will have more impact on the
Consensus Tree when this option is checked.
Figure 17-7. Detail of cluster highlighted in Figure
17-6., calculated with a cut off value of zero.
After clicking <OK>, the Consensus Tree is calculated,
and only the clusters that contain exactly the same
entries in both dendrograms are displayed.
17.7 Managing Advanced Trees
17.6 Creating consensus trees
The advanced clustering tool allows a Consensus Tree to
be calculated from two or more individual
dendrograms. These trees can be conventional
clusterings or Advanced Trees, and can be generated
from the same experiment type or from different
Advanced Trees exist as long as a Comparison is
opened. Unlike conventional trees however, they are not
stored along with a Comparison and will disappear after
the Comparison window is closed.
17.7.1 An Advanced Tree can be displayed by selecting
it from the list that appears in Clustering > Advanced
156
trees. The currently displayed tree is flagged in the
menu. The currently displayed tree can be deleted with
Clustering > Advanced trees > Delete current.
The BioNumerics manual
A number of dendrogram editing functions under the
Clustering menu are not applicable to Advanced Trees.
157
18. Minimum
modelling
Spanning
18.1 Introduction
Minimum spanning trees (MSTs) are known for a long
time in the context of mathematical topology. When a
set of distances is given between n samples, a minimum
spanning tree is the tree that connects all samples in
such a way that the summed distance of all branches of
the tree is minimized.
In a biological context, the MST principle and the
maximum parsimony (MP) principle share the idea that
evolution should be explained with as little events as
possible. There are, however, major differences between
MP and MST. The MP method allows the introduction
of hypothetical samples, i.e. samples that are not part of
the data set. Such hypothetical samples are created to
construct the internal branches of the tree, whereas the
real samples from the data set occupy the branch tips.
The phylogenetic interpretation of the internal branches
is that they are supposed to be common ancestors of
current samples, which do not exist anymore but which
are likely to have existed in the past, under the criterion
of parsimony.
The MST principle, in contrast, requires that all samples
are present in the data set to construct the tree. Internal
branches are also based upon existing samples. This
means that, when a MST is calculated for evolutionary
studies, there are two important conditions that have to
be met: (1) the study must focus on a very short timeframe, assuming that all forms or states are still present,
and (2) the sampled data set must be complete enough
to enable the method to construct a valid tree, i.e.
representing the full biodiversity of forms or states as
closely as possible. Through these restricting conditions,
the method of MST is only applicable for specific
purposes, of which population modelling (microevolution) is a good example.
The trees resulting from MP on the one hand, and MST
on the other hand, also have a topological difference.
The MP method assumes that two (related) samples are
evolved from one common ancestor through one or
more mutations at either side. This normally results in a
bifurcating (dichotomic) tree: the ancestor at the
connecting node, and the samples at the tip. A MST
chooses the sample with the highest number of related
samples as the root node, and derives the other samples
from this node. This may result in trees with star-like
branches, and allows for a correct classification of
population systems that have a strong mutational or
recombinational rate, where a large number of SLV
Trees
for
population
(single locus variants) may evolve from one common
type1.
An important restriction of MSTs is that they can only be
calculated from a true distance matrix. A criterion for a
true distance matrix is that, given three samples A, B,
and C, the distance from A to C should never be longer
than the summed distance from A to B and B to C. This
restriction implies that MSTs are not compatible with all
data types. For example, a distance matrix based upon
pairwise compared DNA fragment patterns does not
fulfill this criterion, and hence, cannot be used for MST
analysis. On the other hand, a distance matrix based
upon a global band matching table, can be used. In
theory, all experiments that produce categorical data
arrays (i.e. multistate character arrays) or binary data
arrays are suitable for analysis with the MST method.
The most typical applications for use with MSTs,
however, are categorical Multilocus Sequence Typing
(MLST) data used in population genetics and
epidemiological studies.
18.2 Minimum spanning trees in
BioNumerics
The MST method usually provides many equivalent
solutions for the same problem, i.e. one data set can be
clustered in to many MSTs with a different topology but
with the same total distance. Therefore, a number of
priority rules, with respect to the linkage of types in a
tree, have been adopted from the BURST program (see
the MLST website http://www.mlst.net or Feil et al.,
20032) to reduce the number of possible trees to those
that have the most probable evolutionary interpretation.
These rules assign priority, in decreasing order, to (1)
types that have the highest number of single locus
variants (SLVs) associated, (2) the highest number of
DLVs (double locus variants) associated (in case of
equivalent solutions), and (3) the highest number of
samples belonging to the type. In BioNumerics, the most
frequent states can also be used as a priority rule, and
each of these rules can be assigned the first priority.
1. Maynard Smith, J., N.H. Smith, M. O'Rourke, and B.G.
Spratt BG. 1993. PNAS 90: 4384-4388.
2. Feil, E.J. J.E. Cooper, H. Grundmann, D.A. Robinson,
M.C. Enright, T. Berendt, S.J. Peacock, J. Maynard Smith, M.
Murphy, B.G. Spratt, C.E. Moore, and N.P.J. Day. 2003. J.
Bacteriol. 185:3307-3316.
158
As discussed in the introduction, a pure minimum
spanning tree assumes that all types needed to construct
a correct tree, are present in the sampled data.
Conversely, algorithms like maximum parsimony will
introduce hypothetical nodes for every internal branch,
while the samples from the data set define the branch
tips.
The major problem with the minimum spanning tree
algorithm in this view is that it requires a very complete
data set to obtain a probably correct tree topology. In
reality, a number of existing types may not have been
included in the sampled data set. If such missing
samples represent central nodes in the "true" MST, their
absence may cause the resulting tree to look very
different, with a much larger total spanning.
The MST algorithm in BioNumerics offers an elegant
solution to this problem, by allowing hypothetical types
to be introduced that cause the total spanning of the tree
to decrease significantly. In the context of MLST, these
are usually missing types for which a number of SLV
(single locus variants) are present in the data set. From
an evolutionary point of view, it is very likely that such
types indeed exist, explaining the existence of SLVs.
18.3 Calculating a minimum spanning
tree
The Demobase doesn't contain a categorical data set
such as MLST type data. However, the MST method can
also be applied to binary data. Therefore, you can either
choose to create a binary data set using the RFLP1
fingerprint data set by calculating a global band
matching table as explained in 10.2, or you can copy the
sample MLST database which is provided on the CD
Figure 18-1. The Minimum spanning tree dialog box.
The BioNumerics manual
ROM. This database is a subset of 500 Neisseria
meningitidis strains, downloaded into BioNumerics from
the Multi Locus Sequence Typing home page (http://
www.mlst.net).
18.3.1 To generate a binary data type from RFLP1,
follow the instructions given in 10.1 and 10.2 so as to
obtain a Composite Data Set containing the presence/
absence values of the band classes for all the entries in
Demobase.
18.3.2 To install the sample MLST database of Neisseria,
run the install program MLST Neisseria install.exe on
the CD ROM, located in the folder Examples\MLST.
This program will automatically install a new database
and prompt you for the default database directory
(c:\Program files\BioNumerics\data. If this is the
correct path, press Unzip to install the database.
Otherwise, enter the correct path, and after installation,
change the path in BioNumerics Startup so that MLST
Neisseria.dbs points to the correct directory (see 1.4).
18.3.3 Select all entries in the database and create a new
comparison.
18.3.4 To calculate a minimum spanning tree, select
Clustering > Calculate > Minimum spanning tree
(population modeling) or press the
from
the
floating
menu
that
button and
appears,
select
.
The Minimum spanning tree dialog box appears as
depicted in Figure 18-1. This dialog box consists of four
panels, about (1) the treatment of Hypothetical types, (2)
the Coefficient to calculate the distance matrix, (3) the
Chapter 18 - Minimum Spanning Trees for population modelling
Priority rule for linking types in the tree, and (4) the
settings for the Creation of complexes
•Hypothetical types:
With the checkbox Allow creation of hypothetical types
(missing links), you can allow the algorithm to
introduce hypothetical types as branches of the MST, as
described in 18.2. When enabled, the following criteria
can be specified:
•Create only if total distance is decreased with at least
(default 1) changes: Only in the case the introduction
of a hypothetical type decreases the total spanning of
the tree with one change, the hypothetical type will be
accepted.
•And if at least (default 3) neighbors have no more than
(default 1) changes: The algorithm will only accept
hypothetical types that have at least 3 neighbors
(closest related types) that have no more than 1
changes (see also 18.2 for the interpretation of this
rule).
•Coefficient:
The choice is offered between Categorical, for
categorical data, Binary, for binary data, and Summed
absolute distance. In the latter option, the sum of the
absolute differences between the values of any two
corresponding states is calculated, and the thus obtained
distances are used to calculate the MST. This option can
be used to cluster non-binary, non-categorical data with
integer values. If non-integer (decimal) values are used,
the program will round them to the closest integers.
In the Summed absolute distance option, an Offset and a
Saturation value can be specified. For each character
compared between two types, the offset value
determines a fixed distance that is added to the distance
of these characters. For each character compared
between two types, the saturation determines the
maximum value the distance can take. In other words,
above the saturation distance, different characters are all
seen equally different. The relation between offset,
saturation, and distance of characters is illustrated in
Figure 18-2. The offset and distance can be used to tune
the summed distance result between fully categorical
(offset = 1 and saturation = 1) and fully numerical
(offset = 0 and saturation ).
•Priority rule
In case of equivalent solutions in terms of calculated
distance, the priority rule allows you to specify a
priority based upon other criteria than distance. The
different options include:
•Highest number of SLVs (single locus variants): in case
two types have an equal distance to a linkage position
in the tree, the type that has the highest number of
single locus variants (i.e. other types that differ only in
one state or character) will be linked first.
•Highest number of SLVs and DLVs (double locus
variants): same as above, but types that differ in two
states will be considered equally.
•Highest number of entries: The program counts how
many entries each unique type contains, and the type
that has the highest number of entries will be assigned
priority, in case of equivalent linkage possibilities.
•Most frequent states: The program calculates a
frequency table for each state of each character. Types
are thus ranked based upon the product of
frequencies of their characters. In case of equivalent
possibilities, types that have the highest rank are
linked first.
•Creation of complexes
In epidemiological population genetics based upon
MLST, a clonal complex can be defined as a single group
of isolates sharing identical alleles at all investigated
loci, plus single-locus variants that differ from this
group at only one locus 1. In another, more relaxed
definition2,3, a clonal complex includes all types that
1. Feil, E.J., J. Maynard Smith, M.C. Enright, and B.G.
Spratt. 2000. Genetics 154: 1439-1450.
Transformed
distance
Saturation
8
7
6
5
4
3
2
Offset
1
1
2
3
4
5
6
7
8
9
10
159
Distance
Figure 18-2. Graphical representation of the meaning of offset and saturation values.
160
The BioNumerics manual
Figure 18-3. The Minimum spanning tree window.
differ in x loci or less from at least one other type of the
complex (x is usually taken as 1 or 2). Under this
definition, not all types of a complex are necessarily
SLVs or DLVs from one another. The latter definition is
used in BioNumerics.
The maximum number of changes allowed to form
complexes can be specified; the default value is 1. In
addition, one can also specify a minimum number of
types that should be included before the groups is
defined as a complex. The default value is 2.
upper left (types panel) listing the selected type or types,
and the right panel (complexes panel) displaying the
composition of the complexes.
•Display options
In the Tree panel, each type is represented by one node or
branch tip, displayed as circles that are connected by
branches. In the default settings, the following
information can be derived from the tree view:
•When sufficiently zoomed (using the zoom buttons
and
18.4 Interpreting and editing a minimum
spanning tree
After pressing <OK> in the Minimum spanning tree dialog
box, the Minimum spanning tree window will pop up. In
the example shown in Figure 18-3., a band matching
table of RFLP1 in the Demobase was created and
analyzed as Binary, with a Maximum neighbor distance
of 2 changes, while the other parameters were left to the
defaults.
The window is divided in three panels, of which the
lower left window (tree panel) displaying the MST, the
2. Feil, E.J. J.E. Cooper, H. Grundmann, D.A. Robinson,
M.C. Enright, T. Berendt, S.J. Peacock, J. Maynard Smith, M.
Murphy, B.G. Spratt, C.E. Moore, and N.P.J. Day. 2003. J.
Bacteriol. 185:3307-3316.
3. BURST (Based Upon Related Sequence Types) program
description, see the MLST website http://www.mlst.net.
), a letter code will appear within
each circle, uniquely identifying each type. In case of
more than 26 types in total, a two-letter code is used,
of which the second can be a digit 1-9 as well. The
codes are assigned alphabetically according to the
Priority rule specified (see 18.3).
•The length of the branches is proportional to the
distance between the types, and the thickness, dotting,
and graying of the branch lines also indicate the
distance between the nodes.
•The number of entries contained in a type (node) is
indicated using a color ranging from white over three
blue darkness tints to brown and red.
In the Complexes panel, the complexes are displayed as
defined under the specified calculation settings (see 18.3,
Creation of complexes).
Chapter 18 - Minimum Spanning Trees for population modelling
•Each complex is shown as a rooted tree, with the type
having the highest priority, as defined by the Priority
rule (18.3) defining the root. On top of the complex
panel, the character values of the root type are
indicated. The branch lengths of the derived types
(i.e., the types branching from the root) are in
proportion to the distances of these types.
•For each type branching off from the root type, the
change(s) is (are) indicated as two numbers separated
by a colon. The first number is the character number,
and the second number is the value towards the
character has changed. For example, 6:3 means that
character 6 has changed into 3 for this type. If more
than one change has led to a derived type, the changes
are indicated next to each other.
•Similar as on the tree, the types are indicated with a
color reflecting the number of entries contained in the
type. In addition, the number of entries is written just
below the type code
The Types panel displays the details of the highlighted
type(s) in the tree panel or the complexes panel. If a type is
selected in the tree panel, it becomes highlighted by a red
circle, and marked with a red flag. The same type
becomes highlighted in the complexes panel, by a red
rectangle. For the highlighted type, detailed information
is shown in the types panel.
•On top of the panel, the character names and character
values (on green background) are shown for the
highlighted type. The frequencies of the character
values are indicated in gray.
•Left from the character list is the name of the type with
the number of entries it contains between brackets.
161
•Edit options
18.4.1 With Edit > Display settings or
, the display
options can be customized in the Display settings dialog
box (Figure 18-4.).
18.4.2 Under Cell color, you can use a color to display
the number of entries, the groups, or the groups pie
charts. The colors are displayed both in the tree panel and
the complex panel.
With Number of entries selected, a differential color
will be assigned to the nodes according to the number of
entries they contain. The intervals can be specified
under Number of entries coding.
With Groups selected, the colors assigned to the groups
(see 11.6) in the comparison (if any), will be given to the
nodes. When a type consists of more than one group, it
will become black. Groups (pie chart) is similar, except
that, in case a type (node) consists of more than one
group, the different groups will be represented in a pie
chart. In the complex panel, the different group colors are
also displayed in the type boxes, in a proportional way.
18.4.3 Number of entries coding is only enabled when
Number of entries is selected under Cell color.
NOTE: By default, the first color (white) is set as <= 0.
This means that only empty nodes are white. This is
useful to visualize hypothetical nodes (see ”Calculating
a minimum spanning tree”) when this option is
enabled. When no hypothetical nodes are allowed, it is
more useful to enter a positive value, for example <= 1,
as has been done in Figure 18-3.
•Right from the character list is the number of SLVs
(single locus variants; types differing only in one
character) and DLVs (double locus variants; types
differing in two characters).
•Under the character list, the entries contained in the
type are listed vertically. If the entries are selected in
the Comparison window, this is indicated here as well,
with the same blue arrows. Selections can be made in
this list using the CTRL and SHIFT keys, and the entry
card can be popped up by double clicking on an entry.
•In case more than one type is highlighted in the tree
panel or the complexes panel, the highlighted types are
displayed under each other in the types panel.
Characters that are the same for more than 50% of the
types are shown on a green background. Characters
for which there is no >50% consensus are shown on a
white background. A characters that is different from
the majority in a type is indicated in red. Note that, as
soon as more than one type is highlighted, the entries
are not listed anymore in the types panel.
Figure 18-4. The Display settings dialog box in the
MST window.
18.4.4 Under Distance coding, you can specify the
distance between types that corresponds with the
different line types offered by the program.
18.4.5 With Distance reduction, you can change the
length of the branches. In the tree panel, this only
changes the zoom, but in the complex panel, this value
will determine the horizontal distance between the types
162
The BioNumerics manual
displayed. With a distance reduction of e.g. 1.5x, the
distance unit is decreased with a factor 1.5.
18.4.13 For one or more highlighted types, it is also
possible to select all the entries directly from the tree
18.4.6 The option Display complexes allows you to
choose whether the complexes are displayed or not.
Note that this option only applies to the tree panel; the
complexes remain displayed in the complex panel.
panel, by pressing the
18.4.7 Using Compact complexes, you can choose to
display a full complex as one node on the tree. The
diameter of the circle is (slightly) proportional to the
number of types the complex contains.
18.4.8 With Use color, you can display the image in color
or grayscale mode.
18.4.9 Scale with member count is an option that lets the
diameter of the circles depend on their size.
18.4.10 Under Type labeling, it is possible to select the
(default) Letter code which is automatically assigned to
the types by the program, or any of the information
fields the database contains. In the latter case, types
(nodes) that do not all have the same string will be
marked with ???. Note also that you may have to zoom
in sufficiently to visualize longer labels than the letter
codes. If the labels do not fit within the circle, they are
represented by ... .
18.4.11 The option Show state information relates to the
types panel, where the states of any selected types can be
displayed. When this option is unchecked, the states of
the characters for the selected types are not displayed.
18.4.12 With Show distances, you can have the distances
indicated on the branches of the tree.
As indicated earlier, it is possible to highlight types on
the tree or in the complexes panel. You can use the SHIFT
or CTRL keys to highlight multiple types, or drag a
rectangle with the mouse in the tree or the complex
panel. For a single highlighted type, you can select
individual entries directly in the types panel.
button or choosing Edit >
Select all entries in selected nodes from the menu.
18.4.14 Likewise, it is possible for any selected entries to
highlight all the types where these entries occur, using
the
button or Edit > Select nodes that contain
selected entries.
18.4.15 With Edit > Select related nodes, you can
highlight all the types that have no more than a specified
number of changes from the highlighted type(s). When
choosing this menu command, the program asks to
enter the maximum distance from the highlighted
type(s).
18.4.16 On the tree, the highlighted nodes are, by
default, marked with a red label, and with a red circle as
well. You can choose to hide or show this label using
button or with Edit > Label selected nodes.
18.4.17 The Cut branch tool (Edit > Cut branch tool or
) is a cursor tool that allows a branch of the tree to
be "cut off" and displayed as one simple end node. A
branch can be cut off by selecting the branch cut tool,
moving the cursor towards one end of a branch and leftclicking. When cut off, the branch is displayed as a green
node which always has the same size, regardless of the
zoom. To disclose the branch again, simply double-click
on the green node.
18.4.18 The complexes present on the minimum
spanning tree can be converted into Groups using File >
Convert complexes to groups.
163
19. Dimensioning techniques (PCA, MDS and
SOM)
Principal Components Analysis (PCA) and MultiDimensional Scaling (MDS) are two alternative
grouping techniques that can both be classified as
dimensioning techniques. In contrast to dendrogram
inferring methods, they do not produce hierarchical
structures like dendrograms. Instead, these techniques
produce two-dimensional or three-dimensional plots in
which the entries studied are spread according to their
relatedness. Unlike a dendrogram, a PCA or MDS plot
does not provide "clusters". The interpretation of the
obtained comparison is, more than in cluster analysis,
left to the user.
The program now asks "Optimize positions".
BioNumerics iteratively recalculates the MDS, each time
again optimizing the positions of the entries in the space
to resemble the similarity matrix as closely as possible. If
you allow the optimization to happen, the calculations
take longer.
PCA assumes a data set with a known number of
characters and analyzes the characters directly. PCA is
applicable to all kinds of character data, but not directly
to fingerprint data. Fingerprints can only be analyzed
when converted into a band matching table (see 10.6.3).
19.2 Editing an MDS
19.1.5 Press <Yes> to optimize the positions.
The MDS is calculated and the Coordinate space window is
shown.
The Coordinate space window (Figure 19-1.) shows the
entries as dots in a cubic coordinate system.
MDS does not analyze the original character set, but the
matrix of similarities obtained using a similarity
coefficient. Rather than being a separate grouping
technique, MDS just replaces the clustering step in the
sequence characters > similarity matrix > cluster analysis.
However, it is a valuable alternative to the dendrogram
methods, which often oversimplify the data available in
a similarity matrix, and tend to produce overestimated
hierarchies.
19.1 Calculating an MDS
Whatever experiment type for which a complete
similarity matrix is available can be analyzed by MDS.
Matrix Types are not suitable for MDS clustering if the
matrices are incomplete.
19.1.1 In the Main window with DemoBase loaded, open
comparison All, or a comparison with all entries except
those defined as STANDARD.
19.1.2 Select FAME in the experiment type selection bar,
and check whether a matrix is available for this
experiment type by looking in the Layout menu if the
menu command Show matrix is enabled (not grayed).
19.1.3 If Show matrix is grayed, first calculate a
dendrogram with Clustering > Calculate > Cluster
analysis (similarity matrix).
19.1.4 Select Dimensioning > Multi-dimensional scaling
or
.
Figure 19-1. Coordinate space window, resulting
from a PCA or MDS analysis.
19.2.1 To zoom in and zoom out on the image, use the
Pge Dn and Pge Up keys, respectively.
19.2.2 The image can be rotated in real time using the
arrow keys or the horizontal and vertical scroll bars.
164
The BioNumerics manual
By default, the entries are represented as 3D spheres in a
realistic perspective. They appear in the colors as
defined for the groups on the dendrogram (11.6).
19.2.3 With Layout > Show keys or
, you can
display the database keys of the entries instead of the
dots.
However, the entry keys may be long and
uninformative for the user, so the entry keys can be
replaced by a group code. The program assigns a letter
to each defined group, and within a group, each entry
receives a number. The group codes are shown as
follows:
19.2.13 Another very interesting option is Layout >
Show dendrogram or
.
When this option is enabled, the entries in the
coordinate system are connected by the dendrogram
branches from the parent Comparison window. This is an
ideal combination to co-evaluate a dendrogram and a
coordinate system (PCA or MDS).
19.2.14 To copy the coordinate space image to the
clipboard, select File > Copy image to clipboard or
.
19.2.15 The image can be printed with File > Print image
19.2.4 In the parent Comparison window, select Layout >
Use group numbers as keys.
or
. The image will print in color if the colors are
shown on the screen.
19.2.5 A legend to the group numbers can be obtained
with File > Export database fields in the parent
Comparison window.
19.2.6 With Layout > Show group colors or
, you
can toggle between the color representation and the
non-color representation, in which the entry groups are
represented (and printed) as symbols instead of colored
dots.
On the screen, it is easier to evaluate the groups using
colors.
19.2.7 Select an entry coordinate system using
CTRL+left mouse click. Selected entries are contained in
a blue cube.
19.2.8 To select several entries at a time, hold down the
SHIFT key while dragging the mouse in the coordinate
system. All entries included in the rectangle will become
selected.
19.3 Calculating a PCA
PCA is typically executed on complete character data. It
does not work on Sequence Types. Fingerprints can only
be analyzed by PCA if a band matching table is first
generated (see 10.2).
19.3.1 In the Main window with DemoBase loaded, open
comparison All, or a comparison with all entries except
those defined as STANDARD.
19.3.2 Select FAME in the experiment type selection bar,
and Dimensioning > Principal Components Analysis or
.
The Principal Components Analysis dialog box (Figure 192.) allows a number of more advanced choices to be
made.
19.2.9 By double-clicking on an entry, its Entry edit
window is popped up.
19.2.10 With Layout > Show construction lines or
,
the entries are displayed on vertical lines starting from
the bottom of the cube. This may facilitate the threedimensional perception. Disable this option to view the
next features.
19.2.11 With Layout > Show rendered image or
,
you can toggle between the realistic three-dimensional
perspective with entries represented by spheres, and a
simple mode where entries are represented as dots.
19.2.12 With Layout > Preserve aspect ratio enabled, the
relative contributions of the three components are
respected, which means that the coordinate system is no
longer shown as a cube.
Figure 19-2. Principal Components Analysis dialog
box.
The simplest choice is “Use quantitative values” . By
default, this choice is checked, and if the technique
provides quantitative information (not just absent/
present), one will normally want to use this information
Chapter 19 - Dimensioning techniques (PCA, MDS and SOM)
165
CHARACTER
ENTRIES
CHAR 1
CHAR 2
CHAR 3
ENTRY 1
VAL 11
VAL 12
VAL 13
ENTRY 2
VAL 21
VAL 22
VAL 23
ENTRY 3
VAL 31
VAL 32
VAL 33
AVERAGE, VARIANCE
AVERAGE,
VARIANCE
Figure 19-3. Character table showing the meaning of Average and Variance correction at the Entries and
Characters level.
for the PCA calculation. If this option is unckecked, the
character values will be converted to binary as specified
in the Conversion to binary settings (see 7.14.20).
More sophisticated options are the possibilities to
Subtract average character value over the Entries, and
to Subtract average character value over the Characters.
Figure 19-3. explains how the averaging works.
•Subtraction of the averages over the characters (green in
the figure) results in a PCA plot arranged around the
origin, and therefore, it is recommended for general
purposes.
•Division by the variances over the characters (green in
the figure) results in an analysis in which each
character is equally important. Enabling this option
can be interesting in a study containing characters of
unequal occurrence. For example, if fatty acid
extractions are analyzed for a set of bacteria, some
fatty acids may be present in abundant amounts,
whereas others may occur only in very small amounts.
It is well possible that the “minor” fatty acids are as
informative or even more informative than the
abundant ones, taxonomically seen. If no correction is
applied, those minor fatty acids will be completely
masked by differences in the abundant fatty acids.
Dividing by the variance for each fatty acid
normalizes for such range differences, making each
character equally contributing to the total separation
of the system.
•Subtraction of the averages over the entries (red in the
figure) results in character sets of which the sum of
characters equals zero for each entry. This feature has
little meaning for general purposes.
•Division by the variances over the entries (red in the
figure) results in character sets for which the intensity
is normalized for all entries. For example, suppose
that you have scanned phenotypic test panels for a
number of strains and want to calculate a PCA. If
some strains are less grown than others, the overall
reaction in the wells will be less developed. Without
correction, well developed and less developed panels
will fall apart in the study. Dividing by the variances
normalizes the character sets for such irrelevant
differences, making character sets with different
overall character developments fall together as long as
the relative reactions of the characters are the same.
NOTE: The two latter features are exactly what is done
by the Pearson product-moment correlation coefficient.
This coefficient subtracts each character set by its
average, and divides the characters by the variance of
the character set. The feature Divide by variance
under Entries should not be used in character sets
where the characters are already expressed as
percentages (for example, fatty acid methyl esters).
The lower panel of the dialog box (Figure 19-2.) displays
the Component type. This can be Principal components,
Discriminants (without variance), or Discriminants
(with variance). The first option is to calculate a
principal
components
analysis,
whereas
the
Discriminants options are to perform discriminant
analysis. These options are described in paragraph 19.6.
19.3.3 In the Entries and Characters panels, check
Subtract average under Characters, and leave the other
options unchecked.
19.3.4 In the Component type panel, select Principal
components, and press <OK>. Calculation of the PCA is
started.
The resulting window, the Principal components analysis
window, is shown in Figure 19-4..
The window is divided in two panels: the left panel
shows the entries plotted in an X-Y diagram
corresponding to the first two components. In the
caption of the window, the first 20 components are
shown, with their relative contribution and the
cumulative
contribution
displayed.
Also,
the
components used as X, Y and Z axes are indicated. The
right panel shows the characters plotted in the same X-Y
diagram. From the right panel, one can see the
contribution each character has to the two displayed
components, and hence, what contribution it has to the
166
The BioNumerics manual
Figure 19-4. Principal components analysis window.
separation of the groups along the same components.
For example, if a group of entries appears left along the
X-axis whereas the other entries appear right, those
characters occurring left on the X-axis are positive for
the left entries and negative for the right entries, and vice
versa.
By default, the first component is used for the X axis, the
second component is used for the Y axis, and the third
component is used for the Z axis. The Z axis is not
shown here, but can be shown in the three-dimensional
representation with Layout > Show 3D plot (see
further).
19.3.5 If you want to assign another component as one of
the axes, select the component in the caption, and
Layout > Use component as X axis, Layout > Use
component as Y axis, or Layout > Use component as Z
axis (or right-click on the component).
•Layout tools:
19.3.6 Switching from color indication for the groups to
symbol indication with Layout > Show group colors or
.
19.3.7 Showing the keys or a unique label based upon
the groups for the entries with Layout > Show keys or
.
NOTE: In case keys are assigned automatically by the
program, they are not very informative, so one should
select Layout > Use group numbers as key in the
underlaying Comparison window. A list of the group
codes and the corresponding entry names can be
generated in the underlying Comparison window
with File > Export database fields.
19.3.8 The option Layout > Preserve aspect ratio allows
you to either preserve the aspect ratio of the
components, i.e. the relative discrimination of the
component on the Y axis with respect to the component
on the X axis, or to stretch the components on the axes so
that they fill the image optimally.
19.3.9 With Layout > Zoom in / zoom out or
, you
can zoom in on any part of the entries or characters panel
of the PCA plot: drag the mouse pointer to create a
rectangle; the area within the rectangle will be zoomed
to cover the whole panel.
Chapter 19 - Dimensioning techniques (PCA, MDS and SOM)
19.3.10 In order to restore the original size of the image,
simply left-click within the panel. Disable the zoommode afterwards.
167
ordered by the selected component in the Comparison
window.
19.3.19 The entry plot can be printed with File > Print
19.3.11 If you move the mouse pointer over the right
panel (characters), the name of the pointed character is
shown.
•Editing tools:
19.3.12 Entries can be selected in a PCA window by
holding the SHIFT key down and selecting the entries in
a rectangle using the left mouse button. Selected entries
are encircled in blue. You can also hold down the CTRL
key while clicking on an entry.
19.3.13 An even more flexible way of selecting entries is
using the lasso selection tool. To activate the lasso
selection tool, choose Layout > Lasso selection tool or
press the
button. With the lasso selection tool
enabled, selections of any shape can be drawn on the
plot. The lasso selection tool menu and button are
flagged when the tool is enabled. To stop using the lasso
selection tool, you have to click the button a second
time, or disable it from the menu.
A PCA is automatically saved along with its parent
Comparison window. It is possible to add entries to an
existing PCA or remove entries from it. The feature to
add entries to an existing PCA is an interesting
alternative way of identifying new entries. They can be
placed in a frame of known database entries, and in this
way, identifying is just looking at the groups they are
closest to. Since the components are not recalculated
when entries are added to an existing PCA, the PCA
does not reflect the full data matrix anymore!
19.3.14 If you want to add entries to an existing PCA,
you can select new entries in the Main window and copy
them to the clipboard using Edit > Copy selection or
and the character plot can be
printed with File > Print image (characters) .
19.3.20 Alternatively, the entry plot can be copied to the
clipboard with File > Copy image to clipboard (entries)
or
and the character plot can be printed with File
> Copy image to clipboard (characters) .
If you want to reconstruct or analyze the PCA system in
other software packages, it is possible to export the
coordinates of the entries along a selected component
(for example the X-axis):
19.3.21 Select a component and File > Export entry
coordinates.
If you want to reconstruct the PCA with the first two
components, you should also export the second
component (Y-axis), by selecting that component and
File > Export entry coordinates.
Similarly, one can export the coordinates for the
characters for a certain component:
19.3.22 Select a component and File > Export character
coordinates.
BioNumerics allows you to display three components at
the same time, by plotting the entries in a 3-dimensional
space.
19.3.23 To create a three-dimensional plot from the PCA,
select Layout > Show 3D plot or
.
The Coordinate space window is shown. See 19.2 to edit a
PCA in 3-D representation mode.
.
19.3.15 In the Comparison window, select Edit > Paste
selection. The new entries are placed in the Comparison
window and in the PCA window.
19.3.16 To delete entries from a PCA, select some entries
as in 19.3.12 and in the Comparison window, select Edit >
Cut selection.
If you started the PCA from a Composite Data Set, you
can order the characters according to the selected
component in the underlying Comparison window. This is
an interesting feature to locate characters that separate
groups you are interested in. The feature works as
follows (only Composite Data Sets).
19.3.17 First determine
separates the groups.
image (entries) or
the
component
that
best
19.3.18 Select that component in the caption and
Characters > Order characters by component (or rightclick on the component). The characters are now
19.3.24 Close the Coordinate space window with File >
Exit.
19.3.25 Close the PCA window with File > Exit.
19.4 Calculating a discriminant analysis
Discriminant analysis is very similar to PCA. The major
difference is that PCA calculates the best discriminating
components for the character table as a whole, without
foreknowledge about groups, whereas discriminant
analysis calculates the best discriminating components
for groups that are defined by the user. In case of
discriminant analysis, these principal components are
then called discriminants. Like PCA, discriminant
analysis is executed on complete character data. It does
not work on Sequence Types. Fingerprints can only be
analyzed by discriminant analysis if a band matching
168
The BioNumerics manual
Group 1
Group 2
Character A
Character B
Figure 19-5. The influence of character spread on discriminant analysis.
table is first generated (see 10.2). Discriminant analysis
also forms the basis for multivariate analysis of variance
(MANOVA), which is explained in paragraph 19.6.
important as character A. This is achieved with option
Discriminants (with variance).
19.4.4 Select Discriminants (with variance) and <OK>.
19.4.1 In database DemoBase, we first create a new
Composite Data Set (see 15.2) for experiment type
PhenoTest. You can name it Phenodata, for example.
19.4.2 Open comparison All, or a comparison with all
entries except those defined as STANDARD.
Since discriminant analysis work on user-delineated
groups, the comparison should contain groups (see
11.6).
19.4.3 Select Phenodata in the experiment type selection
bar, and Dimensioning > Principal Components
Analysis or
.
The Principal Components Analysis dialog box (Figure 192.) allows a number of choices to be made under Entries
and Characters, which are described under PCA (19.3).
These choices also apply for discriminant analysis.
However, the Divide by variance option under
Characters makes no difference whether it is enabled or
disabled for discriminant analysis.
The following two options are available for discriminant
analysis: Discriminants (without variance), and
Discriminants (with variance). If you select “with
variance”, each character is divided by its variance. In
order to understand what this implies, consider Figure
19-5.
This example shows two groups, 1 (red) and 2 (green),
that are separated by two characters, A and B. On the
average, group 1 is less positive both for characters A
and B. Character A seems to be better discriminating
between the two groups than character B, because the
centers of the groups are lying further from each other in
case of character A. However, if the internal spread of
groups are considered, then the groups are found much
more coherent for character B, which may render this
character at least as much value for discriminating as
character A. In a non-corrected discriminant analysis,
character A will account for most of the discrimination,
just by the fact that the centers of the groups are more
distant. This is the case in option Discriminants
(without variance). When the characters are divided by
the variances of the groups, the internal spread is
compensated for, and character B will become at least as
The resulting window is identical as the PCA window
described before (Figure 19-2.), and the same features
apply.
If you started the discriminant analysis from a
Composite Data Set, you can order the characters
according to the selected discriminant in the underlying
Comparison window. This is an interesting feature to
locate characters that separate groups you are interested
in. The feature works as follows (only Composite Data
Sets).
19.4.5 First determine the discriminant that best
separates two groups you have in mind. You can
examine the discriminants by right clicking on them in
the caption and Use component as Y-axis (or X-axis).
19.4.6 Select that discriminant in the caption and
Characters > Order characters by component (or rightclick on the discriminant). The characters are now
ordered by the selected discriminant in the Comparison
window.
19.5 Self organizing maps
A self organizing map (SOM, also called Kohonen map)
is a neural network that classifies entries in a twodimensional space (map) according to their likeliness.
Since the technique which is used for grouping, i.e. the
training of a neural network, is completely different
from all previously described methods, SOMs are an
interesting alternative to conventional grouping
methods, including cluster analysis, principal
component analysis, and related techniques. Also,
similar as in PCA, a SOM can start from the characters as
input, thus avoiding the choice of one or another
similarity coefficient. Unlike PCA, the distance between
entries on the map is not in proportion to the taxonomic
distance between the entries. Rather, a SOM contains
areas of high distance and areas of high similarity. Such
areas can be visualized by different shading, for
example when a darker shading is used in proportion to
the distance in the SOM.
When the similarity values with all of the other entries
of a comparison are considered as the character set, a
Chapter 19 - Dimensioning techniques (PCA, MDS and SOM)
SOM can also be applied on similarity matrices, which
makes the technique also suitable for grouping of
electrophoresis patterns that are compared pair by pair
using a band matching coefficient such as Dice.
169
19.5.7 In the Comparison window, select Edit > Paste
selection. The new entries are placed in the Comparison
window and in the SOM window.
NOTES:
To calculate a self-organizing map based on character
data, use for example the character set FAME in
DemoBase.
19.5.1 In the Main window with DemoBase loaded, open
comparison All, or a comparison with all entries except
those defined as STANDARD.
19.5.2 Select FAME in the experiment type selection bar,
and Dimensioning > Self-organizing map (the command
Dimensioning > Self-organizing map (similarities) is to
calculate the som from the similarity matrix).
An input box asks to enter the map size. This is the
number of nodes of the neural network in each
direction. For the default size 10, a neural network
containing 10x6 nodes is generated. The larger the map
is taken, the longer the training takes. Note that the
optimal size of the map depends on the number of
entries compared. For a small number of entries, a small
map size will usually provide better results.
19.5.3 Enter 7 as map size and press <OK>.
(1) An identification based upon a self-organizing map
is only reliable if the new entries belong to one of the
groups the SOM is based upon. A SOM will always
produce a “positive” identification: an unknown profile
will always find a place in the SOM, i.e. the cell
having the highest similarity with the new entry. If,
after adding a new entry to a SOM, the entry falls next
to a known entry of that SOM, this means only that the
new entry has the highest similarity with that
particular cell compared to the other cells; it does not
mean that it is highly related to that entry. Hence,
identification based upon a SOM is only recommended
if you are sure the unknown entries belong to one of the
groups composing the SOM.
(2) Since no new cells can be created in a SOM, one
should never add new entries which are known to
constitute a group that is not represented in the SOM.
19.5.8 To delete entries from a SOM, select some entries
and in the Comparison window, select Edit > Cut
selection.
The SOM is calculated and shown (Figure 19-6.). Areas
of high similarity are black. Selected entries in the parent
Comparison window are also selected on the map. Note
that the SOM as shown in
will not necessarily
correspond to the one you have calculated.
19.5.9 Close the SOM window with File > Exit.
19.5.4 To show the information of a particular entry in
the SOM, right-click on the entry and select Edit
database fields.
In this case, the result of the SOM is based on similarity
values of the entries with each other and hence is
dependent on the similarity coefficient used, and the
tolerance and optimization settings in case of
Fingerprint Types. Obviously, this method only works if
a cluster analysis of the selected experiment is avaliable.
19.5.5 You can (un)select entries on the SOM by leftclicking on an entry while pressing the CTRL key, or
groups of entries by left-clicking and moving the mouse
while pressing the SHIFT key.
19.5.10 To create a Self Organizing Map from a
similarity matrix obtained after cluster analysis, select
Dimensioning > Self-organizing map (similarities).
NOTE: When a SOM is calculated on Fingerprint Type
data, the densitometric curves are used as character
data sets for training of the SOM.
A SOM is automatically saved along with its parent
Comparison window. It is possible to add entries to an
existing SOM or remove entries from it. The feature to
add entries to an existing SOM is an interesting
alternative way of identifying new entries. Added
entries are placed in a frame of known database entries
in the SOM, and in this way, identifying is just looking
at the groups they are joining.
19.5.6 If you want to add entries to an existing SOM, you
can select new entries in the Main window and copy them
to the clipboard using Edit > Copy selection or
.
Figure 19-6. Self Organizing Map calculated from
data set FAME in the DemoBase of BioNumerics.
170
If not, first create a cluster analysis with Clustering >
Calculate > Cluster analysis (similarity matrix).
19.5.11 The SOM can be printed with the File > Print
command, or exported via the clipboard as enhanced
metafile using File > Copy to clipboard.
In these cases, the map colors are inverted, i.e. white
corresponds with areas of high similarity, whereas
darker shading corresponds with areas of low similarity.
A SOM is saved along with a comparison. In order to
display a previously calculated SOM in a comparison,
select Dimensioning > Show map.
The BioNumerics manual
the possibility of the discriminant analysis to explore
correlations between characters in order to achieve a
better discrimination. Most statistical approaches
assume that the covariance is accounted for, however,
its use becomes dangerous in case the number of
characters is close to, or larger than the number of
entries studied. In such cases, the result of the
discriminant analysis could be that the delineated
groups are perfectly separated. To avoid such unrealistic
separations, you should only allow the program to
account for the covariance when the number of entries is
significantly larger than the number of characters.
19.6 Multivariate analysis of variance
(MANOVA) and discriminant analysis
Multivariate ANalysis Of VAriance (MANOVA) is a
statistical technique which allows the significance of
user-delineated groups to be calculated. Since it is
extremely difficult to prove that delineated groups are
significant, statistical methods usually are based on the
reverse approach, i.e. to prove that the chance
(likelihood) to obtain equally good separations with
randomly generated groups approaches zero.
In
addition, a statistical technique related to PCA,
discriminant analysis, allows the determination of
characters that are responsible for the separation of the
delineated groups.
The MANOVA technique only applies to Composite
Data Sets in BioNumerics. If you want to find the
discriminating characters for an experiment, you should
first create a Composite Data Set containing that
experiment.
MANOVA cannot be applied to incomplete data sets. In
other words, all characters must be filled in for each
entry. In case of “open” Character Types (in which the
character set may grow dynamically), absent values
should be considered as zero (see 7.11).
19.6.1 In database DemoBase, we first create a new
Composite Data Set (see 15.2) for experiment type
PhenoTest. You can name it Phenodata, for example.
19.6.2 Open comparison All, or a comparison with all
entries except those defined as STANDARD.
Since MANOVA and discriminant analysis work on
user-delineated groups, the comparison should contain
groups (see 11.6).
19.6.3 Select Phenodata in the experiment type selection
bar, and Groups > Multivariate Analysis of Variance.
19.6.4 The program now pops up a MANOVA dialog box
(Figure 19-7.). The meaning of the variances (diagonal
elements) is similar to the variances explained for
Discriminant Analysis (19.4). The Covariances relate to
Figure 19-7. The MANOVA dialog box.
19.6.5 A second option is Estimate relative character
importance. When this feature is applied, the program
will repeat the discriminant analysis, each time leaving
out one character. The quality of the separation when a
character is left out is then compared to the quality
when the character is not left out, and this is a direct
measure for the importance of that character. Obviously,
the calculations take much longer when the discriminant
analysis is to be calculated p times, p being the number
of characters.
19.6.6 Select Don’t use under Covariance matrix, enable
the option to calculate the relative character importance
and press <OK>.
If one or more characters are identical for all the entries,
this will be reported in a message box and such
characters will be left out from the discriminant analysis.
The resulting MANOVA & discriminant analysis window
is shown in Figure 19-8. This example is based upon
four groups (red, blue, green and yellow).
The window is divided in three panels: the upper panel,
covering the full width of the window, shows the
relative discriminatory value of the characters for each
of the discriminants. A character can have a contribution
to the discrimination in the positive sense (green) or in
the negative sense (purple). The larger the bar, the
greater the contribution, irrespective of the sense. If
character contributions have a different sense, it means
that the one character will be positive in the groups
where the other character will be negative and vice versa.
Chapter 19 - Dimensioning techniques (PCA, MDS and SOM)
The relative importance for each character is shown as a
red line, right from the character name.
Note that the total number of discriminants will always
be the number of groups less one. For two groups, there
is only one discriminant; for three groups, there are two
discriminants, etc.
19.6.7 If there are more than two discriminants, you can
scroll through the list of discriminants.
The first discriminant is always the most important, i.e.
it accounts for most of the discrimination; the second
discriminant is the second most important etc. The
percentage discrimination of a discriminant is shown in
bold (left). The sum of the percentages equals 100.
In addition, the prameter L (Wilkinson’s likelihood for
normal distributions) predicts the likelihood of the
obtained discrimination in the assumption that the
groups are drawn from the same population. If L is low,
the entries of the different groups are likely to be drawn
A
171
from different populations, in other words, the existence
of the groups is justified. The parameter p is the
probability that a random subdivision in groups would
yield the same degree of discrimination.
The left lower panel maps the entries on the first two
discriminants (first = X axis, second = Y axis). On this
image you can see that the X axis accounts for most of
the discrimination (Figure 19-8.). The right panel maps
the characters on the entry groups. To interpret this very
informative panel, you should inspect it together with
the left panel. The more distant a group occurs from the
center along a discriminant axis, the better it may be
characterized by one or more characters. These
characters can be found in the right panel, shown by
their number. Characters are very positive for the group
if they fall in the same direction of the axis as the group;
if they occur in the opposite direction, they are very
negative for the group. The further a group and one or
more characters occur in either direction of an axis, the
more pronounced these characters are either positive or
negative for that group. For example, group A occurs in
B
C
Figure 19-8. MANOVA & discriminant analysis window. The circles delineating groups A, B, and C are added
to this figure to illustrate the interpretation of discriminant analysis.
172
the negative half of the X axis (first discriminant),
whereas group C is the most positive group on this axis.
Characters 16 and 19 (pronounced positive position)
discriminate group C from group A in that they are
much more positive for group C members than for
group A members. Another example: group C is
positive on both the X and Y discriminant, whereas
groups A and B are negative on either the X or Y axis.
From this, one can conclude that characters 4 and 15
discriminate group C from both groups A and B. The
rhomb in the center of each group in the left panel is the
average position of the group.
19.6.8 The third discriminant can be plotted on the
images by selecting it in the upper panel (third table),
and selecting Plot > Use discriminant as X axis or Plot
> Use discriminant as Y axis.
19.6.9 The menu Plot > Order characters by magnitude
allows the characters to be ordered by their contribution
on the selected discriminant.
19.6.10 The menu Plot > Order characters by
importance allows the characters to be ordered by their
relative importance factors (19.6.5).
19.6.11 It is possible to select an entry from the left panel
with CTRL+left mouse click. Selected entries are
encircled in blue.
19.6.12 To select several entries at a time, hold down the
SHIFT key while dragging the mouse in the left panel.
All entries included in the rectangle will become
selected.
The BioNumerics manual
19.6.13 By double-clicking on an entry, its Entry edit
window is popped up.
19.6.14 With Plot > Show groups using colors, you can
toggle between the default color mode and the non-color
mode where groups are represented using symbols. In
the non-color mode, non-selected entries are shown in
yellow, whereas selected entries are shown in blue.
The various results of a MANOVA analysis can be
printed or exported as enhanced metafile to the
clipboard for further processing in another package:
19.6.15 Use File > Print report to print a detailed
numerical report of all characters and their contribution
along the discriminants. Similarly, File > Export report
is used to export this report to the clipboard, tabdelineated or space-delineated.
19.6.16 Use File > Print discriminants to print the upper
graphical panel, representing the selected discriminants
and the relative importance of the characters shown as
bar graphs. File > Copy discriminants to clipboard is to
export this report to the clipboard as enhanced metafile.
19.6.17 Use File > Print correspondence plot to print the
lower two-dimensional plots, representing the entries
(left) and characters (right) plotted along two
discriminants. File > Copy correspondence plot to
clipboard is to export this report to the clipboard as
enhanced metafile.
173
20. Chart and statistics tools
20.1 Introduction
A number of simple chart tools available in
BioNumerics to apply to the database information fields
of or to character data for the entries in a comparison.
BioNumerics also offers the possibility to perform some
basic statistic analysis on the entries and variables used
in a chart. Given the large variety of information and
Character Types BioNumerics can contain, there are
many different types of charts that can be displayed,
depending on the type of the variable(s) to present. For
each chart one or more standard statistical tests are
implemented. The next sections are intended to provide
some information on the terminology (20.2) and the
mathematical background (20.3) of these tests.
•Determine what statistic will be used. A statistic is a
value calculated from the data set by means of some
formula and that is sensitive to the null-hypothesis
that will be tested for.
•If the null-hypothesis is true, the probability function
of the statistic is known.
•If the statistic is located on an unfavorable position in
the probability function, i.e. if its probability is very
small, the null-hypothesis can be rejected. The
opposite is not true: the null-hypothesis cannot be
accepted as fulfilled if the statistic has a favourable
location in the probability distribution.
The use of the chart and statistics tools is described in
section 20.4 and following sections.
Note that not all tests are applicable in all situations.
There may be restrictions to e.g. the amount of data in
the sample, or to some basic properties of the data set.
These restrictions are mentioned where the tests are
described.
20.2 Basic terminology
20.2.3 Parametric or non-parametric tests
20.2.1 Literature
This manual is not aimed to be an introduction to basic
statistics. For more detailed literature, we refer to the
following handbooks:
•Press W., Teukolsky S.A., Vetterling W.T., Flannery
B.P., ‘Numerical recipes in C’, Cambridge University
Press, Cambridge.
•Sheskin D.J., ‘Handbook of parametric and
nonparametric statistical procedures’, CRC Press,
Boca Raton.
•Zwillinger D., Kokoska S., ‘Standard probability and
statistics tables and formulae’, Chapmann & Hall/
CRC, Boca Raton.
20.2.2 Application of statistic tests
In general terms, the application of a statistic test can be
outlined as follows:
•Make a proposition that will be referred to as the nullhypothesis. Statistical tests cannot be employed for
proving that a certain hypothesis is true, but only for
proving that all alternative hypotheses can be
rejected. Therefore, the null-hypothesis is what one
wants to reject.
Parametric tests basically suppose that the data are
distributed normally; they generally make use of the
values for the mean and the standard deviation.
Non-parametric tests are commonly based on a ranking
of the data. These ranks are distributed uniformly, hence
these tests are independent of any underlying
distribution. The price to pay is that an estimate of the
significance is more complicated and often relies on
approximations. These methods also generally loose
some strength because they loose some information
about the data. In comparison with parametric tests they
require more data to come to an equally significant
result.
For these tests the values of the data points are usually
replaced by their rank among the sample. The data
points are ordered, the lowest in order is assigned rank
one and the highest in order is assigned the rank that
equals the total sample size.
If some of the data points originally have the same
values, they can be assigned the mean of the ranks
(called ‘tie rank’) they would have had if they were
different. The sum of the assigned ranks is always equal
to the total sample size.
20.2.4 Categorical or quantitative data
Within the chart tool, a distinction between three types
of variables is made.
174
The BioNumerics manual
•Categorical variable: this type of variable divides a
sample into separate categories or classes. Examples
are database fields like e.g. genus, species, ... Also
intervals of quantitative variables can be treated as
categorical data.
•Quantitative variable: this type of variable can take
either continuous numerical values or binary values.
Continuous numerical values can be converted into
interval data if necessary. Character data are a typical
example for this type of variable.
Two variables:
•Contingency table: for two categorical variables
•2-D Scatterplot: for two quantitative variables
•2-D ANOVA plot: for one categorical and one
quantitative variable.
Three variables:
•Date variable: a variable containing a date. This
variable can be converted into interval data, which
means that it can be interpreted as either a categorical
variable or a quantitative variable.
•3-D Scatterplot: for three quantitative variables
With combinations of these variables several types of
plots can be created, based upon:
For an overview of graph types and associated tests for
one and two variables, see Table 20-1.
One variable:
Some types of plots can be extended in the sense that
they can display information from an additional
categorical variable by means of a color code. These
plots are the 2-D Scatterplot, the 3-D Scatterplot, the 2D Anova plot and the 1-D numerical distribution.
•Bar graph: for a single categorical variable
•1-D numerical distribution: for one quantitative variable
Categorical
Quantitative
Bar graph (20.3.1)
1-D numerical distribution (20.3.4)
Chi square test for equal category sizes
Kolmogorov-Smirnov test for normality
Contingency table (20.3.3)
2-D ANOVA plot (20.3.6)
Chi square test for contingency tables
See Table 20-3.
2-D ANOVA plot (20.3.6)
2-D scatterplot (20.3.5)
See Table 20-3.
See Table 20-2.
---
Categorical
Quantitative
Table 20-1. Schematic representation of variable types and corresponding graphs and tests for one and two
variables.
Parametric
Non-parametric
Means
T test (20.3.5.1)
Wilcoxon signed-rank test (20.3.5.2)
Correlations
Pearson correlation test (20.3.5.3)
Spearman rank-order correlation test (20.3.5.4)
Table 20-2. Overview of tests associated with 2-D scatterplots.
Parametric
Non-parametric
2 categories
T test (20.3.6.1)
Mann-Whitney test (20.3.6.2)
>2 categories
F test (20.3.6.3)
Kruskal-Wallis (20.3.6.4)
Table 20-3. Overview of tests associated with 2-D ANOVA plots.
Chapter 20 - Chart and statistics tools
175
20.3 Charts and statistics
20.3.1 Bar graph: Chi square test for equal category
sizes
Chi square: 7.191 (2 degrees of freedom)
P value= 0.027440
Significance= 97.2560%
Expected average count per category: 15.67
Figure 20-2. Example of a test report for the chi
square test for equal categorical sizes applied on a
bar graph like shown in Figure 20-1.
Note: This test should not be used if the expected
average count per category is less than 5. If this is the
case, consider combining categories in order to increase
the expected average count.
20.3.2 Bar graph: Simpson and Shannon Weiner
indices of diversity
Figure 20-1. Example of a bar graph.
For a bar graph displaying the number of entries for a
categorical variable, one typically likes to know if there
are significant differences in the number of entries per
category. Hence, the null-hypothesis is that all
categories have an equal number of entries.
If this null-hypothesis holds, the expected average count
per category (Ne) can be calculated as the total number of
entries divided by the number of categories,
N e = N n , with N the total number of entries and n
the number of categories. The chi square statistic is
calculated from the values for the expected average
count (Ne) and the observed entries per category (Noi),
n
(
χ 2 = ∑ [N oi − N e ] N e
i =1
2
) , with n the number of
A commonly asked question about a number of entries
occurring in different categories, is how they are
distributed. Two widely used coefficients to measure the
diversity are the Shannon-Weiner index of diversity and
Simpson’s index of diversity. Both coefficients take into
account the diversity, i.e. the number of catogories
present in the sampled population, as well as the
equitability, i.e. the evenness of the distribution of entries
over the different categories.
Simpson’s index of diversity is defined as the
probability that two consequtive entries will belong to
different categories. Given K categories present in a
sampled population, the probability of sampling
category i twice consecutively is as follows (ni is the
number of entries in category i):
ni( ni – 1 )
P i = --------------------------------K
∑ nj( nj – 1 )
j=1
categories.
The probability of sampling any two samples of the
If the null-hypothesis is true and under certain
conditions (see the note below) this statistic
approximately follows a chi square distribution with n-1
degrees of freedom. The p-value that is returned gives the
probability that the statistic is at least as high as the
observed one. If the p-value is low, the null-hypothesis
can be rejected. The significance s of the test is calculated
same category is given by P =
K
as the complement of the p-value,
s = 100 × (1 − p ) .
The values for these parameters can be found in the test
report. How such a chart and report can be created is
explained in section 20.5.
∑ Pi
. Hence, the
i=1
probability D of sampling two different categories is
D = 1 – P , which is Simpson’s index of diversity.
For a sampled population of N entries belonging to K
categories, the Shannon-Weiner index of diversity is
canculated as follows (ni is the number of entries in
category i):
K n
ni
i
H = – ∑ ---- ln  ----
N
N
i=1
176
The BioNumerics manual
20.3.3 Contingency
contingency tables
table:
chi
square
test
for
A contingency table contains information on the
association between two categorical variables. Each cell
contains the number of entries for a specific combination
of row and column categories. For this kind of
representation of the data, the obvious question is
usually if the information contained in the rows and
columns is correlated or not. The null-hypothesis is that
there is no association between the rows and columns.
In case there is a significant association, its strength can
be expressed using Cramer’s V. The formula is
[
] , with χ the value
V = χ 2 N min(ni − 1, n j − 1)
2
for the statistic, N the total number of entries, ni the
number of rows and nj the number of columns. This
gives a value between 0%, in case there is no association,
and 100%, in case there is a perfect association. Cramer’s
V can be used to compare the strengths of different
associations.
Values for the various parameters can be found in the
test report. The marginal column and row counts are
expressed in absolute counts and relative to the total
number of counts in the table. How such a chart and
report can be created is explained in section 20.6.
Chi square: 10.337 (9 degrees of freedom)
P value= 0.323868
Significance= 67.6132%
Figure 20-3. Example of a contingency table where
intervals of a numerical variable are used to create
categories.
Cramer's V: 27.08%
If the null-hypothesis is true, the expected count per cell
can be calculated. Therefore, we need to know the total
Marginal column counts:
1.250
1
2.13%
1.750
7 14.89%
2.250
17 36.17%
2.750
22 46.81%
number of cells n in the table,
n = ni n j
with ni the
number of rows and nj the number of columns. The
summed numbers of counts in each row and column are
called the marginal row counts (e.g. Nrowi stands for the
marginal row count of row i) and marginal column counts
(Ncol j). If there is no association between rows and
columns, the expected cell count nij for a cell on row i
and
column
j
can
be
calculated
as
nij = N rowi N colj N
Using these expected cell counts (nij) and the observed
counts per cell (Noij), a chi square statistic is calculated,
χ =
∑ ([N
ni , n j
i =1, j =1
oij
− nij
]
2
nij
) , with n the number of
i
rows and nj the number of columns.
follows a chi square distribution with
Figure 20-4. Example of a test report for the chi
square test for contingency tables like shown in
Figure 20-3.
The contingency table can be displayed showing the
residuals for the cells. The residual is a measure for the
deviation from the expected number of counts in that
cell and is calculated as
If the null-hypothesis is true and under certain
conditions (see note below), this statistic approximately
N − ni − n j + 1
degrees of freedom. The p-value that is returned gives the
probability that the statistic is at least as high as the
observed one. If the p-value is low, the null-hypothesis
can be rejected. The significance s of the test can be
calculated as the complement of the p-value,
s = 100 × (1 − p ) .
Marginal row counts:
1.250
1
2.13%
1.750
11 23.40%
2.250
17 36.17%
2.750
18 38.30%
, with N the total number of
entries.
2
Total count: 47
Average cell count: 2.94
[N
oij
− nij
]
nij
, with Noij
the observed cell count and nij the expected cell count.
Note: This test should not be used if the expected
average count per category is less than 5. If this is the
case, consider combining categories in order to increase
the expected average count. In practice, this also means
that there should be no empty rows or columns in the
contingency table.
Chapter 20 - Chart and statistics tools
20.3.4 1-D
numerical
distribution
Kolmogorov-Smirnov test for normality
177
function:
For a sample containing a single quantitative variable,
an often recurring question is if it is normally
distributed or not. In this case the null-hypothesis is that
the sample is drawn from a normal distribution. The
mean value 〈x〉 and corrected standard deviation
n
∑ (x
i =1
− 〈 x〉 )
2
i
(n − 1)
(with xi the observations
and n the sample size) are calculated from the sample
and are used to determine a normal distribution that can
be used as a model (further referred to as model normal
distribution) for the underlying distribution of the
sample if the null-hypothesis holds.
The Kolmogorov-Smirnov test for normality is applied
to test how different the cumulative distribution of the
sample is from the cumulative distribution of the model
normal distribution. For a sample where each
observation is associated with a single number of events,
the cumulative distribution F(xi) gives for each
observation (xi) the total number of events associated to
all observations in the sample that are smaller or equal
to the observation (xi). Hence, the cumulative
distribution gives at each observation the probability of
obtaining that observation or a lower one.
The test statistic is the maximum difference in absolute
value between the cumulative distribution of the sample
and the cumulative distribution of the model normal
distribution. In case the null-hypothesis is true and
under certain conditions (see note below), the
distribution function for this statistic can be calculated
approximately. The p-value gives the probability that the
statistic obtains a higher value than the observed one. If
the p-value is low, the null hypothesis can be rejected.
The significance of the test can be calculated as the
complement of the p-value,
s = 100 × (1 − p ) .
The values for the parameters can be found in the test
report. How such a chart and report can be created is
explained in section 20.10
Figure 20-5. Example of a 1-D numerical
distribution and model normal distribution
Mean: 2.352766
Corrected standard deviation: 0.357346
Maximum difference: 0.1413
P value= 0.282993
Significance= 71.7007%
Figure 20-6. Example of a test report for the
Kolmogorov-Smirnov test for normaility applied to
a 1-D numerical distribution like shown in Figure
20-6.
NOTES:
(1) The Kolmogorov-Smirnov test for normality should
not be used if the number of data points is smaller than
4. The test becomes more accurate if more data points
are used.
(2) This test cannot be used to prove that a sample
follows a normal distribution, since its aim is only to
reject the null-hypothesis with a certain level of
significance.
20.3.5 2-D Scatterplot
Scatterplots contain information on two quantitative
variables that are obtained for a set of entries. The
position of each dot on the plot is determined by the
observations. A scatterplot is dealing with paired data
since a specific pair of observations characterizes each
entry that is represented in the plot.
For this kind of plot one could ask (1) if the means are
significantly different or (2) if there is any correlation
between the two variables. For both questions, there is a
parametric and a non-parametric test available.
178
The BioNumerics manual
The values for the parameters can be found in the test
report. How such a chart and report can be created is
explained in section 20.7
Mean values:
c10
1.3396
c14
1.8512
Corrected variances:
c10
0.2870
c14
0.4246
Corrected covariance = -0.1476
Pooled corrected standard deviation =
0.1464
T = -3.495 (46 degrees of freedom)
P value= 0.001060
Significance= 99.8940%
Figure 20-7. Example of a 2-D scatterplot.
Figure 20-8. Example of a test report for the T test
applied to a 2-D scatterplot like in Figure 20-7.
20.3.5.1 Parametric test for means: T test
The null-hypothesis is that the two samples have the
same mean values. Assume the sample observations are
xi and yi (i=1,…,n), with 〈x〉 and 〈 y〉 the respective
n
mean values and
s x = ∑ ( xi − 〈 x〉 )
2
i =1
n
s y = ∑ ( yi − 〈 y〉 )
i =1
2
(n − 1)
and
(n − 1) the corrected variances.
NOTES:
(1) This test should not be used if the data points are not
normally distributed. In this case the Wilcoxon signedrank test can be used.
(2) This test should not be used if the variances of the
two samples are not the same.
For paired data, it is generally not guaranteed that all
entries have a completely independent pair of
observations. The test statistic should be corrected for
the influences this may have on the variance of the
observations. Therefore, the corrected covariance Cov of
the sample,
20.3.5.2 Non-parametric test for means: Wilcoxon
signed-rank test

 n
Cov (x, y ) =  ∑ (x i − 〈 x〉 )( y i − 〈 y 〉 )

 i =1
differences of these observations
(n − 1) , is
taken into account. The sample variance can be
expressed by means of the pooled corrected standard
deviation sd. In this case, sd can be calculated as
sd =
(s
x
)
+ s y − 2Cov (x, y ) n .
A statistic is defined as
T = (〈 x〉 − 〈 y〉 ) s d
. If the
The null-hypothesis is that the two samples have the
same mean values. Assume the sample observations are
xi and yi (i=1,…,n). The absolute values of the
d i = xi − y i
are
ranked (zero values are eliminated from the analysis).
As a first step, these ranks are assigned to rank variables
Ri. Afterwards, these Ri get the sign of corresponding di.
These two steps turn the Ri into ranks of positive or
negative differences. The sum of ranks of positive
differences (sum of all positive Ri) and the sum of ranks of
negative differences (absolute value of the sum of all
negative Ri) are determined and the smallest of these
sums is called the Wilcoxon T test statistic.
null-hypothesis holds and under certain conditions (see
note below) this statistic follows a t distribution with n-1
degrees of freedom. The p-value gives the probability that
the statistic indeed has the observed value or higher. If
the p-value is small, the null-hypothesis can be rejected.
The significance of the test is calculated as the
If the null-hypothesis holds, the expected value for T is
complement of the p-value,
hypothesis holds and under certain conditions (see note
below)
the
statistic
defined
as
s = 100 × (1 − p ).
n(n − 1) 4
(with
n
the
number
of
pairs
of
observations), while the expected standard deviation on
T is
n(n + 1)(2n + 1) 24 .
(T − [n(n − 1) 4])
Hence, if the null-
n(n + 1)(2n + 1) 24 approxim
ately follows a normal distribution. The p-value gives the
Chapter 20 - Chart and statistics tools
179
probability that the statistic is at least as high as the
observed one. If the p-value is low, the null hypothesis
can be rejected. The significance of the test is calculated as
calculated
the complement of the p-value,
The values for the parameters can be found in the test
report. How such a chart and report can be created is
explained in section 20.7
s = 100 × (1 − p ) .
The values for the parameters can be found in the test
report. How such a chart and report can be created is
explained in section 20.7
Sum of ranks of positive differences= 303.0
Sum of ranks of negative differences= 825.0
P value= 0.005746 (Normal approximation)
Significance= 99.4254%
Figure 20-9. Example of a test report for the
Wilcoxon signed-rank test applied to a 2-D
scatterplot like in Figure 20-7.
as
the
s = 100 × (1 − p ) .
complement
of
the
p-value,
Mean values:
c10
1.3396
c14
1.8512
Variances:
c10
0.2809
c14
0.4156
Covariance= -0.1445
Pearson correlation= -42.288%
P value (single tail)= 0.001531 (T test
approximation)
Significance= 99.8469%
Note:
Figure 20-10. Example of a test report for the
Pearson correlation test applied to a 2-D scatterplot
like in Figure 20-7.
(1) This test should not be used if the population
distribution is not symmetric.
(2) The approximation by using a normal distribution is
only valid if the sample contains more than 20
observations.
20.3.5.3 Parametric test for correlations: Pearson
correlation test
The null hypothesis is that there is no linear relationship
between the sample variables. Assume the observations
in the sample are xi and yi (i=1,…,n), with 〈 x〉 and 〈 y 〉
n
the
mean
s x = ∑ ( xi − 〈 x〉 )
values,
2
n
and
i =1
n
s y = ∑ ( y i − 〈 y〉 )
2
n
the
variances
and
i =1
the
covariance of the sample. Pearson’s correlation r is
r = Cov (x, y )
sx s y
If the samples contain less than 30 observations, an
alternative way for testing the null-hypothesis is offered
by Monte-Carlo simulations. To do this, 10.000 samples
with n pairs of randomly distributed observations are
created. For each of these samples, a value for the
statistic is obtained and is compared to the observed
value. The p-value from the simulations is determined by
the number of times the simulations give a larger value
for the statistic than the value observed in the real
sample. Also here, the significance is calculated as
s = 100 × (1 − p ) .
 n

Cov(x, y ) =  ∑ (xi − 〈 x〉 )( yi − 〈 y〉 ) n
 i =1

calculated as
In case there is a significant linear correlation, Pearson’s
r can be used to indicate its strength. A positive value for
Pearson’s r is associated with a positive correlation and
would result in a regression line with positive slope. A
negative value for Pearson’s r is associated with a
negative correlation and would result in a regression
line with negative slope.
.
The results for the simulated p-
value and significance also appear in the test report.
Note: This test should not be used if the distributions of
xi or yi have strong wings or if they are not normally
distributed. However, this test is acceptable for
sufficiently large samples.
If the null-hypothesis holds and under certain
conditions (see note below) the statistic defined as
r
n−2
1− r2
approximately
follows
distribution with n-2 degrees of freedom. Since
r
a
t
is used
to calculate the statistic, the p-value can be calculated
using a single tail of the t distribution. The p-value gives
the probability that the statistic obtains a value at least
as high as the observed one. The significance of the test is
20.3.5.4 Non-parametric
test
for
Spearman rank-order correlation test
correlations:
The null-hypothesis is that there is no linear correlation
between the sample rank variables, or equivalently that
there is no monotonic relation between the sample
variables. The sample observations xi and yi (i=1,…,n)
are replaced by their rank after ordering them from
180
The BioNumerics manual
smallest to largest. This results in a sample of ranks Ri
and Si (i=1,…,n). The Spearman rank-order correlation
= Cov ( R, S )
coefficient is defined as rs
n
s R = ∑ (Ri − 〈 R〉 )
with
2
sR sS
n
,
and
i =1
n
s S = ∑ (S i − 〈 S 〉 )
2
n
the
rank
statistic is obtained and is compared to the observed
value. The p-value from the simulations is determined by
the number of times the simulations give a larger value
for the statistic than the value observed in the real
sample. Also here, the significance is calculated as
s = 100 × (1 − p ) .
The results for the simulated p-
value and significance also appear in the test report.
variances,
i =1
20.3.6 ANOVA plot
n
Cov( R, S ) = ∑ (Ri − 〈 R〉 )(S i − 〈 S 〉 ) n
the rank
i =i
covariance and 〈R〉 and
〈S 〉
the rank mean values of the
rank variables Ri and Si respectively.
The null-hypothesis can be tested using the statistic
rs n − 2
1 − rs
2
. If the null-hypothesis holds,
this statistic approximately follows a t distribution with
n-2 degrees of freedom. Since
rs
is used to calculate the
statistic, the p-value can be calculated using a single tail
of the t distribution. The p-value gives the probability
that the statistic obtains a value at least as high as the
observed one. In this case, a single tail test is performed.
The significance of the test is calculated as the
complement of the p-value,
This kind of plot presents a categorical and quantitative
variable. The categorical variable splits the sample in a
number of groups while the quantitative variable
describes a distribution within each group. This kind of
data is called unpaired.
A typical question is whether the groups have the same
average for the quantitative variable. In case there are
only two groups for the categorical variable, the
parametric T test or the non-parametric Mann-Whitney
test can be applied. If there are three or more groups for
the categorical variable, the parametric F test or the nonparametric Kruskal-Wallis test can be applied.
s = 100 × (1 − p ) .
The values for the parameters can be found in the test
report. How such a chart and report can be created is
explained in section 20.7
Rank mean values:
c10
24.0000
c14
24.0000
Rank variances:
c10 184.0000
c14 184.0000
Rank covariance= -77.1489
Figure 20-12. Example of an ANOVA plot with two
categorical variables.
20.3.6.1 Parametric test for two groups: T test
The null-hypothesis is that the two groups have the
same mean values. Assume the sample group
observations are xi (i=1,…,n) , and yj , ( j=1,…,m), with
〈 x〉 and 〈 y〉
the respective mean values for the groups.
The pooled corrected standard deviation is defined as
Spearman rank-order correlation= -41.929%
P value (single tail)= 0.001675 (T test
approximation)
m
 n
1
2
2  1
s d =  ∑ ( xi − 〈 x〉 ) + ∑ ( y j − 〈 y 〉 )  + 
n
m


j =1
 i =1

Significance= 99.8325%
Figure 20-11. Example of a test report for the
Spearman rank-order correlation test applied to a 2D scatterplot like in Figure 20-7.
.
A statistic is defined as
If the samples contain less than 30 observations, an
alternative way for testing the null-hypothesis is offered
by Monte-Carlo simulations. To do this, 10.000 samples
with n pairs randomly distributed observations are
created. For each of these samples, a value for the
(n + m − 2)
T = (〈 x〉 − 〈 y〉 ) s d
.
If the null-hypothesis is true and under certain
conditions (see note below) this statistic follows a t
distribution with n-m-2 degrees of freedom. The p-value
gives the probability that the statistic indeed has the
Chapter 20 - Chart and statistics tools
181
observed value or higher. If the p-value is small, the
null-hypothesis can be rejected. The significance of the
test is calculated as the complement of the p-value,
Sum of ranks:
Ambiorix 509.5
Perdrix 270.5
s = 100 × (1 − p ) .
P value= 0.157560 (Normal approximation)
Significance= 84.2440%
The values for the parameters can be found in the test
report. How such a chart and report can be created is
explained in section 20.9.
Figure 20-14. Example of a test report for a MannWhitney test applied on an ANOVA plot with two
categorical variables like in Figure 20-12.
Mean values:
Ambiorix
2.552
Perdrix
2.391
Pooled corrected standard deviation =
0.08509
NOTE: this test should not be used if one of the groups
contains less than 8 members.
T = 1.891 (37 degrees of freedom)
P value= 0.066419
Significance= 93.3581%
Figure 20-13. Example of a test report for a T test
applied on an ANOVA plot with two categorical
variables like in Figure 20-12.
NOTES:
If the sample contains less than 30 observations, an
alternative way for testing the null-hypothesis is offered
by Monte-Carlo simulations. To do this, 10.000 samples
with two groups of n and m randomly distributed
observations are created. For each of these samples, a
value for the statistic is obtained and is compared to the
observed value. The p-value from the simulations is
determined by the number of times the simulations give
a larger value for the statistic than the value observed in
the real sample. Also here, the significance is calculated
as
(1) This test should not be used if the data points are not
normally distributed.
s = 100 × (1 − p ) . The results for the simulated p-
value and significance also appear in the test report.
.
(2) This test should not be used if the variances of the
two samples are not the same.
20.3.6.2 Non-parametric test for two groups: MannWhitney test
The null-hypothesis is that the two groups have the
same median values. Assume the observations in the
sample groups are xi (i=1,…,n) , and yj , ( j=1,…,m). All
observations are combined into one sample and are
ranked. For each group, the sum of ranks is determined
and the smallest of those sums is taken as the U statistic.
If the null-hypothesis holds and under certain
conditions (see note below) this statistic approximately
follows a normal distribution with mean
variance
n m(m + n + 1) 12 .
nm 2
and
The p-value gives the
probability that the statistic indeed has the observed
value or higher. If the p-value is small, the nullhypothesis can be rejected. The significance of the test is
calculated as the complement of the p-value,
s = 100 × (1 − p ) .
The values for the parameters can be found in the test
report. How such a chart and report can be created is
explained in section 20.9.
Figure 20-15. Example of an ANOVA plot with two
categorical variables
20.3.6.3 Parametric test for more than two groups: F
test
Assume that the sample contains g groups. The nullhypothesis is that all groups have the same mean. The
group sizes are given by n1, n2, …, ng, in total n
observations for the complete sample. The jth
observation in the ith group is denoted as xij. The
ni
sample group means are
〈 x〉 groupi = ∑ xij ni ,
with
j =1
xij all observations within group i. The mean of all
g
observations is
〈 x〉 = ∑ 〈 x〉 groupi g ,
i =1
182
The BioNumerics manual
The total sum of squares,
2
SST = ∑∑ (xij − 〈 x〉 )
g
ni
, is
i =1 j =1
a measure for the variation in the sample around the
mean of all observations. The sum of squares among
2
SSA = ∑ ni (〈 x〉 groupi − 〈 x〉 )
g
groups
simulations is determined by the number of times the
simulations give a larger value for the F statistic than the
value observed in the real sample. Also here, the
significance is calculated as
s = 100 × (1 − p ) .
The
results for the simulated p-value and significance also
appear in the test report.
measures the
i =1
variation among the group means. The total within-group
sum of squares
2
SSW = ∑∑ (x ij − 〈 x〉 groupi )
g
ni
gives
i =1 j =1
the variation in the sample within the groups. From the
definitions it is clear that SST=SSA+SSW.
If the null-hypothesis holds and under
conditions (see note below) the statistic
F = SSA(n − g ) SSW (g − 1)
certain
approximately
follows an F-distribution with g-1 and n-g degrees of
freedom.
The p-value gives the probability that the statistic obtains
a value at least as high as the observed one. The
significance of the test is calculated as the complement of
the p-value,
s = 100 × (1 − p ) .
The values for the parameters can be found in the test
report. How such a chart and report can be created is
explained in section 20.9.
SST=
SSA=
SSW=
20.3.6.4 Non-parametric test for more than two
groups: Kruskal-Wallis test
3.347
0.255
3.092
F= 1.814 (2;44 degrees of freedom)
P value= 0.174938 (F approximation)
Significance= 82.5062%
P value= 0.175700 (Simulated)
Significance= 82.4300%
Group means:
Ambiorix 2.552
Perdrix
2.391
Vercingetorix
2.446
Figure 20-16. Example of a test report for an F test
applied to an ANOVA plot with more than two
categorical variables like in Figure 20-15.
In case the sample contains less than 30 observations, an
alternative way for testing the null-hypothesis is offered
by Monte-Carlo simulations. To do this, 10.000 samples
with g groups and n1, n2, …, ng randomly distributed
observations in the groups are created. For each of these
samples, a value for the F statistic is obtained and is
compared to the observed value. The p-value from the
Assume the sample contains g groups. The nullhypothesis is that all groups have the same median. The
number of observations in the groups are given by n1,
n2, …, ng, with n the total number of observations. All
observations are ranked, the rank for the jth observation
in the ith group is denoted by Rij and Ri stands for the
group rank sum of group i.
A
statistic
is
defined
as
g
 12
Ri 
H=
∑
 − 3(n + 1) .
 n(n + 1) i =1 ni 
If the null-hypothesis holds and under certain
conditions (see below) the statistic approximately
follows a chi-square distribution with g-1 degrees of
freedom.
The p-value gives the probability that the statistic obtains
a value at least as high as the observed one. The
significance of the test is calculated as the complement of
the p-value,
s = 100 × (1 − p ) .
The values for the parameters can be found in the test
report. How such a chart and report can be created is
explained in section 20.9.
H= 2.377 (2 degrees of freedom)
P value= 0.304666 (Chi square approximation)
Significance= 69.5334%
P value= 0.312600 (simulated)
Significance= 68.7400%
Group rank sums:
Ambiorix 623.5
Perdrix 328.5
Vercingetorix 176.0
Figure 20-17. Example of a test report for the
Kruskal-Wallis test applied to an ANOVA plot with
more than two categorical variables like in Figure
20-15.
NOTES:
Chapter 20 - Chart and statistics tools
183
(1) In case there are only 3 groups, this test should not
be used if one of the groups contains less than 6
observations.
appearances of the Chart and Statistics window are
discussed. The content of the plot will be discussed in
sections 20.5 - 20.11.
(2) In case there are more than 3 groups, this test should
not be used if one of the groups contains less than 5
observations.
If the sample contains less than 30 observations, an
alternative way for testing the null-hypothesis is offered
by Monte-Carlo simulations. To do this, 10.000 samples
with g groups and n1, n2, …, ng randomly distributed
observations in the groups created. For each of these
samples, a value for the H statistic is obtained and is
compared to the observed value. The p-value from the
simulations is determined by the number of times the
simulations give a larger value for the H statistic than
the one observed in the real sample. Also here, the
significance is calculated as
s = 100 × (1 − p ) .
The
results for the simulated p-value and significance also
appear in the test report.
20.4 Using the plot tool and general
appearance
The plot and statistics tools are available directly from
the Main window or from the Comparison window. In the
Main window, it can be started using Comparison >
Chart / Statistics. When launched from the Main
window, it works on the current selection made in the
database. If launched in the Comparison window, it
works on all entries contained in the comparison.
Figure 20-18. The Select plot components dialog box
that appears when the chart tool is started, it is
used to select the plot components for the chart.
20.4.5 To copy the plot of this window select either File
> Copy to clipboard (metafile) or File > Copy to
clipboard (bitmap). A paper copy can be obtained by
selecting File > Print.
20.4.6 For some type of charts, you can export the data
by selecting File > export data (formatted) or File >
export data (tab delimited). These menu items appear in
grey instead of black if they cannot be applied for the
current type of chart.
20.4.1 In the Comparison window, click the
button or select File > Chart > statistics. This pops up a
dialog box (see Figure 20-18.) that is used to select the
plot components. All components that can be included
in a chart are listed on the left.
20.4.2 To add a component to the chart, select a
component from this list by clicking on it and add it to
the list of Used components (displayed at the right) with
the button <Add>. Also in this list, components can be
clicked for selecting them. The selected component can
be removed from the Used components list with the
button <Delete>. For the selected component, the panel
beneath the Used components list displays what data
type it is.
20.4.7 To change the content of the chart you can use the
Plot menu item. Selecting Plot > Edit components …
pops up the Select plot components window (see Figure 2018.). This can be used to change the Used components. If
the list of Used components is modified, it is possible
that the plot changes into another type of chart because
the chart functionality selects the optimal representation
for a given set of variables. Of course it is possible to
select another type of chart (see 20.4.8 or ).
20.4.8 In the Plot menu item you can also select another
type of chart. If the chart type chosen is not compatible
with the data type, the message “Invalid type of source
data” appears.
20.4.3 Within this Select plot components dialog box you
can convert a quantitative variable into an interval
variable by checking the Convert to interval data
checkbox. The interval size has to be specified. See lower
right part of the panel displayed in Figure 20-18. The
same procedure has to be followed if a data variable has
to be converted to an interval variable.
20.4.9 The View menu item is divided into two parts,
separated by a horizontal line. The upper part contains
the menu items for zooming in or zooming out on the
plot. The part below the horizontal line contains menu
items that change the view of the plot and that generally
depend on the kind of chart that is displayed in the
window. These commands will be discussed when the
various charts are presented.
20.4.4 For this example, select one numerical variable.
After clicking the <OK> button, the chart appears, as in
Figure 20-19. In this section the general features and
20.4.10 The last menu item is Statistics. Under this item,
a list of statistic tests that can be applied to the selected
184
The BioNumerics manual
Figure 20-20. A bar graph for one categorical variable.
type of plot is given. These tests will be discussed when
the various charts are presented.
20.5 Bar graph
The toolbar shows the following buttons (from left to
right):
20.5.1 Open the chart tool by clicking
the Edit plot components button, the Display bar graph
button, the Display 2D contingency table button, the
Display 2D scatterplot button, the Display 3D scatterplot
button, the Display ANOVA plot button, the Display 1D
distribution function button, the Display 3D bar graph
button, the Zoom in button and the Zoom out button. The
button for the plot type that is presently shown is
flagged in green.
20.4.11 Selections of entries can be made within the
chart, except for 3-D bar graphs. These selections are
also shown in the Comparison window and the Main
window. If the selection is changed in the comparison,
the chart is updated automatically. If another chart type
is selected, the entries keep their selected/unselected
state.
In the following sections the various types of charts and
their statistics are described.
in the
Comparison window.
20.5.2 Select a categorical variable, e.g. an information
field and add it to the list of Used components, then
press <OK>.
20.5.3 This creates a Chart and Statistics window like
shown in Figure 20-20. The component that is displayed
is indicated beneath the toolbar. In case you selected
more than one categorical variable in the Used
components list (20.5.2) a drop down list can be used to
display another variable.
20.5.4 The entries corresponding to the bars in the chart
can be selected (or unselected) by pressing the CTRL key
while clicking or dragging the mouse.
20.5.5 Select Statistics > Chi square test for equal
category size. This creates a Statistics report, as shown in
Figure 20-21. A description of this test can be found in
20.3.1.
20.5.6 With Statistics > Index of diversity a report
window is generated which displays Simpson’s index of
diversity and the Shannon-Weiner index of diversity for
Chapter 20 - Chart and statistics tools
185
Figure 20-22. A contingency table for two categorical variables.
the selected entries and categories. The report can be
copied to the clipboard.
list makes it possible to assign another categorical
variable from the used components list to the X
component and Y component.
20.6 Contingency table
20.6.3 Cells can be selected (or unselected) in the table by
pressing (CTRL + mouse click).
20.6.1 Create a Chart and Statistics window with two
categorical variables. This can be done from the
20.6.4 The contingency table can be displayed showing
row respectively column percentages by selecting View
> Display row percentages respectively View > Display
column percentages.
Comparison window by clicking
or within the
Chart and Statistics window by editing the plot
components after clicking the
button. Select two
categorical variables into the Used components list.
After clicking <OK>, a Contingency table like in Figure
20-22. is created.
20.6.2 The contents of the X component
and Y
component are indicated in the window. A drop down
20.6.5 The contingency table can be displayed in the
Chart and Statistics window showing residuals in the cells,
with View > Display residuals. The residual for a cell is
a measure for the deviation from the expected number
of counts in that cell and is calculated as
[N
oij
− nij
]
nij
, with Noij the observed cell count
and nij the expected cell count. This view is closely
Figure 20-21. Statistics report for Chi square test for equal category sizes.
186
The BioNumerics manual
Figure 20-19. The Chart and Statistics window.
related to the statistic test that can be applied to this
chart (see 20.3.3).
20.6.7 In the Chart and Statistics window, you can create
bar graphs for each of the two selected categorical
20.6.6 Select Statistics > Chi square test for contingency
tables to apply the statistical test that is available for this
kind of plot. This creates a Statistics report, as shown in
Figure 20-23. A description of this test can be found in
20.3.3.
variables by clicking the Bar graph button
Figure 20-23. Statistics report for the Chi square test for contingency tables.
.
Chapter 20 - Chart and statistics tools
187
Figure 20-24. 2-D scatterplot for two quantitative variables.
20.7 2-D Scatterplot
the 1-sigma uncertainty levels being plotted as thin
green line.
20.7.1 Create a Chart and Statistics window with two
quantitative variables. This can be done from the
Comparison window by clicking
or within the
Chart and Statistics window by editing the plot
components after clicking the
button. Select two
quantitative variables into the Used components list.
After clicking <OK> , a 2-D scatter plot like in Figure 2024. is created.
20.7.2 The contents of the X axis and Y axis are indicated
in the window beneath the toolbar. A drop down list
makes it possible to change the variables displayed on
the axes.
20.7.5 Under the menu item Statistic, a number of
statistic test can be found: T test for mean value paired
samples), Wilcoxon signed ranks test (paired samples),
Pearson correlation test and Spearman rank-order
correlation test. Each of these tests generates a statistics
test report. A description of these tests can be found in
20.3.5.
20.7.6 If one or more categorical variables are present in
the Used components list, additional information from
one of these variables can be displayed in color code by
20.7.3 Dots can be selected in the chart by holding the
SHIFT key and drawing a rectangle around the dots
with the mouse.
20.7.4 With the menu command View > Regression line,
a regression line can be added to the plot. A Regression
selection dialog box pops up (Figure 20-25.), offering a
choice between several types of regression lines. After
selecting a regression type and clicking <OK> in the
dialog box, a small statistics report is generated. If a
regression line is fitted it is shown as a thick green line,
Figure 20-25. The Regression selection dialog box,
where the type of regression line for the scatter plot
can be selected.
188
The BioNumerics manual
selecting the variable form the color drop-down list. If
this is the case, you can change the color labels with the
command View > Label with continuous colors.
20.7.7 For each of the quantitative variables used in this
plot, a 1-D distribution function plot can be generated.
This can be done by selecting Plot > 1D distribution
function, or by clicking the
scatterplot can be generated. This can be done by
selecting Plot > 2D scatterplot, or by clicking the
button. For more details on this kind of chart, see 20.7.
20.9 ANOVA plot
button. For more
20.9.1 Create a Chart and Statistics window with one
categorical and one quantitative variable. This can be
details on this kind of chart, see 20.10.
done from the Comparison window by clicking
20.8 3-D scatterplot
within the Chart and Statistics window by editing the plot
20.8.1 Create a Chart and Statistics window with three
quantitative variables. This can be done from the
Comparison window by clicking
or within the
Chart and Statistics window by editing the plot
components after clicking the
button. Select three
quantitative variables into the Used components list.
This will create a 3-D scatterplot.
20.8.2 The variables that are displayed on the respective
axes are indicated beneath the toolbar. The variables can
be switched between the X axis, Y axis and Z axis. A drop
down list makes it possible to assign another
quantitative variable from the Used components list to the
respective axes.
20.8.3 Dots can be selected in the chart by holding the
SHIFT key and drawing a rectangle around the dots
with the mouse. The corresponding entries are also
selected in the Comparison window and the Main window.
If they are removed from the comparison, the chart is
updated automatically. Selections made in the Chart and
Statistics window are automatically updated in the
Comparison window and vice versa.
20.8.4 By clicking on the plot and holding the left mouse
button, the plot can be rotated in different directions.
The data points in the plot can be displayed as small
dots or as larger spheres, which can be achieved by
checking or uncheking the command View > Show
rendered spheres.
20.8.5 If one or more categorical variables are present in
the Used components list, additional information from
one of these variables can be displayed in color code by
selecting the variable form the color drop-down list. If
this is the case, you can change the color labels with the
command View > Label with continuous colors.
20.8.6 For each of the quantitative variables used in this
plot, a 1-D distribution function plot can be generated.
This can be done by selecting Plot > 1D distribution
function, or by clicking the
or
components after clicking the
categorical and one quantitative variable for the plot.
This creates an ANOVA plot like in Figure 20-26. The
data for each category is presented on a horizontal line.
The scale for the line is indicated at the top. Each data
point is indicated with a small vertical mark at the
position according to its numerical value and to the
category it belongs to.
20.9.2 The categorical and quantitative variables that are
displayed are indicated beneath the toolbar. A drop
down list makes it possible to assign other variables
from the Used components list to the respective axes.
20.9.3 Vertical marks, indicating the database entries,
can be selected in the chart by holding the SHIFT key
and drawing a rectangle around the marks with the
mouse.
20.9.4 From the menu item Statistics , the ANOVA test
(F test) or the Kruskal-Wallis test (in case more than
two categorical variables are used) or the T test or the
Mann-Whitney test (in case only two categorical
variables are used) can be launched. For these tests a
statistics test report is generated. A description of these
tests can be found in 20.3.6.
20.9.5 If one or more categorical variables are present in
the Used components list, additional information from
one of these variables can be displayed in color code by
selecting the variable form the color drop-down list. If
this is the case, you can change the color labels with the
command View > Label with continuous colors.
20.9.6 For the quantitative variable used in this plot, a 1D distribution function plot can be generated. This can
be done by selecting Plot > 1D distribution function, or
by clicking the
button. For more details on this
kind of chart, see 20.10. For the categorical variables
used in this plot, a bar graph can be generated. This can
be done by selecting Plot > Bar graph, or by clicking the
button. For more
details on this kind of chart, see 20.10. For each couple
of categorical variables used in this plot, a 2-D
button. Select one
button. For more details on this kind of chart, see
20.5.
Chapter 20 - Chart and statistics tools
189
Figure 20-26. ANOVA plot for a categorical and a quantitative variable.
20.10 1-D numerical distribution
20.10.1 Create a Chart and Statistics window with one
quantitative variables. This can be done from the
Comparison window by clicking
or within the
Chart and Statistics window by editing the plot
components after clicking the
button. Select only
one quantitative variable for the plot. This will create a
1-D cumulative distribution function plot like in Figure
20-27. The dots present the data points, each dot has a
corresponding vertical mark just above the chart. The
smooth green line is the normal distribution that serves
as a model for the data.
20.10.2 The variable that is displayed is indicated
beneath the toolbar. A drop down list is available to
select another variable in case there is more than one
numerical variable in the Used components list.
20.10.3 Data points can be selected in the chart by
holding the SHIFT key and drawing a rectangle around
the vertical marks with the mouse.
20.10.4 Select Statistics > Kolmogorov-Smirnov test for
normality for applying the statistical test that is
available for this kind of plot. This will create a Statistics
report, as shown in Figure 20-28. A description of this
test can be found in 20.3.4.
20.10.5 Instead of a cumulative distribution, the data can
be presented as bar graph by unchecking the command
View > Display cumulative distribution.
20.10.6 Additional information from a categorical
variable can be displayed in color code. In this case, with
the menu item View, you can change the color code into
a continuous color code and back.
20.11 3-D Bar graph
20.11.1 For categorical variables, also a 3-D bar graph
can be plotted, see Figure 20-29. This can be done by
selecting two categorical variables for the plot and by
clicking the appropriate button or by selecting Plot > 3D
bar graph from the menu.
20.11.2 Under the menu item View , there is the option to
Label the X axis in color, to Label the Y axis in color or
to Label with continuous colors.
20.11.3 By clicking on the plot and holding the left
mouse button, the plot can be rotated in different
directions.
190
The BioNumerics manual
Figure 20-28. Statistics report for a Kolmogorov-Smirnov test.
Figure 20-27. A 1-D numerical distribution function for a single quantitative variable.
Chapter 20 - Chart and statistics tools
Figure 20-29. 3-D bar graph for two categorical variables.
191
192
The BioNumerics manual
193
21. Identification with database entries
There are two methods for identification available in
BioNumerics. The simplest one, as described in this
section, is to compare and identify unknown patterns
against a selection of database patterns stored on disk.
The more sophisticated method is to identify unknown
patterns against an identification library (see chapter
22.).
21.1 Creating lists for identification
21.1.1 In DemoBase, select all Ambiorix entries except the
Ambiorix sp. entries: First perform a search with Ambiorix
as genus name, and then perform a second search with
Search in list and Negative search enabled, and sp. as
species string.
21.1.2 Create a comparison with the selected entries, and
save it as Ambiorix.
21.1.3 Exit the Comparison window.
21.2 Identifying unknown entries
First we select some entries which we want to identify.
We consider the Ambiorix sp. entries (those without
species name) as unknown, and we will identify them
against the known Ambiorix entries (the list Ambiorix).
21.2.1 In the Main window, press F4 to clear the
selection.
21.2.2 Select all Ambiorix sp. entries (in the Entry search
dialog box, disable Search in list and Negative search and
enter Ambiorix in the Genus field and sp. in the species
field).
21.2.3 Copy the selected entries to the clipboard using
Edit > Copy selection or
21.2.8 In the experiment type selection bar, select an
experiment by means of which you want to identify the
unknown entries. Select for example FAME (fatty acid
methyl esters).
21.2.9 Click on the first unknown Ambiorix sp. entry in
the entry names panel (to place the selection bar on it).
21.2.10 In the menu of the Comparison window, choose
Edit > Arrange entries by similarity.
The entry with the selection focus stands on top and all
the other entries in the comparison are arranged by
decreasing similarity with that entry. The similarity
values are shown in the matrix panel.
21.2.11 You can click on the
button of FAME to
display the images and drag the horizontal separator
line down to show the complete names of the fatty acids.
The Arrange entries by similarity function can be
repeated for each experiment type and for Composite
Data Sets, in order to compare the different results. The
program uses the similarity coefficient which is
specified in the Experiment type window.
21.2.12 A printout of the list of similarity values can be
obtained with File > Print database fields.
21.2.13 An export file of the similarity values is created
with File > Export database fields.
NOTE: In case of a Fingerprint Type, you can also
show the number of different bands between a
highlighted entry and the other entries, by selecting
Different bands as the default similarity coefficient.
Before selecting Edit > Arrange entries by
similarity, you should enable Layout > Show
distances.
.
21.2.4 Open the saved comparison Ambiorix.
21.2.5 Paste the selected Ambiorix sp. entries with Edit >
Paste selection.
21.2.6 For the identification purpose, we do not need the
dendrogram panel (left, see paragraph 9.7 and Figure 911.), which you can minimize.
21.2.7 Create sufficient space for the matrix panel (right,
see paragraph 9.7 and Figure 9-11.), where the similarity
values will appear.
21.3 Fast band-based database screening
of fingerprints
In case of large databases of fingerprint patterns, the
most time-consuming part of a quick database screening
of new or unknown patterns is reading or downloading
all the fingerprint information. BioNumerics offers a
tool that overcomes this bottleneck by generating a
cache containing band information of all available
fingerprints belonging to a Fingerprint Type. When a
database screening is performed, this cache is loaded
rather than the full gel information. This cache-based
fingerprint screening is extremely fast, even for the
largest databases, but is limited to band-based
194
comparisons of fingerprint patterns. In addition, the
feature is only available in a Connected Database
environment (see chapter 28.), where a special column
holding the quick-acces band information is generated
(30.1.5).
21.3.1 The fast band-based identification can be enabled
in the Fingerprint Type window (Experiments panel), by
selecting Settings > Enable fast band matching (this
menu command appears only in a Connected Database).
A question pops up “Do you want to generate cached
patterns for all current fingerprints?”. By answering
<Yes>, a cached pattern will be generated for all patterns
present in the database that belong to the selected
Fingerprint Type. If you answer <No>, a cached pattern
will be created only for new patterns that are added to
the database.
21.3.2 The fast identification tool is launched from the
Main window, where a set of selected entries will be
identified against all other database entries.
21.3.3 A menu command Identification > Fast band
matching (only in a Connected Database) pops up the Fast
band matching dialog box (Figure 21-1.). Under
Experiment type, select the Fingerprint Type you want
to use for the band matching. With Used range, you can
specify a range of the pattern (in percentage distance
from top) within which bands will be compared. The
Tolerance is the same as the Position tolerance explained
in 10.2. With Maximum difference, you can specify the
maximum number of different bands between the
unknown pattern and a database pattern to be included
in the result set. Furthermore, the Result set can be
limited to a certain number (default 20). In the input box
SQL query, it is possible to enter an SQL query, to limit
the search to a subset of entries that match a specific
string entered for an information field.
The BioNumerics manual
The typical syntax of a restricting SQL query is:
"GENUS"='Ambiorix'
One can also combine statements, for example:
"GENUS"='Ambiorix' AND "SPECIES"='sylvestris'
"GENUS"='Ambiorix' OR 'Perdrix'
21.3.4 By pressing <OK> the fast band matching is
executed, and the identification result pops up in the
Fast band matching window (Figure 21-2.). This window is
subdivided in two panels, of which the upper panel lists
the entries to be identified, and the lower panel lists the
result set for the selected entry in the upper panel. The
only matching criterion used is the number of different
bands, which is listed in the outermost left column, Diff.
21.3.5 In both panels of the Fast band matching window,
you can select or unselect entries using the mouse in
combination with the SHIFT or CTRL keys. You can also
pop up the Entry edit cards by double-clicking on an
entry or pressing ENTER.
21.3.6 A text report can be exported with File > Export.
A tab-delimited text file is opened in Notepad, where
the matched entries are listed together with the best
matching database entries, sorted according to number
of different bands.
Figure 21-2. The Fast band matching window.
Figure 21-1. The Fast band matching dialog box.
195
22. Identification using libraries
A library is a collection of library units, which in turn is a
selection of database entries. A library unit is supposed
to be a definable taxon. When generating libraries for
identification, a new library is first created. Then,
library units are defined within that library, to which the
names of the taxa are given. Within each library unit, a
selection of representative entries for that taxon is
entered.
22.1 Creating a library
22.1.1 In the Main window with DemoBase loaded, select
Identification > Create new library from the menu.
22.1.2 Enter a name for the library, for example
Demolib.
The Library window of the new library appears (Figure
22-1.). The left panel shows the available experiment
types and Composite Data Sets, and the right panel
shows the library units defined within the library. This
panel is initially empty.
Within the library, you can include and exclude
experiment types and Composite Data Sets. Excluded
experiments will not be used for identification.
22.1.3 Select an experiment which you do not want use
for identification, for example a Composite Data Set.
Figure 22-1. The Library window of a new library.
22.1.8 In the database, select all Ambiorix sylvestris
entries and copy them to the clipboard using Edit >
22.1.4 In the menu, choose Experiment > Use for
identification. Experiments that are used for
identification are marked with 9; experiments that are
not used are marked with a red cross.
Copy selection or
22.1.5 Select File > Add new library unit or
22.1.10 Save the library unit with File > Save.
.
22.1.6 Enter a name of one of the species in the database,
for example Ambiorix sylvestris.
The library unit now shows up in the right panel.
22.1.7 Double click on the unit, or select it and File >
Edit library unit.
The Library unit window which appears is very similar to
the Comparison window, and allows all the same
clustering functions as in the Comparison window (see
chapters 9. and 11.). This allows you to cluster the
members of a library unit internally in order to check the
homogeneity of a defined taxon.
.
22.1.9 Paste the entries in the library unit with Edit >
Paste selection.
22.1.11 Repeat 22.1.5 to 22.1.10 to create library units for
the named species.
22.1.12 When finished, close the library with File > Exit.
The library is now listed in the libraries panel of the
Main window. You can open the library and add or edit
units whenever desired.
22.2 Identifying entries against a library
22.2.1 Clear any selected entries in the database with F4.
22.2.2 Select a list of entries, for example all unnamed
species (Ambiorix sp. and Perdrix sp.) and a few entries of
the other species.
196
22.2.3 Select
Demolib
under
Libraries
Identification > Identify selected entries.
The BioNumerics manual
and
The score is simply the number of entries of the library
unit that belong to the K nearest neighbors.
A dialog box appears, as shown in Figure 22-2. Under
Method, you can choose between the conventional,
similarity-based identification (Use similarities), KNearest Neighbor and a neural network, if available
(Use neural network) (see 22.4).
22.2.8 If K - Nearest Neighbor is selected, an input field
K value becomes available, where you can enter the
number of nearest neighbors to look for.
NOTE: The value for K is supposed to be smaller than
the number of entries contained in each of the library
units. If this is not the case, the program will warn you
for this conflict when the identification is executed.
22.2.9 The Neural network option is explained in detail
in 22.4. If this option is checked, a drop-down list
becomes available, showing the existing neural
networks, from which you can choose one.
22.2.10 Optionally, a Minimum score can be specified. If
a library unit has a score that is lower than the minimum
score specified, the library unit will not be listed in the
identification report. Obviously, the score depends on
the method selected. If a similarity method is selected,
the score should be a floating value between 0 and 100; if
K - Nearest Neighbor is selected, the value should be an
integer value between 0 and K.
Figure 22-2. The Identification dialog box with
similarity option selected.
22.2.4 With the option Mean similarity, the program
calculates a similarity between the unknown entry and
each entry in the library unit, and then calculates the
average similarity for the entire library unit. These
average similarities are then used in the identification
report.
22.2.5 With the option Maximum similarity,
program will also calculate all similarities between
unknown and the library unit entries, but only
highest similarity value found is used in
identification report.
the
the
the
the
22.2.6 If Mean similarity or Maximum similarity is
selected, an option Calculate quality quotients becomes
available.
The Quality quotient is an indication of the confidence of
the identification. It is achieved by comparing the
average similarity between the unknown entry and the
library unit's entries with the average similarity of the
library unit's entries with each other. If the first value is
as high or higher than the second one, the unknown
entry fits well within the library unit. Thus this quality
indication takes into account the internal heterogeneity
of the taxon defined in the library unit.
22.2.7 With the option K - Nearest Neighbor, the user
has to specify a value K, which is a number of entries
from the whole library having the highest similarity
with the unknown. Suppose that 10 is entered for K, the
10 best matching entries from the whole library will be
retained. The library unit having the largest number of
entries belonging to these K nearest neighbors is
considered the best matching, and gets the highest score.
22.2.11 Click Mean similarity, check Calculate quality
quotients, and press <OK>.
The Identification window appears, showing the progress
of the calculations in the caption. Once the calculations
are done, the window is divided in three panels (Figure
22-3.). The left panel lists the unknown entries, and the
right panel the experiment types (columns) by which the
unknown entries were identified. You can select one of
the experiment types (columns), and the central panel
shows the best matching library units for the selected
experiment. For the not selected experiments, it is
obvious that the best matching library unit may be
another one. Therefore, the place of the library unit in
decreasing order of match is indicated for each
experiment (gray number between square brackets). An
ideal identification is when all experiment types show
[1].
22.2.12 You can move the vertical separator lines
between the panels.
The identification scores are the similarity values
obtained using the coefficient which is specified in the
settings of the experiment type. The quality quotients
appear as colored dots next to the identification scores.
They range from red (improbable identification) over
orange, yellow (doubtful identification) to green
(faithful identification).
22.2.13 Print the global identification report with File >
Print report, or create a text report of it with File >
Export report to file.
For routine identification purposes, it can be useful to be
able to store the identification results for each unknown
entry. Thereto, one can create a dedicated field in the
database, and use the following command:
Chapter 22 - Identification using libraries
22.2.14 Click on the dedicated field where you want to
store the identification result, and select File > Fill
information field.
22.3 Detailed identification reports
For each unknown entry, you can display a more
detailed identification report.
22.3.1 Select one of the unknown entries and Show >
Detailed report (or double click).
The Detailed identification window (Figure 22-4.) displays
the best matching library units in decreasing order for
the selected experiment type (columns). The experiment
type which was selected in the Identification window will
be selected here too. For the other experiment types, the
Figure 22-3. Identification window.
Figure 22-4. Detailed identification window.
197
order of best matching library units is likely to be
different. Therefore, the place of the library unit in
decreasing order of match is indicated for each
experiment type (gray number between square
brackets). In the ideal case, all experiment types would
show [1].
22.3.2 You can print or export the report with File >
Print report or File > Export report to file.
22.3.3 With Show > Identification comparison, an
Identification comparison is created.
This window is in fact a Comparison window which lists
the unknown entry and the entries of the library unit.
The unknown entry is displayed in red, except when it is
one of the library unit's members; in that case it is
displayed in blue.
198
22.4 Creating a neural network
•Theory
A neural network is a means of calculating a function of
which one doesn't have a clear description, but of which
many examples with known input and output are
present. Typically, the input is a set of characters for
each example, and the output is the name of a group to
which the example belongs. The neural network can be
trained with the examples, and if the training succeeds
well, the neural network can be used to perform the
same calculation with other data of which the output is
not known. Usually, all the examples that are fed to the
neural network are divided randomly in a training set
and a validation set. The training set is the part of the
example set that will be used to calculate the neural
network and the validation set is the part that will be
used to validate the network, i.e. check its correctness on
other examples than the ones used for training.
A neural network consists of several layers of neurons or
nodes; mostly there are 2 or 3 layers. The first layer is the
input layer, the last one is the output layer, and the
intermediate ones - if present - are called the hidden
layers. Usually there are 0 or 1 hidden layers. Every
neuron or node has a value that is calculated by the
neural network. The values of the neurons in the input
layer are simply the input of the function. Every neuron
in the successive layers takes the value of all the neurons
in the previous layer and performs a calculation on it, to
obtain its own value. Mostly this calculation is a
weighted sum, in which the weights can be different for
every neuron. That value will be used by neurons in
consecutive layers. The number of nodes in the input
layer is equal to the number of characters available for
the data set, i.e. the number of characters in the
experiment which is used to calculate the neural
network. The number of nodes in the output layer is
equal to the number of groups defined in the
identification system. The number of nodes in the
hidden layer - of any - can be chosen and is dependent
on the nature and complexity of the data set and
identification system.
During the training cycle the input of a known example
is fed in the neural network and the calculation is
performed. Initially the calculated output will most
likely be very different from what it should be. The
weights between every pair of consecutive neurons are
then adjusted slightly, so that the calculated output
becomes closer to the correct output. This is done using
a process called back-propagation. This means that in the
output layer the errors are calculated, which are the
difference between the correct output and the calculated
output. These errors are then back-propagated to the
neurons in previous layers by multiplying the error by
the weight that connects two neurons, and summing for
every neuron. The weights of the neurons are then
adjusted by the error times a number called the learning
ratio. Furthermore the weight correction of the previous
training cycle times a number called the momentum is
added. The higher the learning ratio and the
The BioNumerics manual
momentum, the faster the training, but the higher the
risk that the error doesn't decrease.
This training process is repeated many times (typically a
few thousand times), each time with another known
example chosen randomly from the training set. After
sufficient iterations the calculated outputs will be very
close to what they should be, provided that the number
of layers and number of nodes per hidden layer is
chosen correctly. A higher number of layers and/or
neurons means that training and calculation will take
longer, so a trade-off has to be made. Furthermore, there
is a danger of overtraining when there are too many
layers and/or neurons, which means that the neural
network would be very good for the examples, but not
at all for other inputs. To have an estimate of this, one
usually divides the known examples in a training set and
a validation set. The validation set is not used for training,
but only to check how well the neural network performs
on this set. If it is significantly worse than for the
training set, one knows that there are too many layers
and/or neurons.
•Application
A neural network can be applied to many problems,
such as control theory, character recognition, statistical
analysis and distinguishing patterns. In practice, a
neural network is very useful to set up an identification
or recognition system based upon complex data sets in
which it is not easy or impossible to identify
discriminatory keys based upon conventional methods
such as calculation of similarity using coefficients,
cluster analysis, principal components analysis etc. An
important requirement for successfully applying neural
networks is that the example data set is sufficiently large
and that many examples are present for each group of
the identification system.
In our software, it is used for determining to what
predefined group or taxon an unknown database entry
belongs, based on measurements that could be a
character set or a fingerprint. This is thus an example of
distinguishing patterns. In this case the output of the
neural network is n values, where n is the number of
predefined groups. Every group is given a number from
1 to n, and thereby corresponds to one of the outputs.
The higher a value in the output, the more likely the
sample belongs to that group. In the training and
validation set the output values are zero, except for the
output that corresponds to the group, which will be one.
After the training has succeeded one can use it with
measurements on unknown samples. In these, the
highest output will be decisive for what group it is.
In BioNumerics, the choice in hidden layers is limited to
none or one, because more hidden layers usually don't
give any advantage. In extensive tests, one hidden layer
was always sufficient, in many cases no hidden layer
worked just as well. The number of nodes in the hidden
layer can be chosen if the user wants to do so. If the user
doesn't specify this, the neural network will start
without a hidden layer. If it doesn't succeed in lowering
the error, a hidden layer will be created. If this still
Chapter 22 - Identification using libraries
199
doesn't lower the error, the hidden layer is expanded
until the error is below a predefined threshold.
The learning rate and momentum cannot be specified.
Instead we fixed these to 0.5 and 0.1 respectively,
because in our tests these values gave the optimal tradeoff between speed and success.
To train a neural network, a library must be present. See
22.1 to create a new library. To obtain a reliable neural
network, each of the library units must have sufficient
members, many more than just two or three. The
number of entries required also depends on the
heterogeneity of the group: the more heterogeneous a
group, the more entries that will be needed to create a
reliable neural network.
22.4.1 Double-click on a library to open it.
22.4.2 Select Experiment > Train neural network or
. A dialog box pops up, listing the existing neural
Figure 22-5. The Neural network training dialog box.
networks for this database, if any.
22.4.3 To add a neural network, press <Add>.
The Neural network training dialiog box appears, as
shown in Figure 22-5.
22.4.4 Under "Select experiment to be used in the
neural network", you can select the experiment to train
the neural network.
22.4.5 With "Validation samples", it is possible to
specify the percentage of the library entries (i.e. the
example data) to be used as validation set. By default
this value is 25%.
22.4.8 Enter a name for the neural network under Neural
network name. You can use the name of the experiment
type.
22.4.9 When all parameters are entered, press <Start
training> to start the training process. Depending on the
size of the library, the training proces can take several
minutes. An animation of the progress of the training is
shown in the x-t diagram (Figure 22-5.).
22.4.10 During the training, it is possible to interrupt or
abort the process by pressing <Stop>.
22.4.6 With "Max. number of iterations", you can
specify the maximum number of training cycles to be
performed. By default this value is 20000.
22.4.11 If you wish to resume the training process, press
<Continue>. The program will continue the iteration
process untill the max number is achieved.
22.4.7 "Number of hidden nodes" allows you to
manually specify the number of hidden nodes. If you
leave this field blank, the program will automatically
determine whether a hidden layer is required, and if so,
the optimal number of hidden nodes. If you enter zero,
no hidden layer will be created.
22.4.12 To save the neural network, press <OK>.
22.4.13 To identify database entries using a neural
network, proceed as explained in 22.2.1 to 22.2.11, but
choose Neural network instead. A drop-down list
showing the existing neural networks will become
available, allowing you to choose one of them for the
identification.
200
The BioNumerics manual
201
23. Analyzing 2D gels
23.1 Proteomics in a broader context: the
BioNumerics Platform
The BioNumerics 2D application, developed for the
analysis and comparison of two-dimensional, spotoriented bitmap files, is physically an integral part of the
BioNumerics software suite. Therefore, it is available as
a module of BioNumerics, referred to as BioNumerics
2D . Along with two other applications that act as
plugins of BioNumerics: GeneMaths and Kodon, the
BioNumerics software forms the basis for an integrated
bioinformatics platform: the BioNumerics Platform. The
obvious advantage of integrating a 2D image analysis
application within a broad bioinformatics platform, is
the possibility to link genomics, proteomics,
metabolomics and phenotypic data in one powerful
database.
By its integration in the BioNumerics Platform, the
combined use of BioNumerics 2D and the GeneMaths
software (Applied Maths) will allow the co-evaluation
of the expression of specific proteins with the
simultaneous expression of homologous genes as
evidenced by microarray experiments. Also, the
proteins detected and identified can be linked to DNA
and protein sequences that are kept in the BioNumerics
database and which are amenable to all kinds of
sequence analysis tools such as structural comparison,
chromosome mapping, vector cloning, primer design,
secondary structure analysis using the Kodon software
(Applied Maths), which is also fully integrated in the
BioNumerics Platform.
Another advantage of integrating 2D gel analysis in a
broad bioinformatics analysis platform is the availability
of numerous powerful analysis tools. These include
cluster analysis of organisms or samples based upon
their (combined) experimental data, or cluster analysis
of characters such as genes or protein spots; a wide
range of dimensioning techniques such as principal
components analysis, discriminant analysis, MANOVA,
or Self-Organizing Maps, are all available to compare in
two ways: organisms/samples amongst each other, or
protein spots and genes amongst different samples.
In BioNumerics, all biological experiments are
functionally classified in five different classes, called
Experiment Types:
•Fingerprint Types: Any densitometric record seen as a
one-dimensional profile of peaks or bands can be
considered as a Fingerprint Type. Fingerprint Types
can be derived from TIFF or bitmap files as well,
which are two-dimensional bitmaps. The condition is
that one must be able to translate the patterns into
densitometric curves.
•2D Gel Types: Any two-dimensional bitmap image
seen as a profile spots or defined labelled structures.
Examples are e.g. 2D protein gel electrophoresis
patterns, 2D DNA electrophoresis profiles, 2D thin
layer chromatograms, or even images from
radioactively labelled cryosections or short half-life
radiotracers.
•Character Types: Any array of named characters, binary
or continuous, with fixed or undefined length can be
classified within the Character Types. The main
difference
between
Character
Types
and
electrophoresis types is that in the Character Types,
each character has a well-determined name, whereas
in the electrophoresis types, the bands, peaks or
densitometric values are unnamed (a molecular size is
NOT a well-determined name!).
•Sequence Types: Within the Sequence Types, the user
can enter nucleic acid (DNA and RNA) sequences and
amino acid (protein) sequences.
•A fifth type, Matrix Types, is not a native experiment
type, but the result of a comparison between database
entries, expressed as similarity values between certain
database entries.
Each experiment type is available as a module of the
BioNumerics software. In the following chapters,
BioNumerics 2D will refer to the module 2D Gel Types
within BioNumerics.
Through its integration with BioNumerics, the
BioNumerics 2D software is a perfect tool to be used in
applications such as proteomics, protein expression
studies, drug discovery, functional genomics and
proteome mapping, metabolomics, protein interactions
research, signal transduction pathways, molecular
oncology and clinical screening.
23.2 Data sources for BioNumerics 2D
BioNumerics 2D can handle a variety of file formats
including 8-bit, 12-bit, and 16-bit. The software is able to
cope with images of any size and OD depth. The
software can be used with a variety of staining and
labelling protocols, using different support materials.
For the capture of 2D images a variety of densitometers,
cameras or radiation detection devices are used. These
devices do not only differ in cost but also in resolution
202
and dynamic range. Examples of commonly used
equipment for the digitization of gels are:
•Polaroid photo
•Autoradiography film
•CCD (video) cameras
•CCD document scanners
•Fluorescence cameras
•Phosphor-imagers
The BioNumerics manual
identify cellular proteins, new cell or tissue components
or to detect alterations in protein expression, metabolic
or physiological activities and will also assist in the
quantitative and qualitative comparison of gels run on
samples obtained under different conditions.
By using the BioNumerics 2D software all relevant and
supporting information can be stored in a structured
database format. This information can be used for
selection and for comparison purposes.
•Laser densitometers
•…
Expensive laser densitometers have mostly been
replaced by document scanners and video cameras.
Most document scanners have a rather limited OD range
(of about 2.0 OD units covered by a 8-bit gray scale or
256 gray values). These values, however, cover quite
well what is generally obtained using the most
commonly used staining methods or with X-ray film
irradiation. New types of document scanners or imagers
may offer a considerably higher dynamic range
performance (up to 3 OD units) with 12-,14- or 16-bit
gray scale levels. The BioNumerics 2D software is able
to import the TIFF files from all these types of scanners.
23.3 Applications for BioNumerics 2D
The most obvious application for BioNumerics 2D is the
analysis of 2D protein gel electrophoresis experiments.
Separating, detecting, and quantifying proteins is the
main purpose of modern proteomics research. In order
to correctly identify changes of protein expression levels
(e.g. of disease related proteins), it is extremely
important to use procedures that will allow high
resolution separation and a proper staining or labelling
method.
2D gel electrophoresis separates proteins based on their
iso-electric points (pI values) in a so-called first
dimension performed in a carrier that contains an IPG
(immobilized pH gradient), followed by a second
dimension in a carrier that separates on molecular
weight in a traditional electrophoresis process (second
dimension). There are currently two techniques
available for the first dimension of 2D gel
electrophoresis: NEPHGE and IPG. NEPHGE stands for
non-equilibrium pH gradient electrophoresis, and is a
technique with high resolution but lower levels of
reproducibility, while IPG (immobilized pH gradient)
has a lower resolution but is more easy to handle. The
lack of resolution of the latter technique has been
circumvented by the use of multiple gels with more
limited pH ranges. By using BioNumerics 2D it is
possible to assemble these different pH ranged gels into
a synthetic gel that will contain the overall information
for the subject being studied.
When coupled to existing databases of known proteins,
characterized according to the above mentioned
parameters, 2D gel electrophoresis can be used to
23.4 Getting started with BioNumerics
2D
The next paragraphs will guide the user stepwise
through the different functions of BioNumerics' 2D gel
analysis application. In order to benefit from all the
possibilities of the software, first time users are
recommended to read this guide thoroughly.
•The Demo databases
In order to assist the user in setting up a database
system, a small sample database is included with the
BioNumerics 2D software. This sample database, which
can be installed from the CD, contains four other
examples of 2D gel TIFF files (Furhigh.tif, Furlow.tif,
Wthigh.tif, Wtlow.tif) that will be used as examples in
this guide. These files are obtained with kind permission
from Dr. A.H.M. van Vliet1. They represent a wild type
Campylobacter jejuni
strain exposed to low iron
concentration (Wtlow) and high iron concentration
(Wthigh), and a Fur protein 2 mutant exposed to low iron
concentration (Furlow) and high iron concentration
(Furhigh). These gels will be further used for
demonstration purposes throughout the 2D gel
chapters. A database Demo2D, containing these four
gels fully analyzed, is also available on the installation
CD, under Demo\Demo2D.Creating a new database
23.5 Creating a new database
As explained earlier (1.2), BioNumerics databases are
designed to store information in a structured way. New
databases will be added to this structure, automatically
creating the necessary files and folders to allow proper
management and back-up of your data. We will create a
new database for setting up some 2D gel experiments
see also (5.3).
23.5.1 In the BioNumerics Startup program, press the
<New> button to enter the New database wizard.
23.5.2 Enter a name for the database, e.g. Demo2D, and
press <Next>.
1. van Vliet, A.H.M., K.G. Wooldridge, and J.M. Ketley.
1998. J. Bacteriol. 180: 5291-5298.
2. The Fur protein controls the expression of iron-regulated
proteins.
Chapter 23 - Analyzing 2D gels
23.5.3 Press <Next> again without changing anything to
the directory defaults.
23.5.4 You are now asked whether or not you want to
create log files. If you enable BioNumerics to create log
files, every change made to a database component
(entry, experiment, etc.) is recorded to the log file with
indication of the kind, the date, and the time of change
(see 5.6).
23.5.5 Press <Finish> to complete the setup of the new
database.
23.5.6 Before creating the final files, BioNumerics will
need to know the type of database you like to prepare.
Three options are available: Use the local database,
Create a new, empty connected database, or Connect to
an existing connected database.
Details on the use of local and connected databases are
given in section 28.1.
23.5.7 Select <Use the local database>, and press <OK>
to quit the setup of the new database.
23.6 Defining a new 2D experiment type
Similar as for the other experiment types in
BioNumerics, it is possible to create different 2D Gel
Types within the same database. This option is very
interesting to set up different kinds of 2D gel
experiments within the same database. All options and
parameters defined for a given kind of 2D gels will be
stored within the 2D Gel Type. Other options and
parameters may be stored within other 2D Gel Types,
without having to overwrite carefully defined settings.
Within a specific 2D Gel Type, gels are normalized to
match each other through a Reference system, similar as
for 1-D fingerprints (7.4.12). In a 2D gel experiment type,
a reference system is created by choosing a good quality
gel with clearly resolved spots, and defining all spots, or
a subset, as reference spots in the reference system. Other
gels can then be aligned to the reference system, and
thus to each other, by linking a number of
corresponding spots to the reference spots in the
reference system. Such linked spots are called landmarks.
Based upon a number of landmarks defined by the user,
the program can match all the non-landmark spots of
the gel with the remaining reference spots of the
reference system. This matching is done within certain
tolerance boundaries, which can be specified by the
user. In this way, corresponding spots on different gels
are linked to each other by linking them to the same
reference spots on the reference system.
Within the same 2D Gel Type, however, it is possible to
define more than one reference system. This possibility
is useful when creating, e.g., multiple gels from the same
sample, composed of gels with e.g. different pI ranges.
203
23.6.1 To create a new experiment type select
Experiments > Create new 2D gel type from the main
menu.
Alternatively you can press the experiment symbol
in the toolbar of the experiment types panel or
you can right click in the experiment types panel and
select the option: Create new 2D gel type.
23.6.2 The New 2D Gel type wizard prompts you to enter
a name for the new type. Since we are going to work
with the Fur experiments of C. jejuni (see 23.4) enter for
instance “Fur” as name.
23.6.3 Press <Next> and select the correct optical density
depth (OD) of the fingerprint data files. The default
setting corresponds to the most common case, i.e. twodimensional TIFF files with 8-bit OD depth (256 gray
values).
23.6.4 After pressing <Next> again, the wizard asks
whether the 2D gels have inverted densitometric values.
This is the case when your image appears to have white
spots on a dark background.
Since BioNumerics 2D recognizes the darkness as the
intensity of a spot, the wizard therefore allows you to
invert the densitometric values.
23.6.5 In case you are processing normally registered
gels, i.e. dark spots on a white background, check
<NO>.
Furthermore, the wizard allows you to adjust the color
of the background and the bands to match the reality.
The red, green and blue components can be adjusted
individually for both the background color and the band
color. Usually, you will leave the colors unaltered. In
case you like to mimic e.g. the blue of Coomassie Blue,
you can move the Band color adjuster for Red (R) to left,
for Blue (B) to right and for Green (G) to an intermediate
position that produces the requested color.
23.6.6 In the next step, you are prompted to allow a
Background subtraction. At this time, we leave the
background subtraction disabled, by checking <No>.
Later, we will see how to subtract background (23.8.22).
23.6.7 Press <Finish> to complete the creation of the new
2D Gel Type.
NOTE: You will be able to adjust all of these parameters
later.
23.6.8 The experiment types panel now lists “Fur” under
2D gel types of the database Demo2D.
23.7 Importing 2D gel image files
After creating the database Demo2D as described in
section you should have a subdirectory
204
The BioNumerics manual
C:\Program Files\BioNumerics\Data\Demo2D\Gel2d
Before 2D gels can be processed, the image files should
appear in this directory. There are several ways to place
image files in this directory.
Using the Windows Explorer, you can copy the files
Furhigh.tif, Furlow.tif, Wthigh.tif, and Wtlow.tif from
the EXAMPLES\2D subfolder on the CD-ROM directly
to the folder C:\Program Files\ BioNumerics\ Data\
Demo2D\ Gel2d.
You can also select File > Add new experiment file in the
BioNumerics main menu, or right click in the Files panel.
Moreover, you can select the files symbol
and
select the image files you want to process in this 2D Gel
Type.
In the latter case, BioNumerics always makes a copy of
the original TIFF file in the C:\Program Files\
BioNumerics\ Data\ Demo2D\ Gel2d subdirectory.
23.7.1 Using one of the above methods, import the files
Furhigh.tif, Furlow.tif, Wthigh.tif, Wtlow.tif from the
directory EXAMPLES\2D on the CD-ROM.
The gels files Furhigh.tif, Furlow.tif, Wthigh.tif are
now available in the Files panel. The file are marked
with a red N, which means that they have not been
edited or normalized yet.
NOTE: Experiment files added to the Files panel can
also be deleted by selecting the file and chosing File >
Delete experiment file from the main menu or by
selecting the
button from the Files panel. Deleted
experiment files are struck through by a red line, but are
not actually deleted until you exit the program. As long
as you haven't closed the program, you can undo the
deletion of the file by selecting File > Delete
experiment file again.
23.8 Processing 2D gel images
The gel analysis work flow of BioNumerics 2D is
arranged in a number of consecutive steps. These steps
will guide you through the process of spot detection,
OD calibration, normalization, quantification, and
finally to the storage of all spot information in the
database. The information is then available for the
matching of multiple gels and for quantitative
comparison or analyses with specific expression analysis
tools.
Before processing work can begin, a TIFF file of a 2D gel
experiment usually needs to be ‘cleaned’ before it can be
used for spot detection and quantification. Basic image
file editing and cleaning can be done in any image
processing package. However, the treatment of a 2D gel
image can involve some very specific routines such as
background removal, spike removal, streak removal and
filtering that are not readily available in traditional
image processing software. BioNumerics 2D is equipped
with a number of useful tools to perform various
‘cleaning’ activities on gel images and provides
algorithms with user-adjustable parameters to deal with
these important corrections. These algorithms include
Filtering (median, Gaussian), 2D background
subtraction (rolling ball principle), Streak removal
(horizontal and vertical), and Spike removal.
Following are the consecutive steps in a 2D gel
processing using BioNumerics 2D:
1. Spot detection will find spots on the gel and quantify
them by fitting a 2D Gaussian distribution to the spot.
The result is a spot location (determined by the spot’s
mass centre), a spot size (average size in the X and the
Y direction), a spot maximum, and a spot volume.
Since overlapping spots are commonly found in 2D
gels, BioNumerics 2D will detect these automatically
and will propose the best possible separation. The
spot search algorithm contains a number of
parameters that can be varied and therefore it has
been equipped with a very useful preview window.
Convenient editing tools allow for further manual
correction, such as merging or splitting spots, adding,
deleting or redrawing spots.
2. Calibration is an optional process that generates a
calibration curve expressing the relationship between
densitometric values on the scanned image file and
real OD value. An image is usually calibrated by
applying OD calibration strips delivered with the
scanning device. These strips are processed along
with each gel and will compensate for variation
observed between different scans. After calibration,
spot volumes are also shown as relative volumes. These
relative volumes can be recalculated into absolute
quantities in step 4 (Defining metrics, ”Defining
metrics”).
3. Normalization (gel alignment). The third step of the
gel processing routine allows the mapping of a gel to a
reference system. This reference system can be seen as
an artificial gel to which the others are aligned.
Normally, it is constructed on the basis of a real
experiment, but it can be gradually extended (i.e.
more spots are added) and modified as more gels are
being analyzed and compared with that reference
system. A number of tools are available that will allow
spots on the gel to be matched with the corresponding
spots on the reference system. Based on a number of
easily recognizable homologous spots (landmarks),
BioNumerics 2D will align the gel to the reference
system and will allow all the spots of the gel to be
linked to corresponding spots on the reference system,
within a user- adjustable position tolerance.
4. Assignment of Metrics. This step has two different
purposes. At first, the mobility properties of each spot
will be calculated in both dimensions using a
regression. To establish the regression, marker or
reference proteins or other easily recognizable
Chapter 23 - Analyzing 2D gels
205
physical points (incision, dots, scale indications,
colored molecular weight markers for blots, etc…) can
be used, which correlate with positions of known
molecular weight or pI values. BioNumerics 2D can
use linear or exponential fitting algorithms of first till
fifth degree, with or without logarithmic dependence.
The 2-dimensional mobility grid thus appearing can
be rotated by the user to correct for artifacts like nonhorizontal shots or mobility inclination. The assigned
mobility properties (metrics) will assist in comparing
a specific spot (with an unknown protein) to known
proteins with known molecular weights and pI
values. Secondly, it is possible to enter concentration
values for spots containing fixed quantities of protein
(expressed in ng or µg). The values of these known
concentration marker spots are used to establish a
regression that calculates spot concentrations for all
spots, based on their calibrated volumes. In this way
any calibrated volume in the gel can be read in
‘amount of protein’, e.g. expressed in µg or ng.
5. Database construction. Identified spots can be
provided with descriptive information and stored in
the database. The descriptive information can be
obtained from existing databases (e.g. 2DPage of
SwissProt) by using the accession number or spot
reference number. Alternatively the user can build an
own protein reference database by entering specific
information fields. A total of 8 fields of unlimited
length can be selected. Within the frame of the
BioNumerics Platform it is possible to link the 2D gel
information to other experimental data in the database
and to perform multi-experimental comparisons and
data mining (e.g. comparison with micro-array data).
The protein spot query tool allows the selection of
specific proteins from many 2D gels belonging to the
same or different 2D experiment types.
6. Matching 2D gel spots. The 2D gel spot information
fields are accessible for searches in order to retrieve
subsets of spots. By using the advanced query tool of
the BioNumerics 2D software on a selection of several
gels, it is possible to create subsets of spots that can be
analyzed by a variety of comparison tools. The result
of a query, covering several gels, can be displayed as a
histogram and can be analyzed statistically. In
combination with the GeneMaths software, it is
possible to evaluate protein expression profiles, study
time course relationships, etc… on the selected set of
proteins.
We will now start analyzing the first 2D gel within the
newly created database Demo2D.
23.8.1 Double-click on Wtlow in the Files panel, or select
File > Open experiment file (entries) from the Main
window or click on the file and press
from the Files
panel toolbar.
23.8.2 Since the gel is new (not processed), BioNumerics
2D doesn’t know what 2D Gel Type it belongs to.
Therefore, a list box is first shown, listing all available
2D Gel Types. Select the 2D Gel Type ‘Fur’ and press
<OK>. By clicking <Create New> you can create a new
2D gel experiment type that fits the gel file.
NOTE: The same gel cannot be used in two different 2D
Gel Types.
23.8.3 The gel file is loaded. Depending on the size of the
image this may take some time. The 2D gel file window
appears (Figure 23-1.), showing the image of the gel.
As explained above, processing a 2D gel is a multistep
process. The current step of 2D gel processing is shown
in the upper left corner of the window. Initially, the
window displays "1. Spot detection".
23.8.4 With Edit > Zoom in (+) or Edit > Zoom out (-) the
2D gel image can be sized to fit the screen. Throughout
the gel analysis procedure this tool can also be activated
by the respective buttons
and
or the + and -
keys, respectively.
NOTE: Manual tools can be used with greater precision
on an enlarged image. The zoom factor can be read
from the bar, which also lists the file type, image size
and OD depth. E.g., TIFF:
1110x804x8
(x1.00) means 1110 pixels horizontally by 804 pixels
vertically and an OD depth of 8 bits per pixel . The
zoom factor is 1.00.
A powerful tool to edit the appearance of the image is
the Tone curve editor. While the Image brightness and
contrast settings act only at the screen (monitor) level,
the Tone curve editor acts at the original TIFF file
information. Although the original tiff file is never
physically changed, the settings that apply to it will be
modified by its proper Tone curve, which is saved along
with each particular gel. In case a 2D gel image was
scanned in 16-bit mode, the tone curve settings are
applied to the full 16-bit (65536) gray scale information,
allowing much more information to be revealed in areas
with lower contrast.
23.8.5 In the 2D gel file window menu, select Edit > Edit
tone curve or press
. The tone curve editor appears
as in Figure 23-2. The upper panel is a distribution plot
of the densitometric values in the TIFF file over the
available range. The right two windows represent a part
of the 2D image Before correction (upper) and After
correction (lower).
23.8.6 You can scroll through the preview images by leftclicking and moving the mouse while keeping the
mouse button pressed.
23.8.7 Select a part of the preview images which contains
both very weak and dark bands.
On the left, there are two buttons <Linear> and
<Logarithmic>. Both functions introduce a number of
distortion points on the tone curve, and reposition the
tone curve so that it begins at the grayscale level where
206
The BioNumerics manual
Figure 23-1. The 2D gel file window. Step 1: Spot detection.
values, and the preview After correction looks a little bit
brighter.
There are six other buttons that are more or less selfexplanatory: <Decrease zero level> and <Increase zero
level> are to decrease and increase the starting point of
the curve, respectively.
<Enhance weak bands> and <Enhance dark bands> are
also complementary to each other, the first making the
curve more logarithmic so that more contrast is revealed
in the left part of the curve (bright area), and the second
making the curve more exponential so that more
contrast is revealed in the right part of the curve (dark
area).
Figure 23-2. Gel image tone curve editor.
the first densitometric values are found, and ends at its
maximum where the darkest densitometric values are
found. This is a simple optimization function that
rescales the used grayscale interval optimally within the
available display range. The difference between linear
and logarithmic is whether a linear or a logarithmic
curve is used.
23.8.8 In case of 8-bit gels, a linear curve is the best
starting point, so press <Linear>. The interval is now
optimized between minimum and maximum available
<Reduce contrast> and <Increase contrast> make the
curve more sigmoid so that the total contrast of the
image is reduced or enhanced, respectively.
23.8.9 For the image loaded, pressing three times
<Enhance weak bands> provides a more contrastful
picture.
In standard mode, the gel is displayed as a continuation
of gray levels. The Rainbow palettes and Contour
palettes, which exist of multiple color transitions, can
reveal more visual information in areas of poor contrast
(weak and oversaturated areas). In BioNumerics 2D, 6
different palettes have been pre-defined besides the gray
level representation. In the Contour palettes, each range
Chapter 23 - Analyzing 2D gels
of the five colors is bordered by a dark transition, which
is useful to delineate the contours on the 2D gel image.
Contour Palette (I) has 5 and Contour Palette (II) 9
different discrete color ranges while Contour Palette
(III) has five colors characterized by a discontinuous
transition. The use of the different palette views is useful
for the evaluation of slight intensity gradients that are
invisible in grayscale, such as e.g. the efficacy of the
background removal settings used, judgement of
double/single spots.
207
the
button. It is recommended to save the gel at
regular times.
A last option that will improve the interpretation and
visualization of the 2D gels is the 3-D viewing mode.
23.8.15 Zoom in on an area of the gel with many
overlapping spots of high intensity, using the
button or Edit > Zoom in (+).
23.8.10 If you press <OK>, the tone curve is saved along
with the gel.
23.8.16 Display the 3-D view window using the 3-Dview
Another interesting viewing mode is the embossed view.
button
23.8.11 Select Edit > Edit tone curve or press
again.
23.8.12 Click the <Embossed view> check box and press
<OK>.
The use of the embossed option (Figure 23-3.) adds a
third dimension to the display. The gray levels are
transformed to a shaded 3-D shape that enhances the
distinction between higher and lower intensities, for
example, to separate spots in high-intensity areas which
look uniformely black in normal grayscale mode.
NOTE: (1) Embossed view cannot be shown in the
preview panel in the tone curve editor. (2) The embossed
view effect is largely lost when a strong zoom factor is
used (>x4.00).
or by File > View 3D image.
This option opens a new window that contains a
scalable three dimensional view of the image. The Z-axis
is used to display the pixel intensity of each individual
point in the gel. The 3-D view is particularly suited to
evaluate and judge individual spots for possible
overlap, presence of spikes or visualisation of noise
(Figure 23-4.). Therefore, the view can be used to judge
the effect of background removal, spike removal or
streak removal (see below 23.8.22). To that extent it is
possible to keep several 3-D representation windows
open at the same time, allowing a side by side
comparison for the study of the effect of specific actions
on the 2D gel.
23.8.17 Select View > Show spot outlines to plot the spot
contours on the 3-D image.
23.8.18 By using the Left, Right, Up and Down arrows
keys on the keyboard, the position of the image can be
manipulated in all directions. The image can also be
rotated horizontally and vertically by dragging the
image left/right or up/down using the mouse.
23.8.19 Use the PgUp and PgDn keys to zoom in or out
of the image.
23.8.20 The Insert and Delete keys can be used to higher
or lower the peaks, by resizing of the Z-axis.
NOTE: In contrast to the Embossed view, the 3-D view
is best used on a image with strong zoom. The zoomed
area will selectively be displayed and can be viewed
from all sides.
23.8.21 Close the 3-D view window with File > Exit.
23.8.22 Edit the general settings of the 2D gel file window
Figure 23-3. The Embossed view option from the
Gel tone curve editor.
with Edit > Settings or
23.8.13 Call the tone curve editor again, uncheck the
embossed view, and press <OK>.
In this window the Image and Metrics settings can be
defined (Figure 23-5.). We will discuss the Metrics
settings later (see paragraph 23.12).
23.8.14 To save the work done at any stage of the
process, you can select File > Save, press the <F2> key or
.
•Image coloring. With Inverted values, gels with bright
spots on a dark background can be inverted to dark
spots on a bright background. The OD range of the gel
can be specified in number of grayscale levels (256 = 8
208
The BioNumerics manual
Figure 23-4. 3-D view of a zoomed area of a 2D gel.
•Background subtraction is based on the “rolling ball”
principle, i.e. a ball of a certain size is rolled against
the innerside of a -3-D surface of the gel image.
Depths the ball could enter are removed from the
image. The size of the ball, in pixels, can be entered.
The larger the size of the ball, the less background will
be subtracted, but the faster the calculations will be.
Background removal may be very effective in
removing smear that makes spot quantification more
difficult. Removing too much background holds a
potential danger of removing or excavating large
protein spots from the gel.
•Spike removal is a filtering technique with a similar
mechanism as the rolling ball. A very small ball size is
taken, so that the ball can enter into all regular protein
spots, but not into spikes and noise caused by dust,
scratches etc. Those depths the ball could not enter
into are removed. The size of the ball can be entered in
pixels. The size should be chosen very small, usually
less than 4 pixels.
Figure 23-5. 2D gel settings dialog box.
bit; 1024 = 10 bit; 4096 = 12 bit; 65536 = 16 bit). As
explained in 23.6, the initial settings related to the
image color display (on a monitor or screen), and
defined during the setting-up of the 2D Gel Type, can
still be changed. The RGB (red-green-blue)
contributions can be changed for the Background
color and the Foreground color. Using this tool it is
possible to mimic e.g. Coomassie blue or silver stain.
•Two types of Filtering have been implemented in
BioNumerics 2D to smoothen the image: the Median
and the Gaussian filtering. Median filter is a method
which reduces irregularities that constitute less than
50% of the number of values to average. The Median
filter is therefore very efficient in removing noise and
isolated spikes. The Gaussian filter can be used to
filter out more continuous noise. Gaussian filtering
will remove the continuous noise rather than the
accidental noise spikes. When using a Gaussian
filtering on the latter, the spike intensity may be
Chapter 23 - Analyzing 2D gels
209
Figure 23-6. Automatic spot search dialog box. .
reduced but not eliminated. The Radius of the filter
can be entered in pixels.
•Streak removal is a similar mechanism as the rolling
ball, but an ellipse is used instead, in order to separate
streaks from spots. The streak removal algorithm can
look in a Horizontal and Vertical direction for the
presence of continued smear of protein. The Static
(length of the zone) to be considered as smear can be
entered in pixels.
NOTES:
(1) All the above filtering algorithms will not change
the TIFF files permanently but will have an influence
on the 2D gel representation and on the spot detection
and quantification algorithms. Since the original TIFF
files will not be changed, the settings applied to a
specific gel can be modified at any step at any time.
(2) Since these settings will have a considerable impact
on the spot detection and quantification procedures, we
recommend to use these options with care. The spike
removal and streak removal algorithms in particular
should be handled with consciousness of the effect of the
algorithm on all spots. These algorithms inevitably
cause some distortion on the protein spots as well. The
smaller the level of the spike / streak removal, the less
the distortion.
23.8.23 For the gel Wtlow select median filtering with
averaging 3 (23.8.22), background subtraction of 30
pixels, horizontal streak removal of 25 pixels, and spike
removal of 3 pixels.
23.8.24 Press <OK> to save the settings.
NOTE: it may be useful to redefine the Tone curve after
applying the image enhancement settings.
NOTE: The above settings can be stored for global use
by the function Edit > Save as default settings. The
default settings can be reloaded at any time by Edit >
Load default settings.
23.8.25 With the command File > Print image or File >
Copy image to clipboard, you can print the unprocessed
2D gel, or copy it to the clipboard.
23.9 Step 1: Spot detection
Spot detection is an important feature in the creation of a
protein database. It takes some experience to include /
exclude specific spots from a variety of gels in a
consistent way.
Besides the possibility to pre-optimize an image for
better spot detection (paragraph 23.8 before)
BioNumerics 2D provides a preview-based automatic
assignment of spots. The normal procedure is to allow
the software to assign spots automatically, after
adjusting the parameters using the preview window,
and further inspect the assigned spots and correct
manually where necessary.
23.9.1 By selecting Spots > Automatic search or by
pressing the
button, an interactive Automatic
spot search. window is opened (Figure 23-6.).
The window allows you to see the influence of any
parameter you change on a selected area of the
underlying gel image. The area of the gel can be
changed by dragging the window with the mouse to any
desired position of the gel. The yellow positioning frame
is the unit size which covers 10 pixels.
23.9.2 By choosing <Update preview>, all spots in the
selected zone will be detected, using the current
conditions.
The spots detected are indicated by red circular
indications in the spot preview window (Figure 23-6.).
After changing the target position of the 2D gel you will
have to update the detection by clicking the <Update
preview> button again. Amongst the spot detection
parameters that can be adjusted, are the Estimated spot
210
size, the Minimum spot size, the Minimum profiling, the
Spot contrast enhancement, the Conglomerate spot
separation, and the Shape/Darkness sensitivity.
The Estimated spot size assists the search algorithm in
finding spots of approximately the size specified by the
user. Depending on the resolution and the size of the
gel, there may be considerable differences in the pixel
size of a single spot. By estimating the average spot size
(or diameter of a virtual circular spot) the software can
start to screen the 2D image for individual spots. The
default value is set at 25.
The BioNumerics manual
conglomerate (depression sensitivity). In the latter case,
the presence in the spot surface of multiple cores or
subtops, separated by a valley, will thrive the splitting
process. In case shape is selected as a major criterion for
splitting, irregularities in the contour (e.g. a spot with a
typical 8 –shape) will trigger the splitting process. The
slider allows you to modify the importance of either
criteria by changing the ratio Shape / Darkness.
23.9.3 Choose the following parameter settings for the
gel Wtlow and press <OK>:
•Estimated spot size: 25 pixels
The Minimum spot size is an additional help for the
algorithm to discriminate spikes from real spots and to
optimize the search algorithm. Spots which have a size
below the indicated minimum spot size will not be
considered by the algorithm. The default value is set at
3.
The Minimum profiling is the elevation of the spot
compared to the highest 2% intensity found on the gel.
The higher the value is set, the darker a spot should be
before it will be found. The default value is 30.
Spot contrast enhancement is an algorithm to reduce the
spot surface relative to the intensity of the spot. This
means that dark spots will be clipped at higher gray
levels than weak spots. The algorithm also has an
influence on the final number of spots found: when
small spots are clipped at higher or lower grays, they
may fall within or without the minimum spot size. Since
the algorithm is applied before the Conglomerate spot
separation, a large Spot contrast enhancement will cause
more small spots to be found around high intensity
spots or areas. The slider bar, in combination with the
preview window, will allow you to quickly evaluate the
most suitable Spot contrast enhancement setting for each
gel.
NOTE: Spot contrast enhancement settings may vary
according to the gel image processing parameters that
have been used. Changing the background subtraction
level may have an impact on the optimal Spot contrast
enhancement settings.
Increasing the Conglomerate spot separation factor will
force the algorithm to pay more attention to the
detection of multiple spots in a core that initially has
been recognized as a single spot. In Figure 23-4. the
frequent overlap of not completely separated protein
spots is illustrated. BioNumerics 2D can be forced to
explore each spot for the presence of subtops which
could be the core of a non completely resolved protein
peak (increase conglomerate spot separation) or can be
instructed to consider any continued elevation clearly
separated from the background as a single spot
(decrease conglomerate spot separation).
As an additional parameter in this process you can
indicate whether the decision for splitting conglomerate
spots will be based on an evaluation of the basic Shape
(constriction sensitivity), or on the Darkness of the
•Minimum spot size: 5 pixels
•Minimum profiling: 15 %
•Spot contrast enhancement: 70%
•Conglomerate spot separation: 80%
•Shape/darkness: center
Assigned spots are now contoured by a green
borderline, and semitransparently colored in green.
Selected spots are colored in red.
23.9.4 With Edit > Spot info or
you can display a
small pop-up window that shows information about the
selected spot (Figure 23-7.).
Figure 23-7. Spot information pop-up window.
23.9.5 The pop-up window can be moved by dragging
the mouse anywhere inside the window, and always
remains in front of the 2D gel editor.
23.9.6 You can close the spot information window by
clicking in the upper left triangular button.
NOTE:Iit is also possible to display a label for each spot
in the 2D gel file window, using Edit > Label spot
with, which offers the choice between a number of
information fields that can be assigned to a spot (see
24.1). However, such a label can only be displayed if the
gel is fully processed according to the reference system.
This option can be useful too, if already processed gels
are re-edited.
After the automatic spot search, some spots may still
remain undetected, while others may be indicating
small background elevations. Also, some conglomerates
Chapter 23 - Analyzing 2D gels
211
may have been found which are clearly composed of
multiple spots that have not been identified separately
with the current settings. A number of manual spot
editing tools are available to add, remove, separate,
merge, and redraw spots.
23.9.7 Add spot tool. Click the button
or press
SHIFT+F2 to change your cursor into a spot adding tool.
It is sufficient to click in the center of an unassigned spot
to assign an additional spot. The software will find the
correct shape of the newly created spot automatically.
23.9.8 At any time you can return to the pointer status of
your cursor by clicking
again, or pressing
SHIFT+F1.
23.9.9 Remove spot tool. Click on the button
or
press SHIFT+F3, to change the cursor into the spot
removing tool. Simply click on an assigned spot to
delete it.
23.9.10 Selected spots can also be deleted using the DEL
key or with Spots > Delete selected spots.
23.9.13 You can use the pencil to mark new spots on the
2D gel or to extend the contours of an existing spot.
NOTE: It is advised to use the drawing pencil in
combination with the zoom buttons (+ and - keys)
(23.8.4). Using the zoom function, the program will
automatically zoom on the selected spot.
23.9.14 Drawing tool (remove pixels) . Press
or
SHIFT+F5 to turn the mouse pointer tool into a pixel
removing pencil.
23.9.15 You can use the pencil to delete (parts of)
selected spots or to separate larger spots into two or
more smaller spots.
When used in the appropriate zoom mode and with
small pen size, the tool allows you to split conglomerate
spots following a precise user-defined trace. When used
with a large pen size, the tool can be used to erase
complete spots. When used with an intermediate pen
size it can be used to delete parts of the spots that
should not be considered for quantification.
23.9.16 Split selected spot. Select a conglomerate spot,
which you may want to split up into two spots.
23.9.11 Groups of spots can be selected at once by
dragging the mouse in pointer mode (
) over the
area to select. These spots can be deleted at once by
pressing the DEL key.
button
23.9.18 Merge selected spots. Select two spots which are
more likely to belong to one protein (e.g. the program
sometimes identifies smear as a second spot). Two spots
can be selected together by holding down the CTRL key
while selecting the second.
or the redo
or use the commands Edit > Undo last
action (CTRL + Z) or Edit > Redo last action
(CTRL + Y) from the menu. The number of steps you
can undo, however, is not unlimited. It is therefore
advised to save your approved work regularly. Caution:
saving your data will erase the ‘undo’ memory.
23.9.12 Drawing tool (add pixels) . Press
, <F7> or select Spots >
Using this function, you can force the program to
calculate the most probable trace to split a spot. When
the tool refuses to split a spot it means that no well
delineated trace has been discovered that could be used
for splitting the spot; therefore, the software considers it
as a single spot that cannot be further divided. If you
still like to split such a spot, you can use the tool
described in 23.9.14.
NOTE: The 2D gel file window has a multi-level
undo function that will allow you to undo a large
number of previously performed manipulations. This
undo option therefore enables you to evaluate safely a
number of processing steps on your 2D gel. In case you
are not satisfied with the result of your last
modifications you can get back to the status that had
your last approval by consecutive use of the ‘undo’
function. Press the undo button
23.9.17 Press the button
Split selected spot.
23.9.19 With the button
, <F8> or with Spots >
Merge selected spot s you can merge two or more spots
that have been selected in advance.
or
SHIFT+F4 to turn the mouse pointer tool into a drawing
pencil.
23.9.20 Once you are satisfied with the assigned spots,
you can save the work done by selecting File > Save,
pressing the <F2> key or the
button. It is
The tool is always linked to a specific pen size as
displayed
by
the
pen
size
tool
buttons
advisable to save the gel at regular times.
(
23.9.21 Press the next step button (
). The
indicated by a green flag.
selected pen
size
is
next step, the Calibration step.
) to move to the
212
The BioNumerics manual
23.10 Calibration
Calibration is an essential step when different gels need
to be compared quantitatively. Calibration will improve
the inter-gel comparability by correcting for intensity
differences between scanned gels due to differences in
digitizing of the gel. During the process of calibration,
the relation between the original OD (or counts of
radiation) and the intensities on the TIFF file image can
be defined. Through calibration, each gray level on an
image TIFF file can be assigned a calibrated value on the
basis of a non-linear regression curve. Calibration can
therefore be used to compensate for non-linearity of
scanners in the high OD range, or a non-linear response
of radiographic film to exposing radiation. To that
purpose it is possible to link specific known OD (or
radiation) levels to an area of raw pixel values of the
TIFF file image.
The most obvious way to define this calibration curve is
in combination with a scanner or CCD camera, by the
use of calibration strips that are applied on, or next to,
the gel. These strips represent well-known physical
properties, e.g. OD values. The calibration zones in the
strips can be defined by the user, for which the physical
value can be entered. After calculating the non-linear
calibration curve, every pixel on the 2D gel image can be
translated into a new calibrated value with some
physical property. For quantification purposes, it is
recommended to compare calibrated gels only with
calibrated gels.
IMPORTANT NOTE: Calibration is always performed
on the raw, unprocessed image file. In order to have a
realistic view on the calibration strips on the scanned
image, you may need to switch off the background
subtraction, as well as other filters such as streak
removal.
23.10.6 Lastly, draw a rectangle in the upper center
background part of the image, and enter 0.5.
The calibration rectangles are indicated as blue
rectangles, with a node in the upper left corner and in
the bottom right corner. They can still be modified, after
selecting the pointer tool (
).
NOTE: the Undo function doesn't work in this step of
2D gel processing.
23.10.7 If you click inside a rectangle, it becomes selected
(pink color).
23.10.8 To move a calibration rectangle, drag it to a new
place using the upper left node.
23.10.9 To resize a calibration rectangle, drag the bottom
right node to obtain the desired shape.
23.10.10 To delete a calibration rectangle, select it and
press the DEL key.
23.10.11 If you want to change the calibration value of a
rectangle, double click on the rectangle or select it and
use the option Calibration > Change calibration value.
23.10.12 When all values have been entered you can
calculate the calibration curve by clicking the <Edit
calibration curve button> (
) or by selecting
Calibration > Image calibration.
This will bring the ‘Image calibration window’ on the
screen as displayed in Figure 23-8.
Since the present gel has no calibration strips applied,
we will perform a fictitious calibration using different
intensity areas on the raw gel image.
23.10.1 First, switch off the background subtraction (Edit
> Settings and uncheck Background subtraction).
23.10.2 Calibration rectangles can be defined using the
Add new calibration rectangle tool in the toolbar
(
).
23.10.3 With the calibration rectangle tool cursor
selected, draw a small rectangle in the brightest area of
the gel (bottom left area).
23.10.4 A dialog box pops up, prompting to enter a
calibration value for the defined rectangle. Enter zero
(0).
23.10.5 Next, select a very dark spot, e.g. the lowest spot
in the left molecular weight lane, and draw a small
rectangle in the center of that spot. Enter 3.0.
Figure 23-8. The Image calibration window after the
definition of 3 calibration values.
23.10.13 To calculate the calibration curve click the
‘Interpolation’ checkbox and select one of the five fits:
Cubic spline and Polynomial (1) to (5) (the number is
the degree of the polynomial fit).
Chapter 23 - Analyzing 2D gels
213
23.10.14 Select the Cubic spline fit for this calibration
curve.
Other applications of the calibration curve include:
•When checking the option Clip values at extreme
points in the Image calibration window (Figure 23-8.), all
values higher or lower than the respective highest and
lowest calibration value entered will be clipped to
these respective values. This can be useful if the
dynamic range of the film or the scan is not reliable in
these higher or lower ranges.
•In case not all gels that need to be compared have a
calibration strip, you can save the present curve as the
standard calibration curve. This curve can then be
used for gels without calibration strip, still enabling
reliable quantitative comparison. The gels should be
processed and digitized in a similar way. In addition,
a comparable amount of total protein should have
been loaded. The calibration curve can be saved by
selecting the option Save as default calibration in the
Image calibration window (Figure 23-8.). This curve will
be automatically loaded when the next 2D image for
that experiment type is analyzed. You can verify at
any moment the active calibration curve by pressing
the <Edit calibration curve button> (
) or by
selecting Calibration > Image calibration.
•When you press Remove in the Image calibration
window, the presently calculated curve is removed.
Calibration rectangles will be preserved and a new
curve can be calculated at any time.
•When clicking the button <Apply to tone curve>, the
present tone curve settings (see the tone editor
description in 23.8.5) will be overwritten and will be
replaced by the calibration curve you have defined.
The advantage of applying the calibration curve to the
tone curve is (1) that every gel will be displayed in a
similar way, improving the immediate visual
evaluation of quantitative differences on the screen,
and (2) that OD levels where the scanner provides
poor discrimination can be linearized to offer a better
visual OD depth on the screen.
23.10.15 Press <OK> in the Image calibration window to
save the calibration curve along with the gel.
After calibration, for any point of the 2D gel, the status
bar will show you the gel information as discussed in
section 23.8.4, as well as the raw value before
background subtraction/value after background
subtraction as well as the calibrated value.
23.10.16 This concludes part 2, Calibration. Press the
next step button (
Normalization step.
) to move to the next step, the
23.11 Normalization
Normalization of 2D gels is necessary to locate
homologous spots on different gels. In BioNumerics 2D,
normalization makes use of a reference system. The
reference system is a collection of reference spots with
their coordinates, mass centers and X-Y sizes. These
reference spots have a dual function: (1) They can be
linked to homologous spots on other gels, which are
then called landmarks. Once a number of landmarks have
been defined for a gel, the program can map the gel on
the reference system. Mapping of a gel is a process of
distorting the 2D image so that all linked positions on
the gel and the reference system fit each other. Every
pixel on the gel is recalculated based upon the relative
distance and the magnitude of the known displacement
vectors. The process of linking spots to reference spots
and recalculating the image is called normalization. (2)
Once a gel is normalized, the remaining (non-landmark)
spots can be matched with the corresponding reference
spots. When a spot on one gel is linked to the same
reference spot as a spot on another gel, these spots are
considered the same protein. This is the key to compare
different gels with one another.
Usually, the reference system is initially built from a 2D
image of a representative gel by defining easily
recognizable protein spots as reference spots on the
reference system. As more gels will be matched with this
reference system, you can add additional reference spots
to the reference system. Adding new spots from
additional gels will allow new spots to be added to the
database and will have no influence on previously
normalized gels. A reference system is shown as a
synthetic gel with spots created from the mass center,
height, X-size and Y-size of the spots that were added to
the reference system.
NOTE: Normalization of the image, i.e., recalculating
the image to fit the reference system, is only performed
for easier visual evaluation. Internally, all
quantification is done on the non-distorted image.
The sequence of steps in the normalization procedure is
schematically summarized as follows:
1. Defining new reference system
È
2. Adding reference spots to the reference system
È
3. Defining landmarks (linking spots to reference spots
to align gel)
È
4. Linking all gel spots to corresponding reference spots
214
The BioNumerics manual
Steps 1 and 2 are done once initially, whereas steps 3
and 4 are done for each new gel. However, new
reference spots may be added to the reference system as
new gels are analyzed (step 2).
23.11.4 Now the reference system appears in the ‘Assign
to reference system’ window and it becomes
automatically selected.
23.11.5 Close the window by clicking <OK>.
Within the same 2D Gel Type, it is possible to create
different reference systems. This makes it possible to
merge gels with different pH ranges of the same sample
into one multiple experiment. Within a 2D Gel Type, one
reference system is the active reference system, to which
a new gel is assigned by default.
•Creating a reference system
When the normalization step is reached for the first time
in the current 2D Gel Type (with
), there is no
reference system available. Therefore the software will
automatically open the ‘Assign to reference system’
window (Figure 23-9.). This window shows all available
reference systems. Since there is no reference system
available yet the option <Add new> will be needed.
23.11.1 We add a new reference system by clicking <Add
new>.
In the Normalization step, the window is divided in two
panels. The left panel will display the reference system,
and the right panel shows the current gel. Since the
reference system is empty as yet, nothing is shown in the
left window, except for the contours of the spots from
the current gel (Figure 23-10.). When you select a spot in
the right panel (click on the spot), its contour in the left
panel (reference system) also becomes highlighted (red)
(see Figure 23-10.).
•Defining reference spots
Setting up a reference system for the first time will
require a number of reference spots to be defined.
Reference spots are protein spots that will allow
matching of future gels with each other. Since gel Wtlow
is the first gel we are analyzing, we will have to define
the reference spots from this gel, and thus, gel Wtlow
will automatically become the reference gel. It is obvious
that the reference spots present in the reference system
should be well spread and covering all areas of the gel as
uniformely as possible. Usually, there is no drawback in
adding all spots of the gel to the reference system:
reference spots that are not linked to spots on the gel can
be left untouched.
23.11.6 Select a spot on the gel Wtlow. The spot is
highlighted in red, and in the left panel of the reference
system, the contour of the spot is highlighted as well.
Figure 23-9. The Assign to reference system dialog
box.
23.11.2 In the dialog box ‘Add new reference system’
type the name of the new reference system: Fur and click
<OK>.
The program now asks "Do you want to add all spots of
the current gel to this reference system?". If you answer
<Yes>, all the spots of the current gel will be defined as
reference spots in the new reference system. This can
save you the work of adding the spots manually
afterwards (see 23.11.8 to 23.11.11). If you plan to add
only a selected number fo spots to the reference system,
choose <No>.
NOTE: If the program is allowed to add all the spots of
the current gel automatically to the reference system,
the spots of the current gel are automatically
landmarked as well (see page 215).
23.11.3 Answer <No> to the question to add all spots of
the current gel.
Before we can add reference spots to the reference
system, we will need to view the gel in normalized mode.
The logic behind this step is that the reference system is
the basis for normalizing gels, and thus its spots should
be taken from a normalized gel.
23.11.7 Turn the gel into normalized view by selecting
Normalization > Show normalized view or pressing
.
23.11.8 To define the selected spot as a reference spot,
select Normalization > Add selected spot(s) to reference
system.
The reference spot now becomes visible in the reference
system as a synthetic spot, of which the shape is derived
from the mass center, the intensity, the X-size and the Ysize of the original spot (Figure 23-11.).
23.11.9 By holding down the CTRL key you can select
many spots at once, by clicking them one by one, and
add them all together to the reference system (23.11.8).
23.11.10 Alternatively you can select all spots in a
rectangular zone by dragging a rectangle over the gel
image.
Chapter 23 - Analyzing 2D gels
215
Figure 23-10. The Normalization step, initial view.
•Creating landmarks for normalization
Figure 23-11. Reference spot shown on the reference
system: "synthetic" view mode.
23.11.11 Define all spots on the gel as reference spots
using commands 23.11.10 and 23.11.8 (you can first
zoom out to make selection of all spots easier).
The reference system is now shown as a synthetic gel. In
an alternative viewing mode, the reference system can
be shown as the original gel from which the spots were
derived (the reference gel).
23.11.12 Toggle between synthetic reference system and
original reference gel using Normalization > Show
synthetic reference system and Normalization > Show
reference gel.
NOTE: The purpose of a synthetic reference gel is to be
able to combine and display spots from different gels. If
additional reference spots were defined from other gels,
these spots will not be shown when you display the
original gel rather than the synthetic reference system.
Normalization of a gel happens in two steps. In the first
step, homologous spots on the gel and the reference
system are assigned, so that the image can be corrected
until all homologous spots fall more or less together. In
the second step, the program automatically searches for
the remaining homologous spots on the gel and the
reference system, and links them.
Since the reference system is derived from the current
gel, the gel is already perfectly normalized. Hence, most
features related to this step can be skipped for the first
gel. We will discuss them when analyzing a second gel.
However, to be able to compare 2D gels between
different reference systems, the program needs at least a
few well-distributed landmarks to be defined for each
gel. For that purpose, we will simply perform an
automatic search of landmarks, using a spot matching
algorithm. Since gel and reference system are the same,
the search will involve no manual editing.
23.11.13 Select Normalization > Automatically find
landmarks.
The program asks the number of landmarks to find. It is
not recommended to enter a too high number. Instead, a
moderate number, e.g. between 5 and 10 will allow the
program to assign only the most pronounced and best
corresponding spots, and the user can further assign
spots manually, after updating the normalization.
23.11.14 After entering a number and pressing <OK>,
the program assigns a number of landmarks, which are
indicated by a green cross on the landmark spot (Figure
23-12.). The cross becomes red if the spot is selected.
Figure 23-12. Spot defined as landmark.
216
The BioNumerics manual
23.11.15 Select Normalization > Update normalization
or press
to update the normalized view according to
the landmark data.
The image should not change if all landmarks were
assigned correctly.
•Linking spots with reference spots
Once the gel is properly aligned to the reference gel, all
the spots that have a homologous reference spot should
occur very close to that reference spot. It is then
relatively easy to allow an automatic matching
algorithm to link spots and corresponding reference
spots. However, since alignment is never perfect, some
tolerance needs to be allowed. The size of that tolerance,
entered in pixels, depends on the accuracy of the
normalization, and the resolution of the gels.
Since all spots of the current gel were defined as
reference spots, the program has automatically linked
the spots with the reference spots. A linked spot is
recognizable by a small green square in the center of the
spot, which becomes red when the spot is selected
(Figure 23-3.). When selected, the linked reference spot
also becomes selected (visible as a red mask).
Conversely, when you select a reference spot in the left
panel, the linked protein spot in the right panel (if any)
also becomes selected.
this will usually be a pI value and a molecular weight,
respectively. Since spots are identified using the
reference system, the metrics definition is only an
optional step, which is not necessary for any of the
comparison tools in BioNumerics 2D. The metric
descriptions will facilitate the comparison of spots from
different databases. The X and Y metrics are calculated
based upon spots with known X and/or Y metrics and
using polynomial regressions for which the user can
choose the degree (1-5) and a logarithmic dependency.
A third specification which can be defined for each
protein spot is its quantity. The quantity (also referred to
as Z-metric) is calculated from spots with known
quantity, using a polynomial regression of degree 1 to 5.
As opposed to the X and Y metric, the quantity can have
influence on the comparisons, since spot volumes
derived from scanned 2D images are usually not linear
with spot quantities. By applying spots with known
quantities on the gel, the user can let the program
linearize the spot volumes into physical quantities.
23.12.1 As an exercise, you can enter the molecular
weights and pI values of four known spots, as depicted
in Figure 23-14.
23.12.2 Double click on a spot with known MW and pI.
A dialog box will prompt you to enter a pI value (X), a
MW value (Y) and a quantity (Z).
23.12.3 Enter the appropriate values for pI and MW, and
leave the quantity field blank. Press <OK>.
Figure 23-13. Spots that are linked to the reference
system: unselected (left) and selected (right).
23.11.16 Select Normalization > Automatically link
spots or press the
button.
NOTE: In case a spot used as calibration spot in Figure
23-14. is not defined in your gel, you can return to step
1 to add this spot at any time (
button).
When an X value has been entered for a spot, it is
marked with a bidirectional horizontal arrow:
.
Likewise, when a Y value has been entered for a spot, it
is marked with a bidirectional vertical arrow:
The program asks to enter the maximum deviation (in
pixels).
. In case
both the X and Y values were entered for a spot, it is
marked by the combination of these two arrows, as
shown in Figure 23-14.
23.11.17 Leave the default value of 10 and press <OK>.
23.11.18 If the program has made a wrong linkage (not
possible with this gel), you can select the spot and
choose Normalization > Unlink selected spot(s) , or
press the DEL key.
23.11.19 This finishes part 2, Normalization for this gel.
Press the next step button (
) to move to the next
step, the Metrics step.
23.12 Defining metrics
As explained above (23.8) during the Metrics procedure,
each spot can be identified by a metric in the horizontal
direction and in the vertical direction. For protein gels
23.12.4 When the pI and MW values have been entered
for the 4 known spots, call Edit > Settings or press the
button.
23.12.5 In the 2D gel settings dialog box, select the Metrics
tab, which looks as in Figure 23-15.
23.12.6 Under Name, enter the X and Y metrics, which
are pI and MW, respectively.
The entered X values (usually pI) and Y values (usually
MW) will be used by the software to calculate a linear or
exponential fit which can be used to identify exact spot
metrics all over the gel. BioNumerics 2D uses fitting
algorithms of 1st to 5th Degree and can add a
Logarithmic dependency to the fitting algorithm.
Chapter 23 - Analyzing 2D gels
23.12.7 For the X metric (pI) choose 1 as Degree, and do
not check Logarithmic.
23.12.8 For the Y metric (MW), select 3 as Degree, and
check Logarithmic, since MW electrophoresis runs
usually exhibit logarithmic dependence.
As an additional option, one can specify whether the
isometric values of the X metric should be strictly
vertical or not. In a pull-down box you can choose
between Vertical only, Rotated, and Rotated & curved.
•When Vertical only is selected, the program assumes
no rotation of the gel, and isometric values are
vertical.
•When Rotated is selected, the program tries to find
the best fit through the given marker points with an
additional rotation freedom. This means that, if a
better fit can be found by rotating the X-isometric lines
over a certain angle, this angle will be used.
217
•With Rotated & curved, the program is allowed to add
some curvature to the isometric lines to provide an
even better fit.
It is obvious that the latter two options require enough
input values to become reliable, especially the option
Rotated & curved. It is generally not recommended to
use this option.
The same options apply to the Y-metric.
23.12.9 Since the gel is somewhat rotated clockwise, you
can try choosing Rotated for both X and Y metrics.
23.12.10 Press <OK> to close the 2D gel settings dialog box
and confirm the changes.
The 2D gel image now displays a grid, defining the pI
values in the horizontal direction and the MW in the
vertical direction. You will notice that - if the metrics
were entered as in Figure 23-14. - the pI isometrics
(vertical lines) are slightly rotated clockwise.
Figure 23-14. Example for entering molecular weights and pI values for known protein spots in gel Wtlow.
218
The BioNumerics manual
Figure 23-16. Spot information box.
that the highest calibration spot will be clipped at the
maximum value. This option is useful in case you want
to avoid extrapolation of the fit beyond the maximum
value entered.
The option Zero value allows one to specify a quantity
value on the image that corresponds to a zero intensity
value. In other words, it will add a spot to the regression
with zero intensity value and a quantity value of X.
23.12.17 Press <OK> to close the 2D gel settings dialog box
and confirm the changes.
Figure 23-15. The BioNumerics 2D gel settings
dialog box, Metrics tab.
23.12.18 In order to view the information stored for
every spot you can press the <Show spot info> button
23.12.11 Rotation of a gel can also be compensated for by
manually rotating the isometrics grid. This can be done
by clicking the
and
buttons to rotate
counterclockwise and clockwise, respectively.
NOTE: Manual rotation should not be performed when
Rotation was selected as an option in the Metrics
settings.
The third metric, the Z metric is intended to express spot
intensities as physical quantities.
23.12.12 Although there is no real quantitative
information available for this gel, you can consecutively
select a dark spot, an intermediate spot, and a weak
spot, each time entering a lower Z metric value, for
example 100, 50, and 10, respectively.
Spots for which a Quantity (Z) metric has been entered
are marked with a weight symbol: .
23.12.13 Call the 2D gel settings dialog box again with Edit
> Settings or by pressing the
or select the menu option Edit > Spot info.
As a result a small window will be opened that will
display the following information (Figure 23-16.):
•The maximum intensity of the spot (maximum gray
level value in image file)
•The total area (or surface taken on the image file) in
number of pixels
•The volume as calculated using all gray values of the
spot
•The quantity as calculated from the metrics
•The absolute position of the mass center using the X
and Y positions in the image file
•The average X and Y size in pixels of the protein spot
•The X and Y metrics, respectively as calculated from
the spot's mass center
button.
23.12.14 Select the Metrics tab (Figure 23-15.).
23.12.15 Under Spot quantification, you can enter a
Name for the metric, for example "Quantity".
23.12.16 As Degree for fitting, enter 2 (second degree
exponential fitting).
The option Clip at max. value makes it possible to
restrict the calculation of quantity to the range within
the marker spots entererd. Spots that have more volume
Note that standard deviations are calculated for the
maximum value, the area, the volume and the quantity,
in case the gel is a synthetic average of several
individual gels.
Processing of gel Wtlow is now finished. The gel can be
saved.
23.12.19 Press <F2> or the
Save.
button or select File >
Chapter 23 - Analyzing 2D gels
23.13 Describing the 2D gel in the
database
Before this gel can be analyzed, i.e., queried and
compared with other 2D gels, it needs to be described in
the database. We will need to create a database entry
that has this gel linked to it. The entry holds the
descriptive information of the gel: the organism name,
sample number, the experiment conditions, the running
conditions, or whatever information that is applicable.
The way database information fields are created and
filled in is described in 6.3 and 6.4.
In the 2D gel file window, the link to a database entry is
shown in the top panel under the current step and the
file name (Figure 23-10.). Left in the panel, a gray arrow
is shown. If a link is present, the arrow has a pink color,
and the key and database information fields are
displayed right from the arrow.
There are two ways to link the 2D gel to an associated
database entry: (1) if the organism or sample is not
available yet, by letting the program automatically
create a new entry for the gel, or (2) if the organism or
sample is already described in the database, by linking
the gel to that existing entry.
Since our database is empty, we will create a new entry
for the gel Wtlow.
23.13.1 In the 2D gel file window of Wtlow, select File >
Add to database.
The arrow in the database link line becomes pink, and a
key was automatically generated for the entry.
23.13.2 Close the 2D gel file window for Wtlow.
23.13.3 Using Database > Add new information field,
create 3 information fields in the database: Organism,
Taxon, and Condition.
23.13.4 Double click on the new database entry and
enter the following information under:
•Organism: Wild type
•Taxon: Campylobacter jejuni
•Condition: Low Fe concentration
23.14 Normalization of other 2D gels
When processing the first 2D gel, we have defined the
Reference System based upon that gel (23.11), so the
normalization step could be skipped in that gel. We will
now process a second gel to illustrate the normalization
features in particular.
23.14.1 First, create a new database entry with Database
> Add new entries or
.
219
23.14.2 Press <OK> to have the software automatically
assign a key to the entry.
23.14.3 Double click on the new database entry and
enter the following information under:
•Organism: Wild type
•Taxon: Campylobacter jejuni
•Condition: High Fe concentration
23.14.4 In the Files panel of the BioNumerics Main
window, select file Wthigh, and press the
button to
open the file (or double-click on the file name).
A dialog box prompts you to select the 2D gel
experiment type to which the gel should belong.
23.14.5 Select Fur (the only existing 2D Gel Type) and
press <OK>.
NOTE: Opening a second gel in a specific experiment
type will automatically apply the settings that have
been specified in the experiment. Consequently, opening
a second gel or consequtive gels may appear to be slower
since background and spikes may be removed, smear
removed and filtering applied.
23.14.6 Perform Step 1 and Step 2 of the 2D gel
processing as described earlier (23.9 and 23.10).
When moving to Step 3 (Normalization), a dialog box
prompts you to select the Reference System type to which
the gel should be assigned.
23.14.7 Select Fur (the only existing reference system)
and press <OK>.
In the Normalization step, the window is divided in two
panels. The left panel displays the reference system Fur,
and the right panel shows the current data gel Wthigh.
When you select a spot in the right panel (click on the
spot), its contour in the left panel (reference system) also
becomes highlighted (red instead of blue) (see Figure 2310.).
The reference system is currently shown as a synthetic
gel. In an alternative viewing mode, the reference
system can be shown as the original gel from which the
spots were derived (the reference gel).
23.14.8 Toggle between synthetic reference system and
original reference gel using Normalization > Show
synthetic reference system and Normalization > Show
reference gel.
NOTE: If additional reference spots were defined from
other gels than the initial reference gel, these spots will
not be shown when you display the original gel rather
than the synthetic reference system.
220
The BioNumerics manual
Normalization of a gel happens in two steps (23.11.12):
(1) homologous spots on the gel and the reference
system are assigned (landmarks), and (2) the program
automatically searches for the remaining homologous
spots on the data gel and the reference system, and links
them.
The selected spot is highlighted by a red contour. The
contour of this spot in the reference system panel (left)
also becomes red.
•Creating landmarks for normalization
The selected spot is highlighted by a red contour. The
contour of this spot in the data gel panel (right) also
becomes red. We will now create a landmark by linking
the spot with its homologous reference spot.
BioNumerics 2D contains an automatic searching tool
for landmarks, using a spot matching algorithm. We
have described this feature earlier (23.11.13). For very
different gels, however, this feature will not always
provide satisfactory results.
23.14.14 Select the corresponding spot on the reference
system (left panel).
23.14.15 Select Normalization > Use spot as landmark,
or press ENTER, or press
NOTE: you can try the automatic landmark finding
tool at any time; if the result is not satisfactory, press
the Undo button.
When matching a new gel to a reference system
manually, the correctness and the distribution of the
chosen landmarks will be a crucial factor in determining
the quality of the match. BioNumerics 2D offers a
number of viewing modes to facilitate the definition of
landmarks and verify their effect on the matching. By
default, the reference system and the data gel are
displayed side by side. In BioNumerics 2D, however,
there are two additional modes for display: the
overlapped mode and the superimposed mode.
23.14.9 Show the gels in overlapped mode using the
menu option Normalization > Show overlapped images
.
We have now created a landmark, which is indicated by a
green cross on the spot (Figure 23-12.). The cross
becomes red if the spot is selected. Since the spots on the
data gel and the reference system are linked, both spots
are highlighted if either of them is selected. If a
landmarked spot is selected, the landmark button
(
) has a green flag.
23.14.16 To show the data gel in normalized view, i.e., to
deform the image so that all landmarked spots fall
together with their reference spots, press
or select
Normalization > Show normalized view.
button.
Once normalized to one or a few landmarks, it may
become easier to assign additional landmarks.
In this mode, you only see one gel at a time, initially the
gel to be matched.
23.14.17 Select a few more spots on the gel and
homologous spots on the reference system, to create
more landmarks.
23.14.10 By pressing the TAB key (or pressing
23.14.18 Select Normalization > Update normalization
or by pressing the
),
you can toggle between viewing the data gel and the
reference system.
or press
23.14.11 Show the images in superimposed mode using the
the menu option Normalization > Show superimposed
Creating landmarks in overlapped mode
images or by pressing the
23.14.19 Show the gels in overlapped mode by pressing
button.
the current landmark data.
the
The gels are now shown in two colors: the data gel to be
matched in orange and the reference system in blue.
Spots that overlap each other in both gels become black.
Creating landmarks in side-by-side mode.
23.14.12 Press
or select Normalization > Show
images side by side.
The spots of the reference gel are also shown on the data
gel (right panel) as blue contours. Likewise, the spots of
the data gel are also shown as blue contours on the
reference gel in the left panel (Figure 23-11.).
23.14.13 Select a spot on the data gel (right panel).
to update the normalized view according to
button.
23.14.20 By pressing the TAB key (or pressing
),
you can toggle between viewing the data gel and the
reference system. The gel currently shown is indicated
in the upper right corner.
There are two ways to create landmarks in this mode.
The first way is the same as in the side-by-side mode.
23.14.21 Select a non-landmarked spot on the data gel,
and click on it to select the spot (marked by a red dot).
23.14.22 Press the TAB key to view the reference gel, and
select the homologous spot.
Chapter 23 - Analyzing 2D gels
221
Figure 23-18. Data gel and reference gel in orange and blue, respectively, in superimposed mode (distortion
maze shown).
23.14.23 Press ENTER (or
) to create a landmark
for this spot.
23.14.24 The second way is quicker and easier: click on a
spot to landmark, hold down the left mouse button, and
drag the mouse slightly over the screen.
The reference gel is shown semitransparently in yellow
over the gel (Figure 23-17.). Overlapping spots between
the gel and the reference gel are green. If a spot on the
reference gel is in the close vicinity of the data gel, the
program automatically suggests a link with a red
connecting line, as shown in Figure 23-17.
NOTE: The quick drag-and-drop method can only be
used if the data gel is shown. If this is not the case, press
TAB and try again.
Figure 23-17. Reference gel superimposed on gel in
overlapped mode.
The superimposed mode
23.14.26 Show the images in superimposed mode using the
the menu option Normalization > Show superimposed
images or by pressing the
23.14.25 Drag the mouse until the homologous spot of
the reference gel is linked to the selected spot on the
data gel, and release the mouse button.
button.
The gels are now shown in two colors: the data gel in
orange and the reference system in blue. Spots that
overlap each other in both gels become black (Figure 2318.).
222
The BioNumerics manual
Similar as in the overlapped mode, there are two ways
to create landmarks in this mode. The first way is the
same as in the side-by-side mode:
when you select a reference spot in the left panel, the
linked protein spot in the right panel (if any) also
becomes selected.
23.14.27 Select a non-landmarked spot on the data gel,
and click on it to select the spot (marked by a red dot).
23.14.33 Select Normalization > Auto link spots or press
23.14.28 Press the TAB key to view the reference gel, and
select the homologous spot.
23.14.29 Press ENTER (or
) to create a landmark
the
button.
The program asks to enter the maximum deviation (in
pixels).
23.14.34 Leave the default value of 10 and press <OK>.
for this spot.
23.14.30 The second way is both quicker and easier: click
on a spot from the data gel to landmark (orange), hold
down the left mouse button, and drag the mouse
slightly over the screen to the homologous reference
spot (blue).
23.14.35 If the program has made a wrong linkage, you
can select the spot and choose Normalization > Unlink
selected spot(s) , or press the DEL key or the ENTER key
(or
).
23.14.36 This finishes part 2, Normalization for this gel.
NOTE: The quick drag-and-drop method can only be
used when spots on the data gel (orange) are shown. If
this is not the case, press TAB and try again.
Press the next step button (
) to move to the next
step, the Metrics step, where you may enter some
known spot metrics as discussed in 23.12.
23.14.31 Select Normalization > Update normalization
or press
to update the normalized view according to
23.14.37 Save the gel Wthigh and close the 2D gel file
window.
the landmark data.
23.14.32 You can view the distortion applied for the
matching by selecting the menu option Normalization >
Show distortion maze. The result looks as in Figure 2318.
NOTE: Use the Undo and Redo functions to undo/redo
the last actions. To remove an incorrect landmark, you
can also select it and press the DEL key or the ENTER
key (or
).
•Linking spots with reference spots
Once the gel is properly aligned to the reference gel, all
the spots that have a homologous reference spot should
occur very close to that reference spot. It is then
relatively easy to allow an automatic matching
algorithm to link spots and corresponding reference
spots within a certain tolerance, entered in pixels.
A linked spot is recognizable by a small green square,
which becomes red when the spot is selected (Figure 2313.). When selected, the linked reference spot also
becomes selected (visible as a red mask). Conversely,
The program may ask "Settings have been changed. Do
you want to use the current settings as the new
default?". If you feel the changed settings for the 2D Gel
Type will be useful for future gels as well, answer
<Yes>. If the changes were only necessary to improve
the processing of this individual gel, press <No>.
The program may also ask to confirm that the
"configuration has been changed". This question comes
up when the current gel has been changed.
To illustrate the comparison functions in the next
chapter, it is recommended to add a few other gels to the
database Demo2D. You may want to process gels
Furlow and Furhigh. Alternatively, the database with
the four processed gels can be copied from the
installation CD where it can be found as
Demo\Demo2D. Copy the complete directory Demo2D
to the home directory of BioNumerics, which is
c:\Program files\BioNumerics\Data after a default
installation. If the home directory is different, you can
find the actual path in the BioNumerics Startup
program, by pressing the <Homedir> button.
223
24. Comparing 2D gels
24.1 Introduction
During the normalization procedure, two spots from
different gels may be linked to the same reference spot
(Figure 24-1.). Internally, the software stores a unique
identifier for each protein spot on each gel. The spots on
the reference system also have an identifier. When a spot
is linked to a reference spot, it gets the same identifier as
that reference spot, so that the program recognizes it as
the same protein. When a spot on another gel is linked
to the same reference spot, it also gets the same
identifier, so that the program recognizes the spots on
both gels as the same. If A = B and B = C then A = C.
Ref. system
ed
Link
Link
ed
Gel B
Same protein
Gel A
Ref. system
Linked
One of the main purposes of analyzing 2D gels is to
detect proteins that are invariantly expressed or
differentially expressed in different circumstances.
Another application could be to compare patterns of
protein expression between different organisms, in the
same circumstances. All these applications require that
spots representing the same protein are linked to each
other. This is done by (1) normalizing different gels to a
common reference system (23.11) and (2) by linking spots
of the gel to the homologous reference spots (23.11.16).
Gel B
Same protein
Gel A
Figure 24-2. Direct linkage of two spots in the 2D gel
matching window.
as the same protein, without a reference spot being
present for this protein
Each spot in the database represents a certain protein;
linked spots represent the same protein. BioNumerics
stores information fields for each protein. The software
stores a Spot ID, which is assigned automatically, and a
number of additional fields that can be filled in by the
user:
Accession: An accession number for the protein so that
links to external databases can be made;
Description: A description of the protein such as
function, pathway;
Gene name(s): name or code of the gene relating to the
protein, and synonyms;
Field 1, Field 2, Field 3, Field 4, and Field 5: free userdefinable fields.
Figure 24-1. Indirect linking of spots via reference
spots: two spots linked to the same reference spot
are recognized as the same protein.
In the 2D gel matching window (discussed further in this
chapter) there is also a possibility to link two gels
directly to each other, without linking to a reference
spot. One can, for example, link spot A from one gel to
spot B from another gel, without A or B being linked to a
reference spot (Figure 24-2.). In this case, both spots A
and B will get the same identifier and will be recognized
The Description, Gene name(s) and free fields fields can
contain strings of unlimited length.
24.2 Matching spots on different gels
The goal of comparing 2D protein gels is to either
compare overall protein expression patterns (in different
conditions or over different organisms) or to examine
the expression level of individual proteins in function of
different conditions. Such analyses usually require some
kind of clustering or grouping algorithm, but in the first
place, require that homologous spots on different gels
are properly linked together. The 2D gel matching
window is designed for this purpose.
224
The BioNumerics manual
24.2.1 Select the four Campylobacter jejuni entries in the
database Demo2D: Wild type with low Fe
concentration, Wild type with high Fe concentration,
Fur mutant with low Fe concentration, Fur mutant with
high Fe concentration. Use the <space bar> or click on
the entries while holding the SHIFT or CTRL key.
24.2.2 Copy the gels to the clipboard by selecting Edit >
Copy selection or by pressing
.
24.2.3 In the Experiments panel, open experiment type
Fur by double clicking or pressing the
button.
This opens the 2D gel type window for experiment type
Fur (Figure 24-3.).
The 2D gel type window contains information which is
general for the selected 2D experiment type, and which
is saved along with the experiment type, in this case Fur.
This information includes the reference system(s)
defined within the 2D Gel Type, the image processing
settings, the free field names, the spot queries, and the
spot quantification settings. We will deal with queries
later.
24.2.4 The names for the 5 optional free fields (see the
introduction, 24.1) can be entered from the 2D gel type
window, by selecting Settings > Spot label names. A
dialog box pops up where you can enter a name for each
of the free fields.
Figure 24-3. The 2D gel type window.
24.2.5 Call the 2D gel matching window by selecting File >
Create matching window or by pressing
.
The 2D gel matching window (Figure 24-4.) consists of 3
views: the Gel images view, which is selected by default,
the Query spreadsheet view, and the Scatter plot view.
The latter two views are described in paragraphs 24.4
and 24.5, respectively.
The 2D gel matching window is intended to display many
gels next to each other. Initially, only the active reference
system is displayed.
NOTE: The active reference system is the reference
system used for comparisons in the 2D gel matching
window. You can change the active reference system in
the 2D gel type window by selecting the Reference
systems tab and choosing Refsystem > Set as active
reference system.
24.2.6 Paste the gels from the clipboard in the 2D gel
matching window by selecting Edit > Paste entries from
clipboard or pressing
.
The 2D gel matching window now displays the reference
system and four 2D gels (Figure 24-4.).
In 5 separate panels, the four gels and the reference
system are displayed. The reference gel is bordered by a
red rectangle. For each gel, the landmarks defined in the
Chapter 24 - Comparing 2D gels
225
Figure 24-4. The 2D gel matching window.
normalization step are indicated as green crosses and
the proposed shifts towards the homologous position on
the synthetic reference gel are indicated.
24.2.10 By selecting the image dragging tool (
24.2.7 The position of the gels can be changed in this
window by clicking on the name of the gel and dragging
it to its new location.
If gels are shown in normalized mode, each
displacement of one gel will be automatically followed
by the other gel images.
Initially, the gels are not shown in normalized mode.
24.2.8 To show all gels in the normalized mode, press
the
button or select Image > Show normalized.
24.2.9 When selecting a protein spot on one of the gels
(or on the reference gel), BioNumerics 2D will indicate
the homologous protein, if present, on all other gels
displayed in the 2D gel matching window. A label is
shown, pointing to the selected spot, and indicating an
quantification value of the spot on each gel. This
quantification value can be the Maximum value, Volume,
Relative volume, Quantified value, and Area. The
quantification value to be shown can be chosen from the
drop down list in the button bar of the 2D gel matching
window.
NOTE: The default settings for the quantification value
can also be specified in the Spot quantification
settings dialog box (see Figure 24-14.).
button) you can drag the image to any part of the gel.
24.2.11 You can also use the zoom tools (
and
) to zoom in or out in each gel window. To zoom
in you can drag a rectangle on the region of interest.
In the 2D gel matching window, it is possible to improve
or correct the normalizations of gels made earlier, by
relinking spots, removing links or adding links.
24.2.12 To link a spot on a gel to a reference spot, click
on the spot, and hold down the left mouse button while
dragging the mouse to the homologous reference spot.
The mouse pointer changes from a prohibition sign into
a symbol of two linked spots.
NOTE: In the same way, spots can also be linked
between gels directly, without linking to a spot on the
reference system. Such spots will also be recognized as
the same protein and hence, share the same information
226
The BioNumerics manual
fields. They will, however, not be assigned a Spot ID
(see further).
24.2.13 In the 2D gel matching window, you can also
display the gels in the superimposed mode by using the
menu option Image > Show overlap or by clicking the
button.
The gels are now shown in two colors: the data gels in
orange and the reference system in blue. Spots that
overlap each other in both gels become black.
24.2.14 If you click on a spot on the gel of interest, you
can drag it to the homologous spot position of the
underlying reference gel. When the correct position on
the reference gel has been reached, the mouse pointer
changes from a prohibition sign into a symbol of two
linked spots. At the same time, the reference spot is
bordered by a red square on the reference gel. You can
release the mouse button to establish the link.
24.2.15 The linked spot is now defined as a landmark
position in the original gel.
24.2.16 For each spot you can break existing links by
selecting the spot on the gel and selecting the menu
option Spots > Break link or by pressing the DEL key.
24.2.17 After you created new links you can press the
button
or select the menu option Image > Update
normalization which will renormalize the modified
gel(s) and return the display to the single-gel,
normalized mode (not superimposed).
24.2.18 You can save the changes by pressing the
In order to match two or more data gels more exactly
(not via a reference gel), you can turn any gel into a
temporary reference gel. This will allow two nonreference gels to be shown in superimposed mode. For
example, you may want to display the gel from which
the reference system was derived (Wtlow) in
superimposed mode rather than the synthetic gel.
24.2.23 Click on the gel to use as temporary standard
(e.g. Wtlow) and select Refsystem > Use selected gel as
temporary standard or press the
button.
NOTE: You can remove the synthetic reference gel from
the 2D gel matching window (Edit > Cut selected
gel from matching) and continue to improve the
matching between the remaining gels. A reference
system can also be added to a 2D gel matching
window, by bringing the 2D gel type window to the
front (Figure 24-3.), selecting the <Reference
systems> tab, and choosing a reference system from
the list, and selecting Refsystem > Add to matching
window.
As mentioned earlier, spots that are not present in the
reference system can also be matched between gels;
these spots will have no spot ID.
Based on the 2D gel matching window, the software will
be able to assign an ID code to all protein spots present
on the reference system. Spots that are linked to the
same reference spot in the synthetic reference gel will
have the same ID code.
24.2.24 You can assign the spot IDs by selecting the
option Refsystem > Assign ID code to spots.
button or select File > Save changes.
The ID code assigned to the spots will have a permanent
nature if the matching is saved to disk.
At this stage, it is still possible to change the synthetic
reference system by adding or deleting spots.
24.2.25 You can hide the label by selecting Layout >
Label with > No field.
24.2.19 To add spots to the reference system, select one
or more spots on a gel and choose Refsystem > Add
selected spot(s).
The label given by the software is only a primary
identifier and can be supplemented by other
information.
24.2.20 To delete one or more spots from the reference
system, select a spot on the reference gel and choose
Refsystem > Delete selected spot(s) .
24.2.26 To add additional information, double click on
the spot you like to document, or select Spots > Change
description.
24.2.21 At any time you can add or delete a gel to/from
the matching by pasting new gels from the clipboard or
by selecting Edit > Cut selected gel from matching
The Spot description fields window will open (Figure 24-5.)
and you can type the information for that spot.
(
24.2.27 The Spot description fields window has the three
standard fields Accession code, Description and Gene
name(s) as well as five optional free fields where you can
store other types of information. The information loaded
can be typed from the keyboard or can be loaded from
public databases using BioNumerics scripts.
button).
24.2.22 When doing so it may be useful to change the
layout of the 2D gel matching window by clicking the
button or by selecting a grid layout from the
Layout menu (1x1 grid, 2x1 grid, 2x2 grid, 3x2 grid, or
5x3 grid).
NOTE: The names for the free fields can be changed in
the 2D Gel Type settings, as described in 24.2.4. The
Chapter 24 - Comparing 2D gels
227
Figure 24-5. The Spot description fields dialog box.
information is available for querying only after saving
the gel.
24.2.28 In case more than one spot is selected and you
select Spots > Change description, the program will
prompt that "There are x spots selected. Do you want to
modify all spots simultaneously?". In case of
confirmation with <Yes>, a variant of the Spot description
fields dialog box pops up (Figure 24-6.), from which you
can choose a field, and enter the string that shoud be
filled in for the selected spots.
compose a data set that is amenable to further statistical
analysis by the BioNumerics software, for example,
cluster analysis, principal components analysis,
discriminant analysis, self-organizing maps, MANOVA,
etc.
If you continue from paragraph 24.2, you may still have
the four database entries in database Demo2D selected,
and the 2D gel type window opened. If not, proceed as in
24.3.1 to 24.3.2.
24.3.1 In the database, select the 2D gels you want to
analyze. In this example we will use all four gels Wtlow,
Wthigh, Furlow, and Furhigh of database Demo2D.
24.3.2 Double click on the 2D experiment type Fur to
open the 2D gel type window (Figure 24-3.).
The query tool allows you to create individual query
components, which can be combined into more complex
queries with logical operators. The available query types
Figure 24-6. Spot description fields editor dialog box
for multiple selected spots.
24.2.29 The information stored can be viewed quickly
for each spot by selecting the menu option Layout >
Show spot info or by pressing the
button.
24.3 Creating 2D spot queries
BioNumerics 2D contains a spot querying tool that
allows searches to be performed based on spot
intensities and spot information fields on gels of selected
entries in the database. The resulting set of spots
are Intensity query
, Significance query
, Spot field query
, and Manual
. The available logical operators are
selection
AND
, OR
XOR
.
, NOT
, and
•Intensity queries
With intensity queries, spots can be selected based upon
one of the available intensity measures (spot height,
volume, quantity).
228
The BioNumerics manual
24.3.3 To prepare a new query click the <New intensity
query> button (
) and enter a name, for
example, “Differential expression”. Press <OK>.
24.3.4 In the Intensity query window (Figure 24-7.) select
Volume as the Spot intensity measure.
Other measures to construct intensity queries are
Maximum value (highest pixel intensity), Relative
volume (in %) (related to all spots on the gel as 100%)
and Quantity (according to the Z metric in step 4 of the
normalization; see 23.12.15).
•You can also select the option Fuzzy logic which will
use a weighted approach based upon the different
search criteria to determine whether a spot fulfills the
criteria or not. One advantage of this method is that
not all criteria should be exactly fulfilled: for example
if one criterion specifies that a spot should have at
least a volume of 20000, a spot with a volume of 19000
may be selected as well if other criteria are matching.
Another advantage is that found spots are ranked
according to the overall matching of the search criteria
imposed.
•The Intensity query window also offers the option
Combine using OR, which, when checked, will
combine the Min. and Max. criteria specified for each
gel with OR. This means that, when this option is
checked, each spot for which at least one gel has its
criteria fulfilled, will be selected in the query.
24.3.8 Press <OK> to finish formatting the intensity
query.
The query is now displayed in the 2D gel type window
(Queries tab) as a gray box listing the name, the type and
the number of spots found. As more queries are
generated, all of them will be listed.
24.3.9 With the query selected, click the <Run selected
Figure 24-7. Intensity query window.
24.3.5 Press the button <Add Selected> to include the
selected gels in the query. Four gels should now be
listed. If not, make sure the database entries are selected
(blue arrow).
You can now define the search criteria for the spots. For
each gel you can enter a minimum and a maximum
intensity value, by clicking on the --- field under Min. for
the corresponding gel.
24.3.6 Under Max. for Wtlow click on ---. The field
changes into an input field where you can enter a
maximum volume, for example 10000. Then press
ENTER. Repeat the same for Furlow.
queries> button (
) or select Queries > Update.
The number of spots found of the query is displayed in
its box. When the query is bordered by a red rectangle, it
is the active query.
24.3.10 To make a query active, select it and press the
button.
The active query is the query that will be used in all the
comparison tools of BioNumerics. These include the
spreadsheet comparisons in the 2D gel matching window
(see 24.4), the cluster and grouping analysis tools via a
Composite Data Set (see 24.6), and the advanced
analysis using GeneMaths or GeneMaths XT (24.7).
•Significance queries
24.3.7 Under Min. for Wthigh and Furhigh, enter 10000.
With this setting, the query will look for all proteins that
have a volume of less than 10000 in Wtlow and Furlow,
and higher than 10000 in Wthigh and Furhigh. The
following additional functions are possible in the
Intensity query window:
•When clicking the option Relative to base gel, the
<Set as base gel> button will be activated. You can
then select a gel from the list that will be regarded as
the base gel. For any query that is performed on a nonbase gel, the value set will be used as a multiplication
factor, which will be multiplied with the spot quantity
on the base gel. For example, if you enter 2 as Min., a
spot on a gel should be minimum two times as high as
the corresponding spot on the base gel.
Using a significance query, spots can be selected that are
significantly aberrant from the average expected value,
based upon regression between pairs of gels. For a
comparison of N gels, each gel is compared to each other
gel by mapping all the shared spots into scatterplots and
calculating a best fit regression through the plots (see
also 24.5). This leads to N(N-1)/2 regressions, from
which a spot is selected if it is significantly different on
at least one regression.
24.3.11 To create a query by significance, select Queries
> New significance query or press the
button.
Enter a name, for example, “Outliers” and press <OK>.
Chapter 24 - Comparing 2D gels
The Spot significance query dialog box (Figure 24-8.) allows
the choice between the Maximum value (highest pixel
intensity), Volume, Relative volume (in %) (related to all
spots on the gel as 100%) and Quantity (according to the
Z metric in step 4 of the normalization; see 23.12.15).
24.3.12 Press the button <Add Selected> to include the
selected gels in the query. Four gels should now be
listed. If not, make sure the database entries are selected
(blue arrow).
229
24.3.14 Under Search in, you can specify one of the
information fields, or <All fields>. A partial search
string can be entered using an asterisk (*) as wildcard.
•Manual selections
Manual selections have no search criterion associated
and after creation, they are empty boxes. The purpose is
to add spots manually, and to store such selections.
24.3.15 In case you want to define a set of spots manually,
without the intervention of a criterium-based query, you
can create a manual selection with Queries > New
manual selection or the
button.
A manual selection contains zero spots when created.
However, in the 2D gel matching window (see 24.4), it is
possible to add spots to the active query using the menu
option Spots > Add to active query. With a manual
selection as active query, you can create any set of
manually selected spots. Such manual selections of spots
are saved along with the 2D Gel Type.
Figure 24-8. Spot significance query dialog box.
•With Polynomial degree it is possible to enter the
degree of the regression; to obtain a linear regression,
enter 1.
•With Robust, an iterative algorithm is applied that
assigns less weight to outlier spots, hence obtaining a
less distorted regression in case a few strongly
outlying spots occur.
•Variable standard deviation is an option that
calculates the standard deviation in function of the
position on the regression curve and not as one single
value obtained from the whole regression.
•With P-value you can specify the significance for a
spot to be considered different. A probability is
calculated for each spot to belong to the distribution,
based upon the regression curve and its standard
deviation limits. The program will select all spots that
have a probability (p value) below the value entered in
at least one regression. Such spots can be considered
to be outliers. With Variable standard deviation
enabled, the spots identified as outliers may be
different from with Variable standard deviation
disabled.
•Spot field queries
24.3.13 To create a query based on a spot information
field, select Queries > New spot fields query or press
.
Individual queries can be assembled into composite
queries using one of the logical operators. The individual
queries should then be considered as query components,
which are together part of a composite query, combined
by a logical operator.
AND, combines two or more components. All
conditions of the combined components should be
fulfilled at the same time for a spot to be selected.
OR, combines two or more components. The
condition implied by at least one of the combined
components should be fulfilled for a spot to be selected.
NOT, operates on exactly one component.
This operator inverts the argument (and hence, the
selection) of the query component to which it applies.
XOR, combines two or more components.
Exactly one condition from the combined components
should be fulfilled for a spot to be selected.
NOTE: the buttons for the logical operators contain a
helpful Venn diagram icon that clearly explains the
function of the operator.
As an example, we will create a composite query
containing two components, combined by a logical
operator.
24.3.16 Prepare a new query by clicking the <New
intensity query> button (
) and enter a name,
for example, “Minimal expression”. Press <OK>.
230
The BioNumerics manual
NOTE in case a query is created with one or more
queries already present, a checkbox Derive from
"QueryName" is present. "QueryName" is the name
of the existing query that is selected when the new
query is created. Whern this option is checked, the new
query will be a child query of the existing one, which
means that any search conditions specified will apply to
the set of spots resulting from the parent query.
select it and press the
button (or Queries > Set
as active query).
The active query is the query that will be used in the 2D
gel matching window (see below) and in all the
comparison tools of BioNumerics and GeneMaths/
GeneMaths XT.
24.3.17 Select Volume, press <Add selected> to add the
selected gels, and under Min., enter 20000. Press the
<Set all> button.
24.3.23 Editable queries (Intensity and Field queries) can
be re-edited by selecting the menu option Queries > Edit
query (or double-clicking on the query box) .
24.3.18 By holding down the CTRL key and clicking
both queries, you can select them simultaneously.
24.3.24 Queries or query components can be deleted by
selecting the component and choosing the menu option
24.3.19 Combine the two selected queries to a more
Queries > Delete query or pressing
complex query with OR (
button.
). Enter a name, for
example, Diff+High.
24.4 Listing spots in spreadsheets
A new composite query appears, graphically displayed
as a new box combining the two query components with
connecting lines (Figure 24-9.).
24.3.20 To run the composite query, click on its box and
press the <Run selected queries> button (
) (or
select Queries > Update).
A question pops up "This query depends on one or more
other queries. Do you want to automatically update
these parent queries?".
24.3.21 Answer <Yes> to update the constituent query
components as well as the resulting composite
query.
As illustrated above, it is possible to extract a number of
proteins from a selection of 2D gels based upon specific
criteria using spot queries. Such sets of spots can be
retrieved and viewed in the 2D gel matching window (see
also 24.2). Some of the actions below may already have
been carried out (24.2.1 to 24.2.3).
24.4.1 Select the four Campylobacter jejuni entries in the
database Demo2D: Wild type with low Fe
concentration, Wild type with high Fe concentration,
Fur mutant with low Fe concentration, Fur mutant with
high Fe concentration. Use the <space bar> or click on
the entries while holding the SHIFT or CTRL key.
24.4.2 Copy the gels to the clipboard by selecting Edit >
Copy selection or by pressing
.
24.4.3 In the Experiments panel, open experiment type
Fur by double clicking or pressing the
button.
This opens the 2D gel type window for experiment type
Fur (Figure 24-3.).
24.4.4 Call the 2D gel matching window by selecting File >
Create matching window or by pressing
.
24.4.5 Paste the gels from the clipboard in the 2D gel
matching window by selecting Edit > Paste entries from
clipboard or pressing
Figure 24-9. Composite spot query in the 2D gel type
window.
24.3.22 At present, one of the query components is still
the active query. To make the composite query active,
.
The 2D gel matching window now displays the reference
system and four 2D gels (Figure 24-4.).
To show all gels in the normalized mode, press the
button or select Image > Show normalized .
Chapter 24 - Comparing 2D gels
231
Figure 24-10. The Query spreadsheet view of the 2D gel matching window.
24.4.6 Press the Query spreadsheet tab or select Layout >
Show query spreadsheet from the menu.
24.4.7 As a result a table like in Figure 24-10. will be
displayed.
Figure 24-10. shows a list of all protein spots that have
been found. For each spot the maximum value and the
volume are displayed on the four gels. These values can
be used for further analysis (see below sub 24.6 and
24.7).
24.4.8 The layout of the spreadsheet table view can be
modified by selecting Layout > Spot table preferences or
press
.
This will open the The spot table preferences window
displayed in Figure 24-11. By default, the Spot ID and
Accesison number are displayed as spot information
fields. Other fields that can be displayed include the
Description, Gene name(s), the 5 free Comment fields,
and the metrics properties (pI and MW). As spot
quantity measures, you can display the Maximum
intensity and Volume (defaults) as well as the Area and
Quantity defined by the Z metric. With a separate
checkbox, the Standard deviation can be displayed.
Standard deviations are only shown for spots that are
averaged from different combined gels (see further,
24.8).
Figure 24-11. The Spot table preferences dialog
box.
NOTE: each individual column of the table can be made
wider or smaller by dragging the header separator lines
to the left or to the right.
24.4.9 The full information stored for each spot can be
viewed quickly by selecting the menu option Layout >
Show spot info or by pressing the
button.
232
The BioNumerics manual
24.4.10 You can double-click on a spot to edit its
information fields (24.2.26).
be changed from the pull-down list in the button bar of
the 2D gel matching window.
24.4.11 When you have chosen a specific layout of the
spreadsheet view, you can export the table to the
clipboard with File > Copy to clipboard or by pressing
Likewise, you can assign a fixed label to each spot as
follows:
the
button.
The table is exported as a tab-delimited text file, which
can be easily imported in other software using standard
paste functions.
24.4.12 Return to the image view by clicking the Gel
images tab or selecting Layout > Show images.
24.4.13 Press the
button to display a list of spot
histograms right from the gel images.
On the gel images, the selected spots from the query are
marked with a red rhomb.
24.4.14 The histograms display either the Maximum
value, Volume, Relative volume, Quantified value, or
Area of the spots selected. The type of information
displayed is determined by the Spot quantification
settings in the 2D gel type window (24.6.5), and can also
be changed from the pull-down list in the button bar of
the 2D gel matching window.
Absent proteins will be replaced by small red crosses on
the histograms, indicating that the spot was not
identified on that gel. Amounts that exceed beyond the
maximum value that can be displayed are marked with
a small horizontal line on top of the bar.
24.4.15 The histograms are automatically scaled to the
highest value found for the selected quantification
parameter. Due to individual excessive values, for
example, it may be that the histograms do not fully
cover the vertical range of the graphs. In that case, you
can change the vertical scale using the Up and Down
arrow buttons right from the Max. val indication in the
button bar.
NOTE: The maximum value for the histograms can also
be set in the Spot quantification settings dialog box
from the 2D gel type window (24.6.5).
24.4.16 By clicking on a spot histogram in the
spreadsheet or the gel image view, the histogram will
become highlighted.
Simultaneously, the spot corresponding to the selected
histogram will be marked with a small label on the
individual gels where it occurs. The label displays either
the maximum value, volume, relative volume,
quantified value, or area of the selected spots. As noted
above (24.2.9 and 24.4.14), the type of information
displayed is determined by the Spot quantification
settings in the 2D gel type window (24.6.5), and can also
24.4.17 From the Layout > Label with menu, select the
labeling method (by default No field is selected). Each of
the spot information fields can be selected, including the
free fields.
24.4.18 With Layout > Label query members only you
can display labels for the spots present in the query
only.
An interesting tool is the possibility to add spots
manually to the active query. Likewise, it is possible to
remove spots from the active query. Although any
query can be edited manually, one should realize that an
automatic query, based upon search criteria, will lose
the information about manually added or removed
spots when it is updated. Therefore, it is recommended
to use the Manual selection query (”Spot field queries”)
for that purpose. This happens as follows.
24.4.19 First create a manual selection in the 2D gel type
window (Queries tab) with Queries > New manual
selection or the
button.
The manual selection contains zero spots when created.
24.4.20 Make the manual selection active by selecting it
and pressing the
button (or Queries > Set as
active query).
24.4.21 In the 2D gel matching window, select some spots
on a gel by dragging the mouse with the Cursor tool
(
) selected. Selected spots are displayed with a red
dot.
24.4.22 Add the selected spots to the active manual
selection using the menu option Spots > Add to active
query.
Such manual selections of spots are saved along with the
2D Gel Type. A manual selection can be part of a
composite query, and when updated, the manual
selection is preserved.
24.4.23 Likewise, it is possible to remove spots from an
active query by selecting them in the 2D gel matching
window and choosing Spots > Remove from active query.
NOTE: Adding and deleting spots from the active
query works for automatic and composite queries as
well as for empty queries. Deletion or addition of spots
is saved along with the query. In case of automatic and
composite queries, however, any manual work is lost
when the query is updated.
Chapter 24 - Comparing 2D gels
233
24.5 Comparing spots in scatter plots
As illustrated in the previous paragraphs, it is possible
to extract a number of proteins from a selection of 2D
gels based upon specific criteria using spot queries. Such
sets of spots can be retrieved and viewed in the 2D gel
matching window (see also 24.2). This paragraph
describes how selected spots can be compared in gel-togel scatter plots between any selection of gels from the
database. The actions 24.5.1 to 24.5.5 below may already
have been carried out (24.2.1 to 24.2.3).
24.5.1 Select the four Campylobacter jejuni entries in the
database Demo2D: Wild type with low Fe
concentration, Wild type with high Fe concentration,
Fur mutant with low Fe concentration, Fur mutant with
high Fe concentration. Use the <space bar> or click on
the entries while holding the SHIFT or CTRL key.
24.5.2 Copy the gels to the clipboard by selecting Edit >
Copy selection or by pressing
.
24.5.3 In the Experiments panel, open experiment type
Fur by double clicking or pressing the
The scatter plots are automatically scaled to the highest
value found in any of the gels for the selected
quantification parameter.
24.5.8 Due to individual excessive values, for example, it
may be that the spots do not fully cover the range of the
graphs. In that case, you can change the scale for the
quantification parameter used by pressing the Up and
Down arrow buttons right from the Max. val indication
in the button bar.
The buttons available in the Scatterplot view are the same
as those described for the Gel images view (24.2). It is
possible to zoom in or out, to show histograms, to
display a spot info box for the selected spot(s), and to
launch
GeneMaths/GeneMaths
XT
for
more
sophisticated analysis (24.7).
24.5.9 When the histograms are shown (24.4.13), spots
that are part of the active query are marked by a red
rhomb surrounding the black dot; non-query member
spots are just black dots.
button.
This opens the 2D gel type window for experiment type
Fur (Figure 24-3.).
24.5.4 Call the 2D gel matching window by selecting File >
Create matching window or by pressing
and can be changed from the pull-down list in the
button bar of the window.
24.5.10 To display only the spots from the active query,
select Layout > Show query spots only.
24.5.11 If you click on a spot in one of the scatter plots, it
will be pointed to by a red arrow on all scatter plots
where the spot occurs in at least one of the two gels.
.
.
24.5.12 In the Scatter plots view, it is also possible to
select one or more spots. To select one spot, simply click
on it in one of the scatter plots. A selected spot is marked
as a red dot, while non-selected spots are marked as
black dots.
The 2D gel matching window now displays the reference
system and four 2D gels (Figure 24-4.).
24.5.13 To select additional spots, hold down the CTRL
key while clicking on other spots.
24.5.5 Paste the gels from the clipboard in the 2D gel
matching window by selecting Edit > Paste entries from
clipboard or pressing
To show all gels in the normalized mode, press the
button or select Image > Show normalized .
24.5.6 Press the Scatter plots tab or select Layout >
Show scatter plots from the menu.
As a result, a matrix of scatter plots will be displayed
(Figure 24-12.). Each scatter plot is the comparison
between two gels selected in the 2D gel matching window.
The gel names are displayed in the row header and
column header, respectively. The values for the axes are
also indicated in the row and column headers.
If a spot is present in one gel and absent in another gel, it
will be shown as a black dot on the scatter plot between
these gels, having a zero quantity value on the gel where
it is absent. If a spot is absent in two gels, it will not be
shown on the scatter plot between these gels.
24.5.7 The values on the axes are determined by the Spot
quantification settings in the 2D gel type window (24.6.5),
24.5.14 To select all spots in an area, you can also hold
down the SHIFT key and drag a rectangle over the
scatter plot.
24.5.15 All spots (query and non-query spots) can be
selected at once with Spots > Select all spots, whereas
all spots of the active query can be selected with Spots >
Select all spots in query.
24.5.16 Similar as in the Gel images view, you can add
spots to, or delete spots from, the active query with
Spots > Add to active query and Spots > Remove from
active query, respectively (24.4.22 and 24.4.23).
24.5.17 It is also possible to change the description of a
spot by double clicking on it (24.2.26) or selecting Spots
> Change description to bring up the Spot description
fields dialog box (Figure 24-5.).
24.5.18 You can also change a description field for a set
of selected spots, as explained in 24.2.28.
234
The BioNumerics manual
Figure 24-12. The Scatterplot view in the 2D gel matching window.
It is possible to perform a linear or non-linear regression
on the scatter plots. The program calculates the
regression on the spots that are visualized: if you
choosed to show the spots from the active query only
(24.5.10), only these spots will be taken into account for
the regression calculation.
Chapter 24 - Comparing 2D gels
235
24.6 Clustering and statistical analysis of
2D gels in BioNumerics
The Active query (24.3.10) forms the basis for
comparative analysis of 2D gel spots in the BioNumerics
software. The result of a query is a table of spot
quantities collected from a number of gels. Such a table
can be visualized in the 2D gel matching window, in the
spreadsheet view (Figure 24-10.), but it can also be
treated as a character table to perform cluster analysis,
principal components analysis and all derived
techniques available in BioNumerics.
Figure 24-13. Regression dialog box for 2D gel
scatterplots.
24.5.19 To calculate a regression on the scatter plots,
select Layout > Calculate regression lines.
This brings up the Regression dialog box for 2D gel
scatterplots, as shown in Figure 24-13.
24.6.1 Select the four Campylobacter jejuni entries in the
database Demo2D: Wild type with low Fe
concentration, Wild type with high Fe concentration,
Fur mutant with low Fe concentration, Fur mutant with
high Fe concentration. Use the <space bar> or click on
the entries while holding the SHIFT or CTRL key.
Selected entries are marked with a blue arrow.
24.6.2 Create a new comparison with Comparison >
Create new comparison (9.7) or by pressing the
•With Polynomial degree it is possible to enter the
degree of the regression; to obtain a linear regression,
enter 1.
•Force through zero is an option that forces the
regression line to go through the origin of the scatter
plot.
•Monotonic is an option that will force the regression
to continouusly increase in both the X and Y direction.
•With Robust, an iterative algorithm is applied that
assigns less weight to outlier spots, hence obtaining a
less distorted regression in case a few strongly
outlying spots occur.
•Variable sigma is an option that calculates the sigma
limits (standard deviation) in function of the position
on the regression curve and not as one single value
obtained from the whole regression.
•The Select outliers function will calculate a
probability for each spot to belong to the distribution,
based upon the regression curve and its sigma limits.
The program will select all spots that have a
probability (p value) below a certain threshold which
the user can enter. Such spots can be considered to be
outliers. Note that the p-values, and hence, the outliers
are based upon the average of all scatter plots.
Therefore, spots may be identified as "outliers", and
yet seem to follow the regression closely in individual
scatterplots. With Variable sigma enabled, the spots
identified as outliers may be different from with
Variable sigma disabled.
button in the Comparisons panel.
The 2D Gel Type Fur is the only experiment type listed in
the status bar of the Comparison window.
24.6.3 Show the character table of the spot query by
pressing the
button of Fur.
The spot intensities are now displayed as differentially
shaded gray blocks. The spot ID numbers are indicated
in the column header.
24.6.4 Select Clustering > Calculate > Cluster analysis
(similarity matrix) to calculate a dendrogram.
The Comparison settings dialog box allows you to specify
the similarity coefficient to calculate the similarity
matrix, and the clustering method. Cluster analysis of
2D gel spot query tables is identical as clustering
character-based data (see chapter 13.). All the clustering
and statistical functions that apply to character data also
apply to 2D gel data.
The spot quantities used to construct the character table
in the Comparison window are those chosen in the Spot
quantification settings in the 2D gel type window.
24.6.5 To use another quantification measure, close the
Comparison window, open the 2D gel type window for
Fur, and select Settings > Spot quantification settings
(or press
).
In the Spot quantification dialog box (Figure 24-14.) you
can choose between the Maximum value (in pixel
intensity), the Spot area (number of pixels included), the
Spot volume (sum of pixel intensities), the Relative
volume, and the Quantified value (derived from the Z
metric).
236
The BioNumerics manual
similar, GeneMaths and GeneMaths XT are also very
useful for the analysis of 2D gel spot tables comprising
various experiments. Since GeneMaths and GeneMaths
XT are integrated with the BioNumerics software,
analysis of 2D gel spot tables in GeneMaths/GeneMaths
XT is very straightforward.
Figure 24-14. The spot quantification settings dialog
box.
Under Max. value, one can enter a value which is
important for the bar graph diagrams in the 2D gel
matching window and its spreadsheet view: the bar
graphs will be scaled to that maximum value entered.
For example, in case Volume is specified under Spot
intensity value, the value to enter under Max. value
depends on the highest volume found in the gels that
are being compared. In case Maximum value is specified
under Spot intensity value, the value to enter depends
on the OD range of the image, which can for example be
8 bit (255), 12 bit (4096) or 16 bit (65536).
NOTE: The Spot intensity value as well as the Max.
value can also be chosen directly from the 2D gel
matching window (Gel images and Scatter plot tabs).
Under Conversion to binary, you can specify what
threshold to use when applying a binary coefficient to
the analysis of 2D gel character tables. You can specify a
minimum Absolute value or a percentage of the Mean
intensity.
In order for the spot quantification settings to become
effective, you will need to reopen the Comparison
window.
To add some more flexibility to the data set, 2D gel spot
tables can also be analyzed as a Composite Data Set. The
advantage of analyzing 2D spot tables as Composite Data
Sets is that both the columns and the rows can be
clustered (transversal clustering or two-way clustering) to
get a better understanding of the relation experiments
versus characters. See 10.1 and 15.2 to use a 2D Gel Type
in a Composite Data Set.
The Active query (24.3.10) forms the basis for
comparative analysis of 2D gel spots in GeneMaths/
GeneMaths XT. The result of a query is a table of spot
quantities over a number of gels. Such a table can be
visualized in the 2D gel matching window, in the
spreadsheet view (Figure 24-10.), but it can also be
directly imported as a character table to perform cluster
analysis, principal components analysis and all derived
techniques available in GeneMaths/GeneMaths XT.
24.7.1 Open a spreadsheet with a large query as
explained in 24.4.
24.7.2 Both from the spreadsheet and from the gel image
view, you can select File > Statistical analysis or press
the
button.
The GeneMaths analysis window will open with the
protein spots as rows and the experiments (gels) as
columns (Figure 24-15.). All the information fields for
the spots are displayed, whereas for the experiments,
you can choose between the key or one of the
information fields defined for the entries in the
BioNumerics database.
Designed for the exploration and analysis of large
datasets such as microarrays, the GeneMaths and
GeneMaths XT software packages are the ideal tools for
comparative analysis of sets of 2D protein gels as well.
In addition, through its integration with BioNumerics, it
can be used to compare microarray data with 2D protein
gel data. The following main functions can be applied to
2D protein gels:
•Standardization of data matrix using offset and
scaling functions (arithmetic and Median averages,
Root Mean Square, Standard Deviation);
•Transformation of data matrix (flipping);
•Transversal cluster analysis of rows and columns
using a variety of similarity/distance coefficients, and
several pair-group clustering methods, Ward, and
Neighbor Joining;
•Cluster significance indication based on bootstrap
techniques;
24.7 Analyzing 2D gel spot tables with
GeneMaths or GeneMaths XT
GeneMaths and GeneMaths XT offer some more
advanced statistical tools for the analysis of large data
sets, and is particularly suited for the analysis of
microarrays and gene chips. Since the data generated
and the purpose of 2D gels and microarrays is quite
•Special dendrogram layout and visualization tools to
facilitate the interpretation of analyses of extreme
sizes;
•Pattern matching: search for closest matches with
specific profiles, average profiles or theoretical
profiles;
Chapter 24 - Comparing 2D gels
237
Figure 24-15. Analysis of 2D gel protein spots over different experiments in GeneMaths.
•Single, composite or average profile curve or bar
graph plotting with indication of standard deviations;
24.8.1 In database
Experiments panel.
•X-Y plots, time course plots, and scatterplots;
The 2D gel type window (Figure 24-3.) contains two tabs:
Reference systems and Queries.
•2-D and 3-D Principal Component Analysis and
Discriminant Analysis with or without variance;
•Self-Organizing maps (Kohonen maps).
Demo2D,
open
Fur
in
the
24.8.2 Select the Reference systems tab to display the
reference systems present for the 2D Gel Type Fur.
Normally only Fur should be listed as a reference
system.
•Etc.
24.8 Editing reference systems
As soon as more than one reference system exists within
a 2D Gel Type, it should also be possible to select an
active reference system, i.e. the reference system used for
comparisons in the 2D gel matching window.
For a 2D gel experiment type as well as for 1-D
Fingerprint Types, it is possible to create more than one
reference system. This possibility is useful to combine
gels with different properties in the same experiment
type. For example, it is possible to run different 2D gels
for the same sample, each having their own pH ranges,
and merge such gels with different pH ranges into
multiple gels, spanning the full pH range.
24.8.3 You can change the active reference system in the
2D gel type window by selecting the Reference systems
tab and choosing Refsystem > Set as active reference
system.
A new reference system can be created in the 2D gel file
window (23.11) in the normalization step. Adding
reference spots to the reference system can also be done
during the normalization step. However, for complete
editing functionality of the reference system, the 2D gel
type window should be used (24.2.3).
24.8.5 To update a reference system according to recent
editing work done, select Refsystem > Refresh spots
24.8.4 To reset a reference system so that all spots are
removed from it, you can select Refsystem > Remove all
spots.
24.8.6 To delete a reference system, select Refsystem >
Delete.
238
The BioNumerics manual
NOTE: the active reference system cannot be deleted.
Individual spots can also be added or deleted from a
reference system in the 2D gel matching window. To that
end, we will need some additional gels to be added to
the 2D gel matching window.
24.8.7 Select the four Campylobacter jejuni entries in the
database Demo2D: Wild type with low Fe
concentration, Wild type with high Fe concentration,
Fur mutant with low Fe concentration, Fur mutant with
high Fe concentration. Use the <space bar> or click on
the entries while holding the SHIFT or CTRL key.
used); and (3) combining gels of the same sample with
different pH ranges.
Although there is no suitable example in the Demo2D
database, we can use the available gels to explain this
function.
24.9.1 Select the two Campylobacter jejuni entries in the
database Demo2D: Wild type with low Fe concentration
and Wild type with high Fe concentration. Use the
<space bar> or click on the entries while holding the
SHIFT or CTRL key.
24.9.2 Copy the gels to the clipboard by selecting Edit >
24.8.8 Copy the gels to the clipboard by selecting Edit >
Copy selection or by pressing
24.8.9 In the 2D gel type window, call the 2D gel matching
window by selecting File > Create matching window or
by pressing
Copy selection or by pressing
.
.
24.9.3 In the 2D gel type window, call the 2D gel matching
window by selecting File > Create matching window or
by pressing
.
.
24.8.10 Paste the gels from the clipboard in the 2D gel
matching window by selecting Edit > Paste entries from
clipboard or pressing
24.9.4 Paste the two gels from the clipboard in the 2D gel
matching window by selecting Edit > Paste entries from
clipboard or pressing
.
.
The 2D gel matching window now displays the reference
system and four 2D gels (Figure 24-4.).
The 2D gel matching window now displays the reference
system and two 2D gels (Figure 24-4.).
24.9.5 Show all gels in the normalized mode by pressing
24.8.11 To show all gels in the normalized mode, press
the
the
button or selecting Image > Show normalized.
button or select Image > Show normalized.
24.9.6 Select File > Create synthetic gel.
24.8.12 Select one or more spots on any gel in the
window by dragging the mouse pointer over the spot(s)
in cursor tool mode (
).
24.8.13 Select Refsystem > Add selected spot(s). The
program asks to confirm to add the selected spots to the
reference system.
If you answer <Yes>, the selected spots become
reference spots in the reference system.
Likewise, you can select spots on the reference system
and delete them with Refsystem > Delete spot(s) . A
confirmation is requested.
24.9 Creating synthetic gels
The main purposes of synthetic gels are: (1) averaging
repeats of the same experiment to obtain higher
accuracy; (2) combining images of the same gel with
different exposure times to reveal very weak spots as
well as very dark spots (e.g. when autoradiography is
This will open the Create synthetic gel window as
displayed in Figure 24-16. The panel on the left shows
the shifts towards the averaged spots.
24.9.7 Under Name of the merged gel, type a name, for
example Wt.
In case you are averaging gels with different exposures
to obtain a higher dynamic range for the spots, you can
then define for each gel the range of spot quantities (Z
metric as defined in the Metrics step) to be included in
the synthetic gel.
24.9.8 To define a range for a gel, select the gel under
Name, and press <Change valid range>.
24.9.9 Enter a minimum and/or a maximum spot
quantity.
24.9.10 With <Only spots in active query> checked, the
merged gel will only contain the spots of the current
active query.
Chapter 24 - Comparing 2D gels
Figure 24-16. The Create synthetic gel window.
239
240
The BioNumerics manual
241
25. Database exchange tools
BioNumerics offers a simple and powerful solution to
exchange database information between research sites
on a peer-to-peer basis. For a selection of database
entries, selected information (e.g., experiment types,
information fields) can be bundled into a Bundle. Such a
bundle is a compact data package contained in one file,
which can be sent to other research sites over the
Internet, as e-mail attachment or via FTP. The receiver
can open the bundle directly in BioNumerics and
compare the entries contained in it with the own
database.
Besides the numerical information of the experiments, a
bundle contains all the information of the experiment
type, so that BioNumerics can check whether the
experiment types contained in the bundle are
compatible with those of the receiver's database. If an
experiment type in a bundle is not compatible, this
experiment type will be automatically created in the
receiver's database. If the bundle contains a database
information field which is not defined for the database,
this information field will be added to the database.
In case of Fingerprint Types, the bundle holds the
complete information about the reference system used
and the molecular weight regression, so that
BioNumerics can automatically remap the bundle
fingerprints to be compatible with the database
fingerprints.
25.1 Creating a new bundle
25.1.1 In the Main window with DemoBase loaded, select
all entries belonging to Vercingetorix.
25.1.2 Select File > Create new bundle or
Figure 25-1. The Create new bundle dialog box.
information can be saved in this format. BioNumerics
also recognizes and reads GelCompar and Molecular
Analyst Fingerprinting bundles.
25.1.4 Enter a name for the bundle, for example
Vercingetorix, and press <OK> to create the bundle.
A bundle file Vercingetorix.BDL is created in the
BUNDLES directory of DemoBase (see Figure 1-3. for
the directory structure).
25.1.5 As an example for database exchange, copy
Vercingetorix.BDL to the BUNDLES directory of
database Example.
.
25.2 Opening an existing bundle
The Create new bundle dialog box (Figure 25-1.) lists the
available database information fields in the left panel
and all available experiment types in the right panel.
You can check each of the database information fields
and experiment types to be incorporated in the bundle.
For Fingerprint Types, the fingerprint images, band
information, and densitometric curves can be
incorporated separately.
25.1.3 Leave all checkboxes checked.
NOTE: With the checkbox GelCompar format, one
can save bundles in the format of GelCompar versions
4.1 and 4.2 and Molecular Analyst Fingerprinting
versions 1.12 through 1.60. Only Fingerprint
25.2.1 Close BioNumerics and restart the main program
under database Example.
25.2.2 In database Example, select File > Open bundle or
.
In the Open/close bundles dialog box (Figure 25-2.), you can
browse to the local or network path where the bundle
files can be found with the <Change folder> button. The
default path is the Bundles subdirectory of the current
database. In the right panel, you can select a bundle in
the list of available bundles in the specified path.
25.2.3 Select Vercingetorix and press the <Information>
button.
242
The BioNumerics manual
Figure 25-2. The Open/close bundles dialog box.
This opens the Bundle information dialog box for the
selected bundle (Figure 25-3.). It shows the available
information fields in the bundle, as well as the
experiment types contained in it. If an information field
or an experiment type is recognized as one of the fields
or experiment types in the database, a green dot is
shown left from it. If not, a red dot is shown left from it.
As soon as the bundle is opened, the missing
information fields and experiment types are
automatically added to the database.
Figure 25-3. The Bundle information dialog box.
For example, in the Example database, we have created
an information field Strain no. This clearly corresponds
to the information field Strain number in the bundle,
but since the names are different, BioNumerics would
add a new information field to the database. To avoid
this, you can rename the information fields in the
bundle.
25.2.4 Select Strain number and press the <Rename>
button under the information fields panel.
25.2.5 Enter Strain no and press <OK>.
The information field Strain no now has a green dot left
from it, indicating that it corresponds to the information
field in the database.
A similar problem can happen for the experiment types:
another user may have given a different name to the
same technique, and this would BioNumerics cause to
consider the techniques as different experiment types. If
you know a technique in a bundle is the same as one of
the experiment types defined in the database, you can
also rename it using the <Rename> button under the
experiment types panel.
In addition, in Character Types, the characters may have
received different names by other users. For example,
institution 1 may have named a character "AlphaGlucosidase", and institution 2 "a-Glucosidase".
Obviously, BioNumerics will consider these different
names as different tests. To avoid this, you can select the
Character Type and press the <Details> button. A list of
all characters in the experiment type is shown, and those
corresponding to characters in the database's
experiment type are marked with a green dot; the
characters not recognized in the database's experiment
type are marked with a red dot. You can rename such
characters with the <Rename> button.
25.2.6 <Exit> the Bundle information dialog box, and press
<Open> to load the bundle into the database.
If a bundle is loaded, it is marked with a square dot in
the Open/close bundle dialog box.
In the database, entries from a bundle are recognizable
by the blueish color of their entry keys. For all functions,
they behave like normal database entries. If you exit
Chapter 25 - Database exchange tools
243
BioNumerics, they are not automatically loaded when
you run the software again. If you know a saved
comparison contains bundle entries, you should load
the bundles before opening the comparison, in order to
avoid an error message.
25.2.9 Press <Exit> to close the Open/close bundle dialog
box.
25.2.7 You can select all the entries from an opened
bundle by pressing the <Select entries> button in the
Open/close bundle dialog box.
NOTE: If you want a bundle to be always opened with
the database when BioNumerics is started up, you
should rename it to contain the prefix @_ before its
25.2.8 To close a loaded bundle, select it in the list and
press the <Close> button.
name.
244
The BioNumerics manual
245
26. The BioNumerics client functions
26.1 Introduction
The enormous potential offered by the internet is taken
advantage of in the BioNumerics software with its
integrated Database Sharing Tools. The key component
behind the exchange of data is the so-called Bundle. A
bundle is a packed file that contains a selection of
BioNumerics or GelCompar II database entries (e.g.
bacterial strains), along with one or more experiments
and information fields, as specified by the user. The
creation and use of bundles is described in chapter 23.
BioNumerics and GelCompar II users having the
Database Sharing Tools can exchange information at a
peer-to-peer level as described in chapter 23. However,
in major collaborative research projects, the generation
of one centrally coordinated database will be preferred
over uncontrolled data flow between research sites. It is
here that the BioNumerics Client-Server solution comes
in, based upon a powerful Server suite on the one hand
and the Client tools on the other hand.
Each BioNumerics and GelCompar II software package
that contains the Database Sharing Tools comes as a
client version, which possesses all functionality to
connect and communicate with a BioNumerics Server.
The BioNumerics server is basically a Windows NT
service, which runs on a dedicated computer available
to other computers via Intenet. The server program runs
in background below the user level while processing
tasks and queries from client users. The server provides
two fundamentally different solutions for sharing its
databases and offering its services: a direct TCP/IP based
connection and a web-based communication (see Figure
26-1.).
Solution 1: Direct TCP/IP connection. This solution
provides a direct connection between the server and the
client which allows uploading and downloading
complete database entries, interactive querying, and
batch identification by the client (see Figure 26-1., upper
part). The key components for this direct and powerful
client-server communication are the bundle concept and
TCP/IP internet communication. The server, which
contains the BioNumerics reference database(s), runs
BNSERVER, a
Windows NT
service which
automatically performs identifications, answers queries,
and maintains logfiles. The communication happens
through TCP/IP, and bundles are the database
information packages transferred. The client is a
BioNumerics or GelCompar II user possessing the
Database Sharing Tools module. Once a client has
obtained permission from the server to query a
database, e.g. by entering a correct login and password,
it can connect to the server database and perform one or
more of the following functions, according
permittance settings specified by the server:
to
•Perform interactive searches on the server database
and download queries;
•Send data to an Incoming directory of the server;
•Request the server to perform remote identifications
and receive back detailed reports with possibility to
request full information and download.
Solution 2: Web-based communication using HTML.
This solution requires Microsoft’s Internet Information
Server (IIS) at the server site and an additional
BioNumerics tool BNWWW, which is an ISAPI
extension translating HTML requests into commands
for BNSERVER (see Figure 26-1., lower part).
The BioNumerics Webserver can be a separate computer,
for example, placed outside the firewall. In this case, the
server becomes an HTTP-server and communication
happens through HTML. The client only needs a web
browser to submit fingerprints, profiles or sequences to
the server; the result is returned as an HTML page
documented with identification reports, fingerprint
images, and links for requesting detailed information
about particular closest matches. If allowed by the
server, the client can download bundles from its web
browser.
It will only be possible to execute the features described
in this chapter when you have a valid client login on a
BioNumerics server.
26.2 Connecting to the server datatase
26.2.1 In
the
BioNumerics
Main
Database > Connect to server or
window,
select
.
The Server connection dialog box appears (Figure 26-2.),
prompting for the Server computer name or IP address
and a Port number.
26.2.2 Under Server computer name or IP address, you
can either fill in the full computer name of the server
computer (computer name + domain name, separated
by a period) or its IP address. When working under the
same domain name as client, you do not need to enter
the domain name.
Leave the default port number unless a different port
number was specified on the server.
246
The BioNumerics manual
Figure 26-1. Schematic representation of the two approaches for Internet exchange used by the BioNumerics
Server: direct TCP/IP based and HTML web-based.
Chapter 26 - The BioNumerics client functions
Figure 26-2. The Server connection dialog box.
26.2.3 Press <OK> to establish the connection.
Next, the Database connection dialog box (Figure 26-3.)
promts you for a Login name and a Password.
26.2.4 Enter your Login name and, if required, your
Password.
26.2.5 Press <OK> to establish the connection with the
database.
247
If the connection is successfully established, the Server
login window appears (Figure 26-4.), showing the database
components of the server database in the Configuration
tab. The database components seen in this window are
only the components which are made accessible for your
login by the server. If a component is underlined in
green, it means that this component is found back in
your own client database. If it is not underlined and
gray, the component could not found in your client
database.
Each server database component contains a checkbox,
which is checked by default. The Fingerprint Types
contain three checkboxes, corresponding to the images,
bands and curves, respectively.
26.2.6 You can uncheck a certain component if you do
not wish to download that component.
For example, if you do not wish to download the
gelstrip images for the Fingerprint Types RFLP1 and
RFLP2 in the demobase (Figure 26-4.), you should
uncheck the first checkbox of RFLP1 and RFLP2.
26.3 Searching and downloading entries
on the server database
Figure 26-3. The Database connection dialog box.
Figure 26-4. The Server login window.
If permitted by the server for the login, the client can
perform a search on the server database, using the
database fields as search criterion. The Configuration
panel in the Server login window displays the total
number of entries in the server database and between
brackets the number of entries selected by the client (see
Figure 26-4.).
248
The BioNumerics manual
26.3.1 Select Database > Search or press the
26.3.4 A time-consuming data transfer can be stopped at
button.
any time by pressing the
button.
The Server database search dialog box appears (Figure 265.). Left, you can select a Database field, and right you
can fill in a Search string. Wildcards can be used to
search for substrings: an asterisk * replaces any range of
characters in the beginning or the end of a string,
whereas a question mark ? replaces one single
character.
When downloading is complete, the downloaded server
entries are added to the client database and shown in
blue-green color. In addition, they are selected (blue
arrows) and a Comparison window with these entries is
automatically created. Note that only the components
that were checked in the Configuration tab are
downloaded (see page 247).
Normally, successive searches are additive: new
searches are added to the selection list. The Search in list
checkbox allows you to refine the search within a list of
selected entries, whereas Replace list replaces the
existing selection list with the new one.
With Negative search, all entries that do not match the
specified criteria will be selected.
Thus entries from the server are downloaded in a
temporary bundle; after closing the BioNumerics session,
the entries will not reappear. To save them, you should
either save them in a bundle during the BioNumerics
session (see 25.1) or rename the temporary bundle in the
temp subdirectory of your client database, also before
closing BioNumerics. After a default installation, this
directory is found as
Case sensitive lets the program make a distinction
between uppercase and lowercase.
C:\Program
files\BioNumerics\data\[DatabaseName]\temp.
The downloaded bundle has the following name format:
YYYY-MM-DD__HHhMMmSSs__XXXX.BDL
You can copy this bundle to the bundles subdirectory of
the same database, and rename it as necessary.
26.3.5 If you want a bundle to be always opened with
the database when BioNumerics is started up, you
should rename it to contain the prefix @_ before its
name.
26.4 Uploading bundles and gels to the
server
•Uploading a bundle
Figure 26-5. The Server database search dialog box.
Before you can send a bundle, it must be created in the
client database. To create a bundle, see 25.1. Bundles and
gel files can only be uploaded when allowed for this
login by the server.
26.4.1 Select Database > Send bundle to server or press
26.3.2 Press <OK> to perform a search on the server
database. When entries are selected, the number is
shown as "selected" between brackets.
the
26.3.3 The selected entries can be downloaded from the
server with Database > Download selection list or
The default directory shown is the bundles subdirectory
of the client database. If you wish to select a bundle from
elswhere, you can do so.
.
button.
26.4.2 Select a bundle from the file dialog box.
•Uploading a gel
NOTE: Downloading is only possible when permitted
by the server for the login.
Uploading a gel to the server is not done from the Server
login window, although a connection to the server must
be established first as described in 26.2, which
automatically pops up the Server login window.
Chapter 26 - The BioNumerics client functions
26.4.3 In the BioNumerics Main window, select the
appropriate Fingerprint Type in the Experiments panel.
26.4.4 In the Files panel, double-click on the gel you
want to upload, to open the gel file.
26.4.5 Select File > Upload file to server from the menu.
A Fingerprint file upload dialog box (Figure 26-6.) prompts
you to enter some comments (optional). In this comment
editor you can write or paste some text.
Furthermore, the option is offered to send the following
files:
Fingerprint lanes: densitometric curves, normalization
info, gelstrips, bands, and quantification, all contained
in the .DEF file
Image file: The raw image or curve file, usuall a .TIF file
249
26.5.2 In the Server login window, select Database >
Identify selection list, or simply press the
button.
The Server database identification dialog box (Figure 26-7.)
allows you to select the Experiment which you want to
use for identification.
It further allows a Number of matches, i.e. the number of
best matching server database entries, to be displayed
ranked according to similarity. If you enter 10 (default),
the 10 best matching server database entries will be
shown for each unknown entry you have identified.
The Method includes choices Use complete database,
and Use selection list only. In the latter case, any
selection you have made in the server database (see
paragraph 26.3) will be used to identify your unknown
entries.
Database fields: the definition of the database entries for
the fingerprint labes, contained in the .DBS file.
Figure 26-7. Server database identification dialog
box.
26.5.3 Press <OK> to perform the identification.
Figure 26-6. Fingerprint file upload dialog box.
26.4.6 Press <OK> to upload the fingerprint file.
When the identification is complete, the Identification tab
of the Server login window pops up (Figure 26-8.). This
window shows the experiment used, the similarity
coefficient used, and a list of all unknown entries
(represented by their keys) with their closest server
database matches (indented). Of the server database
entries, the scores, keys and database fields are shown.
26.5 Performing identifications on the
server
Left from each unknown entry and each server entry, an
orange arrow is shown (Figure 26-8.).
When allowed for your login by the server, you can
perform batch identifications on the server database.
26.5.4 If you click on the orange arrow left from a
particular matching server database entry, it is
downloaded in a temporary bundle (see page 248) to
your client database (shown in the database in bluegreen), and its Entry edit window is popped up.
26.5.1 In the client database, select one entry, or create a
selection of entries which you wish to identify. Selected
entries are marked with a blue arrow.
26.5.5 If you click on the orange arrow left from an
unknown entry, all the best matching server database
entries are downloaded in a temporary bundle (see page
250
The BioNumerics manual
Figure 26-8. The Identification tab of the Server login window after identification.
248) to your client database (shown in the database in
blue-green); they are selected (blue arrows) and a
Comparison window is popped up with the downloaded
entries.
NOTE: After closing BioNumerics, the downloaded
entries will not reappear since they were saved in a
temporary bundle in the temp subdirectory.To preserve
these entries, see page 248.
By default, the identifications performed on the server
are done using the server's Comparison Settings (i.e. the
similarity coefficient used, the position and optimization
settings applied). However, it is possible to perform
identifications using the comparison settings specified
in your local database.
26.5.6 To upload your own local comparison settings for
the selected experiment in the Server login window,
select Database > Upload comparison settings in the
Server login window prior to performing the
identification.
26.5.7 To close a connection with the server, simply exit
the Server login window.
NOTE: Depending on the time-out setting specified on
the server, it is possible that a connection is closed after
a certain time of inactivity from the client side.Use the
menu File > Open database or press the
button to establish a new connection.
251
27. Import of data from external databases
The databases in BioNumerics are stored in a binary
format and can only be exported via the available export
functions rather than by reading or copying database
files directly. Within BioNumerics, an additional
BioNumerics database can be opened quite easily using
the menu command File > Open additional database in
the BioNumerics Main window. When the additional
database contains experiments and/or database fields
that are not available in the own database, BioNumerics
will automatically create these components in the own
database in order to be able to display them.
Non-BioNumerics databases can only be accessed in
BioNumerics through import functions (e.g. scripts) or a
common exchange language. Therefore, BioNumerics
allows one to establish a link with an external relational
database using the Open Database Connectivity (ODBC)
protocol. This protocol is supported by almost any
commercial relational database: Access, Excel, FoxPro,
Dbase, Oracle, SQL server, etc… By establishing such a
link between BioNumerics and an external data source,
the user can import data in a completely transparent
way into BioNumerics. Moreover, the BioNumerics
database can be brought up to date using the external
data source by performing automatic downloads.
The database records in the external database are
mapped into BioNumerics entries by making use of the
database key. The user should specify a field of the
external database that corresponds to this key, and then
the software is able to automatically determine which
external record corresponds to which local BioNumerics
database entry.
you are importing data from a spreadsheet program
(e.g. Microsoft Excel), you should first create a “table” in
the spreadsheet. This can be done by selecting a range of
cells that you want to export and assign a name to this
selection (read the documentation of the spreadsheet
software on how to export data using an ODBC link).
Pressing <OK> in this dialog creates the ODBC database
import window. This window allows the user to specify
how each field in the external database should be
mapped to a particular field in the BioNumerics
database. On the left side, the BioNumerics fields are
listed, while on the right side the external database
fields are listed. Initially, all fields are unlinked. You can
link two fields by selecting the local BioNumerics field
from the left column and the external field from the
right column, and pressing the <Link> button. At this
time, both fields are displayed at the same height, and a
green arrow indicates the established link. You can
remove any existing link by selecting it and pressing
<Unlink>.
Before you will be able to perform any exchange action,
you should make sure that the BioNumerics “Key” field,
which corresponds to the local database keys, is linked
to a field from the external database. This link is
obligatory, because the software needs to know which
record in the external database corresponds to which
entry in BioNumerics.
If the necessary links are established between external
and local database fields, press <OK> to validate the
ODBC link configuration. At this moment, BioNumerics
is ready to download information from the external data
source.
27.1 Setting up the ODBC link
Use the menu item Database > ODBC link > Configure
external database link in the Main window to call the
ODBC configuration dialog box. This dialog box
contains two information fields, which are to be filled in:
1. The ODBC data source. This field is to be filled in
with a string that defines the external database that will
be linked using ODBC. If you are familiar with ODBC,
you can specify a string manually. Alternatively, you
can press the button <Select>. This action pops up the
standard Windows dialog box that allows one to select
an ODBC data source. In this dialog box, double click on
the name of the appropriate available database software
and select the database file on the hard disk.
2. The Database table or query. In this field, you should
fill in the name of the table or query in the external
database that you want to use to import data from. If
27.2 Import of database fields using
ODBC
•Update all BioNumerics database entries from the
external data source
It is possible to automatically update all the information
fields from each BioNumerics entry, using the data
provided by the external database. To this end, select
Database > ODBC link > Copy from external database
in the Main window. After confirmation, the software
downloads, for each entry, all the database fields that
have been linked to the external data source. If the
external data source contains records that do not have a
corresponding entry in the BioNumerics software, the
program automatically creates new entries in the
BioNumerics database (after confirmation by the user).
252
The BioNumerics manual
In this way, information fields of existing BioNumerics
are updated and new entries are automatically added.
•Download a database field from the external data
source
It is possible to temporarily download an extra database
field from the external data source, into an empty
database field of BioNumerics. To this end, select the
empty database field in the Main window (or create a
new one), and use the menu command Database >
ODBC link > Download field from external database. A
dialog box pops up, showing all the fields present in the
external database. Select the appropriate field and press
<OK> to download the information in the local field.
Note that the downloaded information is only held
temporarily and not stored on disk. The next time you
re-open the same database in BioNumerics, the field will
be again in its initial state.
•Selection of a list using a query in the external
data source
The software allows you to perform a query in the
external database, and to visualise the result as a
selection list in BioNumerics. In the Main window, use
the command Database > ODBC link > Select list from
external database. In the dialog box, you can specify a
table that should be used to search in (alternatively, you
can specify the name of a pre-defined query that is
present in the external database). In the next field, you
can write an SQL WHERE clause that should be used to
build the selection. A complete description of the
possible variants is beyond the scope of the manual, and
can be found in books on the SQL language. Some
possibilities are: “GENUS=’Ambiorix’” or “GENUS
like ’Amb%’”. The WHERE clause is applied to the
records of the external database, and the resulting
selection is visualised as a selection of the corresponding
entries in the BioNumerics database (supposed that they
are present in the local database).
•Getting a detailed report of the external database
record
For each entry in the BioNumerics database, you can
obtain a complete list of all information present in the
external data source. To this end, you should first open
the Entry edit window, e.g. by double clicking on the
name in the entry list. Then use the button
to
create a new window that shows a list of all information
fields that are present for this entry in the external
database. Note that there is no limit to the number of
fields that can be viewed and edited in this way, and
that each field may consist of several lines and can
contain up to 5000 characters.
Moreover, you can change some of these fields, and
upload these changes to the external database using the
button
.
27.3 Import of character data using
ODBC
One can use an ODBC link to an external database for
importing character data into the BioNumerics database.
Open the Character Type that you want to import by
double clicking on its name in the Experiments panel in
teh Main window. Then use the command File > Import
from external database. A dialog box pops up, showing
a complete list of all the database fields that are present
in the external data source. The program determines
automatically if any of the characters in the Character
Type corresponds to a database field in the external
database. If so, the field is written in boldface, and the
character will be filled with the values from this field
during the import. You can add new characters to the
Character Type by selecting an unmatched field and
pressing <Create character>. Groups of characters to
add can be selected using the SHIFT key. To import the
data, press <Ok>. For every local entry that has a
matching key in the external database, the
corresponding characters of this Character Type will be
filled with information from the external database.
27.4 Import of sequence data using
ODBC
The ODBC link to an external database can also be used
to import sequence data into the BioNumerics database.
Open the Sequence Type that you want to import by
double clicking on its name in the Experiments panel in
the Main window. Then use the command File > Import
from external database. A dialog box pops up, showing
a complete list of all the database fields that are present
in the external data source. Select the database field that
contains the required sequence information and press
<OK> to import the data. For every local entry that has a
matching key in the external database, the
corresponding sequence will be filled with the data
contained in the selected external database field.
253
28. Connected databases
BioNumerics offers two possibilities to store its
databases: the program’s own local database engine (the
local database) or an external ODBC compatible database
engine. The latter solution is called a Connected Database.
Currently supported database engines are Microsoft
Access, SQL Server, and Oracle. Others may work as
well but are not guaranteed to be fully compatible in a
standard setup.
NOTE: BioNumerics uses Quoted Identifiers to pass
information to the Connected Database. Some database
systems, for example MySQL, do not use this ANSI
standard by default, but optionally. To use the database
as a BioNumerics Connected Database, make sure that
the use of Quoted Identifiers is enabled in the database
setup.
Connected Databases are particularly useful in the
following cases:
1. Environments where several users need to access the
same database simultaneously. When the Connected
Database engine is set up to support multi-user access,
BioNumerics will allow multiple users to access and
modify the database simultaneously. Note, in this
respect, that BioNumerics takes a “snapshot” of the
database when the program is launched. As such,
changes to the database made by others while you
have a BioNumerics session open will not be seen in
your current session, until you reload the database
during your session (see 28.7).
2. When sample information and/or experiment data is
already stored in a relational database.
3. Laboratories where vast amounts of data are
generated. In cases where many thousands of
experiment files are accumulated, a powerful
database structure such as Oracle or SQL Server will
be faster and more efficient in use than BioNumerics’
own database.
4. When a more flexible database setup is to be achieved,
for example with different access/permission settings
for different users, and with built-in backup and
restore tools.
In a Connected Database, BioNumerics will require a
number of tables with specific columns to be available
(see 30.1). BioNumerics can either construct its own
tables and appropriate fields or link to existing tables
and fields in the Connected Database. The latter option
is particularly interesting to create a setup where
BioNumerics hooks on to an existing database.
As soon as a valid Connected Database is defined, the
user can start entering information in the Connected
Database. BioNumerics writes and reads the
information directly into and from the external
database, without storing anything locally. Since every
Connected Database has a local BioNumerics database
associated with it, the user has the option to store and
analyze local entries together with entries in the
Connected Database. To make the difference between
locally and externally stored data, all the entries and
experiment data stored in a Connected Database are
underlined. Although the use of Connected Databases
and associated local databases is transparent, it is not
recommended to store entries and experiments in a
mixed way.
NOTE: A number of tables in a GelCompar II
Connected Database deal with Character Types,
Sequence Types, 2D Gel Types, and Matrix Types
(BioNumerics). These tables are also required by
GelCompar II, in order to assure compatibility with
BioNumerics databases and to allow upgrading from
GelCompar II to BioNumerics.
BioNumerics local databases can be converted into a
Connected Database at any time. This process is
irreversible: once a local database has been converted
into a Connected Database, the local database is
removed, and Connected Databases cannot be backconverted into local databases.
NOTE: There exist scripts that can convert database
entries into XML files, which can in turn be extracted
into database entries. This provides a means to convert
Connected Database entries back into local database
entries.
The combined use of local and Connected Databases is
limited to avoid possible conflicts between the two
database systems. In particular, the possibility that local
and Connected experiment types have the same name
but different settings, should be avoided. Therefore, a
few approved possibilities for working with Connected
Databases are supported:
1. Creating a new database in BioNumerics, which is
linked to a new Connected Database. BioNumerics is
allowed to construct the database layout.
2. Creating a new Connected Database in BioNumerics,
by linking to an existing database that has a table
structure already in a BioNumerics compatible format
(e.g., linking to an existing BioNumerics Connected
Database).
254
3. Creating a new Connected Database, linking to an
existing database which is not
created
using
BioNumerics.
4. Converting a local database to a new Connected
Database.
These possibilities are described in further paragraphs.
Note that Connected Databases are only available with
the Database Sharing Tools module. The following
paragraphs describing the use of Connected Databases
require Microsoft Access (97 or later), Microsoft SQL
Server, Oracle, or PostgreSQL to be installed.
28.1 Setting up a new Connected
Database
BioNumerics can automatically create a new database in
Microsoft Access. When you are using SQL Server,
Oracle, or PostgreSQL, however, you will have to create
a new blank database before proceeding with the
following steps.
28.1.1 In the BioNumerics Startup program, click <New>
to create a new database.
The BioNumerics manual
connected database, or Connect to an existing connected
database (Figure 28-1.).
28.1.5 Select Create a new, empty connected database
and the database engine of choice.
28.1.6 If the database engine is Access, press <Auto
create (.mdb)> to create a new Access database.
28.1.7 If the database engine is SQL Server, PostgreSQL,
or Oracle, you will need to build a Connection String
using the <Build> button.
28.1.8 Further to 28.1.7 (non-Access databases), the
dialog box that pops up now is generated by your
Windows operating system and may differ depending
on the Windows version installed. Therefore we refer to
the Windows manual or help function to select or create
a DSN file (ODBC Data Source) that specifies the ODBC
driver, and to set up a connection to the database.
28.1.9 Once the database connection is properly
configured, you can press <OK> to quit the database
setup.
28.1.10 Press the <Analyze> button.
The BioNumerics Main window will open with a blank
database.
28.1.2 Enter ConnectedBase as database name.
28.1.3 In the next step, choose <Yes> to automatically
create the required directories, since a local database
associated with the Connected Database is required.
28.2 Configuring the Connected
Database link in BioNumerics
28.1.4 In the next step, click <Yes> to enable the creation
of log files, and press <Finish>.
In the BioNumerics main program, you can set up a
connection to a Connected Database, or configure an
existing connection. In case the program reports
database linkage problems when opening the database,
A new dialog box pops up, prompting for the type of
database: Use the local database, Create a new, empty
Figure 28-1. Database selection dialog box.
Chapter 28 - Connected databases
255
you will need to use this configuration to create the
required tables in the database.
This results in the Connected Database configuration dialog
box (Figure 28-3.).
28.2.1 Select Database > Connected databases.
The upper left input field (Connected database) shows
the connection description file, which can be found in the
local database directory. When BioNumerics has created
a new connected database in the Startup program, the
file has the default name ConnDb.xdb. In the case of the
example Connected Database ConnectedBase, it occurs
as:
This opens a list of all currently defined Connected
Databases for this BioNumerics database (normally just
one; see Figure 28-2.).
c:\Program
files\BioNumerics\Data\ConnectedBase\ConnDb.xd
b.
This text file can be edited in Notepad or another text
editor.
Under ODBC connection string, the ODBC connection
string is defined. The same string can be found in the
connection description file, under the tag [CONNECT].
Figure 28-2. Connected databases list window.
28.2.2 Select the Connected Database of choice and click
<Edit>, or double-click on the name.
Figure 28-3. Connected Database configuration dialog box.
28.2.3 The <Build> button allows a new connection
string to be defined. This will call the Windows setup
dialog box to create a new ODBC connection (see also
28.1.8).
256
28.2.4 By pressing the <Refresh> button, the connection
between BioNumerics and the Connected Database is
refreshed. A tree-like table structure view of the
database is displayed in the upper right panel.
28.2.5 The database type can be selected under Database
(Access, SQL Server, and Oracle). This information is
written under [DATABASETYPE] in the connection
description file.
The second panel in the Connected databases configuration
dialog box concerns the tables of the connected database.
BioNumerics assumes a certain table structure to be able
to store its different kinds of information. This table
structure is described in 30.1. The default table names
are ATTACHMENTS for attachments (see 6.5),
ENTRYTABLE for the entries, EXPERIMENTS for the
experiments, FPRINTFILES for the fingerprint files,
FPRINT for the fingerprint lanes, SEQUENCES for the
sequences, MATRIXVALS for the Matrix Type values,
G2DGELS,
G2DQUERIES,
G2DQUERYSPOTS,
G2DREFSYS, G2DREFSYSSPOTS, G2DSPOTINFO,
and G2DSPOTS for the 2D Gel Types, EVENTLOG for
the event log file, and SUBSETMEMBERS for the
subset members. Each table should contain a set of
columns with fixed names (30.1). In a database setup
where BioNumerics is connected to an existing database
system, views can be created with table names that
correspond to the required BioNumerics tables, and that
have the required BioNumerics columns. To add
flexibility, however, it is also possible to select different
table names than the default ones. This allows one to
create additional views, for example, where certain
information is shown or hidden. These views can be
saved under a different name, and specific views can be
made visible to users with specific permissions.
The right panel relates to Character Types. Each
Character Type is stored in two separate tables. One
table, <CharacterTypeName>FIELDS, contains the field
(i.e., character) names. When Connected Databases are
used, characters can be described by more than one
information field (see page 58). The name field and the
additional fields are stored in columns in this table. The
second table, <CharacterTypeName>, contains the
character values for the entries. Both tables can also be
chosen under Values and Fields, respectively. The
default names are <CharacterTypeName> and
<CharacterTypeName>FIELDS
(<CharacterTypeName> being the name of the
Character Type).
Under Restricting query, there is a possibility to enter a
query that restricts the number of entries in the database
to those that fulfill a specific query. The use of restricting
queries is explained further in 28.8.
With the option Experiment order statement, it is
possible to define a specific order for the experiments to
show up in the Experiments panel in the Main window. By
default, the experiments are listed alphabetically, which
is indicatede by the default SQL string “ORDER BY
[EXPERIMENT]”. [EXPERIMENT] refers to the column
EXPERIMENT in table EXPERIMENTS (see 30.1.3),
The BioNumerics manual
which holds the names of the experiments. This means
that the experiments will be sorted by their name. It is
possible to add an extra column to this table, with
information entered by the user, for example an index
number. If this column is specified in the SQL string, the
experiments will be ordered by the index.
In Source file location, the path for storing the source
files (TIFF images and .CRV curve files) is entered. The
path can be a local directory or a network path, for
example on a server computer. To change the path, click
to browse through the computer or the network.
Fingerprint files (TIFF files, CRV files) can also be saved
in the Connected Database when the checkbox Store
fingerprint files in database is checked.
As opposed to earlier versions of BioNumerics, contig
projects are always saved in the Connected Database. As
for the trace files from automated sequencers (fourchannel sequence chromatogram files), the user has the
choice between linking to the original path of the files or
storing them in the database, using a checkbox Store
trace files in database. The trace files are stored in
column DATA of table SEQTRACEFILES (see 30.1.9).
In case Store traces in database is not checked, the
column DATA will hold a link to the original path they
were loaded from.
Also for contigs associated with sequences in a
Connected Database, it is possible to display the contig
status by checking Display sequence contig status.
When checked, the program shows the presence of a
contig file as well as a Approved flag (see 7.18.84).
With the checkbox Use as default database, the database
can be specified to be the default Connected Database or
not. Once a database is specified to be the default
Connected Database, it cannot be disabled anymore!
Two buttons, <Check table structure> and <Auto
construct tables>, allow one to check if all required
tables and fields are present in the Connected Database,
and to automatically insert new tables and fields where
necessary, respectively.
WARNING: when pressing <Auto construct tables>,
BioNumerics will automatically create a new table for
every required table that is not yet linked to an existing
table in the database. For tables already linked, it will
insert all required fields that do not yet exist in the
database. In case you want to link BioNumerics to an
existing database, this may cause a number of tables and
fields to be created, and cause irreversible database
changes! Solutions to link BioNumerics to existing
databases having different table structures are explained
in 28.5.
28.3 Working in a Connected Database
Once a Connected Database is correctly set up, adding,
processing and analyzing data is nearly identical to
Chapter 28 - Connected databases
working in a local database. Entries stored in the
Connected Database are underlined.
NOTES:
(1) When entry information fields are obtained from a
view (query) in the Connected Database, it will not be
possible to define new information fields directly from
BioNumerics. In that case, you will have to create the
field in Oracle, SQL Server, PostgreSQL, or Access,
add it to the view, and reload the BioNumerics database
(see 28.6.3).
(2) Certain characters, for example a period, that are
allowed in column names in a BioNumerics database,
may not be allowed in the Connected Database. We refer
to the manual of the database system for more
information.
(3) Views with joined columns may be read-only and it
may not be possible to add new records to the database
that are seen through these views (e.g. entries,
experiments). It is possible to bypass this in Oracle or
SQL Server using triggers.
There are a few differences, however, concerning (1)
adding new entries to the database, (2) the default
directories for images and contig files, and (3) the way
log files are recorded and viewed.
•When adding new entries to the database using the
menu command Database > Add new entries, the
choice is offered to add the entries to the local
database or to the Connected Database. When no
Connected Database is the default database, you will
be able to choose between these two possibilities.
Once a Connected Database is specified as the default
database, however, it will only be possible to add new
entries to the Connected Database.
•In a standard Connected Database setup, images,
contig project files, and other source files are stored in
a common directory Sourcefiles under the local
database directory. For example, in case of the newly
created database ConnectedBase, the default
directory for such files is c:\Program files\
BioNumerics\ Data\ ConnectedBase\ Sourcefiles.
•Within Sourcefiles, there are two subdirectories:
Contig and Images. The Images subdirectory contains
the TIFF files for Fingerprint Types, and are placed in
this directory using the command File > Add
experiment file in the BioNumerics Main window. To
make a gel TIFF file visible in the Files panel in a
Connected Database, the file should be present in this
directory. Under Contig, GeneBuilder contig
(sequence assembly) projects can be saved. This will
only be the case if no column Contig is present in the
Connected Database; otherwise (and by default),
contigs are saved in the Connected Database (see
30.1.8). The source file directory can be modified as
described in 28.2. The path can be a network path, for
example on a server computer.
257
•Log files are stored in a different way in a Connected
Database. The log events are stored in a database table
called EVENTLOG. Different events are stored under
different categories: Database concerns all actions
affecting the database (adding, changing or removing
information fields, adding experiment types, adding
entries,
changing
entry
information
etc.).
Furthermore,
there
is
a
category
EXPER_<ExperimentName>
(<ExperimentName>
being the name of the experiment), relating to changes
made to the experiment type (i.e. normalization
settings in case of Fingerprint Type, adding,
removing, or renaming characters in Character Type,
etc.). A third category reports on changes made to the
data in a certain experiment type. In this category,
components have the name of the experiment type.
•The Event log window (Figure 5-3.) called from the
main program offers the possibility to view the log file
for a Connected Database or the local database under
Database. Under Component, you can choose to view
a specific component, e.g. Database, an experiment
type, or data belonging to an experiment type. With
All, you can view all components together, listed
chronologically. The components can only be selected
when a Connected Database is viewed.
28.4 Linking to an existing database with
standard BioNumerics table structure
Any computer running BioNumerics can link up to an
existing BioNumerics Connected Database at any time.
When this Connected Database has its table structure in
the standard BioNumerics format (see 30.1), this can be
done very easily in the Startup program.
28.4.1 In the BioNumerics Startup program, click <New>
to create a new database.
28.4.2 Enter a name for the Connected Database (this
can be a different name on different computers).
28.4.3 In the next step, choose Yes to automatically create
the required directories, since a local database
associated with the Connected Database is required.
28.4.4 In the next step, choose to whether or not create
log files, and press <Finish>.
A new dialog box pops up, prompting for the type of
database: Local database, New Connected database, or
Connect to an existing connected database (Figure 281.).
28.4.5 Select Connect to an existing connected database
and press <Build> to establish the connection to the
database.
28.4.6 The dialog box that pops up now is generated by
your Windows operating system and may differ
depending on the Windows version installed. Therefore
we refer to the Windows manual or help function to
258
select or create a DSN file (ODBC Data Source) that
specifies the ODBC driver, and to set up a connection to
the database.
28.4.7 Once the database connection is defined, you can
press <OK> to quit the database setup.
28.4.8 Press the <Analyze> button.
The BioNumerics Main window will open. The
Connected Database will be the default database. If the
Connected Database contains the standard table
structure for BioNumerics (see 30.1), no error message is
produced and you can start working immediately.
BioNumerics will automatically recognize the existing
information fields, experiment types, subsets, entries
and data. If the table structure is not in standard
BioNumerics format however, a dialog box appears,
warning for several errors that have occurred while
trying to open specific tables in the connected database
that weren’t found. See 28.5.10 and further to assign the
correct tables or views from the database.
The BioNumerics manual
outlined in 30.1. The names for the tables or views,
however, can be freely chosen.
•Additional tables required by BioNumerics that
contain no fields already present in the database can
be created automatically by BioNumerics.
NOTE: When views are created in the database, to
match the required BioNumerics tables, it is
recommended to name the views using the standard
BioNumerics names for the required tables. This will
allow new users to log on to an existing Connected
Database in the easiest way, by just defining the
connection in the startup program (28.4). By using
different names, new users will have to specify the table/
view names manually in the Connected database
configuration window (Figure 28-3.) after defining
the Connected Database. Using different names for the
views is only useful if it is the intention to assign
different permissions to different users; in this way,
views can be created showing only restricted
information, while other views show full information,
etc.
28.5 Linking to an existing database with
table structure not in BioNumerics
format
28.5.1 In the BioNumerics Startup program, click <New>
to create a new database.
This paragraph describes the situation where an Oracle,
SQL Server, PostgreSQL, or Access database, containing
descriptive information on organisms (entries) and/or
experiment data is already present and BioNumerics
should be hooked up to that database in order to read
and write experiment data and information fields.
28.5.3 In the next step, choose <Yes> to automatically
create the required directories, since a local database
associated with the Connected Database is required.
Before proceeding with the configuration of the
database connection, it will be necessary to make the
database compatible with the BioNumerics table
structure. In a typical case, a number of information
fields and/or experiment fields from the Connected
Database will need to be linked to BioNumerics.
However, these fields will occur in different tables
having different field names. The obvious method in
this case is to create views (or, in Access, queries) in the
database.
A new dialog box pops up, prompting for the type of
database: Local database, New Connected database, or
Connect to an existing connected database (Figure 281.).
•For those BioNumerics tables for which the Connected
Database contains fields to be used, a view (query)
should be constructed in the database. Within that
view (query), those database fields that contain
information to be used by BioNumerics should be
linked to the appropriate field.
•BioNumerics tables for which the Connected Database
contains no fields can be created automatically by
BioNumerics.
•Finally, the database should be configured in such a
way that the BioNumerics tables that contain fields
already present in the database, be present either as
table or as view, with all the recognized field names as
28.5.2 Enter a name for the new database.
28.5.4 In the next step, choose to whether or not create
log files, and press <Finish>.
28.5.5 Select Connect to an existing connected database
and press <Build> to establish the connection to the
database.
28.5.6 The dialog box that pops up now is generated by
your Windows operating system and may differ
depending on the Windows version installed. Therefore
we refer to the Windows manual or help function to
select or create a DSN file (ODBC Data Source) that
specifies the ODBC driver, and to set up a connection to
the database.
28.5.7 Once the database connection is specified, you can
press <OK> to quit the database setup.
28.5.8 Press the <Analyze> button.
The BioNumerics Main window will open. Since the
Connected Database does not contain the standard table
structure for BioNumerics (see 30.1), a dialog box now
appears, warning for several errors that have occurred
while trying to open specific tables in the connected
database that weren’t found.
Chapter 28 - Connected databases
28.5.9 Press <OK> to close the message(s). The
BioNumerics main program now opens with a blank
database.
In the BioNumerics main program, you can now
configure the database connection as described in 28.2:
28.5.10 Select Database > Connected databases.
This opens a list of all currently defined Connected
Databases for this BioNumerics database (normally just
one; see Figure 28-2.).
28.5.11 Select the Connected Database of choice and
click <Edit>, or double-click on the name.
This results into the Connected Database configuration
dialog box (Figure 28-3.). This window shows the default
suggested table names for the required database
components under Database tables (see 28.2). Some, or
all, of these tables do not correspond to the tables of the
database.
28.5.12 Press the <Refresh> button. The upper right
panel now lists the tables and views in the Connected
Database, as it exists.
259
copy of the local database before carrying out a
conversion to a Connected Database.
NOTE: It is not recommended to convert a local
database into a Connected Database that already
contains data, using this tool, since experiment types
with the same name would be overwritten. There are
other tools to convert entries and data from a local
database into a Connected Database, using scripts based
upon XML export and import (see also introduction of
this chapter).
To convert a local database into a new Connected
Database, proceed as follows:
28.6.1 Create a new empty database in Oracle, SQL
Server or Access.
28.6.2 Open the local database in the BioNumerics main
program.
28.6.3 In the main program,
Connected databases.
select
Database
>
This opens a list of all currently defined Connected
Databases for this BioNumerics database (normally
empty at this stage; see Figure 28-2.).
28.5.13 You can expand each table/view to display its
fields by clicking on the “+” sign on the tree.
28.6.4 Click <New ODBC>.
28.5.14 Under Database tables, select the corresponding
table or view for each component.
In the Connected databases configuration dialog box that
appears, click <Build>.
28.5.15 When this is finished, check the correspondence
by pressing <Check table structure>.
28.6.5 The dialog box that pops up now is generated by
your Windows operating system and may differ
depending on the Windows version installed. Therefore
we refer to the Windows manual or help function to
select or create a DSN file (ODBC Data Source) that
specifies the ODBC driver, and to set up a connection to
the database.
When required, you can further configure the database,
leaving the Connected Database configuration window
open. As soon as the new configuration is done, press
<Refresh> and check the table structure again.
28.5.16 Finally, when all links to existing database
tables/views are made correctly, you can allow
BioNumerics to create additional tables for which there
are no fields available in the external database, by
pressing <Auto construct tables>. BioNumerics will
now only construct tables that are not yet linked, and
fields that are not yet present in the connected tables.
NOTE: it will not be possible for BioNumerics to create
new fields within a view/query. In that case, you will
have to create the field in Oracle, SQL Server or Access,
add it to the view, and reload the BioNumerics database.
28.6.6 Make sure the Connected Database is checked as
the default database; otherwise, the conversion cannot
be executed.
28.6.7 Check the table structure of the database, if it does
not contain the required tables and fields, press <Auto
construct tables> to allow BioNumerics to construct its
tables.
28.6.8 Once the connection is defined correctly, press
<OK> to close the Connected databases configuration dialog
box.
28.6.9 Close the Connected databases list window.
28.6 Converting a local database to a
Connected Database
28.6.10 In the main program, select Database > Convert
local data to connected database.
BioNumerics offers the possibility to convert an entire
local database at once to a new Connected Database.
This is an irreversible operation, which causes the local
database to be removed once the conversion is done. It
is therefore strongly recommended to make a backup
An important warning message is displayed. If you are
converting the local database to a NEW Connected
Database, and if you have made a backup of the data
before starting this conversion, you can safely click
<OK> to start the conversion.
260
The BioNumerics manual
Depending on the size of the database, the conversion
can take seconds to hours. Fingerprint image files take
most time to convert. When the conversion is finished
successfully, BioNumerics will automatically restart
with the Connected Database, and the contents of the
local database will be removed.
28.7 Opening and closing database
connections
It is possible to connect to other Connected Databases in
addition to the default Connected Database.
select
Database
28.7.7 In the Connected databases list window (Database >
Connected databases), select the Connected Database
you want to close, and press <Close>.
28.7.8 Confirm with <Yes>. The database disappears
from the list, and the contents of the closed database
disappear from the Main window.
Closing a Connected Database is temporary. When it is
closed, it will automatically be reopened the next time
the BioNumerics analyze program is started up with the
same database.
•Connecting to multiple Connected Databases
28.7.1 In the main program,
Connected databases.
•Closing or deleting a Connected Database
>
This opens a list of all currently defined Connected
Databases for this BioNumerics database (normally just
one; see Figure 28-2.).
28.7.2 Click <New ODBC>.
In the Connected databases configuration dialog box that
appears, click <Build>.
28.7.3 The dialog box that pops up now is generated by
your Windows operating system and may differ
depending on the Windows version installed. Therefore
we refer to the Windows manual or help function to
select or create a DSN file (ODBC Data Source) that
specifies the ODBC driver, and to set up a connection to
the database.
28.7.4 In the Connected databases configuration dialog box,
enter a name for the connected database definition file
(upper left input field, Connected database). This name
should be different from the names of any of the
existing Connected Databases.
28.7.5 Under Source files location, select the directory
where the source files can be found for this Connected
Database. This directory should in any case be different
from the Source files directory of the default Connected
Database.
28.7.6 Once the connection is defined correctly, press
<OK> to close the Connected databases configuration dialog
box.
NOTE: When the two connected databases have the
same Source files directory associated, an error message
is produced at this time: “Another connected database is
already associated with this source files directory.” It
will not be possible to save this new connection until the
source files directories are different.
The new Connected Database is listed in the Connected
databases list window. When you open the main program,
the contents of the two databases are seen together.
To delete a connection to a database, press <Delete> in
the Connected databases list window. The Connected
Database will never reappear until you build the
connection again.
•Reloading a Connected Database
Suppose you have modified the Connected Database
directly in Oracle, SQL Server or Access, you can use the
function <Reload> in the Connected databases list window.
Any columns that were added, for example as
information fields, or any entries or data that was added
externally after BioNumerics was started up will be
updated in the BioNumerics Main window.
Reloading a Connected Database can also be useful in
case several persons are working in the database
simultaneously. Any entries added by other persons
will not be seen in your session until you reload the
database.
28.8 Restricting queries
In the Connected databases configuration dialog box (Figure
28-3.), it is possible to enter a restricting query in the input
field Restricting query. A restricting query is of the
general format FieldName=String. FieldName is the
name of the the restriction is applied to, and String is the
restricting string. As a result, when the BioNumerics
main program is opened with the Connected database,
only those entries having String filled in the field
FieldName will be seen in the database.
In addition, when new entries are added to the database,
they will automatically have their field FieldName filled
with String.
As an example, suppose that the database DemoBase
has been converted to a Connected Database, you can
enter a restricting query to visualize only Ambiorix, as
follows.
28.8.1 In the Connected databases configuration dialog box
under Restricting query, type:
GENUS=Ambiorix
Chapter 28 - Connected databases
28.8.2 Press <OK> to confirm the changes. The Main
window now only shows Ambiorix.
28.8.3 Add a new entry with Database > Add new
entries. The new entry is automatically called Ambiorix
in its Genus field.
Restricting queries can be combined by separating them
with semicolons. For example, if you want to visualize
only Ambiorix sylvestris entries, enter the following as a
restricting query:
261
GENUS=Ambiorix;SPECIES=sylvestris
The result is a database that only shows Ambiorix
sylvestris. New entries will automatically be added as
Ambiorix sylvestris.
NOTE: Never use spaces in a restricting query.
262
The BioNumerics manual
263
29. Preserving the BioNumerics database integrity
In many cases, BioNumerics will be used to construct
large databases of information that has been collected
over a long time span. Obviously, the user should pay
attention to protect such databases from accidental data
losses, e.g. due to hard disk crashes, power
interruptions, etc. Although a practical strategy depends
to a large extent on the computer hardware
configuration, this chapter introduces some basic
considerations to protect the data.
29.1 Taking backups of a database
All the data files that belong to a particular BioNumerics
database are stored on the hard disk in subdirectories of
a single top directory (see also 1.3). If BioNumerics is
opened with this database, this directory is indicated in
the status bar on the bottom of the Main window.
Alternatively, one can open the BioNumerics intro
screen, select a database, and press the <Settings>
button. A dialog box appears which shows the top
directory of the selected database.
Since all important information concerning a database is
stored inside this top directory, one only needs to back
up this complete directory (including subdirectories) to
have a complete copy of all data. When the database
needs to be restored later on, this top directory can be
copied back to the right place on the hard disk. Note that
backups restored from CD-ROM may be read-only. In
this case you will have to specify the files to be writeaccessible before you run BioNumerics with the restored
database.
It is possible to create a duplicate of a BioNumerics
database in a similar way. To this end, copy the entire
contents of the database’s top directory to a new
directory. In the BioNumerics intro screen, select <New>
to create a new database. When the database creation
wizard pops up, fill in a name of the duplicate database
and click <Next>. In the next tab, click <Change> to
change the database top directory into the name of the
duplicate directory. In addition, specify <No> to the
question “Do you want to automatically create the required
directories?” Proceed in the usual way to finish the
creation of the database.
29.2 Detecting and correcting faults in a
database
During the use of a BioNumerics database, there may
occur some faults in the structure, which need to be
repaired. For instance, if a power failure or a system
crash occurs while BioNumerics is saving a gel file, this
file may become corrupted and unreadable. Moreover,
due to the relational structure of the database, there may
occur some more subtle inconsistencies in the database.
For example, two different database entries may end up
having the same key. Such a conflict may arise if
someone has copied new data files from one database
into another without taking care.
BioNumerics comes with a special diagnostics/repair
program that is able to detect and solve such problems
in a local database. Connected Databases (chapter 28.)
cannot be inspected using this tool. To start this
program, click on the <Inspect> button in the intro
screen. The program’s Main window is divided in three
parts (Figure 29-1.):
1. Top panel: a list of all the problems the program has
detected in the current active database. If the program
can solve the problem automatically, a <Solve> button
appears right from it. The user can click on this button
to see a menu of possible solutions. If a solution is
clicked, it is applied on the database.
2. Bottom left panel: a list of all experiments present.
3. Bottom right panel: a list of all database keys having a
conflict. For each key, this window indicates the
number of times it occurs as a database entry list and
as each of the experiment types. If duplicates exist,
they are marked in red. Double clicking on a key in
this list opens a detailed entry window, specifying the
place of occurrence of this key in the database and for
every experiment. For each occurrence, the key can be
changed manually using Edit > Change key. In this
way, the user can manually intervene, having
complete control over the way key conflicts are
resolved
Problems can be solved by either selecting one of the
standard solutions proposed by the software, or
changing the database structure manually. If the
solution involves changing keys in the database, the
user needs to save the changes explicitly by using the
menu item File > Change changes to disk afterwards.
The remainder of this chapter contains a list of problems
that can be detected, together with a discussion of
possible solutions.
264
The BioNumerics manual
Figure 29-1. The BioNumerics diagnose program.
29.3 Missing directories
If a directory is missing in the database directory
structure, the program reports it. The proposed solution
is to create the missing directory.
29.4 Corrupted files
The diagnose program checks the following files for
errors:
•Fingerprint, Character, Sequence & Matrix Types
configuration files (.CNF)
•Fingerprint, character, sequence & matrix data files
(.DEF)
•Database information files (.DBS)
usage of the database, the software detects this and
proposes to remove the empty file(s).
29.6 Database entries with identical keys
In certain circumstances, a database might end up with
two database entries having the same key. For example,
this might occur if one copies database files from one
database to another, without taking care of the key
names. In such a situation, the software is unable to
distinguish between both entries. Two automatic
solutions are proposed:
•Remove the duplicate keys (i.e. replace with empty
strings).
•Automatically rename the duplicate keys in order to
force them to be unique.
•Comparison files (.CMP)
29.7 Database entries without keys
•Identification libraries and unit files (.LIB & .CMP)
If a corrupted file is detected, the only automatic
solution provided by the software is to delete it.
If a database entry ends up having no key (i.e. an empty
string), the software is not able anymore to link any
experiments to it or to use it in an analysis. The
proposed automatic solution is to create a new, unique
key for such a database entry.
29.5 Empty files
It may happen that, if a number of database entries have
been removed, an empty file exists on the hard disk (e.g.
a database file or a fingerprint, character or sequence
data file). Although this situation does not harm the
29.8 Experiments with identical keys
This problem is similar to the previous one. It occurs
when two different “experiments” of the same
experiment type, (e.g. fingerprint lanes of gels belonging
Chapter 29 - Preserving the BioNumerics database integrity
to the same type), have exactly the same key. If this
happens, there is a conflict because both will be linked to
the same database entry. Again, two automatic solutions
are possible:
1. Remove the duplicate keys. In this case, the
experiments that have duplicate keys will become
unlinked, i.e. not associated with any database entry.
Afterwards, the user can re-link them to the
appropriate entries.
2. Automatically rename the duplicate keys into unique
strings. Note that, if this option is chosen, a new
conflict will arise because this will create experiments
with keys that do not correspond to any database
entry. However, this can be solved by automatically
creating new database entries for these experiments
(see next paragraph).
265
29.9 Experiment keys without database
entries
It might happen that an experiment (e.g. a fingerprint
lane of a gel) has a “phantom link”, i.e. it has a key that
does not correspond to any database entry. This
situation may occur if one removes a database entry
without taking care of the experiments that are linked to
it. The software can perform two automatic actions:
1. Remove the key from that experiment. This means
that, at this moment, the experiment becomes
unlinked (not linked to any database entry).
Afterwards, a new link can be created in BioNumerics.
2. Add a new database entry for this key. This solution
creates a new database entry with a key that
corresponds to the key of the experiment.
266
The BioNumerics manual
267
30. Appendix
30.1 Connected Database table structure
30.1.1 Introduction
In the description below, the structures of the tables
required by BioNumerics in a Connected Database are
given (see chapter 28.). The tables are indicated with
their default names. As pointed out in 28.5, however, it
is possible to use different names for these tables or
views in an actual database, which are recorded in the
connected database configuration file (.xdb). The names
of the columns within the tables, however, are fixed.
The object “CLOB” means a large text field. This may be
described differently depending on the database use
(e.g., the Access equivalent is “memo”).
NULL values should be allowed for all fields.
Used for character experiments only: holds the name
of the tables that hold character values and additional
character fields (separated by a comma).
30.1.4 Table FPRINTFILES
This table contains a record for every “batch” of
fingerprints that is entered in the database. A batch may
correspond to fingerprints that should be normalized
simultaneously: e.g. they were run on the same
electrophoresis gel, or run in the same batch on a
sequencer, etc.
•FILENAME (VARCHAR(80))
The name of the batch (should be unique for every
batch). In case of scanned electrophoresis gels, this
corresponds to the name of the TIFF image file.
•EXPERIMENT (VARCHAR(80))
30.1.2 Table ENTRYTABLE
This table contains a record for every entry in the
database.
Name of the experiment type to which this fingerprint
batch belongs.
•LOCKED (VARCHAR(10))
•KEY (VARCHAR(80))
Whether or not this batch is locked (Yes or No).
The unique identifier for every entry in the database
(e.g. isolate number).
Other fields: additional database information fields.
30.1.3 Table EXPERIMENTS
This table contains a record for every experiment type in
the database.
•EXPERIMENT (VARCHAR(80))
Holds the name of the experiment (should be unique
through the whole database).
•TYPE (VARCHAR(80))
Can be “Fingerprint”, “Character”, or “Sequence”.
•SETTINGS (CLOB)
XML string that holds the processing, visualization
and analysis settings of the experiment type.
•TABLES (VARCHAR(160))
•INLINELINK (VARCHAR(80))
If this batch is linked to another batch (for
normalization purposes), this specifies the name of the
batch that contains normalization info.
•BOUNDINGBOX (VARCHAR(200))
Specifies the bounding box of the lanes on a 2D
fingerprint image.
•SETTINGS (VARCHAR(250))
Data processing settings.
•TONECURVE (VARCHAR(200))
Specifies how bitmap pixel values are mapped to grey
shades on the screen.
•REFSYSTEM (CLOB)
Specifies the reference system that is used to
normalize the batch.
•MARKERS (VARCHAR(200))
268
Holds marker points that may be used to align linked
fingerprint images to each other.
The BioNumerics manual
Holds information about 2D concentration estimates.
•BANDCONCINFO (CLOB)
Holds information about 2D concentration estimates.
30.1.5 Table FPRINT
This table contains a record for every fingerprint that is
entered in the database.
•KEY (VARCHAR(80))
The unique identification key of the sample to which
this fingerprint belongs.
•EXPERIMENT (VARCHAR(80))
The name of the experiment type to which this
fingerprint belongs.
•FILENAME (VARCHAR(80))
The name of the batch to which this fingerprint
belongs.
•REFPOS (VARCHAR(250))
Contains the reference positions assigned to this
fingerprint.
•MAPFORWARD (CLOB)
Contains a forward normalization vector.
•MAPBACK (CLOB)
Contains the reverse normalization vector.
•REFSYSTEM (CLOB)
Holds the reference system of the fingerprint.
•TONECURVE (VARCHAR(250))
•FILEIDX (NUMBER)
The number of the fingerprint inside the fingerprint
file.
Contains the tone curve.
•CHPTRN (VARCHAR(250)) (only with “Fast band
matching” enabled)
•SPLINE (VARCHAR(200))
Holds the exact positioning and size of the gelstrip on
the image.
Contains cached pattern information on the band
positions for a Fingerprint Type with “Fast band
matching” enabled.
•CURVESPLINE (VARCHAR(200))
Describes what part of the gelstrip is used for
calculation of the densitometric curve.
•GELSTRIPINFO (VARCHAR(50))
Contains resolution information about the gelstrip
image info.
•GELSTRIP (CLOB)
30.1.6 Character Values table
Each Character Type has its own table holding character
value information for the database entries. The default
name of this table is the name of the Character Type,
although it is possible to specify any table name (the
exact name is contained in the TABLES column of the
EXPERIMENTS table). Each record in the table
corresponds to a single character value belonging to a
single entry in the database.
This field holds the bitmap values of the gelstrip.
•KEY (VARCHAR(80))
•DENSCURVEINFO (VARCHAR(50))
Key of the entry this character value belongs to.
Holds the resolution of the densitometric curve.
•CHARACTER (VARCHAR(80))
•DENSCURVE (CLOB)
Key of the character.
Holds the densitometric curve data.
•VALUE (FLOAT)
•BANDS (CLOB)
Numerical value.
Holds information about the bands assigned on the
fingerprint.
•BANDCONC (CLOB)
Chapter 30 - Appendix
269
30.1.7 Character Fields table
30.1.9 Table SEQTRACEFILES
Contains information about the additional information
fields that can be stored together with characters in a
Character Type. The default table name is the name of
the Character Type, padded with “FIELDS”, but it is
possible to specify any other name (the exact name is
contained in the TABLES column of the EXPERIMENTS
table). Every record in this table corresponds to a single
field for a single character
This table holds information about the sequence trace
files (four-channel chromatogram files from automated
sequencers).
•KEY (VARCHAR(80))
For use with the Kodon software.
•CONTIGFILE (VARCHAR(80))
•CHARACTER (VARCHAR(80))
Name of the character this information field belongs
to.
Unique ID of the contig that is associated to this
sequence trace file.
•TRACEID (VARCHAR(80))
•FIELD (VARCHAR(80))
Unique ID of the trace file.
Name of the field.
•DATA (CLOB)
•CONTENT (VARCHAR(150))
Content of the field.
30.1.8 Table SEQUENCES
This table holds the sequence information stored in the
database. Note that the columns designed for contig files
have changed with respect to earlier versions of the
software.
•KEY (VARCHAR(80))
Key of the database entry this sequence belongs to.
•EXPERIMENT (CHARCHAR(80))
Experiment type of the sequence.
•SEQUENCE (CLOB)
Sequence data.
•CONTIGFILE (VARCHAR(80)
Unique ID of the contig file that is associated to this
sequence (if any).
CONTIG (CLOB)
Holds the contig sequence and its full editing history.
•CONTIGSTATUS (VARCHAR(10)
Contains the status of the contig file, i.e. confirmed or
not.
Holds the full trace information including sequence
and the chromatogram files in case the trace files are
stored in the database (28.2). Otherwise, it stores a link
to the path of the trace file.
•INFO (CLOB)
Contains the full editing information of the sequence
trace file.
30.1.10 Table MATRIXVALS
Holds pairwise similarity values. Each record in this
table represents a single similarity value between two
database entries.
•EXPERIMENT (VARCHAR(80)).
Name of the experiment type this similarity value
belongs to.
•KEY1 (VARCHAR(80))
Key of the first database entry.
•KEY2 (VARCHAR(80))
Key of the second database entry.
•VALUE (FLOAT)
Similarity value.
30.1.11 Table SUBSETMEMBERS
This table contains information about the subsets that
were defined in the database. Each record specifies the
membership of a single entry to a single subset.
270
•KEY (VARCHAR(80))
The key of the database entry.
•SUBSET (VARCHAR(80))
The name of the subset to which this key belongs.
30.1.12 Table EVENTLOG
This table maintains a history list of events that were
generated during the manipulation of the database.
•DATETIME (VARCHAR(80))
Recording date and time of the event.
The BioNumerics manual
FILENAME (may be defined as primary key).
•FPRINT:
KEY. It should not be unique or primary key, since
some lanes on a gel image may not be added to the
database and will have an empty key (e.g. reference
lanes).
FILENAME. Note that this field should not be
required, because some databases may contain
fingerprints that are not associated with any batch
(file).
FILENAME,FILEIDX.
•Character values table:
CHARACTER.
KEY.
•Character fields table:
•LOGIN (VARCHAR(50))
CHARACTER,FIELD.
Windows login at the moment the event was
generated.
•TYPE (VARCHAR(10))
•SEQUENCES
KEY.
•MATRIXVALS
Event type.
EXPERIMENT,KEY1,KEY2.
•SUBJECT (VARCHAR(50))
Database component for which this event was
generated.
•DESCRIPTION (VARCHAR(500))
•SUBSETMEMBERS
KEY.
SUBSET.
Description of the event.
30.2 Regular expressions
30.1.13 Indices in the database
In order to obtain sufficient speed for larger databases, it
is absolutely necessary that a number of indices are
present. This section contains a list of advised indices.
However, depending on the purpose of the database
(emphasis on read or write, database size...), it may be
preferable to modify, add or remove indices. For larger
databases where speed becomes critical, it is strongly
advised to use the tuning tools provided with the
database in order to optimize the various settings and
indices.
•ENTRYTABLE:
KEY (may be defined as primary key).
•EXERIMENTS:
EXPERIMENT (may be defined as primary key).
This usually won’t attribute to the performance,
since the number of records in this table is usually
very limited.
•FPRINTFILES:
A "regular expression" is a pattern that describes a set of
strings.
Regular
expressions
are
constructed
analogously to arithmetic expressions, by using various
operators to combine smaller expressions. `grep'
understands two different versions of regular
expression syntax: "basic" and "extended". In GNU
`grep', there is no difference in available functionality
using either syntax. In other implementations, basic
regular expressions are less powerful. The following
description applies to extended regular expressions;
differences for basic regular expressions are
summarized afterwards.
The fundamental building blocks are the regular
expressions that match a single character.
Most
characters, including all letters and digits, are regular
expressions that match themselves. Any metacharacter
with special meaning may be quoted by preceding it
with a backslash. A list of characters enclosed by `[' and
`]' matches any single character in that list; if the first
character of the list is the caret `^', then it matches any
character *not* in the list. For example, the regular
expression `[0123456789]' matches any single digit. A
Chapter 30 - Appendix
271
range of ASCII characters may be specified by giving the
first and last characters, separated by a hyphen.
Finally, certain named classes of characters are
predefined, as follows. Their interpretation depends on
the `LC_CTYPE' locale; the interpretation below is that
of the POSIX locale, which is the default if no
`LC_CTYPE' locale is specified.
`[:alnum:]'
Any of `[:digit:]' or `[:alpha:]'
`[:alpha:]'
Any letter:
`a b c d e f g h i j k l m n o p q r s t u v w x y z',
`A B C D E F G H I J K L M N O P Q R S T U V W X Y
Z'.
`[:blank:]'
Space or tab.
`[:cntrl:]'
Any character with octal codes 000 through 037, or
`DEL' (octal code 177).
`[:digit:]'
Any one of `0 1 2 3 4 5 6 7 8 9'.
`[:graph:]'
Anything that is not a `[:alnum:]' or `[:punct:]'.
`[:lower:]'
Any one of `a b c d e f g h i j k l m n o p q r s t u v w x y
z'.
`[:print:]'
Any character from the `[:space:]' class, and any
character that is *not* in the `[:graph:]' class.
`[:punct:]'
Any one of `! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^
_` { | } ~'.
`[:space:]'
Any one of `CR FF HT NL VT SPACE'.
For example, `[[:alnum:]]' means `[0-9A-Za-z]', except
the latter form is dependent upon the ASCII character
encoding, whereas the former is portable. (Note that the
brackets in these class names are part of the symbolic
names, and must be included in addition to the brackets
delimiting the bracket list.) Most metacharacters lose
their special meaning inside lists. To include a literal `]',
place it first in the list. Similarly, to include a literal `^',
place it anywhere but first. Finally, to include a literal `', place it last.
The period `.' matches any single character. The symbol
`\w' is a synonym for `[[:alnum:]]' and `\W' is a
synonym for `[^[:alnum]]'.
The caret `^' and the dollar sign `$' are metacharacters
that respectively match the empty string at the
beginning and end of a line. The symbols `\<' and `\>'
respectively match the empty string at the beginning
and end of a word. The symbol `\b' matches the empty
string at the edge of a word, and `\B' matches the empty
string provided it's not at the edge of a word.
A regular expression may be followed by one of several
repetition operators:
`?'
The preceding item is optional and will be matched at
most once.
`*'
The preceding item will be matched zero or more
times.
`+'
The preceding item will be matched one or more
times.
`{N}'
The preceding item is matched exactly N times.
`{N,}'
The preceding item is matched n or more times.
`{N,M}'
The preceding item is matched at least N times, but
not more than M times.
`[:upper:]'
Any one of `A B C D E F G H I J K L M N O P Q R S T
U V W X Y Z'.
`[:xdigit:]'
Any one of `a b c d e f A B C D E F 0 1 2 3 4 5 6 7 8 9'.
Two regular expressions may be concatenated; the
resulting regular expression matches any string formed
by concatenating two substrings that respectively match
the concatenated subexpressions.
Two regular expressions may be joined by the infix
operator `|'; the resulting regular expression matches
any string matching either subexpression.
272
Repetition takes precedence over concatenation, which
in turn takes precedence over alternation. A whole
subexpression may be enclosed in parentheses to
override these precedence rules.
The BioNumerics manual
The backreference `\N', where N is a single digit,
matches the substring previously matched by the Nth
parenthesized subexpression of the regular expression.
273
Index
Numerics
2D background subtraction 204
2D gel image settings dialog box 208
2D Gel Types 7, 29, 201
2D gel types 203
3D view of a zoomed area of a 2D gel 207
A
Absolute value 236
Add (Netkey) 17
Add array of characters 58
Add color 58
Add new calibration rectangle tool 212
Add new entries 24
Add new experiment file 59, 69
Add new reference system 214
Add spot tool 211
Advanced query tool 88
After correction 206
Align external branch 135
Align internal branch 135
amino acid sequences 68
Analysis 19
Analyze 7, 23, 24, 87
Analyzing 2D gel spot tables with GeneMaths 236
Apply to tone curve 213
Area sensitive (coefficient) 107
Arithmetic average 38
Arrange by similarity 193
Assembler 70
Assign to reference system 214
Assignment of Metrics 204
Attachment 88
Auto construct tables 256
Auto create (.mdb) 254
Auto insert notification 13
Automatic spot search 209
Automatic spot search dialog box 209
Average (K-means) 114
Average similarities (jackknife) 115
Average thickness 52
Averaging thickness (curves) 38
B
Background color 208
Background subtraction 30, 38, 203, 208
Background subtraction (2-D image) 35
Background subtraction (BNIMA) 61
Ball size 61
Band class filters 101
Band classes > Add new band class 100
Band classes > Assign band to class 100
Band classes > Auto assign bands to class 100
Band classes > Center class position 100
Band classes > Remove band class 100
Band classes > Remove band from class 100
Band finding (settings) 42
Band height 103
Band matching 97
Band search filters 42
Band search, shoulder sensitivity 43
Band surface 103
Bandmatching 95
Bandmatching > Auto assign all bands to all classes 104
Bandmatching > Band class filter 101
Bandmatching > Comparative Quantification settings 103
Bandmatching > Export bandmatching 101
Bandmatching > Perform band matching 98, 103, 104
Bandmatching > Polymorphic bands only (for selection
list) 104
Bandmatching > Polymorphic classes only (for selection
list) 101
Bandmatching > Search band classes 101
Bands 32, 44, 121
Bands (assigning) 42
Bands > Add new band 43
Bands > Auto search bands 43
Bands > Delete selected band(s) 45
Bands > Mark band(s) as certain 45
Bands > Mark bands as uncertain 45
Binary coefficient 143
Binary coefficients 125
BioNumerics 2D 201
BioNumerics Server 245
Bitmap export 118
BNIMA 61
Bootstrap analysis 113
Build (connected databases) 255
Bundles 241
Bypass normalization 48
C
Calculate > Experiment correlations 118
Calculate > Similarity plot 119
Calculate quality quotients 196
Calibration 204, 212
Calibration > Change calibration value 212
Calibration > Image calibration 212
Calibration curve 54
Canberra metric 125
Case sensitive 88
Case sensitive (client/server) 248
Categorical coefficient 125, 144
Cells > Add disk to mask 62
Cells > Add pixels to mask 62
274
Cells > Add selected 62
Cells > Edit color scale 63
Cells > Remove pixels from mask 62
Change access (Netkey) 18
Change entry key 24
Change towards end of fingerprint 98, 108
Change valid range 238
Changing fingerprint type 50
Changing sequences in a multiple alignment 133
Character > Change character range value 57
Character file, new 59
Character import, from TIFF 60
Character Types 7, 29, 83, 201
Character types 57
Character value (query) 88
Characters > Add new character 57, 59
Characters > Order characters by component 167, 168
Characters > Use character for comparisons 58
Check table structure 256
Client functions of BioNumerics 245
Clip at max. value 218
Clip values at extreme points 213
Cluster analysis 141
Cluster analysis (similarity matrix) 107, 111, 113, 121,
125, 128, 143, 163, 170, 235
Cluster cutoff method 113
Clustering 95, 128, 129
Clustering > Bootstrap analysis 113
Clustering > Calculate cophenetic correlations 113
Clustering > Calculate error flags 112, 113
Clustering > Collapse/expand branch 110
Clustering > Congruence of experiments 118
Clustering > Maximum likelihood cluster analysis 149
Clustering > Maximum parsimony cluster analysis 148
Clustering > Reroot tree 110, 111
Clustering > Select root 110
Clustering > Swap branches 110
Clustering > Tolerance & optimization analysis 122
Clustering and statistical analysis of 2D gels 235
Clustering of characters 125
Clustering of fingerprints 121
Clustering of sequences 127
Clustering, global alignment 133
Color cale (BNIMA) 61
Combine using OR 228
Comparative quantification 97
Comparing 2D gels 223
Comparison > Chart / Statistics 183
Comparison > Compare two entries 93
Comparison > Create new comparison 93
Complete linkage 108
Component type 165
Composite > Calculate clustering of characters 146
Composite > Calculate consensus matrix 145
Composite > Discriminative characters 104, 142
Composite > Export character table 103, 142
Composite > Show quantification (colors) 103, 105
Composite > Show quantification (values) 103
Composite > Sort by character 105, 142
Composite data set 141
Composite spot query in the 2D gel type window 230
Concentration 103
Condition 219
The BioNumerics manual
Conglomerate spot separation 210
Congruence between techniques 118
Connect to an existing connected database 254
Connected database 20
Connected Databases 253
Consensus match 130
Consensus sequence 127
Consider absent values as zero 57
Contour palette (I) 207
Contour palette (II) 207
Contour palette (III) 207
Contour palettes 206
Conversion to binary 125, 236
Cophenetic correlation 112
Copy from character (color) 58
Copy to all characters (color) 58
Copy to character (color) 58
Copy to clipboard (log file) 22
Correct for internal weights 144
Correction parameters (sequence clustering) 129
Correlation type 119
Cosine coefficient 107, 125
Cost table (parsimony) 148
Create a new, empty connected database 254
Create character (ODBC) 252
Create from database field 111
Create keys from tag 68, 69
Create new 2D gel type 203
Create new fingerprint type 32
Creating 2D spot queries 227
Creating a reference system (2D gels) 214
Creating landmarks for normalization 215, 220
Creating synthetic gels 238
Crop > Add new crop 31
Crop > Delete selected crop 31
Crop > Rotate selected crop 31
Cropped 31
Cubic spline 213
Cup type 58
Curves 32, 53
Curves > Spectral analysis 39
D
Database > Add all lanes to database 50
Database > Add lane to database 51
Database > Add new entries 24, 49, 50, 219, 257
Database > Add new information field 24, 219
Database > Change entry key 24
Database > Change fingerprint type of lane 50
Database > Connected databases 255
Database > Download selection list 248
Database > Identify selection list 249
Database > Link lane 50
Database > ODBC link > Configure external database
link 251
Database > ODBC link > Copy from external database
251
Database > ODBC link > Download field from external
database 252
Database > ODBC link > Select list from external database 252
Database > Remove all links 51
Index
Database > Remove entry 24
Database > Remove information field 24
Database > Remove link 51
Database > Remove unlinked entries 24
Database > Rename information field 24
Database > Search 248
Database > Send bundle to server 248
Database > Upload comparison settings 250
Database contstruction 205
Database directory 20
Database field (query) 88
Database field range (query) 88
Database fields 68, 69
Database settings 20
Database sharing tools 241
Databases 9
Decrease zero level 206
Defining a new 2D experiment type 203
defining metrics 216
Defining reference spots 214
Degree 216
Degree (congruence of techniques) 119
Delete (Netkey) 18
Delete database 20
Demo2D 202, 203, 224
Demobase 13, 23
Densitometric curves 38, 51
Densitometric values (BNIMA) 61
Details (bundle) 242
Dice 107, 108, 122, 125
Different bands (coefficient) 107
Differential expression 228
Dimensioning > Multi-dimensional scaling 163
Dimensioning > Principal Components Analysis 164,
168
Dimensioning > Self organizing map 169
Direct linkage of two spots in the 2D gel matching window 223
Discard unknown bases 129, 134
Disconnect (Netkey) 18
Discriminants (with variance) 168
Discriminants (without variance) 168
Divide by variance 168
Divide by variance (PCA) 165
DNS Configuration 16
DNS host name 16
Do not create keys 68
Drag-and-drop sequence alignment 131
Drawing tool (add pixels 211
Drawing tool (remove pixels) 211
Duplicate keys 50
Dynamical preview 33, 52
E
Edit > Arrange entries by database field 96
Edit > Arrange entries by field 26
Edit > Arrange entries by field (numerical) 26
Edit > Arrange entries by similarity 193
Edit > Bring selected entries to top 87, 104
Edit > Change brightness & contrast 33, 36, 52
Edit > Change key 263
Edit > Clear selection list 87, 88, 92
275
Edit > Copy selection 93, 96, 110, 195
Edit > Cut selected gel from matching 226
Edit > Cut selection 93, 95, 96, 101, 110, 134, 148
Edit > Delete current (subset) 93
Edit > Delete selection 93
Edit > Edit tone curve 36, 205, 207
Edit > Freeze left pane 27, 96
Edit > Load default settings 47, 209
Edit > Pasta entries from clipboard 224
Edit > Paste entries from clipboard 230, 233
Edit > Paste selection 93, 95, 96, 101, 110, 134, 193, 195
Edit > Previous page 116
Edit > Redo 32
Edit > Redo last action 211
Edit > Rename current (subset) 93
Edit > Rescale curves 39, 45
Edit > Save as default settings 47, 209
Edit > Search entries 45, 87
Edit > Set database field length 27
Edit > Settings 34, 38, 42
Edit > Settings (BNIMA) 61, 62, 64
Edit > Settings (fingerprints) 39
Edit > Show value scale (BNIMA) 61
Edit > Spot info 210, 218
Edit > Undo 32
Edit > Undo last action 211
Edit > Zoom in 32, 116
Edit > Zoom out 32, 116
Edit Calibration curve button 212, 213
Edit database fields 169
Edit image (BNIMA) 61
Editing reference systems 237
EMBL format 68
Embossed view 207
Enable log files 21
Enable the use of log files 20
Enhance dark bands 206
Enhance weak bands 206
Enhanced metafile export 118
Enter the maximum deviation 216
enter the maximum deviation 222
Entries > Add new entries 59, 69
Error flags 112
Estimate errors 149
Estimate relative character importance 170
Estimated spot size 210
Euclidean distance 125
EXAMPLES2D 204
Experiment 23, 47, 49, 98
Experiment > Comparison settings 142
Experiment > Correct for internal weights 142, 144
Experiment > Train neural network 199
Experiment > Use for identification 195
Experiment > Use in composite data set 98, 142
Experiment card 58, 83, 84
Experiment presence (query) 88
Experiments > Create new 2D gel type 203
Experiments > Create new character type 56
Experiments > Create new composite data set 97, 142
Experiments > Create new fingerprint type 29
Experiments > Create new matrix type 81
Experiments > Create new sequence type 68
Experiments > Edit experiment type 47, 57
276
Export band metrics 83
Export normalized band positions 83
Export normalized curve 83
F
Fields > Add new field 58
Fields > Remove field 58
Fields > Rename field 58
Fields > Set field content 58
Fields > Use as default field 58
File > Add experiment file 257
File > Add image to database 31
File > Add new experiment file 30
File > Add new library unit 195
File > Add to database 219
File > Analyze with GeneMaths 126
File > Approved 79
File > Calculation priority settings 109
File > Clear log file 22
File > Convert complexes to groups 162
File > Copy correspondence plot to clipboard 172
File > Copy discriminants to clipboard 172
File > Copy image to clipboard 148, 164, 209
File > Copy image to clipboard (characters) 167
File > Copy image to clipboard (entries) 167
File > Copy page to clipboard 118
File > Create matching window 230, 233, 238
File > create matching window 224
File > Create new bundle 241
File > Create synthetic gel 238
File > Delete experiment file 31, 60, 69
File > Edit character data 85
File > Edit fingerprint data 85
File > Edit library unit 195
File > Edit sequence data 85
File > Exit 23
File > Export 194
File > Export bands (comparison) 95
File > Export character coordinates 167
File > Export database fields 114, 148, 164, 166, 193
File > Export densitometric curves (comparison) 95
File > Export entry coordinates 167
File > Export report to file 196, 197
File > export sequences 138
File > Export similarity matrix 114
File > Import experiment data 68
File > Import experiment file 81
File > Import from external database 252
File > Link to reference gel 52
File > Load configuration 64
File > Load image (BNIMA) 61
File > Lock 21
File > Open additional database 251
File > Open bundle 241
File > Open database 250
File > Open experiment file (data) 31, 51, 52, 59, 69
File > Open experiment file (entries) 48, 50, 59, 69
File > Open reference gel 52, 53
File > Print all pages 117
File > Print correspondence plot 172
File > Print database fields 193
File > Print discriminants 172
The BioNumerics manual
File > Print image 149, 164, 209
File > Print image (characters) 167
File > Print image (entries) 167
File > Print preview 116
File > Print report 196, 197
File > Print this page 117
File > Printer setup 117
File > save changes 226
File > Save configuration as 64
File > Statistical analysis 236
File > Tools > Horizontal mirror of TIFF image 32
File > Tools > Vertical mirror of TIFF image 32
File > Update linked information 52, 53
File > Upload file to server 249
File > View 3D image 207
File > View log file 21
Files 51
Filter programs 19
Filtering 38, 208
Filters 51
Finding a subsequence in multiple alignment 133
Fingerprint bands (query) 88
Fingerprint data editor 32, 37, 40, 44
Fingerprint image import window 30
Fingerprint Types 7, 29, 201
Fingerprint types 30, 47
Force through 100% 119
Foreground 109
Foreground color 208
Furhigh 222
Furlow 222
Furthest neighbor (K-means) 114
fuzzy logic 228
Fuzzy logic (coefficient) 107
G
Gap penalty 129, 134
Gaussian filter 208
Gel image tone curve editor 206
GelCompar version 4.x, import from 55
Gelstrip thickness 52
GeneMaths 126, 236
Genescan tables, importing 53
Genus 111, 193
Global alignment 128
Gower 125
Grazy zone (bands) 43
Grid > Add new 62
Grid > Delete 62
Grid > Delete seleceted 62
Grid definition 61
Group > Create from database field 112
Group > Partitioning of groups 114
Group separation statistics 115
Group violation 115
Group violations 116
Groups 111
Groups > Assign selected to 111
Groups > Assign selected to > None 114
Groups > Create from database field 122
Groups > Group separations 115
Groups > Multivariate Analysis of Variance 170
Index
Groups > Partitioning of groups 115
H
Hidden nodes 199
Home directory 9, 19, 20
Homedir 16
HTML 245
Hue only (BNIMA) 61
I
ID code 20, 21
Identification 193, 195
Identification > Create new library 195
Identification > Fast band matching 194
Identification > Identify selected entries 196
Identification against database entries 193
Idle time background 109
Image > Convert to gray scale > Ageraved 31
Image > Convert to gray scale > Blue channel 31
Image > Convert to gray scale > Green channel 31
Image > Convert to gray scale > Red channel 31
Image > Invert 31
Image > Load from original 31
Image > Mirror > Horizontal 31
Image > Mirror > Vertical 31
Image > Rotate > 180° 31
Image > Rotate > 90° left 31
Image > Rotate > 90° right 31
Image > Show normalized 225, 230, 233, 238
Image > Show overlap 226
Image > Update normalization 226
Image coloring 207
Image type (BNIMA) 61
Import using ODBC 251
Importing 2D gel image files 203
Increase contrast 206
Increase zero level 206
Info 16
Inserting and deleting gaps in multiple alignment 131
Inspect 19
Install BioNumerics 13
Install Netkey server program 15
Intensity query 227
Intermediate pen size 211
Internal reference markers 52
Interpolation 212
Inverted values 207
IP address 15, 17
J
Jaccard 107, 125
Jackknife 115
Jeffrey’s X 107
Jukes and Cantor 129, 134
K
Kendall's tau 119
277
Kimura 2 parameter 134
K-means partitioning 114
Kohonen map 168
L
Lane|Move down 51
Lane|Move up 51
Lane|Remove 51
Lanes > Add marker point 52
Lanes > Add new lane 35
Lanes > Auto search lanes 34, 52
Lanes > Copy geometry 53
Lanes > Delete selected lane 36
Lanes > Paste geometry 53
large pen size 211
Layout 163
Layout > Compress (X dir) 99
Layout > Create rooted tree 149
Layout > Display experiments 94
Layout > Enlarge image size 117
Layout > Label query members only 232
Layout > Label with 232
Layout > Optimize branch spread 148
Layout > Preserve aspect ratio 166
Layout > Reduce image size 117
Layout > Rescale curves 95
Layout > Show 3D plot 166
Layout > Show bands 95
Layout > Show branch lengths 148, 150
Layout > Show construction lines 164
Layout > Show curves as images 95
Layout > Show dendrogram 128, 164
Layout > Show densitometric curves 95
Layout > Show distances 111
Layout > Show gel images 232
Layout > Show group colors 148, 164, 166
Layout > Show image 98, 128
Layout > Show keys 164, 166
Layout > Show keys or group numbers 148
Layout > Show matrix 113, 128
Layout > Show matrix rulers 114
Layout > Show metric scale 95, 99
Layout > Show rendered image 164
Layout > Show similarity matrix 117
Layout > Show similarity values 114
Layout > Show space between gelstrips 95, 118
Layout > Show spot info 231
Layout > show spot info 227
Layout > Show table preferences 231
Layout > Similarity shades 114
Layout > Stretch (X dir) 99
Layout > Use colors 117
Layout > Use component as X axis 166
Layout > Use component as Y axis 166
Layout > Use component as Z axis 166
Layout > Use group numbers as key 166
Layout > Use group numbers as keys 148, 164
Layout > Zoom in 95, 99
Layout > Zoom out 95, 99
Least square filtering 38
Library 195
Linking spots with reference spots 216, 222
278
Listing spots in spreadsheats 230
Local database 20, 263
Local database, converting to connected database 259
Log files 21
Logarithmic 216
Logarithmic Dependence 50
Login name 247
M
MANOVA 170
manual selection 227
Match against selection only (Jackknife) 115
Matching 2D gel spots 205
Matching spots on different gels 223
Matrix Types 7, 29, 201
Matrix types 81
Maximal similarities (jackknife) 115
Maximum difference 194
Maximum likelihood 147
Maximum number of gaps 128, 129
Maximum parsimony 147
Maximum similarity used 119
Maximum value 52, 228, 229
Maximum value (grayscales) 33
MDS 163
Mean intensity 236
Median filter 38
Merge selected spot 211
Metric > Assign unit 50
Metrics 222
Metrics > Add marker 49
Metrics > Copy markers from reference system 49, 54
Metrics > Cubic spline fit 50
Metrics range of fingerprint 54
Microplate (BNIMA) 60
Minimal area 43
Minimal expression 229
Minimal profiling 43
Minimum consensus percentage 130
Minimum match sequence 128, 129
Minimum profiling 210
Minimum similarity used 119
Minimum spot size 210
Minimum value (grayscales) 33
Mode filter 38
Molecular sizees (defining) 49
Monotonous fit 119
Multi-dimensional scaling 163
multi-level undo function 211
Multiple alignment 127, 128
Multi-state coefficient 143
Multivariate analysis of variance 170
Mutation rate 149
N
Name of the merged gel 238
Navigator 32
Nearest neighbor 115
Nearest neighbor (K-means) 114
Negative search 88, 193
The BioNumerics manual
Negative search (client/server) 248
Neighbor Joining 108, 111, 127, 129, 134
Neighbor match 130
Netkey 15
Network 16
Network settings 16, 18
Neural network 198
New character type 56
New database (creating) 20
New fingerprint type 29
New instensity query 229
New intensity query 228
New matrix type 81
New ODBC 260
New sequence type 68
Normal priority background 109
Normalization 32, 47, 52, 204
Normalization > Add selected spot(s) to reference system 214
Normalization > Auto assign (bands) 41, 52
Normalization > Auto link spots 216, 222
Normalization > Automatically find landmarks 215
Normalization > Delete all assignments 41
Normalization > Show distortion bars 42
Normalization > Show distortion maze 222
Normalization > Show image side by side 220
Normalization > Show normalized view 39, 40, 42, 214,
220
Normalization > Show overlapped images 220
Normalization > Show reference gel 215, 219
Normalization > Show superimposed images 220, 221
Normalization > Show synthetic reference system 219
Normalization > show synthetic reference system 215
Normalization > Unlink selected spot(s) 216
normalization > Unlink selected spot(s) 222
Normalization > Update normalization 42, 216, 220, 222
Normalization > use spot as landmark 220
Normalization of other 2D gels 219
Normalized view 52
nucleic acid sequences 68
Number of bootstrap simulations 148
Number of columns (character type) 57
Number of groups 114
Number of nodes 52
Number of rows (character type) 57
Numerical coefficient 143
Numerical coefficients 125
Numerical values 56
O
Ochiai 107
ODBC connection string 255
ODBC, import 58
One dimension 103
Only spots in active query 238
Open entry 24
Open gap penalty 128, 129
Optimization 98, 108, 122
Optimization, find best 122
Optimize positions (MDS) 163
Optimize topology 148
Organism 219
Index
Original 31
P
Pairwise alignment 127
Pairwise alignment settings 128
Parsimony 147
Password 247
Paste data from clipboard 64
PCA 95, 163, 164
Pearson correlation 107, 125, 143
Pheno 57
Plate (characters) 58
Plot > Use discriminant as X axis 172
Plot > Use discriminant as Y axis 172
Polymorphism analysis 97
Polynomial 212
Polynomian degree (BNIMA) 64
Port number 16, 245
Position tolerance 98, 108, 122
Position tolerance, find best 122
Preview (band search) 43
Principal Components Analysis 164
Processed 31
Processing 2D gel images 204
Properties 13, 16
Q
Quality quotient 196
Quantification > Add cells to character set 64
Quantification > Assign value 46
Quantification > Band quantification 46
Quantification > Calculate concentrations 47
Quantification > Define calibration point 64
Quantification > Export to clipboard (BNIMA) 64
Quantification > Search all surfaces 46
Quantification > Search surface of band 46
Quantification units 42
Quantification, comparative 97
Quantified value 235
Quantity 228, 229
Queries 88, 230
Queries > Edit query 230
Queries > New manual selection 229, 232
Queries > new spot fields query 229
Queries > Set as active query 232
Queries > Update 228
Query 230
Query > Delete query 230
R
Radius 209
Rainbow palette 33
Rainbow palettes 206
Rank correlation 125
Raw data 35
Reduce contrast 206
Reference > Use as reference lane 39
Reference lane 52
Reference system 40, 213
279
Reference systems 237
References > Add external reference position 39
References > Add internal reference position 42
References > Copy normalization 53
References > Paste normalization 53
References > Use all lanes as reference lanes 52
Refresh (connected databases) 256
Refsystem > Add selected spot(s) 226
Refsystem > Add selected spots 238
Refsystem > Add to matching window 226
Refsystem > Assign ID code to spots 226
Refsystem > Delete 237
Refsystem > Delete selected spot(s) 226
Refsystem > Delete spot(s) 238
Refsystem > Refresh spots 237
Refsystem > Remove all spots 237
Refsystem > Set as active reference system 224, 237
Refsystem > use selected gel as temporary standard 226
Registry 9
Regression (congruence of techniques) 119
Regression curve 49, 54
Relative band surface 103
Relative to base gel 228
Relative to max. value (bands) 43
Relative usage (Netkey) 18
Relative volume 103
relative volume 235
Relative volume (in%) 228, 229
Remove spot tool 211
Removing common gaps in a multiple alignment 133
Rename (bundles) 242
Replace list (client/server) 248
Represent as List 58
Represent as Plate 58
Resolution of normalized tracks 47
Restricting queries 260
Restricting query 256
Result set 194
Rotated 217
Rotated & curved 217
Rotation 218
Run selected queries 228, 230
S
Save as default calibration 213
Scripts > Browse Internet 54, 59, 68, 121
Search functions 87
Search in list 88, 193
Search in list (client/server) 248
Security driver 15
Security key 15
Select branch into list 110
Self organizing map 168
Send message (Netkey) 18
Send message to all users (Netkey) 18
Sequence > Align external branch 135
Sequence > Align internal branch 135
Sequence > Calculate global cluster analysis 133, 134
Sequence > Change saved sequence 133
Sequence > Consensus blocks 130
Sequence > Consensus difference 130
Sequence > Create consensus of branch 130, 138
280
Sequence > Create locked group 132
Sequence > Edit 69
Sequence > Find sequence pattern 133
Sequence > Lock / unlock dendrogram branch 132
Sequence > Multiple alignment 129
Sequence > Neighbor blocks 130
Sequence > Paste from clipboard 69
Sequence > Reload sequence from database 133
Sequence > Show global cluster analysis 134
Sequence > Unlock group 132
Sequence Types 7, 29, 84, 201
Sequence types 68
Server 245
Server computer name 16
Server computer name or IP address 245
Set as base gel 228
Settings 13, 19, 20, 21
Settings (Netkey) 18
Settings > Binary conversion settings 58
Settings > Brightness & contrast 48
Settings > Comparative quantification 48
Settings > Edit reference system 49, 54
Settings > Enable fast band matching 194
Settings > Exclude active region 137
Settings > General settings 47, 58, 65
Settings > Include active region 138
Settings > New reference system (curve) 54
Settings > New reference system (positions) 54
Settings > Set as active reference system 54, 55
Settings > Spot quantification settings 235
Settings > Statistics 116
Settings, database 20
Shape/darkness 210
Shoulder sensitivity 43
Shoulder sensitivity, band search 43
Show > Detailed report 197
Show > Identification comparison 197
Show bands 95, 103
Show dendrogram 111, 112, 118, 134
Show matrix 118, 163
Show quantification (colors) 143
Show spot info 218
Similarity 142
Similarity calculation 128, 129
Simple Matching 125
Single linkage 108
small pen size 211
Solving database problems 263
SOM 168
Source file location 256, 257, 260
Spike removal 204, 208
split selected spot 211
Spot area 235
Spot contrast enhancement 210
Spot detection 204, 209
Spot field query 227
Spot information box 218
Spot information pop up window 210
spot intensity measure 228
Spot quantification 218
Spot removal (2-D image) 35
Spot volume 235
The BioNumerics manual
Spots > Add to active query 229
spots > Add to active query 232
Spots > Automatic search 209
Spots > Break link 226
Spots > Delete selected spots 211
Spots > Merge selected spot 211
Spots > Remove from active query 232
Spots > Split selected spot 211
SQL query 194
Standard 48
Standard deviation 112
Standardized characters 143
Start service (Netkey) 17
Startup program 19
Statistics (Netkey) 18
Status (Netkey) 18
Stop Service 18
Stored trees dialog box 155
streak removal 204
Strips 32, 52, 53
Strips > Increase number of nodes 36
Strips > Make larger 36
Strips > Make smaller 36
Subsequence (query) 88
Subsequence search 133
Subsets 92
Subtract average (PCA) 165
Synthetic gels 238
T
Take from experiments 142, 143, 144
Taxon 219
TCP/IP 15
TCP/IP (client server) 245
The 2D Gel data editor window 206
The 2D gel type window 224
The create synthetic gel window 239
The spot description fields window 227
The spot quantificiation settings dialog box 236
The spot table preference dialog box 231
Thickness (image strips) 35
Tie handling 115
Tolerance 194
Tolerance & optimization statistics 122
Tone curve 36
Two dimensions (quantification) 103
U
Uncertain bands 43, 98, 108
Unit gap penalty 128, 129
UPGMA 108, 111, 115, 129, 143
Use active zones only 134
Use as default database 256
Use conversion cost 129
Use fast algorithm 128, 129
Use quantitative values (PCA) 164
Use square root 125, 143
Use the local database 254
Used range 194
Index
281
V
W
Validation samples 199
Vertical only 217
View > show spot outlines 207
View calibration curve (BNIMA) 64
Volume 103
Ward 108, 129
Webserver 245
Wthigh 219
Wtlow 205, 214
Z
Zero value 218
282
The BioNumerics manual
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement