thesis fkaster

thesis fkaster
Inaugural-Dissertation
zur
Erlangung der Doktorwürde
der
Naturwissenschaftlich-Mathematischen
Gesamtfakultät
der Ruprecht-Karls-Universität
Heidelberg
vorgelegt von
Diplom-Physiker Frederik Orlando Kaster
aus Kirchheimbolanden
Tag der mündlichen Prüfung: 11. Mai 2011
Bildanalyse für die Lebenswissenschaften –
Rechnerunterstützte Tumordiagnostik
und Digitale Embryomik
Gutachter:
Prof. Dr. Fred A. Hamprecht
Prof. Dr. Wolfgang Schlegel
Dissertation
submitted to the
Combined Faculties for the Natural Sciences and for Mathematics
of the Ruperto-Carola University of Heidelberg, Germany
for the degree of
Doctor of Natural Sciences
Put forward by
Diplom-Physiker Frederik Orlando Kaster
Born in: Kirchheimbolanden
Oral examination: May 11, 2011
Image Analysis for the Life Sciences –
Computer-assisted Tumor Diagnostics
and Digital Embryomics
Referees:
Prof. Dr. Fred A. Hamprecht
Prof. Dr. Wolfgang Schlegel
Zusammenfassung
Die moderne lebenswissenschaftliche Forschung erfordert die Analyse einer derart
großen Menge von Bilddaten, dass sie nur noch automatisiert bewältigt werden kann.
Diese Arbeit stellt einige Möglichkeiten vor, wie automatische Mustererkennungsverfahren zu verbesserter Tumordiagnostik und zur Entschlüsselung der Embryonalentwicklung von Wirbeltieren beitragen können.
Kapitel 1 untersucht einen Ansatz, wie räumliche Kontextinformation zur verbesserten Schätzung von Metabolitenkonzentrationen aus Magnetresonanzspektroskopiebildgebungs-(MRSI-)Daten zwecks robusterer Tumorerkennung verwendet werden
kann, und vergleicht diesen mit einem neuen Alternativverfahren.
Kapitel 2 beschreibt eine Softwarebibliothek zum Training, Testen und Validieren von
Klassifikationsalgorithmen zur Schätzung von Tumorwahrscheinlichkeiten an Hand
von MRSI-Daten. Diese ermöglicht die Anpassung an geänderte experimentelle Bedingungen, den Vergleich verschiedener Klassifikatoren sowie Qualitätskontrolle: dafür
ist kein Expertenwissen aus der Mustererkennung mehr erforderlich.
Kapitel 3 untersucht verschiedene Modelle zum Lernen von Tumorklassifikatoren unter Berücksichtigung der in der Praxis häufig auftretenden Unzuverlässigkeit menschlicher Segmentierungen. Zum ersten Mal werden Modelle für diese Klassifikationsaufgabe verwendet, welche zusätzlich die objektive Information aus den Bildmerkmalen
nutzen.
Kapitel 4 enthalt zwei Beiträge zu einem Bildanalysesystem für die automatisierte
Rekonstruktion der Entwicklung von Zebrabärbling-Embryonen an Hand von zeitaufgelösten Mikroskopiebildern: Zwei Verfahren zur Zellkernsegmentierung werden experimentell verglichen, und ein Verfahren zur Verfolgung von Zellkernen über die
Zeit wird vorgestellt und ausgewertet.
Abstract
Current research in the life sciences involves the analysis of such a huge amount of
image data that automatization is required. This thesis presents several ways how
pattern recognition techniques may contribute to improved tumor diagnostics and to
the elucidation of vertebrate embryonic development.
Chapter 1 studies an approach for exploiting spatial context for the improved estimation of metabolite concentrations from magnetic resonance spectroscopy imaging
(MRSI) data with the aim of more robust tumor detection, and compares against a
novel alternative.
Chapter 2 describes a software library for training, testing and validating classification algorithms that estimate tumor probability based on MRSI. It allows flexible
adaptation towards changed experimental conditions, classifier comparison and quality control without need for expertise in pattern recognition.
Chapter 3 studies several models for learning tumor classifiers that allow for the
common unreliability of human segmentations. For the first time, models are used
for this task that additionally employ the objective image information.
Chapter 4 encompasses two contributions to an image analysis pipeline for automatically reconstructing zebrafish embryonic development based on time-resolved microscopy: Two approaches for nucleus segmentation are experimentally compared, and
a procedure for tracking nuclei over time is presented and evaluated.
Acknowledgments
First of all, I would like to thank my supervisor Prof. Dr. Fred Hamprecht for the
opportunity to conduct the research for this PhD thesis in his research group and
for his constant advice during the last years. I thank Dr. Ullrich Köthe for his
helpful advice concerning various areas of image processing, pattern recognition and
software development. I thank my predecessors Dr. Björn Menze and Dr. Michael
Kelm for their previous work on MRSI analysis, which paved the ground for parts
of the research presented in this thesis, and for their helpful advice on the MRSI
quantification and tumor segmentation projects. Dr. Björn Menze provided one of
the expert label sets for the evaluation in chapter 1, and performed the registration of
the real-world radiological data sets studied in chapter 3. Dr. Michael Kelm proposed
the spatially regularized MRSI quantification approach that is validated in chapter
1, as well as implementing huge parts of the software foundation that was required
for bringing the MRSI classification library presented in chapter 2 into clinical use.
I thank Xinghua Lou, Martin Lindner and Bernhard Kausler for the productive
collaboration on the zebra fish digital embryo project: Xinghua Lou developed one
of the segmentation methods evaluated in chapter 4, Martin Lindner implemented
the routines for the computation of the features required for the tracking procedure
and Bernhard Kausler provided manual ground truth for the tracking evaluation.
The other segmentation method studied in chapter 4 as well as the visualization
functionality for segmentation validation was provided via the ILASTIK software
developed by Dr. Christoph Sommer, Christoph Straehle and Dr. Ullrich Köthe: I
thank them for their help with the usage and customization of this software. I thank
Stephan Kassemeyer for helping with the implementation of the software described
in chapter 2. Furthermore I thank all the other present and former members of the
Multidimensional Image Processing group for the good group climate, for the lively
discussions and for the help on various technical and scientific questions, namely
Björn Andres, Sebastian Boppel, Joachim Börger, Luca Fiaschi, Jörg Greis, Matthias
Griessinger, Dr. Michael Hanselmann, Nathan Hüsken, Dr. Marc Kirchner, Anna
Kreshuk, Thorben Kröger, Rahul Nair, Dr. Bernhard Renard, Martin Riedl, Jens
Röder, Patrick Sauer, Christian Scheelen, Björn Voss, Andreas Walstra and Matthias
Wieler, as well as all the other researchers at the Heidelberg Collaboratory for Image
Processing.
13
During my PhD research time, I was closely affiliated with the Software Development for Integrated Diagnostics and Therapy (SIDT) group of the German Cancer
Research Center. I thank Prof. Dr. Wolfgang Schlegel for the all of the financial and
academic support I received from the German Cancer Research Center. I thank the
former and present heads of the SIDT group, Dr. Oliver Nix and Dr. Ralf Floca,
for their advice particularly on questions of software development. Furthermore I
thank all the group members for the good group climate, the lively discussions and
the help on various technical questions, namely Markus Graf, Dr. Martina Hub,
Andreas Jäger, Dr. Sarah Mang, Hermann Prüm, Dirk Simon, Dörte van Straaten
and Lanlan Zhang.
Interdisciplinary projects as presented in this thesis would not have been possible
without close interaction with the medical and biological collaborators. From the
Radiological University Clinic of Heidelberg, I thank Dr. Marc-André Weber for
providing the brain tumor images analyzed in chapter 3. From the Radiology group
of the German Cancer Research Center, I thank Dr. Christian Zechmann, Dr. Patrik Zamecnik, Dr. Lars Gerigk, Dr. Bram Stieltjes and Dr. Christian Thieke for
providing magnetic resonance spectroscopy imagery and expert annotations for the
evaluation of the software presented in chapter 2 and for helpfully commenting upon
the software interfaces from a clinical users’ point of view. I also thank Bernd Merkel
and Markus Harz from the Fraunhofer MeVis Institute for Medical Image Computing
Bremen for developing the graphical user interface which makes for the screenshots
in this chapter. For the acquisition and their helpful comments upon the MRSI
spectra analyzed in chapter 1, I thank Prof. Dr. Peter Bachert, Sarah Snyder and
Benjamin Schmitt from the Medical Physics in Radiology group of the German Cancer Research Center. From the Institute for Zoology at the University of Heidelberg,
I thank Prof. Dr. Joachim Wittbrodt and Burkhard Höckendorf for acquiring the
zebrafish microscopy images analyzed in chapter 4. From the Computer Graphics
and Visualization group of the University of Heidelberg, I thank Prof. Dr. Heike
Jänicke for providing software for the visualization of these data.
For their support in all administrative affairs, I would like to thank Barbara Werner,
Stephanie Lindemann, Simone Casula, Sarina Faulhaber, Evelyn Verlinden and Karin
Kruljac.
I gratefully acknowledge the financial support by the Helmholtz International Graduate School for Cancer Research, the Federal Ministry of Education and Research
(BMBF) and the Heidelberg Graduate School of Mathematical and Computational
Methods for the Sciences (HGS MathComp).
My final thanks go to Hans and Elfriede Botz for being the best landlord and landlady
one could wish for, and to my family and friends for their constant love and emotional
backing during these trying years.
Contents
1. MRSI quantification with spatial context
1.1. Introduction and motivation . . . . . . . . . . . . . . . . . . . . .
1.2. Background: Magnetic resonance spectroscopic imaging (MRSI) .
1.3. Quantification with spatial context . . . . . . . . . . . . . . . . .
1.4. Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5. Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6. Preliminary evaluation by single rater (unblinded) . . . . . . . .
1.7. Decisive evaluation by two raters (blinded) and results . . . . . .
1.8. Alternative proposal: Regularized initialization by graph cuts . .
.
.
.
.
.
.
.
.
19
19
20
27
29
31
33
35
37
2. Software for MRSI analysis
2.1. Introduction and motivation . . . . . . . . . . . . . . . . . . . . . . . .
2.2. Background: Supervised classification . . . . . . . . . . . . . . . . . .
2.3. Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4. Software architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1. Overview and design principles . . . . . . . . . . . . . . . . . .
2.4.2. The classification functionality . . . . . . . . . . . . . . . . . .
2.4.3. The preprocessing functionality . . . . . . . . . . . . . . . . . .
2.4.4. The parameter tuning functionality . . . . . . . . . . . . . . . .
2.4.5. The statistics functionality . . . . . . . . . . . . . . . . . . . .
2.4.6. The input / output functionality . . . . . . . . . . . . . . . . .
2.4.7. User interaction and graphical user interface . . . . . . . . . .
2.5. Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1. Exemplary application to 1.5 Tesla data of the prostate . . . .
2.5.2. Extending the functionality with a k nearest neighbors classifier
47
47
51
57
58
58
60
66
67
70
72
74
75
75
78
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3. Brain tumor segmentation based on multiple unreliable annotations
3.1. Introduction and motivation . . . . . . . . . . . . . . . . . . . . . . .
3.2. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1. Imaging methods for brain tumor detection . . . . . . . . . .
3.2.2. Variational inference for graphical models . . . . . . . . . . .
3.3. Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1. Automated methods for brain tumor segmentation . . . . . .
3.3.2. Learning from unreliable manual annotations . . . . . . . . .
.
.
.
.
.
.
.
85
85
86
86
89
97
97
111
15
Contents
3.4. Modelling and implementation . . . . . . . . . . . . . . . . .
3.4.1. Novel hybrid models . . . . . . . . . . . . . . . . . . .
3.4.2. Inference and implementation . . . . . . . . . . . . . .
3.5. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1. Experiments on simulated brain tumor measurements
3.5.2. Experiments on real brain tumor measurements . . . .
3.6. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6.1. Simulated brain tumor measurements . . . . . . . . .
3.6.2. Real brain tumor measurements . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
114
114
116
117
118
122
122
122
125
4. Live-cell microscopy image analysis
127
4.1. Introduction and motivation . . . . . . . . . . . . . . . . . . . . . . . . 127
4.2. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.2.1. The zebrafish Danio rerio as a model for vertebrate development131
4.2.2. Digital scanned laser light-sheet fluorescence microscopy (DSLM)133
4.2.3. Integer linear programming . . . . . . . . . . . . . . . . . . . . 134
4.3. Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.3.1. Cell lineage tree reconstruction . . . . . . . . . . . . . . . . . . 137
4.3.2. Cell or nucleus segmentation . . . . . . . . . . . . . . . . . . . 138
4.3.3. Cell or nucleus tracking . . . . . . . . . . . . . . . . . . . . . . 140
4.4. Experimental comparison of two nucleus segmentation schemes . . . . 142
4.4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.4.2. Evaluation methodology . . . . . . . . . . . . . . . . . . . . . . 142
4.4.3. Results for feature selection and evaluation . . . . . . . . . . . 148
4.5. Cell tracking by integer linear programming . . . . . . . . . . . . . . . 154
4.5.1. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.5.2. Experimental results . . . . . . . . . . . . . . . . . . . . . . . . 156
5. Final discussion and outlook
5.1. MRSI quantification with spatial context . . . . . . . . . . . . . . . .
5.2. Software for MRSI analysis . . . . . . . . . . . . . . . . . . . . . . .
5.3. Brain tumor segmentation based on multiple unreliable annotations
5.4. Live-cell microscopy image analysis . . . . . . . . . . . . . . . . . . .
.
.
.
.
161
161
162
164
165
List of Symbols and Expressions
172
List of Figures
173
List of Tables
175
Bibliography
177
16
Prologue
Computers are of ever-increasing importance for today’s life sciences. Their influence
is most established in genomics, where they were crucial for sequencing e.g. the
human genome (Lander et al., 2001), and in proteomics, where they can be used
in order to identify the proteins that are present in a biological sample (Colinge &
Bennett, 2007). In general, their use is unavoidable whenever one encounters data
sets that are too large for manual analysis. These data-intensive areas are typically
designated with the suffix “-omics”: besides genomics and proteomics, there are e.g.
connectomics where the subject is the connections between all the neurons in a brain
(Lichtman et al., 2008), embryomics which deals with the detailed study of embryonic
development on a cellular level (Bourgine et al., 2010) or glycomics which studies the
interactions between the polysaccharides covering the cellular membranes (Raman
et al., 2005). Recently, the same computer-based high-throughput data analysis
techniques have even transcended the boundaries of the life sciences, and have been
fruitfully employed to study cultural trends by analyzing the usage frequencies of
words and word sequences in digitized books from different time points, leading to
the term “culturomics” (Michel et al., 2010).
While biological data can be structured in various ways (e.g. as sequences, trees,
graphs or relational databases), this thesis concentrates on image data which show
the spatial distribution of some interesting quantity. In the simplest case, each point
in space is associated with a single scalar value, e.g. the intensity of emitted light.
For multispectral or multimodal data, every point is associated with several scalar
values: these may be e.g. the intensities of light emitted at different wavelengths.
Most image data in the life sciences come from either of two sources:
• Medical images (Duncan & Ayache, 2000) are important for basic research,
applied clinical research and routine diagnostics of diseases. Different physical mechanisms are exploited to gain information about the interior tissues
of living humans or animals: e.g. X-ray attenuation (computed tomography),
radiofrequency emission due to the relaxation of excited nuclei in a magnetic
field (magnetic resonance imaging) or ultrasound scattering.
• Microscopy images (Rittscher, 2010) are mainly important for basic research,
although they also have relevance for e.g. drug discovery (toxicity assays).
Living (in vivo) or prepared (in vitro) tissues or organisms are illuminated
17
Contents
with either visible light or an electron beam, and magnified images are created
using a lens system.
Chapters 1 – 3 deal with applications from medical image analysis, while a microscopy
image analysis task is studied in chapter 4.
Computerized image analysis answers questions such as:
• Classification: Does a certain location in the image belong to a foreground
class (e.g. a cell) or a background class?
• Object detection: Where is an interesting foreground object roughly located
in the image?
• Segmentation: Exactly which pixels (2D) or voxels (3D) do belong to a particular contiguous foreground object?
• Tracking: If images are acquired at different time points, how do several
objects in the image move over time?
• Registration: If several independent images are acquired which all show different aspects of an object, how can they be fused to a single multispectral
image so that all points corresponding to the same location are matched to the
same pixel or voxel?
In some cases, foreground and background can be discriminated by a simple criterion
such as the absolute gray value of an image. More often, they differ in a more complicated way, and human experts are able to tell the both classes apart without being
able to state explicit rules on which they base their decisions. Pattern recognition
techniques allow to learn these rules automatically from example images together
with annotations (or labels) provided by the human experts. This allows the use of
generic techniques in order to solve a huge variety of specific image analysis tasks:
often all task-specific information may be learned from a moderate set of annotated
training data.
18
Chapter 1.
Experimental evaluation of MRSI
quantification techniques using spatial
context
1.1. Introduction and motivation
Tumor tissue can be distinguished from healthy tissue by its characteristic biochemical makeup, i.e. by the increase or depletion of characteristic metabolites due to
the idiosyncrasies of tumor metabolism. Magnetic resonance spectroscopy imaging
(MRSI) is a noninvasive technique by which the biochemical composition of tissues
can be studied in the living body (in vivo) in a spatially resolved manner. Extracting the local metabolite concentrations from the MRSI signal is called quantification.
This chapter deals with different approaches how quantification may be improved
by exploiting the spatial smoothness of the MRSI data: rather than considering the
spectrum in each voxel on its own, prior assumptions can be imposed that neighboring voxels should yield similar quantification results, and it is a plausible hypothesis
that this will lead to a more robust estimation. As is shown in the following, it is
experimentally preferable to impose the smoothness prior in a separate initialization
stage in which the theoretically predicted spectra are roughly aligned to the data,
rather than in the actual estimation step.1
1
Parts of this chapter form part of (Kelm et al., 2011).
19
Chapter 1. MRSI quantification with spatial context
1.2. Background: Magnetic resonance spectroscopic
imaging (MRSI)
Nuclear magnetism MRSI is a medical imaging2 modality that makes use of the
Zeeman splitting of nuclear energy states in an external magnetic field. The following
exposition concerns common knowledge, see e.g. (de Graaf, 2008) for a good introductory text. Consider a nucleus A
Z X (i.e. A nucleons, Z protons) with the nuclear
~ the associated magnetic moment is
spin I:
~µ = gI
e ~
~
I = γ I,
2M c
(1.1)
where gI denotes the nuclear g-factor, M denotes the nuclear mass and γ the gyromagnetic ratio. For the nuclear state characterized by the quantum numbers I and
mI (with mI ∈ {−I, −I + 1, . . . , I − 1, I}), the expectation values of the squared
magnitude of the magnetic moment and its z-component are given by
hµ2 i = γ 2 ~2 I(I + 1), hµz i = γ~mI .
(1.2)
For most stable nuclei, both A and Z are even, and the nuclear spin I equals zero
in the ground state. Very few stable nuclei (e.g. deuterium) have an even A and an
odd Z, which leads to an integral value for I. The highest relevance for MRI have
stable nuclei with an odd A, for which I takes a half-integral value (e.g. 1 H, 13 C,
19 F, 23 Na or 31 P).
Equilibrium magnetization In the absence of an external magnetic field, all nuclear states corresponding to the 2I +1 different quantum numbers mI are degenerate
and hence equally populated in thermal equilibrium. However, once an external field
B0 is applied along the z-axis, Zeeman splitting occurs:
E = −µz B0 = −γ~mI B0 .
(1.3)
In the following we restrict ourselves to discussing the case of the protium (1 H), for
which I = 1/2 and γ = 2π × 42.6MHz/T. Due to its high gyromagnetic ratio and
its high natural abundance, this is the most sensitive nucleus for MR measurements.
There are two Zeeman states (mI = 1/2, i.e. parallel to the external field, and
mI = −1/2, i.e. antiparallel to the field). For a sample of matter (e.g. a human
2
To be precise, while medical imaging is the most important application, other applications exist
e.g. in food safety monitoring, non-destructive industrial testing or analyzing the composition
of crude oil.
20
1.2. Background: Magnetic resonance spectroscopic imaging (MRSI)
body), let n↑↑ and n↑↓ denote the numbers of nuclei in these two states. Then in
thermal equilibrium,
n↑↑
~γB0
~γB0
= exp
for small B0 .
(1.4)
≈1+
n↑↓
kB T
kB T
It should be noted that the relative excess is small: e.g. for realistic values (B0 =
1.5T, T = 300K), the ratio is n↑↑ /n↑↓ = 1 + 3 × 10−6 . However, this minute excess
is responsible for the macroscopic magnetization of the protons in the sample:
M0 = (n↑↑ − n↑↓ ) · µz ≈
(γ~)2
N B0 ,
4kB T
(1.5)
where N = n↑↑ + n↑↓ is the total number of 1 H nuclei. At thermal equilibrium,
the gross magnetization is completely aligned with the external field and no net
transverse magnetization occurs (although the magnetic moments of the single spins
precede around the external field, their precession is completely dephased, so that
the transversal components of the magnetic moments cancel out).
Energy transitions by radio-frequency irradiation Transitions between the different energy levels can be driven by exciting the sample with electromagnetic radiofrequency (RFr) radiation near the resonance (or Larmor) frequency of f0 = γB0 /2π
(42.6 MHz/T for 1 H, corresponding to a wavelength of 7m · T/B0 ), which can be
generated by a transmitter coil. The irradiated RFr field must be orthogonal to the
~ 0 field:
main external B
~ 1 (t) = B1 cos(2πf t)~ex + B1 sin(2πf t)~ey .
B
(1.6)
The temporal evolution of the gross magnetization is then governed by the Bloch
equations:


B1 cos(ωt)
~ −M
~k M
~⊥
~
M
dM
~ ×  B1 sin(ωt)  + 0
= γM
−
(1.7)
dt
T1
T2
B0
~ ⊥ and M
~ k denoting the magnetization components perpendicular and parallel
with M
~ 0 field. Eq. (1.7) consists of three terms: a precession term due to the
to the B
excitation field, and two relaxation terms. The latter account for the fact that a gross
~ 0 recovers to the
magnetization perturbed away from the equilibrium magnetization M
equilibrium due to energy exchanges between the nuclear spins and the surrounding
heat bath (T1 relaxation, spin-lattice relaxation) and loss of coherence between the
precessing spins (T2 relaxation, spin-spin relaxation). Typical values for water-rich
biological tissues are 1500–2000 ms for T1 and 50–200 ms for T2 . Inhomogeneities in
~ 0 can further speed up the transversal spin dephasing and lead
the external field B
to effective values of T2∗ < T2 .
21
Chapter 1. MRSI quantification with spatial context
90◦ and 180◦ pulses The qualitative understanding of the magnetization dynamics
is simplified if they are studied in a coordinate system (~ex′ , ~ey′ , ~ez ) rotating in phase
~ 1 vector. In such a system, Eq. (1.7) takes the following form:
with the B


γB1
~0 − M
~′
~′
~′
M
M
dM
k
~′×
+
0
=M
− ⊥
(1.8)
dt
T1
T2
2π(f0 − f )
~ ′ rotates with angular frequency γB1
Now it is obvious that in resonance (f = f0 ), M
around the ~ex′ axis. If such a resonant field is applied for a time of π/(2γB1 ), the
magnetization rotates into the xy-plane and is completely transversal (90◦ pulse):
all spins precess with complete phase coherence, until they are again dephased due
to the spin-spin relaxation. If the excitation field is applied for the double time
(180◦ pulse), the spins first get into phase and then dephase again, so that the net
magnetization points in the −~ez direction.
Signal acquisition: FID and spin echo sequence During relaxation, the precession of the non-equilibrium magnetization causes a transversal RF signal to be
emitted, which can be detected in a receiver coil, typically both in x and in y direction (quadrature detection).3 It is called the free induction decay (FID). The free
induction decay of a single resonance can be described by a damped exponential in
the time domain, and by a Lorentzian in the frequency domain:
t
(1.9)
g(t) ∝ M0 exp − ∗ + 2πif0 t + iφ0
T2
M0 T2∗ exp(iφ0 )
ĝ(f ) ∝
(1.10)
1 + 2πi(f − f0 )T2∗
Often, Doppler broadening occurs due to the thermal motion of the protium nuclei in
the sample: hence the Lorentzian is convolved with a Gaussian, resulting in a Voigt
profile. As the FID is often perturbed by the previous RF pulse, a delayed signal
acquisition is often preferable, which can be achieved by the spin-echo (SE) sequence:
The idea is to reverse the rapid dephasing caused by the B0 field inhomogeneities
(T2∗ ) with a 180◦ pulse in either x′ or y ′ direction, which is applied after a time of
TE/2. This causes all spin precessions to change their direction. Since the absolute
precession speed stays the same, the spins come back into phase at the echo time TE:
hence, a discernible echo signal occurs at that time, and then dephases again with
time constant T2∗ . Compared to the original FID, the amplitude of the echo signal
is reduced by a factor of exp(−TE/T2 ), which accounts for the stochastic dephasing
effects that cannot be reverted by the 180◦ pulse.
3
22
There may also be a single transceiver coil, which acts as both the transmitter and the receiver
coil.
1.2. Background: Magnetic resonance spectroscopic imaging (MRSI)
Chemical shift and MRS The previous discussion assumed that all 1 H nuclei in
an external field have the same resonance frequency, irrespective of the molecules
in which they occur. However, that is only approximatively correct: due to the
magnetic properties of the surrounding electrons, all nuclei experience an effective
~ 0 (chemical shift):
external field that is slightly different from B
~ eff = B
~ 0 (1 − σ) = B
~0 − δ
B
(1.11)
~ 0 field (Lenz’ rule)
Usually the induced magnetic field of the electrons opposes the B
~ 0 (e.g. for benzene, δ is negative).
so that δ > 0, but π electrons may also enhance B
−6
The typical order of magnitude for σ is 10 ; hence the chemical shift is typically
measured in parts per million (ppm). For 1 H spectroscopy, it is defined with respect
to Si(CH3 )4 (tetramethylsilane), which is assigned a chemical shift of 0. The total
signal is a superposition of the FIDs of all metabolites contained in the sample: after
a Fourier transformation, these FIDs appear as distinct peaks whose amplitude is
proportional to the metabolite concentration (Fig. 1.1). Typically by far the most
protium nuclei are part of water molecules, hence the metabolite signals may be
undetectable against the water background signal, unless it is suppressed either by
specific data acquisition protocols or by postprocessing steps. Experiments in which
the spectral composition of the 1 H RFr signal is studied, are known as magnetic
resonance spectroscopy (MRS).
Single-voxel localization In MRS, an entire sample is excited at once, and the
emitted signal from the whole volume is received. This is usually sufficient for studies
of homogeneous substances (e.g. in material characterization), and may also give
valuable information in diagnostic medicine, e.g. about the presence and extent of a
tumor in the brain (Cohen et al., 2005).4 However, often one is interested not only
in whether there is a tumor somewhere in the head, and how large it is, but also in
its location: this information is particularly relevant for radiotherapy and surgery
planning (see e.g. Chan et al. (2004)). Common to all spatial localization techniques
(for a good recent overview over the different possibilities see Keevil (2006)) is the use
of gradient fields, i.e. additional spatially varying magnetic fields which are parallel
~ 0 field. Hence the resonance frequency becomes spatially dependent:
to the B
f0 (~r) =
γ
~ · ~r).
(B0 + G
2π
(1.12)
These gradient fields are typically switched on only at specific phase during the
measurement process. For slice-selective excitation, a z-gradient field is applied only
4
Advantages of such whole-brain spectroscopy protocols are the good signal-to-noise ratio (SNR)
and the robustness with respect to positioning errors.
23
Chapter 1. MRSI quantification with spatial context
Real part (time domain)
Imaginary part (time domain)
10
10
8
8
6
6
4
4
2
0
2
−2
0
−4
−2
−6
−4
−8
−10
0
100
200
300
400
500
Time [msec]
600
700
800
900
−6
0
100
200
Real part (frequency domain)
300
400
500
Time [msec]
600
700
800
900
Imaginary part (frequency domain)
300
150
250
100
50
200
0
150
−50
100
−100
50
−150
0
−50
−200
5
4
3
2
Frequency [ppm]
1
0
−1
−250
5
4
3
2
Frequency [ppm]
1
0
−1
Figure 1.1. – Exemplary brain MRSI spectrum in the time and frequency domain. The
three peaks correspond to the most important metabolites of the healthy brain, namely (from
left to right) choline, creatine and N -acetylaspartate (NAA).
during the excitation with a bandwidth-limited RFr pulse: if the bandwidth is given
by ∆f , only the 1 H nuclei inside a axial slice of thickness
2π∆f
(1.13)
∆z =
γGz
are excited.5 The spectrum in a specific volume element (voxel) can be measured
by single-voxel MRS techniques such as the PRESS (Point-REsolved SpectroScopy)
5
Strictly speaking, as the excitation pulse must be time-limited, it cannot be exactly frequencylimited at the same time, so that some signal bleeding from the other z slices always occurs.
This is the reason why e.g. the 180◦ pulses in the PRESS sequence are commonly flanked by two
symmetric spoiler gradient fields that dephase transversal magnetization that was caused by the
imperfect selectivity.
24
1.2. Background: Magnetic resonance spectroscopic imaging (MRSI)
sequence by Bottomley (1987), which consists of one 90◦ and two refocussing 180◦
pulses. Each pulse is accompanied by a slice selection gradient in a different direction
(x, y and z), so that the second echo only occurs in the intersection of these three
orthogonal planes. If the volume of interest lies near the surface of the sample,
selective excitation can also be achieved by the use of a surface coil, as the B1 field
of a coil of radius a drops with the distance z from the coil as (a2 + z 2 )−3/2 (B1
gradient-based localization).
Magnetic resonance spectroscopy imaging (MRSI) If metabolite concentration
maps are desired, the individual MR spectra of a whole grid of voxels inside a volume
of interest must be measured at the same time: this is the application of MRSI. The
easiest technique is based on the spin-echo sequence: it requires Nx ·Ny ·Nz repetitions
for measuring a grid of Nx × Ny × Nz voxels. Each repetition is characterized by a
different combination of gradients Gx , Gy and Gz . While the Gz gradient is applied
during the 90◦ and the 180◦ pulse to achieve slice-selective excitation, the Gx and Gy
gradients are simultaneously applied for a time of T between the 90◦ and the 180◦
pulse: they lead to a spatially dependent phase shift of
∆φ = γT (Gx x + Gy y) = kx x + ky y,
(1.14)
with ki := γT Gi . Measuring the signal for the different values of Gx and Gy (and
hence kx and ky ) can be interpreted as sampling the two-dimensional Fourier transformation of the spin density inside the excited slice:
Z
Z
ρ̂(kx , ky ) = dx dyρ(x, y)eikx x+iky y ,
(1.15)
and the original spin density can be reconstructed via the inverse Fourier transform.
A repetition TR ≫ TE must elapse between the different spin-echo cycles to avoid
any remanent transverse magnetization from the previous cycle. This accounts for
the long time required for MRSI measurements: with a typical repetition time of
TR = 2 s, acquiring a coarse 16 × 16 × 8 volume takes 4096 s, i.e. more than one
hour.6 For 1 H MRSI and standard clinical B0 fields of 1.5 T, voxel sizes of 0.5–
5 cm3 can be achieved by these techniques. The limiting factor is the signal-to-noise
ratio (SNR): too little signal can be captured from smaller voxels. As SNR improves
roughly linearly with increasing B0 field strength (Edelstein et al., 1986), improved
spatial resolution can be achieved at higher field strengths that are currently under
experimental investigation (Henning et al., 2009).
6
Magnetic resonance imaging (MRI) uses similar encoding strategies and also samples the signal in
the Fourier domain. However, it can be considerably sped up over MRSI by using the discussed
phase modulation strategy only for one of the in-plane directions, and encoding the other direction
in the frequency of the acquired signal (frequency modulation): i.e., the corresponding gradient
is applied during signal acquisition. However, this is not an option for standard MRSI protocols,
as the frequency of the acquired signal already encodes the chemical shift.
25
Chapter 1. MRSI quantification with spatial context
Clinically relevant metabolites In clinical applications of 1 H MRSI, detection is
possible for metabolites having concentrations of down to 1 mmol/l: since the RFr
sensitivity is typically not known, only relative quantification is possible (i.e. the
ratios between the concentrations of different metabolites can be estimated, but
not the absolute concentration values). Among the diagnostically most relevant
metabolites that can be detected by 1 H spectroscopy are (Govindaraju et al., 2000):
1. N -acetylaspartate (NAA): This metabolite gives rise to the predominant
resonance in healthy brain tissue. While its biochemical function is still only
poorly understood, it is known to be a characteristic clinical marker for intact
neurons: hence it is depleted in nearly every type of brain lesion (e.g. stroke,
tumors or neurodegeneration).
2. (Phospho-) Creatine plays an important role as an energy buffer and storage
medium, which is required for regeneration from adenosyldiphosphate (ADP)
to adenosyltriphosphate (ADP), the most important free energy carrier in cell
metabolism. Creatine is most useful as a normalization reference for other
metabolite concentrations, but is not indicative for pathology by itself.
3. Choline is a precursor for the phospholipids making up the cellular membranes; hence it is enhanced in proliferating tissues with a high activity of
membrane biogenesis (such as tumors).
4. Lactate is generated by anaerobic glycolysis; hence it is a marker for ischemia
and hypoxia and it is commonly increased in tumors, particularly in the necrotic
core.
5. Lipid resonances are broader than the signals of the metabolites mentioned
above, and they typically cannot be captured by a simple parametric (Voigt or
quantum mechanical) model. They arise mostly from free fatty acids, and are
indicative for high-grade tumors or cell necrosis.
6. Citrate is one of the main ingredients of prostatic fluid: hence it is the predominant resonance in the healthy prostate, and it is characteristically depleted in
prostatic cancer.
The sensitivity of MRSI and the metabolites visible in the spectrum can also be influenced by the choice of the echo time TE: As the MR signal decays with exp(−TE/T2 ),
shorter echo times correspond to better SNR. However, many nuisance signals from
proteins or liquids have very short T2 and are decayed away in long-TE spectra,
hence the signals from the interesting metabolites can be better discernible in these
spectra.
26
1.3. Quantification with spatial context
1.3. Quantification with spatial context
Current state-of-the-art procedures for time-domain quantification of MRSI series,
such as AMARES (Vanhamme et al., 1997) or QUEST (Ratiney et al., 2005), estimate the spatially resolved concentrations of relevant metabolites by solving a nonlinear least-squares (NLLS) problem:
θ̂ = arg min
θ
N
X
n=1
(gθ (tn ) − yn )2
(1.16)
In the preceding formula, yn denotes the complex MRSI signal for a specific voxel
acquired at the time tn and gθ (tn ) is a parametric model for this time series, with
the parameter vector θ comprising both the amplitudes of the relevant metabolites
in this voxel (i.e. the final aim of quantification) and additional signal distortion
parameters such as phase or frequency shifts or (Lorentzian or Gaussian) damping
factors. In the following, this procedure will be called the Single Voxel (SV) method,
since the estimation is performed for every voxel on its own and no information from
neighboring voxels is used in this process. However, the non-convexity of this optimization problem may lead to convergence problems, or the procedure may converge
to a wrong local optimum. The time course yn also typically contains considerable
noise (especially for high-resolution measurements) which may cause the parameter
estimates to be biased and to have high variance (Cook et al., 1986).
Similar estimation problems arise also in the analysis of other medical imaging modalities, such as in the construction of kinetic parameter maps for the analysis of dynamic contrast-enhanced (DCE) MRI measurements. It could be shown that spatial
regularization could improve both bias and variance of the parameter estimates and
improve the robustness of the estimation with respect to noise (Kelm et al., 2009).
“Spatial regularization” means that the parameters of different voxels are coupled
via a regularization term penalizing large parameter differences between neighbor
voxels, e.g. using a Generalized Gaussian Markov Random Field (GGMRF) model
(Bouman & Sauer, 1993):
θ̂ = arg min
θ
"
N
XX
s∈V n=1
(gθs (tn ) − yns )2 + σ 2
= arg max log P (θs )s | (yns )s,n
X
s∼t
αst kW (θs − θt )kpp
#
(1.17)
(1.18)
θ
27
Chapter 1. MRSI quantification with spatial context
In this formula, image voxels are indexed by s and t, with s ∼ t denoting a neighborhood relationship (usually only voxels in the same slice are considered as neighbors,
and the standard 4-neighborhood or 8-neighborhood is used). yns denotes the MRSI
signal corresponding to the voxel s. The factor αst allows one to e.g. weight diagonal
and vertical or horizontal neighbors in an 8-neighborhood differently. W is a diagonal weighting matrix which controls how the different parameters (e.g. amplitudes,
frequency shifts, phase shifts, . . .) contribute to the penalty term: it is especially
required for incommensurable parameters. σ 2 is the noise variance which can be
estimated from the latest time points of the MRSI signal, and k · kp with 1 < p ≤ 2
denotes the standard p-norm (using p < 2 can prevent an over-smoothing of edges,
e.g. in the presence of lesions). In the language of Bayesian statistics, we can interpret the regularization terms as a prior distribution on the set of potential parameter
maps.
The Hammersley-Clifford theorem (Clifford, 1990) states that for computing the
optimal parameters on a subset of voxels A given the parameters at all other sites,
only the parameter values in the Markov blanket of A must be known:
arg max log P θA |θAc , (yns )s,n = arg max log P θA |θ∂A , (yns )s∈A,n
(1.19)
θA
θA
with
∂A = {s ∈ V |∃t ∈ A : s ∼ t}
(1.20)
This property is used in the Iterated Conditional Modes (ICM) algorithm (Besag,
1986), which finds a local maximum of the joint log-probability by iteratively optimizing the parameters of each voxel given the current (fixed) values of its neighbors.
Convergence may be sped up by the more general block-ICM scheme (Wu et al.,
1994), which iterates over whole blocks of voxels and jointly optimizes the parameters over a whole block of voxels given the fixed parameter values from the Markov
blanket of this block. This block-ICM scheme can be viewed as a compromise between ICM with single-voxel update and the (infeasible) global optimization problem
in which the parameters from all voxels are jointly optimized: hence it may be plausibly expected that it also leads to a higher-energy solution than ICM with single-voxel
updates (which is however not guaranteed).
Recently, Kelm (2007) proposed to impose a GGMRF prior on the MRSI parameter
maps and to use the block-ICM algorithm in order to perform inference on this
model: Preliminary studies on simulated MRSI measurements suggested that this
spatial regularization improves the estimation robustness against noise, and decreases
both bias and variance of the parameter estimates in comparison to the single voxel
(SV) model, as had already been established for DCE MRI analysis. In this study,
this claim was tested on real-world MRSI measurements. Preliminary evaluations on
28
1.4. Related work
proband MRSI measurements (with a voxel size of 10 × 10 × 10mm3 as for standard
clinical measurements) showed no improvement of using the GGMRF model over
the SV model and the question arose how realistic the simulated data were and
whether the GGMRF gives any practical advantages for MRSI analysis that justify
the increased computation time: these findings necessitated a rigorous experimental
analysis.
1.4. Related work
There exists a multitude of quantification techniques for MRSI data, so that only a
cursory overview over the field can be given. For a more comprehensive recent survey,
see (Poullet et al., 2008). They fall into two main categories: time-domain methods
and frequency-domain methods, which may be overlapping.7 Time-domain methods
fit the measured signal to a parametric model by a non-linear least-squares (NLLS)
estimation, which may be solved using local or global optimization techniques. The
parametric model consists of the spectra of the constituting metabolites, which may
be derived from simple parametric approximations (Lorentzian, Gaussian or Voigt
model), quantum mechanical predictions or experimental in vitro measurements.8
Other approaches do not make prior assumptions about the metabolites contributing
to the spectrum, but e.g. use the expectation maximization (EM) algorithm or some
modification of the singular value decomposition (SVD) to fit an optimal number
of Lorentzians to the FID. Nuisance signals arising from macromolecules (proteins,
lipids) can often neither be predicted theoretically nor measured in vitro, hence they
are rather captured by a nonparametric model such as a spline decomposition, like
in the AQSES procedure by Poullet et al. (2007). Many of the frequency-domain
quantification methods also follow either the NLLS or the SVD approach; alternatives
are peak integration (where no assumptions about the peak shape are made) or
nonparametric regression techniques such as artificial neural networks.
Besides the work by Kelm (2007), upon which this chapter builds, there have been few
comparable approaches on exploiting spatial regularity for improved quantification of
magnetic resonance spectroscopy images. The approach by Croitor Sava et al. (2009)
has the highest similarity to this line of research: Same as Kelm (2007), they formulate the spatially regularized fitting problem as a Gaussian Markov random field, and
refine the solution over several iteration sweeps through the grid. They also solve the
7
The procedures discussed later in this chapter only make use of scalar products between spectra:
hence it does not matter for them whether the spectra are represented in the time domain
or in the frequency domain, according to Parseval’s theorem (i.e. the unitarity of the Fourier
transform).
8
For instance, the AMARES procedure by Vanhamme et al. (1997) uses Lorentzian spectra, while
the QUEST procedure by Ratiney et al. (2005) can make use of experimental basis spectra.
29
Chapter 1. MRSI quantification with spatial context
intractable optimization problem approximately via an iterated conditional modes
(ICM) approach, i.e. the nonlinear parameters of one voxel are optimized given the
fixed values of its neighbors. Their work differs in two respects: firstly, they combine the spatial regularization with a semi-parametric baseline estimation as in the
AQSES algorithm (Poullet et al., 2007) in order to account for the macromolecular
nuisance signals that occur in the short-echo data they are studying. Secondly, they
account for the parameters in the neighboring voxels not only in the energy functional, but also in the initialization and for determining the search bounds on the
parameters. Sima et al. (2010) present a slight modification of this approach, which
differs only in the implementation of the nonlinear optimization. Instead of solving
the problem in Eqs. (1.16) and (1.21) by e.g. a Levenberg-Marquardt optimizer with
respect to all parameters, the optimization with respect to the linear parameters is
performed in closed form, so that gradients must only be computed with respect to
the nonlinear parameters. This variable projection approach is known to speed up
convergence (Golub & Pereyra, 2003).
Bao & Maudsley (2007) combine the two tasks of MRSI reconstruction (i.e. computing the spatial MRSI distribution from the signal that has been acquired in k-space)
and metabolite quantification into a single probabilistic Bayesian model and add a
spatial regularity prior: they then use an EM approach to find the maximum a posteriori (MAP) solution for this model. Hereby they differ from most other approaches
(as well as the one presented in this chapter), where the MRSI reconstruction is
performed before the quantification: this is typically done via a Fourier transform,
which causes signal bleeding into adjacent voxels and Gibbs ringing due to the limited k-space sampling rate. Registered MRI data are used to identify the positions of
tissue borders, so that the smoothness priors for the metabolite concentrations can
be switched off across these borders.
Furthermore, the LCModel software by Provencher (2010) contains a “Bayesian
learning” procedure which fits first the good-quality spectra in the center of the
FOV, and propagates the phase and frequency corrections thus found towards the
outer voxels, where they serve as soft constraints for the fit. This approach models
the dependencies between the fit parameters in the different voxels as a directed
graphical model in contrast to the undirected graphical models studied in this chapter. Furthermore the inference is solved in a greedy local manner instead of the
global inference methods employed in this chapter: once an inner spectrum has been
fitted, the information from the outer fits cannot be backpropagated to refine this
fit. However, the technical details are kept as a trade secret, so that a thorough
discussion of this method is not possible. Experimentally, it was shown to perform
inferior to the approach by Croitor Sava et al. (2009).
30
1.5. Experimental setup
1.5. Experimental setup
Spatially regularized models like GGMRF contain the underlying assumption that
the parameters across neighboring voxels are positively correlated: this assumption
holds especially for small voxel sizes. Since small voxels are also associated with
a low signal-to-noise ratio, the advantages of GGMRF should then be particularly
pronounced. In order to study this voxel size effect systematically, two MRSI measurement series of the brain of a healthy proband were run. The measurements were
R
conducted on a Siemens MAGNETOM TrioTM
with the following parameters:
spin-echo (SE) sequence, repetition time 1700 ms, echo time 135 ms, magnetic field
3 Tesla (corresponding to an imaging frequency of 123.23 MHz), dwell time dt = 833
µs, N = 512 recorded time points, matrix size 16 × 16 × 1 voxels. Every series comprised six scans: in the first series, three scans each were performed with a constant
slice thickness of 10 mm or 20 mm and the in-plane side length was reduced, leading
to anisotropic voxels. In the second series, the voxel size was kept isotropic, and
two scans were conducted for each of three different side lengths. These two setups
allow to study the effects of increasing the lateral and axial resolution separately:
However, the second setting (with isotropic voxel sizes) is more typical for clinical
MRSI scans. Tables 1.1 and 1.2 show the voxel sizes and field of view (FOV) sizes
for each measurement: Only voxels fully included in the FOV were used for the subsequent analysis (1433 voxels in total). The scan series also differ in the number of
FIDs which were acquired and averaged in order to improve the signal-to-noise ratio
(SNR). The mean SNR for all series is also reported in these tables: it is defined
as absolute height of the highest peak in the frequency spectrum in the vicinity of
the expected metabolite positions, divided by the root mean-square magnitude of
the spectrum in a frequency band containing neither signal nor artifact peaks, as in
(Kreis, 2004).9
All data were subjected to water suppression with a Hankel singular value decomposition (HSVD) scheme (Pijnappel et al., 1992) before further analysis (the 15 most
prominent SVD components were computed, and all of these components with a
chemical shift > 3.6 ppm or < 1.5 ppm were subtracted from the signal). Furthermore, exponential apodization with a time constant of N · dt/5 was applied in order
to improve the signal-to-noise ratio.
For evaluation, the SV estimation (i.e. a nonlinear least-squares fit for every single
voxel) was compared with the results of a block-ICM optimization of the GGMRF
model, using 3×3 voxel blocks with a “chessboard” sweep schedule as in (Kelm et al.,
R
2009). Prototypical implementations written in MATLAB
were used: the nonlinear
9
While this definition of the SNR is fairly common in the MR spectroscopy community, there are
also other, subtly different conventions: this should be considered when comparing SNRs between
different publications.
31
Chapter 1. MRSI quantification with spatial context
Voxel size [volume]
10 × 10 × 10mm3 [1000 µl]
6.9 × 6.9 × 10mm3 [473 µl]
5 × 5 × 10mm3 [250 µl]
7 × 7 × 20mm3 [980 µl]
10 × 10 × 20mm3 [2000 µl]
3.4 × 3.4 × 20mm3 [236 µl]
# Avg.
3
3
3
3
3
3
FOV size [grid size]
80 × 80 × 10mm3 [8 × 8]
80 × 80 × 10mm3 [11 × 11]
60 × 60 × 10mm3 [12 × 12]
90 × 90 × 20mm3 [13 × 13]
100 × 100 × 20mm3 [10 × 10]
45 × 45 × 20mm3 [13 × 13]
Mean SNR
8.56
4.47
3.18
7.22
13.03
2.86
Table 1.1. – Voxel sizes and field of view sizes of the first six MRSI series (constant slice
thickness) used for the experimental evaluation of the GGMRF quantification procedure,
together with the number of FID averages (# Avg.) and the mean signal-to-noise ratio over
all spectra in the series.
Voxel size [volume]
10 × 10 × 10mm3 [1000 µl]
10 × 10 × 10mm3 [1000 µl]
8 × 8 × 8mm3 [512 µl]
8 × 8 × 8mm3 [512 µl]
6 × 6 × 6mm3 [216 µl]
6 × 6 × 6mm3 [216 µl]
# Avg.
3
6
6
3
6
3
FOV size [grid size]
80 × 80 × 10mm3 [8 × 8]
80 × 80 × 10mm3 [8 × 8]
80 × 80 × 8mm3 [10 × 10]
80 × 80 × 8mm3 [10 × 10]
80 × 80 × 6mm3 [13 × 13]
80 × 80 × 6mm3 [13 × 13]
Mean SNR
20.67
25.49
15.50
12.24
7.12
5.68
Table 1.2. – Voxel sizes and field of view sizes of the second six MRSI series (isotropic voxels)
used for the experimental evaluation of the GGMRF quantification procedure, together with
the number of FID averages and the mean signal-to-noise ratio.
optimization was performed with an interior trust-region method for constrained
R
nonlinear least-squares estimation as implemented in the MATLAB
Optimization
Toolbox (Coleman & Li, 1996). In order to compare the computational requirements
of the two competing methods, the effective quantification time per voxel is reported
(i.e. the quantification time for a whole slice divided by the number of voxels inside
R
the field of view). The average values on a standard PC (Intel
Core Duo 2 CPU
T9300 @ 2.50 GHz, 3 GB RAM) were 0.31 ± 0.03 sec for the SV method, and
1.24 ± 0.37 sec for the GGMRF method: hence spatial regularization leads to a
fourfold increase in computation time.
The following data model (Lorentz model) for the MRSI signal was used:
gθ (tn ) =
M
X
m=1
32
am exp
(0)
− (d(0)
+
d
)
+
2πi(f
+
f
)
t
+
iφ
m
m
n
m
m
m
(1.21)
1.6. Preliminary evaluation by single rater (unblinded)
M = 3 metabolites were considered (choline / creatine / NAA) with expected fre(0)
quency shifts fm of 196.55 Hz / 216.02 Hz / 341.6 Hz at 3 Tesla corresponding
to chemical shifts of 3.161 ppm / 3.009 ppm / 2.026 ppm and expected damping
(0)
constants dm of 8 s−1 for all three metabolites. am denotes the relative amplitudes (i.e. the parameters of interest to be estimated during quantification), φm
denotes the phase shifts, and dm and fm are correction terms for the damping factors and frequency constants (corresponding to the width and the position of the
Lorentz resonance lines). It is also possible to model the resonance lines as Voigt
profiles (with an additional Gaussian damping term), which was neglected here. The
resulting optimization problem therefore contains twelve free parameters per voxel
(ten, if a common phase shift is shared across all metabolites, i.e. if the constraint
φ1 = φ2 = φ3 is introduced). The spatial regularization term depends on five free
parameters: the parameter p characterizing the p-norm and the entries wa , wf , wd
and wφ of the diagonal weight matrix W , which control how much amplitude, frequency, damping and phase gradients are penalized (these values are shared across
the different metabolites). As proposed by Kelm (2007), the parameter combination
p = 2, wa = 0, wf = 2, wd = 0.2 and wφ = 20/π was used for most experiments
(which had there been determined from the variograms of the fitted parameter maps
for another proband dataset). Note that the amplitudes are usually not explicitly
regularized, since it suffices to regularize the other parameters, and since the eventual
interest is on the amplitudes and any bias on them shall be avoided.
1.6. Preliminary evaluation by single rater (unblinded)
An objective evaluation of the GGMRF versus the classical SV quantification method
is not possible, since the true metabolite concentrations inside a living brain are
unknown and cannot be measured. MRSI phantoms (tubes containing metabolite of
defined concentrations) are typically employed for the evaluation of SV quantification
techniques, but they are inappropriate for comparing spatially resolved quantification
methods, as the concentration is typically uniform inside the tube and there is no way
to generate smooth concentration gradients. Hence a subjective evaluation approach
was chosen: as long as the main metabolite peaks are identifiable (SNR > 1), a
trained human can usually judge whether they are captured by the fitted model
peaks. Fig. 1.2 shows an exemplary spectrum with its SV fit and several spatially
regularized fits, which can be clearly distinguished into “good” and “poor” fits.
In a preliminary subjective comparison of the SV and GGMRF fits, the quality of
each fit was labeled as “good” or “poor”. With the standard settings of the algorithm
as detailed above, 2 % of the “poor” SV fits could be improved to a “good” GGMRF
33
Chapter 1. MRSI quantification with spatial context
Weights: 0,6.3662,0.2,2,2
Weights: 0,6.3662,0.2,2,1.5
80
80
60
60
40
40
20
20
0
0
−20
−20
−40
5
4
3
2
1
0
−1
−40
5
4
3
Weights: 0,0,1,0,1.5
80
80
60
60
40
40
20
20
0
0
−20
−20
−40
5
4
3
2
1
0
−1
−40
5
4
3
Weights: 0,0,0,1,2
80
60
60
40
40
20
20
0
0
−20
−20
5
4
3
2
Frequency [ppm]
1
0
−1
2
1
0
−1
1
0
−1
Weights: 0,0,0,1,1.5
80
−40
2
Weights: 0,0,1,0,2
1
0
−1
−40
5
4
3
2
Frequency [ppm]
Figure 1.2. – Example spectrum [series 1, scan 3, voxel (7,5)] in the frequency domain
(black) with SV fit (blue) and several GGMRF fits with different parameters (SP for “spatially regularized”, red). Only the real parts of the complex spectra are shown. The black
and the blue curve are identical for all six subplots, but the red (regularized) fits differ, as
they correspond to different regularization parameters. These are listed in the subplot titles:
e.g. “Weights: 0,6.3662,0.2,2,1.5” stands for wa = 0, wφ = 20/π (6.3662), wd = 0.2, wf = 2,
p = 1.5. The fit quality was rated “good” for the spatially regularized fits with parameters (0,6.3662,0.2,2,2), (0,6.3662,0.2,2,1.5) and (0,0,0,1,2) and “poor” for the single-voxel fit
and the other spatially regularized fits, based on the criterion whether the choline peak was
identified correctly or not. Note the particular relevance of frequency regularization for this
example, which could be confirmed in the evaluation of the other spectra.
fit by the spatial regularization, while none of the “good” SV fits were degraded to
“poor” GGMRF fits.
34
1.7. Decisive evaluation by two raters (blinded) and results
The following modifications of the algorithm were also tried:
1. Different value combinations for the weighting parameters wa , wf , wd and wφ
and the norm parameter p.
2. Augmenting the SV and GGMRF models with a semiparametric baseline estimation to account for macromolecular background signals that cannot be
modeled explicitly (as proposed by Sima & van Huffel (2006)).
3. Reparameterization of the frequency corrections fm . The above data model
assumes the central frequencies of the three metabolite peaks to jitter independently around their respective expected values. However, the mismatch
between expected and true central frequencies may also be due to a miscalibrated frequency axis (e.g. if the local magnetic field deviates from exactly
3 Tesla). In this case, it is preferable to correct all metabolite frequencies by a
common scale factor and offset, and then to add metabolite-specific frequency
jitter within narrower bounds.
4. Constraining the phase shifts φm of the three metabolites to have equal values.
However, none of these modifications yielded better results than the 2 % improvement
by the standard settings: the results were either comparable or worsened. Hence
the standard settings were subjected to a decisive evaluation, thereby avoiding the
multiple-comparison problem in statistical hypothesis testing (Shaffer, 1995).
1.7. Decisive evaluation by two raters (blinded) and results
The above preliminary analysis is insufficient for establishing the superiority of the
GGMRF method over the SV method in a scientifically sound manner. The main
reason is that it was performed unblinded (i.e. with the human rater knowing which
curve corresponds to the SV fit and which curve corresponds to the GGMRF fit).
Since the decision whether a fit is “good” or “poor” is necessarily subjective, the
labels will be involuntarily biased by the prior expectations of the labelers, even if
they try their best to label the fits carefully and fairly. Hence a subsequent decisive
analysis was conducted, which was blinded: each spectrum was plotted twice with
the two different fit curves (with no indication of the underlying model) and all plots
were jumbled randomly. Two independent raters labeled the fit quality of each curve
as either “good” or “poor” as above (for a “good” label, all three metabolite peaks
had to be found with the correct peak position, width and amplitude).
Additionally the signal quality of each spectrum was labeled by the two raters as
either “good”, “noisy” (SNR for the choline and creatine peaks < 1) or “containing artifacts” (presence of unidentifiable broad signal components in the spectrum,
35
Chapter 1. MRSI quantification with spatial context
possibly caused by lipids). In borderline cases, the label “containing artifacts” took
precedence over “noisy”. Fig. 1.3 shows examples of these three signal quality classes.
Since every spectrum is plotted twice (once with the SV fit curve and once with the
GGMRF fit curve) and the spectra are in random order, we get two independent
signal quality labels from each rater. These labels were gathered in order to study
the conditions more carefully under which spatial regularization leads to improved
fits: For spectra degraded by considerable artifacts, no quantification method is expected to work well, and hence a beneficial effect of GGMRF may be diluted if these
examples are included in the analysis. On the other hand, the spatial regularization
is employed mainly to enhance the noise robustness of the fit and should hence prove
advantageous especially on noisy spectra.
The main evaluation results are listed in Table 1.3: it shows the accuracies of the SV
and GGMRF quantification for all of the twelve scans (i.e. the percentage of “good”
fits among all spectra which have not been assigned a “containing artifacts” label
by the respective rater). The alternative hypothesis that GGMRF quantification
leads to an increase in this percentage, was tested against the null hypothesis that
there is no effect. As the two raters clearly have differently strict criteria both for
a good fit and for a good spectrum, a separate test was conducted for each rater.
The values of the percentages also vary considerably between the different scans,
which is understandable due to the differences in voxel size and hence in SNR. Hence
an one-sided signed-rank test (Wilcoxon, 1945) was employed, which only assumes
that the percentage differences between the two quantification methods are sampled
independently from the same distribution, which is symmetric around its mean µ:
the alternative hypothesis then corresponds to µ > 0, while the null hypothesis
corresponds to µ ≤ 0. The p-values were 0.0033 for rater A and 0.0294 for rater B, i.e.
there is significant evidence that GGMRF indeed leads to an improved fit accuracy.
However, the absolute value of the difference is small: the average improvement
Accuracy of GGMRF − Accuracy of SV
is 1.53 % for rater A and 1.25 % for rater B, while the average relative improvement
Accuracy of GGMRF
−1
Accuracy of SV
is 4.1 % for rater A, and 1.8 % for rater B. Fig. 1.4 shows the absolute and relative
accuracy improvements as a function of in-plane resolution. As could be expected,
the improvements by the spatial regularization are particularly pronounced for very
small voxels: firstly, their smaller SNR causes the NLLS fit to be more prone to run
into local maxima, and secondly, the spatial smoothness assumptions are obviously
fulfilled better for smaller voxels.
36
1.8. Alternative proposal: Regularized initialization by graph cuts
Voxel (5,8) in dataset 1
Voxel (7,6) in dataset 11
300
1200
1000
250
800
200
600
150
400
100
200
50
0
0
−50
−200
5
4
3
2
Chemical shift [ppm]
1
0
−1
−400
5
4
3
Voxel (9,1) in dataset 3
30
100
20
80
10
60
0
40
−10
20
−20
0
−30
5
4
3
2
Chemical shift [ppm]
2
Chemical shift [ppm]
1
0
−1
1
0
−1
Voxel (4,11) in dataset 3
1
0
−1
−20
5
4
3
2
Chemical shift [ppm]
Figure 1.3. – Example spectra for the different signal quality labels. The first three spectra
are exemplary for their respective quality classes and received unanimous votes: the top
left spectrum was labeled “good” four out of four times, the top right spectrum was always
labeled as “containing artifacts” and the bottom left spectrum was always labeled as “noisy”.
The bottom right spectrum is a typical borderline example: each of the two raters labeled it
once as “good” and once as “noisy”. The datasets in the two measurement series are labeled
from 1 to 12, hence “Dataset 11” means the fifth scan in the second measurement series.
1.8. Alternative proposal: Regularized initialization by
graph cuts
If the NLLS fit fails on a good-quality spectrum, this is typically due to one of the
following three reasons: Either one peak in the spectrum is interpreted both as the
choline and as the creatine peak (Fig. 1.5(a)), or the true choline peak is erroneously
interpreted as the creatine peak and a small noise peak between the creatine and
37
Chapter 1. MRSI quantification with spatial context
Scan number
1
2
3
4
5
6
7
8
9
10
11
12
Rater A
SV
84.38 %
50.00 %
23.02 %
61.06 %
72.88 %
19.64 %
95.31 %
92.19 %
93.68 %
94.74 %
80.36 %
69.03 %
GGMRF
85.94 %
49.06 %
24.46 %
62.83 %
74.58 %
25.60 %
96.88 %
93.75 %
93.68 %
95.79 %
82.14 %
69.91 %
Rater B
SV
93.75 %
80.70 %
58.33 %
86.24 %
95.45 %
52.07 %
96.88 %
93.75 %
93.68 %
94.74 %
81.25 %
70.37 %
GGMRF
93.75 %
79.82 %
56.94 %
88.07 %
98.48 %
57.99 %
96.88 %
93.75 %
93.68 %
95.79 %
83.93 %
73.15 %
Table 1.3. – Percentage of SV and GGMRF fits that are labeled as “good” by the two
raters, among all spectra in a scan that are assigned a “good” signal quality label by the
respective rater. Scans 7–12 refer to the scans in the second acquisition series.
1.35
0.06
Rater A
Rater B
Accuracy (GGMRF) / Accuracy (SV)
Accuracy (GGMRF) − Accuracy (SV)
0.04
0.03
0.02
0.01
0
1.25
1.2
1.15
1.1
1.05
1
−0.01
−0.02
Rater A
Rater B
1.3
0.05
3
4
5
6
7
8
In−plane resolution [mm]
9
10
0.95
3
4
5
6
7
8
In−plane resolution [mm]
9
10
Figure 1.4. – Absolute and relative accuracy improvement of GGMRF quantification over
SV quantification, as a function of in-plane voxel resolution, for the two raters.
the NAA peak is misinterpreted as the choline peak (Fig. 1.5(b)), or several small
peaks are fitted as one by one overly wide peak instead of the correct (narrow) peak
(Fig. 1.5(c)).
In order to analyze the reasons why the NLLS quantification fails, it is instructive
to compare the actual peak positions in several spectra from one slice with their
expected values, which can be computed from the B0 field, the temporal sampling
38
1.8. Alternative proposal: Regularized initialization by graph cuts
Series 1, voxel (12,10)
Series 4, voxel (13,8)
140
Series 3, voxel (6,12)
120
80
Data
NLLS fit
Data
NLLS fit
120
Data
NLLS fit
100
60
100
80
80
40
60
60
40
20
40
20
20
0
0
0
−20
−20
−20
−40
5
4
3
2
Frequency [ppm]
1
0
−1
−40
5
4
3
2
Frequency [ppm]
1
0
−1
−40
5
4
3
2
Frequency [ppm]
1
0
−1
(a) Merged choline and creatine (b) Choline peak interpreted as (c) Several small peaks fitted as
peak
creatine peak
one
Figure 1.5. – Exemplary spectra showing the reasons for poor NLLS fits. The real part of
spectra in the frequency domain is shown.
rate and the literature values of the chemical shift δ, e.g. as reported by Govindaraju
et al. (2000). Fig. 1.6 shows a representative example: obviously the expected peak
positions are systematically shifted with respect to their actual values. This phenomenon is probably caused by a small systematic deviation of either the B0 field or
the temporal sampling rate from their nominal values. Note that this is a plausible
explanation for fitting results like in Figs. 1.5(a) or 1.5(b): if the initial position of
the choline resonance in the model is closer to the real creatine peak than to the real
choline peak, it gets fitted to this creatine peak, and the creatine resonance in the
model gets either fitted to the same creatine peak (as in Fig. 1.5(a)) or to some other
noise or nuisance peak (as in Fig. 1.5(b)).
Models like in Eqs. (1.16) or (1.18) that vary the parameters of each resonance in
the model separately are ill-suited to correct such systematic errors. One possible
solution would be to introduce couplings between the parameters of the different
resonance, e.g. a repulsion term that prevents different resonances to be mapped to
the same peak in the spectrum. However, a much simpler alternative is to initialize
the model fitting by finding the optimal joint alignment between the model resonances
and the spectrum: For this initialization, we simplify the nonlinear fitting problem
in Eq. (1.16) by keeping the damping constants of the model (1.21) fixed (d1 =
d2 = d3 = 0) and constraining the frequency shifts to be equal for all metabolites
(f1 = f2 = f3 = f ). Then f is the only remaining nonlinear parameter in Eq. (1.16).
For a given value of f , the linear parameters (amplitudes and complex phases) and
hence also the least-squares residuals can be computed in closed form: Let y ∈ RN
denote the complex signal time course as stacked into a column vector, X(f ) ∈ RN ×M
be a matrix with entries
h
i
(0)
Xnm (f ) = exp − d(0)
(1.22)
m + 2πi(fm + f ) tn
39
Chapter 1. MRSI quantification with spatial context
5
4
4,4
5,4
6,4
7,4
8,4
4,5
5,5
6,5
7,5
8,5
4,6
5,6
6,6
7,6
8,6
4,7
5,7
6,7
7,7
8,7
4,8
5,8
6,8
7,8
8,8
3
2
1
Frequency [ppm]
0
5
4
3
2
1
Frequency [ppm]
0
5
4
3
2
1
Frequency [ppm]
0
5
4
3
2
1
Frequency [ppm]
0
5
4
3
2
1
Frequency [ppm]
0
Figure 1.6. – Subgrid of magnitude spectra from dataset 3: the plot titles give the x and y
index in the slice. The vertical green bars indicate the expected peak positions for the three
main metabolite resonances, based on the nominal B0 field strength: from left to right, they
correspond to choline, creatine and NAA. One sees clearly that the actual peak positions are
systematically shifted in all of the spectra.
and b ∈ RM be a complex vector that comprises the metabolite amplitudes and their
complex phases via bm = am eiφm . Then the minimum residual sum of squares (RSS)
for a given f is
RSS(f ) = arg min ky − X(f )bk2
(1.23)
b
−1
= ky − X(f ) X(f )† X(f ) X(f )† yk2 ,
(1.24)
where X(f )† denotes the Hermitian adjoint.
If we assume the metabolite signals to be non-overlapping in frequency space (i.e.
the columns of X(f ) to be approximately orthogonal), Eq. (1.24) can be simplified
40
1.8. Alternative proposal: Regularized initialization by graph cuts
considerably.10 In this case, X(f )† X(f ) ≈ N · I becomes nearly diagonal, and we
can write
2
1
† (1.25)
RSS(f ) ≈ y − X(f )X(f ) y N
1
kX(f )† yk2
(1.26)
N
1
|c1 (f )|2 + |c2 (f )|2 + · · · + |cM (f )|2 ,
(1.27)
= kyk2 −
N
where ci (f ) = Xi (f )† y and Xi (f ) is the i-th column of X(f ). Note that the crosscorrelations ci (f ) can be also computed from the Fourier transforms of y and Xi (f )
according to the unitarity of the Fourier transform (i.e. Parseval’s theorem). Since
the Fourier transform of the damped harmonic oscillation Xi (f ) is a Lorentzian,
and varying f corresponds to shifting this Lorentzian along the frequency axis, the
cross-correlations ci (f ) can be efficiently computed for different values of f using
a convolution. Then a line search can be performed to find the optimal frequency
shift f ∗ = arg min RSS(f ), and use this for initialization of the NLLS optimizer.
Fig. 1.7(a) shows a spectrum for which the uninitialized NLLS fit fails. If the NLLS fit
is run after initializing the frequency search values correctly (by a constant shift found
from Eq. (1.27)), the correct minimum is found. Fig. 1.7(b) shows the corresponding
RSS(f ) curve: the correct initialization shift at -20 Hz is the global minimum of the
curve.
≈ kyk2 −
In the presence of very strong spectral artifacts, the initialization according to
Eq. (1.27) may cause the NAA peak to be mapped to the artifact signal instead
of the true NAA signal peak (see Fig. 1.7(c)). Note that in the experiments this only
happened for spectra which were labeled as “containing artifacts” by both raters, and
which were therefore excluded from the evaluation in section 1.7. When examining
the graph of the function RSS(f ) for these pathological spectra, one notes that the
true initialization appears as a local minimum, which is however overshadowed by
the global minimum corresponding to the artifact signal (Fig. 1.7(d)). In this case,
incorporating spatial context in the spirit of Eq. (1.18) is a plausible remedy: the
initialization constants fv of the different voxels are coupled by a GGMRF prior, and
the joint optimum is found by solving
X
X
RSS(fv ) +
|fv − fw |p .
(1.28)
f ∗ = arg min E(f ) = arg min λ
f
f
v
v∼w
While the pair potential is a convex function in the vector f , the single-site potentials
RSS(fv ) are in general not convex: hence Eq. (1.28) cannot be tackled by convex
10
This assumption holds very well between the NAA resonance and the two other resonances, but
less well between the choline and the creatine resonance. However, since this step is only meant
as a rough initialization of the fitting process, and the peak positions are refined afterwards, the
increase in simplicity and computation speed warrants the slight inaccuracy.
41
Chapter 1. MRSI quantification with spatial context
Series 4; voxel 8,10
250
Voxel 8,10
45
Signal
NLLS (uninitialized)
NLLS (SV initialized)
NLLS (cut initialized)
40
200
Offset−shifted residual sum of squares
35
150
100
30
25
20
15
10
50
5
0
5
4
3
2
Image frequency [ppm]
1
0
0
−30
−1
(a) Example spectrum for which single-voxel
initialization leads to correct NLLS convergence
−20
−10
0
10
Frequency shift between template and signal [Hz]
20
30
(b) Corresponding RSS(f ) curve
Series 4; voxel 8,9
600
Voxel 8,9
35
Signal
NLLS (uninitialized)
NLLS (SV initialized)
NLLS (cut initialized)
500
Offset−shifted residual sum of squares
30
400
300
200
25
20
15
10
100
5
0
5
4
3
2
Image frequency [ppm]
1
0
−1
(c) Neighboring spectrum for which spatially
regularized initialization is required
0
−30
−20
−10
0
10
Frequency shift between template and signal [Hz]
20
30
(d) Corresponding RSS(f ) curve
Figure 1.7. – Exemplary spectra showing the benefits of single-voxel and regularized initialization. Note that the spectrum in Fig. 1.7(a) and other similar spectra are directly adjacent
to the spectrum in Fig. 1.7(c). Hence the smoothness prior on the frequency initialization
shift can be used to evade the global minimum caused by the artifact peak in Fig. 1.7(d).
For illustration purposes, the RSS(f ) curves were offset-shifted so that the minimum value
of the curve is always zero: this does not influence the solution of the optimization problem.
optimization techniques. Using an ICM or block-ICM procedure as for the problem
in Eq. (1.18) would be possible, but with possibly slow convergence and without any
guarantee that the global optimum is attained eventually.
42
1.8. Alternative proposal: Regularized initialization by graph cuts
However, this problem differs from the one in Eq. (1.18) in that the state of each
voxel can be described by a single frequency shift scalar fv instead of several variables
(frequency shifts, dampings and phases of several metabolites). Using an appropriate
discretization for the fv , the exact joint minimum can be computed efficiently by
modelling it as a graph cut problem as in (Ishikawa, 2003). In general, for a set of
linearly ordered labels li , the minimization problem
X
X
ψi (li ) +
g(li − lj )
(1.29)
l∗ = arg min
l
i
i∼j
for arbitrary single-site potentials ψi and an arbitrary convex function g can be
transformed into an equivalent min-st-cut problem, which is then solved using e.g.
the dual-tree max-flow algorithm (Boykov et al., 2001; Kolmogorov & Zabih, 2004;
Boykov & Kolmogorov, 2004). Experimentally, it was shown that this max-flow implementation gives the best results for graph cut problems of this structure (Boykov
& Kolmogorov, 2004). Note the conceptual difference from the GGMRF model
and its block-ICM optimization heuristic described earlier: Instead of imposing a
smoothness prior on the final model parameters, the regularization only affects their
initialization value (i.e. their rough location), and they are then refined by a usual
single-voxel NLLS optimization. Further differences are that only one nonlinear parameter is optimized over (the most important one, namely the global frequency
calibration), and that therefore the global optimum for this single parameter can be
found efficiently in contrast to the local optimality of block-ICM.
Tables 1.4 and 1.5 show the accuracy improvements of the NLLS quantification procedure by the single-voxel and the spatially regularized (graph-cut) initialization over
the basic NLLS method where no special initialization is performed: for Table 1.4, all
spectra are considered, while Table 1.5 only pertains to artifact-free spectra in analogy to Table 1.3. For the weighting factor from Eq. (1.28), λ = 20 was used, and the
spatial prior was chosen to be linear (p = 1). It can be seen that already the singlevoxel initialization leads to considerable improvements over the basic NLLS quantification, which are much more pronounced than the improvements by the GMRF prior
on the fit parameters. The additional smoothness prior on the common frequency
initialization shifts is mainly beneficial for artifact-containing spectra, but also gives
small improvements over the single-voxel initialization for the artifact-free, but noisy
spectra in e.g. series 6. The improvement of the single-voxel initialization over the
uninitialized NLLS quantification is highly significant both when analyzing all spectra and when analyzing only the artifact-free spectra (in both cases p = 1.26 × 10−3
for a one-sided Wilcoxon test, if a fit with “wrong amplitudes” is counted as “poor”).
In contrast, the improvement of the spatially regularized over the single-voxel initialization is significant only when considering all spectra (p = 0.0113), while p = 0.0907
when only the artifact-free spectra are considered. Figs. 1.8(a) and 1.8(b) show the
accuracies as a function of in-plane resolution: as can be expected, the benefits of
43
Chapter 1. MRSI quantification with spatial context
Series
1
2
3
4
5
6
7
8
9
10
11
12
“Wrong
NoInit
84.4 %
42.1 %
21.5 %
42.6 %
50.0 %
23.1 %
96.9 %
93.8 %
89.0 %
90.0 %
55.0 %
49.1 %
amplitudes” as “poor”
SVInit
GCInit
100.0 %
100.0 %
97.5 %
97.5 %
92.4 %
93.1 %
83.4 %
88.8 %
85.0 %
89.0 %
76.9 %
82.8 %
100.0 %
100.0 %
100.0 %
100.0 %
100.0 %
100.0 %
98.0 %
100.0 %
79.9 %
90.5 %
76.3 %
87.6 %
“Wrong
NoInit
84.4 %
43.0 %
22.2 %
42.6 %
50.0 %
24.9 %
96.9 %
93.8 %
89.0 %
90.0 %
55.0 %
49.1 %
amplitudes” as “good”
SVInit
GCInit
100.0 %
100.0 %
97.5 %
97.5 %
93.8 %
95.1 %
83.4 %
98.8 %
86.0 %
96.0 %
82.8 %
87.6 %
100.0 %
100.0 %
100.0 %
100.0 %
100.0 %
100.0 %
98.0 %
100.0 %
80.5 %
93.5 %
77.5 %
89.9 %
Table 1.4. – Ratio of good NLLS fits among all spectra, for three different initialization
schemes of the frequency shifts: setting all to zero (“NoInit”), single-voxel initialization as
by Eq. (1.27) (“SVInit”) and spatially regularized graph cut initialization as by Eq. (1.28)
(“GCInit”). Note that spectra with artifacts were not discarded before computing these
numbers. The difference between columns 2–4 and columns 5–7 lies in how fits with a
“wrong amplitudes” label were treated: in the former case, they were considered as “poor”
fits, while in the latter case, they were considered to be “good” fits.
the initialization are the highest for highly resolved MRSI measurements with a poor
SNR, for which NLLS is likely to run into local minima, as for the GGMRF model.
The computation times are shown in Fig. 1.9. Apparently, using a single-voxel initialization even saves time over the uninitialized NLLS fit (40 % on average): Computing the initialization is very fast, since all computations can be implemented via
one-dimensional convolutions in the approximate formulation of Eq. (1.27), and the
accelerated convergence of the subsequent NLLS fitting more than makes up for this
initial investment. In contrast, using the spatially regularized initialization leads to
an increase in computation time by 57 % on average, since solving the graph-cut
optimization problem is costly. However, this is still well beneath the computation
times required by the block-ICM algorithm.
44
1.8. Alternative proposal: Regularized initialization by graph cuts
Series
1
2
3
4
5
6
7
8
9
10
11
12
“Wrong
NoInit
84.4 %
45.5 %
22.1 %
57.0 %
66.2 %
22.6 %
96.9 %
93.8 %
93.7 %
94.7 %
82.1 %
69.9 %
amplitudes” as “poor”
SVInit
GCInit
100.0 %
100.0 %
99.1 %
99.1 %
92.9 %
93.6 %
97.5 %
99.2 %
100.0 %
100.0 %
76.8 %
82.7 %
100.0 %
100.0 %
100.0 %
100.0 %
100.0 %
100.0 %
100.0 %
100.0 %
99.1 %
99.1 %
99.1 %
99.1 %
“Wrong
NoInit
84.4 %
46.4 %
22.9 %
57.0 %
66.2 %
24.4 %
96.9 %
93.8 %
93.7 %
94.7 %
82.1 %
69.9 %
amplitudes” as “good”
SVInit
GCInit
100.0 %
100.0 %
99.1 %
99.1 %
94.3 %
95.7 %
97.5 %
99.2 %
100.0 %
100.0 %
82.7 %
87.5 %
100.0 %
100.0 %
100.0 %
100.0 %
100.0 %
100.0 %
100.0 %
100.0 %
100.0 %
100.0 %
99.1 %
99.1 %
Table 1.5. – Ratio of good NLLS fits among artifact-free spectra, for three different initialization schemes of the frequency shifts (as in Table 1.4). All spectra were discarded for
which at least one of the two signal quality labels by rater A (see Table 1.3) was “containing
artifacts”. The differences between the numbers in the second column of this table, and the
numbers in the second column of Table 1.3 are due to the limited intra-rater reliability.
Good fits among artifact−free spectra
1
0.9
0.9
0.8
0.8
Percentage of good fits
Percentage of good fits
Good fits among all spectra
1
0.7
0.6
0.5
0.4
0.6
0.5
0.4
0.3
0.2
0.7
NoInit
SVInit
GCInit
3
4
5
6
7
In−plane resolution [mm]
8
9
(a) All spectra used (as in Table 1.4)
10
0.3
0.2
NoInit
SVInit
GCInit
3
4
5
6
7
In−plane resolution [mm]
8
9
10
(b) Artifact spectra discarded (as in Table 1.5)
Figure 1.8. – Accuracy, i.e. percentage of “good” fits among all, for three different initialization schemes (see caption of Table 1.4), plotted against the in-plane voxel resolution.
45
Chapter 1. MRSI quantification with spatial context
0.5
NoInit
SVInit
GCInit
Average computation time per voxel [sec]
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
1
2
3
4
5
6
7
8
9
Number of dataset
10
11
12
Figure 1.9. – Average computation time per voxel for quantifying the different datasets by
the NLLS method, both without any initialization (NoInit), with a single-voxel initialization
(SVInit) as given by Eq. (1.27) and with a spatially regularized initialization (GCInit) as
given by the graph cut functional in Eq. (1.28).
46
Chapter 2.
Software for MRSI analysis
2.1. Introduction and motivation
Imaging methods for the in vivo diagnostics of tumors fall into three categories based
on the different physical mechanisms they exploit: In computer tomography (CT), Xrays are transmitted through the body, which are attenuated differently in different
tissue types. In nuclear medicine methods such as positron emission tomography
(PET) or single photon emission computed tomography (SPECT), one detects the
radiation of radioactive nuclides, which are selectively accumulated in the tumor
region. Finally, magnetic resonance imaging (MRI) exploits the fact that various
nuclei (namely protons) have a different energy when aligned in the direction of an
external magnetic field than when they are aligned opposite to it. By injecting a
radiofrequency wave into the imaged body, one can perturb some protons out of
their equilibrium state into a higher-energy state: the radiofrequency signal which
they emit upon relaxation is then measured, and its amplitude is proportional to the
concentration of the protons in the imaged region. This measurement process can
be performed in a spatially resolved fashion, so that a three-dimensional image is
formed.
Standard MRI produces a scalar image based on the total signal of all protons, irrespective of the chemical compound to which they belong: typically, the protons in
water molecules and in lipids make the highest contribution due to the large concentration of these molecules. However, the protons in different compounds can be
distinguished by their resonance frequencies in the magnetic field (the so-called chemical shift), and it is possible to resolve the overall signal not only spatially, but also
spectrally: this leads to magnetic resonance spectroscopy imaging (MRSI) or chemical shift imaging (CSI), for which a complex spectrum is obtained at each image
voxel instead of a single scalar value as in MRI (de Graaf, 2008). Hence it is possible
to measure the local abundance of various biochemical molecules non-invasively, and
thereby gain information about the chemical make-up of the body at different locations: besides water and lipids, most major metabolites can be identified in the MRSI
47
Chapter 2. Software for MRSI analysis
spectra, e.g. the most common amino acids (glutamate, alanine, . . .), the reactants
and products of glycolysis (glucose, ATP, pyruvate, lactate), precursors of membrane biosynthesis (choline, myo-inositol, ethanolamine), energy carriers (creatine)
and tissue-specific marker metabolites (citrate for the prostate, N-acetylaspartate or
NAA for the brain). As a downside, these metabolites occur in much lower concentrations than water, hence the spatial resolution must be far coarser than in MRI: only
by collecting signal from a volume of typically 0.2–2 cm3 , a sufficient signal-to-noise
ratio can be achieved.
MRSI provides valuable information for the noninvasive diagnosis of various human
diseases, e.g. infantile brain damage (Xu & Vigneron, 2010), multiple sclerosis (Sajja
et al., 2009), hepatitis (Cho et al., 2001) or several psychiatric disorders (Dager et al.,
2008). The most important medical application field lies in tumor diagnostics, especially in the diagnosis and staging of brain, prostate and breast cancer as well
as the monitoring of therapy response (Gillies & Morse, 2005). In tumors, healthy
cells are destroyed and the signals of the biomarkers characteristic for healthy tissue
(e.g. citrate for the prostate, NAA for the brain) are decreased. On the other hand,
biomarkers for pathological metabolic processes often occur in increased concentrations: choline (excessive cell proliferation), lactate (anaerobic glycolysis), mobile
lipids (impaired lipid metabolism). The top right and bottom right spectra in Fig. 2.1
are typical examples of spectra occurring in healthy brain tissue and in brain tumor,
respectively.
While MRSI has proved its efficacy for radiological diagnostics, it is a fairly new
technique that yet has to gain ground in routine radiology and in the training curricula of radiologists. Furthermore, the visual assessment is harder and more timeconsuming than for MRI: while most medical imaging modalities provide two- or
three-dimensional data, MRSI provides four-dimensional data due to the additional
spectral dimension. Automated decision-support systems may assist the radiologists
by visualizing the most relevant information in form of easily interpretable nosologic
images (de Edelenyi et al., 2000): from each spectrum, a scalar classification score
is extracted that discriminates well between healthy and tumorous tissue, and all
scores are displayed as a color map. Ideally the scores can even be interpreted as
the probability that the respective spectrum corresponds to a tumor. While such a
decision support system may not completely obviate the need of manual inspection
of the spectra, it can at least guide the radiologist towards suspicious regions that
should be examined more closely, and facilitate the comparison with other imaging
modalities.
Methods for computing the classification scores fall into two categories: quantificationbased approaches (Poullet et al., 2008) and pattern recognition-based approaches
(Hagberg, 1998). Quantification approaches exploit the fact that MRSI signals are
48
2.1. Introduction and motivation
0.06
0.04
0.02
Magnitude [a.u.]
0
0.06
0.04
0.02
0
0.06
0.04
0.02
0
4
3
2
1
4
3
2
1
Frequency [ppm]
4
3
2
1
Figure 2.1. – Exemplary MRSI magnitude spectra of the brain, showing different voxel
classes and signal qualities. All spectra have been water-suppressed and L1 normalized (i.e.
divided by the sum of all channel entries), and they are displayed on a common scale. Note the
three distinct metabolite peaks, which are characteristic for brain MRSI: Choline (3.2 ppm),
creatine (3.0 ppm) and N-acetylaspartate (NAA, 2.0 ppm). NAA is a marker for functional
neurons, hence it has a high concentration in healthy tissue, and a low concentration in
tumor tissue. On the other hand, choline is a marker for membrane biogenesis and has
a higher concentration in tumor tissue than in healthy tissue. Left column: Spectra that
are not evaluable owing to poor SNR or the presence of artifacts. Middle column: Spectra
with poor signal quality, which however have sufficient quality so that the voxel class may
be ascertained. Right column: Spectra with good signal quality. Top row: Spectra from
healthy brain tissue. Middle row: Spectra of undecided voxel class. Bottom row: Spectra
from tumor tissue. Note that the voxel class is only meaningful for the middle and the right
column, and that the spectra in the left column were randomly assigned to the different rows.
49
Chapter 2. Software for MRSI analysis
physically interpretable as superpositions of metabolite spectra; they can hence be
used to quantify the local relative concentrations of these metabolites by fitting
measured or simulated basis spectra to the spectrum in every voxel. The fitting
parameters (amplitudes, frequency shifts, . . .) may be regarded as a low-dimensional
representation of the signal. Classification scores are then usually computed from amplitude ratios of relevant metabolites: for instance, the choline/creatine and choline/NAA ratios are frequently employed for the diagnosis of brain tumors (Martı́nezBisbal & Celda, 2009).
Pattern recognition approaches forego an explicit data model: instead, the MRSI
signal is preprocessed to a (still high-dimensional) feature vector, and the mapping
of feature vectors to classification scores is learned from manually annotated training
vectors (the so-called supervised learning setting). Because of this need for manually annotated examples, pattern recognition techniques require higher effort from
human experts than quantification-based techniques. Furthermore, they have to be
retrained if the experimental measurement conditions change (e.g. different magnetic
field strength, different imaged organ or different measurement protocol). However,
comparative studies of quantification and pattern recognition methods for prostate
tumor detection showed superior performance of the latter ones, as they are more
robust against measurement artifacts and noise (Kelm et al., 2007). Given a sufficiently large and diverse training dataset, one can even use pattern recognition
to distinguish between different tumor types, e.g. astrocytomas and glioblastomas
(Tate et al., 2006).
MRSI data often have quality defects that render malignancy assessment difficult
or even impossible: low signal-to-noise ratio, line widening because of shimming
errors, head movement effects, lipid contamination, signal bleeding, ghosting etc.
(Kreis, 2004). If these defects become sufficiently grave, even pattern recognition
methods cannot tolerate them, and the resulting classification scores will be clinically
meaningless and should not be used for diagnosis. Fig. 2.1 shows example spectra
of good, poor, and very poor (not evaluable) quality for healthy, undecided and
tumorous tissue. One can deal with this problem by augmenting the classification
score for the malignancy (also called voxel class) with a second score for the signal
quality: If this score is high, the users know that the spectrum has high quality
and that the voxel class score is reliable, while for a low score they know that the
voxel class score is unreliable and the spectrum should be ignored. This may also
save the users’ time, as poor-quality spectra need not be examined in detail. Pattern
recognition approaches have been successfully employed for signal quality prediction,
with similar performance to expert radiologists (Menze et al., 2008).
Most existing software products for MRSI classification incorporate quantificationbased algorithms: for instance, they are typically included in the software packages
50
2.2. Background: Supervised classification
supplied by MR scanner manufacturers. Furthermore, there are several stand-alone
software products such as LCModel (Provencher, 2001), jMRUI (Stefan et al., 2009)
or MIDAS (Maudsley et al., 2006).
In contrast, the application of pattern recognition-based methods still has to gain
ground in clinical routine: This may be partially due to differences in the flexibility
with which both categories of algorithms can be adjusted to different experimental
conditions (e.g. changes in scanner hardware and in measurement protocols) or to a
different imaged organ. For quantification-based methods one must only update the
metabolite basis spectra to a given experimental setting, which can be achieved by
quantum-mechanical simulation, e.g. with the GAMMA library (Smith et al., 1994).
For pattern recognition-based methods on the other hand, one has to provide manual
labels of spectra from many different patients with a histologically confirmed tumor,
which is time-consuming and requires the effort of one or several medical experts.
Since there exist many different techniques whose relative and absolute performance
on a given task cannot be predicted beforehand, for every change in conditions a
benchmarking experiment as in (Menze et al., 2006) or (Garcı́a-Gomez et al., 2009)
should also be conducted to select the best classifier and monitor the classification
quality.
While the need for classifier retraining, benchmarking and quality assessment cannot
be obviated, this chapter presents an object-oriented C++ library and a graphical
user interface which assists this task better than existing software.1 This work is an
extension of the CLARET software (Kelm et al., 2006): While the original prototype
of this software was written in MATLAB, an improved C++ reimplementation was
created for the MeVisLab2 environment. Most of the functionality described in this
thesis does not exist in the original CLARET version and is hence novel: mainly
the possibility to manually define labels and to train, test, evaluate and compare
various classifiers and preprocessing schemes. The original software was only capable of analyzing MRSI data measured with a specific acquisition protocol (prostate
measurements acquired with an endorectal coil at a 1.5 Tesla scanner with an echo
time of 135 ms and a sampling interval of 0.8 ms). Retraining was only possible
using both specialized tools and specialized knowledge about pattern recognition.
2.2. Background: Supervised classification
The following survey covers common knowledge; for a reference, see e.g. the book
by Hastie et al. (2009).
1
2
The contents of this chapter have been published as (Kaster et al., 2009, 2010a,b).
http://www.mevislab.de
51
Chapter 2. Software for MRSI analysis
Aims and pitfalls of classification Supervised classification is a subarea of statistical learning. It deals with the following question: Assume we have a set of training
examples with associated labels {(xi , yi )|i = 1, ..., n} ⊂ X × Y, with a (discrete or
continuous) feature space X and a finite label space Y. In the following, we set
X ⊆ Rp and Y = {0, . . . , L − 1}. A classifier is a rule that tells us which label g(x̂)
should be given to a new test example x̂ for which the true label ŷ is not known,
based on the training input. Ideally one is also interested in estimates for the probabilities p̂1 , . . . , p̂L that the label ŷ belongs to the different possible classes, rather
than a crisp assignment. The aim of classifier training is a low value for the expected
classification error on a test example
E(x̂,ŷ) 1 − δŷ,g(x̂) = p ŷ 6= g(x̂) ,
(2.1)
which is also known as the generalization error.3 The theoretically optimal classifier
(with the smallest generalization error) is the Bayes classifier:
g(x) = arg max p(y|x).
y
(2.2)
However, the conditional distribution p(y|x) is not known in practice. For a suitably
large training set, the training error
n
1X
1 − δg(xi ),yi
n
(2.3)
i=1
is a lower bound on the generalization error, but it may be a severe underestimation:
there are classifiers which are closely tuned to the training set so that their training
error can go down to zero, but which may perform very poorly on test examples (this
phenomenon is called “overfitting”). Better estimates for the generalization error can
be achieved by cross-validation: the training data are partitioned into different folds,
the classifier is repeatedly trained on all but one folds and tested on the remaining
fold, and the average of all empirical test errors is reported. However, one should note
that cross-validation estimates are in general biased (Bengio & Grandvalet, 2004).
Finally, the bias-variance trade-off is important for understanding the dependence of
many classifiers on their free parameters: For sake of illustration, consider a binary
classification (L = 2). Then,
3
2
1 − δŷ,g(x̂) = ŷ − g(x̂)
(2.4)
The generalization error is the simplest example of a loss function, namely one that treats all
misclassifications as equally grave. More flexible loss functions may also be defined, e.g. for the
automated tumor classification application considered in this chapter, false positives might be
considered more permissible than false negatives: Then the goal of the classifier is to minimize
the expected value of this loss.
52
2.2. Background: Supervised classification
and the generalization error can be decomposed as follows:
h
2 2 i
= E(x̂,ŷ) ŷ − Ex̂ g(x̂) + Ex̂ g(x̂) − g(x̂)
E(x̂,ŷ) ŷ − g(x̂)
(2.5)
2
2
.
+ Ex̂ g(x̂) − Ex̂ g(x̂)
= Eŷ ŷ − Ex̂ g(x̂)
(2.6)
The second term in Eq. (2.6) measures how the classifier predictions varies around
its expected prediction value (the variance), while the first term measures by which
amount the expected prediction value deviates from the true label (the bias). Many
classifier parameters increase or decrease the local smoothness (or regularity) of the
classifier: by adjusting them, one can often trade higher bias for lesser variance and
vice versa. Often, the optimum compromise between these two conflicting factors
is achieved at a moderate parameter value, which may be found e.g. via crossvalidation.
k nearest neighbors Arguably one of the simplest supervised learning techniques
is the k nearest neighbors (kNN) classifier: for every test point, find the k closest
examples among the training data (with respect to a suitable metric on the feature
space X) and assign their majority label. Despite its simplicity, this classifier has good
theoretical guarantees: e.g. in the limit of infinite training data, its generalization
error is at most twice as large as the generalization error of the Bayes classifier
(Stone, 1977). However, for limited training examples the parameter k becomes
important: large values of k enforce regularity of the classifier and decrease variance,
while possible incurring a bias. In contrast, small values of k commonly lead to small
bias and large variance.
Decision trees and random forests Decision tree classifiers (Hastie et al., 2009,
chap. 9.2) iteratively partition the feature space into orthotopes: A binary tree data
structure is initialized with the entire space X as the root node, then the tree is
grown by selecting the best axis-parallel split of a leaf node into two daughter nodes
by an axis-parallel split, i.e. a rule of the form “if feature i is larger than a threshold
θ, then go to the right child, else go to the left child”. The best split is commonly
defined as the one causing a maximum decrease in some measure in node impurity
among the training examples: I.e. if the mother node contains N training examples,
of which a fraction p0 ∈ [0, 1] belongs to class 0 and a fraction p1 = 1 − p0 belongs to
class 1, and the left and right child contain NL and NR examples with fractions pL0 ,
pL1 , pR0 and pR1 , common criteria are searching for the maximum entropy decrease
−p0 log p0 −p1 log p1 +
NR
NL
(pL0 log pL0 + pL1 log pL1 )+
(pR0 log pR0 + pR1 log pR1 )
N
N
53
Chapter 2. Software for MRSI analysis
(2.7)
or the maximum Gini impurity decrease
2p0 p1 − 2
NR
NL
pL0 pL1 − 2
pR0 pR1 .
N
N
(2.8)
This process is ended either once a maximum tree depth is reached, or once node
purity is reached (i.e. all leaf orthotopes contain only training examples from a
single class). An unlabeled test example is then classified according to the majority
label inside the orthotope in which it is contained. Single decision trees are prone
to overfitting, especially if the tree is grown up to purity. Random forests (Breiman,
2001) confer higher robustness: instead of growing a single tree, an ensemble of
randomized trees is grown, and unlabeled test examples are assigned the majority
label of the tree predictions. In the most common variant, randomization occurs
at two stages: Firstly, each single tree is only trained using a random subset of
the training examples, which is generated by bootstrapping (i.e. sampling with
replacements). The remaining examples can be used to estimate the generalization
error of this tree (this is called the out-of-bag estimate). Secondly, only a random
subset of mtry ≪ p features is considered for each fit. This number mtry ≪ p is the
main adjustable parameter4 for random forests, as it determines the balance between
two conflicting aims of random forest generation: the trees should be diverse to avoid
overfitting (which encourages small mtry values), but also give accurate predictions
√
(which encourages large mtry values). The rule of thumb mtry = p often provides
a good compromise.
Linear regression and regularized variants Linear regression (Hastie et al., 2009,
chap. 3) is originally a regression problem, which aims to predict continuous labels
yi ∈ R: however, binary classification may be reduced to this setting by using 0 and
1 as the training labels and binarizing the continuous test predictions via a threshold
at e.g. 0.5. It searches for the optimal linear relationship (in a least-squares sense)
between the features and the labels: if all training labels are stacked in an n × 1
vector y, and all training features in an n × p + 1 matrix X, it solves the problem
w∗ = arg min(y − X ′ w)2
w
with w ∈ Rp+1 .
(2.9)
In order to allow for a constant offset, we assume that the last column of X is a vector
of ones. Especially in high-dimensional feature spaces (n < p), linear regression
4
The second parameter is the number of trees. However, this is mostly determined by the time
available for training and prediction: In most cases, more trees give better prediction accuracies,
but the effect saturates, and both training and prediction time grow linearly in the number of
trees.
54
2.2. Background: Supervised classification
becomes an ill-posed problem and may suffer from severe overfitting, poor numerical
conditioning and poor robustness towards noise. The solution lies in regularizing the
regression, i.e. in restricting the effective number of parameters to a value smaller
than n. One possible approach lies in imposing a Gaussian prior on the weight vector,
which leads to ridge regression (RR):
w∗ = arg min(y − Xw)2 + λw2
w
(2.10)
Large values of λ force the weights wj to be small,5 and will additionally make the
problem well-conditioned. A different approach is principal components regression
(PCR): if V = (v1 , . . . , vnPC ) is a matrix built of the nPC principal components of
the feature matrix X (i.e. the eigenvectors of X ⊤ X corresponding to the leading
eigenvalues), then PCR solves the optimization problem
w∗ = arg min(y − XV w)2 .
w
(2.11)
Hence the dimensionality of the features is reduced from p to nPC . Often the leading
principal components carry most of the discriminative information of the features,
while the other components are mainly noise variables. Concerning the bias-variance
tradeoff mentioned above, large values of λ and small values of nPC will decrease the
variance, but possibly incur a bias. Note that both linear regression and its variants
are linear estimators, i.e. the predictions ŷ of the trained regressor for the training
examples linearly depend on the labels:
ŷ = Sy,
(2.12)
with S being a function of X: e.g., for linear regression,
S = X(X ⊤ X)−1 X ⊤ .
(2.13)
For this kind of estimators, the leave-one-out cross-validation estimate for the generalization error can be efficiently approximated by the generalized cross-validation
(GCV):
2
N yi − ŷi
1 X
.
GCV =
N
1 − trace(S)/N
(2.14)
i=1
5
Although typically no weights will be exactly zero. This behavior can be enforced by imposing a
L1 prior on w (LASSO), instead of the L2 prior used in ridge regression. However, in contrast
to ridge regression, a closed-form of the LASSO problem is no longer possible.
55
Chapter 2. Software for MRSI analysis
Margin-based methods: Support vector machines The support vector machine
(Burges, 1998; Schölkopf & Smola, 2002, SVM) is a binary classification technique
that aims to maximize the margin between the two classes (which are commonly
denoted by −1 and 1 rather than 0 and 1). For the simplest case, assume that the
training examples are linearly separable, i.e. there exists a vector w and a scalar b
such that
(2.15)
yi w⊤ xi + b > 0 for all i.
Qualitatively, that means that the training examples with labels +1 and −1 lie
on opposite sides of the separating hyperplane {x|w⊤ x + b = 0}, which then acts
as the decision boundary. In this case, w and b are not unique; and the support
vector machine is defined as the separating hyperplane with the maximum margin,
i.e. the separating hyperplane for which the distance to the closest training point is
maximized:
1
(w∗ , b∗ ) = arg min w2 s.t. yi w⊤ xi + b ≥ 1 for all i.
2
w,b
(2.16)
In practice, training data are rarely exactly linearly separable. If a linear classifier is
appropriate, but there is always some overlap between the two classes due to noise,
the separability constraints can be relaxed by the introduction of slack variables:
n
X
1
ξi s.t. yi w⊤ xi + b ≥ 1 − ξi for all i.
(w∗ , b∗ , ξ ∗ ) = w2 + C
2
(2.17)
i=1
Note that all training examples with ξi > 1 will be misclassified by the trained SVM.
Large values of C penalize such misclassifications severely, while for small values
of C the criterion that the margin should be large becomes more important. If
a nonlinear classifier is more appropriate, the features can be transformed into a
higher-dimensional space via a transformation x → φ(x): a linear classifier in this
higher-dimensional space then becomes a nonlinear classifier in the original space.
For example, a quadratic decision boundary can be achieved via the mapping φ(x) =
(x, x2 )⊤ . It turns out that for solving the optimization problem in Eq. (2.17) only
the scalar products x⊤
i xj are required: the solution in the higher-dimensional space
follows directly by replacing these by φ(xi )⊤ φ(xj ) = K(xi , xj ). This allows the use
of infinite-dimensional mappings φ; an important example is the radial basis function
(RBF) kernel
kxi − xj k2
.
(2.18)
K(xi , xj ) = exp −
2γ 2
56
2.3. Related work
Other methods For space reasons, the previous enumeration of supervised classification methods is incomplete: Important techniques that have not been covered are
e.g. artificial neural networks in their shallow (Bishop, 1994) and deep variant (Bengio, 2009), boosting (Freund & Schapire, 1999) or Gaussian processes (Rasmussen &
Williams, 2006). In general, it depends on the particular data which classifier has
the best accuracy, and there are little theoretical results which could predict a superiority of a certain classifier under realistic conditions (limited amount of training
data, unknown true distribution on X × Y). However, comparative empirical evaluations have shown that randomized tree classifiers such as random forests or boosted
decision trees typically have the highest overall accuracy over a range of real-world
datasets of moderate (Caruana & Niculescu-Mizil, 2006) and high dimension (Caruana et al., 2008). Besides the classical supervised learning setting that has been
discussed in this section, there has been recent research on how classifier accuracy
may be improved by replacing some of the inherent assumptions in the supervised
learning settings by more realistic alternatives. Three important examples for such
assumptions are:
• That the training examples are sampled independently and identically distributed (i.i.d.) from p(x, y). Accounting for statistical dependencies between
different training examples leads to structured output learning (Bakir et al.,
2007).
• That every training feature vector xi comes with a label yi . In practice, labeling
is often costly, so that there may also be a huge pool of feature vectors xi for
which no label is available. Semi-supervised learning explores how to make use
of the information contained in the unlabeled xi (Chapelle et al., 2006).
• That the training procedure has no control over the selection of training data.
In the active learning setting, a training procedure tries to identify feature
vector candidates xi whose labels would be particularly informative for the
classification, and actively requests labels only for these examples (Settles,
2010).
2.3. Related work
There are two other alternative software products which employ pattern recognition
methods for the analysis of MRSI spectra: HealthAgents by González-Vélez et al.
(2009) and SpectraClassifier by Ortega-Martorell et al. (2010). What sets this software apart from these two systems, is the capability to statistically compare various
different classifiers and to select the best one. SpectraClassifier provides statistical
analysis functionalities for the trained classifiers, but linear discriminant analysis
57
Chapter 2. Software for MRSI analysis
is the only available classification method. On the other hand, HealthAgent supports different classification algorithms but does not provide statistical evaluation
functionality.
Extensibility was an important design criterion for the library: by providing abstract
interfaces for classifiers, data preprocessing procedures and evaluation statistics, users
may plug in their own classes with moderate effort. Hereby it follows similar ideas as
general purpose classification frameworks such as Weka,6 TunedIT7 or RapidMiner8 .
However, it is much more focused in scope and tailored towards medical diagnostic
applications. Furthermore, a similar plug-in concept for the analysis of MRSI data
was used by Neuter et al. (2007), but with a focus on quantification techniques as
opposed to pattern recognition techniques, and also lacking statistical evaluation
functionalities.
2.4. Software architecture
2.4.1. Overview and design principles
The software is designed for the following use case: the users label several data
volumes with respect to voxel class (tumor vs. healthy) and signal quality and
save the results (Fig. 2.2). They specify several classifiers to be compared, the free
classifier-specific parameters to be adjusted in parameter optimization (see Fig. 2.3)
and preprocessing steps for the data. A training and test suite is then defined,
which may contain the voxel class classification task, the signal quality classification
task, or both. The users may partition all data volumes explicitly into a separate
training and testing set, otherwise a cross-validation scheme is employed: the data
is partitioned into several folds, and the classifiers are iteratively trained on all but
one folds, and tested on the remaining fold. The latter option is advisable if only
few data are available; it has the additional advantage that means and variances for
the classifier results may be estimated.
Every classifier is assigned to a preprocessing pipeline, which transforms the observed
spectra into training and test features. Some elements of this pipeline may be shared
across several classifiers, while others are specific for one classifier. Input data (spectra and labels) are passed, preprocessed and partitioned into cross validation folds if
no explicit test data are provided. The parameters of every classifier are optimized
either on the designated training data or on the first fold by maximizing an estimate
for the generalization error. The classifiers are then trained with the final parameter
6
http://www.cs.waikato.ac.nz/ml/weka/
http://tunedit.org/
8
http://www.rapid-i.com
7
58
2.4. Software architecture
Figure 2.2. – User interface for the labeling functionality of the MRSI data, showing an
exemplary dataset acquired at a 3 Tesla Siemens Trio scanner. This graphical interface was
implemented by Bernd Merkel and Markus Harz, Fraunhofer MeVis Institute for Medical
Image Computing. Top left: Corresponding morphological dataset in sagittal view (T2 weighted turbo spin-echo sequence in this case). Users can place a marker (blue) to select a
voxel of interest. Middle left: Magnitude spectrum of the selected voxel, which is typical for
a cerebral tumor. Top right: Selected voxel (framed in red) together with the axial slice in
which it is contained. The user-defined labels are overlayed over a synopsis of all spectra in
the slice. The label shape encodes the signal quality (dot / asterisc / cross for “not evaluable”
/ “poor” / “good”), while the label color encodes the voxel class (green / yellow / red for
“healthy” / “undecided” / “tumor”). The labels may also be annotated by free-text strings.
Bottom panel: User interface with controls for label definition, text annotation and data
import / export.
59
Chapter 2. Software for MRSI analysis
values, and performance statistics are computed by comparing the prediction results
on the current test data with the actual test labels. Statistical tests are conducted
to decide whether the classifiers differ significantly in performance. Typically not
only two, but multiple classifiers are compared against each other, which must be
considered when judging significance. Finally the classifiers are retrained on the total
data for predicting the class of unlabeled examples. The user may perform quality
control in order to assess if the performance statistics are sufficient for employment
in the clinic (Fig. 2.4). The trained classifiers may then be loaded and applied to
new datasets, for which no manual labels are available (Fig. 2.5).
The main design criteria were extensibility, maintainability and exception safety. Extensibility was achieved by providing abstract base classes for classifiers, preprocessing procedures and evaluation statistics, so that it is easily possible to add e.g. new
classification methods by deriving from the appropriate class. For maintainability,
dedicated manager objects handle the data flow between the different modules of the
software and maintain the mutual consistency of their internal states upon changes
made by the user. Strong exception safety guarantees are necessitated by the quality
requirements for medical software; it was achieved by creating additional resource
management classes following the Resource Acquisition Is Initialization (RAII) idiom
(Stroustrup, 2001).
2.4.2. The classification functionality
The design of the classification functionality of this library follows the main aim of
separating between classifier-specific functionality (which must be provided by the
user when introducing a new classifier) and common functionality that is used by
all classifiers and does not need to be changed: the ClassifierManager class is
responsible for the former, while the classes derived from the abstract Classifier
basis class are responsible for the latter. Simple extensibility and avoiding code
repetition were therefore the two main design principles.
A ClassifierManager object corresponds to each classification task, e.g. classification with respect to signal quality and with respect to voxel class (see Fig. 2.6). It
controls all classifiers which are trained and benchmarked for this task, and ensures
that operations such as training, testing, and the averaging of performance statistics
over cross-validation folds as well as saving and loading are performed for each classifier. It also partitions the training features and labels into several cross-validation
folds, if the users do not define a designated test dataset.
A Classifier object encapsulates an algorithm for mapping feature vectors to discrete labels after training. Alternatively, the output can also be a continuous score
that gives information about the confidence that a spectrum corresponds to a tu-
60
2.4. Software architecture
Figure 2.3. – Part of the user interface for classifier training and testing. In this panel,
the search grids for automated parameter tuning of the different classifiers may be defined
(default values, starting values, incrementation step sizes and numbers of steps).
mor. Bindings were implemented for several linear and nonlinear classifiers, which
previously had been found to be well-suited for the classification of MRSI spectra
(Menze et al., 2006): support vector machines (SVMs) with a linear and a radial
basis function (RBF) kernel, random forests (RF), ridge regression (RR) and principal components regression (PCR); see (Hastie et al., 2009) for a description of these
methods. The actual classification algorithms are provided by external libraries such
as LIBSVM (Chang & Lin, 2001) and VIGRA (Köthe, 2000).
Both binary classification (with two labels) as well as multi-class classification (with
more than two labels) are supported. Some classifiers (e.g. random forests) natively
support multi-class classification, while for other classifiers (e.g. ridge regression
and principal components regression),9 it can be achieved via a one-vs.-all encoding
9
To be precise, these two classifiers are actually regression methods and can be used for binary
classification by assigning the label +1 and -1 to all positive and negative class examples and
61
Chapter 2. Software for MRSI analysis
Figure 2.4. – Evaluation results for an exemplary training and testing suite. The upper two
windows on the right-hand side show the estimated area under curve value for a linear support
vector machine classifier and its estimated standard deviation (0.554±0.036), while the lower
two windows show the same values for a ridge-regression classifier (0.809±0.048). This would
allow a clinical user to draw the conclusion that only the latter one of these classifiers differs
significantly from random guessing, and may sensibly be used for diagnostics. The poor
quality of these classifiers is due to the fact that only a very small training set was used for
the purpose of illustrating the user interface design (2 patients).
scheme,10 in which each class is classified against all other classes in turn, and the
class with the largest score is selected for the prediction (Rifkin & Klautau, 2004).
This multi-class functionality allows the future extension of the library to the task
of discriminating different tumor types against each other.
Furthermore, every classifier encapsulates an instance of the ClassifierParameterManager class controlling the parameter combinations that are tested during parameter optimization. Most classifiers have one or more internal parameters that ought
to be optimized for each dataset in order to achieve optimal predictive performance
training a regressor. The transformLabelsToBinary() function maps the original labels to these
two numbers.
10
The virtual isOnlyBinary() function allows one to specify the affiliation of a classifier to these
two categories.
62
2.4. Software architecture
Figure 2.5. – Exemplary application of a trained classifier for the computer-assisted diagnosis of a new dataset. The classifier predictions for both voxel class and signal quality are
depicted for a user-defined region of interest: the voxel class is encoded by the color (green for
“healthy”, yellow for “undecided”, red for “tumor”), while the signal quality is encoded by
the transparency (opaque for a good signal, invisible for a spectrum which is not evaluable).
As an alternative to the classifier predictions, it is possible to display precomputed color
maps as well as color maps based on the parametric quantification of relevant metabolites.
63
Chapter 2. Software for MRSI analysis
(see sec. 2.4.4). This is done by maximizing an estimate of the generalization error
(i.e. the performance of the classifier on new test data that were not encountered
during the training process) over a prescribed search grid, using the data from one of
the cross-validation folds (or the whole training data, if no cross-validation is used).
This generalization error could be estimated by dividing the training data into another training and test fold, training the classifier on the training part of the training
data and testing it on the testing part of the training data.11 However, this would be
time-consuming. However, there exists considerable theoretical as well as empirical
evidence (Golub et al., 1979; Breiman, 1996) that efficiently computable approximations for the generalization error may be sufficient for parameter adjustment: these
are provided by the function estimatePerformanceCvFold(). For SVMs, this is
an internal cross-validation estimate as described in (Lin et al., 2007), for random
forests, the out-of-bag error and for regression-based classifiers the generalized crossvalidation (Hastie et al., 2009). The optimal parameters are selected by the function
optimizeParametersCvFold() based on the data from one specific cross-validation
fold.
This part of the library may be easily extended by adding new classifiers, as long
as they fit into the supervised classification settings (i.e. based on labeled training
vectors, a function for mapping these vectors to the discrete labels is learnt). Artificial
neural networks, boosted ensemble classifiers or Gaussian process classification are
examples for alternative classification algorithms that could be added in this way.
For this, one only needs to derive from the Classifier abstract base class and to
provide implementations for its abstract methods (including the definition of the
Preprocessor subclass with which this classifier type is associated). For parameter
tuning, one also has to supply an estimate of the classifier accuracy: This may
always be computed via cross-validation, but preferably this estimate should arise
as a by-product of the training or be fast to compute (same as e.g. the out-ofbag error for the random forest or the generalized cross-validation). Furthermore
one has to assume the existence of a continuous classification score, which ideally
can be interpreted as a tumor probability. However, for classifiers without such a
probabilistic interpretation it is sufficient to reuse the 0/1 label values as scores: as
long as higher scores correspond to a higher likelihood for the positive (tumor) class,
they can take any values. Only the single-voxel spectra are used for classification,
hence the architecture does not allow classifiers that make explicit use of spatial
context information (so-called probabilistic graphical models).
11
Note that the actual test data must not be used during parameter tuning.
64
2.4. Software architecture
Figure 2.6. – Simplified UML diagram of the classification functionality of the software library: detailed explanations can be found in section 2.4.2. The connections
to the classes TrainTestSuite (see Fig. 2.10), Preprocessor / PreprocessorManager
(Fig. 2.7), ClassifierParameterManager (Fig. 2.8) and SingleClassifierStats /
AllPairClassifierStats (Fig. 2.9) are shown. In this diagram, as in the following ones,
abstract methods are printed in italics: to save space, the implementations of these abstract
methods are not shown if they are provided in the leaves of the inheritance tree. The depiction
here is simplified: actually the non-virtual interface principle is followed, so that protected
visibility is given to all abstract methods, which are then encapsulated by non-virtual public
methods.
65
Chapter 2. Software for MRSI analysis
2.4.3. The preprocessing functionality
Preprocessing (Fig. 2.7) is the extraction of a feature vector from the raw MRSI
spectra with the aim of improved classification performance. While classification
makes use of both the label and the feature information (supervised process), preprocessing only uses the feature information (unsupervised process). Preprocessor
objects may act both on the total data (transformTotal()) and of the data of a
single cross-validation fold (transformCvFold()): the distinction may be relevant
since some preprocessing steps (e.g. singular value decomposition) depend on the
actual training data used.
The main goal governing the design of the preprocessing functionality was training
speed: data preprocessing steps which are common to multiple classifiers should
only be performed once. Hence the different preprocessing steps are packaged into
modules (deriving from the Preprocessor abstract base class) and arranged into
cascades. A common PreprocessorManager ensures that every preprocessing step
is only performed once. Hiding the preprocessing functionality from the library users
was an additional criterion: Every subclass of Classifier is statically associated
with a specific Preprocessor subclass and is responsible for registering this subclass
with the PreprocessorManager and passing the data to be preprocessed.
First, since only the metabolite signals carry diagnostically relevant information, the
nuisance signal caused by water molecules has to be suppressed, using e.g. a Hankel
singular value decomposition filter (Pijnappel et al., 1992). Then the spectra are
transformed from the time domain into the Fourier domain by means of the FFTW
library (Frigo & Johnson, 2005), and the magnitude spectrum is computed. The
subsequent steps may be adjusted by the user, and typically depend on the classifier:
Common MRSI preprocessing steps used by all classifiers are the rebinning of spectral
vectors via a B-spline interpolation scheme, the extraction of diagnostically relevant
parts of the spectrum and L1 normalization (i.e. the spectral vector is normalized
such that the sum of all component magnitudes in a prescribed interval equals one):
these are performed by the class MrsiPreprocessor.12 Other preprocessing steps
are only relevant for some of the classifiers, e.g. the RegressionPreprocessor performs a singular value decomposition of the data which speeds up subsequent ridge
regression or PCR. SVMs perform better when the features have zero mean and unit
variance: this can be achieved by the WhiteningPreprocessor.
Two features of the software implementation support this modular structure: The
PreprocessorManager incorporates a class factory, which ensures that only one instance of each preprocessor class is created. This allows to share preprocessors across
12
More sophisticated steps such as the extraction of wavelet features might be added as well.
66
2.4. Software architecture
various classifiers and prevents duplicate preprocessing steps (such as e.g. performing
the singular value decomposition twice on the same data). Furthermore, preprocessors are typically arranged in a tree structure (via the predecessor and successors
references) and every classifier is assigned to one vertex of this tree, which ensures
that all preprocessing steps on the path from the root to this vertex are applied in
order (creating a pipeline of preprocessing steps). Once the data encapsulated inside
one module changes, all successors are invalidated.
When new classifiers are added to the library, the preprocessing part may easily
extended with new preprocessor modules as long as they fit into the unsupervised
setting (i.e. they only make use of the features, but not of the labels). Besides
implementing the abstract methods of the Preprocessor base class, the association
between the classifier and the preprocessor must be included in the classifier definition by implementing its getPreprocessorStub() method: then the classifier object
ensures that the new preprocessor is correctly registered with the preprocessor manager object. As a limitation, the new preprocessor has to be appended as a new leaf
(or a new root node) to the preprocessor tree: the intermediate results from other
preprocessing steps can only be reused if the order of these steps is not changed.
2.4.4. The parameter tuning functionality
All classifiers have adjustable parameters, which are encapsulated in the ClassifierParameter class (Fig. 2.8). The design of the parameter handling functionality was
guided by the main rationale of handling parameters of different datatypes in a uniform way. Furthermore automated parameter adjustment over a search grid was
enabled (which may have linear or logarithmic spacing depending on the range of
reasonable parameter values), by hiding the details of the search mechanism from
the class users.
Some parameters should be optimized for the specific classification task, as described
in section 2.4.2: for the classifiers supplied by us, these are the slack penalty C
for SVMs, the kernel width γ for SVMs with an RBF kernel, the random subspace
dimension mtry for random forests, the number of principal components nPC for PCR
and the regularization parameter λ for ridge regression. They are represented as a
TypedOptimizableClassifierParameter: besides the actual value, these objects
also contain the search grid of the parameters, namely the starting and end value,
the incrementation step and whether the value should be incremented additively or
multiplicatively (encoded in the field incrInLogSpace). Multiplicative updates are
appropriate for parameters that can span a large range of reasonable values.
There are also parameters which may not be optimized: these are encapsulated as a
TypedClassifierParameter, which only contains the actual value. A good example
67
Chapter 2. Software for MRSI analysis
Figure 2.7. – Simplified UML diagram of the preprocessing functionality; see section 2.4.3
for details. The connections to the classes Classifier and ClassifierManager (Fig. 2.6)
are shown.
68
2.4. Software architecture
Figure 2.8. – Simplified UML diagram of the parameter tuning functionality; see section
2.4.4 for details. The connection to the class Classifier (Fig. 2.6) is shown.
would be the number of trees of a random forest classifier, since the generalization
error typically saturates as more trees are added.
While all currently used parameters are either integers or floating-point numbers,
one can define parameters of arbitrary type: however, one has to define how this
data type can be written to or retrieved from a file or another I/O medium by implementing the corresponding I/O callbacks (see section 2.4.6 for detailed explanation).
For optimizable parameters, it must also be defined what it means to increase the
parameter by a fixed value (by overloading the operator++() member function).
As a limitation, all parameters are assumed to vary completely independently and
cannot encode constraints coupling the values of multiple parameters.
69
Chapter 2. Software for MRSI analysis
One should note that the parameter optimization process followed by this library is
exactly the way a human expert would do it: in the absence of universal theoretical
criteria about the choice of good parameters, they have to be tuned empirically
so that a low generalization error is achieved.13 However, this is the most timeconsuming part of adapting a classifier to a new experiment, which is now completely
automated by the software.
2.4.5. The statistics functionality
The computation of evaluation statistics is crucial for the automated quality control
of trained classifiers (Fig. 2.9). This part of the library was designed with the following aims in mind: Needless recomputation of intermediate values should be avoided;
thus the binary confusion matrix is computed only once and then cached within a
StatsDataManager object, which can be queried for computing the different statistics
derived from it (e.g. Precision and Recall). The library can be simply extended by
new statistics characterizing a single classifier. Dedicated manager classes (such as
SingleFoldStats, SingleClassifierStats as well as PairClassifierStats and
AllPairsClassifierStats) are each responsible for a well-defined statistical evaluation task: namely, characterizing a classifier for a single cross-validation fold, characterizing a classifier over all folds, characterizing a single pair of classifiers and
characterizing all existing pairs of classifiers. They ensure that this computation is
performed in a consistent way for all classifiers, so that code redundancy is avoided.
The class SingleClassifierStats manages all statistics pertaining to one single
classifier: it is composed of objects of type SingleFoldStats, which in turn manage
all statistics either of a single cross-validation fold (cvData), or the mean and standard deviation values computed over all folds (meanData). A StatsDataManager is
a helper class which caches several intermediate results required for the computation
of the different Statistics.
There are different variants of how these statistics may be computed in a multi-class
classification setting: some of them (e.g. the MisclassificationRate) can handle
multiple classes natively; these statistics form the derived class AllVsAllStat. Other
statistics (e.g. Precision, Recall or FScore) were originally designed for a binary
classification setting. For the latter kind, one must report multiple values, namely
one for each class when discriminated against all others (one-vs.-all encoding), and
they inherit from the OneVsAllStat class. The AreaUnderCurve (AUC) value of the
13
If sufficient data were available, it would be preferable to perform this parameter tuning on a
separate tuning dataset that is not used in the training and testing of the classifier. Since typically
clinics only have access to few validated MRSI data, this approach may not be practicable, and
the cross-validation scheme used in this library is the best alternative to deal with scarce data.
70
2.4. Software architecture
receiver operating characteristic (ROC) curve (Fawcett, 2006) is a specialty: while
it is also computed in a one-vs.-all fashion, the underlying ROC curves are stored
as well. Standard deviation estimates are mostly available only for the meanData
averaged over several cross-validation folds, with the exception of the AUC values
for which nonparametric bootstrap estimates can be easily computed (Bandos et al.,
2007).
Besides the statistical characterization of single classifiers, it is also relevant to compare pairs of classifiers in order to assess which one of them is best for the current task,
and whether the differences are statistically significant. The AllPairsClassifierStats class manages the statistics characterizing the differences in misclassification rate between all pairs of classifiers, each of which is represented by a single
PairClassifierStats instance. p-values are computed by statistical hypothesis
tests with the null hypothesis that there is no difference between classifier performances. Implementations are provided for two tests: McNemar’s test (Dietterich,
1998) is used when the data are provided as a separate training and test set, while
a recently proposed conservative t-test variant (Grandvalet & Bengio, 2006) is used
if the users provide only a training dataset, which is then internally partitioned into
cross-validation folds. The latter test assumes that there is an upper border on the
correlation of misclassification rates across different cross-validation folds, which is
stored in the variable maxCorrelationGrandvalet.14
If there are more than two classifiers, the p-values must be adjusted for the effect of
multiple comparisons: In the case of five classifiers with equal performance, there are
ten pairwise comparisons and a significant difference (praw < 0.001) is expected to
occur with a probability of 1 − 0.99910 ≈ 0.01. After computing all “raw” p-values,
they are corrected using Holm’s step-down or Hochberg’s step-up method (Demšar,
2006), and all results are stored as PValue structures.
If there is need to extend the statistics functionality, it is simple to add any statistic
characterizing a single classifier that can be computed from the true labels and the
predicted labels and scores, as these values may be queried from the StatsDataManager object. This comprises all statistics which are commonly used for judging the quality of general classification algorithms. As a limitation, the evaluation
statistics cannot use any information about the spatial distribution of the labels:
hence it is impossible to compute e.g. the Hausdorff distance between the true
and the predicted tumor segmentation. Among the statistical significance tests (like
McNemarPairClassifierStat), one can add any technique that only requires the
mean values of the statistic to be compared from each cross-validation fold. The cur14
Note that a classical t-test may not be used, since the variance of misclassification rates is estimated
from cross-validation and hence systematically underestimated. Bengio & Grandvalet (2004)
showed that unbiased estimation of the variances is not possible; but the procedure used here
provides an upper bound on the p-value if the assumptions are fulfilled.
71
Chapter 2. Software for MRSI analysis
rent design is not prepared for new methods of multi-comparison adjustment beyond
Holm’s or Hochberg’s method: for every method acting only on p-values and computing an adjusted p-value, this would be possible, but requires moderate redesign
of this part of the library. Also the assumption is hardwired that the mean and
variance of these evaluation shall be estimated using a cross-validation scheme. The
number of cross-validation folds can be specified at the ClassifierManager level: It
is theoretically possible to run a leave-one-out validation scheme with this machinery,
but that would lead to prohibitive computation times.
2.4.6. The input / output functionality
The input / output functionality was designed in order to keep it separated from the
modules responsible for the internal computations: hence function objects are passed
to the classifier, preprocessor etc. objects, which can then be invoked to serialize all
types of the data that is encapsulated by these objects. Similar function objects are
used for streaming relevant information outside and listening for user signals at check
points.
For persistence, classifiers, preprocessors, statistics and all other classes with intrinsic
state can be saved and reloaded in a hierarchical data format, and the data input/output can be customized by passing user-defined input and output function objects
derived from the base classes LoadFunctor and SaveFunctor (see Fig. 2.10). For
these function objects, the user must define how to enter and leave a new hierarchy
level (initGroup() and exitGroup()) and how to serialize each supported data type
(save() and load()): for the latter purpose, the function objects must implement
all required instantiations of the LoadFunctorInterface or SaveFunctorInterface
interface template. Exemplary support is provided for HDF515 as the main storage
format (XML would be an obvious alternative). For integration into a user interface,
other function objects may be passed that can either report progress information, e.g.
for updating a progress bar (StreamProgressFunctor), or report status information
(StreamStatusFunctor) or listen for abort requests (AbortCheckFunctor) at regular check points. A ProgressStatusAbortFunctor bundles these three different
functions. The TrainTestSuite manages the actions of the library at the highest
level: the library users mainly interact with this class by adding classifier manager
objects, passing data and retrieving evaluation results.
The I/O functionality can simply be extended to other input and output streams,
as long as the data can be stored in a key-value form with string keys, and as long
as a hierarchical structure with group denoted by a name string can be imposed.
Instead of only listening for abort signals, the AbortCheckFunctor could in principle
15
http://www.hdfgroup.org/HDF5/
72
2.4. Software architecture
Figure 2.9. – Simplified UML diagram of the statistical evaluation functionality; see section 2.4.5 for details. The connections to the classes Classifier and ClassifierManager
(Fig. 2.6) are shown.
73
Chapter 2. Software for MRSI analysis
Figure 2.10. – Simplified UML diagram of the data input / output functionality; see section
2.4.6 for details. The connection to the class ClassifierManager (Fig. 2.6) is shown.
handle more general user requests: but aborting a time-consuming training process
is presumably the main requirement for user interaction capabilities.
2.4.7. User interaction and graphical user interface
In order to further aid the clinical users in spectrum annotation, a graphical user
interface was developed in MeVisLab that displays MRSI spectra from a selected
slice in the context of its neighbor spectra, which can then be labeled on an ordinal
scale by voxel class and signal quality and imported into the classification library
(Fig. 2.2). Since clinical end users only interact with this user interface, they can start
a training and testing experiment and evaluate the results without expert knowledge
on pattern recognition techniques: they only have to provide their domain knowledge
about the clinical interpretation of MRSI data. To this purpose, a graphical user
interface displays the MRSI spectra of the different voxels both in their spatial context
(upper right of Fig. 2.2) and as enlarged single spectra (middle left of this figure).
It is known that the ability to view MRSI spectra in their surroundings and to
incorporate the information from the neighboring voxels is one of the main reasons
why human experts still perform better at classifying these spectra than automated
methods (Zechmann et al., 2011). Simultaneously one can display a morphological
MR image that is registered to the MRSI grid, which can give additional valuable
information for the labeling process of the raters. Labels are provided on two axes
74
2.5. Case studies
(signal quality and voxel class / malignancy) that are encoded by marker shape and
color; furthermore it is possible to add free-text annotations to interesting spectra.
After saving the label information in a human-readable text format, clinical users
only have to provide the information which label files (and associated files with MRSI
data) shall be used for training and testing. (As stated in section 2.4.6, it is not required to specify dedicated testing files; in this case, all data are used in turn for both
training and testing via a hold-out scheme.) An expert mode provides the opportunity to select which classifiers to train and test and to set the classifier parameters
manually (Fig. 2.3). Also default values are proposed for these parameters, which
gave the best or close to the best accuracy on different prostate datasets acquired at
1.5 Tesla (table 2.1): these values can at least serve as plausible starting values for
the parameter fine tuning on new classification tasks. Alternatively a search grid of
parameter values may be specified, so that the best value is detected automatically:
this allows to improve the classifier accuracy in some cases, while still requiring little
understanding about the detailed effects of the different parameters on the side of
the users.
Besides the weights of the trained classifiers, the training and testing procedures
also generates test statistics that are estimated from the cross-validation schemes
and saved in the HDF5 file format. By inspecting these files, one can get a detailed
overview over the accuracy and reliability of the different classifiers and compare
whether they yield significantly different results (Fig. 2.4).
Finally, the trained classifiers can be applied to predict the labels of new MRSI
spectra for which no manual labels are available. For a user-selected region of interest,
this information can be displayed in the CLARET software as an easily interpretable
nosologic map overlayed over the morphological MR image (Fig. 2.5). The voxel class
is encoded in the color (green for healthy tissue, red for tumor, yellow for undecided
cases), while the signal quality is encoded in the alpha channel (for poor spectra the
nosologic map is transparent, whereas for very good spectra it is nearly opaque).
2.5. Case studies
2.5.1. Exemplary application to 1.5 Tesla data of the prostate
The library was validated on 1.5 Tesla MRSI data of prostate carcinomas. Two
different datasets were used for the training of signal quality and of voxel class classifiers: Dataset 1 (DS1) consisted of 36864 training spectra and 45312 test spectra, for
which only signal quality labels were available; see (Menze et al., 2008) for further
details. For joint signal quality and voxel class classification, 19456 training spectra
75
Chapter 2. Software for MRSI analysis
from 24 patients with both signal quality and voxel class labels were provided; see
(Kelm et al., 2007) for further details. During preprocessing, 101 magnitude channels were extracted as features for dataset 1, and 41 magnitude channels for dataset
2. No preprocessing steps besides rebinning and selection of the appropriate part of
the spectrum were used. For training the voxel class classifier on dataset 2, only the
2746 spectra with “good” signal quality were used. Since relatively few spectra were
available for dataset 2, an eight-fold cross-validation scheme was used on it rather
than partitioning it into a separate training and test set.
Parameter (classifier)
Search grid values
Slack penalty C (SVM)
Number of features
per node mtry (RF)
L2 norm penalty λ (RR)
Number of principal
components nPC (PCR)
10−2 , 10−1 , . . . , 103
Final values for DS1
(SQ) / DS2 (SQ) /
DS2 (VC)
101 / 102 / 102
4, 6, . . . , 16
10−3 , 10−2 , . . . , 102
16 / 14 / 16
10−1 / 10−1 / 10−2
10, 15, . . . , 40
40 / 35 / 25
Table 2.1. – Search grid for automated classifier parameter selection and final values for
signal quality (SQ) classification based on dataset 1 (DS1) and signal quality and voxel class
(VC) classification based on dataset 2 (DS2).
As classifiers, support vector machines with linear kernel, random forests, principal
component regression and ridge regression were trained, as the training of support
vector machines with an RBF kernel was found to be too time-consuming. The
optimal free hyperparameters were selected from the proposal values in table 2.1
by the automated parameter search capabilities of the library (using ten-fold crossvalidation for the SVMs with linear kernel).
With these input data, one achieves state-of-the art classification performance: For
signal quality prediction on dataset 1, the different classifiers achieved correct classification rates (CCR) of 96.5 % – 97.3 % and area under the ROC curve values of
98.9 % – 99.3 % (see table 2.2). On dataset 2, one obtains correct classification rates
of 89.9 % – 92.2 % and area under curve values of 89.0 % – 94.6 % for the signal
quality prediction task (table 2.3), and correct classification rates of 90.9 % – 93.7 %
as well as area under curve values of 95 % – 98 % for the voxel class prediction task
(table 2.4).
The automated parameter tuning functionality is especially relevant for the use of
support vector machines, since wrong values of the parameter C may lead to a
considerably degraded accuracy. If e.g. the starting value of 0.01 for C had been
76
2.5. Case studies
Precision
Recall
Specificity
F-score
CCR
AUC
SVM
0.815
0.913
0.972
0.861
0.965
0.989(14)
RF
0.869
0.913
0.982
0.891
0.973
0.993(14)
RR
0.921
0.797
0.991
0.855
0.968
0.990(14)
PCR
0.922
0.802
0.991
0.857
0.968
0.990(14)
Table 2.2. – Evaluation statistics for signal quality classifiers based on dataset 1. The
standard deviation of the area under curve value (in parentheses) is estimated as proposed
by Bandos et al. (2007). Note that the recall is also known as the “sensitivity”.
Precision
Recall
Specificity
F-score
CCR
AUC
SVM
0.73(11)
0.57(18)
0.964(23)
0.621(14)
0.905(37)
0.891(54)
RF
0.832(57)
0.58(17)
0.9820(62)
0.67(13)
0.922(32)
0.946(57)
RR
0.79(12)
0.42(17)
0.980(18)
0.53(15)
0.899(37)
0.890(54)
PCR
0.79(12)
0.43(17)
0.979(19)
0.54(16)
0.899(38)
0.890(54)
Table 2.3. – Average evaluation statistics for signal quality classifiers based on dataset 2
(with standard deviations in parentheses). While the standard deviation reported for the area
under curve value is estimated as by Bandos et al. (2007) to facilitate the comparison with
table 2.2, the other standard deviation estimates are computed from the cross-validation.
used for the signal quality classification of dataset 1, the correct classification rate
would have dropped to 92.5 % (which means that the number of wrongly classified
Precision
Recall
Specificity
F-score
CCR
AUC
SVM
0.908(76)
0.69(17)
0.983(23)
0.76(12)
0.932(42)
0.97(15)
RF
0.864(27)
0.753(16)
0.9771(87)
0.79(11)
0.937(42)
0.98(15)
RR
0.966(39)
0.50(21)
0.9966(39)
0.63(22)
0.909(59)
0.96(15)
PCR
0.900(14)
0.50(21)
0.9928(78)
0.63(21)
0.909(62)
0.95(15)
Table 2.4. – Average evaluation statistics for voxel class classifiers based on dataset 2 (see
table 2.4 for further explanations).
77
Chapter 2. Software for MRSI analysis
spectra would have doubled). The other classifiers that are currently available in the
library are more robust with respect to the values of their associated parameters.
While these absolute quality measures are highly relevant for the clinical practitioners, a research clinician may also be interested in the question which classifier to use
for this particular task (and whether there is any difference between the different
classifiers at all). This question could be answered with the statistical hypothesis
testing capabilities of the library, since p-values from McNemar’s test (for dataset
1) and the t-test variant (for dataset 2) characterizing the differences in the correct
classification rates of various classifiers were automatically computed and corrected
for multiple comparisons (both Holm’s step-down and Hochberg’s step-up method
yielded qualitatively the same results). For the signal quality classifiers trained on
dataset 1, random forests differed with high significance from all other classifiers
(p < 10−6 ). Support vector machines differed from principal components regression
significantly (p < 10−3 ), and ridge regression showed a barely significantly difference
to both principal components regression and support vector machines (p < 10−2 ),
while all other differences were non-significant. For dataset 2, no (even barely) significant differences could be detected by Grandvalet’s conservative t-test with an
assumed upper bound of 0.7 for the between-fold correlation (even without Holm’s
or Hochberg’s correction: this is presumably due to the small number of data points.
2.5.2. Extending the functionality with a k nearest neighbors classifier
As an exemplary case of how the functionality of the library may be extended, this
subsection describes the addition of a new classifier method in detail, namely the
k nearest neighbors (kNN) method as one of the simplest classifiers (Hastie et al.,
2009). Every test spectrum is assigned the majority label of its k closest neighbors
among the training spectra (with respect to the Euclidean distance).16 This classifier
is represented by a NearestNeighborClassifier class derived from the abstract
Classifier base class:
class EXPORT_CLASSTR AI N
NearestNeigh bo rC la s si fi er : public Classifier {
private :
// All training spectra
vigra :: Matrix < double > trainingSpectra ;
// All training labels
vigra :: Matrix < double > trainingLabels ;
// Training spectra for the different cross - validation folds
std :: vector < vigra :: Matrix < double > > trainingSpect ra Cv Fo ld s ;
// Training labels for the different cross - validation folds
std :: vector < vigra :: Matrix < double > > trainingLabels Cv Fo l ds ;
// Name strings associated with the kNN classifier
static const std :: string knn_name ;
static const std :: string k_name ;
static const std :: string cv_error_name ;
16
For binary classification, ties can easily be avoided by restricting k to odd values. However, if the
user chooses an even k, the classifier errs on the safe side and classifies the spectrum as tumorous
in case of a tie.
78
2.5. Case studies
static const std :: string training_spe ct ra _n am e ;
static const std :: string training_lab el s_ na me ;
protected :
// Can be used for native multi - class classification
virtual bool isOnlyBinary () const {
return false ;
}
public :
// Stub constructor
NearestNeighb or Cl a ss if ie r () : Classifier () ,
trainingSpectra () , trainingLabels () ,
trainingSpect ra C vF ol ds () , trainingLabels C vF ol ds (){
}
// Read - only access to classifier name string
virtual std :: string getClassifierNa me () const {
return knn_name ;
}
// Read - only access to error score name string
virtual std :: string getErrorScoreNa me () const {
return cv_error_name ;
}
protected :
/* The following virtual functions are discussed separately */
...
};
The only adjustable parameter is the number of nearest neighbors k. By default,
the odd values 1, 3, . . . , 15 shall be considered while optimizing over this parameter:
they may also be adjusted afterwards by the library user. The last argument of
the addClassifierParameter specifies that this parameter shall be incremented
additively rather than multiplicatively.
void
NearestNeigh bo rC l as si fi er ::
addClassifi er Sp e ci fi c Pa ra m et er s (){
unsigned kValue =5;
unsigned kLower =1;
unsigned kUpper =15;
unsigned kIncr =2;
parameters -> addClassifier Pa ra me te r ( k_name , kValue , kIncr ,
kLower , kUpper , false );
}
In this application case, the different spectral features correspond to MRSI channels and can assumed to be commensurable: hence no preprocessing except for the
general MRSI preprocessing steps is required, and the associated preprocessor is an
instance of the IdentityPreprocessor class, which leaves the features unchanged.
In cases where one cannot assume the features to be commensurable, one should
rather associate this classifier with a preprocessor of type WhiteningPreprocessor
which brings all features to the same scale.
shared_ptr < Preprocessor >
NearestNeigh bo rC l as si fi er :: getPreprocess or S tu bS pe c if ic () const {
shared_ptr < Preprocessor > output ( new IdentityPrepro ce ss or ());
return output ;
}
For didactic reasons, a simple, but admittedly inefficient implementation is proposed.
The training process consists simply of storing the training features and labels:
double
NearestNeigh bo rC l as si fi er ::
estimatePer fo r ma nc e Cv Fo l dS pe c if ic ( FoldNr iF ,
const Matrix < double >& features ,
79
Chapter 2. Software for MRSI analysis
const Matrix < double >& labels ){
double output = learnCvFoldSpec if ic (iF , features , labels );
cvFoldTrained ( iF ,0)= true ;
return output ;
}
double
NearestNeigh bo rC la s si fi er ::
learnSpecific ( const Matrix < double >& features ,
const Matrix < double >& labels ){
trainingSpectra = features ;
trainingLabels = labels ;
return estimateByInte r na lV al ( features , labels );
}
double
NearestNeigh bo rC la s si fi er ::
learnCvFoldSpe ci fi c ( FoldNr iFold , const Matrix < double >&
features , const Matrix < double >& labels ){
trainingSpec tr aC vF ol d s [ iFold ] = features ;
trainingLabe ls Cv Fo ld s [ iFold ] = labels ;
return estimateByInte r na lV al ( features , labels );
}
The automated parameter optimization requires an estimate for the generalization
error, which must be obtained from one single cross-validation fold: if the data has
for example been split into a training and a testing fold, only the training fold may
be used for this estimation. Otherwise one would incur a bias for the test error
that is computed on the separate testing dataset. Unlike many other classifiers (e.g.
random forests), the kNN classifier does not automatically generate a generalization
error estimate during training: hence one must resort to an internal validation step,
in which the training data is split into an internal “training” and “testing” subset:
struct
NearestNeigh bo rC la s si fi er ::
Comparison {
operator ()( const pair < double , double >& p1 ,
const pair < double , double >& p2 ){
return p1 . first < p2 . first ;
}
};
double
NearestNeigh bo rC la s si fi er ::
estimateByInt er na lV al ( const Matrix < double >& features ,
const Matrix < double >& labels ){
unsigned k = parameters -> getValue < unsigned >( k_name );
// randomly group into two folds
vector < int > folds ( features . shape (0) );
for ( int i =0; i < features . shape (0); ++ i ){
folds [ i ] = rand () % 2;
}
unsigned correct = 0;
unsigned wrong = 0;
for ( int i =0; i < features . shape (0); ++ i ){
if ( folds [ i ]==0 ){ // 1 : test spectra , 0 : training spectra
continue ;
}
priority_queue < pair < double , double >, vector < pair < double , double > >,
Comparison > currBest ;
unsigned nFound = 0;
for ( int j =0; j < features . shape (0); ++ j ){
if ( folds [ j ]==1 ){
continue ;
}
Matrix < double > tempVec = features . rowVector (i );
tempVec -= features . rowVector ( j );
double newDist = tempVec . squaredNorm ();
if ( nFound ++ < k ){ // first k spectra automatically pushed
currBest . push ( pair < double , double >( newDist , labels (j ,0)));
} else {
80
2.5. Case studies
if ( newDist < currBest . top (). first ){
currBest . pop ();
currBest . push ( pair < double , double >( newDist , labels (j ,0)));
}
}
}
double maxLabel = retrieveMajority ( currBest );
if ( maxLabel == labels (i ,0) ){
correct ++;
} else {
wrong ++;
}
}
return double ( wrong )/( correct + wrong );
}
retrieveMajority() is a helper function to retrieve the most common label from
the priority queue. Note that the implementation is deliberately simple for didactical
reasons and has not been optimized for efficiency: in production code, one would
store the training spectra in a balanced data structure like the box-decomposition
trees (Arya et al., 1998) used in the ANN library17 for faster retrieval. A similar
implementation is used to predict the values of new test examples:
void
NearestNeigh bo rC l as si fi er ::
predictLabels An d Sc or es ( const Matrix < double >& featuresTrain ,
const Matrix < double >& labelsTrain ,
const Matrix < double >& featuresTest ,
Matrix < double >& labelsTest ,
Matrix < double >& scoresTest ) const {
unsigned k = parameters -> getValue < unsigned >( k_name );
labelsTest = Matrix < double >( featuresTest . shape (0) ,1);
scoresTest = Matrix < double >( featuresTest . shape (0) , classes . size () ,0.);
for ( int i =0; i < featuresTest . shape (0); ++ i ){
priority_queue < pair < double , double >, vector < pair < double , double > >,
Comparison > currBest ;
unsigned nFound = 0;
for ( int j =0; j < featuresTrain . shape (0); ++ j ){
Matrix < double > tempVec = featuresTest . rowVector ( i );
tempVec -= featuresTrain . rowVector ( j );
double newDist = tempVec . squaredNorm ();
if ( nFound ++ < k ){
currBest . push ( pair < double , double >( newDist , labelsTrain (j ,0)));
} else {
if ( newDist < currBest . top (). first ){
currBest . pop ();
currBest . push ( pair < double , double >( newDist , labelsTrain (j ,0)));
}
}
}
labelsTest (i ,0) = retrieveMajority ( currBest );
while ( ! currBest . empty () ){
scoresTest (i , classIndices . find ( currBest . top (). second )-> second )+=1./ k ;
currBest . pop ();
}
}
}
This helper routine considerably simplifies the definition of the virtual prediction
functions:
void
NearestNeigh bo rC l as si fi er ::
predictBinar yS c or es Sp e ci fi c ( const Matrix < double >& features ,
Matrix < double >& scores ) const {
Matrix < double > labels ;
predictLabelsA nd S co re s ( trainingSpectra , trainingLabels ,
17
http://www.cs.umd.edu/˜mount/ANN/
81
Chapter 2. Software for MRSI analysis
features , labels , scores );
}
void
NearestNeigh bo rC la s si fi er ::
predictBina ry Sc o re sC v Fo ld S pe ci f ic ( FoldNr iFold ,
const Matrix < double > & features ,
Matrix < double > & scores ) const {
Matrix < double > labels ;
predictLabel sA nd Sc or e s ( trainingSpect ra Cv Fo ld s [ iFold ],
trainingLabel sC vF ol ds [ iFold ],
features , labels , scores );
}
void
NearestNeigh bo rC la s si fi er ::
predictLabels Sp ec if ic ( const Matrix < double >& features ,
Matrix < double >& labels ) const {
Matrix < double > scores ;
predictLabel sA nd Sc or e s ( trainingSpectra , trainingLabels ,
features , labels , scores );
}
void
NearestNeigh bo rC la s si fi er ::
predictLabel sC v Fo ld Sp ec i fi c ( FoldNr iFold , const Matrix < double >&
features , Matrix < double > & labels ) const {
Matrix < double > scores ;
predictLabel sA nd Sc or e s ( trainingSpect ra Cv Fo ld s [ iFold ],
trainingLabel sC vF ol ds [ iFold ],
features , labels , scores );
}
Concerning serialization and deserialization, this classifier is only responsible for its
internal data. In contrast, the serialization of the parameter k is handled by the
associated ParameterManager object, while the evaluation statistics are serialized
by the ClassifierManager.
void
NearestNeigh bo rC la s si fi er ::
saveSpecific ( shared_ptr < SaveFunctor < string > > saver ) const {
shared_ptr < SaveFunctorInterface < string , Matrix < double > > > matSaver =
dynamic_pointer_cast < SaveFunctorInterface < string , Matrix < double > > >(
saver );
CSI_VERIFY ( matSaver );
matSaver -> save ( training_spectra_name , trainingSpectra );
matSaver -> save ( training_labels_name , trainingLabels );
for ( FoldNr iF =0; iF < nCvFolds ; ++ iF ){
ostringstream currMatName ;
currMatName << getFoldName () << iF << " " << training_spec tr a_ na me ;
matSaver -> save ( currMatName . str () , trainingSpect ra Cv Fo ld s [ iF ]);
currMatName . str () = "" ;
currMatName << getFoldName () << iF << " " << training_labe ls _n am e ;
matSaver -> save ( currMatName . str () , trainingLabels Cv Fo l ds [ iF ]);
}
}
void
NearestNeigh bo rC la s si fi er ::
loadSpecific ( shared_ptr < LoadFunctor < string > > loader ){
shared_ptr < LoadFunctorInterface < string , Matrix < double > > > matLoader =
dynamic_pointer_cast < LoadFunctorInterface < string , Matrix < double > > >(
loader );
CSI_VERIFY ( matLoader );
matLoader -> load ( training_spectra_name , trainingSpectra );
matLoader -> load ( training_labels_name , trainingLabels );
trainingSpec tr aC vF ol d s . resize ( nCvFolds );
for ( FoldNr iF =0; iF < nCvFolds ;++ iF ){
ostringstream currMatName ;
currMatName << getFoldName () << iF << " " << training_spec tr a_ na me ;
matLoader -> load ( currMatName . str () , trainingSpec tr aC vF ol d s [ iF ]);
currMatName . str () = "" ;
currMatName << getFoldName () << iF << " " << training_labe ls _n am e ;
matLoader -> load ( currMatName . str () , trainingLabe ls Cv Fo ld s [ iF ]);
}
82
2.5. Case studies
}
On the signal quality task for dataset 1 (see section 2.5.1), this classifier achieves a
correct classification rate of ca. 95 % across all tested values for the parameter k.
83
Chapter 2. Software for MRSI analysis
84
Chapter 3.
Brain tumor segmentation based on
multiple unreliable annotations
3.1. Introduction and motivation
The use of machine learning methods for computer-assisted radiological diagnostics faces a common problem: In most situations, it is impossible to obtain reliable
ground-truth information for e.g. the location of a tumor in the images. Instead
one has to resort to manual segmentations by human labelers, which are necessarily
imperfect due to two reasons. Firstly, humans make labeling mistakes due to insufficient knowledge or lack of time. Secondly, the medical images upon which they
base their judgment may not have sufficient contrast to discriminate between tumor
and non-tumor tissue. In general, this causes both a systematic bias (tumor outlines
are consistently too large or small) and a stochastic fluctuation of the manual segmentations, both of which depend on the specific labeler and the specific imaging
modality.
One can alleviate this problem by explicitly modelling the decision process of the
human raters: in medical image analysis, this line of research started with the STAPLE algorithm (Warfield et al., 2004) and its extensions (Warfield et al., 2008), while
in the field of general computer vision, it can already be traced back to the work of
Smyth et al. (1995). Similar models were developed in other application areas of machine learning (Raykar et al., 2009; Whitehill et al., 2009; Rogers et al., 2010): some
of them make also use of image information and produce a classifier, which may be
applied to images for which no annotations are available. The effect of the different
imaging modalities on the segmentation has not yet found as much attention.
In this chapter, all these competing methods as well as novel hybrid models are
systematically evaluated for the task of computer-assisted tumor segmentation in radiological images: the same machinery is used on annotations provided by multiple
human labelers with different quality and on annotations based on multiple imaging
modalities. While traditionally these methods have been tackled by expectation max-
85
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
imization (EM; Dempster et al., 1977), here the underlying inference problems are
formulated as probabilistic graphical models (Koller & Friedman, 2009) and thereby
rendered amenable to generic inference methods. This facilitates the inference process and makes it easier to study the effect of modifications on the final inference
results.1
3.2. Background
3.2.1. Imaging methods for brain tumor detection
T1 -, T2 - and PD-weightings in MRI For a general introduction to magnetic resonance imaging, such as principles of signal generation and spatial encoding, see
section 1.2. In the following, some additional background about weightings and tissue contrast is provided, since these concepts are crucial for the detection of brain
cancers from scalar MR images (in contrast to the spectral MRS images that were
considered in the previous two chapters). For references, see e.g. (Yokoo et al., 2010;
Kates et al., 1996). As can be derived from Eq. (1.8), the magnitude of the echo
signal in a spin-echo sequence is approximately
(3.1)
A ∝ ρ 1 − e−TR/T1 e−TE/T2 ,
with ρ being the density of MR-visible protium nuclei (PD), TR being the repetition
time, i.e. the time between two subsequent 90◦ excitation pulses,2 and TE being the
echo time, i.e. the time between excitation and signal acquisition.3 Image contrast
between different tissues arises due to different values of the three relevant tissue
parameters, ρ, T1 and T2 . By appropriate choices for the sequence parameters TE
and TR, one can weight the relative importance of these parameters: If very small
values of TE are chosen (TE ≪ T2 for all relevant tissues), and TR is selected
in the range of typical T1 values,4 the contrast mainly depends on T1 and ρ (T1
weighting). If very large values of TR are chosen (TR ≫ T1 for all relevant tissues)
1
The contents of this chapter have been published as (Kaster et al., 2011).
If a whole volume is imaged, multiple spin-echo sequences must be performed, which means that
repeated excitation occurs before the longitudinal magnetization has completely relaxed to its
equilibrium value. Eq. (3.1) describes the state after several previous excitations.
3
Fast MR imaging techniques such as the FLASH sequence dispense with the refocussing 180◦ pulse
and generate the echo signal purely by gradient pulses. For these techniques, the magnitude
follows a similar formula, which however depends on the T2∗ instead of the T2 time.
4
These depend on the magnetic field strength. At 1.5 T, typical values are 250 ms for fat, 600 ms
for white matter (WM), 750 ms for gray matter (GM) and 4000 ms for water and water-like
liquids such as cerebrospinal fluid (CSF).
2
86
3.2. Background
and TE is in the range of typical T2 times,5 the contrast mainly depends on T2 and
ρ (T2 weighting). If both TE ≪ T2 and TR ≫ T1 is chosen, the contrast depends
purely on ρ (PD-weighting).6 The best characterization of tissues via MR is possible
by combining the results from different series with different weightings (multimodal
imaging).
MR contrast agents The presence of paramagnetic contrast agents in the vicinity
of the precessing spins speeds up both T1 relaxation and T2 relaxation, by an amount
which is approximately linear in the contrast agent concentration cCA :
(CA)
1/T1
= 1/T1 + r1 · cCA ,
(CA)
1/T2
= 1/T2 + r2 · cCA ,
(3.2)
where r1 and r2 are the relaxivities of the contrast agent. Most important for clinical
applications are gadolinium(III) chelates, such as gadopentetate dimeglumine (GdDTPA), for which the predominant effect is on T1 time. While the signal generation
in MR imaging is due to the nuclear magnetic moments, the action of MR contrast
agents is caused by the magnetic moment of their electron shell, for instance the
half-filled f -shell of the Gd(III) atom. In the healthy brain, the blood-brain barrier
prevents extravasation of contrast agents so that they stay in the blood pool: hence a
contrast-enhancement in the brain tissue points to a disruption of blood-brain barrier
integrity, which may be caused by immature blood vessels (that are often created by
tumor angiogenesis), as well as inflammatory or degenerative diseases of the brain.
Inversion recovery and the FLAIR sequence The inversion recovery (IR) sequence is an alternative to the spin-echo sequence, in which the order of the 180◦
and the 90◦ pulse is interchanged: first the longitudinal magnetization is inverted by
a 180◦ pulse, then after an inversion time TI, a transversal magnetization is created
by a 90◦ pulse, and the FID signal is directly acquired after the 90◦ pulse. The signal
magnitude is given by
(3.3)
A ∝ ρ 1 − 2e−TI/T1 .
This sequence is frequently used for masking a certain compartment (e.g. fat or CSF)
out of the MR image, by setting TI/ log(2) equal to the T1 time of this compartment. An important modification is the fluid-attenuated inversion recovery (FLAIR)
sequence, which combines inversion recovery with a spin echo (moderate TE, long
5
Typical values are 60 ms for fat, 80 ms for WM, 90 ms for GM and 2000 ms for water or CSF.
For T2 , the dependency on magnetic field strength is less pronounced than for T1 .
6
Typical values for ρ are 0.7 g/ml for WM, 0.8 g/ml for GM and 1 g/ml for water or CSF. The
difference between the chemical and the MR-visible proton concentration should be noted: lipids
contain many immobilized protons that cannot contribute to the MR signal.
87
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
TR) in order to generate a T2 -weighted image where the CSF signal is masked out:
the sequence schema is 180◦ – TI – 90◦ – TE/2 – 180◦ – TE/2 – ACQ.
Brain tumors The following description of medical imaging techniques for the detection of brain tumors contains common knowledge: for references see e.g. (DeAngelis et al., 2007; Debnam et al., 2007; Mikulis & Roberts, 2007). Brain tumors fall
into two categories: primary brain tumors which originate from the brain (intra-axial
tumor) or its direct surroundings (extra-axial tumor), and metastases of an extracranial cancer (e.g. lung cancer, breast cancer or malignant melanoma). Primary brain
tumors seldom originate from neural cells, but more typically from the meninges
(meningioma) or from a glia cell (e.g. astrocytoma, oligodendroglioma, glioblastoma multiforme, schwannoma). Prognostically relevant is the distinction between
malignant brain tumors (which show uncontrolled proliferation, invade surrounding
tissues and may metastasize) and benign tumors, which stay in a circumscribed area.
However, even benign tumors may be fatal without treatment due to increased intracranial pressure. Due to their rapid proliferation, malignant brain tumors have
a high demand for oxygen (and hence for blood perfusion): hence they build new
blood vessels (tumor angiogenesis), which often have abnormal lining cells so that
the blood-brain barrier may be disrupted inside the tumor. This is the reason why
most tumors are surrounded by edema (i.e. blood plasma leaking in the intercellular
space of the brain tissue). If the angiogenesis cannot keep step with the growth
of the tumor, the core of the tumor becomes first hypoxic and later necrotic: this
is indicative of highly aggressive malignancies. Radiological imaging diagnostics is
typically indicated when neurological symptoms are observed, such as deficits in sensation, motion or language, seizures or impairments of alertness or cognition; also
metastasis screening should be performed upon diagnosis of a primary tumor which
is known often to metastasize to the brain.
Imaging of brain tumors The first choice for imaging diagnostics is magnetic resonance imaging (see section 1.2); computed tomography (CT) and positron emission
tomography have typically lower sensitivity and specificity and are mainly useful
either as a supplement or for patients which have a contraindication for high magnetic fields (e.g. metallic implants or cardiac pacemakers). Common tumor imaging
protocols comprise two T1 -weighted scans (before and after injection of a contrast
agent such as Gd-DTPA), a diffusion-weighted scan and either a T2 -weighted or a
FLAIR scan. Gadolinium enhancement is the best indicator for aggressive (highgrade) malignancies. As necrotic tissue does not take up contrast agents, tumors
with a necrotic core typically display a ring-shaped enhancement pattern, while tumors without a necrotic core are uniformly enhanced. However, low-grade and benign
brain tumors show no enhancement after Gd-DTPA injection. They can be detected
88
3.2. Background
by the second radiological tumor sign, namely abnormal relaxation times: Most tumors are hypocellular (with increased T1 and T2 times) and appear as hypointensities
in T1 -weighted and as hyperintensities in T2 -weighted (or FLAIR) images; while some
tumors are hypercellular (with decreased relaxation times), where the effects are exactly reversed. In diffusion-weighted magnetic resonance imaging (DWI), the image
intensity is attenuated by a factor of e−bD , where b is a constant and D is the local
diffusion coefficient. This is achieved by two gradient fields of equal strength that
are applied symmetrically around the 180◦ pulse. For resting nuclei, they do not effect the signal, as the first gradient field causes a dephasing that is exactly rephased
by the second gradient field. However, protium nuclei that have moved along the
gradient direction experience a different field strength during rephasing than during
dephasing, leading to the attenuation. Diffusion is increased in hypocellular regions; accordingly hypercellular tumors appear as hyperintensities and hypocellular
tumors appear as hypointensities in diffusion-weighted imaging. Additional imaging
techniques such as MRSI, functional MRI or perfusion-weighted imaging may further
improve the differential diagnosis, but they are rarely used in clinical routine (mainly
due to time constraints). The gold standard for tumor diagnosis and grading is the
histopathological examination of an image-guided biopsy.
3.2.2. Variational inference for graphical models
Graphical models Probabilistic graphical models (Koller & Friedman, 2009; Wainwright & Jordan, 2008) are a tool for encoding the conditional independence relationships between random variables, and for inferring upon the values of unobserved (or
hidden) variables H = {Hi |i = 1, . . . , NH } based on the values of observed variables
V = {Vi |i = 1, . . . , NV }. This chapter only considers directed graphical models (also
known as Bayesian networks), which directly specify the factorization properties of
the joint probability density over all variables: If X = H ∪ V , a Bayesian network
over the variables X is a directed graph with vertex set X, such that
p(X) =
NHY
+NV
i=1
p(Xi |pai ),
(3.4)
with pai denoting the parents of variable Xi in the graph (see Fig. 3.1 for an example).
The factors p(Xi |pai ) are called the conditional probability distributions (CPDs) of
the Bayesian network.
Aims of inference Typical inference goals for such models are:
1. Computing the posterior marginals p(Hi |V ).
89
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
H1
H 2,n
H3
Vn
H4
N
Figure 3.1. – Simple example for a Bayesian network. The graph nodes correspond to random variables; observed variables are denoted by a gray filling. All variables drawn inside the rectangle stand for an array of N variables V1 , . . . , VN and
H2,1 , . . . , H2,N (plates notation, see Buntine (1994)). The edges denote the factorization properties of the joint probability distribution. For this example, p(H, V ) =
QN p(H1 )p(H3 )p(H4 |H3 ) i=1 p(H2,n |H1 )p(Vn |H2,n , H3 , H4 ) .
2. Computing the evidence p(V ) of the observations given the current model.
This may be useful for selecting a graphical model that captures the structure
of the data well. A common problem in model selection is choosing the proper
number of hidden variables: more hidden variables typically correspond to
higher flexibility, so that the observations can be fitted more accurately, but
at the same time the danger of overfitting arises. Bayesian model selection
provides an elegant way to tackle this problem: consider two models M1 , M2
with different numbers of variables. Then
Z
p(V |Mi ) = dH p(V |H)p(H|Mi )
(3.5)
results from a likelihood term p(V |H) and an “Occam’s razor” term p(H|Mi ).
For complex models with more parameters, the observations can usually be
fitted better (p(V |H) is higher for the best choice of H), but it becomes less
likely that the hidden variables take this particular value out of the much larger
space of possible values. Hence both overly simple and overly complex models
are discouraged (Kass & Raftery, 1995).
3. Finding the maximum a posteriori (MAP) solution for the hidden variables
H ∗ = arg maxH p(H, V ) = arg maxH p(H|V ).
4. Computing the predictive distribution p(v̂|V ) that specifies which observations
v̂ can be expected when sampling from the same graphical model with the same
hidden variables.
90
3.2. Background
Exact inference via junction trees Exact inference on Bayesian networks can be
performed by the junction tree algorithm: First the directed graph is transformed
into an undirected graph by moralization, i.e. by converting all directed edges into
undirected edges and connecting all common parents of a node.7 Afterwards the
moralized graph is chordalized, i.e. edges are introduced in order to remove all
chordless cycles of length greater than three. Then a junction tree is constructed on
the chordalized graph, i.e. a tree graph whose nodes correspond to the maximum
cliques Ci of the chordalized graph and whose edges link cliques sharing the same
variables so that the running intersection property is respected (if a variable is present
in two cliques, it must be present in all cliques on the unique path between those
two cliques on the junction tree). Then every factor is assigned to some clique in
this junction tree: ψi (Ci ) denotes the product of all CPDs assigned to the clique Ci .
Finally, a message-passing algorithm is run, in which messages of the following kind
are sent between neighboring cliques in a specific update order:8
X
Y
δi→j (Ci ∪ Cj ) =
ψi (Ci )
δk→i (Ck ∪ Ci )
(3.6)
Ci \Cj
k∼i,k6=j
After messages have been passed along every edge in both directions, the clique
marginals are given by
X
Y
βi (Ci ) =
p(X) = ψi (Ci )
δk→i .
(3.7)
X\Ci
k∼i
Limitations of exact inference However, the complexity of this junction tree algorithm is exponential in the size of the largest clique in the junction tree for the optimum chordalization, which is called the treewidth of the original moralized graph.910
7
This “marrying” of unconnected parents accounts for the “explaining away” property of Bayesian
networks. This is best explained by the famous burglary-earthquake example by Pearl (1988).
Both a burglary and an earthquake may set off an alarm bell in a house, and we can assume that
both events occur independently from each other. However, once we know that the alarm bell
rang, both the probability of a burglary and an earthquake become more likely; but if we know
that a burglary occurred, the probability of an earthquake becomes less likely again and vice
versa. This means that the common parent variables of a child variable are not conditionally
independent given the child variable, even if they are independent when the child variable is
marginalized over.
8
Eqs. (3.6) and (3.7) describe the sum-product message-passing algorithm that is used to compute
posterior marginals. For MAP estimation, all summations have to be replaced by maximizations
(max-product algorithm).
9
To be exact, the treewidth is defined as the minimum size of the largest clique of all chordal graphs
containing the original graph minus one.
10
There exist graphical models for which the junction tree algorithm has more favorable complexity:
e.g. if all factors are Gaussians for which the marginalization can be performed analytically
(Gaussian processes), the complexity is cubic in the treewidth.
91
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
Since there exist different possibilities for the chordalization, determining the optimum chordalization and hence the treewidth for a given Bayesian network is not
straightforward: in fact, it is an NP-complete problem except for specialized classes
of graphs (Bodlaender, 1992). As will be shown later, the graphical models that
we analyze in this chapter have a treewidth linear in the number of raters and the
number of image features used for the supervised classification; hence exact inference
would only be practicable if there were very few raters and if the objective image
information were disregarded.
Markov Chain Monte Carlo However, the computation time can be highly reduced
if one dispenses with exact solutions and allows approximations. Most popular approximate inference techniques fall into one of two categories: Markov Chain Monte
Carlo (MCMC) techniques (Andrieu et al., 2003) and variational approximations
(Wainwright & Jordan, 2008). MCMC techniques approximate the (intractable)
analytical marginal p(H) by an empirical point mass density
T
1X
pN (H|V ) =
δ(H − H (t) ),
T t=1
(3.8)
where the T samples H (t) are drawn independently and identically distributed from
the true p(H|V ). This sampling process is typically achieved by variants of the
MR2 T2 algorithm (Metropolis et al., 1953) in which one or more particles perform
random steps in the state space of all possible H, which may or may not be accepted
based on the changes in p(H, V ): the states of the particle at the different points
in their trajectory are then used as the random samples. An important special
case is the Gibbs sampler (Geman & Geman, 1984), for which only one hidden
variable Hi is updated in each step: namely, it is sampled from the conditional
(t)
distribution p(Hi |{Hj : j 6= i}, V ) obtained by fixing all other hidden variables to
their current values. MCMC techniques have been shown to be practically useful,
though computationally expensive, and there are software products such as BUGS
(Gilks et al., 1994; Lunn et al., 2000) or INFER.NET (Minka et al., 2009) that can
perform generic MCMC inference on a variety of graphical models.
Variational inference and Rényi entropies Variational inference methods follow
a different strategy: the true posterior p(H|V ), for which inference is intractable, is
approximated by the closest q(H) in a family F of distributions that allow tractable
inference: “closest” is here defined with respect to a divergence measure between
pairs of distributions D(pkq). Commonly D(pkq) is selected out of the family of
92
3.2. Background
Rényi α-entropies (Rényi, 1961; Minka, 2005). If p and q are probability densities,
then
Z
p(H)
q(H) p(H)α q(H)1−α
Dα (pkq) = D1−α (qkp) = dH
+
−
.
(3.9)
1−α
α
α(1 − α)
The two most important special cases are the inclusive (α = 1) and exclusive (α = 0)
Kullback-Leibler (KL) divergence:
Z
Z
p(H)
D1 (pkq) = KL(pkq) = dH p(H) log
+ dH q(H) − p(H) , (3.10)
q(H)
Z
Z
p(H)
− dH q(H) − p(H) .
D0 (pkq) = KL(qkp) = − dH q(H) log
q(H)
(3.11)
For large values of α, the closest distribution q ∗ to a given distribution p tends
towards majorization of p: for α ≥ 1, p(H = h) > 0 implies that also q ∗ (H = h) > 0
(zero-avoiding property), and in the limit α → ∞, q ∗ (H) > p(H) holds everywhere.11
The closest q ∗ hence tries to best fit the entire shape of the true p. In contrast, for
small values of α the best approximation q ∗ tends towards minorization of the true p:
for α ≤ 0, p(H = h) = 0 implies that also q ∗ (H = h) = 0, and in the limit α → −∞,
q ∗ (H) < p(H) holds everywhere. The closest q ∗ hence tries to best fit the tails of
the true distribution of the true p.
Inference by local updates Finding the closest q ∗ is achieved approximately via
an iterative local update scheme (Minka, 2005), in which both the true p and the
approximation q are partitioned into factors (the CPDs of the Bayesian network) and
the factors of q are locally fit to the factors of p. Assume the following factorizations:
Y
Y
p(H) =
pi (H), q ∗ (H) =
qi∗ (H),
(3.12)
i
i
and define
p\i (H) =
Y
j6=i
pj (H) =
p(H)
.
pi (H)
(3.13)
We now want iteratively select qi∗ so that given the other factors, p is approximated
best. The optimal local solution,
(3.14)
qi∗ ← arg min D pi p\i kqi q ∗\i ,
qi
11
Note that we do not require q ∗ to be normalized: after normalization, this property does obviously
no longer hold.
93
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
would be intractable, but if q ∗\i approximates p\i already adequately, Eq. (3.14) can
be approximated by the tractable
(3.15)
qi∗ ← arg min D pi q ∗\i kqi q ∗\i .
qi
Using the inclusive KL divergence (α = 1) in this local update scheme, together
with some additional assumptions leads to the expectation propagation algorithm
by Minka (2001), while the use of the exclusive KL divergence (α = 0) leads to
variational message passing (Winn & Bishop, 2005). More general choices of α lead
to the power expectation propagation algorithm (Minka, 2004). The advantage of
choosing α = 0 is that it provides an exact lower bound on the model evidence: note
that
Z
p(H, V )
log p(V ) = L(q) + KL(qkp) ≥ L(q) = dHq(H) log
,
(3.16)
q(H)
which is tractable as it only involves a marginalization over q(H).
Variational message passing After this generic view on variational inference
techniques, we now discuss the variational message passing (VMP) algorithm by
Winn & Bishop (2005) in detail. For the family F, we choose all distributions q that
factorize over all variables, and for which inference is hence trivially tractable:
Y
qi (Hi ).
(3.17)
q(H) =
i
In this case, the solution of Eq. (3.15) is given by
log qj∗ (Hj ) = Eqi∗ (Hi ),i6=j [log p(H, V )] .
(3.18)
By the graphical model structure, log p(H, V ) can be written as a sum of log-factors,
most of which do not depend on Hj and which are hence irrelevant for the functional
form of qj∗ (Hj ). For evaluating the expectation value in Eq. (3.18), we must only
consider the local factors qi∗ (Hi ) for which i lies in the Markov blanket of j, i.e. is
either a child, parent or coparent (i.e. another parent of a child) of j:
X
log qj∗ (Hj ) = Ei∈paj log p(Hj |paj ) +
Ei∈{k}∪cp(j) ∩H [log p(Xk |pak )] + const,
k∈chj
k
(3.19)
(j)
with paj , chj being the sets of parents and children of j, and cpk = pak \Hj .
94
3.2. Background
Conjugate-exponential models In order to evaluate Eq. (3.19) efficiently and to
summarize the distribution qj∗ succinctly, we add the constraint that the factors of
p(H, V ) must be conjugate-exponential models: Consider an arbitrary (observed or
unobserved) variable of the graphical model, which shall be denoted by X1 without
loss of generality. Denote the parent nodes of X1 by Y1 , Y2 , . . .. Then two conditions
must hold:
1. Exponential family: The CPD of X1 given its parents has the following
log-linear form:
log p(X1 |Y1 , Y2 , . . .) = φ(Y1 , Y2 , . . .)⊤ uX1 (X1 ) − gX1 φ(Y1 , Y2 , . . .) . (3.20)
The vector uX1 is called the natural statistics of X1 and determines the family
of distributions to which p(X1 |Y1 , . . .) belongs: e.g. for a Gaussian distribution, uX1 (X1 ) = (X1 , X12 )⊤ , while for a Gamma distribution, uX1 (X1 ) =
(X1 , log X1 )⊤ . The vector φ is called the natural parameters and parameterizes the specific distribution in the family, and the normalization summand gX
is known as the log-partition function.
2. Conjugacy: The prior distributions log p(Yi |pai ) on the parents Yi must have
the same functional parameter dependence on Yi as p(X1 |Y1 , . . .), i.e. if
log p (Yi |pai ) = φYi (pai )⊤ uYi (Yi ) − gYi (φYi (pai )) ,
(3.21)
then it must be possible for all i to write
(i ⊤
(i)
log p(X1 | . . . , Yi , . . .) = φX1 Yi X1 , cp1
uYi (Yi ) + λi X1 , cp1
(3.22)
with some functions λi and φXYi . This is best explained with a simple example:
consider a Gaussian variable X1 with a mean Y1 and a precision Y2 , which are
themselves random variables:
r
!
Y2 (X1 − Y1 )2
Y2
exp −
(3.23)
log p(X1 |Y1 , Y2 ) = log
2π
2
⊤ 1
Y1 Y2
X1
=
+
log Y2 − Y2 Y12 − log(2π)
−Y2 /2
X12
2
(3.24)
⊤ 1
X1 Y2
Y1
log Y2 − X12 Y2 − log(2π)
=
+
−Y2 /2
Y12
2
(3.25)
⊤
1
X1 Y1 − X12 /2 − Y12 /2
Y2
− log(2π)
=
1/2
log Y2
2
(3.26)
95
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
If written as a function of Y1 , P (X1 |Y1 , Y2 ) has the form of a Gaussian, while
written as a function of Y2 , it has the form of a Gamma distribution. Hence
conjugacy is only fulfilled if the prior on the mean p(Y1 |pa1 ) is also a Gaussian
and the prior on the precision p(Y2 |pa2 ) is also a Gamma distribution.
Mean parameterization and VMP updates If the natural statistics vector uX of
an exponential model is a minimal representation (meaning that its components are
linearly independent), there are two equivalent parameterizations of this model: the
natural parameter vector φX and the mean parameterization, also known as the
gradient mapping
µX = Ep(X) [uX (X)] = ∇φX gX (φX ).
(3.27)
For the simple case of a Gaussian with mean ψ and precision λ, the two parameterizations are given by
−1/2 !
λψ
µX1 µX2 − µ2X1
φX =
=
,
(3.28)
−1/2
−λ/2
− µX2 − µ2X1
/2
ψ
−φX1 /(2φX2 ) µX =
=
.
(3.29)
ψ 2 + λ−2
φ2X1 − 1 / 4φ2X2
Let φ̃X denote the inverse gradient mapping from µX to the corresponding φX . If all
CPDs in the VMP problem are conjugate-exponential models, then the qj∗ (Hj ) solving Eq. (3.19) is in the same exponential family as p(Hj |paj ), i.e. it is a multilinear
function of the same statistics vector uHj . Its updated parameter vector is given by
i
X h
(j)
.
(3.30)
φ∗Hj = E φHj (paj ) +
E φXk Hj Xk , cpk
k∈chj
Another key implication of the conjugacy is that the expectation values of the natural
parameters in Eq. (3.30) can be uniquely determined from the expectation values of
the natural parameters of the other variables in the Markov blanket via the inverse
gradient mapping. As the latter are just the mean parameters of the distributions
of these other parameters, these mean parameters capture all the information that
Hj must know about its parents, children and coparents. Hence Eq. (3.30) may be
written as
X
φ∗Hj = φ̃Hj {µHk }k∈paj +
(3.31)
φ̃Xk Hj µXk , {µHi }H ∈cp(j)
i
k∈chj
X
= φ̃Hj {mXi →Hj }Xi ∈paj +
mXk →Hj ,
k∈chj
96
k
(3.32)
3.3. Related work
with the messages
mXi →Hj = µXi
mXk →Hj
for Xi ∈ paj ,
= φ̃Xk Hj µXk , {µHi }H ∈cp(j)
i
k
(3.33)
for Xk ∈ chj .
(3.34)
The variational message passing algorithm consists of iteratively updating the parameters of all nodes based on Eq. (3.32), and updating the lower bound on the
evidence L, until a local optimum is reached.
3.3. Related work
The work presented in this chapter lies in the intersection of two areas, which come
together for the first time: latent variable and latent score models for learning with
unreliable annotations (methodology), which are used for learning brain tumor segmentations from medical imagery (application area). First an overview over the
different precious approaches for tackling the application task is given in subsection
3.3.1, while the methodologically related work is discussed in subsection 3.3.2.
3.3.1. Automated methods for brain tumor segmentation
Even for the constrained task of automated brain tumor segmentation in medical
imagery, there exist so many previous approaches that a complete enumeration would
go beyond the scope of this chapter. The following examples should hence be viewed
only as a representative selection.
Methods based on generative models
Generative methods for tumor segmentation can often be formulated in the formalism of graphical models that is also used in this chapter for fusing the information
from various different unreliable sources. However, instead of modelling the labeling
process of the raters, these techniques usually propose probabilistic models for the
generation of the visible image information given the hidden class labels.
For instance, Moon et al. (2002) and Prastawa et al. (2003b) propose an extension of the expectation maximization method by Leemput et al. (1999b) for brain
segmentation with an atlas prior to joint brain, tumor and edema segmentation by
adding class models for tumors and edema. The basic idea is to assume a Gaussian
likelihood for each tissue class (with unknown parameters), to add a spatially varying prior for each class derived from a probabilistic brain atlas, and to jointly learn
97
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
the likelihood parameters, the multiplicative bias field (which accounts for smooth
intensity inhomogeneities in the image) and the class assignments of the voxels by
an EM algorithm, with the class assignments and the bias field parameters being
treated as hidden variables. Spatial priors for the tumor and the edema class are
constructed as follows: The difference of two log-transformed T1 -weighted MR scans
before and after gadolinium contrast enhancement is assumed as bias-free (since the
multiplicative bias fields are assumed to have canceled out). The intensity histogram
of this difference image is modeled by two Gaussians (corresponding to noise) and a
gamma distribution (corresponding to tumor and other enhancing regions like blood
vessels). The posterior probability of the gamma term is then interpreted as tumor
prior. Since edema is mostly observed in WM regions, the edema prior is modeled
experimentally as a fraction of the WM prior.
Nie et al. (2009) account for the different spatial resolutions of the different imaging
modalities by proposing a spatial accuracy-weighted hidden Markov random field
expectation maximization (SHE) procedure for fully automated segmentation of
brain tumors from multi-channel MR images. Typically high-resolution (pre- and
post-contrast) T1 -weighted images are combined with low-resolution T2 -weighted or
FLAIR images by registration: since interpolation is required for resampling the
low-resolution measurements, their accuracy is assumed to be lower. The geometric
mean of distances to the voxels in the original image is used as the accuracy measure.
As a generative model, a Gaussian hidden Markov Random Field (MRF) is used, for
which the clique potentials are weighted by the product of accuracies of all neighbor
pixels contributing to the interpolated signal. Parameter estimation is performed
by the EM algorithm. The procedure is evaluated on the task of segmenting brain
tumors from T1 -weighted, T2 -weighted and FLAIR MR images, after brain stripping
and bias field correction as preprocessing steps. Compared to the results of two
raters, no significant difference to the inter-rater results could be found (measured
by Jaccard index and volume similarity).12
Particularly interesting is the approach by Corso et al. (2006, 2008), who propose a
hybrid of two successful segmentation approaches: generative Bayesian models and
normalized cut segmentation, the latter in the segmentation by weighted aggregation (SWA) approximation. As a generative model, a Gaussian mixture model is
used for each of four classes (brain, tumor, edema, non-brain), whose parameters
are estimated from the training data by the EM algorithm. The normal SWA algorithm generates a hierarchical segmentation by successively merging nodes based on
their affinity (i.e. feature distance) and accumulating their statistics: this allows foreground objects of different scales to be detected (corresponding to different hierarchy
12
The Jaccard index is the ratio of the intersection and the union of detected and true tumor
volume, while the volume similarity is defined as 1 − |VD − VT |/(VD + VT ), where VD and VT are
the detected and the true tumor volume.
98
3.3. Related work
levels). The newly proposed algorithm differs in two respects by incorporating the
generative model: every node is assigned a model class, and the affinity is modulated
such that nodes of the same class have an affinity near 1, and that nodes of different
classes have an affinity near 0. The parameters are again learned from the training
data by a stochastic search. Only the intensities in the different modalities are used
as features. The algorithm has linear time complexity in the number of voxels v, but
typically high memory requirements for storing the multi-level representation (scaling as v log(v)); on a state-of-the-art PC, segmentation of an image volume takes 1-2
minutes (with ca. 5 minutes required for preprocessing). Evaluation against manual
ground truth on multispectral datasets (pre- and post-contrast T1 -weighted MRI, T2 weighted MRI, FLAIR, which are subsampled to the lowest resolution) yields average
Jaccard scores for tumor and edema detection of 69 % and 62 %. For the majority
of datasets, the median distance between automatic and ground-truth segmentation
is 0 mm (meaning that most voxels of these two boundaries coincide).
Methods based on outlier detection
While generative models can capture well the intensity distributions of the different classes in healthy brain tissue, pathological lesions such as tumors or multiple
sclerosis hyperintensities are often harder to model, and the common assumption of
Gaussianity may be violated. This is the reasoning behind outlier-based segmentation methods, which fit a generative model to the normal tissues and detect all
pathologies as outliers to this model.
Gering et al. (2002) propose a hierarchical classification procedure for learning models
of healthy tissue classes and assigning the voxels to those classes, in which higher
levels may correct wrong decisions made on lower levels. On the lowest level, an EM
algorithm is used to learn the intensity distribution of GM, WM and CSF, treating
the bias field and the class assignments of the single voxels as hidden variables,
exactly as in (Leemput et al., 1999a). Spatial context is introduced on the second
level by imposing a Potts model MRF prior on the class assignments, which is relaxed
to a mean-field approximation for tractability as in (Leemput et al., 1999b). On the
third level, the position of every voxel inside the structure of equally labeled voxels
is considered, mainly its distance from the structure boundaries (e.g. if a WM voxel
lies in the center of the white matter or borders neighboring structures). The prior
probabilities for large distances from the boundary may then be increased, which
favors large homogeneous regions and may remove spurious misclassifications. On
the fourth level, global prior information such as digital atlas priors or priors on the
distances between several structures (such as ventricles and skin) may be imposed.
The fifth level is the interaction with the user, who initializes the iterative fitting
of the models for the healthy classes by providing examples for each class with a
99
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
quick brush stroke. Manual correction of misclassified voxels would also be possible
on this level. Several iteration passes over these five levels are then performed until
convergence; tumor voxels are identified as outliers with respect to the Mahalanobis
distance to the center of the class they are assigned to.
Gering (2003) proposes a new metric called nearest neighbor pattern matching (NNPM)
for judging the abnormality of an image window. For each window center position, a
set of template windows corresponding to normal texture examples at this location is
provided and the NNPM of the window is defined as the smallest root-mean-squared
distance to any template in the set. In order to resolve texture similarity at different scales, a scale-space representation is used and a joint pathology probability is
defined by treating the probabilities at each resolution as independent (i.e. the joint
probability is the product of the pathology probabilities at the different scales, where
a Gaussian assumption is used to extract a pathology probability from the distance).
Prastawa et al. (2003a, 2004) detect brain tumors as outliers in multispectral MR
images, after robustly learning models for the healthy tissue classes: A probabilistic
brain atlas is used to draw samples for all healthy classes (WM, GM, CSF) from
locations characteristic for the respective class. A Gaussian model is assumed for each
class, whose parameters are estimated with an outlier-robust estimator (Minimum
Covariance Determinant); samples further than three standard deviations apart from
the mean are discarded as outliers (tumor, edema) and assigned to an “abnormal”
class. The distributions of all classes (GM, WM, CSF, abnormal, non-brain) are then
re-estimated nonparametrically by a kernel density estimation, and the posterior
probabilities are computed for all voxels. After estimating and correcting for a bias
field, the whole process is iterated with the posterior probabilities in lieu of the prior
atlas probabilities. After the abnormal class is finally segmented, it is partitioned
into tumor and edema by k-means clustering with k = 2; if there exist two separate
clusters (as measured by the Davies-Bouldin overlap index), the cluster with the
lower mean T2 -weighted intensity is labeled as tumor. The tumor segmentation is
then refined by performing a level set evolution initialized with the distance transform
of the presegmented tumor; then false positives for the edema class are discarded by
performing a connected component analysis and removing all components without
contact to a tumor. This procedure is also iterated, disabling the level set in the final
iteration step. Validation on bispectral datasets with T1 - and T2 -weighting yields
overlap fractions of 77 ± 5% and Hausdorff distances of 12.7 ± 4.1 mm for tumor
segmentation, while intra-rater comparison yields 77 ± 15% and 4.43 ± 0.68 mm.
100
3.3. Related work
Methods based on discriminative learning without explicit context
information
The following methods are closest in spirit to the variants of logistic regression that
will later be discussed in this chapter. Instead of directly modeling the joint distributions of features and labels p(x, y), as generative models do, discriminative models
restrict themselves to modeling the conditional distribution p(y|x), which is also the
relevant distribution for prediction purposes. This is an easier task as the feature
distribution need not be detected, however it also poses the risk of overfitting if few
training data are available. First we discuss only discriminative models that account
for purely local image information, without taking spatial context into account:
Schmidt et al. (2005) explore support vector machine (SVM) classification with several combinations of alignment-based features for brain tumor segmentation in multispectral (pre- and post-contrast T1 -weighted and T2 -weighted) MR images in order
to facilitate inter-patient training without need to provide patient-specific training
examples. Preprocessing steps are noise reduction by nonlinear filtering, inter-slice
intensity normalization, intra-volume bias field correction, mutual information-based
multimodal registration, matching to an atlas template by a linear and a nonlinear
step, resampling to the template coordinate system and inter-volume intensity standardization (in all steps methods were used that are mostly robust to the presence
of tumors). Four types of alignment-based local features are then extracted: the
distance transform of the brain area of the template (B feature), spatially dependent
probabilities for the three main normal tissue classes (P features), spatially dependent average intensities for healthy brains in the different modalities (A features)
and the intensity difference to the contralateral voxel to characterize local symmetry
or asymmetry (S features). Also textural features are created by applying a multiscale Gaussian convolution filter bank. A linear kernel SVM is then trained, and the
classification results of test images are postprocessed by repeated median filtering (in
order to remove isolated labels) and selection of the largest connected component.
For the best combination of alignment-based features (P , A and S) together with the
texture features an average Jaccard score of 0.732 is obtained (which outperforms
several other feature sets taken from previous literature).
Zhou et al. (2005) use a one-class learning procedure (one-class RBF kernel SVM)
to learn the appearance of tumorous areas in pre- and post-contrast T1 -weighted
images (only the gray values from both modalities are used as features). This yields
a sensitivity of 83.3 ± 5.1% and a correspondence rate (true positives − half of false
positives, normalized by total number of tumor voxels) of 0.78 ± 0.06, while FCM
(see section 3.3.1) only achieves values of 76.2 ± 4.8% and 0.73 ± 0.07.
101
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
Methods based on discriminative learning with incorporated context
information
In cases where the local information is ambiguous, taking spatial context into account
can often improve the segmentation: voxels that are surrounded by tumor voxels have
an increased likelihood of being tumor voxels themselves, and likewise for healthy
tissue. This increased model complexity comes at a price of increased computational
complexity: finding the MAP solution of a spatially regularized model often leads to
a discrete optimization problem that is intractable or only tractable in special cases,
so that one has to resort to approximate solutions. The following approaches start
from local discriminative classifiers as discussed in the previous section, and augment
them with spatial context information:
Lee et al. (2005) compare three context-sensitive classification procedures (Markov
random fields (MRF) as a generative model, discriminative random fields (DRF) and
support vector random fields (SVRF) as discriminative models) with their contextfree degenerated versions (naive Bayesian, logistic regression and support vector machines) for the task of segmenting brain tumors from multispectral MR images. The
three context-sensitive models are all graphical models with single-site and pair potentials: for the MRF, the single-site potentials are Gaussians and the pair potential
only depends on the local label assignments (e.g. a Potts potential); for the DRF, the
single-site potentials are a generalized linear model (e.g. logistic regression terms)
and the pair potential may be modulated by the (possibly non-local) features (here
the penalty for different adjacent labels is attenuated if the features at the two voxels differ by a large amount). For the SVRF finally, the logit-transformed output of
an SVM is chosen as single-site potential, and the same interaction term as for the
DRF is chosen; it is assumed that the SVRF performs superior to the DRF in highdimensional feature spaces with correlated features. The parameters of an SVRF can
be trained by a solving a quadratic program. For inference, the label assignments of
the context-sensitive classifiers are initialized with the locally optimal labels, and the
final label assignment is computed using ICM (see section 1.3). Several preprocessing steps for noise reduction, bias-field correction, inter-slice intensity normalization
and registration to an anatomical template are performed. Using alignment-based
features as in (Schmidt et al., 2005) and evaluating the classifiers on three different tasks (segmenting the enhancing tumor region, the gross tumor region and the
edema region), it turns out that SVRFs perform best for all three tasks (with average
Jaccard indices of 0.825, 0.723 and 0.769).
Lee et al. (2006) propose semi-supervised discriminative random fields (SSDRF) as
a semi-supervised generalization of classical discriminative random fields to be used
for general computer vision problems, and use brain tumor segmentation as the main
experimental application example of their article. The unlabeled data are used in
102
3.3. Related work
order to decrease the risk of parameter overfitting, by adding the expected conditional
entropy of the unlabeled dataset as a regularization term to the DRF posterior: the
uncertainty for the labeling of the unlabeled training examples should be low. For
parameter estimation, a gradient descent optimization is used (the marginalization
over the unobserved labels may only be performed approximately by resorting to a
pseudolikelihood approximation). Inference for the test examples is performed by
ICM, as for a normal DRF. An evaluation on a dataset of multispectral 3D MRI
scans (pre- and post-contrast T1 -weighted and T2 -weighted) against manual ground
truth yields a significant increase in the average Jaccard index (0.66) compared to
both logistic regression (0.54) and and DRF (0.55).
Corso et al. (2007) propose an algorithm called extended graph-shifts to minimize the
energy function of a conditional random fields model for which the number of labels
is unknown beforehand. The image label structure is represented by a hierarchical
graph of progressively aggregated note such that each node takes the same label as its
parent node: the root nodes correspond to the different clusters. The hierarchy may
then be transformed by two types of graph shift operations (greedily selecting the operation at each iteration step that maximally decreases the global energy): changing
the parent of a node (thus changing the label of all nodes in the sub-graph) and creating a new subgraph from a node. At the bottom layer (corresponding to the lattice
voxels), every node is assigned a unary potential corresponding to the local evidence
for the different possible labels, which is computed from the probabilistic output of
a Viola-Jones-like boosting cascade trained on about 3000 features (e.g. Haar-like
filters, gradients, local intensity curve-fitting). Also every pair of bottom layer nodes
is assigned a Potts potential term; nodes and edges at the higher hierarchy layers
aggregate the potentials of their children. The label assignment is initialized stochastically, and then the hierarchical structure allows the efficient decision which move
decreases the total energy maximally. The procedure is evaluated on the tasks of
brain tumor and edema segmentation from multispectral MR images (high-resolution
pre- and post-contrast T1 -weighted MRI, and low-resolution T2 -weighted and FLAIR
MRI), and of multiple sclerosis lesion segmentation from high-resolution unispectral
MRI, training and testing on six datasets each. For tumor and edema segmentation,
Jaccard scores, precision and recall of 86 % / 95 % / 90 % and 88 % / 89 % / 98 %
respectively are achieved, while for multiple sclerosis lesion detection, the detection
rate is 81 % on the test set.
Lee et al. (2008) propose a context-sensitive classifier called pseudo-conditional random fields that yields similar or better accuracy than DRF or SVRF, while being
exactly solvable and computationally much more efficient than the traditional approaches. The local potentials are products of a generalized linear model (for the
feature-conditional label distribution) and a Potts model term on the labels of adjacent voxels favoring smoothness, which is modulated by a multiplicative factor
103
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
measuring the similarity of the features of both voxels. Only the generalized linear model term contains adjustable parameters, so that the spatial correlations can
be neglected during training; and inference in the testing phase can be performed
efficiently using graph cuts. An evaluation on the task of segmenting enhancing
and necrotic glioblastomas from multispectral MR images (pre- and post-contrast
T1 -weighted, and T2 -weighted) against manual ground truth leads to Jaccard scores
in the range of 0.82–0.93, which are significantly superior to logistic regression and
comparable to SVRF (see above), while the training time is over 30 times faster than
for the SVRF (38 vs. 1276 seconds on average).
Wels et al. (2008a,b) propose two similar approaches for segmenting on the one hand
pediatric brain tumors, and on the other hand multiple sclerosis lesions from multispectral MR images. The modalities used are T1 -weighted MRI with and without
gadolinium enhancement and T2 -weighted MRI in the first case, and T1 -weighted,
T2 -weighted and FLAIR MRI in the second case. For the tumor application, the
images are preprocessed by brain stripping, anisotropic diffusion filtering, and intensity standardization by dynamic histogram warping. Segmentation is viewed as
MAP estimation in a Markov random field, with the single-site potentials given by
the probabilistic outputs of a probabilistic boosting trees (PBT) classifier trained
on local features (multispectral intensities and gradient magnitudes, and Haar-like
features efficiently computed for each of the modalities from an integral image representation). For the tumor application, a contrast- and distance-attenuated Ising pair
potential is imposed and the MAP inference problem is solved exactly using graph
cuts. For the MS application, a simple Ising pair potential is imposed and the MAP
inference problem is solved approximately using ICM. In the latter case, the final
segmentation is obtained by a Laplacian 2D level set evolution initialized from the
MAP solution for every slice. Typical segmentation times are 5 minutes per dataset.
For the tumor application, Jaccard scores of 0.78 ± 0.17 are obtained when comparing to manual segmentation. The evaluation of the MS application leads to total
detection failure of one out of six datasets, and to similarity indices of 0.68 ± 0.15
for the other five examples.
Methods based on active contours / level set segmentation
Active contour methods model the segmentation contour as level set of a continuous
function (the embedding function), and minimize a energy functional for the embedding function that accounts for data fidelity (the contour should coincide with
local edge cures), regularity (e.g. the curvature of the contour) and prelearned shape
assumptions. Mathematically, this energy minimization leads to the task of solving
a partial differential equation (PDE). While this formalism can simply incorporate a
104
3.3. Related work
large amount of prior knowledge about the final segmentation (such as shape information), it is prone to getting stuck in local minima.
Ho et al. (2002) use level set evolution to adapt an active contour to the tumor
boundaries; the region competition formalism is employed in order to deal with the
fuzzy tumor boundaries. First a tumor probability map is created from two T1 weighted scans with and without gadolinium enhancement (by fitting a Gaussian
mixture model with two components to the difference image), which tends to be
noisy and show also blood vessels etc. The active contour is initialized with the 0.5
level set of this probability map, and then evolves by an PDE containing a region
competition term (which causes shrinkage in low probability regions and expansion
in high probability regions), a smoothness term penalizing high curvature and a
uniform smoothing term for increased numerical stability. The procedure is validated
on multispectral MR scans (T1 -weighted with and without gadolinium enhancement
and T2 -weighted) of meningioma and glioblastoma patients, yielding Jaccard’s scores
in the range 0.85–0.93 and Hausdorff distances of 7–13 voxels as compared to manual
segmentation.
Khotanlou et al. (2006) devise a method towards tumor segmentation on unispectral
images (T1 weighting only). After brain-stripping, the histogram-based fuzzy possibilistic c-means clustering method is used to create a rough tumor segmentation
(which minimizes the sum of squared differences between the local gray level and the
cluster center weighted by a sum of a fuzzy membership and a typicality value and
thus ensures higher robustness than ordinary c-means). Misclassification errors are
removed using morphological operations. The final tumor boundaries are obtained
by evolving a deformable triangulated surface, containing an internal force (controlling surface tension and curvature) and an external force (a Generalized Gradient
Vector Flow field, which is the equilibrium state of diffusing the gradient vector of a
Canny edge map).
Cobzas et al. (2007) combine discriminative learning with problem-specific highdimensional features, anatomical prior information and variational (level set) segmentation for the segmentation of brain tumors. The posterior probability as estimated by a logistic regression is used in the external force term of the level set
evolution PDE leading to the final segmentation. After preprocessing the data by
similar steps as in Schmidt et al. (2005) (see below), a logistic regression is trained
based on alignment-based features as in Schmidt et al. (2005) and texture features
(multi-scale Gabor features). The final segmentation is then obtained by running
the level set evolution and removing small surface pieces as a post-processing step.
Evaluation on T1 -weighted and T2 -weighted datasets yields average overlap fractions,
Hausdorff distances and mean distances of 60±14%, 8.1±1.8 mm and 1.74±0.66 mm,
which is considerable better than when using a Gaussian classifier.
105
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
Methods based on fuzzy clustering
Fuzzy clustering techniques work by grouping the set of features extracted from all
voxels in the training images into several groups (or clusters), which are given different semantic interpretations: for example, some clusters may be identified with the
different tissue classes (GM, WM, CSF) in the brain, while others may be identified
with pathologies (tumor, edema) or extracerebral regions (bones, skin or air). For
brain lesion detection applications, usually a fuzzy clustering approach is followed
rather than a hard clustering: i.e. every voxel may be assigned to every cluster, with
soft assignment weights that have to sum to 1. Most applications are based on the
fuzzy c-means (FCM) technique that iteratively estimates the cluster centers and the
soft assignments in an interleaved fashion.
Fletcher-Heath et al. (2001) combine FCM clustering with subsequent image processing and labeling operations based on explicit knowledge for segmenting nonenhancing brain tumors from multispectral MR images (T1 -, T2 - and PD-weighted).
The input images are fuzzily oversegmented into ten clusters by FCM, and clusters
corresponding to extracranial tissues, white matter and gray matter are identified and
removed (but their locations are remembered in order to guide the subsequent steps).
CSF, necrosis (if present) and tumors are then separated by several knowledge-guided
image processing steps: If the T1 histogram has a bimodal shape, the low-intensity
peak corresponds to a necrosis which is then removed. The ventricles are identified
by extracting a central shape bordered by GM and WM (left-right symmetry information is used if the tumor borders the ventricles). Isolated CSF pixels are then
removed by morphological operations (this assumes a minimum spatial extent of the
tumor). Finally, the most compact region(s) is/are selected as tumor(s), i.e. the
number of tumors must also be known beforehand. The validation yields correct
classification rates range from 53 % to 91 % per volume.
Segmentation in 4D images
While most of the other approaches described in this chapter only aim to segment an
image volume acquired at a single time point, tumor progression monitoring studies
require to track e.g. the volume of a tumor over time, so that the response to a therapy can be assessed. The following methods try to improve upon the single-volume
segmentations by using the information from the different time points simultaneously:
Solomon et al. (2004) employ 4D segmentation to track the tumor volume over time
and to assess changes in tumor size objectively; it is assumed that the additional temporal dimension may also lead to improved segmentation at the single time points.
The basis for segmentation is a Gaussian mixture model fitted with an EM algorithm
106
3.3. Related work
as in Leemput et al. (1999b), which is augmented with a temporal hidden Markov
model (EM-HMM segmentation). Unispectral, nearly isotropic 3D MRI scans acquired at three different time points are registered and de-skulled. First a rough
segmentation is obtained by k-means clustering, which is used as initialization to the
EM estimation of the Gaussian model parameters (for this purpose the volumes at
all different time points are used). Given the class-conditional observation models,
the class assignment labels are estimated: It is assumed that every voxel at every
time point is characterized by a status label (lesion vs. not lesion) which evolves
by a Markov process (independently from all other voxels), and that the observed
intensity only depends on the current status. Furthermore one assumes that the
transition probability drops exponentially with the distance from the current tissue
boundary, and the exponential coefficient is estimated from the results of the nontemporal EM segmentation at different time points. The posterior of the current
status given all evidence acquired up to the current time point is then computed and
used for fuzzy segmentation; it is also possible to reestimate the class assignments at
earlier time points given the new information (smoothing). In a first experimental
with three different time points, a correlation of 0.89 with the manual segmentation
and a mean Dice similarity coefficient13 of 0.71 are found. In an extension (Solomon
et al., 2006), an MRF prior is added to the intensity distribution learned by the
EM algorithm and the transition matrix is refined to accommodate more than two
tissue classes (parenchyma, tumor, CSF and blood vessels), so that the Gaussian
model assumptions become more accurate. Evaluation on simulated data shows that
the MRF and the HMM priors and the smoothing step all lead to improvements
as measured by sensitivity and the Jaccard index. Furthermore, evaluation on real
data from three different time points yields segmentation results that are as good as
comparable state-of-the-art segmentation techniques, and which have the same sensitivity as a manual segmentation, if a slightly smaller Jaccard similarity compared
to the ground truth. The use of a multi-class tissue class leads to a slight decrease
in sensitivity, but also to an increased similarity index (owing to less false positive
detections).
Interactive segmentation methods
The segmentation is typically simplified if no full automation is required, and the
clinical user has the opportunity to either initialize the segmentation by manual seed
placement, or to refine the final segmentation.
The first approach is followed e.g. by Warfield et al. (2000) and Kaus et al. (1999,
2001), where the authors propose an adaptive, template-moderated spatially varying
13
The Dice coefficient of two segmentations is the ratio between the overlap volume and the average
volume of the single segmentations.
107
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
classification (ATM SVC) algorithm for multiple segmentation problems of both
healthy and pathological structures, and apply it to brain tumor segmentation,
amongst other tasks. The idea is to combine two segmentation strategies: classification based on local features (which does not account for anatomical information)
and nonlinear registration to an anatomical template (which takes the local features
only partially into account, and has only limited accuracy for pathological or highly
variable organs). A unispectral three-dimensional MR image is initially registered
to a template atlas by a nonlinear registration algorithm. The user has to provide
three or four example labels for each class of interest (e.g. brain (WM & GM), CSF,
tumor, skin, background), which typically requires 5 minutes of user interaction. The
image is then segmented by a kNN classification (section 2.2), using as features both
the voxel intensity and the distance to relevant brain structures, e.g. the ventricles.
Then the registration is refined by matching the atlas with the segmented image,
and the procedure is iterated. While the initial atlas only contains normal brain
structures and no tumor, a tumor segment is added after the first iteration step from
the initial segmentation. Compared to the majority vote of four experts, the tumor
can be segmented with a voxelwise accuracy or 99.7%.
Level-set and active contour segmentations (see above) are also well-suited for a
user-defined initialization, which may alleviate its problems with running into local
energy minima. For instance, Jiang et al. (2004) provide a brain tumor segmentation
method as part of a telemedicine CAD system. They use a level set segmentation
starting from a coarse user-provided manual delineation, with standard terms for the
external and internal force (local curvature and gradient of a simple edge map).
Droske et al. (2005) use level set evolution with an expanding force term to segment brain tumors on T1 -weighted gadolinium-enhanced images, starting from a
user-provided initialization contour inside the tumor. The expansion speed is computed based on an edge map (expansion is slowed down if the edge intensity lies outside of a prescribed interval, which is estimated from the user-defined seed points).
Since no automated convergence diagnostics are included, the user also has to specify the arrival time for the final segmentation. It is also possible to correct or add
intermediate segmentations to ensure convergence to the correct final state.
Besides the dependence on the initial contour, level-set segmentation methods also
typically depend on a number of free parameters whose optimal choice is not always clear beforehand, especially to clinical users. Lefohn et al. (2003) and Cates
et al. (2004) employ fast level-set deformation solvers to interactively tune these free
parameters of the level set partial differential equation (e.g. the trade-off between
curvature term and data term, or the free parameters of the data term). A sparse
approximation of the PDE is used in which only voxels near the isosurface are taken
into account, and a further speed-up of 10–15 is achieved by implementing the solver
108
3.3. Related work
on a GPU. Compared with the STAPLE-generated ground truth from four expert
segmentations, even non-radiologist raters achieved a mean precision of 94 % (experts: 83 %) and an average correct volume fraction of 99.78% ± 0.13%, needing a
total time of 6 ± 3 minutes per dataset, whereas the typical time for an unassisted
three-dimensional manual segmentation is rather 3–5 hours.
The second approach, i.e. enabling the users to perform final corrections on the
segmentation, is followed e.g. by Letteboer et al. (2004): A multiscale watershed
segmentation of the tumor images is created as a preprocessing step (i.e. a scalespace representation is created by convolving with differentials of Gaussians at different scales, watershed segmentations are performed at the various scales and the
catchment basins are linked across the different scales to ensure that each catchment
basin at a fine scale is contained in exactly one catchment basin at every coarser
scale). In a graphical user interface, the user may first create a rough segmentation
by selecting segments at a coarse scale, and then interactively refine it by adding or
deselecting subsegments at the finer scales: this leads to an increased intraobserver
and interobserver similarity, and reduces the time needed for manually delineating
the tumor is decreased from 22 minutes on average (10–40 minutes) to 7 minutes on
average (1–15 minutes).
Cates et al. (2005) explore the opportunities of the ITK segmentation library for interactive segmentations of brain tumors and different anatomical structures (e.g. optic nerve, eyeball, lateral rectus muscle). Datasets are preprocessed by anisotropic diffusion, and a watershed over-segmentation is computed based on a lower-thresholded
gradient map. A segment hierarchy is then constructed by successively merging watershed basins based on their watershed depth. The users then create the final
segmentation by manual selection of regions in this hierarchy graph. Compared to
the STAPLE consensus of several expert segmentations, this procedure yields a mean
correct classification rate of 99.76 ± 0.14%. Giving the clinical user the opportunity
for manual corrections at the final stage may also increase the acceptance of clinical
radiologists for computer-assisted segmentation systems, and increase the safety of
the patients during subsequent interventions that are planned on the basis of these
segmentations.
Active learning approaches
Creating manual annotations for training a classifier is a time-consuming and tedious task, especially as it has to be performed by clinical radiologists, whose time is
typically scarce. Active learning approaches can speed up this process by proposing
the images (or image parts) for annotation which are expected to give the highest benefit to classifier accuracy. Farhangfar et al. (2009) propose such an active
109
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
learning approach for the training of a DRF classifier, and apply it to the tasks of
sky segmentation in natural images and brain tumor segmentation in MR images.
Their approach is similar to the semi-supervised DRF model presented in Lee et al.
(2006) (see above), but the regularization term consists of the expected conditional
entropy of each queried new image to be labeled rather than the expected conditional entropy of all unlabeled images together. A pseudolikelihood approximation is
employed to make the parameter estimation for this regularized likelihood tractable
(for this approximation it is necessary to compute the MAP label estimate for the
unlabeled image by ICM). There are two possible strategies how to request the next
image to be labeled: firstly, select the image with the highest expected conditional
entropy given the current estimate for the posterior distribution of the labels (which
is approximated as sum over the pixel-wise entropies); this strategy is applied in
all steps but the first where the posterior distribution is not yet initialized. Secondly, select the instance providing the maximum information about the labels of
the other unlabeled instances (which can be computed from the solution of the regularized posterior); this strategy is only used in the initial step as it is computationally
more expensive. Besides sky segmentation, this procedure is evaluated for the task
of brain tumor segmentation from multispectral MR scans (pre- and post-contrast
T1 -weighted and T2 -weighted). Four features are used for each pixel: the intensity
in the T2 -weighted image, the difference between the post-contrast and pre-contrast
T1 -weighted intensities, and the differences of these two gray values to the gray value
of the contralateral voxel. Actively selecting two training images yielded (insignificantly) better F -measures than training on all 71 examples.
Methods exploiting left-right symmetry of the brain
Besides generic segmentation methods for medical imagery, there are also techniques
that depend heavily on the specifical properties of brain imagery, namely the approximate left-right symmetry of the brain: this is e.g. exploited by the alignment-based
features of Schmidt et al. (2005), cf. section 3.3.1. Another approach in this direction was proposed by Ray et al. (2008): The authors aim to quickly place a boundary
box around a tumor in unispectral MR images, e.g. for retrieval purposes. For this
they use asymmetry-based features specific for brain tumor segmentation, in order
to profit from the knowledge that tumors tend to disturb the bilateral symmetry of
the brain. A (healthy) template image is matched approximately to the input image,
and for each coronal plane the Bhattacharyya distance14 between the intensity histograms of the two images before this plane and after this plane is computed. The
front and back face of the bounding box then delineate the region where this score
14
The Bhattacharyya distance between two histograms is the sum of the geometric means of the
entries in each histogram bin.
110
3.3. Related work
decreases from front to back, as the intensities of the two images tend to be uncorrelated in this area. Similarly, the left and right face are detected. Dice coefficients
with bounding boxes drawn by expert radiologists lie in the range of 0.7–0.9.
Other approaches There are also multiple other approaches for brain lesion segmentation that cannot be discussed here due to space constraints. Amongst others,
they comprise region growing (Broderick et al., 1996), rule-based techniques (Raya,
1990), semi-supervised classification (Song et al., 2006, 2009), template matching
(Warfield et al., 1995; Hojjatoleslami et al., 1998), mathematical morphology (Gibbs
et al., 1996; He & Narayana, 2002), fuzzy connectedness estimation (Udupa et al.,
1997; Moonis et al., 2002), vector quantization (Karayiannis & Pai, 1999), pyramid segmentation (Pachai et al., 1998), eigenimages (Soltanian-Zadeh et al., 1998),
texture-based classification (Kovalev et al., 2001; Iftekharuddin et al., 2005), Bayesian
classification (Harmouche et al., 2006) and fuzzy logic (Zhu et al., 2005; Dou et al.,
2007).
3.3.2. Learning from unreliable manual annotations
In the common formulation of supervised learning methods (see section 2.2), a mapping from training examples x ∈ X to targets y ∈ Y is learned from training examples
(xi , yi ). Typically, X ⊆ Rp and Y is either continuous (Y ⊆ R, regression setting)
or discrete (Y = {1, . . . , L}, classification setting). Often the targets y come from
human judgment, and one assumes that this judgment is reliable, so that the training
examples (xi , yi ) can be viewed as samples from the true data distribution during the
subsequent classifier training and testing. However, in many cases this assumption
is overly optimistic, since the human labelers may be unreliable and assign some
wrong labels. This is particularly the case for classification based on noisy or ambiguous image information, e.g. for the tasks of finding volcanoes in small aperture
radar imagery of the Venus (Smyth et al., 1995) or distinguishing between genuine
(Duchenne) and insincere (non-Duchenne) smiles (Whitehill et al., 2009). The most
extreme phenomenon are adversarial labelers which deliberately cast wrong labels
in order to degrade the classifier performance: they pose a severe challenge for e.g.
collaborative e-mail spam filtering systems (Attenberg et al., 2009). Applications in
medical image analysis include the segmentation of healthy brain images into the
three main compartments of GM, WM and CSF (Warfield et al., 2004) or the classification of lung nodules detected in CT images into malignant or benign examples
(Raykar et al., 2010). In the following, we will deal with the task of segmenting
brain tumors from multimodal medical images. Fig. 3.2 gives an impression of the
unreliability of human annotators for this task.
111
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
Figure 3.2. – Exemplary segmentations of a real-world brain tumor image by a single
expert, based on different imaging modalities. In the background, an axial FLAIR section
of an astrocytoma patient is displayed. The colored lines are the contours of manual tumor
segmentations that were drawn by a senior radiologist on three different MR scans of the
same slice: namely a T2 -weighted scan (magenta), a gadolinium-enhanced T1 -weighted scan
(blue) and this FLAIR scan (red). The other two scans had been affinely registered to
the FLAIR scan beforehand. Note the volume variability of ca. 400 % between the different
modalities. This chapter deals with the question what single segmentation should be reported
to summarize this information.
In cases in which only a single label and no additional information is provided about
every training example, one can obviously not do better than treating this label as
the truth. However, if several labels from multiple annotators are available, one can
fuse these (possibly conflicting) votes to a consensus label, which should hopefully
more reliable than every single vote, or even estimate the probabilities for the different possible values of the label. It can be expected that the multiple labelers may
differ in their reliability: some may be experts for this tasks, some novices, some may
be meticulous, some careless, and some may even be malicious as in the adversarial
112
3.3. Related work
scenario. Ideally the fusion routine should identify the reliable labelers and assign
their votes a higher weight for the final decision. Or, if objective feature information
about the training example is available (that characterizes each example sufficiently
well), one can check whether a rater consistently gives the same labels to examples
having similar features, which may help one to decide whether he or she assigns the
labels rather randomly or based on the visible image information. In the following,
the previously proposed models for fusing unreliable manual annotations are reformulated in the language of probabilistic graphical models (more precisely Bayesian
networks), which has not been done before (Fig. 3.3). This makes the similarities
and differences between the different approaches clearer and allows the use of generic
inference techniques.
In the STAPLE model proposed by Warfield et al. (2004, Fig. 3.3(a)), the discrete observations snr ∈ {0, 1} are noisy views on the true scores tn ∈ {0, 1}, with
n ∈ {1, . . . , N } indexing the image pixels and r ∈ {1, . . . , R} indexing the raters.
The r-th rater is characterized by the sensitivity γr and the specificity 1 − δr , and
the observation model is snr ∼ tn Ber(γr ) + (1 − tn )Ber(δr ), with “Ber” denoting
a Bernoulli distribution. A Bernoulli prior is given for the true class: tn ∼ Ber(p).
While the original formulation fixes p = 0.5 and uses uniform priors for γr and δr , the
priors were modified in order to fulfil the conjugacy requirements for the chosen variational inference techniques: hence Beta priors are imposed on γr ∼ Beta(ase , bse ),
δr ∼ Beta(bsp , asp ) and p ∼ Beta(ap , bp ). A similar Beta prior was independently
introduced by Commowick & Warfield (2010) in order to use prior knowledge about
the relative quality of different raters: While in the following experiments the same
values of ase , bse , asp , bsp were used for all raters, it would also be possible to give
higher a parameters and lower b parameters to raters who are supposed to be more
reliable.15 The prior on p is introduced in order to learn the share of tumor tissue
among all voxels from the data.
The model by Raykar et al. (2009, Fig. 3.3(b)) is the same as (Warfield et al., 2004)
except for the prior on tn : here the authors assume that a feature
vector ϕn is
observed at the n-th pixel and that tn ∼ Ber {1 + exp(−w⊤ ϕn )}−1 follows a logistic
regression model. A Gaussian prior is imposed on w ∼ N(0, λ−1
w I). In contrast to
(Warfield et al., 2004), they obtain a classifier that can be used to predict the tumor
probability on unseen test images, for which one has access to the features ϕn but
not to the annotations snr . One may hypothesize that the additional information of
the features ϕn can help to resolve conflicts: in a two-rater scenario, one can decide
that the rater has less noise who labels pixels with similar ϕn more consistently. In
the modified graphical model formulation, a gamma prior for the weight precision
is added: λw ∼ Gam(aw , bw ). Note that this model can be regarded as a direct
multi-rater generalization of logistic regression (Hastie et al., 2009, Ch. 4).
15
Note that the expected mean of a Beta(a, b) distribution is a/(a + b).
113
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
Whitehill et al. (2009, Fig. 3.3(c)) propose a model in which the misclassification
probability depends on both the pixel and the rater: snr ∼ Ber {1+exp(−tn αr ǫn )}−1
with the rater accuracy αr ∼ N(µα , λ−1
α ) and the pixel difficulty ǫn with log(ǫn ) ∼
N(µǫ , λ−1
)
(this
parameterization
is
chosen
to constrain ǫn to be positive).
ǫ
In the continuous variant of STAPLE by Warfield et al. (2008, Fig. 3.3(d)), the
observations ynr are continuous views on a continuous latent score τn . It is assumed
that the noisy ynr and the true τn give information not only whether a given voxel
is tumor or not, but also how far it is away from the tumor boundary: Commonly
ynr is defined as the signed Euclidean distance function16 of the r-th rater, and τn
hence corresponds to the distance transform of the true tumor segmentation, so that
the tumor contours are the zero-level set of τ . The r-th rater can be characterized
by a bias βr and a noise precision λr : ynr ∼ N(τn + βr , λ−1
r ), with a Gaussian prior
−1
on the true scores: τn ∼ N(0, λτ ). In the modified graphical model formulation,
Gaussian priors on the biases are added, i.e. βr ∼ N(0, λ−1
β ). For the precisions of
the Gaussians, gamma priors are used: λτ ∼ Gam(aτ , bτ ), λβ ∼ Gam(aβ , bβ ) and
λr ∼ Gam(aλ , bλ ). Note that when thresholding the continuous scores, the tumor
boundary may shift because of the noise, but misclassifications far away from the
boundary are unlikely: this is an alternative to (Whitehill et al., 2009) for achieving
a non-uniform noise model.
3.4. Modelling and implementation
3.4.1. Novel hybrid models
In addition to the previously proposed latent-class and latent-score models, four
novel hybrid models are introduced, which incorporate all aspects of the previous
proposals simultaneously: while they provide a classifier as in (Raykar et al., 2009),
they do not assume misclassifications to occur everywhere equally likely. In the simplest variant (hybrid model 1, Fig. 3.4(a)), the model from (Warfield et al., 2008) is
−1
modified by a linear regression model for τn ∼ N(w⊤ ϕn , λ−1
τ ) with w ∼ N(0, λw ).
Note that this model predicts a (noisy) linear relationship between the distance
transform values ynr and the features ϕn , while experimentally the local image appearance saturates in the interior of the tumor or the healthy tissue. To alleviate this concern (hybrid model 2, Fig. 3.4(b)), one can interpret ynr as an unob16
The unsigned Euclidean distance transform of a binary mask I is defined as 0 inside of I, and
as the Euclidean distance to the closest point of I outside of I. The signed Euclidean distance
¯ Using
transform is the difference of the unsigned distance transforms of I and its complement I.
a modification of Dijkstra’s all-pairs shortest path algorithm, these measures can be computed
for an entire binary image in a time linear in the number of pixels (Fabbri et al., 2008).
114
3.4. Modelling and implementation
ase bse
bsp asp
Beta
Beta
γr
δr
bsp asp
Beta
Beta
γr
δr
R
a
b
Beta
Bernoulli
p
Bernoulli
N
λα
R
αr
λw
BernoulliFromLogOdds
Gaussian
φn
aβ
Gaussian
Gamma
λβ
βr
λr
+
Exp
εn
bλ
Gamma
Bernoulli
BernoulliFromLogOdds
aλ
bβ
p
λε
Gaussian
•
w
ScalProd
bp
Beta
με
Gamma
Bernoulli
(b) Raykar et al. (2009)
ap
Gaussian
bw
tn
(a) Warfield et al. (2004)
μα
aw
Bernoulli
tn
N
R
snr
snr
Bernoulli
ase bse
aτ
•
snr
(c) Whitehill et al. (2009)
tn
N
bτ
Gaussian
ynr
Gamma
τn
λτ
Gaussian
R
N
(d) Warfield et al. (2008)
Figure 3.3. – Graphical model representations of the previously proposed fusion algorithms,
partially with new priors added. Red boxes correspond to factors, circles correspond to
observed (gray) and unobserved (white) variables. Some factors are deterministic: “Exp”
refers to an exponential function, “ScalProd” to a scalar product, and the + and · factors to
addition and multiplication. The “BernoulliFromLogOdds” factor means that the output y
is a binary variable sampled from a Bernoulli distribution with parameter (1 + e−x )−1 , where
x is the input of the factor. Solid black rectangles are plates indicating an indexed array of
variables (Buntine, 1994). The dashed rectangles are “gates” denoting a mixture model with
a hidden selector variable (Minka & Winn, 2009).
served malignancy score, which influences
the (observed) binary segmentations snr
−1
via snr ∼ Ber {1 + exp(−ynr )}
. This is a simplified version of the procedure
presented in Rogers et al. (2010), with a linear regression model for the latent score
instead of a Gaussian process regression. Alternatively one can model the raters as
using a biased weight vector rather than having a biased view on an ideal score, i.e.
−1
yrn ∼ N(vr⊤ ϕn , λ−1
r ) with vr ∼ N(w, λβ I). Again the score ynr may be observed
115
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
directly as a distance transform (hybrid model 3, Fig. 3.4(c)) or indirectly via snr
(hybrid model 4, Fig. 3.4(d)).
aβ
aλ
bβ
Gamma
Gaussian
Gamma
λβ
βr
λr
aw
+
bw
aβ
bλ
R
Gamma
Gaussian
Gamma
λβ
βr
λr
+
bw
λw
ynr
λw
τn
w
ScalProd
φn
τn
w
ScalProd
(a) Hybrid model 1
aβ
bβ
Gamma
λβ
aw
bw
Gamma
Gaussian
λw
w
aλ
aβ
bλ
Gamma
bβ
Gamma
R
aw
bw
Gamma
λβ
Gaussian
λw
w
N
aλ
bλ
Gamma
λr
Gaussian
vr
φn
φn
(b) Hybrid model 2
λr
Gaussian
snr
noulliFromLogOdds
Gaussian
N
R
Gaussian
Gamma
ynr
Gaussian
bλ
Gamma
aw
Gaussian
aλ
bβ
R
vr
φn
ScalProd
ScalProd
Gaussian
Gaussian
ynr
N
ynr
BernoulliFromLogOdds
snr
N
(c) Hybrid model 3
(d) Hybrid model 4
Figure 3.4. – Newly proposed hybrid models: for the explanation of the symbols see the
caption of Fig. 3.3.
3.4.2. Inference and implementation
For the graphical models considered here, exact inference by the junction tree algorithm is infeasible especially for the models that make use of the objective image
information: If d is the number of features in the vector ϕn for the models that
make use of the features ϕ, and d = 1 for the other models, the treewidth of the
graphical models in Figs. 3.3 and 3.4 is given by 2R + d. In absence of efficient exact
algorithms for treewidth computation, this was found by computing experimental
upper and lower bounds by the approximation techniques presented in (Bodlaender
& Koster, 2010a) and (Bodlaender & Koster, 2010b). The tightest upper and lower
116
3.5. Experiments
bounds were found to coincide, giving the exact treewidth value.17 However, one
can perform approximate inference using e.g. variational message passing (Winn &
Bishop, 2005): the true posterior for the latent variables is approximated by the closest factorizing distribution (as measured by the Kullback-Leibler distance), for which
inference is tractable. As a prerequisite, all priors must be conjugate; this holds for
all models discussed above except (Whitehill et al., 2009). Here one cannot apply the
generic variational message passing scheme to this model, so that the results from
the EM inference algorithm provided by the authors are reported instead.
The INFER.NET 2.3 Beta implementation for variational message passing (Minka
et al., 2009) was employed to perform inference on the algorithms by Warfield et al.
(2004), Warfield et al. (2008), Raykar et al. (2009) and the four hybrid models. The
default value of 50 iteration steps was found to be sufficient for convergence, since
doubling the number of steps led to virtually indistinguishable results. For the algorithm by Whitehill et al. (2009), the GLAD 1.0.2 reference implementation was
used.18 Alternative choices for the generic inference method would have been expectation propagation (Minka, 2001) and Gibbs sampling (Gelfand & Smith, 1990).
We experimentally found out that expectation propagation had considerably higher
memory requirements than variational message passing for our problems, which prevented its use for our problems on the available hardware. Gibbs sampling was not
employed since some of the factors incorporated in our models (namely gates and
factor arrays) are not supported by the current INFER.NET implementation. Note
that these are purely practical reasons: in theory, it would have been possible to use
also these two alternatives.
The results of the graphical models were also compared against three simple baseline
procedures: majority voting, training a logistic regression classifier from the segmentations of every single rater and averaging the classifier predictions (ALR), and
training a logistic regression classifier on soft labels (LRS): if S out of R raters voted
for tumor in a certain pixel, it was assigned the soft label S/R ∈ [0, 1].
3.5. Experiments
Two experiments were performed in order to study the influences of labeler quality and imaging modality separately. In the first experiment, multiple human annotations of varying quality based on one single imaging modality were collected
and fused: for this task, simulated brain tumor measurements were used, for which
ground truth information about the true tumor extent was available, so that the
17
18
The LibTW library was used for these studies: http://www.treewidth.com/docs/libtw.zip
http://mplab.ucsd.edu/~jake/OptimalLabelingRelease1.0.2.tar.gz
117
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
results could be evaluated quantitatively. In the second experiment, multiple human
annotations based on real-world image data were collected and fused, which were
all of high quality, but had been derived from different imaging modalities showing
similar physical changes caused by glioma infiltration with different sensitivity.
3.5.1. Experiments on simulated brain tumor measurements
Tumor simulations Simulated brain tumor MR images were generated by means
of the TumorSim 1.0 software (Prastawa et al., 2009).19 The advantage of these
simulations was the existence of ground truth about the true tumor extent (in form
of probability maps for the distribution of white matter, gray matter, cerebrospinal
fluid, tumor and edema). The final task of the classifiers was to discriminate between
“pathological tissue” (tumor and edema) and “healthy tissue” (the rest). Nine image
volumes were used: three for each tumor class that can be simulated by this software (ring-enhancing, uniformly enhancing and non-enhancing, see Fig. 3.5). Each
volumetric images contained 256 × 256 × 181 voxels, and the three different imaging
modalities (T1 -weighted with and without gadolinium enhancement and T2 -weighted)
were considered perfectly registered with respect to each other. The feature vectors
ϕi consisted of four features for each modality: gray value, gradient magnitude and
the responses of a minimum and maximum filter within a 3 × 3 neighborhood. A
row with the constant value 1 was added to learn a constant offset for the linear
or logistic models (since there was no reason to assume that features values at the
tumor boundary are orthogonal to the final weight vector).
Justification of linear classification term Linear discrimination models like the
model by Raykar et al. (2009) and the hybrid models are appropriate if the decision
boundaries in the selected feature space can be regarded as linear, i.e. if a linear
classifier can distinguish between pathological (tumor or edema) and healthy (GM /
WM /CSF) features just as good as a state-of-the-art nonlinear classifier. In order
to test this, a preparatory experiment was conducted, in which the ground-truth
values for the tissue probabilities were assumed as known (i.e. no multirater setting). The generalization errors of both a linear classifier (logistic regression) and a
nonlinear classifier (random forest, see section 2.2) were estimated for the task of distinguishing between characteristic pathological and characteristic healthy examples.
“Characteristic” meant that the ground-truth probability for the respective class exceeded 0.98. For the estimation of variances, a twelve-fold cross-validation scheme
was used, so that each of the twelve simulated volumes was selected as test dataset
in some fold, and the remaining eleven simulated volumes were used for training.
19
http://www.sci.utah.edu/releases/tumorsim v1.0/TumorSim 1.0 linux64.zip
118
3.5. Experiments
Figure 3.5. – Exemplary slices of the three simulated tumor classes: every column shows an
exemplary simulated brain tumor image slice in the three weightings which can be produced
by the TumorSim 1.0 software, namely T1 -weighting with gadolinium enhancement (top),
T1 -weighting without gadolinium enhancement (middle) and T2 -weighting (bottom). The
left column shows an example of a ring-enhancing tumor, the middle column of a uniformly
enhancing tumor, and the right column of a non-enhancing tumor: this corresponds to
decreasing tumor grade from left to right. Note that the appearance of the three classes
only differs in the Gd-enhanced image; under T1 -weighting all appear as hypointensities, and
under T2 -weighting as hyperintensities.
Logistic regression yielded a sensitivity of 97.8 ± 4.8% and a specificity of 97.2 ± 1.0%
(average F -measure: 97.5%), while the random forest classifier yielded a sensitivity
of 89 ± 16% and a specificity of 99.4 ± 0.6% (average F -measure: 93.9%). Since a
high sensitivity is crucial for tumor detection, this means that linear classifiers (and
119
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
especially variants of logistic regression) are superior to nonlinear methods for this
classification task.20
Justification of feature set choice In an extension of the preliminary experiments
described in the previous paragraph, several combinations of image features were
tested in order to find a feature set that is sufficiently discriminative between healthy
and pathological tissue in the ideal case that reliable labels are given. Table 3.1 shows
the different features that were tried, while Fig. 3.6 shows the resulting sensitivities
and specificities. The final choice fell on four features per image weighting (gray
value, gradient, local minimum and local maximum): using fewer features would
have impaired the classification specificity (Fig. 3.6(b)), while using more features
would have given no additional improvements and would have increased the memory
requirements.
Feature
Gradient magnitude
2D Hessian eigenvalues
2D structure tensor eigenvalues
Local entropy (3 × 3)
Local maximum & minimum (3 × 3)
Length
1
2
2
1
2
Binary flag
1
2
4
8
16
Table 3.1. – Image features that were tested in order to find an optimal feature set for
linear classification. Additionally the image gray values were part of each tentative feature
set. While some features are scalars, others comprise several values: this is encoded in the
column “Length”. The final column gives the binary flag by which the features are encoded
in Figs. 3.6(a) and 3.6(b). The mask size used for the computation of the local entropy,
maximum and minimum is indicated in parentheses.
Label acquisition The image volumes were segmented manually based on hypointensities in the T1 -weighted images, using the manual segmentation functionality of the ITK-SNAP 2.0 software.21 In order to control the rater precision, time
limits of 60, 90, 120 and 180 seconds for labeling a 3D volume were imposed and five
segmentations were created for each limit: one can expect the segmentations to be
precise for generous time limits, and to be noisy when the rater had to label very
20
Obviously linear decision boundaries can also be learned using a nonlinear classifier. However,
for a limited amount of training data (i.e. for all practical purposes), linear classifiers will give
superior classification accuracy if the decision boundary is (approximately) linear, as they are less
prone to overfitting to noise in the data. As a rule, restrictive classifiers that make assumptions
about the data are superior to more general classifiers if the assumptions actually hold in practice.
21
http://www.itksnap.org/pmwiki/pmwiki.php?n=Main.Downloads
120
3.5. Experiments
1
0.98
0.9
0.96
0.94
Specificity
Sensitivity
0.8
0.7
0.92
0.9
0.6
0.88
0.5
0.86
0.4
0.84
0
1
3
5
7
9
11
13
15
17
19
Feature set representation
(a) Sensitivities
21
23
25
27
29
31
0
1
3
5
7
9
11
13
15
17
19
Feature set representation
21
23
25
27
29
31
(b) Specificities
Figure 3.6. – Sensitivities and specificities for logistic regression on simulated brain tumor
imagery using different feature subsets, when trained with randomly sampled characteristic
examples for healthy and pathological tissues. Ground truth labels are provided to the
classifier for this purpose. Each selected feature was computed for all three modalities, i.e.
T1 -weighting with and without gadolinium-enhancement, and T2 weighting. Furthermore,
the image gray values were part of each feature set (and the only elements of the set with
the label “0”). A cross-validation scheme is used to estimate the spread of the values that
is visualized by the box plots (see the text for further details). The x label numbers encode
the feature set composition (bit vector representation, see Table 3.1): e.g. 11 = 1 + 2 + 8
corresponds to the set containing gradient, Hessian eigenvalues and entropy filter responses.
fast. The set of raters was the same for the different time constraints, and the other
experimental conditions were also kept constant across the different time constraints.
This was statistically validated: the area under curve value of the receiver operating
characteristic of the ground-truth probability maps compared against the manual
segmentations showed a significant positive trend with respect to the available time
(p = 1.8 × 10−4 , F test for a linear regression model). Since tight time constraints
are typical for the clinical routine, this setting was considered as realistic, although
it does not account for rater bias.
The slices with the highest amount of tumor lesion were extracted and partitioned
into nine data subsets in order to estimate the variance of segmentation quality
measures, with each subset containing one third of the slices extracted from three
different tumor datasets (one for each enhancement type). Due to memory restriction, the pixels labeled as “background” by all raters were randomly subsampled to
reduce the sample size. A cross-validation scheme was used to test the linear and
log-linear classifiers (all except those by Warfield et al. (2004), Warfield et al. (2008)
and Whitehill et al. (2009)) on features ϕn not seen during the training process: the
121
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
training and testing process was repeated nine times, and each of the data subsets
was chosen in turn as the training dataset (and two different subsets as the test
data).
Choice of prior parameters The following default values for the prior parameters
were used: aSe = 10, bSe = 2, aSp = 10, bSp = 2, aw = 2, bw = 1, ap = 2, bp = 2,
aτ = 2, bτ = 1, aβ = 2, bβ = 1, aλ = 2, bλ = 1. Additional experiments verified that
inference results changed only negligibly when these hyperparameters were varied
over the range of a decade. In order to check the effect of the additional priors that
were introduced into the models of Warfield et al. (2004), Warfield et al. (2008) and
Raykar et al. (2009), additional experiments were run with exactly the same models
as in the original papers (by fixing the corresponding variables or using uniform
priors). However, this led to uniformly worse inference results than in the modified
model formulations as described in section 3.3.2.
3.5.2. Experiments on real brain tumor measurements
For evaluation on real-world measurements, a set of twelve multimodal MR volumes
acquired from glioma patients (T1 -, T2 -, FLAIR- and post-gadolinium T1 -weighting)
was used. All images had previously been affinely registered to the FLAIR volume by
an automated multi-resolution mutual information registration procedure as included
in the MedINRIA22 software. Manual segmentations of pathological tissue (tumor
and edema) were provided separately for every modality on 60 slices extracted from
these volumes (20 axial, sagittal and coronal slices each of which was intersecting
with the tumor center). In these experiments, the described models are used to
infer a single probability map summarizing all tumor-induced changes in the different
imaging modalities. In particular, every modality is identified with a separate “rater”
with a specific and consistent bias with respect to the joint probability map inferred.
3.6. Results
3.6.1. Simulated brain tumor measurements
Several scenarios (i.e. several compositions of the rating committee) were studied,
which all gave qualitatively similar results for the accuracies of the different models, irrespective of whether “good” raters or “poor” raters were in the majority.
Results are exemplarily reported for the 120/120/90 scenario (i.e. two raters with
22
https://gforge.inria.fr/projects/medinria
122
3.6. Results
Majority vote
ALR
LRS
Warfield et al. (2004)
Warfield et al. (2008)
Raykar et al. (2009)
Whitehill et al. (2009)
Hybrid model 1
Hybrid model 2
Specificity
.987(007)
.953(018)
.953(019)
.987(007)
1.000(001)
.988(006)
.988(004)
.940(078)
.972(019)
Sensitivity
.882(051)
.920(036)
.919(037)
.882(051)
.617(130)
.886(045)
.913(016)
.692(060)
.716(048)
CCR
.910(032)
.931(025)
.931(025)
.910(032)
.692(139)
.913(028)
.931(008)
.751(070)
.770(057)
AUC
.972(008)
.981(005)
.981(005)
.972(008)
.989(003)
.993(003)
.980(003)
.902(117)
.953(015)
Dice
.827(020)
.855(031)
.855(030)
.827(020)
.584(211)
.830(024)
.845(063)
.603(191)
.628(163)
Table 3.2. – Evaluation statistics for the training data (i.e. the manual annotations of the
raters were used for inference), under the 120/120/90 scenario. The first three rows show the
outcome of the three baseline techniques. The best result in each column is marked in italics,
while bold numbers indicate a significant improvement over the best baseline technique (p <
.05, rank-sum test with multiple-comparison adjustment). Estimated standard deviations are
given in parentheses. The outcome of the other scenarios was qualitatively similar (especially
concerning the relative ranking between different inference methods). ALR = Averaged
logistic regression. LRS = Logistic regression with soft labels. CCR = Correct classification
rate (percentage of correctly classified pixels). AUC = Area Under Curve of the receiver
operating characteristics curve obtained when thresholding the ground-truth probability map
at 0.5. Dice = Dice coefficient of the segmentations obtained when thresholding both the
inferred and the ground-truth probability map at 0.5.
a 120 sec constraint and one rater with a 90 sec constraint). Tables 3.2 and 3.3
show the results of various evaluation statistics both for training data (for which
the human annotations were used) and test data. Sensitivity, specificity, correct
classification rate (CCR) and Dice coefficient are computed from the binary images
that are obtained by thresholding both the ground-truth probability map and the
inferred posterior probability map at 0.5. If nfb denotes the number of pixels that
are thereby classified as foreground (tumor) in the ground truth and as background
in the posterior probability map (and nbb , nbf and nff are defined likewise), these
statistics are computed as follows:
nff
,
nfb + nff
nff + nbb
CCR =
,
nff + nbb + nbf + nfb
Sensitivity =
nbb
,
nfb + nbb
2nff
Dice =
2nff + nbf + nfb
Specificity =
123
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
ALR
LRS
Raykar et al. (2009)
Hybrid model 1
Hybrid model 2
Sensitivity
.937(017)
.936(017)
.927(019)
.851(152)
.973(013)
Specificity
.924(038)
.925(038)
.937(031)
.735(181)
.727(174)
CCR
.928(029)
.928(029)
.936(025)
.760(167)
.786(116)
AUC
.978(009)
.978(009)
.977(013)
.852(172)
.952(026)
Dice
.837(065)
.837(066)
.853(038)
.619(142)
.667(084)
Table 3.3. – Evaluation statistics for the test data (i.e. the manual annotations of the raters
were not used for inference), under the 120/120/90 scenario. Note that one can only employ
the inference methods which make use of the image features ϕn and estimate a weight vector
w: the unobserved test data labels are then treated as missing values and are marginalized
over. All methods which only use the manual annotations (majority voting, and the methods
by Warfield et al. (2004) and Warfield et al. (2008)) cannot be applied to these examples. The
results for the other scenarios were qualitatively similar (especially concerning the relative
ranking between different inference methods). Cf. the caption of table 3.2 for further details.
Additionally Area Under Curve (AUC) values are reported for the receiver operating
curve obtained by binarizing the ground-truth probabilities with a fixed threshold of
0.5 and plotting sensitivity against 1 − specificity while the threshold for the posterior
probability map is swept from 0 to 1. Most methods achieve Dice coefficients in the
range of 0.8–0.85, except for the models operating on a continuous score (the hybrid
models and the model by Warfield et al. (2008)). Since the chosen features are
highly discriminative, even simple label fusion schemes such as majority voting give
highly competitive results. Qualitatively, there is little difference between these two
scenarios (and the other ones under study). While some graphical models perform
better than the baseline methods on the training data (namely (Raykar et al., 2009)
and (Warfield et al., 2008)), they bring no improvement on the test data.
Unexpectedly, the hybrid models perform worse and with lesser stability than the
simple graphical models, and for hybrid models 3 and 4, the inference converges to
a noninformative posterior probability of 0.5 everywhere. It should be noted that
the posterior estimates of the rater properties do not differ considerably between
corresponding algorithms such as (Warfield et al., 2008) and (Raykar et al., 2009),
hence the usage of image features does not allow one to distinguish between better
and poorer raters more robustly.
In order to account for partial volume effects and blurred boundaries between tumor
and healthy tissue, it is preferable to visualize the tumors as soft probability maps
rather than as crisp segmentations. In Fig. 3.7, the ground-truth tumor probabilities
are compared with the posterior probabilities following from the different models.
The models assuming a latent binary class label (i.e. those by Warfield et al. (2004);
Raykar et al. (2009); Whitehill et al. (2009)) tend to sharpen the boundaries between
124
3.6. Results
2D histogram for Warfield, 2008 and 120 120 90
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0
0
0.2
0.4
0.6
True posterior probability
0.8
Inferred posterior probability
1
0
1
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
2D histogram for Hybrid 1 and 120 120 90
0.2
0.4
0.6
True posterior probability
0.8
1
0
2D histogram for Hybrid 2 and 120 120 90
1
1
0.9
0.9
0.8
0.8
0.8
0.6
0.5
0.4
0.3
0.2
0.1
Inferred posterior probability
1
0.7
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.2
0.4
0.6
True posterior probability
0.8
1
0.4
0.6
True posterior probability
0.8
1
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
2D histogram for Hybrid 3 and 120 120 90
0.9
Inferred posterior probability
Inferred posterior probability
2D histogram for Warfield, 2008 and 60 60 60 180 180
1
Inferred posterior probability
Inferred posterior probability
2D histogram for Warfield, 2004 and 120 120 90
1
0
0
0.2
0.4
0.6
True posterior probability
0.8
1
0
0.2
0.4
0.6
True posterior probability
0.8
1
Figure 3.7. – Comparison of ground-truth (abscissa) and inferred posterior (ordinate) tumor
probabilities for simulated brain tumor images, visualized as normalized 2D histograms. All
histograms are normalized such that empty bins are white, and the most populated bin is
drawn black. We show the inference results of (Warfield et al., 2004), (Warfield et al., 2008),
and the hybrid models 1–3. The results of hybrid model 4 were similar to hybrid model 3,
and the results of (Raykar et al., 2009) and (Whitehill et al., 2009) were similar to (Warfield
et al., 2004). Most models gave similar results when the composition of the rater committee
was altered, with the exception of (Warfield et al., 2008): Unexpectedly, this model gave
slightly worse results for a scenario with a majority of better raters (e.g. 120/120/90, top
middle) than for a scenario with a majority of poorer raters (e.g. 60/60/60/180/180, top
right). For the ideal inference method, all bins outside the main diagonal would be white;
Warfield et al. (2004) comes closest.
tumor and healthy tissue overly, while the latent score models (all others) smooth
them. One can again note that the true and inferred probabilities are completely
uncorrelated for hybrid model 3 (and 4).
3.6.2. Real brain tumor measurements
The optimal delineation of tumor borders in multi-modal image sequences and obtaining ground truth remains difficult. So, in the present study only a qualitative
comparison of the different models is undertaken. Fig. 3.8 shows the posterior probability maps for a real-world brain image example. The results of the methods by
(Warfield et al., 2004) and (Warfield et al., 2008) can be regarded as extreme cases:
the former yields a crisp segmentation without accounting for uncertainty near the
125
Chapter 3. Brain tumor segmentation based on multiple unreliable annotations
Figure 3.8. – Example of a FLAIR slice with manual segmentation of tumor drawn on the
same FLAIR image (white contour), and inferred mean posterior tumor probability maps for
(Warfield et al., 2004) (top left), Warfield et al. (2008) (top right), (Whitehill et al., 2009)
(bottom left) and hybrid model 2 (bottom right). The results of hybrid model 3 and 4 were
nearly identical to (Warfield et al., 2008), the results of hybrid model 1 to model 2, and the
results of (Raykar et al., 2009) to (Whitehill et al., 2009). Tumor probabilities outside the
skull were set to 0.
tumor borders, while the latter assigns a probability near 0.5 to all pixels and is hence
inappropriate for this task. Hybrid model 1 (or 2) and the methods by (Whitehill
et al., 2009) or (Raykar et al., 2009) are better suited for the visualization of uncertainties.
126
Chapter 4.
Live-cell microscopy image analysis for
the study of zebrafish embryogenesis
4.1. Introduction and motivation
Digital Scanned Laser Light Sheet Fluorescence Microscopy (Keller & Stelzer, 2008,
DSLM) is a recent live-cell imaging technique which provides unprecedented spatiotemporal resolution and signal-to-noise ratio at low energy load. This makes it
an excellent tool for in-vivo studies of embryonic development at a cellular level:
in particular, it allows one to determine the detailed fate of each single cell, its
motion, divisions and in some cases eventual death, to construct a digital model of
embryonic development (also called a “digital embryo”) and to extract a cell lineage
tree showing the ancestry and progeny of each cell. However, the huge number of
images that are produced (due to the high spatio-temporal resolution) can no longer
be analyzed manually: hence automated image processing methods are required in
order to extract the biologically relevant information out of the raw image data.
This chapter describes two contributions to an image processing pipeline that shall
eventually be used for high-throughput analysis of nucleus-labeled DSLM imagery.1
The whole pipeline consists of the following parts:
Segmentation After interpolating the image stack in the z direction (so that all
voxels are roughly isotropic), cell nuclei are segmented in a three-stage scheme developed by Lou et al. (2011b): Firstly, foreground seeds are generated by identifying
local maxima (i.e. points where all eigenvalues of the Hessian are negative) that occur robustly across several levels in state-space and refining them via morphological
closing and opening. These seeds serve as automatically generated foreground labels
for a random forest classifier, while blurred watersheds between the basins flooded
from the foreground seeds serve as background labels. The final segmentation is
1
Parts of this chapter form part of (Lou et al., 2011a).
127
Chapter 4. Live-cell microscopy image analysis
obtained by solving a discrete energy minimization problem via the graph cut algorithm (Boykov et al., 2001): the energy function incorporates single-site potentials
(the classifier log-posterior probabilities) as well as higher-order terms corresponding
to smoothness and shape priors, as well as flux priors guarding against the shrinking
bias by which graph cut segmentation is commonly affected. For encoding the shape
assumptions, a multi-object generalization of the gradient vector flow proposed by
Kolmogorov & Boykov (2005) is used.
For several reasons, this is the hardest as well as the most crucial step of the pipeline.
All later stages assume that the true nuclei form a subset of these segments: while
some segments may later be discarded as misdetections, true nuclei that are missed
cannot be recovered again. Hence the quality requirements for the segmentation are
very high; in particular, the sensitivity should be close to 100 %, while a smaller
specificity can be tolerated. For the same reasons, it should ideally never occur that
two distinct nuclei are erroneously merged (undersegmentation), while the opposite
case of oversegmenting one single nucleus into two segments can later be handled by
discarding one of these segments. Further impeding factors are (see Fig. 4.1):
1. the high variability of nucleus brightness,
2. the inhomogeneous illumination of the images with characteristic striped artifacts, which are probably due to a combination of drifts in the illuminating
laser intensity, and the linear scanning order (see section 4.2.2),
3. the presence of high-intensity speckles that can easily be mistaken for nuclei,
4. the varying texture and sometimes low contrast of the nuclei to be segmented,
5. the leakage of fluorescent dye into the cytoplasm as well as
6. the weak boundaries between neighboring nuclei, which bring a high risk of
undersegmentation.
Besides finding the correct number and positions of nuclei, segmenting the correct
size of the nuclei is an additional challenge, and many state-of-the-art segmentation
methods are prone to shrinkage.
The first contribution of this chapter (section 4.4) is an experimental comparison of
the segmentation scheme detailed above with the results obtained with a recently
introduced interactive segmentation software (Sommer et al., 2010).
128
4.1. Introduction and motivation
Figure 4.1. – Exemplary slice of a DSLM zebrafish image. The red rectangles mark areas
where the different challenges of the data can best be illustrated: highly varying nucleus
brightness (1a and 1b), striped illumination inhomogeneities (2), speckles which often occur
close to real nuclei (3), low contrast (4), presence of fluorescent markers in the cytoplasm
(5), weak boundaries between adjacent nuclei (6).
129
Chapter 4. Live-cell microscopy image analysis
Feature extraction Connected component labeling is used to transform the binary
image generated by the segmentation step into a list of individual nucleus objects.
The individual objects are efficiently stored in a dictionary of keys-based2 sparse
matrix representation, and the segmented nucleus candidates are characterized by
different features. These may be:
• Geometrical features such as the center of mass position (i.e. the intensityweighted average position of the segment), the volume, the side lengths of the
smallest bounding box around the segment or the principal components of the
segment (i.e. the semiaxis lengths of an ellipsoid that is fitted to the intensity
distribution).
• Intensity distribution features, i.e. both the leading central moments
(mean, variance, skew, kurtosis), the maximum and minimum and the quartiles
of the intensity distribution inside the segment.
• Texture features: for characterizing texture properties, the statistical geometric features (SGF) by Walker & Jackway (1996) are used. They are computed by binarizing the gray value images inside each segment at different
thresholds, extracting intermediate features on each binary image (e.g. average squared distance of the connected component centers from the center of
gravity) and aggregate statistics (such as mean or standard deviation) over all
intermediate features, which are then used as the final features.
Cell tracking In order to efficiently track the large number of nuclei over time,
the jointly optimal association of nuclei is found for every pair of subsequent time
frames. The tracking algorithm is the second contribution of this chapter: hence it
is described in detail in section 4.5.
Interactive visualization The results are interactively visualized by a software
called Visbricks, which is based on the OpenSceneGraph 3D computer graphics library.3 It offers the following capabilities:
• Visualization of all segmented nuclei in a given subvolume by their center-ofmass positions along with the principal components semiaxes or by volume
rendering with smooth shading.
2
The dictionary of keys representation describes as sparse matrix as a dictionary, with the keys
being the row/column index tuples and the values being the nonzero entries of the matrix.
3
http://www.openscenegraph.org
130
4.2. Background
• Validation of individual nuclei by showing the cross-section of a selected nucleus
across the plane defined by the leading principal components together with the
segmentation isocontour.
• Visualization of the 3D trajectories of individual cells and their progeny over
time.
• Synchronized display of the raw image data, nucleus segments and the cell
lineage tree topology.
4.2. Background
4.2.1. The zebrafish Danio rerio as a model for vertebrate
development
The zebrafish (Danio rerio) is a popular aquarium fish that has become one of the
classical model organisms for vertebrate development, along with the Japanese ricefish (Oryzias latipes), the African clawed frog (Xenopus laevis), the chicken (Gallus
gallus domesticus) and the mouse (Mus musculus). Due to the transparency of its
embryos during their first 36 hours of development and its nearly constant size during
the first 16 hours, it is particularly well-suited to in-vivo imaging studies.
In contrast to avertebrate model organisms such as the nematode Caenorhabditis
elegans, the development of zebrafish embryos has no stereotypical course, and even
genetically identical specimens may develop asynchronously. However, the usual
development under optimal incubation conditions (28.5◦ C) can be roughly divided
into the following eight periods (Kimmel et al., 1995):
Zygote During the first 45 minutes p.f.,4 cytoplasm streams to the animal pole,
where the nucleus is located: there it forms the so-called blastodisc. Meanwhile
the yolk mass remains at the vegetal pole. At the animal pole, the fertilized egg
undergoes its first mitotic division.
Cleavage From 45 to 145 minutes p.f., the second to seventh mitotic divisions occur
rapidly (at 15 minute intervals), in which all cells in the embryo divide synchronously.
However, the cell cleavage is not complete, and the cells are still connected by cytoplasmatic bridges. At the end, the 64 cell stadium is reached, and the cells are
arranged in three regular layers.
4
post fertilisationem, i.e. after fertilization.
131
Chapter 4. Live-cell microscopy image analysis
Blastula During the next three hours (2.25 – 5.25 hours p.f.), the synchrony of
cell cycles is gradually lost, and the average cell cycle duration increases. The cell
arrangement also loses its regularity. The cell cycles 8 and 9 are still rapid and metasynchronous (i.e. the cells divide at nearly the same time), while the subsequent
cell cycles are longer (up to 60 min) and asynchronous. From this stadium on, the
cell cleavage is always completed and there are no cytoplasmatic bridges connecting
adjacent cells. These cells in the lowest layer, which are neighboring the yolk, lose
their integrity and release their cytoplasm and nuclei into the yolk: the yolk syncytial
layer5 arises, in which the nuclei still undergo mitosis, which is however not accompanied by a division of the cytoplasm. In the second half of the blastula period,
epiboly sets in: both the blastodisc and the yolk syncytial layer thin and spread over
the yolk sac, which is roughly halfway engulfed at the end of this stadium (50 %
epiboly).
Gastrula This stadium lasts from 5.25 to 10 hours p.f., during which epiboly is
completed (at 100 % epiboly, the yolk sac is fully enclosed by the embryo). In parallel,
a thickened region (the germ ring) appears around the rim of the blastodisk, and
cells accumulate at one particular position along this ring, the embryonic shield.6
The germ ring consists of two germ layers, the epiblast and the hypoblast, with
cells moving from the epiblast down into the interior of the embryo (towards the
hypoblast). As the embryonic shields marks the later dorsal side of the embryo, this
is the first time when the final embryonic axes can be discerned. Near the posterior
end of the embryo, the tail bud starts to develop.
Segmentation From 10 up to 24 hours p.f., the tail extends futher extends from
the tail bud. Along the anteroposterior axis, somites (i.e. primitive body segments)
appear sequentially, which will later form the segments of the vertebral column as
well as the associated muscles. Also, along this axis the notochord is formed, which
induces neurulation: a ridge in the epiblast develops into the neural tube, which is
segmented into neuromeres: these develop into the central nervous system, i.e. the
brain and the spinal column. Motor axons grow out from the neuromeres towards
the muscle precursors in the somites. This is also the period when the first body
movements start. Rudiments of the kidneys and the eyes appear. In the head, the
pharyngeal arches appear, which will later evolve into the gills and the jaws.
5
The yolk syncytial layer is considered an extraembryonal tissue; it is unique to teleosts (bony
fishes).
6
This process is also called involution.
132
4.2. Background
Pharyngula During the second day of development (24 – 48 hours p.f.), the body
axis (which has hitherto be curved) starts to straighten. The circulatory system
begins to develop, as well as the liver, the swim bladder and the gut tract, and around
36 hours all primary organ systems are present. During the end of the segmentation
period and the beginning of the pharyngula period (16 – 32 hours p.f.), the embryo
also experiences a rapid growth phase, in which it grows from its initial size of 1 mm
to nearly 3 mm. Pigmentation sets in and the fins start to develop: Due to the
pigmentation and the rapid growth, this is the time point where the organism can
no longer easily be studied by live microscopy.
Hatching During the third day of development (48 – 72 hours p.f.), the morphogenesis of most organ systems except for the gut is completed. The gills and jaws
are formed from the pharyngeal arches, and cartilage develops in the head and the
fins. Sometime in this period, the larva hatches out of the chorion, in which it has
been confined up to this point.
Early larva By the third day, the morphogenesis of the larva has been completed
and the shape of the body and its organs stays mostly constant from then on. The
swim bladder inflates, and the larva begins autonomous swimming and feeding movements. The larva eventually grows from its size of 3.5 mm (after hatching) to its
final size of 4 cm, and reaches sexual maturity after 12 weeks.
4.2.2. Digital scanned laser light-sheet fluorescence microscopy
(DSLM)
Fluorescence microscopy using GFP Fluorescence microscopy is a microscopy
technique which detects the structures of interest by coupling them with fluorescent
molecules and recording their light emission: since light emission occurs at a longer
wavelength than the absorption of the illumination light by which the fluorophores
are excited, wavelength-selective filters can be used to suppress the illumination background. Arguably the most important fluorophore in biology is the green fluorescent
protein (GFP) of the jellyfish Aequorea victoria, which absorbs blue light at 395 nm
and emits green light at 509 nm (Chalfie et al., 1994). By fusing the GFP gene with
the gene for the histone protein H2B, and transferring this fusion gene to a living
cell via mRNA injection, one can fluorescently label the chromatin within the nuclei
of a cell and its daughter cells after mitosis (Kanda et al., 1998).
Light-sheet-based fluorescence microscopy Conventional fluorescence microscopy techniques use a single lens for illuminating the specimen and for gathering
133
Chapter 4. Live-cell microscopy image analysis
the emitted light: hence the whole specimen is flooded in light, even if only the focal
plane is currently imaged. This poses problems due to phototoxicity and photobleaching: illuminated fluorophores may cause the death of the cells expressing them
(possibly due to the formation of reactive oxygen radicals), and the fluorophore
themselves may be destroyed after prolonged exposure due to chemical reactions and
covalent bond forming while being in the excited state. This limits the applicability
of fluorescence imaging, especially to time-lapse imaging series in which images are
taken at regular intervals over a long time. Light-sheet-based fluorescence microscopy
(LSFM) alleviates this problem by selectively illuminating only the focal plane that
is currently imaged: for this purpose, two separate lenses are used for illumination
and collecting the emitted light (such that the optical axes are perpendicular to each
other), and a thin light sheet (formed by apertures) is used for illumination instead
of flooding the entire specimen (Reynaud et al., 2008).
Digital scanned laser light-sheet fluorescence microscopy DSLM (Keller et al.,
2008; Keller & Stelzer, 2008) is a variant of LSFM, in which a laser beam sequentially
illuminates the entire specimen by a raster scan, thereby enabling 3D imaging. The
advantages of using a laser instead of apertures for the light-sheet formation are an
improved image quality (due to reduced optical aberrations), an increased intensity
homogeneity and an increased illumination efficiency (due to the highly localized
energy deposition). The SNR is typically 1000:1, and hence by a factor of 10–100
better than that of conventional techniques, at an energy deposition which is reduced
by a factor of 103 – 106 (leading to minimal photobleaching and phototoxicity and
allowing time-lapse imaging over a long period). The specimen is typically fixated in
a transparent gel such as agarose. CCD cameras are used for fast image capturing:
typically, images of 4.2 megapixels (2048 × 2048) can be acquired with a frame rate
of 15 frames per second, leading to a data rate of 1 Gbit/s for 16 bit images. Lateral
and axial resolutions down to 300 nm and 1000 nm respectively can be achieved, and
a multi-view image acquisition can be used to achieve isotropic image resolution (i.e.
by taking several images from different angles and fusing them in silico).
4.2.3. Integer linear programming
The following mathematical background is common knowledge and covered in standard textbooks such as those by Papadimitriou & Steiglitz (1998) or Wolsey (1998).
A linear program (LP) is a mathematical optimization problem for which both the
134
4.2. Background
optimization objective and the constraints are linear in the variables. In its canonical
form, it is stated as
min c⊤ x subject to Ax ≥ b.
x
(4.1)
It can be viewed as the minimization of a linear function over the convex polytope
defined by Ax ≥ b. If we denote the number of variables with p and the number of
constraints with m, then x ∈ Rp is the variable vector, c ∈ Rp the cost vector and
A ∈ Rm×p the constraint matrix.
State-of-the-art algorithms for globally solving the LP problem in Eq. (4.1) fall into
two categories:
• Simplex algorithm: This algorithm by Dantzig (1949) makes use of the fact
that the minimum must be attained on one of the vertices of the feasible polytope: starting from an initial vertex, one iteratively visits adjacent vertices
such that the objective decreases. Different pivoting strategies exist which
specify which neighbor to take if there are several possibilities, and may lead
to vast differences in the practical performance of the algorithm (see (Terlaki &
Zhang, 1993) for a somewhat dated overview). Most known pivoting schemes
of the simplex algorithm give exponential worst-case complexity, and it is currently not known whether variants with polynomial complexity exist. However,
these worst-case problem instances are mostly pathological, and the averagecase complexity is typically cubic, both for random problem instances and for
a variety of real-world instances.
• Interior-point methods: The first usable LP solver with proven polynomial complexity was proposed by Karmarkar (1984); the previously proposed
ellipsoid algorithm (Aspvall & Stone, 1980) was unfit for practical use due to
numerical stability problems and gave non-competitive performance on all realworld LP instances. In contrast to the simplex algorithm (where the current
candidate solution is always a polytope vertex), it maintains an interior point
of the polytope as the current solution and reaches the optimal solution on the
border only asymptotically. The key idea is the replacement of the inequality constraints by adding a differentiable barrier function to the minimization
objective, which becomes infinite at the border of the feasible region: the minimum of this adjusted objective is then found using Newton updates, and the
contribution of the barrier term is iteratively decreased to zero (for a recent
overview see (Nemirovski & Todd, 2008)).
135
Chapter 4. Live-cell microscopy image analysis
Integer linear programming (ILP) is a mathematical optimization problem of the
form as in Eq. (4.1), with the additional constraint that x ∈ Zp must be a vector of
integers. An important special case occurs when the xi are further constrained to
be either 0 or 1 (binary integer programming, BIP): this is one of the classical 21
NP-complete problems identified by Karp (1972) and is hence unlikely to be solvable
in polynomial time. Nonetheless, powerful algorithms exist for finding the global
optimum of ILP instances that can nowadays solve problem instances with up to a
few hundred thousands of constraints and variables (Achterberg et al., 2006). They
fall into three categories:
• Branch and bound: This strategy generates a search tree for all feasible
solutions, and prunes unpromising subtrees to avoid complete enumeration
(which would require exponential time). A subtree typically corresponds to
the subproblem of fixing some variables of the original problem and finding
the optimum over the remaining variables. It uses the fact that solving the
LP relaxation of an ILP subproblem (with the integrality constraints dropped)
gives a lower bound for the optimal solution of the ILP problem: if this lower
bound already exceeds an upper bound for the global optimum (e.g. the best
feasible solution that is currently known) then the subtree may be pruned.
• Cutting plane: This strategy tries to find the integer polytope, i.e. the convex
hull of all integral feasible points, which is usually a strict subset of the convex
polytope of the relaxed LP problem. Obviously, solving the LP relaxation over
the integer polytope would give the solution for the original ILP problem, but
finding this polytope is an NP-complete problem by itself. The cutting plane
algorithm iteratively adds inequality constraints that are met by all feasible
integral points, until the solution of the LP relaxation becomes integral.
• Branch and cut: This is a hybrid of the two other strategies, where the cutting
plane algorithm is applied to the subproblems encountered while traversing
the branch and bound search tree, leading to tighter lower bounds due to the
additional constraints.
It should be noted that the performance of these methods depends on the individual
ILP instance upon which they are applied: while they give fast solutions for many
practically relevant instances, their worst-case complexity is still exponential. However, there is an important subclass of ILP instances, which are guaranteed to be
solvable in polynomial time, namely those where the constraint matrix A is totally
unimodular (TU): this means that the determinant of every quadratic minor must
be either 0, +1 or -1. For these instances, the constraint polytope is at the same
time the integral polytope, hence the ILP problem has the same solution as its LP
relaxation. Several important network flow problems fall into this category, e.g. the
minimum-cost flow problem, which asks how to route a given amount of flow f from
136
4.3. Related work
a source s to a sink t, via a directed network of edges with a transport cost ci and a
maximum capacity bi :
P
min c⊤ x s.t. x ≥ 0, x ≤ b, e xe dve = f (δvs − δvt ) for all v.
(4.2)
x
In this equation, (dve )ve denotes the directed incidence matrix, i.e. for every node v
and edge e,


+1 v start node of e
dve = −1 v end node of e
(4.3)


0
else
4.3. Related work
First, section 4.3.1 discusses previous research which has a scope similar to the entire
pipeline to which this chapter contributes, namely reconstructing the cell lineage tree
of an entire organism. In contrast, sections 4.3.2 and 4.3.3 present related work in the
two most important subcomponents of this pipeline, namely nucleus segmentation
and nucleus tracking. Due to the multitude of publications in those areas, only some
selected articles can be discussed that have the highest relevance for putting the
techniques discussed in this chapter into context.
4.3.1. Cell lineage tree reconstruction
Cell lineage reconstruction has been pioneered in the nematode Caenorhabditis elegans, which has a highly invariant cell lineage: 671 cells are generated, of which 113
(for hermaphrodites) undergo programmed cell death. Due to this moderate number
of cells, the first lineage tree could be generated by manual tracing in interferometric
microscopy images of living worms (Sulston et al., 1983).7 However, this manual lineage reconstruction becomes impracticable when a large number of specimens shall
be analyzed for their cell lineage, e.g. in order to elucidate the developmental effects
of genetic variants.
Bao et al. (2006) present an automated lineage reconstruction system for confocal
time-lapse microscopy imagery of H2B-GFP labeled C. elegans embryos up to the
350 cell stadium: nuclei are identified as local intensity maxima (with the constraint
that there must be a minimum distance between all nucleus pairs) and approximated
by the best spherical fit to the local intensity distribution. Nucleus tracking works by
7
It should be noted that this work was awarded the 2006 Nobel Prize in Physiology or Medicine.
137
Chapter 4. Live-cell microscopy image analysis
a greedy procedure. First each nucleus is tentatively assigned to its nearest neighbor
in the previous time step, and then this assignment is refined by tackling cell divisions
separately: if a nucleus at time t has several children at time t+1, all possible parents
of these nuclei are computed (i.e. those whose distance is lower than a threshold),
and a hand-crafted scoring function is used to select which of these possible parents
are actually selected for each nucleus in the end. A graphical user interface is also
provided for final manual lineage tree correction.
Recently, research has been undertaken with the aim of achieving automated cell
lineage reconstruction also in a vertebrate, namely in the zebrafish D. rerio. It
culminates in the publication of the zebrafish cell lineage tree up to the 1000 cell stadium (i.e. the first ten mitotic divisions, up to the mid-blastula stadium), based on
label-free microscopic imagery (Olivier et al., 2010). The authors employ harmonicgeneration microscopy in order to forego the need for fluorescent labeling: Mitotic
nuclei and cell membrane positions are extracted from second-harmonics generation and third-harmonics generation images, as second harmonics are generated at
anisotropic structures (such as microtubule spindles) and third harmonics are generated at interfaces between aqueous and lipidic media (such as cell membranes). A
nearest neighbor scheme with interactive manual corrections is used for cell tracking. Additionally, a software for automated segmentation and tracking of cells in
the zebrafish brain (based on laser scanning confocal microscopy) has be published
(Langenberg et al., 2006), but no details about the technical workings are provided.
However, the only published evaluation pipeline for DSLM data is the one presented
in (Keller et al., 2008): There the authors segment cell nuclei by local adaptive thresholding and perform local nucleus tracking by a nearest neighbor search. Nucleus
detection efficiencies of 92 % and tracking accuracies of 99.5 % per frame-by-frame
association can be achieved.
4.3.2. Cell or nucleus segmentation
Multi-scale initialization and graph-cut refinement The first of the segmentation methods studied in section 4.4, which was developed by Lou et al. (2011b),
is most closely related to the nucleus segmentation presented in (Al-Kofahi et al.,
2010): both approaches use blob filter responses that are coherent across multiple
scales as initial seeds for the segmentation, and refine them via a discrete graph cut
optimization. However, the method by Lou et al. (2011b) differs by the use of more
flexible foreground cues based on discriminative random forest classifiers (see section
2.2) instead of the Poisson mixture model employed in the other article, by explicit
shape regularization using a multi-object generalization of the graph cuts algorithm
138
4.3. Related work
presented by Boykov et al. (2001), and by being a 3D segmentation in contrast to
the 2D segmentation in (Al-Kofahi et al., 2010).
Classification based on local features The ILASTIK procedure, i.e. the second
of the segmentation methods studied in section 4.4 is most closely related to the
work of Yin et al. (2010). Both approaches extract local features from every pixel
and classify them as either foreground or background; finally simple segmentation
schemes are used to group spatially connected foreground pixels into segmented nuclei. However, the ILASTIK procedure uses the responses of convolutional filters
computed at multiple scales as features, and a random forest as supervised classifier,
while Yin et al. (2010) extract histograms from a patch window around each pixel,
which are then clustered and classified using a Bayesian mixture of experts.
Level-set evolution A different approach is followed by Bourgine et al. (2010) and
Zanella et al. (2010), who tackle the three steps of their segmentation pipeline (image
denoising, center detection and pixel-accurate segmentation) with a common mathematical formalism, namely nonlinear partial differential equations. For denoising,
a variant of anisotropic diffusion is used, objects are distinguished from background
speckles by a level set evolution which causes small objects to vanish quickly, and
voxelwise segmentation is achieved by a level set evolution that can account for
missing boundaries by curvature regularization. This segmentation method is then
applied for detecting and delineating cell nuclei in confocal time-lapse microscopy of
zebrafish embryos. Compared to manual ground truth, they achieve mean Hausdorff
distances (over all nuclei) in the range of 0.35 µm – 0.98 µm, with an average of
0.65 µm. Mosaliganti et al. (2009) propose a different level-set segmentation method
for the same application: they fit a parametric model to the intensity distributions of
training nuclei, and incorporate this as a prior into an active contour-based energy
minimization problem, achieving Dice indices of 0.86 – 0.94 on different datasets.
However, a disadvantage of continuous PDE-based segmentation methods (as opposed to discrete techniques like graph cuts) is that they are prone to get stuck in
local optima of the energy functional.
Gradient flow tracking Li et al. (2007, 2008a) use a three-stage procedure for
segmenting cell nuclei in 3D confocal fluorescence images of C. elegans and D. rerio.
First, the image gradient field is denoised using gradient vector diffusion (i.e. solving
a Navier-Stokes PDE). Secondly, the image is partitioned into the attraction basins
of the gradient vector field, and it is assumed that each basin contains at most one
nucleus. Finally, the local adaptive thresholding method by Otsu (1979) is used to
compute the final nucleus segmentation, which achieves a volume overlap of 90 %
139
Chapter 4. Live-cell microscopy image analysis
with the manual segmentation ground truth. These methods have been made publicly
available in a software package called ZFIQ (Liu et al., 2008).
4.3.3. Cell or nucleus tracking
Previous approaches for the tracking of cells or nuclei fall into four categories:
1. Segmentation and frame-by-frame association,
2. level-set evolution and
3. stochastic filtering,
4. four-dimensional segmentation of spatiotemporal volumes.
Segmentation and frame-by-frame association In this formalism, independent
nucleus segmentation is performed on each data volume (at each time step), and
afterwards the optimal association of nuclei across different time points is found. In
most cases, integer linear programming is used for matching nuclei between pairs
of subsequent time steps (Al-Kofahi et al., 2006; Padfield et al., 2009a; Li et al.,
2010): the objective is a suitably selected energy function, which makes sure that
e.g. spatially close nuclei are more likely to be matched than distant nuclei, while the
constraints make sure that no cell is used by more than one matching event. There
are slight differences between the various papers (e.g. whether occlusions or entering
and leaving the field of view are modelled), but the mathematical formulation of the
models is nearly identical. The approach followed in this paper (section 4.5) follows
the same formalism.
Level-set evolution Level-set evolution techniques model the cell boundaries as
zero level sets of an implicit embedding function. Hence they are not restricted to
a certain topology and can easily account for splits and merges of objects: Additional objects simply correspond to additional hills in the profile of the embedding
function. Padfield et al. (2009b) segment and track GFP-labeled nuclei in time-lapse
2D fluorescence microscopy imagery by interpreting the images acquired at different
time points as a 3D stack, and use level set evolution (initialized with automatically
placed seed points, which are determined by a classifier) to segment the entire cell
trajectories. For tracking fluorescent cells in time-lapsed 3D microscopy, Dufour et al.
(2005) use a coupled active surfaces approach, which identifies every cell with the
zero level set of a single embedding function and adds overlap penalties and volume
preservation priors in order to prevent cells that overlap with other cells, or shrink
or grow too rapidly. A level-set segmentation is then performed on the individual
140
4.3. Related work
data volumes, in which the final segmentation of the previous time step always serves
as initialization for the next segmentation task. This approach reaches a tracking
accuracy of 99.2 %.
Stochastic filtering Li et al. (2008b) use a combination of stochastic motion filters
with the techniques presented in the previous two paragraphs for cell tracking in 2D
phase contrast microscopy imagery: first cells are detected using a combination of
region-based and edge-based object detection techniques, then predictions for their
central position in the next time step are cast using an interacting multiple models
filter, which allows cells to have different motion patterns (such as e.g. Brownian
motion or constant acceleration). This prediction is combined with the detection result from the next time step, and incorporated as one term into the energy functional
of a level-set tracking scheme, which computes the definite tracking event across two
subsequent time steps. Explicit rules are then used to compile tracking events into
track segments (spanning multiple time steps), which are finally linked to a lineage
tree using integer linear programming. Tracking accuracies between 87 % and 93 %
can be achieved by this method on different datasets. A simplified version of this
procedure (which uses only object detection and interacting multiple models filtering, but neither level-set evolution nor integer linear programming) is employed by
Genovesio et al. (2006) for tracking quantum dot-labeled endocytosis vesicles in 3D
fluorescence microscopy imagery, achieving a true positive rate of 85 % and a false
positive rate of 6 % among all tracks. However, this is an easier task than cell or
nucleus tracking, since vesicles do not divide.
Four-dimensional segmentation of spatio-temporal volumes If the cells or nuclei show sufficient overlap in subsequent time frames, one can unify segmentation
(in space) and tracking (over time) by segmenting the nucleus tracks in a fourdimensional volume with three spatial and one temporal dimensions, as many threedimensional segmentation techniques can be readily generalized to more dimensions.
Luengo-Oroz et al. (2007) apply four-dimensional mathematical morphology to optical sectioning microscopy of zebrafish embryos with fluorescence-labeled nuclei and
membranes. The nucleus lineage tree is finally used as the seeds in a seeded watershed segmentation of the cell outlines in the cellular membrane channel. 90 % of
all mitosis events are identified correctly by this approach. A disadvantage of this
approach is the high memory load, since a typical single three-dimensional microscopy image volume already has a size of several gigabytes and the four-dimensional
segmentation has to access the data from all time points simultaneously.
141
Chapter 4. Live-cell microscopy image analysis
4.4. Experimental comparison of two nucleus
segmentation schemes
4.4.1. Introduction
Two nucleus segmentation procedures were experimentally compared: The first of
these methods by Lou et al. (2011b) is fully automated; for a description see section 4.1. Since its final step is solving a shape-regularized graph-cut optimization
problem, it is henceforward referred to as the “regularized graph cut” (RGC) segmentation. The second method uses the Interactive Learning and Segmentation Toolkit
(ILASTIK) software by Sommer et al. (2010) for semiautomatic segmentation: the
users interactively train a random forest classifier (see section 2.2) for the task of distinguishing between foreground (cell nuclei) and background (everything else) based
on locally extracted image features. When a new label is placed on a training image
volume, the current random forest predictions are automatically updated and displayed, so that the users can see where the classifier still performs poorly, and place
their labels at these locations. The trained classifier can then be used to predict the
foreground and background probabilities of all voxels either of the same volume that
it was trained on, or of a new test image volume. These continuous probabilities
can then be converted to a binary segmentation either by simple thresholding or by
more sophisticated schemes that incorporate spatial regularity terms (e.g. graph cut
segmentation). In this chapter, only the simple thresholding method will be used
for this purpose. The main difference between the two methods is that RGC automatically generates foreground and background labels in order to train a classifier,
and sophisticated spatial and shape regularization is used in order to transform the
classifier predictions into a binary segmentation. For ILASTIK in contrast, the labels
are placed manually by a user, and the segmentation is obtained from the classifier
predictions by a trivial procedure.
4.4.2. Evaluation methodology
Dataset The experimental studies described in this chapter are based on a series of
100 DSLM image volumes, showing the animal pole of a H2B-eGFP labeled zebrafish
embryo; see Fig. 4.1 for an exemplary slice. While the native voxel size was 0.3×0.3×
1.0 µm3 , the data were resampled in the z-direction resulting in a nearly isotropic
voxel size of 0.3 × 0.3 × 0.33 µm3 . The total number of voxels in a volume after
resampling was 1161 × 1111 × 486 = 6.3 × 108 . 60 seconds elapsed between the
acquisition of two subsequent volumes. For comparison: A typical nucleus has a
diameter of 7 µm, the typical mitosis duration in D. rerio is about 6–7 minutes and
142
4.4. Experimental comparison of two nucleus segmentation schemes
the typical migration speed of nuclei is less than 3 µm/min in the interphase of the
cell cycle, and 8 µm/min in the metaphase (Kimmel et al., 1995).
Need for feature selection Classification-based segmentation methods such as the
ones studied in this chapter require local image features, which are typically generated
by convolving the image with Gaussian kernels of different scales and computing the
responses of different image filters that capture properties like edge strength (e.g.
gradient amplitudes), presence and orientation of blobs and ridges (e.g. the Hessian
matrix and its eigenvalues) or local anisotropy (e.g. the structure tensor and its
eigenvalues). The scale of the Gaussian kernel determines an interaction length
between different locations in the image, i.e. how large a neighborhood should be
chosen to take the context of each voxel into account for the classification. The large
size of the image volumes acquired by DSLM necessitates feature selection in order
to apply the ILASTIK segmentation method effectively: for a typical whole-embryo
volume with a size of roughly 4 × 108 voxels, computing all routinely available local
image features would require ca. 250 GB of main memory, which is by far out of
reach for current desktop computers. However, not all features are equally useful for
classification, and selecting a parsimonious feature set allows one to train a classifier
if not on the whole data volume, then at least on a subvolume of the maximal possible
extent.8 This is important in order to obtain a classifier that is suitable for classifying
an entire image volume, since the local appearance of the nuclei changes across the
image due to illumination inhomogeneities.
Variable importance estimation Fortunately, the random forest classifier that is
used as part of the ILASTIK segmentation provides two simple measures for the
importance of the different features (Breiman, 2001): one generic measure that can
be computed for every classifier, and one that is specific to the random forest. The
generic method computes for every feature the decrease in classification accuracy
that occurs when the values of this feature for the different training examples are
randomly permuted. This means, if the rows of the n × p matrix X contain the
features extracted for each of the n training examples, the classifier is retrained p
times with a training matrix X (p) which differs from X by a random permutation of
the p-th column. The decrease in classification accuracy can be computed separately
for foreground and background examples, or averaged across all classes. A less costly
alternative to this permutation-based variable importance measure uses the fact that
the trees of the random forest classifier are grown by iteratively searching for the
feature cut with the highest decrease in the Gini impurity (see Eq. (2.8)) among a
8
This problem could be bypassed via the use of a lazy evaluation scheme, which computes the
image features for the slice currently examined by the user on the fly. However, for the current
processor speeds this procedure would be too slow for real-time responsiveness.
143
Chapter 4. Live-cell microscopy image analysis
randomly selected feature subset. During classifier training, one can create a list
for each feature containing the Gini decreases of cuts on this feature: The mean of
this list is called the mean Gini decrease of this feature, and it has the advantage
that it can be computed as a byproduct of the normal classifier training. For an
overview over alternative possibilities for measuring variable importance, see (Guyon
& Elisseeff, 2003). Note that there are natural groups of features that are sensibly
computed simultaneously (e.g. the three different eigenvalues of a three-dimensional
Hessian matrix): in this case it was decided to select or deselect these features as a
whole, and to use the maximum mean Gini decrease in the group as a measure of
the importance of the entire feature group.
Feature type / Feature scale
Gaussian smoothing (G)
Structure tensor (S)
Hessian of Gaussian (H)
Smoothed gradient magnitude (M)
Hessian of Gaussian eigenvalues (V)
Difference of Gaussians (D)
Structure tensor eigenvalues (E)
Laplacian of Gaussian (L)
0.3
3
2
2
1
1
1
1
1
0.7
2
2
2
1
2
1
3
1
1
3
4
3
1
2
1
3
1
1.6
4
4
4
2
3
2
3
1
3.5
4
4
5
3
4
2
4
3
5
3
4
5
2
5
5
4
5
10
3
3
5
2
5
5
4
4
Table 4.1. – Order in which the different image features were eliminated from the active
feature set used by the ILASTIK software. For instance, a “1” means that the respective
feature was eliminated already in the first iteration, and a “5” means that it was among the
eight best features. The finally selected features are the best 20 ones that remained after the
third iteration, i.e. all marked with either a “4” or a “5”.
Feature selection scheme Due to possible correlations between features, the variable importance of a particular feature depends on the other features that are used:
specifically, it may gain importance once another feature is deselected. Hence the
variable importance should be recomputed at times during the pruning of the feature
set. As a compromise between evaluation time and accuracy, the following recursive
feature elimination scheme was used, which is similar to the one employed by Menze
et al. (2009):9 The ILASTIK software was used for interactive labeling and classifier
training of five image subvolumes of size 400 × 400 × 100 spaced at twenty minutes,
and ten random forest classifiers (consisting of ten trees each) were trained on each
volume using the same labels. Every random forest yielded a separate variable importance estimate for each feature, and the twelve feature groups with the smallest
9
The main difference is that Menze et al. (2009) remove a certain fraction of the p % worst features
in every iteration, while here the same absolute number of features is removed in each iteration.
144
4.4. Experimental comparison of two nucleus segmentation schemes
10000
9000
Required memory [MB]
8000
7000
6000
5000
4000
3000
2000
1000
56
44
32
Number of features
20
8
Figure 4.2. – ILASTIK memory requirements of the feature sets remaining at the end of the
different iteration rounds for a 400 × 400 × 100 data volume, assuming a 32-bit floating-point
representation.
medians over all 50 estimates for the maximum mean Gini decrease were discarded,
leaving 44 feature groups in the active feature set. The use of the variable-based variable importance instead of the Gini decrease would have been an obvious alternative,
but both methods assign similar rankings to the features (compare Figs. 4.3(a) and
4.3(b)). This whole procedure was iterated three more times using the same labels
for each image volume, with 32, 20 and 8 features remaining at the end of the different iteration rounds. Table 4.1 shows the order in which the features are pruned, and
Fig. 4.2 shows the main memory requirements for the different feature sets. The final
feature set was selected based on the quality of the segmentations obtained in the
different iteration rounds, as computed from comparisons against manual ground
truth. A threshold of 0.5 was used to generate a binary segmentation out of the
continuous classifier outputs.
Segmentation evaluation After feature selection, the two competing segmentation methods (ILASTIK and RGC) are validated against the manual ground truth.
Since it is impracticable to train the ILASTIK classifier on every single data volume
separately, it was studied how well the trained classifiers generalize to close time
points: Five image subvolumes were selected for interactive classifier training (at 1,
21, 41, 61 and 81 minutes after the start of the imaging series), and the trained
145
Chapter 4. Live-cell microscopy image analysis
120
Mean maximum Gini decrease
100
80
60
40
20
0 GaGbGcGdGeGfGgSaSbScSdSe Sf SgHaHbHcHdHeHfHgMaMbMcMdMeMfMgVaVbVcVdVeVf VgDaDbDcDdDeDfDgEaEbEcEdEe Ef EgLaLb Lc LdLe Lf Lg
(a) Mean maximum Gini decrease variable importance measure
0.14
Average accuracy decrease for all classes
0.12
0.10
0.08
0.06
0.04
0.02
0.00
GaGbGcGdGeGfGgSaSbSc SdSe Sf SgHaHbHcHdHeHf HgMaMbMcMdMeMfMgVaVbVcVdVeVf VgDaDbDcDdDeDf DgEaEbEc EdEe Ef EgLa Lb Lc Ld Le Lf Lg
(b) Class-averaged permutation-based variable importance measure
Figure 4.3. – Comparison of two variable importance measures for the first feature selection
iteration round. The boxplots show the distribution of 50 estimates computed for each feature
(from ten random forest classifiers trained over five datasets). The arrows in Fig. 4.3(a) mark
the features that were removed in this iteration due to having the lowest median value for the
mean maximum Gini decrease. The upper-case letters in the x-axis labels encode the feature
type (see Table 4.1), while the lower-case letters encode the feature scale: “a” stands for the
smallest scale (0.3 voxel lengths), while “g” stands for the largest scale (10 voxel lengths).
classifiers were used for segmenting both these training subvolumes and separate
testing subvolumes which had been acquired four minutes later (at 5, 25, 45, 65 and
146
4.4. Experimental comparison of two nucleus segmentation schemes
85 minutes).10 Again, the size of each subvolume was 400 × 400 × 100. While the
current RGC implementation generates one single deterministic segmentation, the
results of ILASTIK may vary depending on the number of labels and on the binarization threshold; furthermore there is an element of chance as the label placement
is inherently subjective. In order to study the influence of these effects, 25 separate
label sets were independently acquired for each training volume, and each was used
for training a random forest classifier with 100 trees. Each set contained an equal
number of foreground and background labels, and the total number of labels was
varied systematically: five sets each had a total size of 40 (20 + 20), 80, 120, 160
and 200 labels.11 This allowed to study both the effect of the label number (and
hence of the user effort) on the segmentation quality on the segmentation quality, as
well as the variability of the classifier for a fixed number of labels. Furthermore three
different thresholds (0.25, 0.5 and 0.75) were used in order to transform the predicted
probability maps into binary segmentations: One can expect that a low threshold
leads to higher merge rates and recall while reducing split rates and precision (and
that the opposite holds for a high threshold). The following segmentation quality
measures were calculated:
• Precision: the percentage of segments in the computed segmentation that
overlap with at least one nucleus in the ground truth.
• Recall: the percentage of nuclei in the ground truth that overlap with at least
one segment in the computed segmentation.
• F1 measure: The harmonic mean of precision and recall.12
• Merge rate: the percentage of segments in the computed segmentation that
overlap with more than one nucleus in the ground truth.
• Split rate: the percentage of nuclei in the ground truth that overlap with
more than one segment in the computed segmentation.
• Dice index: the ratio between the volume of the intersection between computed and ground-truth segmentation, and the average volume of these two
segmentations.
• Hausdorff distance: the maximum distance of a point in the computed segmentation to the ground-truth segmentation.
10
Since generating the manual ground truth for these ten data volumes was already a time-consuming
process, the same training volumes were used to select the number of features and to train the
classifiers. It would have been methodologically preferable if a separate dataset would have been
used for the feature selection.
11
Typical human labeling speeds were 12–18 labels per minute.
12
The subscript 1 indicates that there is a more general Fβ measure, which allows to place higher
weight on either precision or recall when forming the mean. For β = 1, both these measures are
weighted equally.
147
Chapter 4. Live-cell microscopy image analysis
The precision, recall and F1 measure characterize the performance of the segmentation with respect to detecting precisely the cell nuclei as objects. Split and merge
rate quantify the occurrence of oversegmentation and undersegmentation respectively, while Dice index and Hausdorff distance quantify the voxelwise accuracy of
the segmentation.
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
Recall
Precision
4.4.3. Results for feature selection and evaluation
0.5
0.4
0.4
RGC
ILASTIK (56)
ILASTIK (44)
ILASTIK (32)
ILASTIK (20)
ILASTIK (8)
0.3
0.2
0.1
0
0.5
1
RGC
ILASTIK (56)
ILASTIK (44)
ILASTIK (32)
ILASTIK (20)
ILASTIK (8)
0.3
0.2
0.1
21
41
61
Time step of dataset
81
0
(a) Precision values
1
21
41
61
Time step of dataset
81
(b) Recall values
1
0.9
0.8
1
F measure
0.7
0.6
0.5
0.4
RGC
ILASTIK (56)
ILASTIK (44)
ILASTIK (32)
ILASTIK (20)
ILASTIK (8)
0.3
0.2
0.1
0
1
21
41
61
Time step of dataset
81
(c) F1 measure values
Figure 4.4. – Object detection accuracy measures of the ILASTIK segmentations for different feature set sizes, with the RGC results shown for comparison purposes. Higher values
correspond to better segmentations.
Feature selection results Figs. 4.4, 4.5 and 4.6 show the results of the different segmentation accuracy measures for the feature sets obtained after the different
148
4.4. Experimental comparison of two nucleus segmentation schemes
Hausdorff distance for entire segmentation [voxels]
feature selection iterations. While there is little difference between the results obtained with 56, 44, 32 and 20 features, the segmentation quality degrades markedly
when going down to eight features. Especially the precision declines considerably,
which means that a classifier using only the final eight features erroneously detects
a large number of false positive segments that do not correspond to actual cell nuclei. Fig. 4.5(a) shows that also the voxelwise accuracy (as measured by the Dice
index) is impaired. On the other hand, using eight features favors oversegmentation
over undersegmentation for the later time steps, which is advantageous for the later
tracking procedure (compare Figs. 4.6(a) and 4.6(b)). Both the use of 20 and of
eight features would hence be defensible: the decision fell on using the 20 features
remaining after the fourth iteration round (i.e. those marked with “4” and “5” in
Table 4.1), since this choice led to segmentation results that were nearly indistinguishable from using the entire feature set. It should be noted that 16 out of the 20
finally selected features were already among the 20 best features in the first iteration
round (see Fig. 4.3(a)), so our procedure of gradually discarding only a small number
of features per iteration step can be seen as somewhat overly cautious.
1
Dice index for entire segmentation
0.9
0.8
0.7
0.6
0.5
0.4
RGC
ILASTIK (56)
ILASTIK (44)
ILASTIK (32)
ILASTIK (20)
ILASTIK (8)
0.3
0.2
0.1
0
1
21
41
61
Time step of dataset
(a) Dice indices
81
80
RGC
ILASTIK (56)
ILASTIK (44)
ILASTIK (32)
ILASTIK (20)
ILASTIK (8)
70
60
50
40
30
20
10
0
1
21
41
61
Time step of dataset
81
(b) Hausdorff distances
Figure 4.5. – Voxelwise accuracy measures of the ILASTIK segmentations for different
feature set sizes (in parentheses), with the RGC results shown for comparison purposes. For
the left plot, higher values correspond to better segmentations, while it is the other way
around for the right plot.
Selection of optimal binarization threshold Figs. 4.7, 4.8 and 4.9 show the effect
of varying the ILASTIK binarization threshold on the F1 measure and on the occurrence of oversegmentation and undersegmentation. Choosing an optimal threshold
typically requires resolving a conflict between precision and recall, since raising the
threshold typically increases the former and decreases the latter. The F1 measure
149
Chapter 4. Live-cell microscopy image analysis
0.5
0.4
0.35
0.07
RGC
ILASTIK (56)
ILASTIK (44)
ILASTIK (32)
ILASTIK (20)
ILASTIK (8)
Fraction of segments that are merges
Fraction of true cells that are split
0.45
0.3
0.25
0.2
0.15
0.1
0.06
0.05
0.04
0.03
0.02
0.01
0.05
0
1
21
41
61
Time step of dataset
(a) Split rates
81
0
RGC
ILASTIK (56)
ILASTIK (44)
ILASTIK (32)
ILASTIK (20)
ILASTIK (8)
1
21
41
61
Time step of dataset
81
(b) Merge rates
Figure 4.6. – Oversegmentation and undersegmentation measures of the ILASTIK segmentations for different feature set sizes, with the RGC results shown for comparison purposes.
Lower values correspond to better segmentations.
captures how well these two conflicting goals of good precision and recall can be met
at the same time: as Fig. 4.8 shows, varying the threshold affects the recall more
than the precision, so that higher F1 measures are attained for a lower threshold.
Having a high recall is also the more important goal than having a high precision,
since extraneous nuclei may still be suppressed at a later stage during the tracking,
while nuclei that are lost in the segmentation stage cannot be recovered later: this is
a second argument for choosing a threshold of 0.25 rather than 0.5 or 0.75. On the
other hand, lowering the threshold leads also to a decreased split rate (oversegmentation, Fig. 4.8) and an increased merge rate (undersegmentation, Fig. 4.9). Both
these effects are nonnegligible, with the effect on merge rate being more pronounced.
Since artificial splits can be tolerated better than artificial merges, this is an argument for selecting a high binarization threshold. Hence a compromise value of 0.5
was chosen, as in the preliminary studies on feature selection.
Comparison of training and test data For most performance measures, the differences between the training and the test datasets were negligible, with the largest
differences occurring for the F1 measures (Figs. 4.10(a) and 4.10(b)) and the recall (Figs. 4.10(e) and 4.10(f)): There the test values were slightly decreased compared to the training values, whereas no effect was noticeable for e.g. the precisions
(Figs. 4.10(c) and 4.10(d)). Note that the absolute numbers should not be compared
since the intrinsic difficulty of segmenting the training and the test datasets may
be different: instead the relative performance of ILASTIK compared to the RGC
segmentation should be used for the comparison, as the RGC method is not affected
150
1
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.5
1
0.5
0.6
0.4
RGC
ILASTIK (40)
ILASTIK (80)
ILASTIK (120)
ILASTIK (160)
ILASTIK (200)
0.3
0.2
0.1
0
1
21
41
61
Time step of dataset
0.4
RGC
ILASTIK (40)
ILASTIK (80)
ILASTIK (120)
ILASTIK (160)
ILASTIK (200)
0.3
0.2
0.1
0
81
0.6
0.5
1
0.6
F measure
1
0.9
F measure
1
0.9
1
F measure
4.4. Experimental comparison of two nucleus segmentation schemes
(a) Threshold = 0.25
1
21
41
61
Time step of dataset
0.4
RGC
ILASTIK (40)
ILASTIK (80)
ILASTIK (120)
ILASTIK (160)
ILASTIK (200)
0.3
0.2
0.1
0
81
(b) Threshold = 0.5
1
21
41
61
Time step of dataset
81
(c) Threshold = 0.75
Figure 4.7. – Effect of varying the segmentation threshold on the F1 measure, for the
training datasets. The bar lengths for the ILASTIK results indicate the mean values over
the five separately trained classifiers with the same number of labels (in parentheses), and
the error bars indicate the standard deviation.
0.45
0.3
0.25
0.2
0.15
0.1
0.05
0
0.4
0.35
0.5
RGC
ILASTIK (40)
ILASTIK (80)
ILASTIK (120)
ILASTIK (160)
ILASTIK (200)
0.45
Fraction of true cells that are split
Fraction of true cells that are split
0.4
0.35
0.5
RGC
ILASTIK (40)
ILASTIK (80)
ILASTIK (120)
ILASTIK (160)
ILASTIK (200)
Fraction of true cells that are split
0.5
0.45
0.3
0.25
0.2
0.15
0.1
0.05
1
21
41
61
Time step of dataset
0
81
(a) Threshold = 0.25
0.4
0.35
RGC
ILASTIK (40)
ILASTIK (80)
ILASTIK (120)
ILASTIK (160)
ILASTIK (200)
0.3
0.25
0.2
0.15
0.1
0.05
1
21
41
61
Time step of dataset
0
81
(b) Threshold = 0.5
1
21
41
61
Time step of dataset
81
(c) Threshold = 0.75
Figure 4.8. – Effect of varying the segmentation threshold on the occurrence of oversegmentation (split rate), for the training datasets.
0.14
RGC
ILASTIK (40)
ILASTIK (80)
ILASTIK (120)
ILASTIK (160)
ILASTIK (200)
0.09
0.12
0.1
0.08
0.06
0.04
0.02
0
0.08
0.07
0.08
RGC
ILASTIK (40)
ILASTIK (80)
ILASTIK (120)
ILASTIK (160)
ILASTIK (200)
Fraction of segments that are merges
0.16
Fraction of segments that are merges
Fraction of segments that are merges
0.18
0.06
0.05
0.04
0.03
0.02
0.01
1
21
41
61
Time step of dataset
(a) Threshold = 0.25
81
0
1
21
41
61
Time step of dataset
(b) Threshold = 0.5
81
RGC
ILASTIK (40)
ILASTIK (80)
ILASTIK (120)
ILASTIK (160)
ILASTIK (200)
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
1
21
41
61
Time step of dataset
81
(c) Threshold = 0.75
Figure 4.9. – Effect of varying the segmentation threshold on the occurrence of undersegmentation (merge rate), for the training datasets.
151
Chapter 4. Live-cell microscopy image analysis
by the attribution to test or training data.13 Note that the precision improves over
time, while the recall is diminished: In the earlier time steps, the true nuclei have
good contrast and are clearly detectable (high recall), while there are also numerous
speckles that may be mistaken for nuclei by the segmentation method (poor precision). Most of these speckles disappear at later time steps, but at the same time the
average nucleus contrast is reduced and several nuclei are not detected any longer.
Comparison between RGC and ILASTIK The differences between RGC and ILASTIK can be summarized as follows:
• If ILASTIK is trained with a sufficient number of training examples, both
methods do not significantly differ in terms of precision, recall and F1 measure.
This holds both for segmenting the same dataset on which the classifiers are
trained, and when applying the classifier to the data from a neighboring time
step (Fig. 4.10). Typical values for the later time steps are 0.7–0.8 for the
recall, > 0.99 for the precision and 0.8–0.9 for the F1 measure.
• The voxelwise accuracy of both methods is also comparable, both when measured in terms of overlap volumes (Dice measure, Fig. 4.11(a)) and when measured in terms of surface distances (Hausdorff distance, Fig. 4.11(b)). Typical
Dice indices for the later time steps lie between 0.55 and 0.65.
• RGC is more susceptible to oversegmentation (Fig. 4.8) and less susceptible
to undersegmentation (Fig. 4.9), particularly for later time steps. In principle,
the subsequent tracking is more robust towards oversegmentation than towards
undersegmentation. However, the relative sizes of both effects should be taken
into consideration: The merge rate of ILASTIK can be kept under 1 % using
a sufficiently high number of training labels (Fig. 4.9(b)), while the split rate
of RGC exceeds 35 % for the later time steps (Fig. 4.8). This places a heavy
burden on the subsequent tracking and may cause tracking errors, by which
true nuclei are matched with oversegmentation fragments.
In total, most differences are inconsiderable. If trained with a high number of labels (200 per data volume), ILASTIK has a slight advantage over RGC due to the
markedly lower occurrence of oversegmentation: but this should be weighed against
the increased human effort caused by the interactivity. Due to the suboptimal recall
values, the subsequent tracking step needs to be robust towards false negatives, i.e.
nuclei that are missed in some time step.
13
This confounding effect could have been avoided by training the random forests on all ten datasets
(1, . . . , 81, 5, . . . , 85) and computing two segmentations for each of the datasets 1, . . . , 81, one with
the classifier that was trained on the same dataset and one with the classifier that was trained
on the dataset acquired four minutes later. However, this approach was not followed, as it would
have been more time-consuming.
152
4.4. Experimental comparison of two nucleus segmentation schemes
1
1
0.9
0.9
0.8
0.8
F measure
0.7
0.6
0.6
0.5
1
0.5
1
F measure
0.7
0.4
RGC
ILASTIK (40)
ILASTIK (80)
ILASTIK (120)
ILASTIK (160)
ILASTIK (200)
0.3
0.2
0.1
0
1
21
41
61
Time step of dataset
0.4
0.2
0.1
0
81
1
1
0.8
0.8
RGC
ILASTIK (40)
ILASTIK (80)
ILASTIK (120)
ILASTIK (160)
ILASTIK (200)
0.2
85
1
21
41
61
Time step of dataset
RGC
ILASTIK (40)
ILASTIK (80)
ILASTIK (120)
ILASTIK (160)
ILASTIK (200)
0.2
0
81
5
25
45
65
Time step of dataset
85
(d) Precision values (test data)
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
Recall
Recall
45
65
Time step of dataset
0.6
(c) Precision values (training data)
0.5
0.4
0.6
0.5
0.4
RGC
ILASTIK (40)
ILASTIK (80)
ILASTIK (120)
ILASTIK (160)
ILASTIK (200)
0.3
0.2
0.1
0
25
0.4
0.4
0
5
(b) F1 measure values (test data)
Precision
Precision
(a) F1 measure values (training data)
0.6
RGC
ILASTIK (40)
ILASTIK (80)
ILASTIK (120)
ILASTIK (160)
ILASTIK (200)
0.3
1
21
41
61
Time step of dataset
(e) Recall values (training data)
81
RGC
ILASTIK (40)
ILASTIK (80)
ILASTIK (120)
ILASTIK (160)
ILASTIK (200)
0.3
0.2
0.1
0
5
25
45
65
Time step of dataset
85
(f) Recall values (test data)
Figure 4.10. – Illustration of the difference between training and testing datasets for the
counting accuracy measures, and comparison between the RGC and the ILASTIK segmentation. These graphics show the results for an ILASTIK segmentation threshold of 0.5.
153
0.8
0.7
0.6
0.5
0.4
0.3
RGC
ILASTIK (40)
ILASTIK (80)
ILASTIK (120)
ILASTIK (160)
ILASTIK (200)
0.2
0.1
0
1
21
41
61
Time step of dataset
(a) Dice indices
81
Hausdorff distance for entire segmentation
Dice index for entire segmentation [voxels]
Chapter 4. Live-cell microscopy image analysis
RGC
ILASTIK (40)
ILASTIK (80)
ILASTIK (120)
ILASTIK (160)
ILASTIK (200)
100
80
60
40
20
0
1
21
41
61
Time step of dataset
81
(b) Hausdorff distances
Figure 4.11. – Comparison of the RGC and the ILASTIK segmentation with respect to the
voxelwise accuracy measures, for the training datasets. These graphics show the results for
an ILASTIK segmentation threshold of 0.5. The decrease in Hausdorff distance is partially
due to the higher cell density at the later time points. For the left-hand plot, higher values
correspond to better segmentations, while the opposite holds for the right-hand plot.
4.5. Cell tracking by integer linear programming
4.5.1. Methodology
After generating segmented nucleus candidates by either of the methods discussed in
the previous chapter, they have to be tracked over time in order to construct the cell
lineage tree. This is achieved by finding the optimal joint association between nuclei
for every pair of two subsequent time frames. The following events are possible:
1. Nucleus i moves to become nucleus j in the next time step (i → j),
2. nucleus i splits into the nuclei j and k (i → j + k),
3. nucleus i disappears in the next time step due to leaving the field of view,
apoptosis or misdetection (i → ⊘),
4. nucleus j from time step t + 1 appears due to entering the field of view or being
misdetected in the previous time step (⊘ → j).
154
4.5. Cell tracking by integer linear programming
In order to rule out implausible events, children must be among the k nearest neighbors of their parent cell, and the parent-child distance must lie below a threshold
rmax . All these events have associated costs, which are chosen as follows (ri denoting
the center-of-mass position of nucleus i in voxel lengths):
ci→j = kri − rj k2
(4.4)
2
2
ci→j+k = kri − rj k + kri − rk k + cSpl
(4.5)
ci→⊘ = cDis
(4.6)
c⊘→j = cApp
(4.7)
Obviously, one could simply use also additional features for computing these costs.
The constants cSpl , cDis and cApp are chosen such that appearance and disappearance
events are heavily penalized compared to splits and moves. Experimentally, the
choice k = 6, rmax = 35, cSpl = 100, cDis = cApp = 10000 was found to yield
2
acceptable results. Note that as long as cDis and cApp are above 2rmax
+ cSpl , their
exact value does not matter since they preclude the disappearance or appearance of
all cells which could be accounted for by some other event.
For each possible move (out of the set M) and split (out of the set S), define a binary
variable x indicating whether this event takes place or not. Finding the optimum
joint association is then a integer linear programming (ILP) problem:
X
xi→j (ci→j − ci→⊘ − c⊘→j )
min
x
+
(i→j)∈M
X
(i→j+k)∈S
s.t.
X
xi→j+k (ci→j+k − ci→⊘ − c⊘→j − c⊘→k )
xi→j +
j:(i→j)∈M
X
i:(i→j)∈M
X
j,k:(i→j+k)∈S
xi→j +
X
i,k:(i→j+k)∈S
xi→j+k ≤ 1
∀ i,
xi→j+k ≤ 1
∀ j.
All cells not accounted for by either a split or a move are assumed to appear or
disappear. Typically there are a few hundred thousands variables (one for each
split or move) and a few ten thousands constraints (one for each nucleus in one of
the two frames). Using a state-of-the-art ILP solver (ILOG CPLEX 12.214 ), this
problem can be solved to global optimality within less than a minute per frame
pair on a standard desktop computer. Note that several frame pairs may trivially
be processed in parallel. The ANN library15 is used for efficiently extracting the k
nearest potential child nuclei of each parent nucleus.
14
15
http://www-01.ibm.com/software/integration/optimization/cplex-optimizer/
http://www.cs.umd.edu/˜mount/ANN/
155
Chapter 4. Live-cell microscopy image analysis
4.5.2. Experimental results
For a quantitative performance evaluation, the tracking was run on the first 25
data volumes of the same DSLM series that was used for the evaluation of the
segmentation (see section 4.4.2), after the cell nuclei had been segmented by the
RGC method. For these datasets, manual ground truth for the tracking was prepared
based on their maximum intensity projection maps: this is a 2D image for which
the gray value of the pixel with coordinates (x, y) is set to maxz I(x, y, z). This
visualization technique is commonly used by biologists analyzing volumetric data,
since the increased contrast simplifies the identification of nuclei, but the price is the
loss of z information and the possible occurrence of occlusions. These shortcomings
render the use of maximum intensity projections ineffectual for later time steps where
the nucleus density becomes too high: hence the restriction to only the first 25
volumes. A cell lineage ground truth was constructed by manually tracking local
intensity maxima in this 2D view over time.16
In order to use this tracking ground truth for the validation of the automated tracking
results, the manually selected local intensity maxima had to be matched to the RGC
segments. This was achieved by globally minimizing the sum of squared distances
between the (x, y) positions of the placed marker and the intensity maximum of its
assigned segment (with a distance cutoff of 20 voxel lengths). This optimization
problem can be formulated as an ILP and solved as in section 4.5.1.17 This matching
is possibly error-prone due to possible occlusions and the disregard of the z dimension,
but these imperfections are unavoidable given the origin of the ground truth.
Moves
Splits
Appearances
Disappearances
Ngt
3280
189
2
4
′
Ngt
3006
159
2
3
Ntr
3107
247
107
181
Ntr′
2941
157
67
72
Nci
2940
136
1
2
Precision
100.0 %
86.6 %
1.5 %
2.8 %
Recall
97.8 %
86.6 %
50.0 %
33.3 %
Table 4.2. – Summary of statistics for the tracking evaluation.
We are interested in both the precision and the recall of the tracking, i.e. which
percentage of detected events are actual, and which percentage of actual events are
detected. Let Ngt denote the number of the different events (moves, splits, appearances, disappearances) in the ground truth, and Ntr denote the number of events
16
17
The manual ground truth is courtesy by Bernhard X. Kausler.
This particular problem can actually be solved more efficiently using e.g. the Kuhn-Munkres
algorithm (Munkres, 1957), but the difference is irrelevant for the problem sizes encountered
here.
156
4.5. Cell tracking by integer linear programming
found by the automated tracking. In order to disentangle the imperfections of the
tracking from the imperfections of the segmentation, we discard all events for which
the parent or one of the children could not be matched to an object in the other set:
′ denotes the number of ground truth events for which all participating inHence Ngt
tensity maxima are matched to a segment, and Ntr′ denotes the number of automated
tracking events for which all participating segments are matched to an intensity maximum in the ground truth. If Nci is the number of events that are correctly identified,
′ . The results are
then the precision is defined as Nci /Ntr′ and the recall as Nci /Ngt
summarized in Table 4.2. Note that precision and recall have similar values for the
interesting events (moves and splits), while the precision exceeds the recall by far for
the events that are caused by artifacts, i.e. appearances and disappearances.18 This
is unsurprising given the scarcity of these events in the ground-truth data, but indicates that incorrect appearances and disappearances are much too often introduced
by the current tracking procedure.
25
Number of occurrences
20
15
10
5
0
−10
0
10
20
30
40
50
"True" z−distance − "wrong" z−distance
60
70
80
Figure 4.12. – Histogram showing the distribution of signed differences between the parentchild z distances for the ground-truth events and the events found by the automated tracking,
aggregated over all events for which the ground truth and the automated tracking disagree.
For this plot, the position of the maximum intensity voxel is used as the position of each
segment.
It should be emphasized that the numbers in Table 4.2 are conservative estimates,
i.e. lower bounds for the actual accuracy of the tracking: Firstly, since the ground
truth is only derived from maximum intensity projections, it cannot handle occlusions
properly. Fig. 4.12 shows that in most of the cases where the automated tracking
18
Apoptosis normally does not occur at these early stage.
157
Chapter 4. Live-cell microscopy image analysis
Time step 4
Time step 5
Figure 4.13. – Exemplary tracking error for which the daughter in the ground truth is
well distinct from the daughter proposed by the automated tracking. The background image shows the maximum intensity projection, while the circles indicate the position of the
parent nucleus (red), the daughter nucleus according to the ground truth (cyan) and the
daughter nucleus proposed by the automated tracking (yellow). The circles are centered at
the maximum intensity voxel of the respective segment.
and the ground truth disagree, the child segments according to the ground truth
are more than 30 voxel lengths further away from the parent segment along the z
direction than the child segments that are proposed by the automated tracking. This
indicates that occlusion may be a problem, and that the ground truth may connect
nuclei which have very different z positions. Secondly, mitoses typically span several
time steps, and the exact time point of when a parent nucleus loses its identity
and becomes two separate daughter nuclei is ill-defined. In Table 4.2, it is marked
as a tracking error if the automated tracking places the split one minute earlier or
later than in the ground truth, although such a variation has no biological relevance.
Figs. 4.13 – 4.15 illustrate some typical tracking events that are marked as errors.
Only rarely appears the daughter that is found by the automated tracking as clearly
distinct from the ground-truth daughter in the maximum intensity projection, as in
Fig. 4.13. More common is the case that these two segments lie in different z planes
and occlude each other in the projection, as in Fig. 4.14. In some cases the daughter
nucleus is tracked correctly, but an additional daughter is introduced by the tracking,
changing a move into a split event (Fig. 4.15).
158
4.5. Cell tracking by integer linear programming
Time step 22
Time step 23
Figure 4.14. – Exemplary tracking error for which the two daughter candidate lie in different
z-planes and occlude each other. Colors as in Fig. 4.13.
Time step 4
Time step 5
Figure 4.15. – Exemplary tracking error where a move event is mistaken for a split event,
by introducing an additional parent-daughter track. Colors as in Fig. 4.13.
159
Chapter 4. Live-cell microscopy image analysis
160
Chapter 5.
Final discussion and outlook
5.1. MRSI quantification with spatial context
In chapter 1, different methods for improving the accuracy of the simple singlevoxel NLLS fit (AMARES) procedure were studied: it could be shown that imposing
a Bayesian smoothness prior on the final fit parameters (GGMRF model) leads to
small but significant improvements. However, improving the initialization step rather
than the optimization step of NLLS fitting was found to give much higher gains in
quantification accuracy, while requiring much less computation time. For most of
the voxels, it was sufficient to optimize the initialization using only single-voxel information, but spatial smoothing of the initialization shifts was found to increase the
robustness against pronounced spectral artifacts. However, the practical importance
of the latter finding is dubious, as it only achieves significant improvements over
the single-voxel initialization on artifact-ridden spectra that should not be used for
diagnostic purposes anyway. Furthermore, the actual metabolite peak positions are
typically stable across the entire volume: hence it may be sufficient to perform a
global calibration of the fit model (for the whole scan) before fitting the single-voxel
spectra.1 As an additional caveat, the results in section 1.8 should be subjected to
a double-blinded multi-rater evaluation before definite conclusions are drawn.
Further room for improvement is also in the MRSI datasets used for this study: a
thorough experimental evaluation should comprise data from more probands and a
higher variety of MR scanners, ideally from a multi-center study in the spirit of the
INTERPRET project (Tate et al., 2006). It is particularly important to add pathological MRSI datasets coming from patients with e.g. tumor or multiple sclerosis,
and to study whether the procedures can deal with the higher variability in these
data. However, obtaining highly resolved spectral images (which are required for
evaluations as performed in this chapter) from tumor patients may be difficult, since
standard MR imaging protocols only comprise moderately resolved MRSI (if any),
1
A plausible approach would be to use a robust estimator for the average minimum of RSS(f ) over
all voxels, such as the median.
161
Chapter 5. Final discussion and outlook
which can be adequately quantified using existing quantification methods such as
AMARES. Due to the long measurement time needed for MRSI scans and the stress
that is thereby caused in the patients, acquiring such high-resolution measurements
from highly diseased and mostly elderly people solely for the purpose of benchmarking quantification procedures may not be ethically defensible. Exploratory studies
about the clinical applicability of high-field MR imaging may provide a way out and
yield suitable high-resolution data as a by-product, since improved spatial resolution is one of the chief reasons for increasing magnetic field strength. It should be
noted that pathologies mainly manifest themselves in the respective signal amplitudes, while the signal frequencies (on which the main smoothness assumptions are
imposed both under the GGMRF and the GCInit model) mainly depend on magnetic field inhomogeneities and shimming problems which should be independent of
biological phenomena such as tumors. Hence it is a plausible conjecture that the
benefits of the GGMRF, SVInit and GCInit quantification schemes carry over to
pathological data, but this needs to be checked experimentally.
A sensible extension of this study would be the comparison with a higher number
of competing quantification schemes. Many concepts such as incorporating a semiparametric baseline for nuisance signals (as in the AQSES approach by Poullet et al.
(2007)) can be combined with both the GGMRF and the initialization procedures.
However, the initialization optimization can also be combined with the QUEST approach of using experimental basis spectra for the fit (Ratiney et al., 2005), while
GGMRF depends on an explicit parametric metabolite model. Particularly worthwhile would be the comparison with the “Bayesian learning” procedure provided in
the LCModel software, as this software is commonly regarded as the current state of
the art in MRSI quantification. Another interesting choice would be the proprietary
quantification routines by the major MR scanner manufacturers such as Siemens or
General Electric, which are typically used in clinical routine. As these are commercial products, they are expensive to obtain and their inner workings are opaque,
which makes their use in methodological studies difficult. A comparison of only the
final fit curves would not provide meaningful insights, as each software uses specific
preprocessing steps, which are seldom reproducible by outsiders.
5.2. Software for MRSI analysis
Chapter 2 describes the first C++ library specifically designed for medical applications which allows principled comparison of classifier performance and significance
testing. This will presumably help automated quality assessment and the conduction
of clinical studies. While the absolute performance statistics of the single classifiers
are most relevant for practical quality control in the clinic, the relative compar-
162
5.2. Software for MRSI analysis
isons between different classifiers are interesting from a research-oriented point of
view: for instance, they may answer the question which out-of-the-box classification
techniques work best for the specific task of MRSI analysis, and can check whether
newly proposed classification techniques give a significant advantage over established
methods. Since quantification-based classifiers may easily be incorporated into the
same framework, it will be possible to study the relative merits of quantificationbased techniques as opposed to pattern recognition-based techniques on a large set
of patient data.
The design of the library is deliberately restricted to single-voxel classifiers that
predict the malignancy or signal quality of each voxel only based on the appearance
of the spectrum inside this voxel, without considering the context of the surrounding
spectra. The reason for this limitation is that automatic single-voxel classification is
a mature technology whose efficacy has been proved in several independent studies,
e.g. those by Tate et al. (2006), Garcı́a-Gomez et al. (2009) or Menze et al. (2006).
In contrast, classification with spatial context information has not yet been studied
thoroughly: the two-dimensional conditional random field approach by Görlitz et al.
(2007) is the only one in this direction to date. In that article, the authors achieve
a promising, but moderate improvement in prediction accuracy over single-voxel
classification on a simulated dataset (98.7 % compared to 98.2 %). However, it
is yet far from clear which kinds of spatial context information may be beneficial
for MRSI classification (2D neighborhoods, 3D neighborhoods, long-range context,
comparison with registered MRI), and this question would have to be solved before
a generic interface for such classifiers could be designed.
As next steps, the visualization and data reporting functionalities should be enhanced in order to improve usability: especially a more interpretable visualization
of the statistical results may considerably benefit the medical users (for instance,
plots of ROC curves could be provided, or the meaning of the AUC scores could
be explained verbally). The clinical validation on 3 Tesla MRSI measurements of
brain and prostate carcinomas is scheduled for the immediate future. Furthermore
this software will eventually be integrated into the RONDO software platform for
integrated tumor diagnostic and radiotherapy planning,2 where it is planned to be a
major workhorse for MRSI analysis. This will provide a good test for the usefulness
of pattern recognition techniques in a clinical routine setting. Since the RONDO
platform shall serve as a general-purpose tool for the radiological assessment of cancer, it must be tunable to different organ systems or measurement settings also by
non-experts: hence the library is well-suited for this purpose.
2
http://www.projekt-dot-mobi.de
163
Chapter 5. Final discussion and outlook
5.3. Brain tumor segmentation based on multiple
unreliable annotations
In chapter 3, graphical model formulations were introduced to the task of fusing
noisy manual segmentations: e.g. the model by Raykar et al. (2009) had not been
previously employed in this context, and it was found to improve upon simple logistic
regression on the training data. However, these graphical models do not always have
an advantage over simple baseline techniques: compare the results of the method
by Warfield et al. (2004) to majority voting. Hybrid models combining the aspects
of several models did not fare better than simple models. This ran contrary to
the initial expectations, which were based on two assumptions: that different pixels
have a different probability of being mislabeled, and that it is possible to detect these
pixels based on the visual content (these pixels would be assigned high scores far away
from the decision boundary). This may be an artifact of the time-constrained labeling
experiment: if misclassifications can be attributed mostly to chance or carelessness
rather than to ignorance or visual ambiguity, these assumptions obviously do not
hold, and a uniform noise model as in the models by (Warfield et al., 2004) or
(Raykar et al., 2009) should be used instead. It is furthermore not yet understood
why the slight model change between hybrid models 1 / 2 and hybrid models 3 / 4
leads to the observed failure of inference. For the future, it should be checked if these
effects arise from the use of an approximate inference engine or are inherent to these
models: hence unbiased Gibbs sampling results should be obtained for comparison
purposes, using e.g. the WinBUGS modelling environment (Lunn et al., 2000).
The use of simulated data for the evaluation is the main limitation of this approach,
as simulations always present a simplification of reality and cannot account for all
artifacts and other causes for image ambiguity that are encountered in real-world
data. However, this limitation is practically unavoidable, since we are assessing
the imperfections of the currently best clinical practice for the precise delineation
of brain tumors, namely manual segmentation of MR images by human experts.
This assessment requires a superior gold standard by which the human annotations
may be judged, and this can only be obtained from an in silico ground truth. For
animal studies, a possible alternative lies in sacrificing the animals and delineating
the tumor on histological slices which can be examined with better spatial resolution.
However, these kinds of studies are costly and raise ethical concerns. Additionally,
even expert pathologists often differ considerably in their assessment of histological
images (Giannini et al., 2001).
Better segmentations could presumably be achieved by two extensions: More informative features could be obtained by registration of the patient images to a brain
atlas, e.g. in the spirit of (Schmidt et al., 2005). An explicit spatial regularization
164
5.4. Live-cell microscopy image analysis
could be achieved by adding an MRF prior on the latent labels or scores, and employing a mean-field approximation (Zhang, 1992) to jointly estimate the optimum
segmentation and the model parameters.
5.4. Live-cell microscopy image analysis
Chapter 4 compares two alternative approaches for segmenting cell nuclei in DSLM
images of zebrafish embryos, a fully automated approach that uses prior knowledge about the nucleus shape, and an interactive approach that does not account
for shape. It establishes that there is no clear advantage of one approach over the
other: While the fully automated method is more susceptible to oversegmentation
(erroneous fragmentation of nuclei), the semiautomated method rather encounters
problems with undersegmentation (erroneous merging of distinct nuclei). This results hold even when the classifier that forms the core of the interactive classifier
is applied to another image volume than the one it was trained on. Furthermore,
the chapter presents a new method for tracking nuclei over time, which uses integer
linear programming for finding a jointly optimal association between segments at
different time points, and shows that it correctly assigns around 90 % of all matches,
as compared against manual ground truth.
At the current stage, neither the segmentation nor the tracking is of sufficient quality
to reconstruct an entire cell lineage of D. rerio over several hours. Since the accuracy
of the tracking is limited by the accuracy of the segmentation, and since tracking
errors accumulate over time, an accuracy of over 99.9 % would be required to keep
the accuracy of the entire lineage tree over 90 %, when it is constructed from 100
time steps. However, the recall values of both segmentation and tracking lie below
90 %: hence their error rates still need to be reduced by a factor of 100. The
two segmentation methods studied in this chapter (ILASTIK and RGC) do not
significantly differ with respect to quality. However, the current accuracy may be
sufficient to answer biological questions that are concerned with average values over
ensembles of cells and do not need to account for the precise fate of every single cell:
e.g. how the average cell motion speed changes over time, or how it is affected by
different genetic mutations.
The gravest problem of the tracking procedure is the relatively high number of erroneous appearance and disappearance events. These cause discontinuities in the
cell lineage which preclude the long-term analysis of cell fate. The reason lies in
the limitations of the greedy frame-by-frame processing approach that is currently
employed for the tracking. While this is well-suited for quickly reducing the problem
size and finding all obvious associations between nuclei, it cannot handle artifacts
or ambiguous cases where information from more time points needs to be used to
165
Chapter 5. Final discussion and outlook
find the correct matching. For instance, if one nucleus is missed by the segmentation in a particular time step, this leads to an appearance of its daughter nucleus.
The optimal adoption of appearing nuclei by grandparent nuclei from more than one
time step earlier can be found by solving a similar ILP problem as is used for the
frame-by-frame tracking.
Other promising approaches for improvement include:
• Additional features for cell matching: Only the nucleus position is currently used for the frame-to-frame association, as this is an easily interpretable
criterion which is also used by human annotators. To resolve ambiguous cases,
it may be useful to include e.g. the segment volume or the average intensity,
as these values can be expected to vary little over time within one nucleus.
• Automated cell cycle phase classification: Particularly apoptosis and cell
division events can be identified with high confidence by biological experts,
based on the characteristic appearance of the cells. For 2D cell microscopy images, Wang et al. (2008) were already able to predict the cell cycle phase based
on shape descriptors with 99 % accuracy using statistical learning techniques.
If similar accuracy rates could be achieved for the 3D DSLM images, reduced
confusion between split and move events can be expected.
• Use of motion information: The current optimization objective is to achieve
a low squared distance between the position of each parent nucleus at time t
and the position(s) of its daughter(s) at time t + 1. Since the cells are moving,
it is more plausible to extrapolate the trajectory of the parent nucleus to time
t + 1, and to minimize the distance between the daughter position(s) and the
extrapolated parent position instead. This could be achieved using stochastic
motion modelling as in (Li et al., 2008b), but as a simpler alternative one could
also fit a low-parametric model (e.g. a straight line) to the nucleus trajectory
in the previous few time steps.
• Interleaved segmentation and tracking: The current segmentation uses
no temporal information. However, for determining whether an ambiguous
image patch belongs to the foreground or to the background, it may help to
know whether or not a nucleus exists at the same position in the previous time
step. This information could be incorporated e.g. by propagating the positions
of the nuclei found at time t to the following time step (according to some
motion model) and adding a potential to the regularized graph cut objective
that encourages the new segments to lie close to the previous segments.
166
5.4. Live-cell microscopy image analysis
Further concern is warranted about the reliability of the ground truth that was used
for assessing the accuracies of both segmentation and tracking. For the imagery that
is analyzed here, both segmentation and tracking are ill-defined tasks. After time step
50, the foreground / background contrast becomes so low that the decision whether
to label a particular patch as nucleus or background becomes highly subjective. Nor
is it then a clear-cut decision whether two bright blobs belong to one single segment,
or two separate segments. Possible shortcomings of the tracking ground truth in
the presence of occlusions were already mentioned. Furthermore, if there are several
potential daughter candidates with similar distances from a parent nucleus, there
exists no reliable criterion even for human raters by which the correct association
could be determined. A remedy may be the random expression of fluorescent markers
in the spirit of the Brainbow project in neurobiology (Livet et al., 2007): if only a few
nuclei emit fluorescence light at a particular wavelength, they are easier to identify
at subsequent time points. Additionally, one could accept the fact that there are
unavoidable uncertainties in the reconstruction of the cell lineage, and convey to the
biologist user the information which parts of the cell lineage are certain and which
are ambiguous.
167
Chapter 5. Final discussion and outlook
168
List of Symbols and Expressions
Acronyms
AMM
Adaptive Mixture Model
AUC
Area Under Curve
Cho
Choline
CPD
Conditional Probability Distribution
Cre
Creatine
CSF
Cerebro-Spinal Fluid
CSI
Chemical Shift Imaging
CT
Computer Tomography
DCE
Dynamic Contrast Enhancement
DRF
Discriminative Random Field
DS
Data Set
DSLM
Digital Scanned Light Sheet Microscopy
DWI
Diffusion-Weighted Imaging
EM
Expectation Maximization
FCM
Fuzzy c-Means
FID
Free Induction Decay
FLAIR
Fluid-Attenuated Inverse Recovery
FOV
Field of View
GC
Graph Cut
Gd
Gadolinium
GFP
Green Fluorescent Protein
169
Chapter 5. Final discussion and outlook
GGMRF
Generalized Gaussian Markov Random Field
GM
Gray Matter
GMM
Gaussian Mixture Model
GMRF
Gaussian Markov Random Field
GPU
Graphical Processing Unit
HSVD
Hankel Singular Value Decomposition
ICM
Iterated Conditional Modes
ILASTIK
Interactive Learning and Segmentation Toolkit
IR
Inversion Recovery
KL
Kullback-Leibler
LSFM
Light sheet-based fluorescence microscopy
MAP
Maximum a posteriori
MCMC
Markov Chain Monte Carlo
MRF
Markov Random Field
MRI
Magnetic Resonance Imaging
MR
Magnetic Resonance
MRSI
Magnetic Resonance Spectroscopic Imaging
MRS
Magnetic Resonance Spectroscopy
NAA
N -acetylaspartate
NNPM
Nearest Neighbor Pattern Matching
PCR
Principal Components Regression
PDE
Partial Differential Equation
PD
Protium Density
PET
Positron Emission Tomography
p.f.
post fertilisationem
ppm
parts per million
PRESS
Point-Resolved Spectroscopy
170
5.4. Live-cell microscopy image analysis
RBF
Radial Basis Function
RF
Random Forest
RFr
Radio-Frequency
RGC
Regularized Graph Cut
ROC
Receiver Operating Characteristic
RR
Ridge Regression
RSS
Residual Sum of Squares
SE
Spin Echo
SGF
Statistical Geometric Features
SPECT
Single-Photon Emission Computer Tomography
SP
Spatial regularization
SQ
Signal Quality
STAPLE
Simultaneous Truth and Performance Level Estimation
SVD
Singular Value Decomposition
SVM
Support Vector Machine
SVRF
Support Vector Random Field
SV
Single Voxel
SWA
Segmentation by Weighted Aggregation
TE
Echo Time
TR
Repetition Time
TU
totally unimodular
VC
Voxel Class
VMP
Variational Message Passing
WM
White Matter
Greek Symbols
γ
Gyromagnetic ratio
µ
Magnetic moment
171
List of Symbols and Expressions
Latin Symbols
B0 / B1
Static longitudinal / oscillating transverse magnetic field
f
Frequency
I
Nuclear spin quantum number
M0
Equilibrium magnetization
M⊥ / Mk
Transverse / longitudinal magnetization
T1 / T2
Spin-lattice / spin-spin relaxation time
172
List of Figures
1.1.
1.2.
1.3.
1.4.
1.5.
1.6.
1.7.
1.8.
1.9.
Exemplary MRSI spectrum in the time and frequency domain . . . .
Example spectrum with SV fit and several GGMRF fits . . . . . . .
Example spectra for the different signal quality labels . . . . . . . .
Absolute and relative accuracy improvement for SV and GGMRF vs.
voxel resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exemplary spectra showing the reasons for poor NLLS fits . . . . . .
Magnitude spectra subgrid with expected peak positions, showing a
systematic shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exemplary spectra for the benefits of single-voxel and regularized initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Percentage of “good” fits among all for three different initialization
schemes, plotted against the in-plane resolution . . . . . . . . . . . .
NLLS quantification times without initialization and with SV and regularized initialization . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 24
. 34
. 37
. 38
. 39
. 40
. 42
. 45
. 46
2.1. Exemplary MRSI magnitude spectra of the brain . . . . . . . . . . . .
2.2. User interface for the labeling functionality of the MRSI data . . . . .
2.3. User interface for classifier training and testing . . . . . . . . . . . . .
2.4. Evaluation results for an exemplary training and testing suite . . . . .
2.5. Exemplary application of a trained classifier to a new dataset . . . . .
2.6. UML diagram of the classification functionality of the software library
2.7. UML diagram of the preprocessing functionality . . . . . . . . . . . .
2.8. UML diagram of the parameter tuning functionality . . . . . . . . . .
2.9. UML diagram of the statistical evaluation functionality . . . . . . . .
2.10. UML diagram of the data input / output functionality . . . . . . . . .
3.1. Exemplary Bayesian network . . . . . . . . . . . . . . . . . . . . . .
3.2. Exemplary segmentations of a real-world brain tumor image by a single expert radiologist, based on different imaging modalities . . . . .
3.3. Graphical model representations of the previously proposed fusion algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4. Newly proposed hybrid models . . . . . . . . . . . . . . . . . . . . .
3.5. Exemplary slices of the three simulated tumor classes . . . . . . . .
49
59
61
62
63
65
68
69
73
74
. 90
. 112
. 115
. 116
. 119
173
List of Figures
3.6. Sensitivities and specificities for logistic regression with different feature sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.7. Comparison of ground-truth and inferred posterior tumor probabilities
for simulated brain tumor images . . . . . . . . . . . . . . . . . . . . . 125
3.8. Exemplary FLAIR slice with inferred mean posterior tumor probability maps for multiple different inference methods . . . . . . . . . . . . 126
4.1. Exemplary slice of a DSLM zebrafish image . . . . . . . . . . . . . .
4.2. Memory requirements for the different feature sets . . . . . . . . . .
4.3. Comparison of two variable importance measures for the first feature
selection iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4. Object detection accuracy measures of the ILASTIK segmentations
for different feature set sizes . . . . . . . . . . . . . . . . . . . . . . .
4.5. Voxelwise accuracy measures of the ILASTIK segmentations for different feature set sizes . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6. Oversegmentation and undersegmentation measures of the ILASTIK
segmentations for different feature set sizes . . . . . . . . . . . . . .
4.7. Effect of varying the segmentation threshold on the F1 measure, for
the training datasets . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8. Effect of varying the segmentation threshold on the occurrence of oversegmentation (split rate), for the training datasets . . . . . . . . . .
4.9. Effect of varying the segmentation threshold on the occurrence of undersegmentation (merge rate), for the training datasets . . . . . . . .
4.10. Comparison of training and testing datasets for the counting accuracy
measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.11. Comparison of RGC and ILASTIK with respect to voxelwise accuracy
measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12. Differences between the parent-child z distances for the ground-truth
events and the tracking events . . . . . . . . . . . . . . . . . . . . . .
4.13. Representative example for tracking errors 1: Spatially distinct daughter candidates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.14. Representative example for tracking errors 2: Occlusion . . . . . . .
4.15. Representative example for tracking errors 3: Move tracked as split .
174
. 129
. 145
. 146
. 148
. 149
. 150
. 151
. 151
. 151
. 153
. 154
. 157
. 158
. 159
. 159
List of Tables
1.1. Voxel and FOV sizes (constant slice thickness) . . . . . . . . . . . .
1.2. Voxel and FOV sizes (isotropic) . . . . . . . . . . . . . . . . . . . . .
1.3. Percentage of SV and GGMRF fits that are labeled as “good” by the
two raters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4. Ratio of good NLLS fits for three different initialization schemes, for
all spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5. Ratio of good NLLS fits for three different initialization schemes, for
artifact-free spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.
2.2.
2.3.
2.4.
Search grid for automated classifier parameter selection . . .
Evaluation statistics for signal quality classifiers on dataset 1
Evaluation statistics for signal quality on dataset 2 . . . . . .
Evaluation statistics for voxel class classifiers on dataset 2 . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 32
. 32
. 38
. 44
. 45
.
.
.
.
76
77
77
77
3.1. Tested image features . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.2. Evaluation statistics for the training data under the 120/120/90 scenario123
3.3. Evaluation statistics for the test data under the 120/120/90 scenario . 124
4.1. Order of ILASTIK feature elimination from the active set . . . . . . . 144
4.2. Summary of statistics for the tracking evaluation . . . . . . . . . . . . 156
175
List of Tables
176
Bibliography
T. Achterberg, T. Koch, A. Martin (2006). “MIPLIB 2003.” Operations Research
Letters, 34(4), 361–372. The current state of which problems are solved can be
found at http://miplib.zib.de/miplib2003.php.
O. Al-Kofahi, R. Radke, S. Goderie, et al. (2006). “Automated Cell Lineage Construction.” Cell Cycle, 5(3), 327–335.
Y. Al-Kofahi, W. Lassoued, W. Lee, et al. (2010). “Improved Automatic Detection
and Segmentation of Cell Nuclei in Histopathology Images.” IEEE Transactions
on Biomedical Engineering, 57(4), 841–852.
C. Andrieu, N. De Freitas, A. Doucet, et al. (2003). “An introduction to MCMC for
machine learning.” Machine Learning, 50(1), 5–43.
S. Arya, D. Mount, N. Netanyahu, et al. (1998). “An Optimal Algorithm for Approximate Nearest Neighbor Searching.” Journal of the ACM, 45, 891–923.
B. Aspvall, R. Stone (1980). “Khachiyan’s Linear Programming Algorithm.” Journal
of Algorithms, 1, 1–13.
J. Attenberg, K. Weinberger, A. Dasgupta, et al. (2009). “Collaborative Email-Spam
Filtering with Consistently Bad Labels using Feature Hashing.” In: Conference
on Email and Anti-Spam (CEAS).
G. Bakir, R. Hofmann, B. Schölkopf, et al. (eds.) (2007). Predicting Structured Data.
MIT Press.
A. Bandos, H. Rockette, D. Gur (2007). “Exact Bootstrap Variances of the Area
Under ROC curve.” Communications in Statistics: Theory and Methods, 36,
2443–2461.
Y. Bao, A. Maudsley (2007). “Improved Resolution for MR Spectroscopic Imaging.”
IEEE Transactions on Medical Imaging, 26(5), 686–695.
Z. Bao, J. Murray, T. Boyle, et al. (2006). “Automated cell lineage tracing in
Caenorhabditis elegans.” Proceedings of the National Academy of Sciences, 103(8),
2707–2712.
177
Bibliography
R
Y. Bengio (2009). “Learning Deep Architectures for AI.” Foundations and Trends
in Machine Learning, 2(1), 1–127.
Y. Bengio, Y. Grandvalet (2004). “No Unbiased Estimator of the Variance of K-Fold
Cross-Validation.” Journal of Machine Learning Research, 5, 1089–1105.
J. Besag (1986). “On the statistical analysis of dirty pictures.” Journal of the Royal
Statistical Society B (Methodological), 48(3), 259–302.
C. Bishop (1994). “Neural networks and their applications.” Reviews of Scientific
Instruments, 65(6), 1803–1832.
H. Bodlaender (1992). “A Tourist Guide through Treewidth.” Tech. Rep. RUU-CS92-12, Utrecht University.
H. Bodlaender, A. Koster (2010a). “Treewidth computations I: Upper bounds.”
Information and Computation, 208(3), 259–275.
H. Bodlaender, A. Koster (2010b). “Treewidth Computations II: Lower Bounds.”
Tech. Rep. UU-CS-2010-022, Utrecht University.
P. Bottomley (1987). “Spatial Localization in NMR Spectroscopy in Vivo.” Annals
of the New York Academy of Sciences, 508, 333–348.
S. Bouman, K. Sauer (1993). “A generalized Gaussian image model for edgepreserving MAP estimation.” IEEE Transactions on Image Processing, 2(3), 296–
310.
P. Bourgine, R. Čunderlık, O. Drblıková-Stašová, et al. (2010). “4D embryogenesis
image analysis using PDE methods of image processing.” Kybernetika, 46(2),
226–259.
Y. Boykov, V. Kolmogorov (2004). “An Experimental Comparison of Min-Cut/MaxFlow Algorithms for Energy Minimization in Vision.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9), 1124–1137.
Y. Boykov, O. Veksler, R. Zabih (2001). “Fast Approximate Energy Minimization via
Graph Cuts.” IEEE Transactions on Pattern Analysis and Machine Intelligence,
23(11), 1222–1239.
L. Breiman (1996). “Out-of-Bag Estimation.” Tech. Rep., UC Berkeley.
L. Breiman (2001). “Random Forests.” Machine Learning, 45(1), 5–32.
J. Broderick, S. Narayan, M. Gaskill, et al. (1996). “Volumetric measurement of
multifocal brain lesions.” Journal of Neuroimaging, 6, 36–43.
178
Bibliography
W. Buntine (1994). “Operations for Learning with Graphical Models.” Journal of
Artificial Intelligence Research, 2, 159–225.
C. Burges (1998). “A Tutorial on Support Vector Machines for Pattern Recognition.”
Data Mining and Knowledge Discovery, 2(2), 121–167.
R. Caruana, N. Karampatziakis, A. Yessenalina (2008). “An Empirical Evaluation
of Supervised Learning in High Dimensions.” In: International Conference on
Machine Learning (ICML), 96 – 103.
R. Caruana, A. Niculescu-Mizil (2006). “An Empirical Comparison of Supervised
Learning Algorithms.” In: International Conference on Machine Learning (ICML),
161–168.
J. Cates, A. Lefohn, R. Whitaker (2004). “GIST: an interactive, GPU-based level set
segmentation tool for 3D medical images.” Medical Image Analysis, 8(3), 217–231.
J. Cates, R. Whitaker, G. Jones (2005). “Case study: an evaluation of user-assisted
hierarchical watershed segmentation.” Medical Image Analysis, 9(6), 566–578.
M. Chalfie, Y. Tu, G. Euskirchen, et al. (1994). “Green fluorescent protein as a
marker for gene expression.” Science, 263(5148), 802–805.
A. Chan, A. Lau, A. Pirzkall, et al. (2004). “Proton magnetic resonance spectroscopy
imaging in the evaluation of patients undergoing gamma knife surgery for Grade
IV glioma.” Journal of Neurosurgery, 101, 467–475.
C. Chang, C. Lin (2001). “LIBSVM: a library for support vector machines.” Software
available at http://www.csie.ntu.tw/ cjlin/libsvm.
O. Chapelle, B. Schölkopf, A. Zien (eds.) (2006). Semi-Supervised Learning. MIT
Press.
S. Cho, M. Kim, H. Kim, et al. (2001). “Chronic hepatitis: in vivo proton MR
spectroscopic evaluation of the liver and correlation with histopathologic findings.”
Radiology, 221(3), 740–746.
P. Clifford (1990). “Markov random fields in statistics.” In: G. Grimmett, D. Welsh
(eds.), Disorder in Physical Systems. A Volume in Honour of John M. Hammersley.
Oxford University Press, Oxford.
D. Cobzas, N. Birkbeck, M. Schmidt, et al. (2007). “3D variational brain tumor
segmentation using a high dimensional feature set.” In: International Conference
on Computer Vision (ICCV 2007).
179
Bibliography
B. Cohen, E. Knopp, H. Rusinek, et al. (2005). “Assessing Global Invasion of Newly
Diagnosed Glial Tumors with Whole-Brain Proton MR Spectroscopy.” American
Journal of Neuroradiology, 26, 2170–2177.
T. Coleman, Y. Li (1996). “An interior trust-region approach for nonlinear minimization subject to bounds.” SIAM Journal on Optimization, 6, 418–445.
J. Colinge, K. Bennett (2007). PLoS Computational Biology, 3(7), e114.
O. Commowick, S. Warfield (2010). “Incorporating Priors on Expert Performance Parameters for Segmentation Validation and Label Fusion: A Maximum a Posteriori
STAPLE.” In: T. Jiang, et al. (eds.), Proceedings of the 13th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2010), Part III, Lecture Notes on Computer Science, vol. 6363/2010, 25–32.
Springer, Berlin.
R. D. Cook, C.-L. Tsai, B. C. Wei (1986). “Bias in nonlinear regression.” Biometrika,
73(3), 615–623.
J. Corso, E. Sharon, S. Dube, et al. (2008). “Efficient multilevel brain tumor segmentation with integrated bayesian model classification.” IEEE Transactions on
Medical Imaging, 27(5), 629–640.
J. Corso, E. Sharon, A. Yuille (2006). “Multilevel segmentation and integrated
Bayesian model classification with an application to brain tumor segmentation.”
In: Medical Image Computing and Computer-Assisted Interventions (MICCAI),
Lecture Notes in Computer Science, vol. 4191, 790–798.
J. Corso, A. Yuille, N. Sicotte, et al. (2007). “Detection and Segmentation of Pathological Structures by the Extended Graph-Shifts Algorithm.” In: Medical Image Computing and Computer-Assisted Intervention (MICCAI), Lecture Notes in
Computer Science, vol. 4791/2007, 985–993. Springer.
A. Croitor Sava, D. Sima, J. Poullet, et al. (2009). “Exploiting spatial information
to estimate metabolite levels in 2D MRSI of heterogeneous brain lesions.” Tech.
Rep. ESAT-SISTA 09-182, Katholieke Universiteit Leuven.
S. Dager, N. Oskin, T. Richards, et al. (2008). “Research Applications of Magnetic
Resonance Spectroscopy (MRS) to Investigate Psychiatric Disorders.” Topics in
Magnetic Resonance Imaging, 19(2), 81–96.
G. Dantzig (1949). “Programming of Interdependent Activities II: Mathematical
Model.” Econometrica, 17(3/4), 200–211.
180
Bibliography
F. S. de Edelenyi, C. Rubin, F. Estève, et al. (2000). “A new approach for analyzing proton magnetic resonance spectroscopic images of brain tumors: nosologic
images.” Nature Medicine, 6, 1287–1289.
R. de Graaf (2008). In Vivo NMR Spectroscopy: Principles and Techniques. Wiley,
New York.
L. DeAngelis, J. Loeffler, A. Mamelak (2007). “Primary and metastatic brain tumors.” In: R. Pazdur, L. Wagman, K.A.Camphausen, et al. (eds.), Cancer Management: A Multidisciplinary Approach. CMP Healthcare Media, San Francisco
CA.
J. Debnam, L. Ketonen, L. Hamberg, et al. (2007). “Current Techniques Used for
the Radiological Assessment of Intracranial Neoplasms.” Archives of Pathology
and Laboratory Medicine, 131, 252–260.
A. Dempster, N. Laird, D. Rubin, et al. (1977). “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Statistical Society. Series
B (Methodological), 39(1), 1–38.
J. Demšar (2006). “Statistical comparisons of classifiers over multiple data sets.”
Journal of Machine Learning Research, 7, 1–30.
T. Dietterich (1998). “Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms.” Neural Computation, 10, 1895–1923.
W. Dou, S. Ruan, Y. Chen, et al. (2007). “A framework of fuzzy information fusion
for the segmentation of brain tumor tissues on MR images.” Image and Vision
Computing, 25(2), 164–171.
M. Droske, B. Meyer, M. Rumpf, et al. (2005). “An adaptive level set method for
interactive segmentation of intracranial tumors.” Neurological Research, 27(4),
363–370.
A. Dufour, V. Shinin, S. Tajbakhsh, et al. (2005). “Segmenting and tracking fluorescent cells in dynamic 3-D microscopy with coupled active surfaces.” IEEE
Transactions on Image Processing, 14(9), 1396–410.
J. Duncan, N. Ayache (2000). “Medical Image Analysis: Progress over Two Decades
and the Challenges Ahead.” IEEE Transactions on Pattern Recognition and Machine Intelligence, 22(1), 85–106.
W. Edelstein, G. Glover, C. Hardy, et al. (1986). “The Intrinsic Signal-to-Noise Ratio
in NMR Imaging.” Magnetic Resonance in Medicine, 3, 604–618.
181
Bibliography
R. Fabbri, L. D. F. Costa, J. Torelli, et al. (2008). “2D Euclidean Distance Transform
Algorithms: A Comparative Survey.” ACM Computing Surveys, 40(1), 2:1 – 2:44.
A. Farhangfar, R. Greiner, C. Szepesvári (2009). “Learning to Segment from a Few
Well-Selected Training Images.” In: International Conference on Machine Learning
(ICML), 305–312.
T. Fawcett (2006). “An introduction to ROC analysis.” Pattern Recognition Letters,
27(8), 861–874.
L. M. Fletcher-Heath, L. O. Hall, D. B. Goldgof, et al. (2001). “Automatic segmentation of non-enhancing brain tumors in magnetic resonance images.” Artificial
Intelligence in Medicine, 21(1-3), 43–63.
Y. Freund, R. Schapire (1999). “A Short Introduction to Boosting.” Journal of the
Japanese Society for Artificial Intelligence, 14(5), 771–780.
M. Frigo, S. Johnson (2005). “The Design and Implementation of FFTW3.” Proceedings of the IEEE, 93(2), 216–231.
J. Garcı́a-Gomez, J. Luts, M. Julià-Sapé, et al. (2009). “Multiproject-multicenter
evaluation of automatic brain tumor classification by magnetic resonance spectroscopy.” Magnetic Resonance Materials in Physics, Biology and Medicine, 22,
5–18.
A. E. Gelfand, A. F. Smith (1990). “Sampling-Based Approaches to Calculating
Marginal Densities.” Journal of the American Statistical Association, 85(410),
398–409.
S. Geman, D. Geman (1984). “Stochastic relaxation, Gibbs distributions and the
Bayesian restoration of images.” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 6, 721–741.
A. Genovesio, T. Liedl, V. Emiliani, et al. (2006). “Multiple Particle Tracking in 3D+t Microscopy: Method and Application to the Tracking of Endocytosed Quantum Dots.” IEEE Transactions on Image Processing, 15(5), 1062–1070.
D. Gering (2003). “Diagonalized Nearest Neighbor Pattern Matching for Brain Tumor Segmentation.” In: Medical Image Computing and Computer-Assisted Intervention (MICCAI), Lecture Notes in Computer Science, vol. 2879/2003, 670–677.
Springer.
D. Gering, W. Grimson, R. Kikinis (2002). “Recognizing Deviations from Normalcy for Brain Tumor Segmentation.” In: Medical Image Computing and
Computer-Assisted Intervention (MICCAI), Lecture Notes in Computer Science,
vol. 2488/2002, 388–395. Springer.
182
Bibliography
C. Giannini, B. Scheithauer, A. Weaver, et al. (2001). “Oligodendrogliomas: reproducibility and prognostic value of histologic diagnosis and grading.” Journal of
Neuropathology & Experimental Neurology, 60(3), 248.
P. Gibbs, D. Buckley, S. Blackband, et al. (1996). “Tumour volume determination
from MR images by morphological segmentation.” Physics in Medicine and Biology, 13, 2437–2446.
W. Gilks, A. Thomas, D. Spiegelhalter (1994). “A language and program for complex
Bayesian modelling.” The Statistician, 43, 169–178.
R. Gillies, D. Morse (2005). “In Vivo Magnetic Resonance Spectroscopy in Cancer.”
Annual Review of Biomedical Engineering, 7, 287–326.
G. Golub, M. Heath, G. Wahba (1979). “Generalized Cross-Validation as a Method
for Choosing a Good Ridge Parameter.” Technometrics, 21(2), 215–223.
G. Golub, V. Pereyra (2003). “Separable nonlinear least squares: the variable projection method and its applications.” Inverse Problems, 19, R1–R26.
H. González-Vélez, M. Mier, M. Julià-Sapé, et al. (2009). “HealthAgents: distributed
multi-agent brain tumor diagnosis and prognosis.” Applied Intelligence, 30, 191–
202.
L. Görlitz, B. H. Menze, M.-A. Weber, et al. (2007). “Semi-Supervised Tumor Detection in Magnetic Resonance Spectroscopic Images Using Discriminative Random
Fields.” In: Proceedings of the DAGM 2007, Lecture Notes in Computer Science,
vol. 4713/2007, 224–233.
V. Govindaraju, K. Young, A. Maudsley (2000). “Proton NMR chemical shifts and
coupling constants for brain metabolites.” NMR in Biomedicine, 13, 129–153.
Y. Grandvalet, Y. Bengio (2006). “Hypothesis Testing for Cross-Validation.” Tech.
Rep. TR 1285, Département d’Informatique et Recherche Opérationelle, University
of Montréal.
I. Guyon, A. Elisseeff (2003). “An Introduction to Variable and Feature Selection.”
Journal of Machine Learning Research, 3, 1157 – 1182.
G. Hagberg (1998). “From magnetic resonance spectroscopy to classification of tumors: A review of pattern recognition methods.” NMR in Biomedicine, 11(4–5),
148–156.
R. Harmouche, L. Collins, D. Arnold, et al. (2006). “Bayesian MS Lesion Classification Modeling Regional and Local Spatial Information.” In: 18th International
Conference on Pattern Recognition (ICPR).
183
Bibliography
T. Hastie, R. Tibshirani, J. Friedman (2009). The Elements of Statistical Learning.
Springer, New York.
R. He, P. Narayana (2002). “Automatic delineation of Gd enhancements on magnetic
resonance images in multiple sclerosis.” Medical Physics, 29, 1536–1546.
A. Henning, A. Fuchs, J. Murdoch, et al. (2009). “Slice-selective FID acquisition,
localized by outer volume suppression (FIDLOVS) for 1 H-MRSI of the human
brain at 7 T with minimal signal loss.” NMR in Biomedicine, 22(7), 683–696.
S. Ho, E. Bullitt, G. Gerig (2002). “Level-set evolution with region competition:
automatic 3-D segmentation of brain tumors.” In: 16th International Conference
on Pattern Recognition (ICPR).
S. Hojjatoleslami, F. Kruggel, D. Von Cramon (1998). “Segmentation of white
matter lesions from volumetric MR images.” In: Medical Image Computing and
Computer-Assisted Intervention (MICCAI), Lecture Notes in Computer Science,
vol. 1496/1998, 52–61. Springer.
K. Iftekharuddin, M. Islam, J. Shaik, et al. (2005). “Automatic brain tumor detection
in MRI: methodology and statistical validation.” In: Medical Imaging 2005: Image
Processing, Proceedings of SPIE, vol. 5747, 2012–2022.
H. Ishikawa (2003). “Exact optimization for Markov random fields with convex
priors.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(10),
1333–1336.
C. Jiang, X. Zhang, W. Huang, et al. (2004). “Segmentation and Quantification
of Brain Tumor.” In: IEEE International Conference on Virtual Environments,
Human-Computer Interfaces, and Measurement Systems (VECIMS).
T. Kanda, K. Sullivan, G. Wahl (1998). “Histone-GFP fusion protein enables sensitive analysis of chromosome dynamics in living mammalian cells.” Current Biology,
8(7), 377.
N. Karayiannis, P. Pai (1999). “Segmentation of magnetic resonance images using
fuzzy algorithms for learning vector quantization.” IEEE transactions on medical
imaging, 18(2), 172–180.
N. Karmarkar (1984). “A New Polynomial-Time Algorithm for Linear Programming.” Combinatorica, 4(4), 373–395.
R. Karp (1972). “Reducibility Among Combinatorial Problems.” In: J. T. R.E. Miller
(ed.), Complexity of Computer Computations, 85–103. Plenum, New York.
184
Bibliography
R. Kass, A. Raftery (1995). “Bayes Factors.” Journal of the American Statistical
Association, 90(430), 773–795.
F. Kaster, S. Kassemeyer, B. Merkel, et al. (2010a). “An object-oriented library
for systematic training and comparison of classifiers for computer-assisted tumor
diagnosis from MRSI measurements.” In: Bildverarbeitung für die Medizin 2010
– Algorithmen, Systeme, Anwendungen, 97–101.
F. Kaster, B. Kelm, C. Zechmann, et al. (2009). “Classification of Spectroscopic
Images in the DIROlab Environment.” In: World Congress on Medical Physics
and Biomedical Engineering, September 7 - 12, 2009, Munich, Germany, IFMBE
Proceedings, vol. 25/V, 252–255.
F. Kaster, B. Menze, M.-A. Weber, et al. (2011). “Comparative validation of graphical models for learning tumor segmentations from noisy manual annotations.” In:
B. Menze, et al. (eds.), MICCAI 2010 Workshop on Medical Computer Vision
(MCV), Lecture Notes in Computer Science, vol. 6533, 74–85. Springer, Heidelberg.
F. Kaster, B. Merkel, O. Nix, et al. (2010b). “An object-oriented library for systematic training and comparison of classifiers for computer-assisted tumor diagnosis
from MRSI measurements.” Computer Science – Research and Development, in
press.
R. Kates, D. Atkinson, M. Brant-Zawadzki (1996). “Fluid-attenuated Inversion Recovery (FLAIR): Clinical Prospectus of Current and Future Applications.” Topics
in Magnetic Resonance Imaging, 8(6), 389–396.
M. Kaus, S. Warfield, A. Nabavi, et al. (1999). “Segmentation of meningiomas and
low grade gliomas in MRI.” In: Medical Image Computing and Computer-Assisted
Intervention (MICCAI), Lecture Notes in Computer Science, vol. 1679/1999, 1–10.
Springer.
M. Kaus, S. Warfield, A. Nabavi, et al. (2001). “Automated Segmentation of MR
Images of Brain Tumors.” Radiology, 218(2), 586–591.
S. Keevil (2006). “Spatial localization in nuclear magnetic resonance spectroscopy.”
Physics in Medicine and Biology, 51, R579 – R636.
P. Keller, A. Schmidt, J. Wittbrodt, et al. (2008). “Reconstruction of zebrafish early
embryonic development by scanned light sheet microscopy.” Science, 322(5904),
1065–1069.
P. Keller, E. Stelzer (2008). “Quantitative in vivo imaging of entire embryos with
Digital Scanned Laser Light Sheet Fluorescence Microscopy.” Current Opinion in
Neurobiology, 18(6), 624–632.
185
Bibliography
B. Kelm (2007). Evaluation of Vector-Valued Clinical Image Data Using Probabilistic Graphical Models: Quantification and Pattern Recognition. Ph.D. thesis,
Ruprecht-Karls-Universität Heidelberg.
B. Kelm, F. Kaster, A. Henning, et al. (2011). “Using Spatial Prior Knowledge
in the Spectral Fitting of Magnetic Resonance Spectroscopic Images.” NMR in
Biomedicine, accepted.
B. Kelm, B. Menze, T. Neff, et al. (2006). “CLARET: a tool for fully automated evaluation of MRSI with pattern recognition methods.” In: H. Handels, J. Ehrhardt,
A. Horsch, et al. (eds.), Bildverarbeitung für die Medizin 2006 – Algorithmen,
Systeme, Anwendungen, 51–55.
B. Kelm, B. Menze, O. Nix, et al. (2009). “Estimating Kinetic Parameter Maps
from Dynamic Contrast-Enhanced MRI using Spatial Prior Knowledge.” IEEE
Transactions on Medical Imaging, 28(10), 1534 – 1547.
B. Kelm, B. Menze, C. Zechmann, et al. (2007). “Automated Estimation of Tumor Probability in Prostate Magnetic Resonance Spectroscopic Imaging: Pattern
Recognition vs. Quantification.” Magnetic Resonance in Medicine, 57, 150–159.
H. Khotanlou, J. Atif, O. Colliot, et al. (2006). “3D brain tumor segmentation using
fuzzy classification and deformable models.” In: Fuzzy Logic and Applications,
Lecture Notes in Computer Science, vol. 3849/2006, 312–318. Springer.
C. Kimmel, W. Ballard, S. Kimmel, et al. (1995). “Stages of Embryonic Development
of the Zebrafish.” Developmental Dynamics, 203, 253–310.
D. Koller, N. Friedman (2009). Probabilistic Graphical Models – Principles and
Techniques. MIT Press.
V. Kolmogorov, Y. Boykov (2005). “What Metrics Can Be Approximated by GeoCuts, or Global Optimization of Length/Area and Flux.” In: International Conference on Computer Vision (ICCV 2005).
V. Kolmogorov, R. Zabih (2004). “What Energy Functions can be Minimized via
Graph Cuts?” IEEE Transactions on Pattern Analysis and Machine Intelligence,
26(2), 147–159.
U. Köthe (2000).
Generische Programmierung für die Bildverarbeitung.
Ph.D. thesis, Universität Hamburg. Software available at http://hci.iwr.uniheidelberg.de/vigra/.
V. Kovalev, F. Kruggel, H. Gertz, et al. (2001). “Three-Dimensional Texture Analysis
of MRI Brain Datasets.” IEEE Transactions on Medical Imaging, 20, 424–433.
186
Bibliography
R. Kreis (2004). “Issues of spectral quality in clinical 1 H magnetic resonance spectroscopy and a gallery of artifacts.” NMR in Biomedicine, 17(6), 361–381.
E. Lander, L. Linton, B. Birren, et al. (2001). “Initial sequencing and analysis of the
human genome.” Nature, 409, 860–921.
T. Langenberg, T. Dracz, A. Oates, et al. (2006). “Analysis and Visualization of
Cell Movement in the Developing Zebrafish Brain.” Developmental Dynamics,
235, 928–933.
C. Lee, M. Schmidt, A. Murtha, et al. (2005). “Segmenting brain tumors with conditional random fields and support vector machines.” In: First International Workshop for Computer Vision for Biomedical Image Applications (CVBIA), Lecture
Notes in Computer Science, vol. 3765/2005, 469–478. Springer.
C. Lee, S. Wang, F. Jiao, et al. (2006). “Learning to model spatial dependency: Semisupervised discriminative random fields.” In: Advances in Neural Information
Processing Systems (NIPS), vol. 19, 793–800.
C. Lee, S. Wang, A. Murtha, et al. (2008). “Segmenting Brain Tumors using Pseudo–
Conditional Random Fields.” In: Medical Image Computing and Computer Assisted Intervention (MICCAI), vol. 5241/2008, 359–366. Springer.
K. V. Leemput, F. Maes, D. Vandermeulen, et al. (1999a). “Automated Model-based
Bias Field Correction of MR Images of the Brain.” IEEE Transactions on Medical
Imaging, 18(10), 885–896.
K. V. Leemput, F. Maes, D. Vandermeulen, et al. (1999b). “Automated Model-based
Tissue Classification of MR Images of the Brain.” IEEE Transactions on Medical
Imaging, 18(10), 897–908.
A. Lefohn, J. Cates, R. Whitaker (2003). “Interactive, GPU-Based Level Sets
for 3D Brain Tumor Segmentation.”
In: Medical Image Computing and
Computer-Assisted Intervention (MICCAI), Lecture Notes in Computer Science,
vol. 2878/2003, 564–572.
M. Letteboer, O. Olsen, E. Dam, et al. (2004). “Segmentation of Tumors in Magnetic
Resonance Brain Images Using an Interactive Multiscale Watershed Algorithm.”
Academic Radiology, 11, 1125–1138.
F. Li, X. Zhou, J. Ma, et al. (2010). “Multiple nuclei tracking using integer programming for quantitative cancer cell cycle analysis.” IEEE Transactions on Medical
Imaging, 29(1), 96–105.
G. Li, T. Liu, J. Nie, et al. (2008a). “Segmentation of touching cell nuclei using
gradient flow tracking.” Journal of Microscopy, 231(1), 47–58.
187
Bibliography
G. Li, T. Liu, A. Tarokh, et al. (2007). “3D cell nuclei segmentation based on gradient
flow tracking.” BMC Cell Biology, 8, 40.
K. Li, E. Miller, M. Chen, et al. (2008b). “Cell population tracking and lineage
construction with spatiotemporal context.” Medical Image Analysis, 12(5), 546–
566.
J. Lichtman, J. Livet, J. Sanes (2008). “A technicolour approach to the connectome.”
Nature Reviews Neuroscience, 9, 417–422.
H. Lin, C. Lin, R. Weng (2007). “A note on Platt’s probabilistic outputs for support
vector machines.” Machine Learning, 68, 267–276.
T. Liu, J. Nie, G. Li, et al. (2008). “ZFIQ: a software package for zebrafish biology.”
Bioinformatics, 24(3), 438–439.
J. Livet, T. Weissman, H. Kang, et al. (2007). “Transgenic strategies for combinatorial expression of fluorescent proteins in the nervous system.” Nature, 450,
56–62.
X. Lou, F. Kaster, M. Lindner, et al. (2011a). “DELTR: Digital Embryo Lineage
Tree Reconstructor.” In: International Symposium on Biomedical Imaging (ISBI),
submitted.
X. Lou, U. Köthe, P. Keller, et al. (2011b). “Accurate Reconstruction of Digital Embryo Volume with Multi-Object Shape Regularization.” Medical Image Analysis,
to be submitted.
M. A. Luengo-Oroz, B. Lombardot, E. Faure, et al. (2007). “A Mathematical Morphology Framework for the 4D Reconstruction of the Early Zebrafish Embryogenesis.” In: International Symposium on Mathematical Morphology.
D. Lunn, A. Thomas, N. Best, et al. (2000). “WinBUGS – A Bayesian modelling
framework: Concepts, structure and extensibility.” Statistics and Computing,
10(4), 325–337.
M. Martı́nez-Bisbal, B. Celda (2009). “Proton magnetic resonance spectroscopy
imaging in the study of human brain cancer.” Quarterly Journal of Nuclear
Medicine and Molecular Imaging, 53(6), 618–630.
A. Maudsley, A. Darkazanli, J. Alger, et al. (2006). “Comprehensive processing,
display and analysis for in vivo MR spectroscopic imaging.” NMR in Biomedicine,
19(4), 492–503.
188
Bibliography
B. Menze, B. Kelm, R. Masuch, et al. (2009). “A comparison of random forest and
its Gini importance with standard chemometric methods for the feature selection
and classification of spectral data.” BMC Bioinformatics, 10, 213.
B. H. Menze, B. M. Kelm, M.-A. Weber, et al. (2008). “Mimicking the human
expert: Pattern recognition for an automated assessment of data quality in MRSI.”
Magnetic Resonance in Medicine, 59(6), 1457–1466.
B. H. Menze, M. P. Lichy, P. Bachert, et al. (2006). “Optimal classification of long
echo time in vivo magnetic resonance spectra in the detection of recurrent brain
tumors.” NMR in Biomedicine, 19(5), 599–609.
N. Metropolis, A. Rosenbluth, M. Rosenbluth, et al. (1953). “Equation of state
calculations by fast computing machines.” Journal of Chemical Physics, 21(6),
1087–1092.
J.-B. Michel, Y. Shen, A. Aiden, et al. (2010). “Quantitative Analysis of Culture
Using Millions of Digitized Books.” Science, 331(6014), 176–182.
D. Mikulis, T. Roberts (2007). “Neuro MR: protocols.” Journal of Magnetic Resonance Imaging, 26(4), 838–847.
T. Minka (2001). “Expectation Propagation for approximate Bayesian inference.”
In: Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence
(UAI), 362 – 369.
T. Minka (2004). “Power EP.” Tech. Rep. MSR-TR-2004-149, Microsoft Research.
T. Minka (2005). “Divergence measures and message passing.” Tech. Rep. MSRTR-2005-173, Microsoft Research Cambridge.
T. Minka, J. Winn (2009). “Gates.” In: D. Koller, D. Schuurmans, Y. Bengio,
et al. (eds.), Advances in Neural Information Processing Systems (NIPS), vol. 21,
1073–1080. MIT Press, Cambridge MA.
T. Minka, J. Winn, J. Guiver, et al. (2009). “Infer.NET 2.2.” Microsoft Research
Cambridge. http://research.microsoft.com/infernet.
N. Moon, E. Bullitt, K. Van Leemput, et al. (2002). “Automatic Brain and Tumor Segmentation.” In: Medical Image Computing and Computer-Assisted Intervention (MICCAI), Lecture Notes in Computer Science, vol. 2488/2002, 372–379.
Springer.
G. Moonis, J. Liu, J. Udupa, et al. (2002). “Estimation of tumor volume with fuzzyconnectedness segmentation of MR images.” American Journal of Neuroradiology,
23(3), 356–363.
189
Bibliography
K. Mosaliganti, A. Gelas, A. Gouaillard, et al. (2009). “Detection of Spatially Correlated Objects in 3D Images Using Appearance Models and Coupled Active Contours.” In: G.-Z. Yang, et al. (eds.), Medical Image Computing and ComputerAssisted Intervention (MICCAI 2009), Part II, Lecture Notes in Computer Science,
vol. 5762, 641–648. Springer, Berlin.
J. Munkres (1957). “Algorithms for the Assignment and Transportation Problems.”
Journal of the Society for Industrial and Applied Mathematics, 5(1), 32–38.
A. Nemirovski, M. Todd (2008). “Interior-point methods for optimization.” Acta
Numerica, 17, 191–234.
B. D. Neuter, J. Luts, L. Vanhamme, et al. (2007). “Java-based framework for processing and displaying short-echo-time magnetic resonance spectroscopy signals.”
Computational Methods and Programs in Biomedicine, 85, 129–137.
J. Nie, Z. Xue, T. Liu, et al. (2009). “Automated brain tumor segmentation using
spatial accuracy-weighted hidden Markov Random Field.” Computerized Medical
Imaging and Graphics, 33, 431–441.
N. Olivier, M. Luengo-Oroz, L. Duloquin, et al. (2010). “Cell Lineage Reconstruction
of Early Zebrafish Embryos Using Label-Free Nonlinear Microscopy.” Science,
329(5994), 967–971.
S. Ortega-Martorell, I. Olier, M. Julià-Sapé, et al. (2010). “SpectraClassifier 1.0: a
user friendly, automated MRS-based classifier-development system.” BMC Bioinformatics, 11, 106.
N. Otsu (1979). “A threshold selection method from gray-level histograms.” IEEE
Transactions on Systems, Man, and Cybernetics, 9, 62–66.
C. Pachai, Y. Zhu, J. Grimaud, et al. (1998). “Pyramidal approach for automatic
segmentation of multiple sclerosis lesions in brain MRI.” Computerized Medical
Imaging and Graphics, 22(5), 399–408.
D. Padfield, J. Rittscher, B. Roysam (2009a). “Coupled Minimum-Cost Flow Cell
Tracking.” In: J. Prince, D. Pham, K. Myers (eds.), Information Processing in
Medical Imaging (IPMI 2009), Lecture Notes in Computer Science, vol. 5636, 374–
385. Springer, Berlin.
D. Padfield, J. Rittscher, N. Thomas, et al. (2009b). “Spatio-temporal cell cycle phase
analysis using level sets and fast marching methods.” Medical Image Analysis,
13(1), 143–155.
C. Papadimitriou, K. Steiglitz (1998). Combinatorial Optimization: Algorithms and
Complexity. Dover Publications.
190
Bibliography
J. Pearl (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan-Kaufmann.
W. Pijnappel, A. van den Boogaart, R. de Beer, et al. (1992). “SVD-Based Quantification of Magnetic Resonance Signals.” Journal of Magnetic Resonance, 97,
122–134.
J. Poullet, D. Sima, A. Simonetti, et al. (2007). “An automated quantitation of short
echo time MRS spectra in an open source software environment: AQSES.” NMR
in Biomedicine, 20(5), 493–504.
J. Poullet, D. Sima, S. Van Huffel (2008). “MRS signal quantitation: A review of
time- and frequency-domain methods.” Journal of Magnetic Resonance, 195(2),
134–144.
M. Prastawa, E. Bullitt, G. Gerig (2009). “Simulation of Brain Tumors in MR Images
for Evaluation of Segmentation Efficacy.” Medical Image Analysis, 13(2), 297–311.
M. Prastawa, E. Bullitt, S. Ho, et al. (2003a). “Robust estimation for brain tumor segmentation.” In: Medical Image Computing and Computer-Assisted Intervention (MICCAI), Lecture Notes in Computer Science, vol. 2879/2003, 530–537.
Springer.
M. Prastawa, E. Bullitt, S. Ho, et al. (2004). “A brain tumor segmentation framework
based on outlier detection.” Medical Image Analysis, 8(3), 275–283.
M. Prastawa, E. Bullitt, N. Moon, et al. (2003b). “Automatic Brain Tumor Segmentation by Subject Specific Modification of Atlas Priors.” Academic Radiology,
10(12), 1341–1348.
S. Provencher (2001). “Automatic quantitation of localized in vivo 1 H spectra with
LCModel.” NMR in Biomedicine, 14(4), 260–264.
S. Provencher (2010). LCModel and LCMgui user’s manual, version 6.2-2. Http://sprovencher.com/pub/LCModel/manual/manual.pdf.
R. Raman, S. Raguram, G. Venkataraman, et al. (2005). “Glycomics: an integrated
systems approach to structure-function relationships of glycans.” Nature Methods,
2, 817–824.
C. Rasmussen, C. Williams (2006). Gaussian Processes for Machine Learning. MIT
Press.
H. Ratiney, M. Sdika, Y. Coenradie, et al. (2005). “Time-domain semi-parametric
estimation based on a metabolite basis set.” NMR in Biomedicine, 18, 1–13.
191
Bibliography
N. Ray, R. Greiner, A. Murtha (2008). “Using Symmetry to Detect Abnormalities
in Brain MRI.” Computer Society of India Communications, 31(19), 7–10.
S. Raya (1990). “Low-level segmentation of 3D Magnetic Resonance brain images:
A rule-based system.” IEEE Transactions on Medical Imaging, 9, 327–337.
V. Raykar, S. Yu, L. Zhao, et al. (2009). “Supervised Learning from Multiple Experts:
Whom to trust when everyone lies a bit.” In: International Conference on Machine
Learning (ICML), 889 – 896.
V. Raykar, S. Yu, L. Zhao, et al. (2010). “Learning From Crowds.” Journal of
Machine Learning Research, 11, 1297–1322.
A. Rényi (1961). “On measures of entropy and information.” In: Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, 547–561.
E. Reynaud, U. Kržič, K. Greger, et al. (2008). “Light sheet-based fluorescence microscopy: more dimensions, more photons, and less photodamage.” HFSP Journal,
2(5), 266–275.
R. Rifkin, A. Klautau (2004). “In Defense of One-Vs-All Classification.” Journal of
Machine Learning Research, 5, 101–141.
J. Rittscher (2010). “Characterization of Biological Processes through Automated
Image Analysis.” Annual Review of Biomedical Engineering, 12, 315–344.
S. Rogers, M. Girolami, T. Polajnar (2010). “Semi-parametric analysis of multi-rater
data.” Statistics and Computing, 20(3), 317 – 334.
B. Sajja, J. Wolinsky, P. Narayana (2009). “Proton Magnetic Resonance Spectroscopy in Multiple Sclerosis.” Neuroimaging Clinics of North America, 19(1),
45–58.
M. Schmidt, I. Levner, R. Greiner, et al. (2005). “Segmenting Brain Tumors using
Alignment-Based Features.” In: International Conference on Machine Learning
and Applications (ICMLA), 215–220.
B. Schölkopf, A. Smola (2002). Learning with Kernels. Support Vector Machines,
Regularization, Optimization, and Beyond. MIT Press, Cambridge MA.
B. Settles (2010). “Active Learning Literature Survey.” Tech. Rep. 1648, University
of Wisconsin-Madison.
J. Shaffer (1995). “Multiple Hypothesis Testing.” Annual Review of Psychology, 46,
561–584.
192
Bibliography
D. Sima, A. Croitor Sava, S. V. Huffel (2010). “Adaptive Alternating Minimization
for Fitting Magnetic Resonance Spectroscopic Imaging Signals.” In: M. Diehl,
et al. (eds.), Recent Advances in Optimization and its Applications in Engineering,
vol. 7, 511–520. Springer, Berlin.
D. Sima, S. van Huffel (2006). “Regularized semiparametric model identification
with application to NMR signal quantification with unknown macromolecular baseline.” Journal of the Royal Statistical Society B (Methodological), 68(3), 383–409.
S. Smith, T. Levante, B. Meier, et al. (1994). “Computer Simulations in Magnetic
Resonance. An Object-Oriented Programming Approach.” Journal of Magnetic
Resonance, A 106(1), 75–105.
P. Smyth, U. Fayyad, M. Burl, et al. (1995). “Inferring Ground Truth From Subjective Labelling of Venus Images.” In: G. Tesauro, D. Toretzy, T. Leen (eds.),
Advances in Neural Information Processing Systems (NIPS), vol. 7, 1085–1092.
MIT Press.
J. Solomon, J. Butman, A. Sood (2004). “Data Driven Brain Tumor Segmentation
in MRI Using Probabilistic Reasoning over Space and Time.” In: Medical Image Computing and Computer-Assisted Intervention (MICCAI), Lecture Notes in
Computer Science, vol. 3216/2004, 301–309. Springer.
J. Solomon, J. Butman, A. Sood (2006). “Segmentation of brain tumors in 4D MR
images using the hidden Markov model.” Computer Methods and Programs in
Biomedicine, 84(2–3), 76–85.
H. Soltanian-Zadeh, D. Peck, J. Windham, et al. (1998). “Brain tumor segmentation
and characterization by pattern analysis of multispectral NMR images.” NMR in
Biomedicine, 11(4–5), 201–208.
C. Sommer, C. Straehle, U. Köthe, et al. (2010). “Interactive Learning and Segmentation Tool Kit.” http://gitorious.org/ilastik/ilastik.git. “master”
branch, commit 087fd66d4db165ff6c14c8573b6543b3e62d5b7e with personal customizations.
Y. Song, C. Zhang, J. Lee, et al. (2006). “A Discriminative Method for SemiAutomated Tumorous Tissues Segmentation of MR Brain Images.” In: Computer
Vision and Pattern Recognition Workshop (CVPRW).
Y. Song, C. Zhang, J. Lee, et al. (2009). “Semi-supervised discriminative classification with application to tumorous tissues segmentation of MR brain images.”
Pattern Analysis & Applications, 12(2), 99–115.
193
Bibliography
D. Stefan, F. D. Cesare, A. Andrasescu, et al. (2009). “Quantitation of magnetic resonance spectroscopy signals: the jMRUI software package.” Measurement Science
and Technology, 20, 104035.
C. Stone (1977). “Consistent Nonparametric Regression.” Annals of Statistics, 5(4),
595–620.
B. Stroustrup (2001). “Exception Safety: Concepts and Techniques.” In: C. Dony,
J. Knudsen, A. Romanovsky, et al. (eds.), Advances in Exception Handling Techniques, 60–76. Springer, New York.
J. Sulston, E. Schierenberg, J. White, et al. (1983). “The embryonic cell lineage of
the nematode Caenorhabditis elegans.” Developmental Biology, 100(1), 64–119.
A. Tate, J. Underwood, D. Acosta, et al. (2006). “Development of a decision support
system for diagnosis and grading of brain tumours using in vivo magnetic resonance
single voxel spectra.” NMR in Biomedicine, 19, 411–434.
T. Terlaki, S. Zhang (1993). “Pivot rules for linear programming: a survey on recent
theoretical developments.” Annals of Operation Research, 46, 202–233.
J. Udupa, L. Wei, S. Samarasekera, et al. (1997). “Multiple sclerosis lesion quantification using fuzzy-connectedness principles.” IEEE Transactions on Medical
Imaging, 16(5), 598–609.
L. Vanhamme, A. van den Boogaart, S. van Huffel (1997). “Improved method for
accurate and efficient quantification of MRS data with use of prior knowledge.”
Journal of Magnetic Resonance, 129(1), 35–43.
M. Wainwright, M. Jordan (2008). “Graphical models, exponential families, and
R in Machine Learning, 1(1-2),
variational inference.” Foundations and Trends
1–305.
R. Walker, P. Jackway (1996). “Statistical Geometric Features – Extensions for Cytological Texture Analysis.” In: Proceedings of the 13th International Conference
on Pattern Recognition (ICPR).
M. Wang, X. Zhou, F. Li, et al. (2008). “Novel cell segmentation and online SVM for
cell cycle phase identification in automated microscopy.” Bioinformatics, 24(1),
94–101.
S. Warfield, J. Dengler, J. Zaers, et al. (1995). “Automatic identification of gray
matter structures from MRI to improve the segmentation of white matter lesions.”
Journal of Image-Guided Surgery, 1(6), 326–338.
194
Bibliography
S. Warfield, M. Kaus, F. A. Jolesz, et al. (2000). “Adaptive, template moderated,
spatially varying statistical classification.” Medical Image Analysis, 4(1), 43–55.
S. Warfield, K. Zou, W. Wells (2004). “Simultaneous truth and performance level
estimation (STAPLE): an algorithm for the validation of image segmentation.”
IEEE Transactions on Medical Imaging, 23(7), 903–921.
S. Warfield, K. Zou, W. Wells (2008). “Validation of image segmentation by estimating rater bias and variance.” Philosophical Transactions of the Royal Society
A, 366(1874), 2361–2375.
M. Wels, G. Carneiro, A. Aplas, et al. (2008a). “A Discriminative Model-Constrained
Graph Cuts Approach to Fully Automated Pediatric Brain Tumor Segmentation
in 3-D MRI.” In: Medical Image Computing and Computer-Assisted Intervention
(MICCAI), Lecture Notes in Computer Science, vol. 5241/2008, 67–75. Springer.
M. Wels, M. Huber, J. Hornegger (2008b). “Fully Automated Segmentation of Multiple Sclerosis Lesions in Multispectral MRI.” In: Pattern Recognition and Image
Analysis, vol. 18, 347–350. Pleiades.
J. Whitehill, P. Ruvolo, T. Wu, et al. (2009). “Whose vote should count more: Optimal integration of labels from labelers of unknown expertise.” In: Y. Bengio,
D. Schuurmans, J. Lafferty, et al. (eds.), Advances in Neural Information Processing Systems 22, 2035–2043. MIT Press.
F. Wilcoxon (1945). “Individual Comparisons by Ranking Methods.” Biometrics
Bulletin, 1(6), 80–83.
J. Winn, C. Bishop (2005). “Variational Message Passing.” Journal of Machine
Learning Research, 6, 661–694.
L. Wolsey (1998). Integer programming. Wiley-Interscience.
Z. Wu, H.-W. Chung, F. Wehrli (1994). “A Bayesian approach to subvoxel tissue
classification in NMR microscopic images of trabecular bone.” Magnetic Resonance
in Medicine, 31(3), 302–308.
D. Xu, D. Vigneron (2010). “Magnetic Resonance Spectroscopy Imaging of the
Newborn Brain – A Technical Review.” Seminars in Perinatology, 34(1), 20–27.
Z. Yin, R. Bise, M. Chen, et al. (2010). “Cell Segmentation in Microscopy Imagery Using a Bag of Local Bayesian Classifiers.” In: International Symposium on
Biomedical Imaging (ISBI), 125–128.
T. Yokoo, W. Bae, G. Hamilton, et al. (2010). “A Quantitative Approach to Sequence
and Image Weighting.” Journal of Computer-Assisted Tomography, 34, 317–331.
195
Bibliography
C. Zanella, M. Campana, B. Rizzi, et al. (2010). “Cells Segmentation from 3-D
Confocal Images of Early Zebrafish Embryogenesis.” IEEE Transactions on Image
Processing, 19(3), 770–781.
C. Zechmann, B. Menze, B. Kelm, et al. (2011). “How much spatial context do we
need? Automated versus manual pattern recognition of 3D MRSI data of prostate
cancer patients.” NMR in Biomedicine, submitted.
J. Zhang (1992). “The mean field theory in EM procedures for Markov random
fields.” IEEE Transactions on Signal Processing, 40(10), 2570–2583.
J. Zhou, K. Chan, V. Chong, et al. (2005). “Extraction of Brain Tumor from MR Images Using One-Class Support Vector Machine.” In: IEEE Engineering in Medicine
and Biology 27th Annual Conference.
Y. Zhu, Q. Liao, W. Dou, et al. (2005). “Brain tumor segmentation in MRI based
on fuzzy aggregators.” In: Visual Communications and Image Processing 2005,
Proceedings of SPIE, vol. 5960, 1704–1711.
196
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement